29 min read

Proximal Policy Optimization

RL Part 8: Trust regions, the clipped surrogate, and the workhorse of modern RL.

👉

Recap

Chapter 7 taught us about the policy gradient approach. Instead of learning values and acting greedily, we parameterized the policy directly and pushed its parameters in the direction that increases expected return.

The log-derivative trick turned an intractable gradient into something we could estimate from sampled experience.

REINFORCE was the first algorithm built from that gradient. It worked, but it suffered from high variance since the return of a single trajectory is a noisy signal.

Two training curves for REINFORCE on different random seeds, one climbing then collapsing, the other climbing and holding, illustrating how much the outcome swings with randomness

We reduced that variance in stages:

A baseline subtracted a reference value without adding bias.

Diagram showing why subtracting a baseline leaves the gradient unbiased: the score function sums to zero across all actions, so the baseline's contribution cancels in expectation, leaving the mean gradient direction unchanged while only affecting variance

The best baseline turned out to be the value function, which further gave us the advantage, i.e., how much better an action was than the typical behavior for that state.

A diagram showing Q(s,a) for several actions as bars, V(s) as a horizontal line at their average, and the advantage as the signed gap between each bar and the line

Moving ahead, we saw that learning the value function alongside the policy produced actor-critic:

The actor chooses actions.
The critic estimates values, and the critic's estimate serves as the baseline.

GAE then gave us a smooth knob to trade a little bias for a lot less variance when estimating the advantage.

We also briefly discussed about language models in the RL setting, and finally, in the hands-on section, we implemented the REINFORCE algorithm and trained it on CartPole.

If you have not read Chapter 7, we recommend doing so first:

Introduction

It is worth being precise about what the actor-critic did and did not solve. It addressed variance. By subtracting a learned baseline and bootstrapping value estimates, the gradient became far less noisy than REINFORCE's. What it left untouched was the size of each update.

👉

The gradient told us a direction to move in, but not how far we could safely go. Variance and step size are different problems, and actor-critic only fixed the first.

So suppose we already have a working actor-critic agent. The question this chapter answers is narrow but important: why is updating that agent so dangerous, and what is the safe way to do it?

The answer is a family of methods built around a single idea, which is that you do not let the policy change too much in one step. That idea is called a trust region. The algorithm that made it practical is Proximal Policy Optimization (PPO).

👉

This chapter assumes you have read Chapter 7. We build directly on REINFORCE, baselines, the advantage function, actor-critic, and GAE. If any of those feel hazy, a quick re-read will pay off here.

As always, every idea will be explained through clear examples and walkthroughs to develop a solid understanding.

Let's begin!

The problem in detail

In policy gradient methods, the data we learn from is generated by the current policy. The moment we update the policy, that data no longer exactly matches the policy being optimized. It describes the behavior of an agent that no longer exists.

Now consider what a gradient step actually does. It nudges the parameters in a direction estimated from a noisy batch. If the step is small, the new policy will often remain close to the old one, and the data is still roughly valid. If the step is large, the new policy can land somewhere very different.

👉

Every batch is a snapshot of one policy, and its value begins to decay the moment that policy changes. The larger the change, the less the batch we just paid to collect is worth.

When a step is too large, the policy can move into a region where it performs badly. Because learning is on-policy, the agent then collects new data using this worse policy. The next gradient is then estimated from bad experience, which can push it further off course. The collapse feeds itself.

You might wonder why we cannot simply use a small learning rate and be done with it. There are two reasons it is not that easy:

First, no single step size is right throughout training. Early on, when the policy is poor, large steps help; late on, near a good policy, that same size is reckless. One global number cannot be both.
Second, and more fundamentally, a learning rate limits how far the weights move, but what we care about is how far the behavior moves, and those are not quite the same thing.

👉

The network's outputs pass through a softmax to become action probabilities, and that mapping is uneven. In some states a small change in the weights barely shifts the probabilities, while in others it swings them wildly. So a small learning rate might help, but gives no reliable guarantee on how much the policy's behavior actually changes.

The fix is to constrain how far each update is allowed to move the policy, measured in the space of behaviors rather than the space of weights. That region of allowed movement is the trust region.

In summary, on-policy data is only a good estimate near the policy that made it, large updates can leap outside that valid region, and the result can be a self-reinforcing collapse.

The rest of the chapter is about updating safely.

Importance ratio

Before we constrain how far each update is allowed to move the policy, we need one more tool. It will let us do something REINFORCE could not: reuse a batch of data for more than one gradient step.

Published on Jun 14, 2026

Reinforcement Learning Course

Recap

Introduction

The problem in detail

Importance ratio