Skip to main content
25 min read

Function Approximation

RL Part 5: From tables to parameterized value functions.

👉

Recap

In the previous chapter, we entered the model-free setting, where the agent learns purely from interaction.

We began by clarifying what "model-free" means. The environment still has dynamics, of course. The point is that the algorithm does not get to see $P$ or $R$ directly. It only gets to interact, observe rewards, and adapt.

upload in progress, 0

We then covered Monte Carlo methods that estimate values by averaging full episode returns. They are unbiased but high variance, and they only work on episodic tasks because the return has to be computable.

Temporal-difference learning sits between MC and DP: it inherits the model-free property from MC and the bootstrapping idea from DP. The TD(0) update uses one observed reward plus the current estimate of the next state's value, which lets the agent learn online, after every single step.

upload in progress, 0

From TD prediction, we moved to TD control:

  • SARSA is the on-policy variant, using the value of the next action that the policy actually picks.
  • Q-learning is off-policy, using the max over next actions, so it can learn the optimal greedy policy while the agent itself explores.
upload in progress, 0

We closed with a hands-on experiment on the Cliff Walking gridworld:

  • Q-learning found the optimal cliff-edge path but suffered more during training because $\varepsilon$-greedy exploration kept knocking it off the cliff.
  • SARSA learned the safer path along the top and accumulated less negative reward, because its update reflects the policy that is actually being followed.
upload in progress, 0

If you have not read Chapter 4, we recommend doing so first:

Model-Free Learning
RL Part 4: Learning value functions and policies without a model. Monte Carlo methods, TD(0), SARSA, Q-learning, and the bias-variance bridge between them.

Introduction

Everything we did so far fits into tables, like the Q-table in chapter 4 for Cliff Walking had 48 cells.

upload in progress, 0

Now consider something like Backgammon, where the number of distinct positions is on the order of $10^{20}$, or mountain car, the classic control benchmark we will meet later in this chapter, where the state is a pair of real numbers (position and velocity) drawn from a continuous range. The table approach has nowhere to go.

This chapter is about knowing that bridge:

  • We will replace the table with a parameterized function.
  • We will set up a learning objective.
  • Derive the gradient Monte Carlo and semi-gradient TD updates.
  • Extend the control algorithms to the approximation setting.

We will then learn about the deadly triad, the combination of function approximation, bootstrapping, and off-policy learning that can cause value estimates to diverge.

Finally, we will train a semi-gradient SARSA agent on mountain car using tile coding, the workhorse linear-features technique for continuous state spaces, and plot the learned cost-to-go surface.

👉
This chapter is conceptually dense and assumes strong familiarity with the previous chapters, as well as a solid understanding of probability and linear algebra. It is recommended to follow the explanations with pen and paper at hand.

Let's begin!


Why tables fail at scale?

The tabular approach has two failure modes, and they compound:

  • Memory: When the state space is small and discrete, this is fine. When it is large, you cannot store the table. When it is continuous, you cannot even index into it.
  • Zero generalization: Updating the entry for state $s$ changes nothing about state $s'$, even if $s'$ is right next to $s$. Every state-action pair has to be visited many times for its value to settle. In a space with millions of states, you will never visit enough of them to learn anything useful, even if memory were free. What we need is a representation where updating the value at $s$ also updates nearby states automatically, where similar states share information. This is what function approximation gives us.

A concrete way to feel the problem is to take a car (known as the mountain car example), where the state is just two floats, say position $\in [-1.2, 0.6]$ and velocity $\in [-0.07, 0.07]$. There are infinitely many states. A table is structurally incapable for this. Thus we need function approximation.

One way to do this is state aggregation. Group the state space into clusters and store one value per cluster. It is the simplest possible form of generalizing approximation. For example, a 1000-state random walk can be grouped into 10 clusters of 100 states each.

upload in progress, 0

The resulting value function is a staircase, which holds a piecewise constant value within each cluster, jumping between clusters. It is crude, but it captures the broad shape of the true value function using only 10 parameters instead of 1000.

In summary, tables are out, not because they are wrong, but because they cannot scale and cannot generalize. The fix is to represent the value function as a function over states, parameterized by a small set of weights.


Parameterized value functions

The shift is from a table to a function. We write:

upload in progress, 0

where $\theta \in \mathbb{R}^d$ is a parameter vector of dimension $d$, typically much smaller than $|\mathcal{S}|$.

upload in progress, 0

The function $\hat{v}$ can be anything that is differentiable in $\theta$, for example, a linear combination of features.

The same idea applies to action values:

upload in progress, 0

The change is more profound than it looks. With a table, updating $V(s)$ for one state affected only that state. With a parameterized function, changing $\theta$ to fit the value of state $s$ also changes the predicted values at every other state, because $\hat{v}$ depends on all the components of $\theta$.

This is the generalization we need. However, it is also the source of all the new difficulties in this chapter. We can no longer make every state's estimate exactly right. The parameters are shared, so making one state more accurate can make another less accurate. Thus, we are obligated to choose what we care about.

👉
The number of parameters $d$ is typically far smaller than $|\mathcal{S}|$. This is not a limitation, it is the design. Sharing parameters is what produces generalization.

The mean square value error

The tabular case did not need an explicit objective. Updates were decoupled, and each estimate could march independently toward its true value. With function approximation, we lose that luxury, so we have to specify what "good" means.

The standard prediction objective is the mean square value error, or MSVE. It is a weighted average of squared prediction errors across states:

upload in progress, 0

Let's unpack this:

  • $v_\pi(s)$ is the true value of state $s$ under policy $\pi$
  • $\hat{v}(s, \theta)$ is our parameterized estimate at that state
  • The squared bracket measures pointwise error at $s$.
  • The factor $d(s)$ is a non-negative weighting that distributes our care across the state space.
  • The sum runs over every state.

The structure tells us two things:

  • Error at each state is squared, so positive and negative errors do not cancel, meaning every deviation costs us.
  • $d(s)$ controls which states matter. A state with $d(s) = 0$ contributes nothing to the objective no matter how badly we mispredict it, and a state with high $d(s)$ dominates.

The intuition is that with limited capacity, we cannot get everything right, so we have to triage. The natural choice for $d(s)$ is the on-policy distribution (the long-run fraction of time the agent spends in each state when following $\pi$). The states the policy visits often get prioritized. The states the policy never visits, we cannot learn about them anyway, so weighing them is irrelevant.

A useful sanity check is that if $d(s)$ were a uniform distribution and $\hat{v}$ were perfect everywhere, MSVE would be zero. Any deviation lifts it above zero, weighted by where we care.

👉
MSVE is the standard objective for prediction. It is not provably the right objective for control: the value function that minimizes MSVE may not be the value function that yields the best policy when used greedily.

One important thing to note is that we usually do not know the true value function $v_\pi(s)$. Because of that, we cannot compute MSVE directly.

Instead, MSVE serves as a theoretical objective that tells us what good predictions look like. The learning algorithms we build update the parameters in ways that, in expectation, descend MSVE, and we will track how closely they actually do so.


Linear function approximation

The most important special case is when $\hat{v}$ is linear in $\theta$. This is the case where everything generalizes cleanly from the tabular setting and where almost every theoretical result lives.

Each state is represented by a feature vector:

upload in progress, 0

The value estimate is then the inner product of features and weights:

upload in progress, 0

The terms:

  • $\phi_i(s)$ is the $i$-th feature evaluated at state $s$, a fixed function of the state.
  • $\theta_i$ is the learnable weight for that feature.
  • The product $\theta_i \phi_i(s)$ is the contribution of feature $i$ to the predicted value at $s$, and we add up all such contributions.

The intuition has two parts:

  • First, the features carry the inductive bias: similarity in feature space implies similarity in predicted value. Two states with nearly identical feature vectors must have nearly identical predicted values, by construction.
  • Second, the weights carry the learning: as $\theta$ changes, every state's predicted value changes in a coordinated way determined by the features.

The gradient is the cleanest part. Differentiating $\hat{v}(s, \theta) = \theta^\top \phi(s)$ with respect to $\theta$ gives $\phi(s)$. That is, the gradient at state $s$ is just the feature vector at $s$. This is what makes linear methods so tractable.

👉
The tabular case is a special instance of linear FA. If you choose features to be one-hot indicators (one feature per state, equal to 1 at that state and 0 elsewhere), then $\theta$ is the table, $\phi(s)$ selects one entry, and the linear method reduces exactly to the tabular update.

In summary, linear FA gives us a representation with two pieces: a fixed feature map $\phi$ that carries the inductive bias, and a learnable weight vector $\theta$ that carries the learning. The gradient is the feature vector itself, which makes everything that follows simple to derive and to implement.


Gradient Monte Carlo

Let's now understand our first learning algorithm for the function approximation setting. The idea is to treat each visit to a state as a supervised training example.

The supervised setup is this:

→ input $S_t$, target $G_t$ (the return observed from $S_t$ onward), and we want to make $\hat{v}(S_t, \theta)$ close to $G_t$. The standard tool is stochastic gradient descent on the squared error.

After observing $(S_t, G_t)$, we adjust $\theta$ in the direction that most reduces $\big[ G_t - \hat{v}(S_t, \theta) \big]^2$. That is:

upload in progress, 0

The terms:

  • $G_t$ is the actual observed return from time $t$, computed by playing the episode out to termination.
  • $\hat{v}(S_t, \theta)$ is the current estimate.
  • The bracketed difference is the prediction error.
  • The gradient $\nabla_\theta \hat{v}(S_t, \theta)$ is the direction in parameter space that increases $\hat{v}(S_t, \theta)$ fastest.
  • $\alpha$ is the step size.

For linear function approximation, the squared error is convex in $\theta$, so the local minimum is the global minimum. Gradient Monte Carlo with linear FA provably converges to the parameter vector that minimizes MSVE, under the on-policy distribution.

Published on May 24, 2026