Linear TD vs Neural TD: A Tale of Two Geometries

Posted Jun 12, 2026

By Sreejeet Maity

14 min read

Temporal-difference learning looks deceptively simple. For a fixed policy, we observe a transition, form the Bellman error, and move the value estimate in the direction suggested by that error. In tabular RL this is already intuitive: every state-action pair has its own coordinate. With function approximation, however, the geometry becomes the whole story.

This post compares two population versions of TD learning:

projected TD with linear function approximation, and
projected TD with a two-layer overparameterized ReLU network.

At a superficial level the two proofs look similar. Both use projection. Both use a Bellman contraction. Both telescope a one-step inequality. But the actual meaning of the proof is very different.

The shortest summary is this:

\[\boxed{\text{Linear TD is TD in a fixed feature space.}}\] \[\boxed{\text{Neural TD behaves like TD in a random frozen feature space, provided the network stays near initialization.}}\]

The second statement is the important one. In neural TD, the nonlinear network is not analyzed directly as a generic nonconvex function class. The proof first freezes the ReLU gates at initialization, obtains a random-feature linear class, and then proves that the actual network remains close enough to this frozen linearization. That is where overparameterization enters.

The goal of this post is to make this comparison mathematically explicit, but in a way that is readable as a blog post rather than as a compressed theorem sheet.

1. The common policy-evaluation problem

Fix a policy $\pi$. Let

\[x=(s,a)\]

be a state-action pair, and let $x’=(s’,a’)$ denote the next state-action pair generated by the environment and the policy. We assume a stationary population model:

\[(x,r,x')\sim \mathcal{D}, \qquad x\sim \mu, \qquad x'\mid x \sim P^\pi(\cdot\mid x).\]

Here $\mu$ is the stationary distribution of the Markov chain over state-action pairs induced by $\pi$. Since this post focuses on the population version, every update direction is an expectation under $\mathcal{D}$. There is no stochastic sampling error and no Markovian concentration issue in this note.

For a candidate action-value function $f:\mathcal{X}\to\mathbb{R}$, define the Bellman operator

\[(\mathcal{T}^{\pi} f)(x) := \mathbb{E}\bigl[r+\gamma f(x')\mid x\bigr], \qquad \gamma\in(0,1).\]

The Bellman residual is

\[\delta_f(x,r,x') := f(x)-r-\gamma f(x').\]

We use the $L_2(\mu)$ inner product and norm

\[\langle f,g\rangle_\mu := \mathbb{E}_{x\sim\mu}[f(x)g(x)], \qquad \lVert f\rVert_\mu^2 := \mathbb{E}_{x\sim\mu}[f(x)^2].\]

The stationarity of $\mu$ gives the basic inequality

\[\mathbb{E}[h(x)h(x')] \le \sqrt{\mathbb{E}[h(x)^2]\mathbb{E}[h(x')^2]} = \lVert h\rVert_\mu^2.\]

Therefore,

\[\langle h, h-\gamma P^\pi h\rangle_\mu = \mathbb{E}\bigl[h(x)(h(x)-\gamma h(x'))\bigr] \ge (1-\gamma)\lVert h\rVert_\mu^2.\]

This is the same contraction geometry that appears in both the linear and neural population proofs.

2. Projection and MSPBE: the real object of comparison

Let $\mathcal{F}\subset L_2(\mu)$ be a closed convex function class. The projection of a function $g$ onto $\mathcal{F}$ is

\[\Pi_{\mathcal{F}}g := \arg\min_{f\in\mathcal{F}}\lVert f-g\rVert_\mu^2.\]

The projected Bellman equation is

\[f^\star = \Pi_{\mathcal{F}}\mathcal{T}^{\pi} f^\star.\]

The corresponding mean-squared projected Bellman error is

\[\operatorname{MSPBE}_{\mathcal{F}}(f) := \lVert f-\Pi_{\mathcal{F}}\mathcal{T}^{\pi} f\rVert_\mu^2.\]

This is the first key point:

The function class $\mathcal{F}$ is not a cosmetic detail. Changing $\mathcal{F}$ changes the projected Bellman equation, the MSPBE objective, and the limiting object of TD.

The projection has a useful variational characterization. A function $f_{\mathcal{F}}=\Pi_{\mathcal{F}}g$ if and only if

\[\langle f_{\mathcal{F}}-g, f-f_{\mathcal{F}}\rangle_\mu \ge 0, \qquad \forall f\in\mathcal{F}.\]

Therefore a projected Bellman fixed point satisfies

\[\langle f^\star-\mathcal{T}^{\pi} f^\star, f-f^\star\rangle_\mu \ge 0, \qquad \forall f\in\mathcal{F}.\]

This variational inequality is the clean way to compare the two proofs. Linear TD uses a fixed linear class. Neural TD uses a random frozen-feature class induced by initialization.

3. Linear TD: fixed features, fixed geometry

Suppose we choose a fixed feature map

\[\phi:\mathcal{X}\to\mathbb{R}^d.\]

The linear approximation is

\[Q_\theta(x)=\phi(x)^{\texttt{T}}\theta, \qquad \theta\in\mathbb{R}^d.\]

Let $\Theta\subset\mathbb{R}^d$ be a compact convex feasible set. The linear function class is

\[\mathcal{F}_{\operatorname{lin}} := \bigl\{x\mapsto \phi(x)^{\texttt{T}}\theta:\theta\in\Theta\bigr\}.\]

This is a fixed class. Once $\phi$ is chosen, the geometry does not move during training.

For a transition $(x,r,x’)$, the TD residual is

\[\delta_\theta(x,r,x') := \phi(x)^{\texttt{T}}\theta-r-\gamma \phi(x')^{\texttt{T}}\theta.\]

The population TD direction is

\[\bar g_{\operatorname{lin}}(\theta) := \mathbb{E}\bigl[\delta_\theta(x,r,x')\phi(x)\bigr].\]

Since the model is linear, this direction is affine:

\[\bar g_{\operatorname{lin}}(\theta) = A\theta-b,\]

where

\[A := \mathbb{E}\bigl[\phi(x)(\phi(x)-\gamma\phi(x'))^{\texttt{T}}\bigr], \qquad b := \mathbb{E}\bigl[r\phi(x)\bigr].\]

The population projected linear TD update is

\[\theta_{t+1} = \Pi_\Theta\bigl(\theta_t-\eta(A\theta_t-b)\bigr).\]

The entire linear proof is powered by the fact that $A\theta-b$ is a fixed affine operator.

4. What is the linear TD fixed point?

Assume first that the projection set is large enough and the unconstrained solution belongs to $\Theta$. Then the population TD fixed point satisfies

\[A\theta^\star=b.\]

Equivalently,

\[Q_{\theta^\star} = \Pi_{\mathcal{F}_{\operatorname{lin}}}\mathcal{T}^{\pi} Q_{\theta^\star}.\]

To see this, the normal equation for projecting $\mathcal{T}^{\pi} Q_\theta$ onto the linear span of $\phi$ is

\[\mathbb{E}\Bigl[\phi(x)\bigl(\phi(x)^{\texttt{T}}\theta-(\mathcal{T}^{\pi} Q_\theta)(x)\bigr)\Bigr] =0.\]

But

\[(\mathcal{T}^{\pi} Q_\theta)(x) = \mathbb{E}\bigl[r+\gamma\phi(x')^{\texttt{T}}\theta\mid x\bigr],\]

so the normal equation becomes

\[\mathbb{E}\bigl[\phi(x)(\phi(x)^{\texttt{T}}\theta-r-\gamma\phi(x')^{\texttt{T}}\theta)\bigr]=0,\]

which is exactly

\[A\theta=b.\]

If the feasible set $\Theta$ is active, then the correct fixed-point condition is the variational inequality

\[\langle A\theta^\star-b,\theta-\theta^\star\rangle \ge 0, \qquad \forall \theta\in\Theta.\]

For the clean comparison in this post, we focus on the regime where $\theta^\star\in\Theta$ and $A\theta^\star=b$. In that case projection is mainly a stability device.

5. Population proof for projected linear TD

Assume:

$\theta^\star\in\Theta$ and $A\theta^\star=b$,
the linear TD operator is strongly monotone: there exists $\mu_{\operatorname{lin}}>0$ such that

\[\langle \theta-\theta^\star,A(\theta-\theta^\star)\rangle \ge \mu_{\operatorname{lin}}\lVert \theta-\theta^\star\rVert_2^2, \qquad \forall \theta\in\Theta,\]

$\lVert A\rVert_2\le L_A$, and
the stepsize satisfies

\[0<\eta\le \frac{\mu_{\operatorname{lin}}}{L_A^2}.\]

Because Euclidean projection onto a closed convex set is nonexpansive,

\[\begin{aligned} \lVert \theta_{t+1}-\theta^\star\rVert_2^2 &= \left\lVert \Pi_\Theta\bigl(\theta_t-\eta(A\theta_t-b)\bigr) - \Pi_\Theta(\theta^\star) \right\rVert_2^2 \\ &\le \left\lVert \theta_t-\eta(A\theta_t-b)-\theta^\star \right\rVert_2^2. \end{aligned}\]

Since $A\theta^\star=b$, this becomes

\[\begin{aligned} \lVert \theta_{t+1}-\theta^\star\rVert_2^2 &\le \lVert \theta_t-\theta^\star-\eta A(\theta_t-\theta^\star)\rVert_2^2 \\ &= \lVert \theta_t-\theta^\star\rVert_2^2 -2\eta\langle \theta_t-\theta^\star,A(\theta_t-\theta^\star)\rangle +\eta^2\lVert A(\theta_t-\theta^\star)\rVert_2^2. \end{aligned}\]

Using strong monotonicity and the operator norm bound,

\[\lVert \theta_{t+1}-\theta^\star\rVert_2^2 \le \bigl(1-2\eta\mu_{\operatorname{lin}}+ \eta^2L_A^2\bigr) \lVert \theta_t-\theta^\star\rVert_2^2.\]

The stepsize condition gives

\[1-2\eta\mu_{\operatorname{lin}}+ \eta^2L_A^2 \le 1-\eta\mu_{\operatorname{lin}}.\]

Therefore

\[\boxed{ \lVert \theta_{t+1}-\theta^\star\rVert_2^2 \le (1-\eta\mu_{\operatorname{lin}}) \lVert \theta_t-\theta^\star\rVert_2^2. }\]

Iterating gives the population linear TD guarantee:

\[\boxed{ \lVert \theta_T-\theta^\star\rVert_2^2 \le (1-\eta\mu_{\operatorname{lin}})^T \lVert \theta_0-\theta^\star\rVert_2^2. }\]

This is exponentially fast in the population setting because the update is a deterministic contraction in parameter space.

The proof is short because nothing moves except $\theta_t$. The features stay fixed, the operator stays affine, and the strong monotonicity is global.

6. Neural TD: the model is nonlinear, but the proof freezes a linear class

Now consider a two-layer ReLU network

\[\widehat Q(x;W) = \frac{1}{\sqrt m} \sum_{r=1}^m b_r\sigma(W_r^{\texttt{T}} x),\]

where

\[\sigma(z)=\max\{z,0\}, \qquad W=(W_1,\ldots,W_m), \qquad W_r\in\mathbb{R}^d.\]

The second-layer signs $b_r\in{\pm 1}$ are usually fixed at initialization. The first-layer weights $W_r$ are trained.

The projected neural TD update has the form

\[W_{t+1} = \Pi_{S_B}\bigl(W_t-\eta\bar g_{\operatorname{NN}}(W_t)\bigr),\]

where the projection set is a ball around initialization:

\[S_B := \{W:\lVert W-W(0)\rVert_2\le B\}.\]

The population neural TD semigradient is

\[\bar g_{\operatorname{NN}}(W) := \mathbb{E}\bigl[ \delta_W(x,r,x')\nabla_W\widehat Q(x;W) \bigr],\]

with

\[\delta_W(x,r,x') := \widehat Q(x;W)-r-\gamma \widehat Q(x';W).\]

Unlike the linear case, this is not an affine map. The feature vector

\[\nabla_W\widehat Q(x;W)\]

changes with $W$. Thus the geometry itself moves during training.

This is exactly why the neural proof needs a different idea.

7. The frozen random-feature class

At initialization, define the frozen tangent feature map

\[\Phi_0(x) := \nabla_W\widehat Q(x;W)\big|_{W=W(0)}.\]

For the two-layer ReLU network, this is

\[\Phi_0(x) = \frac{1}{\sqrt m} \begin{bmatrix} b_1\mathbf{1}\{W_1(0)^{\texttt{T}} x>0\}x \\ \vdots \\ b_m\mathbf{1}\{W_m(0)^{\texttt{T}} x>0\}x \end{bmatrix}.\]

The corresponding frozen linearized model is

\[\widehat Q_0(x;W) := \widehat Q(x;W(0))+ \langle \Phi_0(x), W-W(0)\rangle.\]

Ignoring the constant initialization term, this is a linear model in the parameter displacement $W-W(0)$. Therefore the proof introduces the random feature class

\[\mathcal{F}_{B,m} := \left\{ x\mapsto \frac{1}{\sqrt m}\sum_{r=1}^m b_r\mathbf{1}\{W_r(0)^{\texttt{T}} x>0\} W_r^{\texttt{T}} x : W\in S_B \right\}.\]

This class is random because it depends on the initialization. But conditional on initialization, it is a linear class: the ReLU gates

\[\mathbf{1}\{W_r(0)^{\texttt{T}} x>0\}\]

are frozen.

This is the central conceptual distinction:

Method	Class used in the projected Bellman equation	Geometry
Linear TD	$\mathcal{F}_{\operatorname{lin}}={x\mapsto\phi(x)^{\texttt{T}}\theta}$	deterministic fixed features
Neural TD	$\mathcal{F}_{B,m}$	random frozen features from initialization

The neural theorem is not simply saying that an arbitrary nonlinear network finds a global TD solution. It is saying that, in the overparameterized and near-initialization regime, the nonlinear TD dynamics track a projected TD method over the frozen random-feature class $\mathcal{F}_{B,m}$.

8. What is the neural target $W^\star$?

The neural proof introduces a reference parameter $W^\star\in S_B$. This object should be interpreted carefully.

It is not necessarily a global minimizer over the entire nonlinear neural-network class. Instead, $W^\star$ represents a projected Bellman solution in the frozen random-feature class.

In the clean population form, it satisfies the variational inequality

\[\left\langle \widehat Q_0(\cdot;W^\star)-\mathcal{T}^{\pi}\widehat Q_0(\cdot;W^\star), \widehat Q_0(\cdot;W)-\widehat Q_0(\cdot;W^\star) \right\rangle_\mu \ge 0\]

for all $W\in S_B$.

Equivalently,

\[\widehat Q_0(\cdot;W^\star) = \Pi_{\mathcal{F}_{B,m}} \mathcal{T}^{\pi} \widehat Q_0(\cdot;W^\star).\]

Thus the target of the analysis is the projected Bellman fixed point over $\mathcal{F}_{B,m}$, not an arbitrary nonlinear neural optimum.

9. The frozen neural proof is basically linear TD in function space

To see why the proof resembles linear TD, temporarily pretend that the algorithm uses the frozen model $\widehat Q_0$ exactly. Define the frozen residual

\[\delta_0(W;x,r,x') := \widehat Q_0(x;W)-r- \gamma\widehat Q_0(x';W).\]

The frozen population semigradient is

\[\bar g_0(W) := \mathbb{E}\bigl[ \delta_0(W;x,r,x')\Phi_0(x) \bigr].\]

Because $\widehat Q_0$ is linear in $W-W(0)$, this is the analogue of $A\theta-b$ in linear TD.

The key one-point inequality is

\[\left\langle \bar g_0(W), W-W^\star \right\rangle \ge (1-\gamma) \left\lVert \widehat Q_0(\cdot;W)-\widehat Q_0(\cdot;W^\star) \right\rVert_\mu^2.\]

Let us prove it, because this is the heart of the comparison.

Set

\[h_W := \widehat Q_0(\cdot;W)-\widehat Q_0(\cdot;W^\star).\]

Since $\widehat Q_0$ is linear in $W$,

\[\langle \bar g_0(W),W-W^\star\rangle = \left\langle \widehat Q_0(\cdot;W)-\mathcal{T}^{\pi}\widehat Q_0(\cdot;W), h_W \right\rangle_\mu.\]

Add and subtract the residual at $W^\star$:

\[\begin{aligned} &\left\langle \widehat Q_0(W)-\mathcal{T}^{\pi}\widehat Q_0(W), \widehat Q_0(W)-\widehat Q_0(W^\star) \right\rangle_\mu \\ &= \left\langle \widehat Q_0(W)-\widehat Q_0(W^\star) - \gamma P^\pi(\widehat Q_0(W)-\widehat Q_0(W^\star)), h_W \right\rangle_\mu \\ &\quad+ \left\langle \widehat Q_0(W^\star)-\mathcal{T}^{\pi}\widehat Q_0(W^\star), h_W \right\rangle_\mu. \end{aligned}\]

The second term is nonnegative by the projected Bellman variational inequality for $W^\star$. The first term satisfies

\[\begin{aligned} \left\langle h_W-\gamma P^\pi h_W,h_W\right\rangle_\mu &= \mathbb{E}[h_W(x)^2]-\gamma\mathbb{E}[h_W(x)h_W(x')] \\ &\ge (1-\gamma)\lVert h_W\rVert_\mu^2. \end{aligned}\]

Therefore

\[\boxed{ \langle \bar g_0(W),W-W^\star\rangle \ge (1-\gamma)\lVert \widehat Q_0(W)-\widehat Q_0(W^\star)\rVert_\mu^2. }\]

This is the neural analogue of strong monotonicity, but it is not Euclidean strong monotonicity in $W$. It controls prediction error in the frozen function class.

10. Telescoping the frozen update

Suppose the frozen semigradient is bounded:

\[\lVert \bar g_0(W)\rVert_2\le G_0, \qquad \forall W\in S_B.\]

The projected frozen update is

\[W_{t+1} = \Pi_{S_B}\bigl(W_t-\eta\bar g_0(W_t)\bigr).\]

Projection is nonexpansive, so

\[\begin{aligned} \lVert W_{t+1}-W^\star\rVert_2^2 &\le \lVert W_t-W^\star-\eta\bar g_0(W_t)\rVert_2^2 \\ &= \lVert W_t-W^\star\rVert_2^2 -2\eta\langle \bar g_0(W_t),W_t-W^\star\rangle +\eta^2\lVert \bar g_0(W_t)\rVert_2^2. \end{aligned}\]

Using the one-point inequality,

\[2\eta(1-\gamma) \lVert \widehat Q_0(W_t)-\widehat Q_0(W^\star)\rVert_\mu^2 \le \lVert W_t-W^\star\rVert_2^2 - \lVert W_{t+1}-W^\star\rVert_2^2 + \eta^2G_0^2.\]

Summing from $t=0$ to $T-1$,

\[\frac1T\sum_{t=0}^{T-1} \lVert \widehat Q_0(W_t)-\widehat Q_0(W^\star)\rVert_\mu^2 \le \frac{\lVert W_0-W^\star\rVert_2^2}{2\eta(1-\gamma)T} + \frac{\eta G_0^2}{2(1-\gamma)}.\]

Choosing $\eta\asymp T^{-1/2}$ gives a standard $O(T^{-1/2})$ averaged bound in this generic form. In the population neural TD theorem of Cai, Yang, Lee, and Wang, a sharper choice of projection radius and stepsize yields an averaged population error of order

\[O\left(\frac{B^2}{(1-\gamma)^2T}\right)\]

plus finite-width errors.

The structural point is more important than the exact constant: the frozen proof controls average prediction error, not Euclidean convergence to a unique neural parameter.

11. Returning to the actual neural network

The actual neural TD update does not use the frozen semigradient $\bar g_0$. It uses

\[\bar g_{\operatorname{NN}}(W) = \mathbb{E}\bigl[ \delta_W(x,r,x')\nabla_W\widehat Q(x;W) \bigr].\]

Thus we must justify replacing the nonlinear network by its frozen linearization. This is the part that has no analogue in ordinary linear TD.

The required finite-width control has two forms.

First, the actual network output must stay close to its frozen linearization:

\[\widehat Q(x;W) \approx \widehat Q_0(x;W), \qquad W\in S_B.\]

Second, the actual neural semigradient must satisfy an approximate one-point inequality:

\[\langle \bar g_{\operatorname{NN}}(W),W-W^\star\rangle \ge (1-\gamma) \lVert \widehat Q_0(W)-\widehat Q_0(W^\star)\rVert_\mu^2 - \Delta_{m,B}.\]

Here $\Delta_{m,B}$ is a finite-width error term that vanishes as the width $m$ grows, provided the projection radius $B$ is chosen appropriately. In the standard ReLU analysis, terms of the form

\[O(B^3m^{-1/2}+B^{5/2}m^{-1/4})\]

appear.

This is the exact place where overparameterization enters. The network is wide enough that, inside the ball $S_B$, not too many ReLU gates change, and the nonlinear dynamics remain close to the random-feature dynamics generated at initialization.

12. Population projected neural TD: modular theorem

The following statement captures the population neural result in the clean modular form.

Assume:

$W_t\in S_B$ because of projection,
there exists $W^\star\in S_B$ solving the projected Bellman equation over $\mathcal{F}_{B,m}$,
the approximate one-point inequality holds:

\[\langle \bar g_{\operatorname{NN}}(W),W-W^\star\rangle \ge (1-\gamma) \lVert \widehat Q_0(W)-\widehat Q_0(W^\star)\rVert_\mu^2 - \Delta_{m,B},\]

the neural semigradient norm is bounded:

\[\lVert \bar g_{\operatorname{NN}}(W)\rVert_2^2 \le G_{m,B}^2, \qquad \forall W\in S_B.\]

Then the projected update

\[W_{t+1} = \Pi_{S_B}\bigl(W_t-\eta\bar g_{\operatorname{NN}}(W_t)\bigr)\]

satisfies

\[\begin{aligned} \frac1T\sum_{t=0}^{T-1} \lVert \widehat Q_0(W_t)-\widehat Q_0(W^\star)\rVert_\mu^2 &\le \frac{\lVert W_0-W^\star\rVert_2^2}{2\eta(1-\gamma)T} +\frac{\eta G_{m,B}^2}{2(1-\gamma)} +\frac{\Delta_{m,B}}{1-\gamma}. \end{aligned}\]

This proof is exactly the same telescoping skeleton as the frozen proof, except for the additive finite-width error $\Delta_{m,B}$.

With the overparameterization estimates from the neural TD paper, this becomes a bound of the form

\[\mathbb{E}_{\operatorname{init},\mu} \Bigl[ (\widehat Q_{\operatorname{out}}(x)-\widehat Q_0(x;W^\star))^2 \Bigr] \le \frac{16B^2}{(1-\gamma)^2T} + O\left(B^3m^{-1/2}+B^{5/2}m^{-1/4}\right).\]

The output is usually an averaged iterate, selected iterate, or averaged prediction depending on the exact theorem statement.

13. Side-by-side proof comparison

Here is the cleanest way to compare the two arguments.

Item	Linear TD	Neural TD
Approximation	$Q_\theta(x)=\phi(x)^{\texttt{T}}\theta$	$\widehat Q(x;W)=m^{-1/2}\sum_r b_r\sigma(W_r^{\texttt{T}} x)$
Feature map	fixed $\phi(x)$	moving $\nabla_W\widehat Q(x;W)$
Class in projected Bellman equation	$\mathcal{F}_{\operatorname{lin}}$	frozen random-feature class $\mathcal{F}_{B,m}$
Population direction	$A\theta-b$	$\mathbb{E}[\delta_W\nabla_W\widehat Q(x;W)]$
Main inequality	Euclidean strong monotonicity	one-point monotonicity in frozen prediction norm
Projection role	keeps $\theta_t\in\Theta$	keeps $W_t$ near initialization
Convergence statement	parameter contraction	averaged function error
Extra neural ingredient	none	finite-width/local-linearization control

The common proof skeleton is:

\[\text{projection nonexpansiveness} \quad+ \quad \text{one-point inequality} \quad+ \quad \text{telescoping}.\]

But the one-point inequality means different things.

For linear TD:

\[\langle A(\theta-\theta^\star),\theta-\theta^\star\rangle \ge \mu_{\operatorname{lin}}\lVert\theta-\theta^\star\rVert_2^2.\]

For neural TD:

\[\langle \bar g_{\operatorname{NN}}(W),W-W^\star\rangle \ge (1-\gamma) \lVert \widehat Q_0(W)-\widehat Q_0(W^\star)\rVert_\mu^2 - \Delta_{m,B}.\]

Linear TD contracts in parameter space. Neural TD is controlled in prediction space, relative to the frozen random-feature solution.

14. Why the final rates look different

In the population linear case, if the operator is globally strongly monotone, the deterministic recursion contracts:

\[\lVert\theta_T-\theta^\star\rVert_2^2 \le (1-\eta\mu_{\operatorname{lin}})^T \lVert\theta_0-\theta^\star\rVert_2^2.\]

This is an exponential population rate.

In the neural case, the theorem is usually stated in averaged function error:

\[\mathbb{E}_{\operatorname{init},\mu} \Bigl[ (\widehat Q_{\operatorname{out}}(x)-\widehat Q_0(x;W^\star))^2 \Bigr] \le \frac{16B^2}{(1-\gamma)^2T} + O\left(B^3m^{-1/2}+B^{5/2}m^{-1/4}\right).\]

This is not a contradiction. The neural proof is not proving a simple Euclidean contraction of $W_t$ to a unique nonlinear parameter. It is proving that the neural predictions approach the projected Bellman solution in the frozen random-feature class, up to finite-width error.

That difference is fundamental.

15. The mental model

The comparison can be remembered as follows.

Linear TD

Choose features once:

\[\phi(x).\]

This defines a fixed approximation class:

\[\mathcal{F}_{\operatorname{lin}} = \{x\mapsto\phi(x)^{\texttt{T}}\theta\}.\]

The TD operator is affine:

\[\bar g_{\operatorname{lin}}(\theta)=A\theta-b.\]

If $A$ is strongly monotone, the population projected update contracts.

Neural TD

Initialize a wide network. The initialization creates random gates:

\[\mathbf{1}\{W_r(0)^{\texttt{T}} x>0\}.\]

These gates define a frozen random-feature class:

\[\mathcal{F}_{B,m}.\]

Projection keeps the network close to initialization, so the actual nonlinear network behaves approximately like this frozen linear class. The proof is therefore:

\[\text{actual neural TD} \approx \text{frozen random-feature TD} \approx \text{linear TD in }\mathcal{F}_{B,m}.\]

This is the main conceptual bridge between linear TD and neural TD.

16. What this comparison teaches

The important lesson is not merely that neural TD uses a bigger function approximator. The important lesson is that the proof changes the object of study.

Linear TD studies a deterministic class chosen by the analyst:

\[\mathcal{F}_{\operatorname{lin}}.\]

Neural TD studies a random class generated by initialization:

\[\mathcal{F}_{B,m}.\]

The overparameterized network is useful because it makes this random class rich while keeping the training trajectory close enough to initialization for the frozen-class analysis to remain valid.

So the neural theorem should not be read as a generic nonconvex optimization miracle. It should be read as a carefully controlled random-feature TD theorem, with an additional argument showing that the nonlinear ReLU network remains close to that random-feature model.

That is the real difference between projected linear TD and projected neural TD.

References

Jalaj Bhandari, Daniel Russo, and Raghav Singal. A finite time analysis of temporal difference learning with linear function approximation. COLT, 2018.
Qi Cai, Zhuoran Yang, Jason D. Lee, and Zhaoran Wang. Neural Temporal-Difference Learning Converges to Global Optima. NeurIPS, 2019.
Aritra Mitra. A Simple Finite-Time Analysis of TD Learning with Linear Function Approximation. arXiv, 2024.
John N. Tsitsiklis and Benjamin Van Roy. An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 1997.

rl-blogs

This post is licensed under CC BY 4.0 by the author.

1. The common policy-evaluation problem

2. Projection and MSPBE: the real object of comparison

3. Linear TD: fixed features, fixed geometry

4. What is the linear TD fixed point?

5. Population proof for projected linear TD

6. Neural TD: the model is nonlinear, but the proof freezes a linear class

7. The frozen random-feature class

8. What is the neural target $W^\star$?

9. The frozen neural proof is basically linear TD in function space

10. Telescoping the frozen update

11. Returning to the actual neural network

12. Population projected neural TD: modular theorem

13. Side-by-side proof comparison

14. Why the final rates look different

15. The mental model

Linear TD

Neural TD

16. What this comparison teaches

References

Trending Tags