Why Do Neural TD Converge ?

Posted Jun 2, 2026

By Sreejeet Maity

19 min read

Temporal-difference learning is one of the most fundamental algorithms in reinforcement learning. In the tabular setting, TD learning is simple, elegant, and well understood. However, modern deep reinforcement learning almost always combines TD-style bootstrapping with neural networks. This gives rise to neural TD, DQN, actor-critic methods, soft actor-critic, and many other algorithms.

This combination is powerful, but theoretically delicate. TD is already not ordinary stochastic gradient descent. It uses a semi-gradient update, where the bootstrapped target is treated as fixed. Neural networks then make the parameterization nonlinear and nonconvex. In fact, nonlinear TD can diverge in general.

The paper by Cai, Yang, Lee, and Wang, “Neural Temporal-Difference and Q-Learning Provably Converge to Global Optima” [Cai et al., 2020], addresses exactly this issue. The paper proves that, under overparameterization, neural TD converges globally to the optimum of the mean-squared projected Bellman error. It also proves an analogous convergence result for neural \(Q\)-Learning under stronger assumptions.

The main message of the paper is:

\[\boxed{ \text{Overparameterized neural TD behaves like linear TD in a random-feature space.} }\]

This blog post explains the paper in detail.

1. Policy Evaluation Setup

We begin with an MDP

\[(\mathcal S,\mathcal A,P,r,\gamma),\]

where \(\mathcal S\) is the state space, \(\mathcal A\) is the action space, \(P(\cdot\mid s,a)\) is the transition kernel, \(r(s,a)\) is the reward, and \(\gamma\in(0,1)\) is the discount factor.

A policy \(\pi\) maps each state to a distribution over actions:

\[\pi(\cdot\mid s)\in \Delta(\mathcal A).\]

The paper first focuses on policy evaluation. Therefore, the policy \(\pi\) is fixed.

For a fixed policy \(\pi\), the action-value function is

\[Q^\pi(s,a) = \mathbb E \left[ \sum_{t=0}^{\infty}\gamma^t r_t \mid s_0=s,\ a_0=a,\ a_t\sim \pi(\cdot\mid s_t) \right].\]

The Bellman evaluation operator is

\[\mathcal T^\pi Q(s,a) = \mathbb E \left[ r(s,a)+\gamma Q(s',a') \mid s'\sim P(\cdot\mid s,a),\ a'\sim \pi(\cdot\mid s') \right].\]

The true value function \(Q^\pi\) satisfies the Bellman fixed-point equation

\[Q^\pi = \mathcal T^\pi Q^\pi.\]

Thus, policy evaluation is a fixed-point problem.

2. From Bellman Equation to Optimization

Suppose we use a parametric function approximator

\[\widehat Q_\theta(s,a).\]

A natural idea is to minimize the Bellman residual

\[\widehat Q_\theta(s,a) - \mathcal T^\pi \widehat Q_\theta(s,a).\]

This leads to the mean-squared Bellman error, or MSBE:

\[\operatorname{MSBE}(\theta) = \mathbb E_{(s,a)\sim \mu} \left[ \widehat Q_\theta(s,a) - \mathcal T^\pi \widehat Q_\theta(s,a) \right]^2.\]

Here \(\mu\) is the stationary distribution of the state-action pair \((s,a)\) under policy \(\pi\).

\[\operatorname{MSBE}(\theta)=0,\]

then

\[\widehat Q_\theta = \mathcal T^\pi \widehat Q_\theta.\]

In that case, assuming the Bellman fixed point is unique,

\[\widehat Q_\theta = Q^\pi.\]

Therefore, MSBE is conceptually the cleanest objective.

However, it is difficult to optimize from samples.

The Bellman operator contains a conditional expectation:

\[\mathcal T^\pi \widehat Q_\theta(s,a) = \mathbb E \left[ r+\gamma \widehat Q_\theta(s',a') \mid s,a \right].\]

Thus,

\[\operatorname{MSBE}(\theta) = \mathbb E_{(s,a)\sim \mu} \left[ \widehat Q_\theta(s,a) - \mathbb E [ r+\gamma \widehat Q_\theta(s',a') \mid s,a ] \right]^2.\]

The conditional expectation is inside the square. If we observe only one next-state sample \(s'\), then

\[r+\gamma \widehat Q_\theta(s',a')\]

is an unbiased estimate of

\[\mathcal T^\pi \widehat Q_\theta(s,a),\]

but

\[\left( \widehat Q_\theta(s,a) - r - \gamma \widehat Q_\theta(s',a') \right)^2\]

is not an unbiased estimate of

\[\left( \widehat Q_\theta(s,a) - \mathcal T^\pi \widehat Q_\theta(s,a) \right)^2.\]

This is because, in general,

\[\mathbb E[Z^2]\neq (\mathbb E[Z])^2.\]

This issue is known as the double-sampling problem.

3. MSPBE: The Projected Bellman Objective

With function approximation, even if

\[\widehat Q_\theta \in \mathcal F,\]

we generally do not have

\[\mathcal T^\pi \widehat Q_\theta \in \mathcal F.\]

The Bellman operator may take us outside the function class. Therefore, instead of directly solving

\[Q=\mathcal T^\pi Q,\]

we solve the projected Bellman equation

\[Q=\Pi_{\mathcal F}\mathcal T^\pi Q.\]

Here \(\Pi_{\mathcal F}\) denotes projection onto the function class \(\mathcal F\) under the \(L^2(\mu)\) geometry.

Define

\[\langle f,g\rangle_\mu = \mathbb E_{(s,a)\sim \mu} [f(s,a)g(s,a)]\]

and

\[\|f\|_\mu^2 = \mathbb E_{(s,a)\sim \mu} [f(s,a)^2].\]

Then

\[\Pi_{\mathcal F}h = \arg\min_{f\in\mathcal F} \|f-h\|_\mu^2.\]

The mean-squared projected Bellman error, or MSPBE, is

\[\operatorname{MSPBE}(\theta) = \mathbb E_{(s,a)\sim \mu} \left[ \widehat Q_\theta(s,a) - \Pi_{\mathcal F}\mathcal T^\pi \widehat Q_\theta(s,a) \right]^2.\]

The goal of the paper is to show that neural TD converges globally to the optimum of this MSPBE, under overparameterization.

4. TD Is a Semi-Gradient Method

The TD residual is

\[\delta_\theta(s,a,r,s',a') = \widehat Q_\theta(s,a) - r - \gamma \widehat Q_\theta(s',a').\]

The TD update is

\[\theta' = \theta - \eta \delta_\theta(s,a,r,s',a') \nabla_\theta \widehat Q_\theta(s,a).\]

Equivalently, using the usual TD-error sign,

\[\theta' = \theta + \eta \left( r+\gamma \widehat Q_\theta(s',a') - \widehat Q_\theta(s,a) \right) \nabla_\theta \widehat Q_\theta(s,a).\]

This looks like stochastic gradient descent, but it is not exact SGD.

If we were minimizing the sampled squared Bellman residual

\[\ell(\theta) = \frac12 \left( \widehat Q_\theta(s,a) - r - \gamma \widehat Q_\theta(s',a') \right)^2,\]

then the true gradient would be

\[\nabla_\theta \ell(\theta) = \delta_\theta \left[ \nabla_\theta \widehat Q_\theta(s,a) - \gamma \nabla_\theta \widehat Q_\theta(s',a') \right].\]

Therefore, the full-gradient update would be

\[\theta' = \theta - \eta \delta_\theta \left[ \nabla_\theta \widehat Q_\theta(s,a) - \gamma \nabla_\theta \widehat Q_\theta(s',a') \right].\]

TD drops the second term

\[-\gamma \nabla_\theta \widehat Q_\theta(s',a').\]

So TD treats the target

\[r+\gamma \widehat Q_\theta(s',a')\]

as fixed during each update. This is why TD is called a semi-gradient method.

The expected TD direction is

\[\mathbb E_\mu \left[ \delta_\theta(s,a,r,s',a') \nabla_\theta \widehat Q_\theta(s,a) \right].\]

Conditioning on \((s,a)\), we get

\[\mathbb E_\mu \left[ \delta_\theta(s,a,r,s',a') \nabla_\theta \widehat Q_\theta(s,a) \right] = \mathbb E_{(s,a)\sim \mu} \left[ \left( \widehat Q_\theta(s,a) - \mathcal T^\pi \widehat Q_\theta(s,a) \right) \nabla_\theta \widehat Q_\theta(s,a) \right].\]

This is not the gradient of MSBE.

That is one of the main technical obstacles in the paper.

5. Linear TD as the Guiding Case

To understand the neural case, it is useful to recall linear TD.

Suppose

\[\widehat Q_\theta(x)=\phi(x)^{\texttt{T}} \theta,\]

where \(x=(s,a)\).

The TD update is

\[\theta_{t+1} = \theta_t - \eta \left( \phi(x_t)^{\texttt{T}}\theta_t - r_t - \gamma \phi(x_t')^{\texttt{T}}\theta_t \right) \phi(x_t).\]

The expected update direction is

\[g(\theta) = \mathbb E_\mu \left[ \left( \phi(x)^{\texttt{T}}\theta - r - \gamma \phi(x')^{\texttt{T}}\theta \right) \phi(x) \right].\]

At a fixed point,

\[g(\theta^\star)=0.\]

Thus,

\[\mathbb E_\mu \left[ \left( \phi(x)^{\texttt{T}}\theta^\star - r - \gamma \phi(x')^{\texttt{T}}\theta^\star \right) \phi(x) \right]=0.\]

Rearranging,

\[\mathbb E_\mu [ \phi(x)(\phi(x)-\gamma \phi(x'))^{\texttt{T}} ] \theta^\star = \mathbb E_\mu[r\phi(x)].\]

Define

\[A = \mathbb E_\mu [ \phi(x)(\phi(x)-\gamma \phi(x'))^{\texttt{T}} ],\]

and

\[b = \mathbb E_\mu[r\phi(x)].\]

Then the TD fixed point satisfies

\[A\theta^\star=b.\]

This is equivalent to the projected Bellman equation

\[\widehat Q_{\theta^\star} = \Pi_{\mathcal F}\mathcal T^\pi \widehat Q_{\theta^\star}.\]

So even though TD is not exact gradient descent on MSBE, linear TD has a clean projected Bellman fixed-point structure.

The paper’s strategy is to show that overparameterized neural TD behaves like linear TD in a random-feature space.

6. Neural TD Parameterization

The paper represents the state-action pair by a vector

\[x=\psi(s,a)\in\mathcal X\subseteq \mathbb R^d.\]

It assumes

\[\|x\|_2=1,\]

and the reward is bounded:

\[|r(x)|\le r.\]

The \(Q\)-function is parameterized by a two-layer ReLU network:

\[\widehat Q(x;W) = \frac{1}{\sqrt m} \sum_{r=1}^m b_r \sigma(W_r^{\texttt{T}} x),\]

where

\[\sigma(z)=\max\{0,z\}.\]

The initialization is

\[b_r\sim \operatorname{Unif}(\{-1,1\}),\]

and

\[W_r(0)\sim \mathcal N(0,I_d/d).\]

The output weights \(b_r\) are fixed after initialization. Only the first-layer weights

\[W=(W_1,\dots,W_m)\]

are updated.

The algorithm also uses a projection ball around initialization:

\[S_B = \left\{ W\in\mathbb R^{md}: \|W-W(0)\|_2\le B \right\}.\]

This projection keeps the parameters close to initialization. In practical deep RL, this projection is not usually present, but in the theory it is crucial. It ensures that the network stays in the lazy-training regime.

7. Neural TD Algorithm

At iteration \(t\), the algorithm samples

\[(s_t,a_t,r_t,s_t',a_t')\sim \mu.\]

Let

\[x_t=(s_t,a_t), \qquad x_t'=(s_t',a_t').\]

The Bellman residual is

\[\delta_t = \widehat Q(x_t;W(t)) - r_t - \gamma \widehat Q(x_t';W(t)).\]

The raw TD update is

\[\widetilde W(t+1) = W(t) - \eta \delta_t \nabla_W \widehat Q(x_t;W(t)).\]

Then the algorithm projects back to \(S_B\):

\[W(t+1) = \arg\min_{W\in S_B} \|W-\widetilde W(t+1)\|_2.\]

Finally, it uses iterate averaging:

\[\overline W \leftarrow \frac{t+1}{t+2}\overline W + \frac{1}{t+2}W(t+1).\]

The output is

\[\widehat Q_{\mathrm{out}}(\cdot) = \widehat Q(\cdot;\overline W).\]

This is the neural TD algorithm studied in the paper.

8. Why ReLU Makes Local Linearization Possible

For ReLU,

\[\sigma(z)=\max\{0,z\}.\]

For \(z\neq 0\),

\[\sigma(z)=\mathbf 1\{z>0\}z.\]

Therefore,

\[\sigma(W_r^{\texttt{T}} x) = \mathbf 1\{W_r^{\texttt{T}} x>0\}W_r^{\texttt{T}} x.\]

The neural network can be written as

\[\widehat Q(x;W) = \frac{1}{\sqrt m} \sum_{r=1}^m b_r \mathbf 1\{W_r^{\texttt{T}} x>0\} W_r^{\texttt{T}} x.\]

The nonlinearity is entirely contained in the activation gates

\[\mathbf 1\{W_r^{\texttt{T}} x>0\}.\]

If the weights remain close to initialization, then for most \(x\),

\[\mathbf 1\{W_r(t)^{\texttt{T}} x>0\} \approx \mathbf 1\{W_r(0)^{\texttt{T}} x>0\}.\]

Thus, the network behaves like the frozen-gate model

\[\widehat Q_0(x;W) = \frac{1}{\sqrt m} \sum_{r=1}^m b_r \mathbf 1\{W_r(0)^{\texttt{T}} x>0\} W_r^{\texttt{T}} x.\]

This model is linear in \(W\).

Define the random feature map

\[\Phi(x) = \frac{1}{\sqrt m} \left( b_1\mathbf 1\{W_1(0)^{\texttt{T}} x>0\}x, \dots, b_m\mathbf 1\{W_m(0)^{\texttt{T}} x>0\}x \right).\]

Then

\[\widehat Q_0(x;W)=\Phi(x)^{\texttt{T}} W.\]

Therefore, the overparameterized ReLU network behaves like a linear random-feature model near initialization.

9. The Linearized Function Class

The linearized function class is

\[\mathcal F_{B,m} = \left\{ \frac{1}{\sqrt m} \sum_{r=1}^m b_r \mathbf 1\{W_r(0)^{\texttt{T}} x>0\} W_r^{\texttt{T}} x : W\in S_B \right\}.\]

This is a random function class because it depends on the initialization.

Define the linearized Bellman residual

\[\delta_0(x,r,x';W) = \widehat Q_0(x;W) - r - \gamma \widehat Q_0(x';W).\]

The approximate stationary point \(W^\star\) is defined by the variational inequality

\[\mathbb E_\mu [ \delta_0(x,r,x';W^\star) \nabla_W \widehat Q_0(x;W^\star) ]^{\texttt{T}} (W-W^\star) \ge 0\]

for every

\[W\in S_B.\]

This is the constrained analogue of stationarity.

The associated function

\[\widehat Q_0(\cdot;W^\star)\]

is the unique projected Bellman fixed point in \(\mathcal F_{B,m}\):

\[\widehat Q_0(\cdot;W^\star) = \Pi_{\mathcal F_{B,m}} \mathcal T^\pi \widehat Q_0(\cdot;W^\star).\]

Thus, \(\widehat Q_0(\cdot;W^\star)\) is the global optimum of the MSPBE over the linearized function class.

10. Regularity Assumption

The paper assumes a regularity condition on the stationary distribution \(\mu\).

There exists a constant \(c_0>0\) such that for any \(\tau\ge 0\) and any unit vector \(w\in\mathbb R^d\),

\[\mathbb P_{x\sim \mu}\left(|w^{\texttt{T}} x|\le \tau\right)\le c_0\tau.\]

This says that the distribution of \(x\) does not put too much probability mass near any hyperplane.

Why is this needed?

A ReLU gate changes when

\[W_r^{\texttt{T}} x\]

crosses zero. If many data points lie close to the hyperplane

\[W_r^{\texttt{T}} x = 0,\]

then tiny movements of \(W_r\) can change many activation patterns.

The regularity assumption prevents this. It ensures that, if the weights move only slightly, then only a small fraction of activation gates change.

ReLU Gate Instability

In 3D, the ReLU boundary \(w^\top x = 0\) is a plane. A point changes its activation when it crosses from one side of this plane to the other. The regularity assumption prevents too many data points from lying close to this boundary.

active before and after inactive before and after gate flipped near the margin slab

Drag the plot to rotate the 3D view. The amber plane is the original ReLU boundary. The dashed red plane is the boundary after a small movement of the weight vector.

Data distribution Weight azimuth

horizontal direction of \(w\) 35°

Weight elevation

vertical direction of \(w\) 20°

Weight perturbation

small movement of \(w\) 8°

Margin slab width

\(|w^\top x| \le r\) 0.08

Number of data points

sample size 650

points inside the margin slab near \(w^\top x = 0\).

ReLU gates that changed after perturbing the weight.

Try switching to the bad distribution.

11. Local Linearization Lemma

The key technical result is that the neural network remains close to its frozen-gate linearization.

For every \(t\),

\[\mathbb E_{\mathrm{init},\mu} \left[ \left| \widehat Q(x;W(t)) - \widehat Q_0(x;W(t)) \right|^2 \right] \le 4c_1B^3m^{-1/2}.\]

This means

\[\widehat Q(x;W(t)) \approx \widehat Q_0(x;W(t))\]

along the entire training trajectory.

The difference is

\[\widehat Q(x;W) - \widehat Q_0(x;W) = \frac{1}{\sqrt m}\sum_{r=1}^m b_r\left[\mathbf 1\{W_r^{\texttt{T}} x>0\}-\mathbf 1\{W_r(0)^{\texttt{T}} x>0\}\right]W_r^{\texttt{T}} x.\]

This difference is nonzero only when the ReLU gate changes:

\[\mathbf 1\{W_r^{\texttt{T}} x>0\} \neq \mathbf 1\{W_r(0)^{\texttt{T}} x>0\}.\]

The projection ensures

\[\|W-W(0)\|_2\le B.\]

In a wide network, the average movement per neuron is small. The regularity assumption ensures that small movements cause few sign flips. Hence the nonlinear network is close to its linearization.

This is the mathematical core of the lazy-training argument.

12. Local Linearization of the TD Direction

Define the population neural TD semigradient

\[\overline g(t) = \mathbb E_\mu [ \delta(x,r,x';W(t)) \nabla_W \widehat Q(x;W(t)) ].\]

Define the linearized population semigradient

\[\overline g_0(t) = \mathbb E_\mu [ \delta_0(x,r,x';W(t)) \nabla_W \widehat Q_0(x;W(t)) ].\]

The paper proves that these directions are close:

\[\mathbb E_{\mathrm{init}} \left[ \|\overline g(t)-\overline g_0(t)\|_2^2 \right] \le C(B,r)m^{-1/2},\]

where

\[C(B,r) = 56c_1B^3+24c_2B+6c_1Br^2.\]

Therefore,

\[\overline g(t)\approx \overline g_0(t).\]

So not only does the neural network remain close to its linearization, but the TD update direction also remains close to the linearized TD direction.

This is what allows the proof to transfer linear TD behavior to neural TD.

13. One-Point Monotonicity

The proof does not rely on convexity. Instead, it relies on a one-point monotonicity property.

Let

\[e_W(x) = \widehat Q_0(x;W)-\widehat Q_0(x;W^\star).\]

Since

\[\widehat Q_0(x;W)=\Phi(x)^{\texttt{T}} W,\]

we have

\[e_W(x)=\Phi(x)^{\texttt{T}}(W-W^\star).\]

The linearized population semigradient is

\[\overline g_0(W) = \mathbb E_\mu [ (\widehat Q_0(x;W)-r-\gamma \widehat Q_0(x';W)) \Phi(x) ].\]

Similarly,

\[\overline g_0(W^\star) = \mathbb E_\mu [ (\widehat Q_0(x;W^\star)-r-\gamma \widehat Q_0(x';W^\star)) \Phi(x) ].\]

Subtracting,

\[\overline g_0(W)-\overline g_0(W^\star) = \mathbb E_\mu [ (e_W(x)-\gamma e_W(x'))\Phi(x) ].\]

Now take inner product with \(W-W^\star\):

\[\left\langle \overline g_0(W)-\overline g_0(W^\star), W-W^\star \right\rangle = \mathbb E_\mu [ (e_W(x)-\gamma e_W(x')) \Phi(x)^{\texttt{T}}(W-W^\star) ].\]

But

\[\Phi(x)^{\texttt{T}}(W-W^\star)=e_W(x).\]

Therefore,

\[\left\langle \overline g_0(W)-\overline g_0(W^\star), W-W^\star \right\rangle = \mathbb E_\mu [ (e_W(x)-\gamma e_W(x'))e_W(x) ] =\mathbb E_\mu[e_W(x)^2]-\gamma \mathbb E_\mu[e_W(x')e_W(x)].\]

By Cauchy-Schwarz,

\[\mathbb E_\mu[e_W(x')e_W(x)] \le \sqrt{\mathbb E[e_W(x')^2]} \sqrt{\mathbb E[e_W(x)^2]}.\]

Under stationarity,

\[x\sim \mu, \qquad x'\sim \mu.\]

Thus,

\[\mathbb E[e_W(x')^2] = \mathbb E[e_W(x)^2].\]

Therefore,

\[\mathbb E_\mu[e_W(x')e_W(x)] \le \mathbb E_\mu[e_W(x)^2].\]

Hence,

\[\left\langle \overline g_0(W)-\overline g_0(W^\star), W-W^\star \right\rangle \ge (1-\gamma) \mathbb E_\mu[e_W(x)^2].\]

Because \(W^\star\) satisfies the constrained stationarity condition,

\[\left\langle \overline g_0(W^\star), W-W^\star \right\rangle \ge 0.\]

Therefore,

\[\left\langle \overline g_0(W), W-W^\star \right\rangle \ge (1-\gamma) \mathbb E_\mu \left[ \left( \widehat Q_0(x;W)-\widehat Q_0(x;W^\star) \right)^2 \right].\]

This is the key progress inequality.

In convex optimization, we often use

\[\langle \nabla f(W), W-W^\star\rangle \ge f(W)-f(W^\star).\]

Here, the TD semigradient is not a true gradient. Nevertheless, it still points toward \(W^\star\) in this one-point sense.

This replaces convexity in the proof.

14. Population Descent Lemma

For the population update,

\[\widetilde W(t+1) = W(t)-\eta \overline g(t),\]

and projection gives

\[W(t+1)=\Pi_{S_B}(\widetilde W(t+1)).\]

Projection onto a convex set is non-expansive, so

\[\|W(t+1)-W^\star\|_2^2 \le \|W(t)-\eta \overline g(t)-W^\star\|_2^2.\]

Expanding,

\[\|W(t+1)-W^\star\|_2^2 \le \|W(t)-W^\star\|_2^2 - 2\eta \langle \overline g(t), W(t)-W^\star \rangle + \eta^2\|\overline g(t)\|_2^2.\]

Decompose

\[\overline g(t) = \overline g_0(t) + (\overline g(t)-\overline g_0(t)).\]

The linearized term gives descent through one-point monotonicity:

\[\langle \overline g_0(t), W(t)-W^\star \rangle \ge (1-\gamma) \mathbb E_\mu [ (\widehat Q_0(x;W(t))-\widehat Q_0(x;W^\star))^2 ].\]

The difference

\[\overline g(t)-\overline g_0(t)\]

is the local linearization error.

The resulting descent inequality has the form

The negative term is the useful descent term. The final two terms are the price paid for replacing the nonlinear neural network by its linearization.

15. Population Convergence Rate

Set

\[\eta=\frac{1-\gamma}{8}.\]

Then

\[2\eta(1-\gamma)-8\eta^2 = \frac{(1-\gamma)^2}{8}.\]

So the descent inequality becomes

\[\|W(t+1)-W^\star\|_2^2 \le \|W(t)-W^\star\|_2^2 - \frac{(1-\gamma)^2}{8} \left\| \widehat Q_0(\cdot;W(t)) - \widehat Q_0(\cdot;W^\star) \right\|_\mu^2 + \text{linearization error}.\]

Rearranging and summing over \(t\) gives a telescoping sum:

\[\frac1T \sum_{t=0}^{T-1} \left\| \widehat Q_0(\cdot;W(t)) - \widehat Q_0(\cdot;W^\star) \right\|_\mu^2 \le O\left( \frac{B^2}{(1-\gamma)^2T} \right) + \text{width error}.\]

The width error comes from the local linearization bounds and has order

\[O\left( B^3m^{-1/2} + B^{5/2}m^{-1/4} \right).\]

Using iterate averaging, the paper obtains

\[\mathbb E_{\mathrm{init},\mu} \left[ \widehat Q_{\mathrm{out}}(x) - \widehat Q_0(x;W^\star) \right]^2 \le \frac{16B^2}{(1-\gamma)^2T} + O\left( B^3m^{-1/2} + B^{5/2}m^{-1/4} \right).\]

Thus, the population update has rate

\[O(T^{-1})\]

plus finite-width error.

16. Stochastic TD Convergence Rate

For the actual stochastic update, define

\[g(t) = \delta(x,r,x';W(t)) \nabla_W \widehat Q(x;W(t)),\]

and

\[\overline g(t) = \mathbb E_\mu[g(t)].\]

Then

\[g(t)-\overline g(t)\]

is the stochastic semigradient noise.

The paper proves a variance bound:

\[\mathbb E_{\mathrm{init},\mu} \left[ \|g(t)-\overline g(t)\|_2^2 \right] \le \sigma_g^2,\]

where

\[\sigma_g^2=O(B^2).\]

The stochastic descent inequality contains one additional term:

\[\eta^2 \mathbb E [ \|g(t)-\overline g(t)\|_2^2 ].\]

Using the variance bound,

\[\eta^2 \mathbb E [ \|g(t)-\overline g(t)\|_2^2 ] \le \eta^2\sigma_g^2.\]

Choosing

\[\eta=\min\left\{\frac{1-\gamma}{8},\frac1{\sqrt T}\right\},\]

the stochastic noise contributes order

\[O(T^{-1/2}).\]

The final stochastic convergence result is

\[\mathbb E_{\mathrm{init},\mu} \left[ \widehat Q_{\mathrm{out}}(x) - \widehat Q_0(x;W^\star) \right]^2 \le \frac{16(B^2+\sigma_g^2)}{(1-\gamma)^2\sqrt T} + O\left( B^3m^{-1/2} + B^{5/2}m^{-1/4} \right).\]

Thus, stochastic neural TD converges at rate

\[O(T^{-1/2})\]

plus finite-width error.

17. Approximation to the True Value Function

The theorem gives convergence to

\[\widehat Q_0(\cdot;W^\star),\]

not directly to \(Q^\pi\).

The paper then relates this projected fixed point to the true value function.

It proves

\[\|\widehat Q_0(\cdot;W^\star)-Q^\pi\|_\mu \le \frac{1}{1-\gamma} \|\Pi_{\mathcal F_{B,m}}Q^\pi-Q^\pi\|_\mu.\]

Therefore, if the function class \(\mathcal F_{B,m}\) approximates \(Q^\pi\) well, then the projected Bellman fixed point is close to \(Q^\pi\).

Combining this approximation result with the stochastic convergence theorem gives

\[\mathbb E_{\mathrm{init},\mu} \left[ \widehat Q_{\mathrm{out}}(x)-Q^\pi(x) \right]^2 \le \frac{32(B^2+\sigma_g^2)}{(1-\gamma)^2\sqrt T} + \frac{ 2\mathbb E_{\mathrm{init},\mu} [ \Pi_{\mathcal F_{B,m}}Q^\pi(x)-Q^\pi(x) ]^2 }{(1-\gamma)^2} + O\left( B^3m^{-1/2} + B^{5/2}m^{-1/4} \right).\]

There are three errors:

\[\boxed{ \text{optimization/statistical error} }\] \[\boxed{ \text{function approximation error} }\] \[\boxed{ \text{finite-width linearization error} }\]

\[Q^\pi\in \mathcal F_{B,\infty},\]

then the approximation error vanishes in the infinite-width limit.

18. Representation Power and RKHS Interpretation

As \(m\to\infty\), the random-feature class induced by the frozen ReLU gates converges to an infinite-width class related to a reproducing kernel Hilbert space.

The relevant kernel is

\[K(x,y) = \mathbb E_{w\sim \mathcal N(0,I_d/d)} \left[ \mathbf 1\{w^{\texttt{T}} x>0,\ w^{\texttt{T}} y>0\} x^{\texttt{T}} y \right].\]

The infinite-width function class can be viewed as an RKHS-type class associated with this kernel.

This matters because it explains the representation side of overparameterization.

Overparameterization does two things:

\[\boxed{ \text{It keeps training close to a linearized regime.} }\]

and

\[\boxed{ \text{It creates a rich random-feature function class.} }\]

The first point gives optimization stability. The second point gives approximation power.

19. Extension to Neural \(Q\)-Learning

For policy optimization, the Bellman evaluation operator is replaced by the Bellman optimality operator:

\[\mathcal T Q(s,a) = \mathbb E \left[ r(s,a) + \gamma\max_{a'\in\mathcal A}Q(s',a') \mid s,a \right].\]

The optimal action-value function satisfies

\[Q^\star=\mathcal T Q^\star.\]

Neural \(Q\)-Learning uses the update

\[W(t+1) = \Pi_{S_B} \left[ W(t) - \eta \delta_t \nabla_W\widehat Q(x_t;W(t)) \right],\]

where

\[\delta_t = \widehat Q(x_t;W(t)) - r_t - \gamma \max_{a'\in\mathcal A} \widehat Q(s_t',a';W(t)).\]

The linearized residual becomes

\[\delta_0(x,r,x';W) = \widehat Q_0(x;W) - r - \gamma \max_{a'\in\mathcal A} \widehat Q_0(s',a';W).\]

The corresponding projected Bellman optimality equation is

\[Q = \Pi_{\mathcal F_{B,m}} \mathcal T Q.\]

The max operator makes the analysis harder because the greedy action depends on the current function. The paper therefore imposes stronger regularity assumptions for neural \(Q\)-Learning.

Under those assumptions, the paper proves

\[\mathbb E_{\mathrm{init},\mu_{\mathrm{exp}}} \left[ \widehat Q_{\mathrm{out}}(x) - \widehat Q_0(x;W^\star) \right]^2 = O\left( B^2T^{-1/2} + B^3m^{-1/2} + B^{5/2}m^{-1/4} \right).\]

So neural \(Q\)-Learning also converges at a stochastic rate of order

\[O(T^{-1/2})\]

plus finite-width linearization error.

20. Relation to Practical DQN

The neural \(Q\)-Learning result is related to DQN, but it is not a full theory of practical DQN.

Practical DQN usually uses:

\[\text{experience replay},\] \[\text{target networks},\] \[\epsilon\text{-greedy exploration},\] \[\text{deep networks with all layers trained},\]

and nonstationary data.

The paper’s main theoretical model uses:

\[\text{i.i.d. samples from a stationary distribution},\]

projection to a ball around initialization,

fixed output-layer weights,

and sufficiently wide networks.

Thus, the result should be viewed as a theory of neural TD and neural \(Q\)-Learning in an idealized overparameterized regime, not as a complete explanation of every engineering component of DQN.

Still, the paper gives a valuable conceptual message:

\[\boxed{ \text{Overparameterization can stabilize bootstrapped value learning.} }\]

21. Summary of the Proof Strategy

The proof can be summarized as follows.

First, define the frozen-gate linearized network:

\[\widehat Q_0(x;W) = \frac{1}{\sqrt m} \sum_{r=1}^m b_r \mathbf 1\{W_r(0)^{\texttt{T}} x>0\} W_r^{\texttt{T}} x.\]

Second, show that the nonlinear neural network stays close to this linearization:

\[\widehat Q(x;W(t)) \approx \widehat Q_0(x;W(t)).\]

Third, show that the neural TD semigradient stays close to the linearized TD semigradient:

\[\overline g(t) \approx \overline g_0(t).\]

Fourth, prove one-point monotonicity for the linearized TD direction:

\[\left\langle \overline g_0(W), W-W^\star \right\rangle \ge (1-\gamma) \left\| \widehat Q_0(\cdot;W) - \widehat Q_0(\cdot;W^\star) \right\|_\mu^2.\]

Fifth, use this monotonicity to prove a descent inequality:

\[\|W(t+1)-W^\star\|_2^2 \le \|W(t)-W^\star\|_2^2 - \text{progress} + \text{linearization error} + \text{stochastic noise}.\]

Sixth, telescope over time and use iterate averaging.

This gives the final convergence bound.

22. Main Takeaways

The paper teaches several important lessons.

First, TD is a semi-gradient method. It does not minimize the Bellman residual by ordinary SGD.

Second, linear TD works because its expected update solves a projected Bellman fixed-point equation.

Third, neural TD is difficult because the parameterization is nonlinear and nonconvex.

Fourth, overparameterization makes the neural network behave almost linearly near initialization.

Fifth, the resulting linearized model is a random-feature model.

Sixth, neural TD inherits the projected Bellman convergence structure of linear TD, up to finite-width error.

Seventh, the final error decomposes into optimization error, approximation error, and linearization error.

In one line:

\[\boxed{ \text{Neural TD converges because wide ReLU networks turn nonlinear TD into approximate linear TD.} }\]

Reference

Cai, Q., Yang, Z., Lee, J. D., and Wang, Z. (2020).
Neural Temporal-Difference and Q-Learning Provably Converge to Global Optima.
Available at: https://arxiv.org/abs/1905.10027

rl-blogs

This post is licensed under CC BY 4.0 by the author.

1. Policy Evaluation Setup

2. From Bellman Equation to Optimization

3. MSPBE: The Projected Bellman Objective

4. TD Is a Semi-Gradient Method

5. Linear TD as the Guiding Case

6. Neural TD Parameterization

7. Neural TD Algorithm

8. Why ReLU Makes Local Linearization Possible

9. The Linearized Function Class

10. Regularity Assumption

11. Local Linearization Lemma

12. Local Linearization of the TD Direction

13. One-Point Monotonicity

14. Population Descent Lemma

15. Population Convergence Rate

16. Stochastic TD Convergence Rate

17. Approximation to the True Value Function

18. Representation Power and RKHS Interpretation

19. Extension to Neural \(Q\)-Learning

20. Relation to Practical DQN

21. Summary of the Proof Strategy

22. Main Takeaways

Reference

Trending Tags