Why Vanilla Q-Learning Breaks Under Corrupted Rewards

Posted Apr 12, 2026

By Sreejeet Maity

12 min read

Vanilla Q-learning is one of the most elegant algorithms in reinforcement learning. It is model-free, computationally simple, and in the clean setting its behavior is governed by a beautiful fixed-point argument: the algorithm tracks the Bellman optimality operator, and that operator is a contraction. This is exactly why Q-learning converges to the optimal action-value function \(Q^\star\).

However, that clean story hides a fragile structural assumption: the reward channel must be trustworthy. The observations may be noisy, but they must still be centered around the true reward function. Once this assumption fails, the classical contraction argument does not disappear, but it begins to work in the wrong direction. The iterates are then driven not toward \(Q^\star\), but toward the fixed point of a perturbed Bellman operator induced by the corrupted reward process.

This is the real reason vanilla Q-learning breaks under corrupted rewards. The issue is not merely extra randomness. It is the appearance of a persistent bias in the Bellman target. Since Q-learning is bootstrapped, this bias is then propagated recursively through future updates. In the end, the algorithm may still converge, but it can converge to a point arbitrarily far from \(Q^\star\), and even learn the wrong optimal action.

In this post I will make that statement precise. I will first recall the clean Bellman geometry behind vanilla Q-learning, then describe the Huber contamination model, then explain exactly which operator the corrupted recursion tracks, and finally show, through an explicit finite MDP construction, that the asymptotic error can be made arbitrarily large.

1. The clean Bellman picture

Consider a discounted finite Markov decision process with state space \(\mathcal S\), action space \(\mathcal A\), reward function \(R:\mathcal S \times \mathcal A \to \mathbb R\), transition kernel \(P(\cdot \mid s,a)\), and discount factor \(\gamma \in (0,1)\).

The Bellman optimality operator is

\[(\mathcal T^\star Q)(s,a) = R(s,a) + \gamma \mathbb E_{s' \sim P(\cdot \mid s,a)} \left[ \max_{a' \in \mathcal A} Q(s',a') \right].\]

The optimal action-value function \(Q^\star\) is the unique fixed point of \(\mathcal T^\star\):

\[Q^\star = \mathcal T^\star Q^\star.\]

The fundamental structural fact is that \(\mathcal T^\star\) is a \(\gamma\)-contraction in the sup norm:

\[\|\mathcal T^\star Q - \mathcal T^\star Q'\|_\infty \le \gamma \|Q-Q'\|_\infty.\]

This inequality is the backbone of the clean theory. Once a recursion is shown to track \(\mathcal T^\star\), convergence to \(Q^\star\) follows from contraction plus stochastic approximation.

Vanilla asynchronous Q-learning updates only the visited coordinate. If at time \(t\) the learner visits \((s_t,a_t)\), observes reward \(Y_t\), and transitions to \(s_{t+1}\), then

\[Q_{t+1}(s_t,a_t) = (1-\alpha_t)Q_t(s_t,a_t) + \alpha_t \left( Y_t + \gamma \max_{a' \in \mathcal A} Q_t(s_{t+1},a') \right),\]

and all other coordinates remain unchanged.

Let \(\mathcal F_t\) denote the filtration generated by the trajectory and the past iterates up to time \(t\). In the clean setting, conditional on visiting the pair \((s,a)\), one has

\[\mathbb E\!\left[ Y_t + \gamma \max_{a'} Q_t(s_{t+1},a') \;\middle|\; \mathcal F_t,\; s_t=s,\; a_t=a \right] = (\mathcal T^\star Q_t)(s,a).\]

Equivalently, if we define

\[\zeta_{t+1}(s,a) = \Bigl( Y_t + \gamma \max_{a'} Q_t(s_{t+1},a') \Bigr) - (\mathcal T^\star Q_t)(s,a),\]

then

\[\mathbb E[\zeta_{t+1}(s,a)\mid \mathcal F_t,\; s_t=s,\; a_t=a] = 0.\]

So the clean update is a stochastic approximation to the fixed-point equation \(Q=\mathcal T^\star Q\) with a conditionally mean-zero fluctuation term. That is the mathematically precise reason vanilla Q-learning works in the uncorrupted setting.

2. Why corruption is not just extra noise

Now suppose the reward channel is corrupted. The learner no longer sees a clean reward sample. Instead, it sees something contaminated.

The crucial point is that corruption is fundamentally different from ordinary noise. Noise is often modeled as a centered fluctuation. Corruption is not centered. It changes the mean of the update target itself.

To model this cleanly, fix \(\varepsilon \in (0,1)\) and suppose that for each state-action pair \((s,a)\) and each round \(t\), the observed reward \(y_t(s,a)\) is generated from the Huber contamination model

\[y_t(s,a) \sim (1-\varepsilon)\mathcal R(s,a) + \varepsilon \mathcal C(s,a),\]

where \(\mathcal R(s,a)\) is the true reward distribution and \(\mathcal C(s,a)\) is an arbitrary adversarial distribution.

Define the perturbed mean reward

\[\widetilde R_c(s,a) = \mathbb E[y_t(s,a)].\]

Then the corruption-induced reward bias is

\[\Delta_c(s,a) := \widetilde R_c(s,a) - R(s,a).\]

This quantity is the real villain. It is not a fluctuation around zero. It is a structural shift in the Bellman target.

3. The corrupted recursion tracks the wrong Bellman operator

To keep the asymptotic statement clean, let us focus on the synchronous version of vanilla Q-learning. Under the Huber corruption model, the synchronous recursion is simply Q-learning applied to the reward process \(y_t(s,a)\).

The key object is now the perturbed Bellman operator

\[(\widetilde{\mathcal T}_c^\star Q)(s,a) = \widetilde R_c(s,a) + \gamma \mathbb E_{s' \sim P(\cdot \mid s,a)} \left[ \max_{a' \in \mathcal A} Q(s',a') \right].\]

Notice that the transition kernel is unchanged. Only the reward function has shifted from \(R\) to \(\widetilde R_c\).

If \(\widetilde Q_c^\star\) denotes the unique fixed point of \(\widetilde{\mathcal T}_c^\star\), then under standard stochastic approximation conditions on the step sizes,

\[\alpha_t \in (0,1), \qquad \sum_{t=1}^\infty \alpha_t = \infty, \qquad \sum_{t=1}^\infty \alpha_t^2 < \infty,\]

the synchronous vanilla Q-learning iterates converge almost surely to \(\widetilde Q_c^\star\).

This is the core theorem behind the vulnerability story.

The most important sentence here is the following:

\[Q_t \longrightarrow \widetilde Q_c^\star \quad\text{a.s., not to } Q^\star.\]

That is the mathematical meaning of “vanilla Q-learning breaks.”

It does not mean that the algorithm must explode.
It does not mean that the iterates must oscillate forever.
It does not even mean that convergence fails.

Rather, it means that convergence occurs toward the wrong fixed point.

4. The error decomposition: contraction plus bias

It is useful to write the recursion in terms of the estimation error

\[e_t := Q_t - Q^\star.\]

In the clean setting, one can think of the update schematically as

\[e_{t+1} = e_t + \alpha_t \Bigl[ (\mathcal T^\star Q_t - Q_t) + \zeta_{t+1} \Bigr],\]

where \(\zeta_{t+1}\) is conditionally mean-zero.

Under corruption, the correct decomposition is

\[e_{t+1} = e_t + \alpha_t \Bigl[ (\mathcal T^\star Q_t - Q_t) + \Delta_c(s_t,a_t) + \zeta_{t+1}^{(c)} \Bigr],\]

where \(\zeta_{t+1}^{(c)}\) is the centered fluctuation around the corrupted Bellman target. More precisely,

\[\mathbb E\!\left[ \zeta_{t+1}^{(c)} \;\middle|\; \mathcal F_t,\; s_t=s,\; a_t=a \right] = 0,\]

but the drift itself is no longer centered at \(\mathcal T^\star Q_t\). It is centered at

\[(\widetilde{\mathcal T}_c^\star Q_t)(s,a) = (\mathcal T^\star Q_t)(s,a) + \Delta_c(s,a).\]

So the update contains two fundamentally different pieces:

a centered stochastic fluctuation term \(\zeta_{t+1}^{(c)}\);
a non-centered deterministic bias term \(\Delta_c(s,a)\).

The first one can average out. The second one cannot.

This is why corruption is qualitatively worse than ordinary stochasticity.

5. A clean perturbation bound

Let us now compare the true fixed point \(Q^\star\) and the perturbed fixed point \(\widetilde Q_c^\star\).

Since

\[Q^\star = \mathcal T^\star Q^\star, \qquad \widetilde Q_c^\star = \widetilde{\mathcal T}_c^\star \widetilde Q_c^\star,\]

we have

\[\begin{aligned} \|\widetilde Q_c^\star - Q^\star\|_\infty &= \|\widetilde{\mathcal T}_c^\star \widetilde Q_c^\star - \mathcal T^\star Q^\star\|_\infty \\ &\le \|\widetilde{\mathcal T}_c^\star \widetilde Q_c^\star - \mathcal T^\star \widetilde Q_c^\star\|_\infty + \|\mathcal T^\star \widetilde Q_c^\star - \mathcal T^\star Q^\star\|_\infty \\ &\le \|\widetilde R_c - R\|_\infty + \gamma \|\widetilde Q_c^\star - Q^\star\|_\infty. \end{aligned}\]

Therefore,

\[\|\widetilde Q_c^\star - Q^\star\|_\infty \le \frac{\|\widetilde R_c - R\|_\infty}{1-\gamma}.\]

This inequality already contains the core mechanism. A bias in one-step rewards gets propagated through the Bellman recursion and amplified by the geometric factor \(1/(1-\gamma)\).

6. A finite MDP where the asymptotic gap is arbitrarily large

The perturbation bound above is informative, but one could still wonder whether it is merely a loose upper bound. The answer is no. There is a concrete finite MDP where the corrupted limiting point can be arbitrarily far from \(Q^\star\).

Take

\[\mathcal S = \{1,2,3,4,5\}, \qquad \mathcal A = \{\texttt{L},\texttt{R}\}.\]

Fix \(p \in (0,1)\). The transition structure is:

In state \(1\):
- action \(\texttt{L}\) goes to state \(2\) with probability \(p\) and to state \(5\) with probability \(1-p\);
- action \(\texttt{R}\) goes to state \(3\) with probability \(p\) and to state \(4\) with probability \(1-p\).
In state \(2\), under either action, the chain stays in state \(2\) with probability \(p\) and moves to state \(5\) with probability \(1-p\).
In state \(3\), under either action, the chain stays in state \(3\) with probability \(p\) and moves to state \(4\) with probability \(1-p\).
States \(4\) and \(5\) are absorbing.

Now assign rewards as follows:

at state \(1\), \(R(1,\texttt{L}) = d, \qquad R(1,\texttt{R}) = -d,\) with \(d>0\);
at states \(2\) and \(3\), the reward is deterministically \(1\) for both actions;
at states \(4\) and \(5\), the reward is deterministically \(0\).

Under the clean model, the optimal action at state \(1\) is plainly \(\texttt{L}\).

7. Computing the clean optimal values

Because states \(4\) and \(5\) are absorbing with zero reward,

\[Q^\star(s,a) = 0, \qquad s \in \{4,5\},\; a \in \mathcal A.\]

For states \(2\) and \(3\), the Bellman equations are

\[Q^\star(2,a) = 1 + \gamma p\, Q^\star(2,a), \qquad Q^\star(3,a) = 1 + \gamma p\, Q^\star(3,a),\]

hence

\[Q^\star(2,a) = Q^\star(3,a) = \frac{1}{1-\gamma p}.\]

Define

\[\beta := \frac{p\gamma}{1-\gamma p}.\]

Then the values at state \(1\) are

\[Q^\star(1,\texttt{L}) = d + \beta, \qquad Q^\star(1,\texttt{R}) = -d + \beta.\]

So \(\texttt{L}\) is the unique optimal action at state \(1\).

8. Designing the Huber attack

Now let the attacker corrupt rewards only at state \(1\).

For \((1,\texttt{L})\), let the observed reward be

\(d\) with probability \(1-\varepsilon\),
\(-C\) with probability \(\varepsilon\).

For \((1,\texttt{R})\), let the observed reward be

\(-d\) with probability \(1-\varepsilon\),
\(C\) with probability \(\varepsilon\).

Then the perturbed expected rewards become

\[\widetilde R_c(1,\texttt{L}) = (1-\varepsilon)d - \varepsilon C,\]

and

\[\widetilde R_c(1,\texttt{R}) = -(1-\varepsilon)d + \varepsilon C.\]

All other rewards remain unchanged.

Since the attack does not affect states \(2,3,4,5\), the perturbed fixed point coincides with the clean one there. At state \(1\), the continuation term is still \(\beta\), so

\[\widetilde Q_c^\star(1,\texttt{L}) = (1-\varepsilon)d - \varepsilon C + \beta,\]

and

\[\widetilde Q_c^\star(1,\texttt{R}) = -(1-\varepsilon)d + \varepsilon C + \beta.\]

9. Making the error arbitrarily large

Now choose the attack magnitude as

\[C = \frac{(2-\varepsilon)d + \kappa}{\varepsilon},\]

where \(\kappa > 0\) is arbitrary.

Substituting this into the formulas above gives

\[\widetilde Q_c^\star(1,\texttt{L}) = -d - \kappa + \beta,\]

and

\[\widetilde Q_c^\star(1,\texttt{R}) = d + \kappa + \beta.\]

Therefore,

\[\left| \widetilde Q_c^\star(1,\texttt{L}) - Q^\star(1,\texttt{L}) \right| = 2d+\kappa,\]

and similarly

\[\left| \widetilde Q_c^\star(1,\texttt{R}) - Q^\star(1,\texttt{R}) \right| = 2d+\kappa.\]

Hence,

\[\|\widetilde Q_c^\star - Q^\star\|_\infty = 2d+\kappa.\]

Since \(\kappa\) is arbitrary, this gap can be made arbitrarily large.

This is the sharp mathematical statement of vulnerability. Even if the attack probability \(\varepsilon\) is small, the adversary can compensate by taking the corruption magnitude \(C\) large. The asymptotic bias in vanilla Q-learning is then unbounded.

10. The optimal action flips

The same construction shows something even more alarming.

In the clean MDP,

\[Q^\star(1,\texttt{L}) = d+\beta, \qquad Q^\star(1,\texttt{R}) = -d+\beta,\]

so the optimal action is \(\texttt{L}\).

Under the attack,

\[\widetilde Q_c^\star(1,\texttt{L}) = -d-\kappa+\beta, \qquad \widetilde Q_c^\star(1,\texttt{R}) = d+\kappa+\beta,\]

so the optimal action becomes \(\texttt{R}\).

Thus the corruption does not merely perturb the value function quantitatively. It changes the qualitative decision rule.

This is exactly the kind of failure that makes corrupted feedback dangerous in control problems. The learner is not just “a bit off.” It is systematically pushed into preferring the wrong action.

11. What the whole argument teaches us

Let me summarize the logic in one chain.

First, vanilla Q-learning works cleanly because its update is centered around the true Bellman operator \(\mathcal T^\star\), up to a conditionally mean-zero fluctuation term \(\zeta_{t+1}\).

Second, under Huber reward corruption, the conditional mean target shifts by \(\Delta_c(s,a)=\widetilde R_c(s,a)-R(s,a)\). Consequently, the recursion is centered around the perturbed Bellman operator \(\widetilde{\mathcal T}_c^\star\) instead of \(\mathcal T^\star\).

Third, because \(\widetilde{\mathcal T}_c^\star\) is also a contraction, the iterates still converge, but they converge to \(\widetilde Q_c^\star\) rather than \(Q^\star\).

Fourth, Bellman geometry amplifies reward perturbations into value perturbations by roughly a factor \(1/(1-\gamma)\).

Finally, the explicit five-state construction shows that this is not merely a loose upper bound or a pathological proof trick. The asymptotic gap can truly be arbitrarily large, and the optimal action can flip.

12. Final takeaway

Vanilla Q-learning breaks under corrupted rewards for a mathematically precise reason: the one-step target is no longer an unbiased stochastic approximation to the true Bellman optimality operator. Instead, the update decomposes into a centered fluctuation term plus a non-vanishing bias term induced by corruption. This shifts the operator, shifts the fixed point, and thereby shifts the asymptotic limit of the algorithm itself. Because Bellman recursions propagate one-step bias across future time, even a small attack budget can be turned into arbitrarily large long-run value distortion when the corruption magnitude is unconstrained. The explicit finite-state example above shows that the final consequence is not just numeric error but a genuine policy error: the optimal action can be reversed.

That is why robust reinforcement learning is not a cosmetic modification of classical RL. It is a structural necessity whenever the reward channel cannot be blindly trusted.

What comes next

The next natural question is not whether vanilla Q-learning is fragile; that is now clear. The next question is: what must be changed in the update rule so that corrupted rewards do not control the asymptotic limit? That is where robust reward estimation and thresholded Bellman updates enter.

rl-blogs

This post is licensed under CC BY 4.0 by the author.