Concentration Inequalities: A Researcher's Guide from Markov to Freedman

Posted Jun 9, 2026

By Sreejeet Maity

25 min read

Concentration inequalities are among the most useful tools in modern probability, learning theory, stochastic approximation, and reinforcement learning (RL). They convert local information about a random process into global high-probability control. In the analysis of algorithms, this is often the difference between saying

\[\mathbb{E}[\mathrm{error}] \text{ is small}\]

and saying

\[\mathrm{error} \text{ is small with probability at least } 1-\delta.\]

The second statement is usually what we need in finite-time reinforcement learning. A typical proof has to control random visitation counts, empirical transition probabilities, noisy Bellman targets, martingale noise terms, and sometimes adversarial or heavy-tailed reward perturbations. Each of these terms asks for a different concentration inequality.

This post is a coherent guide to the inequalities I repeatedly use in research:

Markov’s inequality.
Chebyshev’s inequality.
Hoeffding’s inequality.
Chernoff and multiplicative Chernoff bounds.
Bernstein’s inequality.
Azuma-Hoeffding inequality.
A high-probability refinement of Azuma-Hoeffding in the spirit of Shamir and Spencer.
Freedman’s inequality for martingales.

The goal is not just to list statements. The goal is to understand the logic connecting them.

Markov’s inequality is the starting point. It only uses nonnegativity and the first moment. Because it knows almost nothing about the distribution, it gives a very general but often loose bound.

Chebyshev’s inequality uses one more piece of information: the variance. This already improves the picture. Instead of only controlling the probability that a nonnegative random variable is large, Chebyshev controls how far a random variable moves away from its mean.

Hoeffding’s inequality becomes sharper because it uses boundedness. If each random variable is known to lie in a fixed interval, then very large deviations are impossible, and the sum concentrates exponentially fast.

Chernoff bounds push this idea through moment generating functions. Rather than controlling a deviation directly, we exponentially tilt the random variable and optimize the resulting bound. This is especially powerful for sums of Bernoulli random variables and leads to the familiar multiplicative Chernoff bounds.

Bernstein’s inequality combines two kinds of information: variance and boundedness. This is often sharper than Hoeffding when the variance is small. Hoeffding only sees the worst-case range; Bernstein also sees the typical scale of fluctuations.

The same hierarchy reappears for dependent data. Azuma-Hoeffding is the martingale analogue of Hoeffding: if the increments of a martingale are uniformly bounded, then the martingale concentrates. Freedman’s inequality is the martingale analogue of Bernstein: it uses both bounded increments and the predictable variance process.

The final inequality in this post is a refined high-probability version of Azuma-Hoeffding, in the spirit of Shamir and Spencer. This refinement is useful when the increments are not always small deterministically, but are small on a high-probability event. In other words, the process has two scales: a crude worst-case scale and a much sharper typical scale.

So the guiding theme is simple:

as we move from Markov to Chebyshev to Hoeffding, Chernoff, Bernstein, Azuma-Hoeffding, Freedman, and refined high-probability martingale inequalities, we progressively inject more structure into the problem. More structure gives sharper concentration.

1. Basic notation

Let $X$ be a real-valued random variable. We write

\[\mathbb{E}[X]\]

for its expectation and

\[\operatorname{Var}(X) = \mathbb{E}\left[(X-\mathbb{E}[X])^2\right]\]

for its variance whenever these quantities exist.

For independent random variables $X_1,\ldots,X_n$, define

\[S_n = \sum_{i=1}^n X_i.\]

For martingales, let

\[\mathcal{F}_0 \subseteq \mathcal{F}_1 \subseteq \cdots \subseteq \mathcal{F}_n\]

be a filtration. A stochastic process $M_0,M_1,\ldots,M_n$ is a martingale with respect to $\{\mathcal{F}_i\}_{i=0}^{n}$ if

\[M_i \text{ is } \mathcal{F}_i\text{-measurable}\]

and

\[\mathbb{E}[M_i \mid \mathcal{F}_{i-1}] = M_{i-1} \quad \text{for all } i\ge 1.\]

The martingale difference sequence is

\[D_i = M_i - M_{i-1}.\]

Thus

\[\mathbb{E}[D_i \mid \mathcal{F}_{i-1}] = 0.\]

This zero conditional mean property is the martingale analogue of independence plus zero mean.

2. Markov’s inequality: concentration from one moment

Markov’s inequality is the most basic concentration inequality. It is weak, but it is extremely general.

Theorem 1: Markov’s inequality

Let $X\ge 0$ be a nonnegative random variable. Then, for every $a>0$,

\[\mathbb{P}(X\ge a) \le \frac{\mathbb{E}[X]}{a}.\]

Proof

Since $X\ge 0$, we have the pointwise inequality

\[X \ge a\mathbf{1}\{X\ge a\}.\]

Taking expectations on both sides gives

\[\mathbb{E}[X] \ge \mathbb{E}\left[a\mathbf{1}\{X\ge a\}\right].\]

Since $a$ is deterministic,

\[\mathbb{E}\left[a\mathbf{1}\{X\ge a\}\right] = a\mathbb{P}(X\ge a).\]

Therefore

\[\mathbb{E}[X] \ge a\mathbb{P}(X\ge a),\]

and dividing by $a$ proves

\[\mathbb{P}(X\ge a) \le \frac{\mathbb{E}[X]}{a}.\]

Intuition

Markov’s inequality says: if a nonnegative random variable has small average size, then it cannot be large too often. It does not require independence, boundedness, variance, or tails. It only requires nonnegativity and a finite mean.

This is why Markov’s inequality is often used as a first conversion step:

\[\mathbb{E}[X] \text{ small} \quad \Longrightarrow \quad X \text{ small with some probability.}\]

But the price is that the resulting tail decays only like $1/a$. This is usually too weak for sharp finite-time learning theory.

A useful general form

If $g$ is nonnegative and increasing, then

\[\mathbb{P}(X\ge a) = \mathbb{P}(g(X)\ge g(a)) \le \frac{\mathbb{E}[g(X)]}{g(a)}.\]

This is the bridge to many stronger inequalities. For example:

taking $g(x)=x^2$ leads to Chebyshev-type bounds;
taking $g(x)=e^{\lambda x}$ leads to Chernoff-type bounds.

Markov’s inequality is therefore not just a standalone result. It is the engine underneath many concentration arguments.

3. Chebyshev’s inequality: concentration from variance

Markov’s inequality uses a first moment. Chebyshev’s inequality applies Markov’s inequality to a squared deviation.

Theorem 2: Chebyshev’s inequality

Let $X$ be a random variable with finite mean $\mu=\mathbb{E}[X]$ and finite variance $\sigma^2=\operatorname{Var}(X)$. Then, for every $t>0$,

\[\mathbb{P}\left(\lvert X-\mu\rvert\ge t\right) \le \frac{\sigma^2}{t^2}.\]

Equivalently, for every $k>0$,

\[\mathbb{P}\left(\lvert X-\mu\rvert\ge k\sigma\right) \le \frac{1}{k^2}.\]

Proof

Consider the nonnegative random variable

\[Y=(X-\mu)^2.\]

The event $\lvert X-\mu\rvert\ge t$ is the same as the event $(X-\mu)^2\ge t^2$. Therefore,

\[\mathbb{P}\left(\lvert X-\mu\rvert\ge t\right) = \mathbb{P}(Y\ge t^2).\]

By Markov’s inequality,

\[\mathbb{P}(Y\ge t^2) \le \frac{\mathbb{E}[Y]}{t^2}.\]

But

\[\mathbb{E}[Y] = \mathbb{E}\left[(X-\mu)^2\right] = \sigma^2.\]

Hence

\[\mathbb{P}\left(\lvert X-\mu\rvert\ge t\right) \le \frac{\sigma^2}{t^2}.\]

Sample mean version

Let $X_1,\ldots,X_n$ be independent random variables with common mean $\mu$ and common variance $\sigma^2$. Define the sample mean

\[\overline{X}_n = \frac{1}{n}\sum_{i=1}^n X_i.\]

Then

\[\mathbb{E}[\overline{X}_n]=\mu\]

and, by independence,

\[\operatorname{Var}(\overline{X}_n) = \frac{\sigma^2}{n}.\]

Applying Chebyshev gives

\[\mathbb{P}\left(\lvert \overline{X}_n-\mu\rvert\ge t\right) \le \frac{\sigma^2}{nt^2}.\]

Equivalently, with probability at least $1-\delta$,

\[\lvert \overline{X}_n-\mu\rvert \le \sqrt{\frac{\sigma^2}{n\delta}}.\]

Intuition

Chebyshev is robust because it only needs a second moment. This makes it useful when the distribution may be heavy-tailed. But the dependence on $\delta$ is weak:

\[\sqrt{\frac{1}{\delta}}.\]

In learning theory, one usually wants logarithmic dependence on $1/\delta$, such as

\[\sqrt{\frac{\log(1/\delta)}{n}}.\]

Chebyshev does not give that. To get logarithmic confidence dependence, we need stronger assumptions, such as boundedness, sub-Gaussianity, or variance plus boundedness.

4. The exponential method: the common backbone

Before Hoeffding, Chernoff, and Bernstein, it is useful to isolate the main proof pattern.

Let $S$ be a random variable. For any $\lambda>0$,

\[\mathbb{P}(S\ge t) = \mathbb{P}\left(e^{\lambda S}\ge e^{\lambda t}\right).\]

By Markov’s inequality,

\[\mathbb{P}\left(e^{\lambda S}\ge e^{\lambda t}\right) \le \frac{\mathbb{E}\left[e^{\lambda S}\right]}{e^{\lambda t}}.\]

Thus

\[\mathbb{P}(S\ge t) \le \exp(-\lambda t)\mathbb{E}\left[e^{\lambda S}\right].\]

Since this holds for every $\lambda>0$, we can optimize over $\lambda$:

\[\mathbb{P}(S\ge t) \le \inf_{\lambda>0} \exp(-\lambda t)\mathbb{E}\left[e^{\lambda S}\right].\]

This is the Chernoff method.

The quantity

\[\mathbb{E}\left[e^{\lambda S}\right]\]

is called the moment generating function of $S$. The reason exponential moments are powerful is that they transform sums into products under independence. If $S_n=\sum_{i=1}^n X_i$ and the $X_i$’s are independent, then

\[\mathbb{E}\left[e^{\lambda S_n}\right] = \mathbb{E}\left[e^{\lambda\sum_{i=1}^n X_i}\right] = \mathbb{E}\left[\prod_{i=1}^n e^{\lambda X_i}\right] = \prod_{i=1}^n \mathbb{E}\left[e^{\lambda X_i}\right].\]

This factorization is the key to Hoeffding, Chernoff, and Bernstein.

5. Hoeffding’s inequality: bounded independent noise

Hoeffding’s inequality is one of the most common concentration inequalities in learning theory. It says that sums of independent bounded random variables concentrate around their means at a sub-Gaussian rate.

Hoeffding’s lemma

The main ingredient is Hoeffding’s lemma.

Let $X$ be a random variable such that

\[X\in[a,b] \quad \text{almost surely}.\]

Let

\[\mu=\mathbb{E}[X].\]

Then, for every $\lambda\in\mathbb{R}$,

\[\mathbb{E}\left[e^{\lambda(X-\mu)}\right] \le \exp\left(\frac{\lambda^2(b-a)^2}{8}\right).\]

This says that a centered bounded random variable behaves like a sub-Gaussian random variable with proxy variance $(b-a)^2/4$.

Theorem 3: Hoeffding’s inequality

Let $X_1,\ldots,X_n$ be independent random variables with

\[X_i\in[a_i,b_i] \quad \text{almost surely}.\]

Let

\[S_n = \sum_{i=1}^n X_i\]

and

\[\mathbb{E}[S_n] = \sum_{i=1}^n \mathbb{E}[X_i].\]

Then, for every $t>0$,

\[\mathbb{P}\left(S_n-\mathbb{E}[S_n]\ge t\right) \le \exp\left( -\frac{2t^2}{\sum_{i=1}^n (b_i-a_i)^2} \right).\]

Similarly,

\[\mathbb{P}\left(\mathbb{E}[S_n]-S_n\ge t\right) \le \exp\left( -\frac{2t^2}{\sum_{i=1}^n (b_i-a_i)^2} \right).\]

Therefore, by the union bound,

\[\mathbb{P}\left(\lvert S_n-\mathbb{E}[S_n]\rvert\ge t\right) \le 2\exp\left( -\frac{2t^2}{\sum_{i=1}^n (b_i-a_i)^2} \right).\]

Proof

Define centered variables

\[Y_i = X_i-\mathbb{E}[X_i].\]

Then

\[S_n-\mathbb{E}[S_n]=\sum_{i=1}^n Y_i.\]

For any $\lambda>0$, Markov’s inequality gives

\[\mathbb{P}\left(\sum_{i=1}^n Y_i\ge t\right) \le \exp(-\lambda t) \mathbb{E}\left[\exp\left(\lambda\sum_{i=1}^n Y_i\right)\right].\]

By independence,

\[\mathbb{E}\left[\exp\left(\lambda\sum_{i=1}^n Y_i\right)\right] = \prod_{i=1}^n \mathbb{E}\left[e^{\lambda Y_i}\right].\]

By Hoeffding’s lemma,

\[\mathbb{E}\left[e^{\lambda Y_i}\right] \le \exp\left(\frac{\lambda^2(b_i-a_i)^2}{8}\right).\]

Therefore,

\[\prod_{i=1}^n \mathbb{E}\left[e^{\lambda Y_i}\right] \le \exp\left( \frac{\lambda^2}{8} \sum_{i=1}^n (b_i-a_i)^2 \right).\]

Thus

\[\mathbb{P}\left(S_n-\mathbb{E}[S_n]\ge t\right) \le \exp\left( -\lambda t + \frac{\lambda^2}{8} \sum_{i=1}^n (b_i-a_i)^2 \right).\]

Let

\[V_H = \sum_{i=1}^n (b_i-a_i)^2.\]

We need to minimize

\[-\lambda t + \frac{\lambda^2V_H}{8}\]

over $\lambda>0$. Differentiating gives

\[-t+\frac{\lambda V_H}{4}=0,\]

so the optimizer is

\[\lambda^* = \frac{4t}{V_H}.\]

Substituting $\lambda^*$ yields

\[-\lambda^* t + \frac{(\lambda^*)^2V_H}{8} = -\frac{4t^2}{V_H} + \frac{16t^2}{V_H^2}\frac{V_H}{8} = -\frac{2t^2}{V_H}.\]

Hence

\[\mathbb{P}\left(S_n-\mathbb{E}[S_n]\ge t\right) \le \exp\left(-\frac{2t^2}{V_H}\right).\]

The lower tail follows by applying the same argument to $-Y_i$. The two-sided bound follows from the union bound.

Sample mean form

If $X_i\in[a,b]$ independently and all have mean $\mu$, then

\[\mathbb{P}\left(\lvert \overline{X}_n-\mu\rvert\ge t\right) \le 2\exp\left(-\frac{2nt^2}{(b-a)^2}\right).\]

Equivalently, with probability at least $1-\delta$,

\[\lvert \overline{X}_n-\mu\rvert \le (b-a)\sqrt{\frac{\log(2/\delta)}{2n}}.\]

Intuition

Hoeffding is the right first tool when:

the samples are independent;
each sample is bounded;
we do not want to use the variance;
we want exponential tails.

It gives the canonical rate

\[\sqrt{\frac{\log(1/\delta)}{n}}.\]

In RL, Hoeffding often appears when rewards are bounded, or when estimating transition probabilities for a fixed state-action pair using independent samples.

6. Chernoff bounds: concentration for sums of Bernoulli variables

Chernoff bounds are especially sharp for sums of Bernoulli random variables. They are central when controlling counts, such as the number of times a state-action pair is visited.

Let

\[X_i\in\{0,1\}\]

be independent Bernoulli random variables. Define

\[S_n=\sum_{i=1}^n X_i\]

and

\[\mu=\mathbb{E}[S_n].\]

The random variable $S_n$ counts how many successes occur.

Theorem 4: Multiplicative Chernoff bound

For every $\varepsilon>0$,

\[\mathbb{P}\left(S_n\ge (1+\varepsilon)\mu\right) \le \left(\frac{e^{\varepsilon}}{(1+\varepsilon)^{1+\varepsilon}}\right)^\mu.\]

For every $\varepsilon\in(0,1)$,

\[\mathbb{P}\left(S_n\le (1-\varepsilon)\mu\right) \le \exp\left(-\frac{\mu\varepsilon^2}{2}\right).\]

A frequently used simplified upper-tail form is

\[\mathbb{P}\left(S_n\ge (1+\varepsilon)\mu\right) \le \exp\left(-\frac{\mu\varepsilon^2}{2+\varepsilon}\right).\]

In particular, for $\varepsilon\in(0,1)$,

\[\mathbb{P}\left(S_n\ge (1+\varepsilon)\mu\right) \le \exp\left(-\frac{\mu\varepsilon^2}{3}\right).\]

Proof of the upper-tail bound

For $\lambda>0$, Markov’s inequality gives

\[\mathbb{P}(S_n\ge (1+\varepsilon)\mu) \le \exp(-\lambda(1+\varepsilon)\mu) \mathbb{E}\left[e^{\lambda S_n}\right].\]

By independence,

\[\mathbb{E}\left[e^{\lambda S_n}\right] = \prod_{i=1}^n \mathbb{E}\left[e^{\lambda X_i}\right].\]

If $X_i\sim\operatorname{Bernoulli}(p_i)$, then

\[\mathbb{E}\left[e^{\lambda X_i}\right] = 1-p_i+p_i e^{\lambda} = 1+p_i(e^{\lambda}-1).\]

Using $1+x\le e^x$,

\[1+p_i(e^{\lambda}-1) \le \exp\left(p_i(e^{\lambda}-1)\right).\]

Therefore,

\[\mathbb{E}\left[e^{\lambda S_n}\right] \le \exp\left((e^{\lambda}-1)\sum_{i=1}^n p_i\right) = \exp\left(\mu(e^{\lambda}-1)\right).\]

Hence

\[\mathbb{P}(S_n\ge (1+\varepsilon)\mu) \le \exp\left( -\lambda(1+\varepsilon)\mu + \mu(e^{\lambda}-1) \right).\]

The optimal choice is

\[\lambda = \log(1+\varepsilon),\]

so $e^{\lambda}=1+\varepsilon$. Substituting gives

\[-\lambda(1+\varepsilon) + e^{\lambda}-1 = -(1+\varepsilon)\log(1+\varepsilon)+\varepsilon.\]

Therefore

\[\mathbb{P}(S_n\ge (1+\varepsilon)\mu) \le \exp\left( \mu\left[\varepsilon-(1+\varepsilon)\log(1+\varepsilon)\right] \right).\]

This is exactly

\[\mathbb{P}(S_n\ge (1+\varepsilon)\mu) \le \left(\frac{e^{\varepsilon}}{(1+\varepsilon)^{1+\varepsilon}}\right)^\mu.\]

The simplified form follows from the elementary inequality

\[(1+\varepsilon)\log(1+\varepsilon)-\varepsilon \ge \frac{\varepsilon^2}{2+\varepsilon}.\]

Proof of the lower-tail bound

Let $\lambda<0$. By Markov’s inequality,

\[\mathbb{P}(S_n\le (1-\varepsilon)\mu) = \mathbb{P}(e^{\lambda S_n}\ge e^{\lambda(1-\varepsilon)\mu}).\]

Since $\lambda<0$, the event $S_n\le (1-\varepsilon)\mu$ becomes an upper-tail event for $e^{\lambda S_n}$. Thus

\[\mathbb{P}(S_n\le (1-\varepsilon)\mu) \le \exp(-\lambda(1-\varepsilon)\mu) \mathbb{E}\left[e^{\lambda S_n}\right].\]

The same moment generating function bound gives

\[\mathbb{E}\left[e^{\lambda S_n}\right] \le \exp\left(\mu(e^{\lambda}-1)\right).\]

Choose

\[\lambda = \log(1-\varepsilon) <0.\]

Then

\[\mathbb{P}(S_n\le (1-\varepsilon)\mu) \le \exp\left( \mu\left[-\varepsilon-(1-\varepsilon)\log(1-\varepsilon)\right] \right).\]

Using the elementary inequality

\[(1-\varepsilon)\log(1-\varepsilon)+\varepsilon \ge \frac{\varepsilon^2}{2}, \quad 0<\varepsilon<1,\]

we obtain

\[\mathbb{P}(S_n\le (1-\varepsilon)\mu) \le \exp\left(-\frac{\mu\varepsilon^2}{2}\right).\]

Intuition

Chernoff bounds are multiplicative. They say that if the expected number of visits is $\mu$, then the probability of seeing much fewer or much more than $\mu$ visits decays exponentially in $\mu$.

This is extremely useful in asynchronous RL. If a state-action pair $(s,a)$ has visitation probability $\lambda(s,a)$, then after $t$ i.i.d. draws, its visit count is roughly binomial with mean

\[\lambda(s,a)t.\]

A Chernoff or Bernstein bound can show that, after a burn-in time, every state-action pair has been visited often enough.

7. Bernstein’s inequality: variance-sensitive concentration

Hoeffding uses only boundedness. Bernstein uses boundedness plus variance. It is sharper when the variance is much smaller than the worst-case squared range.

Theorem 5: Bernstein’s inequality

Let $X_1,\ldots,X_n$ be independent random variables such that

\[\mathbb{E}[X_i]=0\]

and

\[\lvert X_i\rvert\le b \quad \text{almost surely}.\]

Let

\[V = \sum_{i=1}^n \mathbb{E}[X_i^2].\]

Then, for every $t>0$,

\[\mathbb{P}\left(\sum_{i=1}^n X_i\ge t\right) \le \exp\left( -\frac{t^2}{2(V+bt/3)} \right).\]

Consequently,

\[\mathbb{P}\left(\left\lvert \sum_{i=1}^n X_i\right\rvert\ge t\right) \le 2\exp\left( -\frac{t^2}{2(V+bt/3)} \right).\]

Proof

The proof is again based on the exponential method. We first use a moment generating function bound.

For $\lambda\in(0,3/b)$, one can show that

\[\mathbb{E}\left[e^{\lambda X_i}\right] \le \exp\left( \frac{\lambda^2\mathbb{E}[X_i^2]}{2(1-\lambda b/3)} \right).\]

This inequality follows from the Taylor expansion of the exponential and the bound $\lvert X_i\rvert\le b$. Indeed, for $k\ge 2$,

\[\lvert X_i\rvert^k \le b^{k-2}X_i^2.\]

Using $\mathbb{E}[X_i]=0$,

\[\mathbb{E}[e^{\lambda X_i}] = 1+ \sum_{k=2}^{\infty}\frac{\lambda^k\mathbb{E}[X_i^k]}{k!}.\]

Bounding the higher moments by the second moment yields

\[\mathbb{E}[e^{\lambda X_i}] \le 1+ \mathbb{E}[X_i^2] \sum_{k=2}^{\infty}\frac{\lambda^k b^{k-2}}{k!}.\]

A standard estimate gives

\[\sum_{k=2}^{\infty}\frac{\lambda^k b^{k-2}}{k!} \le \frac{\lambda^2}{2(1-\lambda b/3)}.\]

Therefore,

\[\mathbb{E}[e^{\lambda X_i}] \le 1+ \frac{\lambda^2\mathbb{E}[X_i^2]}{2(1-\lambda b/3)}.\]

Since $1+x\le e^x$,

\[\mathbb{E}[e^{\lambda X_i}] \le \exp\left( \frac{\lambda^2\mathbb{E}[X_i^2]}{2(1-\lambda b/3)} \right).\]

By independence,

\[\mathbb{E}\left[e^{\lambda\sum_{i=1}^n X_i}\right] = \prod_{i=1}^n \mathbb{E}[e^{\lambda X_i}] \le \exp\left( \frac{\lambda^2V}{2(1-\lambda b/3)} \right).\]

Thus, for every $\lambda\in(0,3/b)$,

\[\mathbb{P}\left(\sum_{i=1}^n X_i\ge t\right) \le \exp\left( -\lambda t+ \frac{\lambda^2V}{2(1-\lambda b/3)} \right).\]

Choosing

\[\lambda=\frac{t}{V+bt/3}\]

gives $\lambda b<3$ and yields

\[\mathbb{P}\left(\sum_{i=1}^n X_i\ge t\right) \le \exp\left( -\frac{t^2}{2(V+bt/3)} \right).\]

The two-sided version follows by applying the same argument to $-X_i$ and using the union bound.

Two regimes

Bernstein has two regimes. When $t$ is small compared to $V/b$, the term $V$ dominates, and the exponent behaves like

\[-\frac{t^2}{2V}.\]

This is the sub-Gaussian regime.

When $t$ is large compared to $V/b$, the term $bt$ dominates, and the exponent behaves like

\[-\frac{3t}{2b}.\]

This is the sub-exponential regime.

Thus Bernstein smoothly interpolates between Gaussian-like and exponential-like tails.

Why Bernstein is often sharper than Hoeffding

Suppose $X_i\in[-b,b]$. Hoeffding effectively uses the worst-case scale $b^2n$. Bernstein uses the actual variance

\[V=\sum_{i=1}^n \mathbb{E}[X_i^2].\]

If $V\ll nb^2$, Bernstein can be much sharper.

In RL proofs, Bernstein is useful when controlling empirical counts or transition estimates because Bernoulli variables may have small variance when the probability of the event is small.

8. Martingales: why independence is not always necessary

Many stochastic approximation algorithms are not sums of independent random variables. The iterate at time $t+1$ depends on the random data observed at time $t$, and also on the entire past through the current iterate.

This is where martingales enter.

A martingale difference sequence $(D_i)_{i=1}^n$ satisfies

\[\mathbb{E}[D_i\mid \mathcal{F}_{i-1}]=0.\]

It need not be independent. The random variable $D_i$ may depend strongly on the past. The only requirement is that, after conditioning on the past, its mean is zero.

This is precisely the structure of many noise terms in stochastic approximation. For example, a Bellman noise term often has the form

\[D_t = \text{random Bellman target at time } t - \text{conditional mean Bellman target given the past}.\]

Then

\[\mathbb{E}[D_t\mid \mathcal{F}_{t-1}]=0.\]

This is not independence, but it is enough for martingale concentration.

9. Azuma-Hoeffding inequality: Hoeffding for martingales

Azuma-Hoeffding is the martingale analogue of Hoeffding’s inequality.

Theorem 6: Azuma-Hoeffding inequality

Let $\{M_i\}_{i=0}^{n}$ be a martingale with respect to $\{\mathcal{F}_i\}_{i=0}^{n}.$ Suppose there exist deterministic constants $c_1,\ldots,c_n$ such that

\[\lvert M_i-M_{i-1}\rvert\le c_i \quad \text{almost surely for every } i.\]

Then, for every $t>0$,

\[\mathbb{P}(M_n-M_0\ge t) \le \exp\left( -\frac{t^2}{2\sum_{i=1}^n c_i^2} \right).\]

Similarly,

\[\mathbb{P}(M_0-M_n\ge t) \le \exp\left( -\frac{t^2}{2\sum_{i=1}^n c_i^2} \right).\]

Therefore,

\[\mathbb{P}(\lvert M_n-M_0\rvert\ge t) \le 2\exp\left( -\frac{t^2}{2\sum_{i=1}^n c_i^2} \right).\]

Proof

Let

\[D_i=M_i-M_{i-1}.\]

Then

\[\mathbb{E}[D_i\mid \mathcal{F}_{i-1}]=0\]

and

\[D_i\in[-c_i,c_i] \quad \text{almost surely}.\]

For $\lambda>0$, Markov’s inequality gives

\[\mathbb{P}(M_n-M_0\ge t) = \mathbb{P}\left(\sum_{i=1}^n D_i\ge t\right) \le \exp(-\lambda t) \mathbb{E}\left[\exp\left(\lambda\sum_{i=1}^n D_i\right)\right].\]

We now control the exponential moment iteratively using conditional expectation. Since $D_n$ is conditionally zero-mean and bounded in $[-c_n,c_n]$, Hoeffding’s lemma gives

\[\mathbb{E}\left[e^{\lambda D_n}\mid \mathcal{F}_{n-1}\right] \le \exp\left(\frac{\lambda^2c_n^2}{2}\right).\]

Therefore,

\[\mathbb{E}\left[\exp\left(\lambda\sum_{i=1}^n D_i\right)\right] = \mathbb{E}\left[ \exp\left(\lambda\sum_{i=1}^{n-1}D_i\right) \mathbb{E}\left[e^{\lambda D_n}\mid \mathcal{F}_{n-1}\right] \right]\]

is at most

\[\exp\left(\frac{\lambda^2c_n^2}{2}\right) \mathbb{E}\left[ \exp\left(\lambda\sum_{i=1}^{n-1}D_i\right) \right].\]

Repeating this argument backward gives

\[\mathbb{E}\left[\exp\left(\lambda\sum_{i=1}^n D_i\right)\right] \le \exp\left( \frac{\lambda^2}{2}\sum_{i=1}^n c_i^2 \right).\]

Hence

\[\mathbb{P}(M_n-M_0\ge t) \le \exp\left( -\lambda t+ \frac{\lambda^2}{2}\sum_{i=1}^n c_i^2 \right).\]

The optimal choice is

\[\lambda=\frac{t}{\sum_{i=1}^n c_i^2}.\]

Substituting gives

\[\mathbb{P}(M_n-M_0\ge t) \le \exp\left( -\frac{t^2}{2\sum_{i=1}^n c_i^2} \right).\]

The lower-tail and two-sided bounds follow as before.

Intuition

Azuma-Hoeffding says that a martingale with bounded increments cannot drift too far. The increments may be dependent, but they have no predictable direction.

In stochastic approximation, this is powerful because the noise terms are usually adapted to the past rather than independent. Azuma lets us control sums like

\[\sum_{t=0}^{T-1}\alpha_t D_t,\]

provided we can bound each weighted increment:

\[\lvert \alpha_t D_t\rvert\le c_t.\]

The resulting fluctuation scale is

\[\sqrt{\sum_{t=0}^{T-1}c_t^2\log(1/\delta)}.\]

10. A refined high-probability Azuma-Hoeffding principle

Standard Azuma-Hoeffding requires deterministic bounded differences. It only sees the worst-case increment bound.

This can be too crude.

Suppose a martingale difference $D_i$ satisfies both of the following:

A crude deterministic bound:

\[\lvert D_i\rvert\le b_i \quad \text{always}.\]

A much sharper typical bound:

\[\lvert D_i\rvert\le c_i\]

with high probability, where $c_i\ll b_i$.

Standard Azuma-Hoeffding forces us to use $b_i$. But if the event $\lvert D_i\rvert>c_i$ is rare, the true fluctuations should be closer to the scale $c_i$, not $b_i$.

This is the setting where refined high-probability versions of Azuma-Hoeffding become useful.

Theorem 7: Shamir-Spencer type probabilistic Azuma-Hoeffding inequality

Let $(X_i)_{i=0}^n$ be a martingale with $X_0$ constant. Suppose that for each $0\le i<n$, there exist deterministic numbers $c_i>0$, $b_i>0$, and $r\in(0,1)$ such that

\[\mathbb{P}\left(\lvert X_{i+1}-X_i\rvert\le c_i\right) \ge 1-r,\]

and

\[\lvert X_{i+1}-X_i\rvert\le b_i \quad \text{deterministically}.\]

Assume also that

\[b_i\sqrt{r}\le c_i \quad \text{for all } i.\]

Then the following bound holds:

\[\mathbb{P}\left( \lvert X_n-X_0\rvert > \sqrt{32\left(\sum_{i=0}^{n-1}c_i^2\right)\log\left(\frac{2}{\delta}\right)} + \sum_{i=0}^{n-1}b_i\sqrt{r} \right) < \delta+2n\sqrt{r}.\]

This form is adapted from a theorem of Shamir and Spencer. It is the kind of result needed when the increments are not uniformly small deterministically, but are uniformly small on a high-probability event.

What the theorem says

The ordinary Azuma-Hoeffding bound with deterministic increments $b_i$ would give the scale

\[\sqrt{\left(\sum_{i=0}^{n-1}b_i^2\right)\log(1/\delta)}.\]

The refined inequality instead gives the scale

\[\sqrt{\left(\sum_{i=0}^{n-1}c_i^2\right)\log(1/\delta)} + \sum_{i=0}^{n-1}b_i\sqrt{r}.\]

If $c_i\ll b_i$ and $r$ is very small, this can be dramatically sharper.

The extra term

\[\sum_{i=0}^{n-1}b_i\sqrt{r}\]

is the price paid for rare bad increments. The additional failure probability

\[2n\sqrt{r}\]

is the price paid for ensuring that these rare events do not accumulate too badly.

A clean conditional version and proof

The following slightly stronger conditional version is easier to prove and captures the main mechanism.

Let

\[D_i=X_i-X_{i-1}.\]

Assume

\[\mathbb{E}[D_i\mid \mathcal{F}_{i-1}]=0,\] \[\lvert D_i\rvert\le b_i \quad \text{almost surely},\]

and

\[\mathbb{P}(\lvert D_i\rvert>c_i\mid \mathcal{F}_{i-1}) \le r \quad \text{almost surely}.\]

Define the bad event

\[B_i=\{\lvert D_i\rvert>c_i\}.\]

Now define the truncated and recentered increment

\[\widetilde{D}_i = D_i\mathbf{1}_{B_i^c} - \mathbb{E}\left[D_i\mathbf{1}_{B_i^c}\mid\mathcal{F}_{i-1}\right].\]

Then

\[\mathbb{E}[\widetilde{D}_i\mid\mathcal{F}_{i-1}]=0,\]

\[\widetilde{M}_k= \sum_{i=1}^k \widetilde{D}_i\]

is a martingale.

Next, because $D_i$ is conditionally mean-zero,

\[\mathbb{E}[D_i\mathbf{1}_{B_i^c}\mid\mathcal{F}_{i-1}] = - \mathbb{E}[D_i\mathbf{1}_{B_i}\mid\mathcal{F}_{i-1}].\]

Thus

\[\left\lvert \mathbb{E}[D_i\mathbf{1}_{B_i^c}\mid\mathcal{F}_{i-1}] \right\rvert \le \mathbb{E}[\lvert D_i\rvert\mathbf{1}_{B_i}\mid\mathcal{F}_{i-1}] \le b_i r.\]

On $B_i^c$, $\lvert D_i\rvert\le c_i$. Hence

\[\lvert \widetilde{D}_i\rvert \le c_i+b_ir.\]

If $b_i\sqrt{r}\le c_i$, then $b_ir\le c_i$, and therefore

\[\lvert \widetilde{D}_i\rvert \le 2c_i.\]

Azuma-Hoeffding applied to $(\widetilde{M}_k)$ gives

\[\mathbb{P}\left( \left\lvert\sum_{i=1}^n\widetilde{D}_i\right\rvert > \sqrt{8\left(\sum_{i=1}^n c_i^2\right)\log\left(\frac{2}{\delta}\right)} \right) \le \delta.\]

Now compare the original sum and the truncated martingale. On the event that no bad event $B_i$ occurs,

\[D_i = \widetilde{D}_i + \mathbb{E}\left[D_i\mathbf{1}_{B_i^c}\mid\mathcal{F}_{i-1}\right].\]

Therefore, on that event,

\[\left\lvert\sum_{i=1}^n D_i\right\rvert \le \left\lvert\sum_{i=1}^n\widetilde{D}_i\right\rvert + \sum_{i=1}^n b_i r.\]

Since $r\le \sqrt{r}$ for $r\in(0,1)$,

\[\sum_{i=1}^n b_i r \le \sum_{i=1}^n b_i\sqrt{r}.\]

Finally, by the union bound,

\[\mathbb{P}\left(\bigcup_{i=1}^n B_i\right) \le \sum_{i=1}^n \mathbb{P}(B_i) \le nr.\]

Since $nr\le n\sqrt{r}$, we get the high-probability control

\[\left\lvert\sum_{i=1}^nD_i\right\rvert \le \sqrt{8\left(\sum_{i=1}^n c_i^2\right)\log\left(\frac{2}{\delta}\right)} + \sum_{i=1}^n b_i\sqrt{r}\]

with probability at least

\[1-\delta-n\sqrt{r}.\]

This conditional theorem is not exactly the Shamir-Spencer statement above, but it explains the mechanism:

truncate the rare large increments;
recenter to recover a martingale;
apply Azuma to the typical increments;
pay a small price for the rare increments.

Why this matters in robust RL

In reward-agnostic robust Q-learning, the threshold may grow as a polynomial proxy for an unknown reward scale. This gives a crude deterministic bound on the iterates, for example of order

\[O(T^p),\]

but on a good event the iterates are actually bounded on the much smaller natural scale, for example

\[O(1).\]

If we use standard Azuma-Hoeffding, the martingale increment bound is forced to use the crude $O(T^p)$ scale. This can destroy the finite-time rate. The refined high-probability Azuma-Hoeffding principle allows the proof to exploit the fact that the large bound is only needed on rare events.

This is exactly the conceptual role of the Shamir-Spencer inequality in the analysis of reward-agnostic robust asynchronous Q-learning.

11. Freedman’s inequality: Bernstein for martingales

Azuma-Hoeffding is the martingale analogue of Hoeffding. Freedman’s inequality is the martingale analogue of Bernstein.

Instead of depending only on a uniform increment bound, Freedman also uses the conditional variance process.

Let $(M_i)_{i=0}^n$ be a martingale and define

\[D_i=M_i-M_{i-1}.\]

The predictable quadratic variation is

\[V_n= \sum_{i=1}^n \mathbb{E}[D_i^2\mid\mathcal{F}_{i-1}].\]

This is the martingale analogue of the variance sum

\[\sum_{i=1}^n\mathbb{E}[X_i^2]\]

in Bernstein’s inequality.

Theorem 8: Freedman’s inequality

Let $(M_i)_{i=0}^n$ be a martingale with increments

\[D_i=M_i-M_{i-1}.\]

Assume

\[D_i\le b \quad \text{almost surely for every } i.\]

Let

\[V_n= \sum_{i=1}^n \mathbb{E}[D_i^2\mid\mathcal{F}_{i-1}].\]

Then, for every $t>0$ and $v>0$,

\[\mathbb{P}\left(M_n-M_0\ge t \text{ and } V_n\le v\right) \le \exp\left( -\frac{t^2}{2(v+bt/3)} \right).\]

A two-sided version follows if $\lvert D_i\rvert\le b$ almost surely:

\[\mathbb{P}\left(\lvert M_n-M_0\rvert\ge t \text{ and } V_n\le v\right) \le 2\exp\left( -\frac{t^2}{2(v+bt/3)} \right).\]

Proof idea with the exponential supermartingale

Freedman’s proof is based on an exponential supermartingale.

For $\lambda\in(0,3/b)$, one can show that

\[\mathbb{E}\left[e^{\lambda D_i}\mid\mathcal{F}_{i-1}\right] \le \exp\left( \frac{\lambda^2}{2(1-\lambda b/3)} \mathbb{E}[D_i^2\mid\mathcal{F}_{i-1}] \right).\]

This is the conditional analogue of Bernstein’s moment generating function bound.

Define

\[\phi(\lambda) = \frac{\lambda^2}{2(1-\lambda b/3)}.\]

Then

\[\mathbb{E}\left[ \exp\left(\lambda D_i-\phi(\lambda)\mathbb{E}[D_i^2\mid\mathcal{F}_{i-1}]\right) \mid\mathcal{F}_{i-1} \right] \le 1.\]

Therefore,

\[Z_k = \exp\left( \lambda(M_k-M_0) - \phi(\lambda)V_k \right)\]

is a nonnegative supermartingale. Hence

\[\mathbb{E}[Z_n] \le \mathbb{E}[Z_0] = 1.\]

On the event

\[\{M_n-M_0\ge t,\, V_n\le v\},\]

we have

\[Z_n \ge \exp(\lambda t-\phi(\lambda)v).\]

By Markov’s inequality applied to $Z_n$,

\[\mathbb{P}(M_n-M_0\ge t,\, V_n\le v) \le \exp(-\lambda t+\phi(\lambda)v).\]

Optimizing over $\lambda$ yields the Bernstein-type exponent

\[-\frac{t^2}{2(v+bt/3)}.\]

Thus

\[\mathbb{P}\left(M_n-M_0\ge t \text{ and } V_n\le v\right) \le \exp\left( -\frac{t^2}{2(v+bt/3)} \right).\]

Intuition

Freedman’s inequality says that martingale fluctuations are governed not only by the worst possible jump size $b$, but also by the accumulated conditional variance $V_n$.

This is extremely important in adaptive processes. Even if the increments are bounded by $b$, the process may usually have much smaller conditional variance. Freedman captures this. Azuma-Hoeffding does not.

In reinforcement learning, Freedman is often the right tool when the martingale noise has a variance structure that depends on the current state, action, or value function.

12. How the inequalities fit together

Here is the conceptual progression.

Markov

Use when you only know

\[\mathbb{E}[X].\]

It gives

\[\mathbb{P}(X\ge a) \le \frac{\mathbb{E}[X]}{a}.\]

It is general but weak.

Chebyshev

Use when you know

\[\operatorname{Var}(X).\]

It gives

\[\mathbb{P}(\lvert X-\mathbb{E}[X]\rvert\ge t) \le \frac{\operatorname{Var}(X)}{t^2}.\]

It handles heavy tails with finite variance but has weak confidence dependence.

Hoeffding

Use for independent bounded variables. It gives

\[\mathbb{P}\left(\lvert S_n-\mathbb{E}[S_n]\rvert\ge t\right) \le 2\exp\left(-\frac{2t^2}{\sum_{i=1}^n(b_i-a_i)^2}\right).\]

It is simple and gives clean sub-Gaussian concentration.

Chernoff

Use for sums of Bernoulli variables or nonnegative counting variables with tractable moment generating functions. It gives multiplicative control around the mean:

\[S_n \approx \mu\]

with exponentially high probability in $\mu$.

Bernstein

Use for independent bounded variables when variance information matters. It gives

\[\mathbb{P}\left(S_n\ge t\right) \le \exp\left(-\frac{t^2}{2(V+bt/3)}\right).\]

It is sharper than Hoeffding when $V\ll nb^2$.

Azuma-Hoeffding

Use for martingales with deterministic bounded increments. It gives

\[\mathbb{P}(\lvert M_n-M_0\rvert\ge t) \le 2\exp\left(-\frac{t^2}{2\sum_{i=1}^n c_i^2}\right).\]

It is the default martingale concentration inequality.

Refined high-probability Azuma-Hoeffding

Use when increments are always bounded by $b_i$, but are usually bounded by a much smaller $c_i$. It gives concentration at the typical scale $c_i$, plus a small penalty for rare bad increments.

This is useful when a deterministic worst-case bound exists only for technical safety, but the actual algorithm behaves much better on a good event.

Freedman

Use for martingales when conditional variance matters. It gives

\[\mathbb{P}\left(M_n-M_0\ge t \text{ and } V_n\le v\right) \le \exp\left(-\frac{t^2}{2(v+bt/3)}\right).\]

It is the variance-sensitive martingale analogue of Bernstein.

13. A research-oriented decision guide

When facing a random error term, ask the following questions.

Question 1: Is the object nonnegative and do I only know its expectation?

Use Markov.

Example:

\[\mathbb{P}(X\ge a) \le \frac{\mathbb{E}[X]}{a}.\]

Question 2: Do I only know a second moment or variance?

Use Chebyshev.

Example:

\[\mathbb{P}(\lvert X-\mu\rvert\ge t) \le \frac{\sigma^2}{t^2}.\]

Question 3: Is it a sum of independent bounded variables?

Use Hoeffding.

Example:

\[\lvert \overline{X}_n-\mu\rvert = O\left(\sqrt{\frac{\log(1/\delta)}{n}}\right).\]

Question 4: Is it a sum of indicators or a visit count?

Use Chernoff or Bernstein.

Example:

\[\mathbb{P}\left(N_t(s,a)\le \frac{1}{2}\lambda(s,a)t\right) \le \exp(-c\lambda(s,a)t)\]

for a numerical constant $c>0$.

Question 5: Is it a martingale noise term with bounded increments?

Use Azuma-Hoeffding.

Example:

\[\sum_{t=0}^{T-1}\alpha_t D_t = O\left( \sqrt{\sum_{t=0}^{T-1}\alpha_t^2 c_t^2\log(1/\delta)} \right)\]

with high probability.

Question 6: Is it a martingale noise term with small conditional variance?

Use Freedman.

Example:

\[\sum_{t=0}^{T-1}D_t = O\left( \sqrt{V_T\log(1/\delta)}+b\log(1/\delta) \right)\]

with high probability, where $V_T$ is the predictable quadratic variation.

Question 7: Is the martingale increment usually small but rarely large?

Use a refined high-probability Azuma-Hoeffding inequality.

Example:

\[\lvert D_t\rvert\le b_t \quad \text{always},\]

but

\[\lvert D_t\rvert\le c_t \quad \text{with high probability},\]

where $c_t\ll b_t$. Then the refined inequality can preserve the $c_t$-scale rather than paying the full $b_t$-scale.

14. How these tools appear in reinforcement learning proofs

Finite-time RL proofs often contain several different random objects. No single concentration inequality is best for all of them.

Visit counts

Suppose a state-action pair $(s,a)$ is visited with probability $\lambda(s,a)$. Under an i.i.d. sampling abstraction, the count

\[N_t(s,a) = \sum_{k=1}^t \mathbf{1}\{(s_k,a_k)=(s,a)\}\]

is a sum of Bernoulli random variables with mean

\[\mathbb{E}[N_t(s,a)] = \lambda(s,a)t.\]

Chernoff or Bernstein gives

\[N_t(s,a) \ge \frac{1}{2}\lambda(s,a)t\]

with high probability once $t$ is large enough. This kind of event is needed before empirical estimates for every state-action pair become reliable.

Empirical rewards

If rewards are bounded and independent, Hoeffding controls the empirical reward mean:

\[\left\lvert \frac{1}{N}\sum_{i=1}^N r_i(s,a)-R(s,a) \right\rvert = O\left(\sqrt{\frac{\log(1/\delta)}{N}}\right).\]

If rewards are heavy-tailed but have finite variance, Chebyshev gives a weaker guarantee, and one often needs robust mean estimation to recover sub-Gaussian-type behavior.

Bellman noise

A typical Q-learning update contains a random target and its conditional mean. The difference is a martingale difference. This is where Azuma-Hoeffding or Freedman enters.

If the noise increments are uniformly bounded, Azuma-Hoeffding is natural. If one can exploit conditional variance, Freedman can sharpen the result.

Robust reward estimation and rare events

In robust RL under adversarial reward corruption, the analysis often constructs a good event on which robust reward estimates are accurate. Outside this event, the algorithm must still remain controlled. This creates two scales:

a deterministic safety scale;
a sharper high-probability scale.

Standard Azuma-Hoeffding only sees the deterministic safety scale. Refined high-probability Azuma-Hoeffding inequalities are designed precisely for this two-scale situation.

15. Summary

The inequalities in this post should be viewed as a hierarchy.

At the bottom, Markov and Chebyshev need very little structure, but give weak tails. Hoeffding and Chernoff give exponential tails by exploiting boundedness and independence. Bernstein improves Hoeffding by using variance. Azuma-Hoeffding moves from independence to martingales. Freedman adds variance sensitivity back into the martingale setting. Finally, refined high-probability Azuma-Hoeffding inequalities handle the subtle situation where the increments are usually small but only crudely bounded on rare events.

This hierarchy is exactly why concentration inequalities are so central in reinforcement learning theory. A finite-time RL proof is not just one concentration argument; it is usually a careful orchestration of several different concentration tools, each matched to the specific random object being controlled.

References

K. Azuma. Weighted sums of certain dependent random variables. Tohoku Mathematical Journal, 19:357–367, 1967.
S. Bernstein. On a modification of Chebyshev’s inequality and of the error formula of Laplace. 1924.
H. Chernoff. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. The Annals of Mathematical Statistics, 23(4):493–507, 1952.
W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30, 1963.
D. A. Freedman. On tail probabilities for martingales. The Annals of Probability, 3(1):100–118, 1975.
E. Shamir and J. Spencer. Sharp concentration of the chromatic number on random graphs $G_{n,p}$. Combinatorica, 7(1):121–129, 1987.
S. Boucheron, G. Lugosi, and P. Massart. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, 2013.
P. Massart. Concentration Inequalities and Model Selection. Springer, 2007.
S. Maity and A. Mitra. Corruption-Tolerant Asynchronous Q-learning with Near-Optimal Rates. ICML, 2026.

rl-blogs

This post is licensed under CC BY 4.0 by the author.

1. Basic notation

2. Markov’s inequality: concentration from one moment

Theorem 1: Markov’s inequality

Proof

Intuition

A useful general form

3. Chebyshev’s inequality: concentration from variance

Theorem 2: Chebyshev’s inequality

Proof

Sample mean version

Intuition

4. The exponential method: the common backbone

5. Hoeffding’s inequality: bounded independent noise

Hoeffding’s lemma

Theorem 3: Hoeffding’s inequality

Proof

Sample mean form

Intuition

6. Chernoff bounds: concentration for sums of Bernoulli variables

Theorem 4: Multiplicative Chernoff bound

Proof of the upper-tail bound

Proof of the lower-tail bound

Intuition

7. Bernstein’s inequality: variance-sensitive concentration

Theorem 5: Bernstein’s inequality

Proof

Two regimes

Why Bernstein is often sharper than Hoeffding

8. Martingales: why independence is not always necessary

9. Azuma-Hoeffding inequality: Hoeffding for martingales

Theorem 6: Azuma-Hoeffding inequality

Proof

Intuition

10. A refined high-probability Azuma-Hoeffding principle

Theorem 7: Shamir-Spencer type probabilistic Azuma-Hoeffding inequality

What the theorem says

A clean conditional version and proof

Why this matters in robust RL

11. Freedman’s inequality: Bernstein for martingales

Theorem 8: Freedman’s inequality

Proof idea with the exponential supermartingale

Intuition

12. How the inequalities fit together

Markov

Chebyshev

Hoeffding

Chernoff

Bernstein

Azuma-Hoeffding

Refined high-probability Azuma-Hoeffding

Freedman

13. A research-oriented decision guide

Question 1: Is the object nonnegative and do I only know its expectation?

Question 2: Do I only know a second moment or variance?

Question 3: Is it a sum of independent bounded variables?

Question 4: Is it a sum of indicators or a visit count?

Question 5: Is it a martingale noise term with bounded increments?

Question 6: Is it a martingale noise term with small conditional variance?

Question 7: Is the martingale increment usually small but rarely large?

14. How these tools appear in reinforcement learning proofs

Visit counts

Empirical rewards

Bellman noise

Robust reward estimation and rare events

15. Summary

References

Trending Tags