Hoeffding's inequality

In probability theory, Hoeffding's inequality provides an upper bound on the probability that the sum of bounded independent random variables deviates from its expected value by more than a certain amount. Hoeffding's inequality was proven by Wassily Hoeffding in 1963.[1]

Hoeffding's inequality is a generalization of the Chernoff bound, which applies only to Bernoulli random variables,[2] and a special case of the Azuma–Hoeffding inequality and the McDiarmid's inequality. It is similar to, but incomparable with, the Bernstein inequality, proved by Sergei Bernstein in 1923.

Special case of Bernoulli random variables

Hoeffding's inequality can be applied to the important special case of identically distributed Bernoulli random variables, and this is how the inequality is often used in combinatorics and computer science. We consider a coin that shows heads with probability $p$ and tails with probability $1 - p$ . We toss the coin $n$ times. The expected number of times the coin comes up heads is $pn$ . Furthermore, the probability that the coin comes up heads at most $k$ times can be exactly quantified by the following expression:

\operatorname {P} (H(n)\leq k)=\sum _{i=0}^{k}{\binom {n}{i}}p^{i}(1-p)^{n-i},

where $H (n)$ is the number of heads in $n$ coin tosses.

When $k = (p - ε) n$ for some $ε > 0$ , Hoeffding's inequality bounds this probability by a term that is exponentially small in $ε 2 n$ :

\operatorname {P} (H(n)\leq (p-\varepsilon )n)\leq \exp \left(-2\varepsilon ^{2}n\right).

Similarly, when $k = (p + ε) n$ for some $ε > 0$ , Hoeffding's inequality bounds the probability that we see at least $εn$ more tosses that show heads than we would expect:

\operatorname {P} (H(n)\geq (p+\varepsilon )n)\leq \exp \left(-2\varepsilon ^{2}n\right).

Hence Hoeffding's inequality implies that the number of heads that we see is concentrated around its mean, with exponentially small tail.

\operatorname {P} \left((p-\varepsilon )n\leq H(n)\leq (p+\varepsilon )n\right)\geq 1-2\exp \left(-2\varepsilon ^{2}n\right).

For example, taking $\varepsilon ={\sqrt {\dfrac {\ln {n}}{n}}}$ gives:

\operatorname {P} \left(|H(n)-pn|\leq {\sqrt {n\ln n}}\right)\geq 1-2\exp \left(-2\ln n\right)=1-2/n^{2}.

General case of bounded random variables

Let $X 1, ..., X n$ be independent random variables bounded by the interval $[0, 1]$ : $0 \leq X i \leq 1$ . We define the empirical mean of these variables by

{\overline {X}}={\frac {1}{n}}(X_{1}+\cdots +X_{n}).

One of the inequalities in Theorem 1 of Hoeffding (1963) states

{\begin{aligned}\operatorname {P} \left({\overline {X}}-\mathrm {E} \left[{\overline {X}}\right]\geq t\right)\leq e^{-2nt^{2}}\end{aligned}}

where $t\geq 0$ .

Theorem 2 of Hoeffding (1963) is a generalization of the above inequality when it is known that $X i$ are strictly bounded by the intervals $[a i, b i]$ :

{\begin{aligned}\operatorname {P} \left({\overline {X}}-\mathrm {E} \left[{\overline {X}}\right]\geq t\right)&\leq \exp \left(-{\frac {2n^{2}t^{2}}{\sum _{i=1}^{n}(b_{i}-a_{i})^{2}}}\right)\\\operatorname {P} \left(\left|{\overline {X}}-\mathrm {E} \left[{\overline {X}}\right]\right|\geq t\right)&\leq 2\exp \left(-{\frac {2n^{2}t^{2}}{\sum _{i=1}^{n}(b_{i}-a_{i})^{2}}}\right)\end{aligned}}

which are valid for positive values of $t$ . Here $E[X]$ is the expected value of $X$ . The inequalities can be also stated in terms of the sum

S_{n}=X_{1}+\cdots +X_{n}

of the random variables:

\operatorname {P} (S_{n}-\mathrm {E} [S_{n}]\geq t)\leq \exp \left(-{\frac {2t^{2}}{\sum _{i=1}^{n}(b_{i}-a_{i})^{2}}}\right),

\operatorname {P} (|S_{n}-\mathrm {E} [S_{n}]|\geq t)\leq 2\exp \left(-{\frac {2t^{2}}{\sum _{i=1}^{n}(b_{i}-a_{i})^{2}}}\right).

Note that the inequalities also hold when the $X i$ have been obtained using sampling without replacement; in this case the random variables are not independent anymore. A proof of this statement can be found in Hoeffding's paper. For slightly better bounds in the case of sampling without replacement, see for instance the paper by Serfling (1974).

General case of sub-Gaussian random variables

A random variable $X$ is called sub-Gaussian,[3] if

\mathrm {P} (|X|\geq t)\leq 2e^{-ct^{2}},

for some c>0. For a random variable $X$ , the following norm is finite if and only if it is sub-Gaussian:

\Vert X\Vert _{\psi _{2}}:=\inf \left\{c\geq 0:\mathrm {E} \left(e^{X^{2}/c^{2}}\right)\leq 2\right\}.

Then let $X 1, ..., X n$ be zero-mean independent sub-Gaussian random variables, the general version of the Hoeffding's inequality states that:

\mathrm {P} \left(\left|\sum _{i=1}^{n}X_{i}\right|\geq t\right)\leq 2\exp \left(-{\frac {ct^{2}}{\sum _{i=1}^{n}\Vert X_{i}\Vert _{\psi _{2}}^{2}}}\right),

where c > 0 is an absolute constant. See Theorem 2.6.2 of Vershynin (2018) for details.

Proof

In this section, we give a proof of Hoeffding's inequality.[4] The proof uses Hoeffding's Lemma:

Suppose

X

is a real random variable such that

\textstyle \operatorname {P} \left(X\in \left[a,b\right]\right)=1

. Then

\mathrm {E} \left[e^{s\left(X-\mathrm {E} \left[X\right]\right)}\right]\leq \exp \left({\tfrac {1}{8}}s^{2}(b-a)^{2}\right).

Using this lemma, we can prove Hoeffding's inequality. Suppose $X 1, ..., X n$ are $n$ independent random variables such that

\operatorname {P} \left(X_{i}\in [a_{i},b_{i}]\right)=1,\qquad 1\leq i\leq n.

Let

S_{n}=X_{1}+\cdots +X_{n}.

Then for $s, t > 0$ , Markov's inequality and the independence of $X i$ implies:

{\begin{aligned}\operatorname {P} \left(S_{n}-\mathrm {E} \left[S_{n}\right]\geq t\right)&=\operatorname {P} \left(e^{s(S_{n}-\mathrm {E} \left[S_{n}\right])}\geq e^{st}\right)\\&\leq e^{-st}\mathrm {E} \left[e^{s(S_{n}-\mathrm {E} \left[S_{n}\right])}\right]\\&=e^{-st}\prod _{i=1}^{n}\mathrm {E} \left[e^{s(X_{i}-\mathrm {E} \left[X_{i}\right])}\right]\\&\leq e^{-st}\prod _{i=1}^{n}e^{\frac {s^{2}(b_{i}-a_{i})^{2}}{8}}\\&=\exp \left(-st+{\tfrac {1}{8}}s^{2}\sum _{i=1}^{n}(b_{i}-a_{i})^{2}\right)\end{aligned}}

To get the best possible upper bound, we find the minimum of the right hand side of the last inequality as a function of $s$ . Define

{\begin{cases}g\colon \mathbf {R_{+}} \to \mathbf {R} \\g(s)=-st+{\frac {s^{2}}{8}}\sum _{i=1}^{n}(b_{i}-a_{i})^{2}\end{cases}}

Note that $g$ is a quadratic function and achieves its minimum at

s={\frac {4t}{\sum _{i=1}^{n}(b_{i}-a_{i})^{2}}}.

Thus we get

\operatorname {P} \left(S_{n}-\mathrm {E} \left[S_{n}\right]\geq t\right)\leq \exp \left(-{\frac {2t^{2}}{\sum _{i=1}^{n}(b_{i}-a_{i})^{2}}}\right).

Usage

Confidence intervals

Hoeffding's inequality is useful to analyse the number of required samples needed to obtain a confidence interval by solving the inequality in Theorem 1:

\operatorname {P} ({\overline {X}}-\mathrm {E} [{\overline {X}}]\geq t)\leq e^{-2nt^{2}}

The inequality states that the probability that the estimated and true values differ by more than $t$ is bounded by e^−2nt². Symmetrically, the inequality is also valid for another side of the difference:

\operatorname {P} (-{\overline {X}}+\mathrm {E} [{\overline {X}}]\geq t)\leq e^{-2nt^{2}}

By adding them both up, we can obtain two-sided variant of this inequality:

\operatorname {P} (|{\overline {X}}-\mathrm {E} [{\overline {X}}]|\geq t)\leq 2e^{-2nt^{2}}

This probability can be interpreted as the level of significance $\alpha$ (probability of making an error) for a confidence interval around $\mathrm {E} [{\overline {X}}]$ of size 2 $t$ :

\alpha =\operatorname {P} ({\overline {X}}\notin [\mathrm {E} [{\overline {X}}]-t,\mathrm {E} [{\overline {X}}]+t])\leq 2e^{-2nt^{2}}

Solving the above for $n$ gives us the following:

n\geq {\frac {\log(2/\alpha )}{2t^{2}}}

Therefore, we require at least $\textstyle {\frac {\log(2/\alpha )}{2t^{2}}}$ samples to acquire $\textstyle (1-\alpha )$ -confidence interval $\textstyle \mathrm {E} [{\overline {X}}]\pm t$ .

Hence, the cost of acquiring the confidence interval is sublinear in terms of confidence level and quadratic in terms of precision.

Note that this inequality is the most conservative of the three in Theorem 1, and there are more efficient methods of estimating a confidence interval.

Notes

Hoeffding (1963)
Nowak (2009); for a more intuitive proof, see this note
Kahane (1960)
Nowak (2009); for a more intuitive proof, see this note

References

Serfling, Robert J. (1974). "Probability Inequalities for the Sum in Sampling without Replacement". The Annals of Statistics. 2 (1): 39–48. doi:10.1214/aos/1176342611. MR 0420967.CS1 maint: ref=harv (link)
Hoeffding, Wassily (1963). "Probability inequalities for sums of bounded random variables" (PDF). Journal of the American Statistical Association. 58 (301): 13–30. doi:10.1080/01621459.1963.10500830. JSTOR 2282952. MR 0144363.CS1 maint: ref=harv (link)
Nowak, Robert (2009). "Lecture 7: Chernoff's Bound and Hoeffding's Inequality" (PDF). ECE 901 (Summer '09) : Statistical Learning Theory Lecture Notes. University of Wisconsin-Madison. Retrieved May 16, 2014.
Vershynin, Roman (2018). High-Dimensional Probability. Cambridge University Press. ISBN 9781108415194.
Kahane, J.P. (1960). "Propriétés locales des fonctions à séries de Fourier aléatoires". Stud. Math. 19. pp. 1–25. .

This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.

[1] Hoeffding (1963)

[2] Nowak (2009); for a more intuitive proof, see this note

[3] Kahane (1960)

[4] Nowak (2009); for a more intuitive proof, see this note