§11.5: The Berry–Esseen Theorem

Now that we’ve built up some results concerning Gaussian space, we’re motivated to try reducing problems involving Boolean functions to problems involving Gaussian functions. The key tool for this is the Invariance Principle, discussed at the beginning of the chapter. As a warmup, this section is devoted to proving (a form of) the Berry–Esseen Theorem.
As discussed in Chapter 5.2, the Berry–Esseen Theorem is a quantitative form of the Central Limit Theorem for finite sums of independent random variables. We restate it here:

In this traditional statement of Berry–Esseen, the error term $\gamma$ is a little opaque. To say that $\gamma$ is small is to simultaneously say two things: the random variables $\boldsymbol{X}_i$ are all “reasonable” (as in Chapter 9.1); and, none is too dominant in terms of variance. In Chapter 9.1 we discussed several related notions of “reasonableness” for a random variable $\boldsymbol{X}$. It was convenient there to use the definition that $\|\boldsymbol{X}\|_4^4$ is not much larger than $\|\boldsymbol{X}\|_2^4$. For the Berry–Esseen Theorem it’s more convenient (and slightly stronger) to use the analogous condition for the $3$rd moment. (For the Invariance Principle it will be more convenient to use $(2,3,\rho)$- or $(2,4,\rho)$-hypercontractivity.) The implication for Berry–Esseen is the following:

Remark 56 In the Berry–Esseen Theorem, if all of the $\boldsymbol{X}_i$’s are “reasonable” in the sense that $\|\boldsymbol{X}_i\|_3^3 \leq B \|\boldsymbol{X}_i\|_2^3 = B \sigma_i^{3}$, then we can use the bound \begin{equation} \label{eqn:be-no-variance-dom} \gamma \leq B \cdot \max_i\{\sigma_i\}, \end{equation} as this is a consequence of \[ \gamma =\sum_{i=1}^n \|\boldsymbol{X}_i\|_3^3 \leq B \sum_{i=1}^n \sigma_i^{3} \leq B \cdot \max_i \{\sigma_i\} \cdot \sum_{i=1}^n \sigma_i^2 = B \cdot \max_i \{\sigma_i\}. \] (Cf. Remark 5.12.) Note that some “reasonableness” condition must hold if $\boldsymbol{S} = \sum_i \boldsymbol{X}_i$ is to behave like a Gaussian. For example, if each $\boldsymbol{X}_i$ is the “unreasonable” random variable which is $\pm \sqrt{n}$ with probability $\frac{1}{2n^2}$ each and $0$ otherwise, then $\boldsymbol{S} = 0$ except with probability at most $\frac{1}{n}$ — quite unlike a Gaussian. Further, even assuming reasonableness we still need a condition like \eqref{eqn:be-no-variance-dom} ensuring that no $\boldsymbol{X}_i$ is too dominant (“influential”) in terms of variance. For example, if $\boldsymbol{X}_1 \sim \{-1,1\}$ is a uniformly random bit and $\boldsymbol{X}_2, \dots, \boldsymbol{X}_n \equiv 0$, then $\boldsymbol{S} \equiv \boldsymbol{X}_1$, which is again quite unlike a Gaussian.

There are several known ways to prove the Berry–Esseen Theorem; for example, using characteristic functions (i.e., “real” Fourier analysis), or Stein’s Method. We’ll use the “Replacement Method” (also known as the Lindeberg Method, and similar to the “Hybrid Method” in theoretical cryptography). Although it doesn’t always give the sharpest results, it’s a very flexible technique which generalizes easily to higher-degree polynomials of random variables (as in the Invariance Principle) and random vectors. The Replacement Method suggests itself as soon as the Berry–Esseen Theorem is written in a slightly different form: Instead of trying to show \begin{equation} \label{eqn:replacement-no-suggest} \boldsymbol{X}_1 + \boldsymbol{X}_2 + \cdots + \boldsymbol{X}_n \approx \boldsymbol{Z}, \end{equation} where $\boldsymbol{Z} \sim \mathrm{N}(0,1)$, we’ll instead try to show the equivalent statement \begin{equation} \label{eqn:replacement-suggest} \boldsymbol{X}_1 + \boldsymbol{X}_2 + \cdots + \boldsymbol{X}_n \approx \boldsymbol{Z}_1 + \boldsymbol{Z}_2 + \cdots + \boldsymbol{Z}_n, \end{equation} where the $\boldsymbol{Z}_i$’s are independent Gaussians with $\boldsymbol{Z}_i \sim \mathrm{N}(0,\sigma_i^2)$. The statements \eqref{eqn:replacement-no-suggest} and \eqref{eqn:replacement-suggest} really are identical, since the sum of independent Gaussians is Gaussian, with the variances adding. The Replacement Method proves \eqref{eqn:replacement-suggest} by replacing the $\boldsymbol{X}_i$’s with $\boldsymbol{Z}_i$’s one by one. Roughly speaking, we introduce the “hybrid” random variables \[ \boldsymbol{H}_t = \boldsymbol{Z}_1 + \cdots + \boldsymbol{Z}_t + \boldsymbol{X}_{t+1} + \cdots + \boldsymbol{X}_n, \] show that $\boldsymbol{H}_{t-1} \approx \boldsymbol{H}_t$ for each $t \in [n]$, and then simply add up the $n$ errors.

As a matter of fact, the Replacement Method doesn’t really have anything to do with Gaussian random variables. It actually seeks to show that \[ \boldsymbol{X}_1 + \boldsymbol{X}_2 + \cdots + \boldsymbol{X}_n \approx \boldsymbol{Y}_1 + \boldsymbol{Y}_2 + \cdots + \boldsymbol{Y}_n, \] whenever $\boldsymbol{X}_1, \dots, \boldsymbol{X}_n, \boldsymbol{Y}_1, \dots, \boldsymbol{Y}_n$ are independent random variables with “matching first and second moments”, meaning $\mathop{\bf E}[\boldsymbol{X}_i] = \mathop{\bf E}[\boldsymbol{Y}_i]$ and $\mathop{\bf E}[\boldsymbol{X}_i^2] = \mathop{\bf E}[\boldsymbol{Y}_i^2]$ for each $i \in [n]$. (The error will be proportional to $\sum_i (\|\boldsymbol{X}_i\|^3 + \|\boldsymbol{Y}_i\|_3^3)$.) Another way of putting it (roughly speaking) is that the linear form $x_1 + \cdots + x_n$ is invariant to what independent random variables you substitute in for $x_1, \dots, x_n$, so long as you always use the same first and second moments. The fact that we can take the $\boldsymbol{Y}_i$’s to be Gaussians (with $\boldsymbol{Y}_i \sim \mathrm{N}(\mathop{\bf E}[\boldsymbol{X}_i], \mathop{\bf Var}[\boldsymbol{X}_i])$) and then in the end use the fact that the sum of Gaussians is Gaussians to derive the simpler-looking \[ \boldsymbol{S} = \sum_{i=1}^n \boldsymbol{X}_i \approx \mathrm{N}(\mathop{\bf E}[\boldsymbol{S}], \mathop{\bf Var}[S]) \] is just a pleasant bonus (and one that we’ll no longer get once we look at nonlinear polynomials of random variables in Section 6). Indeed, the remainder of this section will be devoted to showing that \[ \boldsymbol{S}_X = \boldsymbol{X}_1 + \cdots + \boldsymbol{X}_n \quad \text{is ``close'' to} \quad \boldsymbol{S}_Y = \boldsymbol{Y}_1 + \cdots + \boldsymbol{Y}_n \] whenever the $\boldsymbol{X}_i$’s and $\boldsymbol{Y}_i$’s are independent, “reasonable” random variables with matching first and second moments.

To do this, we’ll first have to discuss in more detail what it means for two random variables to be “close”. A traditional measure of closeness between two random variables $\boldsymbol{S}_X$ and $\boldsymbol{S}_Y$ is the “cdf-distance” used in the Berry–Esseen Theorem: $\mathop{\bf Pr}[\boldsymbol{S}_X \leq u] \approx \mathop{\bf Pr}[\boldsymbol{S}_Y \leq u]$ for every $u \in {\mathbb R}$. But there are other natural measures of closeness too. We might want to know that the absolute moments of $\boldsymbol{S}_X$ and $\boldsymbol{S}_Y$ are close; for example, that $\|\boldsymbol{S}_X\|_1 \approx \|\boldsymbol{S}_Y\|_1$. Or, we might like to know that $\boldsymbol{S}_X$ and $\boldsymbol{S}_Y$ stray from the interval $[-1,1]$ by about the same amount: $\mathop{\bf E}[\mathrm{dist}_{[-1,1]}(\boldsymbol{S}_X)] \approx \mathop{\bf E}[\mathrm{dist}_{[-1,1]}(\boldsymbol{S}_Y)]$. Here we are using:

All of the closeness measures just described can be put in a common framework: they are requiring $\mathop{\bf E}[\psi(\boldsymbol{S}_X)] \approx \mathop{\bf E}[\psi(\boldsymbol{S}_Y)]$ for various “test functions” (or “distinguishers”) $\psi : {\mathbb R} \to {\mathbb R}$.

It would be nice to prove a version of the Berry–Esseen Theorem that showed closeness for all the test functions $\psi$ depicted in Figure 11.1, and more. What class of tests might we able to handle? On one hand, we can’t be too ambitious. For example, suppose each $\boldsymbol{X}_i \sim \{-1,1\}$, each $\boldsymbol{Y}_i \sim \mathrm{N}(0,1)$, and $\psi(s) = 1_{s \in {\mathbb Z}}$. Then $\mathop{\bf E}[\psi(\boldsymbol{S}_X)] = 1$ because $\boldsymbol{S}_X$ is supported on the integers, but $\mathop{\bf E}[\psi(\boldsymbol{S}_Y)] = 0$ because $\boldsymbol{S}_Y \sim \mathrm{N}(0,n)$ is a continuous random variable. On the other hand, there are some simple kinds of tests $\psi$ for which we have exact equality. For example, if $\psi(s) = s$, then $\mathop{\bf E}[\psi(\boldsymbol{S}_X)] = \mathop{\bf E}[\psi(\boldsymbol{S}_Y)]$; this is by the assumption of matching first moments, $\mathop{\bf E}[\boldsymbol{X}_i] = \mathop{\bf E}[\boldsymbol{Y}_i]$ for all $i$. Similarly, if $\psi(s) = s^2$, then \begin{align} \mathop{\bf E}[\psi(\boldsymbol{S}_X)] &= \mathop{\bf E}\Bigl[\bigl(\sum_i \boldsymbol{X}_i\bigr)^2\Bigr] = \sum_{i} \mathop{\bf E}[\boldsymbol{X}_i^2] + \sum_{i \neq j} \mathop{\bf E}[\boldsymbol{X}_i\boldsymbol{X}_j] \nonumber\\ &= \sum_{i} \mathop{\bf E}[\boldsymbol{X}_i^2] + \sum_{i \neq j} \mathop{\bf E}[\boldsymbol{X}_i]\mathop{\bf E}[\boldsymbol{X}_j] \label{eqn:be-quadratic1}
\end{align}
(using independence of the $\boldsymbol{X}_i$’s); and also,
\begin{align}
\label{eqn:be-quadratic2} \mathop{\bf E}[\psi(\boldsymbol{S}_Y)] &= \sum_{i} \mathop{\bf E}[\boldsymbol{Y}_i^2] + \sum_{i \neq j} \mathop{\bf E}[\boldsymbol{Y}_i]\mathop{\bf E}[\boldsymbol{Y}_j] \end{align} The quantities \eqref{eqn:be-quadratic1} and \eqref{eqn:be-quadratic2} are equal because of the matching first and second moment conditions.

As a consequence of these observations we have $\mathop{\bf E}[\psi(\boldsymbol{S}_X)] = \mathop{\bf E}[\psi(\boldsymbol{S}_Y)]$ for any quadratic polynomial $\psi(s) = a + bs + cs^2$. This suggests that to handle a general test $\psi$ we try to approximate it by a quadratic polynomial up to some error; in other words, consider its $2$nd-order Taylor expansion. For this to make sense the function $\psi$ must have a continuous $3$rd derivative, and the error we incur will involve the magnitude of this derivative. Indeed, we will now prove a variant of the Berry–Esseen Theorem for the class of $\mathcal{C}^3$ test functions $\psi$ with $\psi^{\prime\prime\prime}$ uniformly bounded. You might be concerned that this class doesn’t contain any of the interesting test functions depicted in Figure 11.1. But we’ll be able to handle even those test functions with some loss in the parameters by using a simple “hack” — approximating them by smooth functions, as suggested in Figure 11.2.

This version of the Berry–Esseen Theorem is incomparable with the standard version. Sometimes it can be stronger; for example, if for some reason we wanted to show $\mathop{\bf E}[\cos \boldsymbol{S}] \approx \mathop{\bf E}[\cos \boldsymbol{Z}]$ then the Variant Berry–Esseen Theorem gives this with error $.433\gamma$, whereas it can’t be directly deduced from the standard Berry–Esseen at all. On the other hand, as we’ll see shortly, we can only obtain the standard Berry–Esseen conclusion from the Variant version with an error bound of $O(\gamma^{1/4})$ rather than $O(\gamma)$.

We end this section by describing the “hacks” which let us extend the Variant Berry–Esseen Theorem to cover certain non-$\mathcal{C}^3$ tests $\psi$. As mentioned the idea is to smooth them out, or “mollify” them:

As $\eta \to 0$ this gives a better and better smooth approximation to $\psi$, but also a larger and larger value of $\|\widetilde{\psi}_\eta^{\prime\prime\prime}\|_\infty$. Trading these off gives the following:

Remark 60 It’s obvious that the dependence on $c$ in this theorem should be linear in $c$; in fact, since we can always divide $\psi$ by $c$ it would have sufficed to prove the theorem assuming $c = 1$.

This corollary covers all Lipschitz tests, which suffices for the functions $\psi(s) = |s|$ and $\psi(s) = \mathrm{dist}_{[-1,1]}(s)$ from Figure 11.1. However, it still isn’t enough for the test $\psi(s) = 1_{s \leq u}$ — i.e., for establishing cdf-closeness as in the usual Berry–Esseen Theorem. Of course, we can’t hope for a smooth approximator $\widetilde{\psi}_\eta$ satisfying $|\widetilde{\psi}_\eta(s) – 1_{s \leq u}| \leq \eta$ for all $s$ because of the discontinuity at $u$. However, as suggested in Figure 11.2, if we’re willing to exclude $s \in [u - \eta, u+\eta]$ we can get an approximator with third derivative bound $O(1/\eta^3)$, and thereby obtain (exercise):

Although the error bound here is weaker than necessary by a power of $1/4$, this weakness will be more than made up for by the ease with which the Replacement Method generalizes to other settings. In the next section we’ll see it applied to nonlinear polynomials of independent random variables. Furthermore, an exercise outlines how to use it to give a Berry–Esseen theorem for sums of independent random vectors; as you’ll see, other than replacing Taylor’s theorem with its multivariate form, hardly a symbol in the proof changes.