§6.4: Applications in learning and testing

In this section we describe some applications of our study of pseudorandomness.
We begin with a notorious open problem from learning theory, that of learning juntas. Let $\mathcal{C} = \{f : {\mathbb F}_2^n \to {\mathbb F}_2 \mid f \text{ is a } k\text{-junta}\}$; we will always assume that $k \leq O(\log n)$. In the query access model, it is quite easy to learn $\mathcal{C}$ exactly (i.e., with error $0$) in $\mathrm{poly}(n)$ time (Exercise 3.36(a)). However in the model of random examples, it’s not obvious how to learn $\mathcal{C}$ more efficiently than in the $n^{k} \cdot \mathrm{poly}(n)$ time required by the Low-Degree Algorithm (see Theorem 3.36). Unfortunately, this is superpolynomial as soon as $k > \omega(1)$. The state of affairs is the same in the case of depth-$k$ decision trees (a superclass of $\mathcal{C}$), and is similar in the case of $\mathrm{poly}(n)$-size DNFs and CNFs. Thus if we wish to learn, say, $\mathrm{poly}(n)$-size decision trees or DNFs from random examples only, a necessary prerequisite is doing the same for $O(\log n)$-juntas.

Whether or not $\omega(1)$-juntas can be learned from random examples in polynomial time is a longstanding open problem. Here we will show a modest improvement on the $n^{k}$-time algorithm:

(The $3/4$ in this theorem can in fact be replaced by $\omega/(\omega + 1)$, where $\omega$ is any number such that $n \times n$ matrices can be multiplied in time $O(n^{\omega})$.)

The first observation we will use to prove Theorem 36 is that to learn $k$-juntas, it suffices to be able to identify a single coordinate which is relevant. The proof of this is fairly simple and is left for the exercises:

Assume then that we have random example access to a (nonconstant) $k$-junta $f : {\mathbb F}_2^n \to {\mathbb F}_2$. As in the Low-Degree Algorithm we will estimate the Fourier coefficients $\widehat{f}(S)$ for all $1 \leq |S| \leq d$, where $d \leq k$ is a parameter to be chosen later. Using Proposition 3.30 we can ensure that all estimates are accurate to within $(1/3)2^{-k}$, except with probability most $\delta/2$, in time $n^d \cdot \mathrm{poly}(n) \cdot \log(1/\delta)$. (Recall that $2^k \leq \mathrm{poly}(n)$.) Since $f$ is a $k$-junta, all of its Fourier coefficients are either $0$ or at least $2^{-k}$ in magnitude; hence we can exactly identify the sets $S$ for which $\widehat{f}(S) \neq 0$. For any such $S$, all of the coordinates $i \in S$ are relevant for $f$. So unless $\widehat{f}(S) = 0$ for all $1 \leq |S| \leq d$, we can find a relevant coordinate for $f$ in time $n^{d} \cdot \mathrm{poly}(n) \cdot \log(1/\delta)$ (except with probability at most $\delta/2$).

To complete the proof of Theorem 36 it remains to handle the case that $\widehat{f}(S) = 0$ for all $1 \leq |S| \leq d$; i.e., $f$ is $d$th-order correlation immune. In this case, by Siegenthaler’s Theorem we know that $\deg_{{\mathbb F}_2}(f) \leq k-d$. (Note that $d < k$ since $f$ is not constant.) But there is a learning algorithm running in time in time $O(n)^{3\ell} \cdot \log(1/\delta)$ which exactly learns any ${\mathbb F}_2$-polynomial of degree at most $\ell$ (except with probability at most $\delta/2$). Roughly speaking, the algorithm draws $O(n)^\ell$ random examples and then solves an ${\mathbb F}_2$-linear system to determine the coefficients of the unknown polynomial; see the exercises for details. Thus in time $n^{3(k-d)} \cdot \mathrm{poly}(n) \cdot \log(1/\delta)$ this algorithm will exactly determine $f$, and in particular find a relevant coordinate.

By choosing $d = \left\lceil \frac34 k \right\rceil$ we balance the running time of the two algorithms. Regardless of whether $f$ is $d$th-order correlation immune, at least one of the two algorithms will find a relevant coordinate for $f$ (except with probability at most $\delta/2 + \delta/2 = \delta$) in time $n^{(3/4)k} \cdot \mathrm{poly}(n) \cdot \log(1/\delta)$. This completes the proof of Theorem 36.

Our next application of pseudorandomness involves using $\epsilon$-biased distributions to give a deterministic version of the Goldreich–Levin Algorithm (and hence the Kushilevitz–Mansour learning algorithm) for functions $f$ with small $\hat{\lVert} f \hat{\rVert}_1$. We begin with a basic lemma showing that you can get a good estimate for the mean of such functions using an $\epsilon$-biased distribution:

Proof: It suffices to handle the case $U = \emptyset$ because for general $U$, the algorithm can simulate query access to $f \cdot \chi_U$ with $\mathrm{poly}(n)$ overhead, and $\widehat{f \cdot \chi_U}(\emptyset) = \widehat{f}(U)$. The algorithm will use Theorem 30 to construct an $(\epsilon/s)$-biased density $\varphi$ which is uniform over a (multi-)set of cardinality $O(n^2 s^2/\epsilon^2)$. By enumerating over this set and using queries to $f$, it can deterministically output the estimate $\widetilde{f}(\emptyset) = \mathop{\bf E}_{{\boldsymbol{x}} \sim \varphi}[f({\boldsymbol{x}})]$ in time $\mathrm{poly}(n, s, 1/\epsilon)$. The error bound now follows from Lemma 38. $\Box$

The other key ingredient needed for the Goldreich–Levin Algorithm was Proposition 3.40, which let us estimate \begin{equation} \label{eqn:gl-key2} \mathbf{W}^{S\mid\overline{J}}[f] = \sum_{T \subseteq \overline{J}} \widehat{f}(S \cup T)^2 = \mathop{\bf E}_{\boldsymbol{z} \sim \{-1,1\}^{\overline{J}}}[\widehat{f_{J \mid \boldsymbol{z}}}(S)^2] \end{equation} for any $S \subseteq J \subseteq [n]$. Observe that for any $z \in \{-1,1\}^{\overline{J}}$ we can use Proposition 40 to deterministically estimate $\widehat{f_{J \mid z}}(S)$ to accuracy $\pm \epsilon$. The reason is that we can simulate query access to the restricted function $\widehat{f_{J \mid z}}$, the $(\epsilon/s)$-biased density $\varphi$ remains $(\epsilon/s)$-biased on $\{-1,1\}^{J}$, and most importantly $\hat{\lVert} f_{J \mid z} \hat{\rVert}_1 \leq \hat{\lVert} f \hat{\rVert}_1 \leq s$ by Exercise 3.6. It is not much more difficult to deterministically estimate \eqref{eqn:gl-key2}

Propositions 40 and 41 are the only two ingredients needed for a derandomization of the Goldreich–Levin Algorithm. We can therefore state a derandomized version of its corollary Theorem 3.38 on learning functions with small Fourier $1$-norm:

Whereas the original BLR Test required exactly $2n$ independent random bits, the above derandomized version needs only $n + O(\log(n/\epsilon))$. This is very close to minimum possible: a test using only, say, $.99n$ random bits would only be able to inspect a $2^{-.01 n}$ fraction of $f$’s values.

If $f$ is ${\mathbb F}_2$-linear then it is still accepted by the Derandomized BLR Test with probability $1$. As for the approximate converse, we’ll have to make a slight concession: we’ll show that any function accepted with probability close to $1$ must be close to an affine function — i.e., satisfy $\deg_{{\mathbb F}_2}(f) \leq 1$. This concession is necessary: the function $f : {\mathbb F}_2^n \to {\mathbb F}_2$ might be $1$ everywhere except on the (tiny) support of $\varphi$. In that case the acceptance criterion $f({\boldsymbol{x}}) + f(\boldsymbol{y}) = f({\boldsymbol{x}} + \boldsymbol{y})$ will almost always be $1 + 0 = 1$; yet $f$ is very far from every linear function. It is, however, very close to the affine function $1$.

Remark 45 The bound in this theorem works well both when $\theta$ is close to $0$ and when $\theta$ is close to $1$. E.g., for $\theta = 1-2\delta$ we get that if $f$ is accepted with probability $1-\delta$ then $f$ is nearly $\delta$-close to an affine function, provided $\epsilon \ll \delta$.