§3.4: Learning theory

Computational learning theory is an area of algorithms research devoted to the following task: given a source of “examples” $(x, f(x))$ from an unknown function $f$, compute a “hypothesis” function $h$ which is good at predicting $f(y)$ on future inputs $y$. We will focus on just one possible formulation of the task:

Definition 27 In the model of PAC (“Probably Approximately Correct”) learning under the uniform distribution on $\{-1,1\}^n$, a learning problem is identified with a concept class $\mathcal{C}$, which is just a collection of functions $f : \{-1,1\}^n \to \{-1,1\}$. A learning algorithm $A$ for $\mathcal{C}$ is a randomized algorithm which has limited access to an unknown target function $f \in \mathcal{C}$. The two access models, in increasing order of strength, are:

queries, meaning $A$ can request the value $f(x)$ for any $x \in \{-1,1\}^n$ of its choice.

In addition, $A$ is given as input an accuracy parameter $\epsilon \in [0,1/2]$. The output of $A$ is required to be (the circuit representation of) a hypothesis function $h : \{-1,1\}^n \to \{-1,1\}$. We say that $A$ learns $\mathcal{C}$ with error $\epsilon$ if for any $f \in \mathcal{C}$, with high probability $A$ outputs an $h$ which is $\epsilon$-close to $f$: i.e., satisfies $\mathrm{dist}(f,h) \leq \epsilon$.

In the above definition, the phrase “with high probability” can be fixed to mean, say, “except with probability at most $1/10$”. (As is common with randomized algorithms, the choice of constant $1/10$ is unimportant; see the exercises.)

For us, the main desideratum of a learning algorithm is efficient running time. One can easily learn any function $f$ to error $0$ in time $\widetilde{O}(2^n)$ (exercise); however this is not very efficient. If the concept class $\mathcal{C}$ contains very complex functions then such exponential running time is necessary; however if $\mathcal{C}$ contains only relatively “simple” functions then more efficient learning may be possible. For example, in a later section we will show that the concept class \[ \mathcal{C} = \{f : {\mathbb F}_2^n \to \{-1,1\} \mid {\mathrm{DT}_{\mathrm{size}}}(f) \leq s\} \] can be learned with queries to error $\epsilon$ by an algorithm running in time $\mathrm{poly}(s, n, 1/\epsilon)$.

A common way of trying to learn an unknown target $f : \{-1,1\}^n \to \{-1,1\}$ is by discovering “most of” its Fourier spectrum. To formalize this, let’s generalize Definition 1:

Most functions don’t have their Fourier spectrum concentrated on a small collection (see the exercises). But for those that do, we may hope to discover “most of” their Fourier coefficients. The main result of this section is a kind of “meta-algorithm” for learning an unknown target $f$. It reduces the problem of learning $f$ to the problem of identifying a collection of characters on which $f$’s Fourier spectrum is concentrated.

Theorem 29 Assume learning algorithm $A$ has (at least) random example access to target $f : \{-1,1\}^n \to \{-1,1\}$. Suppose that $A$ can — somehow — identify a collection $\mathcal{F}$ of subsets on which $f$’s Fourier spectrum is $\epsilon/2$-concentrated. Then using $\mathrm{poly}(|\mathcal{F}|, n, 1/\epsilon)$ additional time, $A$ can with high probability output a hypothesis $h$ which is $\epsilon$-close to $f$.

The idea of the theorem is that $A$ will estimate all of $f$’s Fourier coefficients in $\mathcal{F}$, obtaining a good approximation to $f$’s Fourier expansion. Then $A$’s hypothesis will be the sign of this approximate Fourier expansion.

The first tool we need to prove Theorem 29 is the ability to accurately estimate any fixed Fourier coefficient:

As we described, Theorem 29 reduces the algorithmic task of learning $f$ to the algorithmic task of identifying a collection $\mathcal{F}$ on which $f$’s Fourier spectrum is concentrated. In the next section we will describe the Goldreich–Levin algorithm, a sophisticated way to find such an $\mathcal{F}$ assuming query access to $f$. For now, though, we observe that for several interesting concept classes we don’t need to do any algorithmic searching for $\mathcal{F}$: we can just take $\mathcal{F}$ to be all sets of small cardinality. This works whenever all functions in $\mathcal{C}$ have low-degree spectral concentration.

The “Low-Degree Algorithm” Let $k \geq 1$ and let $\mathcal{C}$ be a concept class for which every function $f : \{-1,1\}^n \to \{-1,1\}$ in $\mathcal{C}$ is $\epsilon/2$-concentrated up to degree $k$. Then $\mathcal{C}$ can be learned from random examples only with error $\epsilon$ in time $\mathrm{poly}(n^k, 1/\epsilon)$.

The Low-Degree Algorithm reduces the algorithmic problem of learning $\mathcal{C}$ from random examples to the analytic task of showing low-degree spectral concentration for the functions in $\mathcal{C}$. Using the results of Section 3.1 we can quickly obtain some learning-theoretic results. For example:

You might be concerned that a running time such as $n^{O(\sqrt{n})}$ does not seem very efficient. Still, it’s much better than the trivial running time of $\widetilde{O}(2^n)$. Further, as we will see in the next section, learning algorithms are sometimes used in attacks on cryptographic schemes, and in this context even subexponential-time algorithms are considered dangerous. Continuing with applications of the Low-Degree Algorithm: