I posted this earlier on MSE, but it was suggested that here may be a better place to ask.

Universal approximation theorem states that "the standard multilayer feed-forward network with a single hidden layer, which contains finite number of hidden neurons, is a universal approximator among continuous functions on compact subsets of Rn, under mild assumptions on the activation function."

I understand what this means, but the relevant papers are too far over my level of math understanding to grasp why it is true or how a hidden layer approximates non-linear functions.

So, in terms little more advanced than basic calculus and linear algebra, how does a feed-forward network with one hidden layer approximate non-linear functions? The answer need not necessarily be totally concrete.

2 Answers
2

Cybenko's result is fairly intuitive, as I hope to convey below; what makes things more tricky is he was aiming both for generality, as well as a minimal number of hidden layers. Kolmogorov's result (mentioned by vzn) in fact achieves a stronger guarantee, but is somewhat less relevant to machine learning (in particular, it does not build a standard neural net, since the nodes are heterogeneous); this result in turn is daunting since on the surface it is just 3 pages recording some limits and continuous functions, but in reality it is constructing a set of fractals. While Cybenko's result is unusual and very interesting due to the exact techniques he uses, results of that flavor are very widely used in machine learning (and I can point you to others).

Here is a high-level summary of why Cybenko's result should hold.

A continuous function on a compact set can be approximated by a piecewise constant function.

A piecewise constant function can be represented as a neural net as follows. For each region where the function is constant, use a neural net as an indicator function for that region. Then build a final layer with a single node, whose input linear combination is the sum of all the indicators, with a weight equal to the constant value of the corresponding region in the original piecewise constant function.

Regarding the first point above, this can be taken as the statement "a continuous function over a compact set is uniformly continuous". What this means to us is you can take your continuous function over $[0,1]^d$, and some target error $\epsilon>0$, then you can grid $[0,1]^d$ at scale $\tau>0$ (ending up with roughly $(1/\tau)^d$ subcubes) so that a function which is constant over each subcube is within $\epsilon$ of the target function.

Now, a neural net can not precisely represent an indicator, but you can get very close. Suppose the "transfer function" is a sigmoid. (Transfer function is the continuous function you apply to a linear combination of inputs in order to get the value of the neural net node.) Then by making the weights huge, you output something close to 0 or close to 1 for more inputs. This is consistent with Cybenko's development: notice he needs the functions involved to equal 0 or 1 in the limit: by definition of limit, you get exactly what I'm saying, meaning you push things arbitrarily close to 0 or 1.

(I ignored the transfer function in the final layer; if it's there, and it's continuous, then we can fit anything mapping to $[0,1]$ by replacing the constant weights with the something in the inverse image of that constant according to the transfer function.)

Notice that the above may seem to take a couple layers: say, 2 to build the indicators on cubes, and then a final output layer. Cybenko was trying for two points of generality: minimal number of hidden layers, and flexibility in the choice of transfer function. I've already described how he works out flexibility in transfer function.

To get the minimum number of layers, he avoids the construction above, and instead uses functional analysis to develop a contradiction.
Here's a sketch of the argument.

The final node computes a linear combination of the elements of the layer below it, and applies a transfer function to it. This linear combination is a linear combination of functions, and as such, is itself a function, a function within some subspace of functions, spanned by the possible nodes in the hidden layer.

A subspace of functions is just like an ordinary finite-dimensional subspace, with the main difference that it is potentially not a closed set; that's why cybenko's arguments all take the closure of that subspace. We are trying to prove that this closure contains all continuous functions; that will mean we are arbitrarily close to all continuous functions.

If the function space were simple (a Hilbert space), we could argue as follows. Pick some target continuous function which is contradictorily supposed to not lie in the subspace, and project it onto the orthogonal complement of the subspace. This residual must be nonzero. But since our subspace can represent things like those little cubes above, we can find some region of this residual, fit a little cube to it (as above), and thereby move closer to our target function. This is a contradiction since projections choose minimal elements. (Note, I am leaving something out here: Cybenko's argument doesn't build any little cubes, he handles this in generality too; this is where he uses a form of the Riesz representation theorem, and properties of the transfer functions (if I remember correctly, there is a separate lemma for this step, and it is longer than the main theorem).)

We aren't in a Hilbert space, but we can use the Hahn-Banach theorem to replace the projection step above (note, proving Hahn-Banach uses the axiom of choice).

Now I'd like to say a few things about Kolmogorov's result. While this result does not apparently need the sort of background of Cybenko's, I personally think it is much more intimidating.

Here is why. Cybenko's result is an approximation guarantee: it does not say we can exactly represent anything. On the other hand, Kolmogorov's result is provides an equality. More ridiculously, it says the size of the net: you need just $\mathcal O(d^2)$ nodes. To achieve this strengthening, there is a catch of course, the one I mentioned above: the network is heteregeneous, by which I mean all the transfer functions are not the same.

Okay, so with all that, how can this thing possible work?!

Let's go back to our cubes above. Notice that we had to bake in a level of precision: for every $\epsilon>0$, we have to go back and pick a more refined $\tau >0$. Since we are working with (finite) linear combinations of indicators, we are never exactly representing anything. (things only get worse if you include the approximating effects of sigmoids.)

So what's the solution? Well, how about we handle all scales simultaneously? I'm not making this up: Kolmogorov's proof is effectively constructing the hidden layer as a set of fractals. Said another way, they are basically space filling curves which map $[0,1]$ to $[0,1]^d$; this way, even though we have a combination of univariate functions, we can fit any multivariate function. In fact, you can heuristically reason that $\mathcal O(d^2)$ is "correct" via a ridiculous counting argument: we are writing a continuous function from $\mathbb{R}^d$ to $\mathbb R$ via univariate continuous functions, and therefore, to capture all inter-coordinate interactions, we need $\mathcal O(d^2)$ functions...

Note that Cybenko's result, due to using only one type of transfer function, is more relevant to machine learning. Theorems of this type are very common in machine learning (vzn suggested this in his answer, however he referred to Kolmogorov's result, which is less applicable due to the custom transfer functions; this is weakened in some more fancy versions of Kolmogorov's result (produced by other authors), but those still involve fractals, and at least two transfer functions).

I have some slides on these topics, which I could post if you are interested (hopefully less rambly than the above, and have some pictures; I wrote them before I was adept with Hahn-Banach, however). I think both proofs are very, very nice. (Also, I have another answer here on these topics, but I wrote it before I had grokked Kolmogorov's result.)

Usually when I think of Hahn-Banach I think of the geometric variant: two convex sets $A$, $B$ are disjoint if and only if a hyperplane separates them, i.e. there exists a linear functional $\phi$ such that $\forall f \in A: \phi(f) \leq 1$ and $\forall g \in B: \phi(g) > 1$. so is Cybenko simply constructing a linear functional that separates a function from the subspace of interest and deriving a contradiction, maybe by approximating the functional ?
–
Sasho NikolovMay 15 '13 at 3:59

3

@SashoNikolov , yes, basically that. Given our closed subspace $S$ (a closed convex set) and target function $f\not\in S$, a slightly more powerful version of Hahn-Banach gives us linear function $L$ with $L(g) = 0$ for $g \in S$, and $L(f) = \|f\|$. Next, Riesz representation theorem lets us write $L(f)$ as an integral with respect to some signed measure. But this finishes the proof due to Cybenko's conditions on the transfer functions (continued in next comment).
–
matusMay 15 '13 at 4:13

3

@SashoNikolov , Cybenko's condition is that given any signed measure not exactly zero, there exists some affine function so that integration of the transfer function composed with that affine function, over that measure, is not equal to zero. He then has to prove the lemma that generalized sigmoids (as I gave above: limits to 0 and 1 on the left and right) fit the bill. (continued in next comment.)
–
matusMay 15 '13 at 4:16

2

@SashoNikolov . Above I said "plopping a cube along the residual". This would make our job slightly easier, since the signed measure is not exactly zero, we would just pick out some little piece and plop an indicator there. In his case, he has to work a little, but similarly this boils down to moving around the sigmoid with the affine function so that it finds some easy region, thus getting nonzero integral, contradicting Hahn-Banach (which is zero over our subspace); in the Hilbert sense, we shrank our residual, a contradiction.
–
matusMay 15 '13 at 4:18

1

Wow, this is an extremely nice answer. Naturally, I have a few questions if you don't mind answering them. Cybenko's result (as you say) seems most useful for applications, but I get a bit lost dealing with the subspace of functions. How do we project an arbitrary continuous function onto the orthogonal complement of the subspace of linear combinations of possible nodes. For that matter, how do we conceptualize the orthogonal compliment of that subspace? Do functions closer in space more closely approximate one another? (Continued).
–
Matt MunsonMay 16 '13 at 20:18

There is an advanced result, key to machine learning, known as Kolmogorov's theorem [1]; I have never seen an intuitive sketch of why it works. This may have to do with the different cultures that approach it. The applied learning crowd regards Kolmogorov's theorem as an existence theorem that merely indicates that NNs may exist, so at least the structure is not overly limiting, but the theorem does not guarantee these NNs can be found. Mathematicians are not so concerned with low-level applications of the theorem.

The theorem was also historically used to invoke/defend the inherent sophistication of multilayer NNs to counter a criticism from Perceptrons (Minsky/Papert) that there were basic functions [ie nonlinear] that they couldnt learn.

Theoretical computer scientists prefer not to regard NNs as "approximations", as that term has a special/different meaning. There is probably some rough analogy with piecewise linear interpolation but again, I havent seen it laid out.

"this advanced result [...] have not seen an intuitive sketch of why it works." Would such a sketch be a considerable undertaking for someone in the advanced math crowd? Do the advanced math people even understand intuitively why it works? It seems that an intuitive understanding of this theorem is something that the applied learning crowd should strongly desire, if they are to devise superior topologies and learning algorithms for ANNs.
–
Matt MunsonMay 8 '13 at 18:49