My Ph.D. is in pure mathematics, and I admit I don't know much (i.e. anything) about theoretical CS. However, I have started exploring non-academic options for my career and in introducing myself to machine learning, stumbled across statements such as "No one understands why neural networks work well," which I found interesting.

My question, essentially, is what kinds of answers do researchers want? Here's what I've found in my brief search on the topic:

The process of SGD is well-understood mathematically, as is the statistical theory.

The universal approximation theorem is powerful and proven.

There's a nice recent paper https://arxiv.org/abs/1608.08225 which essentially gives the answer that universal approximation is much
more than we actually need in practice because we can make strong
simplifying assumptions about the functions we are trying to model
with the neural network.

In the aforementioned paper, they state (paraphrasing) "GOFAI algorithms are fully understood analytically, but many ANN algorithms are only heuristically understood." Convergence theorems for the implemented algorithms are an example of analytic understanding that it seems we DO have about neural networks, so a statement at this level of generality doesn't tell me much about what's known vs. unknown or what would be considered "an answer."

The authors do suggest in the conclusion that questions such as effective bounds on the size of the neural network needed to approximate a given polynomial are open and interesting. What are other examples of mathematically specific analytical questions that would need to be answered to say that we "understand" neural networks? Are there questions that may be answered in more pure mathematical language?

(I am specifically thinking of methods in representation theory due to the use of physics in this paper --- and, selfishly, because it is my field of study. However, I can also imagine areas such as combinatorics/graph theory, algebraic geometry, and topology providing viable tools.)

$\begingroup$Is GOFAI really that well understood? A lot of GOFAI seems to boild down to SAT solving, the archetypical NP-complete problem. Modern SAT solvers work remarkably well in practise, even though they should not according to the extant theory. Why?$\endgroup$
– Martin BergerSep 27 '16 at 20:42

$\begingroup$there is really pre-deep learning and post-deep learning study/ change/ history in this area and its a major paradigm shift in the field. deep learning took off only within last half decade. the simple answer is that neural networks can represent arbitrarily complex functions and that complexity is now at very advanced levels with deep neural networks. another answer is that the problems that are studied, and maybe even "reality in general,", are "built out of features" and ANNs are now adept at learning very complex features.$\endgroup$
– vznSep 28 '16 at 15:49

$\begingroup$I don't think people are really searching for "an answer" here. They seek to use neural networks to solve problems, and if the problem is indeed solved then it's fine. Knowing how the networks reached that solution isn't necessarily of interest here. Nobody cares much if it's a black/opaque box as long as it solves the issue.$\endgroup$
– xjiOct 8 '16 at 20:54

5 Answers
5

There are a bunch of "no free lunch" theorems in machine learning, roughly stating that there can be no one master learning algorithm that performs uniformly better than all other algorithms (see, e.g., here http://www.no-free-lunch.org/ ). Sure enough, deep learning can be "broken" without much difficulty:
http://www.evolvingai.org/fooling

Hence, to be provably effective, a learner needs inductive bias --- i.e., some prior assumptions about the data. Examples of inductive bias include assumptions of data sparsity, or low dimensionality, or that the distribution factorizes nicely, or has a large margin, etc. Various successful learning algorithms exploit these assumptions to prove generalization guarantees. For example, (linear) SVM works well when the data is well-separated in space; otherwise -- not so much.

I think the main challenge with deep learning is to understand what its inductive bias is. In other words, it is to prove theorems of the type: If the training data satisfies these assumptions, then I can guarantee something about the generalization performance. (Otherwise, all bets are off.)

$\begingroup$It should be noted that adversarial examples are not unique to deep neural networks. They can also easily be constructed for linear and logistic regression, for example: arxiv.org/pdf/1412.6572.pdf$\endgroup$
– Lenar HoytOct 1 '16 at 10:48

$\begingroup$It should perhaps also be noted that the NFL theorems might not play a big role in practical machine learning because while NFL is concerned with the class of all functions, real world problems are typically constrained to e.g. smooth functions or even more specific functions such as the ones considered in the paper by Lin and Tegmark. It might be possible to find inductive biases that cover all learning problems that we are interested in.$\endgroup$
– Lenar HoytOct 1 '16 at 19:43

4

$\begingroup$Then we should first formalize this space of "all learning problems that we are interested in".$\endgroup$
– AryehOct 1 '16 at 19:58

1

$\begingroup$That definitely seems worthwhile, especially with regards to AI safety. We need to be able to reliably specify what a machine learning algorithm is supposed to learn.$\endgroup$
– Lenar HoytOct 1 '16 at 20:31

There are two main gaps in our understanding of neural networks: optimization hardness and generalization performance.

Training a neural network requires solving a highly non-convex optimization problem in high dimensions. Current training algorithms are all based on gradient descent, which only guarantees convergence to a critical point (local minimum or saddle). In fact, Anandkumar & Ge 2016 recently proved that finding even a local minimum is NP-hard, which means that (assuming P != NP) there exist "bad", hard to escape, saddle points in the in the error surface.
Yet, these training algorithms are empirically effective for many practical problems, and we don't know why.
There have been theoretical papers such as Choromanska et al. 2016 and Kawaguchi 2016 which prove that, under certain assumptions, the local minima are essentially as good as the global minima, but the assumptions they make are somewhat unrealistic and they don't address the issue of the bad saddle points.

The other main gap in our understanding is generalization performance: how well does the model perform on novel examples not seen during training? It's easy to show that in the limit of an infinite number of training examples (sampled i.i.d. from a stationary distribution), the training error converges to the expected error on novel examples (provided that you could train to the global optimum), but since we don't have infinite training examples, we are interested in how many examples are needed to achieve a given difference between training and generalization error. Statistical learning theory studies these generalization bounds.
Empirically, training a large modern neural network requires a large number of training examples (Big Data, if you like buzzwords), but not that monumentally large to be practically unfeasible. But if you apply the best known bounds from statistical learning theory (for instance Gao & Zhou 2014) you typically get these unfeasibly huge numbers. Therefore these bounds are very far from being tight, at least for practical problems.
One of the reason might be that these bounds tend to assume very little about the data generating distribution, hence they reflect the worst-case performance against adversarial environments, while "natural" environments tend to be more "learnable".
It is possible to write distribution-dependent generalization bounds, but we don't know how to formally characterize a distribution over "natural" environments. Approaches such as algorithmic information theory are still unsatisfactory.
Therefore we still don't know why neural networks can be trained without overfitting.

Furthermore, it should be noted that these two main issues seem to be related in a still poorly understood way: the generalization bounds from statistical learning theory assume that the model is trained to the global optimum on the training set, but in a practical setting you would never train a neural network until convergence even to a saddle point, as to do so would typically cause overfitting. Instead you stop training when the error on a held-out validation set (which is a proxy for the generalization error) stops improving. This is known as "early stopping".
So in a sense all this theoretical research on bounding the generalization error of the global optimum may be quite irrelevant: not only we can't efficiently find it, but even if we could, we would not want to, since it would perform worse on novel examples than many "sub-optimal" solutions.
It may be the case that optimization hardness is not a flaw of neural network, on the contrary, maybe neural networks can work at all precisely because they are hard to optimize.
All these observations are empirical and there is no good theory that explains them. There is also no theory that explains how to set the hyperparameters of neural networks (hidden layer width and depth, learning rates, architectural details, etc.). Practitioners use their intuition honed by experience and lots of trial and error to come up with effective values, while a theory could allow us to design neural networks in a more systematic way.

Another take on this question, to add to @Aryeh's remarks: For many other models of learning, we know the "shape" of the hypothesis space. SVMs are the best example of this, in that what you're finding is a linear separator in a (possibly-high dimensional) Hilbert space.

For neural networks in general, we don't have any such clear description or even an approximation. And such a description is important for us to understand what exactly a neural network is finding in the data.

$\begingroup$What would you call as the "shape" of the hypothesis space? :) Does Theorem 2.1 (page 3) of ours answer some of your question : eccc.weizmann.ac.il/report/2017/098 ? :D$\endgroup$
– AnirbitMay 23 '18 at 16:13

Last month, a YouTube video of a conference talk in Berlin, shared widely among artificial-intelligence researchers, offered a possible answer. In the talk, Naftali Tishby, a computer scientist and neuroscientist from the Hebrew University of Jerusalem, presented evidence in support of a new theory explaining how deep learning works. Tishby argues that deep neural networks learn according to a procedure called the “information bottleneck,” which he and two collaborators first described in purely theoretical terms in 1999. The idea is that a network rids noisy input data of extraneous details as if by squeezing the information through a bottleneck, retaining only the features most relevant to general concepts. Striking new computer experiments by Tishby and his student Ravid Shwartz-Ziv reveal how this squeezing procedure happens during deep learning, at least in the cases they studied.

I would say that we still need to discover an efficient algorithm for training deep neural networks. Yes, SGD does work well in practice but finding a better algorithm that has guarantees to converge to global minimum would be very nice.