7/11/2011

Maybe it’s too early to call, but with four separate Neural Network sessions at this year’s ICML, it looks like Neural Networks are making a comeback. Here are my highlights of these sessions. In general, my feeling is that these papers both demystify deep learning and show its broader applicability.

The first observation I made is that the once disreputable “Neural” nomenclature is being used again in lieu of “deep learning”. Maybe it’s because Adam Coates et al. showed that single layer networks can work surprisingly well.

Another surprising result out of Andrew Ng’s group comes from Andrew Saxe et al. who show that certain convolutional pooling architectures can obtain close to state-of-the-art performance with random weights (that is, without actually learning).

Of course, in most cases we do want to train these models eventually. There were two interesting papers on the topic of training neural networks. In the first, Quoc Le et al. show that a simple, off-the-shelf L-BFGS optimizer is often preferable to stochastic gradient descent.

It will be interesting to see whether this type of training will allow recurrent neural networks to outperform CRFs on some standard sequence tasks and data sets. It certainly seems possible since even with standard L-BFGS our recursive neural network (see previous post) can outperform CRF-type models on several challenging computer vision tasks such as semantic segmentation of scene images. This common vision task of labeling each pixel with an object class has not received much attention from the deep learning community.
Apart from the vision experiments, this paper further solidifies the trend that neural networks are being used more and more in natural language processing. In our case, the RNN-based model was used for structure prediction. Another neat example of this trend comes from Yann Dauphin et al. in Yoshua Bengio’s group. They present an interesting solution for learning with sparse bag-of-word representations.

Such sparse representations had previously been problematic for neural architectures.

In summary, these papers have helped us understand a bit better which “deep” or “neural” architectures work, why they work and how we should train them. Furthermore, the scope of problems that these architectures can handle has been widened to harder and more real-life problems.

It should be noted at the paper of Quoc V. Le et al. is about training shallow networks only. The title, and the way the paper is written, is quite misleading. Moreover, it ignores the work already done long ago by Lecun and collaborators that investigated using L-BFGS for training neural networks, and concluded SGD, done properly (and perhaps using acceleration techniques such as Lecun’s diagonal curvature matrix approximations), was almost always preferable. And I don’t trust that Le et al.’s SGD implementation and choice of learning rate schedule is particularly good, or that they tried particularly hard to tune it properly. Yes black-box optimization packages like minfunc can do pretty well with little or no tuning, but this isn’t a particularly interesting finding.

Finally, this paper completely misinterprets (see its related work section) my results about training deep networks with Hessian-free optimization published in the previous ICML. In particular, HF is able to very successfully train deep networks from random initializations without requiring pre-training, something which was never demonstrated before (or since) with any optimizer, including L-BFGS. Le et al.’s paper incorrectly places HF (and other work) as just tools for implementing the pre-training approach of Hinton et al., a critical misinterpretation of the literature on this topic.

Could L-BFGS in principle be used for learning deep nets from random initializations (to say nothing of recurrent neural networks)? Well, maybe, since it is also quasi-Newton optimization method. But noone has yet to demonstrate this, and many have tried. In my experience, it does very poorly, even when it is allowed to store many more rank-1 updates than would be practical in most situations. The supplement to Le et al.’s paper posted on Quoc Le’s website is just a half-hearted attempt at obtaining fast convergence on *shallow nets* using some version of HF which I have little faith in. I’m not sure what this is supposed to prove, or why he even bothered posting it, except to mislead and confuse people who don’t critically examine such things.

Unfortunately, some people who have read Le et al.’s supplement have come away with the mistaken interpretation that L-BFGS is across-the-board faster on HF for neural net training. There are many problems with this, but I’ll give the two main ones. First, the version of HF for my ICML paper was never tuned to be a general purpose black-box optimization package, unlike the the implementation of L-BFGS in minfunc. In my experience it is possible to tune it in order to get good convergence rates with shallow nets, but that’s hardly interesting, or really the point of it. Secondly, and much more importantly, it assumes that observations made while training 1 layer networks (with tied weights, which makes this easy problem even easier) can somehow generalize to deep or temporal networks. After all, if optimizer A is faster than B on easy problem X it should also be faster on hard problem Y, right? But this demonstrably false, and contradicts over a decade of research done on training deep networks, where methods which worked fine on shallow networks would invariably “peter out” when applied to deep networks, I would guess because they are not as adept as HF is at handling issues of local curvature variations.

To clarify: I’m not saying that I actually think Quoc was being intentionally misleading with the posting of the supplement to his paper. However, I maintain that it has resulted in people being confused and misled.

As a researcher who relies upon optimization methods but has not made optimization the main focus of my research, I’ve found many of the recent publications on optimization methods very confusing. Let me be very clear that I don’t think this is anyone’s “fault” per se, rather I am just offering my perspective as someone who would like to find a clear take-home message in the machine learning optimization literature and so far has not find one. I hope that my comments can help optimization researchers plan future experiments to clarify some of these issues.

Specifically, with regard to Hessian Free, I’m not certain of the following:
-Which aspects of the algorithm are important to its performance
-To what extent each of the aspects of the algorithm are truly optimization methods rather than regularization methods
-What sort of problems Hessian Free is applicable to
-How to tune Hessian Free for a particular problem / to what extent it would be possible to make Hessian Free a more black box optimizer

With regard to Le et al’s ICML paper I think some of the results certainly require further explanation. The main result I find confusing is the demonstration that (minibatch) SGD does not enjoy any speedup on GPU at all. This contradicts the Raina 2009 paper also from Andrew’s lab which shows a > 70X speedup for minibatch gradient descent when using GPU. Moreover, there is also a Le et al CVPR 2011 paper which advocates using batch gradient descent over “other methods”. It’s understandable that the CVPR paper and ICML paper do not refer to each other as they were published so close together in time, and it’s important to note that the Raina et al paper applied minibatch SGD to different models than the Le et al paper. The supplement does make some effort to explain why the GPU is slower than usual but doesn’t specifically address these papers and is also rather on short on details about exactly how the limited memory of the GPU is simulated. All of this makes it difficult for an outside observer to find a clear take-home message about which optimization method to use.

In the case of Hessian Free, one could reduce the confusion by performing an ablative analysis (show its performance with different parts removed) or applying it to more tasks. With respect to the SGD speedup issue I think maybe we need more information published about the experimental setup involved in each case to know exactly what the issue is– or if SGD is genuinely very fast on GPU for training DBNs and not very fast on GPU for training autoencoders that would be an interesting finding. I would also like to see the GPU speedup evaluated in a realistic setting where limited memory is a genuine issue rather than a simulated handicap.

As someone who’s more interested in modeling than optimization, all of this uncertainty means that I’m probably going to continue using SGD with Polyak averaging for the time being (because it’s easy to implement and seems to work reasonably well for me), but I look forward to seeing more optimization results that could help me figure out if I should be using a different method.

The aspects which are absolutely critical are the use a PSD curvature matrix such as the Gauss-Newton, the truncation of CG at some reasonable point, as measured either by progress on optimizing the quadratic objective, a fixed threshold (this might be hard to set properly) or some other metric or combination of these, and the use of a damping scheme such as the Tikhonov one that I used in the paper. Note that none of the aspects of the algorithm are design to be regularization techniques. Tikhonov “regularization” refers to the regularization of the quadratic subproblem and not the overall nonlinear objective. In particular, it does not affect the gradient at all so it’s not a regularization method in the machine learning sense of the word. Lack of damping is the main reason that implementations of HF in packages like minfunc do so poorly in practice. Without an implicit “trust-region”, strongly 2nd-order optimizers like HF are absolutely hopeless for highly non-linear optimizations such as deep auto-encoder training.

But most of the other ideas presented in the paper, such as including preconditioning (which by its nature is highly problem dependent) and initializing each CG run from the previous iteration are very helpful to speed things up in most cases, and they are trivial to add once you have the basic algorithm in place.

I’ve found that HF can be successfully applied to training deep networks (with or without pre-trianing) and RNNs. I would suggest applying it to any problem where you suspect that standard optimization methods are running into difficulties and not producing the best results possible due to issues related to difficult curvature (as sometimes measured by the condition number of the curvature matrix, although this is a vast oversimplification of a very complex issue that is more to do with the eigen-distribution of the curvature matrix than the extreme large and small values).

As for practical advice for using the optimizer and setting the meta parameters (e.g. preconditioner, CG iteration limit, minibatch sizes, initial lambda constant), well, there are some comments in the body of the code provided on my website about this which cover the most important points, and some discussion of the minibatch issue in my paper, but in general it’s not totally clear how to make it black-box. This is something that I’ve been working on for a while in collaboration with a couple others and I now have something which seems to work much more reliably and robustly, and is less sensitive to meta-parameter choices. It involves some non-trivial mathematics and new ideas and looks very different from the original algorithm. Expect to see something on this in the coming year.

Using SGD is fine as long as you recognize that there are many problems on which it will either underperform or fail spectacularly, such as training deep neural networks (especially without pre-training) and RNNs (especially on datasets with long-term dependencies).