How to fight underfitting in a deep neural net

38

14

When I started with artificial neural networks (NN) I thought I'd have to fight overfitting as the main problem. But in practice I can't even get my NN to pass the 20% error rate barrier. I can't even beat my score on random forest!

I'm seeking some very general or not so general advice on what should one do to make a NN start capturing trends in data.

For implementing NN I use Theano Stacked Auto Encoder with the code from tutorial that works great (less than 5% error rate) for classifying the MNIST dataset. It is a multilayer perceptron, with softmax layer on top with each hidden later being pre-trained as autoencoder (fully described at tutorial, chapter 8). There are ~50 input features and ~10 output classes. The NN has sigmoid neurons and all data are normalized to [0,1]. I tried lots of different configurations: number of hidden layers and neurons in them (100->100->100, 60->60->60, 60->30->15, etc.), different learning and pre-train rates, etc.

And the best thing I can get is a 20% error rate on the validation set and a 40% error rate on the test set.

On the other hand, when I try to use Random Forest (from scikit-learn) I easily get a 12% error rate on the validation set and 25%(!) on the test set.

How can it be that my deep NN with pre-training behaves so badly? What should I try?

Answers

30

The problem with deep networks is that they have lots of hyperparameters to tune and very small solution space. Thus, finding good ones is more like an art rather than engineering task. I would start with working example from tutorial and play around with its parameters to see how results change - this gives a good intuition (though not formal explanation) about dependencies between parameters and results (both - final and intermediate).

They both describe RBMs, but contain some insights on deep networks in general. For example, one of key points is that networks need to be debugged layer-wise - if previous layer doesn't provide good representation of features, further layers have almost no chance to fix it.

While ffriend's answer gives some excellent pointers for learning more about how neural networks can be (extremely) difficult to tune properly, I thought it might be helpful to list a couple specific techniques that are currently used in top-performing classification architectures in the neural network literature.

its output is a true zero (not just a small value close to zero) for $z \le 0$ and

its derivative is constant, either 0 for $z \le 0$ or 1 for $z > 0$.

A network of relu units basically acts like an ensemble of exponentially many linear networks, because units that receive input $z \le 0$ are essentially "off" (their output is 0), while units that receive input $z > 0$ collapse into a single linear model for that input. Also the constant derivatives are important because a deep network with relu activations tends to avoid the vanishing gradient problem and can be trained without layerwise pretraining.

Dropout

Many research groups in the past few years have been advocating for the use of "dropout" in classifier networks to avoid overtraining. (See for example "Dropout: A simple way to prevent neural networks from overfitting" by Srivastava, Hinton, Krizhevsky, Sutskever, & Salakhutdinov http://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf) In dropout, during training, some constant proportion of the units in a given layer are randomly set to 0 for each input that the network processes. This forces the units that aren't set to 0 to "make up" for the "missing" units. Dropout seems to be an extremely effective regularizer for neural network models in classification tasks. See a blog article about this at http://fastml.com/regularizing-neural-networks-with-dropout-and-with-dropconnect/.

When deeper networks are able to start converging, a
degradation problem has been exposed: with the network
depth increasing, accuracy gets saturated (which might be
unsurprising) and then degrades rapidly. Unexpectedly,
such degradation is not caused by overfitting, and adding
more layers to a suitably deep model leads to higher training
error, as reported in [11, 42] and thoroughly verified by
our experiments.

To solve the problem, they have made use of a skip architecture. With that, they trained very deep networks (1202 layers) and achieved the best result in the ILSVRC 2015 challenge.

Thank you for your ans, you are taking about vanishing gradient problem, if in case the validation acc is higher than training acc then what shoud do?? It can be happend for small number of data in val set But sometime it doesnot depend on the val set. I am asking that is there any other reason where the val acc is higher than training acc?? – Sudip Das – 2018-03-01T15:27:01.040