P.S.: Cross-posting my post with minor edits from one of the LFD course sub-forums. But this probably belongs here. And all are welcome to comment.

Professor Abu-Mostafa:

Sorry about the length of this post, but I would appreciate your advice on a problem I am facing with a neural network that I am trying to implement for regression.

The problem is that I am finding the predicted values eventually turn out to be the same for all inputs. It seemed to me, after doing some reading, that this is possibly a consequence of the saturation of the hidden layer. This is a network with one hidden layer and one linear output neuron. The hidden layer is non-linear, and I have tried various sigmoid functions here. I have tried tanh and logistic functions, and then after reading some papers on how these can result in saturation, I also tried rectified linear units i.e. max(0, x). However, after some amount of time that varies with the parameters, the output values are again all equal even with the rectified linear units. I am using gradient descent. I have tried mini-batches with several thousand examples, iterating over them to decrease the cost in each batch. I have also tried learning from just one example at a time. I have tried randomly permuting the inputs. I started with the cross-entropy error and am now working with mean square error. I have checked the gradient calculation by numerically checking some values by perturbing the weights slightly. I have also tried it with and without regularization. With tanh, the learning seemed to be quick even with one example at a time, but ran into this stuck behavior very early. With rectified linear units, it is learning more slowly but then it seems to be on a big plateau, and it took some time to get into this saturated or saturated-like state.

I think I am training on a sufficient number of examples. Their number is about 10 times the number of weights, as per my understanding of the VC analysis in your lectures.

I noticed in earlier runs that the predicted values tended to converge to the mean of the target output values of the last mini-batch that it had been trained on. It seems to me that it somehow wants to minimize the cost by finding the mean of the target outputs, and then use this mean for prediction. And that is the local minimum that it seems to move to. However, this does not happen right away, so I don't think I have accidentally coded anything into the cost function or the back-propagation specifically asking it to do this. Does the math of back-propagation encourage this specific kind of local minimum (predicted value tending to mean of outputs from mini-batch)?

While it is certainly possible there is a bug in my code, is this kind of behavior common? If so, what measures would you recommend to address it? Specifically, if I iterate over the same examples for a much longer duration, can the neural network move out of this state?

In fact, as this is happening even with rectified linear units, is this not theoretically a phenomenon of saturation to fixed hidden layer activations, but some other behavior, related to an overall tendency of the outputs towards certain local minima? Or are they really the same thing?

Is it possible to not get into such a situation by trying out many random combinations of initial weight values?

It seems to me that this is not really a generalization problem, and that regularization may not cure this, though it may find a different minimum where it saturates, by changing the cost. Is this intuition correct?

The problem is that I am finding the predicted values eventually turn out to be the same for all inputs.

Let me take this step by step. I assume that your training data points have different outputs and that your training gets the network to predict these outputs with reasonable approximation so that the network outputs are not the same for the training set. Is this correct?

__________________Where everyone thinks alike, no one thinks very much

After training from a mini-batch of the training set and adjusting the weights, when I try to predict the outputs of the same training mini-batch, it produces identical (incorrect) outputs when it has undergone some training. Initially i.e. at the beginning of training, these values are very far wrong and also different but many are the same, but eventually after some training, all the predictions on the training set start to converge, and they get closer to the mean of the outputs of the latest training mini-batch.

I get the same output value if I then run the network on a validation set. When I continue training from the next mini-batch and then test with that mini-batch and then a validation set, the behavior is the same, but the predicted value changes after each training mini-batch.

I hope I have answered your question.

Initially, I had conjectured that maybe, the network is completely biased and is just providing a fixed output from the weight of the bias term and no contribution from the other weights, and so I removed the bias term altogether. But the behavior stayed the same, and I brought the bias term back.

After training from a mini-batch of the training set and adjusting the weights, when I try to predict the outputs of the same training mini-batch, it produces identical (incorrect) outputs when it has undergone some training. Initially i.e. at the beginning of training, these values are very far wrong and also different but many are the same, but eventually after some training, all the predictions on the training set start to converge, and they get closer to the mean of the outputs of the latest training mini-batch.

OK, so somehow the training phase is not working. There could be different reasons for that. Can I ask you to try to train on only two examples with different outputs and see what happens?

__________________Where everyone thinks alike, no one thinks very much

Sure, I just tried the training that you suggested. With repeated training over the same 2 examples that have different target outputs, it learns correctly, and eventually predicts them both with no training error.

Sure, I just tried the training that you suggested. With repeated training over the same 2 examples that have different target outputs, it learns correctly, and eventually predicts them both with no training error.

OK. Now, this narrows it down to inability to learn properly on the full data set. Two possibilities

1. Computational: Not enough epochs, or a bad local minimum.

2. Inherent: The target function is almost impossible to capture given the size of the network.

Let's deal with 1 first. Try a very long run with say 100 times the epochs and see if the result is better. Also try 100 different runs (initial random weights seeded differently) with the smaller number of epochs and see how the best result is.

__________________Where everyone thinks alike, no one thinks very much

I have been running the training as you suggested. It's still in progress, but I thought I would mention some things I have found in the meantime.

Without regularization, it has actually not got stuck (in the identical predictions problem on the training set) after 24 epochs. This is contrary to what I had reported in my first post, about regularization not affecting this problem. I must have got my observations mixed up while juggling the different model hyper-parameters. Sorry about that.

However, with some regularization, I am seeing the problem I saw before. So, it must have been the regularization that pushed it over to a high-bias region, where the best it could do was to learn the mean of the outputs it most recently saw and predict that for every example. In that case, maybe this is similar to the condition shown in the last curve on slide 12 of your lecture on regularization?

I still notice significant training error in the long many-epoch run, and even in the randomized runs, though it probably hasn't gone through enough iterations of random runs yet to be sure it's always the case. But if this trend continues, I suppose that means the model may not have sufficient number of parameters for this problem?

One possibility which I cannot exclude from your description is that the non-random part of the relationship between your inputs and outputs is insignificant compared to the random component. In this case, you may arrive at all inputs predicting an average of the outputs. This is the best that can be done if the outputs are entirely random, so there is no learnable content.

Thanks for your comment. If I understand your post correctly, you're saying that maybe the inputs are not far from random? Well, I'm hoping that's not the case, but it's certainly possible that my representation of the actual input has some issues, as I was trying out what, to my knowledge, may be a non-standard way to represent this input.

And just to complete the picture about the training runs ... I did run it for 100 epochs with the same initial values, and I completed 50 runs with different random initial settings. I had to stop it mid-way through the 100 random runs, as it was beginning to sort of take over my computer. :-) I ran these with no regularization, and none of them showed the “identical predictions” problem, though they do not show good learning behavior. But with regularization, I do see the problem.

The contents of this forum are to be used ONLY by readers of the Learning From Data book by Yaser S. Abu-Mostafa, Malik Magdon-Ismail, and Hsuan-Tien Lin, and participants in the Learning From Data MOOC by Yaser S. Abu-Mostafa. No part of these contents is to be communicated or made accessible to ANY other person or entity.