You are here

Newbie question: Confused about train, validation and test data!

First of all I would like to thank you about Encog Framework, since without it, it could be impossible to finish my thesis! :)

My question is regarding, how many sets of data should be used in order to train a neural network.

From this http://www.heatonresearch.com/articles/1/page3.html at Training Neural Networks and Validating Neural Networks sections,
I understand that just two sets of data are enough. One for the training and one for testing!
However, looking through the net, I have read that is being used and a third set of data which in order to minimize overfitting.

I quote:

Training Set: this data set is used to adjust the weights on the neural network.

Validation Set: this data set is used to minimize overfitting. You're not adjusting the weights of the network with this data set, you're just verifying that any increase in accuracy over the training data set actually yields an increase in accuracy over a data set that has not been shown to the network before, or at least the network hasn't trained on it (i.e. validation data set). If the accuracy over the training data set increases, but the accuracy over then validation data set stays the same or decreases, then you're overfitting your neural network and you should stop training.

Testing Set: this data set is used only for testing the final solution in order to confirm the actual predictive power of the network.

So, my question is, do I need and a validation set (as is referred above) and if yes, how exactly (in code manner) can I do it using Encog?

Usually the testing set is the "real" actual data that you are using the neural network on after you've created it. For example, if I create a neural network to predict the S&P500, I might decide to use years 1950-2000 as the training data. Years 2000-2010 as the validation set. And today onward becomes the testing set.

There are many theories on how to do this. This is also an area of Encog that is in very active development. The idea is to try and get the best use of your training data. Methods such as crossvalidation, bootstrapping and jackknifing can be quite useful for this. These are areas we are currently adding to Encog.