I've been asking myself this question all the time. Let's say you are given a large set of time series data. Your task is to find out patterns that are meaningful or that you can use for future trend prediction.

The issue now is, how do you know for sure that the patterns you extract are valid, in the sense that they don't suffer from data snooping bias or a case of "torture-the-data-until-it-confesses"?

I can always test my hypothesis as new data comes in, but even if it can predict all the trends in the past, that doesn't mean that it will continue to do so in the future. No?

6 Answers
6

A good way to do this is called 'cross-validation'. The data can be divided into three disjoint sets: the training set, the test set and the validation set.

Different models are developed using the training set. A reasonable way to do this is to take different subsets of points from the training set uniform randomly (for instance by generating 100 subsets of size 90 from a training set of 100 points) and to fit your model as you normally would. This will give you a set of models with varying ability to predict. Pick the model that gives the best prediction for the test set. (Taking random subsets has the benefit of eliminating outliers.) Now, having done all this, the training set and the test set have been 'tainted' by the fact that you used them to build your model.

Therefore, finally, the model should be evaluated on the validation set. This set gives a more honest estimate of how well the model generalizes to a new set of data.

There is some sophistication involved in picking the right relative sizes of the sets and in getting more out of less data. For instance, you can look up k-fold cross-validation.

The thing is that even if it can predict new data in the past, that doesn't mean that it will continue to do so in the future.
–
GravitonOct 24 '09 at 1:51

I meant new data in the future, not new data in the past (which is an oxymoron).
–
Darsh RanjanOct 24 '09 at 3:09

But let's say if you have to commit a large amount of $$$ to it? And even if those prediction can predict the upcoming trend now, it doesn't mean that it will continue to do so, forever.
–
GravitonOct 28 '09 at 8:28

Your question is a difficulty one which every scientist grapples with every day of their career. Broadly there are two mechanisms by which hypothesis testing can be biased.

The first is the obvious mechanism of biasing the data in favor an hypothesis. Sometimes this occurs through straightforward fraud (i.e. South Korean stem cell findings). But more frequently this source of bias is much more subtle; involving issues such as unrecognized selection bias in the data.

The second mechanism is far more difficult to prevent and is often seen in peer reviewed literature, this mechanism is the biasing of the hypothesis in favor of the data. Two typical examples are over fitting the data with a large number of extra parameters to obtain a better fit, or using an inappropriate statistical model on a large dataset to yield statistical significance (i.e. fitting a time dependent model of the speed of light on cosmological scales). Another example is confusing data analysis for hypothesis testing (i.e. low frequency heart rate modulation research).

There is no single algorithm to prevent this second form of bias, and using general data mining and data analysis tools will almost guarantee that you will over fit the data. The best practice available in the scientific field is an iterative practice: First look for the most obvious pattern that has the simplest explanation, test for the explanations fit with a reasonably course dataset. If the explanation fits, then refine your data and look at how your previous explanation fails on the refined data, propose and test a more nuanced hypothesis. Continue ad infinitum, or ad nausea which ever comes first. Thus a first step is to ask a simple question for which the extraneous factors and error sources can be controlled and don't analyze very detailed data, this will only lead you on a wild goose chase.

Agreeing with all of the advice above, none of it took explicitly into account the time series aspect of the data. If you are using models (like arima) where the time dependence is explicit in the model, it is not possible to use standard cross validation.

What I have done in such cases is choose a model, estimate it using the 5 first years of data,
then predict the five next years, getting a discrepancy measure, and continuing like this.