Are there common procedures prior or posterior backtesting to ensure that a quantitative trading strategy has real predictive power and is not just one of the thing that has worked in the past by pure luck? Surely if we search long enough for working strategies we will end up finding one. Even in a walk forward approach that doesn't tell us anything about the strategy in itself.

Some people talk about white's reality check but there are no consensus in that matter.

Testing on "out of band" data (data you didn't use to come up w/ the strategy) is a common approach. The problem "if we search long enough for working strategies we will end up finding one" is actually more serious. If you have a parameterized strategy, you can just tweak the parameters to make the strategy more accurate.
–
barrycarterFeb 3 '11 at 8:35

3

I'm not sure what you mean by "tweak the parameters". What I do is look at the robustness of the strategy by changing the parameters arround my prefered values. The problem with this approach is that it is difficult to build a proper quatitative test to reject or not the strategies robustness. What is the size of the radius of the hypersphere arround the prefered value one has to look at? What performance variation should we tolerate? At the end it is almost just how you feel about it. I believe that to be dangerous in quant trading.
–
ZarbouzouFeb 3 '11 at 10:01

6 Answers
6

Strictly speaking, data snooping is not the same as in-sample vs out-of-sample model selection and testing, but has to deal with sequential or multiple tests of hypothesis based on the same data set. To quote Halbert White:

Data snooping occurs when a given set
of data is used more than once for
purposes of inference or model
selection. When such data reuse
occurs, there is always the
possibility that any satisfactory
results obtained may simply be due to
chance rather than to any merit
inherent in the methody yielding the
results.

Let me provide an example. Suppose that you have a time series of returns for a single asset, and that you have a large number of candidate model families. You fit each of these models, on a test data set, and then check the performance of the model prediction on a hold-out sample. If the number of models is high enough, there is a non-negligible probability that the predictions provided by one model will be considered good. This has nothing to do with bias-variance trade-offs. In fact, each model may have been fitted using cross-validation on the training set, or other in-sample criteria like AIC, BIC, Mallows etc. For examples of a typical protocol and criteria, check Ch.7 of Hastie-Friedman-Tibshirani's "The Elements of Statistical Learning". Rather the problem is that implicitly multiple tests of hypothesis are being run at the same time. Intuitively, the criterion to evaluate multiple models should be more stringent, and a naive approach would be to apply a Bonferroni correction. It turns out that this criterion is too stringent. That's where Benjamini-Hochberg, White, and Romano-Wolf kick in. They provide efficient criteria for model selection. The papers are too involved to describe here, but to get a sense of the problem, I recommend Benjamini-Hochberg first, which is both easier to read and truly seminal.

Building an effective backtest is not significantly different than building any other kind of predictive model. The goal is to have similar behavior out of sample as you have in sample. As such, there are methodologies developed in statistics and machine learning that can be useful:

You can certainly use a training and test dataset. But there are also other kinds of approaches that can be used. To list two common options: cross-validation (similar to having segmented data, but can help with parameter selection) and ensemble methods (using multiple models can outperform just one and further reduce the curve-fitting problem).

So a few general recommendations:

Your guiding principle should be Einstein's razor: 'Everything should be kept as simple as possible, but no simpler.' In other words, less degrees of freedom in your model equates to less chance for overfitting. In the statistics world, this can involve eliminating unnecessary parameters through a selection or regularization method.

Robustness (in every respect) is also critical. Parameters that result in sharp changes in expected prediction error will be more open to the risk of overfitting. Similarly, if the model has no fundamental basis, then it should be applicable to a wide number of assets.

Lastly, this applies to any kind of model: understand your data, your model, your objectives, assumptions, etc. There have been countless mistakes made over time from people not understanding the meaning of their models, implications, and risks. This includes things like execution assumptions and transaction costs. Make sure that you take everything into account. Lead by being skeptical of your data, constantly asking what can go wrong, or how can the future be different. Is there any survivorship bias in your data, and if so, how can you control for it? Have you introduced any look-ahead bias?

I have seen Hansen's SPA ('Superior Predictive Ability') test and stepwise variants used for this purpose. Hansen's test is a Studentized version of White's Reality Check. The stepwise variants allow one to accept or reject the null of no predictive ability on a subset of some tested strategies while maintaining a familywise error rate.

In his book, 'Evidence-Based Technical Analysis,' David Aronson discusses the overfit bias very well, although I believe his techniques for minimizing the bias may only apply to technical strategies, because they rely on Monte Carlo simulations.

I think the only non-datasnooping method there is is to trade live. But the problem of data snooping can be reduced by seeing how significant the backtest result is compared to what would have happened if the trades were random. Using this technology also makes it clear that backtesting results can easily be deceiving.

The output of your model will be a realization of your assumptions. Shane's given you a great answer. Besides doing out of sample testing (i.e., calibrating on period X then testing in period Y only using info available at the time of each trade), I would add that you should test it in sub-periods. If you have a big chunk of data, break it up and see how it works on each subset of the data.

Thanks for the answer as it tackles a lot of backtesting flaws, model parsimony, overfitting, survivorship bias, look ahead...
But actually one can look at thousands of technical trading rules and other more sofisticated strategies, and maybe find the few ones that will answer all these problems. Nevetheless we would still be left with data snooping ie we have used our data set untill we find a satisfactory result.

You will always have variance in your out of sample performance because the future isn't exactly like the past. There's a reason that this is hard.
–
ShaneFeb 3 '11 at 15:50

4

For reference: your answer here would generally be better as a comment on an existing answer.
–
ShaneFeb 3 '11 at 15:52

I think that even if one can completly cancel the data snooping bias that wouldn't cancel the variance between backtested and true results. This is due to non stationarity and we cannot do anything about that. Ok for the comments
–
ZarbouzouFeb 3 '11 at 15:56

Do you view data snooping as different than overfitting?
–
ShaneFeb 3 '11 at 16:10

Yes I maybe wrong but I view these problems as close but yet different. Well if we try to find the best parameters (with maximum likehood, MSE, MAP) and actually fit the predicted time serie to noise instead of real "structure" or "pattern", we have overfitted. In the contest of backtesting I see the data snooping as the bias we introduce when looking for different rules untill we find a working one.
–
ZarbouzouFeb 3 '11 at 16:19