Suppose one has an idea for a short-horizon trading strategy, which we will define as having an average holding period of under 1 week and a required latency between signal calculation and execution of under 1 minute. This category includes much more than just high-frequency market-making strategies. It also includes statistical arbitrage, news-based trading, trading earnings or economics releases, cross-market arbitrage, short-term reversal/momentum, etc. Before even thinking about trading such a strategy, one would obviously want to backtest it on a sufficiently long data sample.

How much data does one need to acquire in order to be confident that the strategy "works" and is not a statistical fluke? I don't mean confident enough to bet the ranch, but confident enough to assign significant additional resources to forward testing or trading a relatively small amount of capital.

Acquiring data (and not just market price data) could be very expensive or impossible for some signals, such as those based on newer economic or financial time-series. As such, this question is important both for deciding what strategies to investigate and how much to expect to invest on data acquisition.

A complete answer should depend on the expected Information Ratio of the strategy, as a low IR strategy would take a much longer sample to distinguish from noise.

where $s$ is the measured standard deviation, which you already have from your IR calculation.

High-frequency Example

I was testing a market-making model recently that was expected to return a couple basis points for each trade and I wanted to be confident that my returns were really positive (ie, not a fluke). So, I chose a distance of 3 bps $(\Delta = .0003)$. My sample's measured standard deviation was 45 bps $(s = .0045)$. For a confidence interval of 95% $(\alpha = 1.96)$, my sample size needs to be $n = 3458$ trades. I would have picked a tighter distance if I had been simulating this model, but I was trading live and I couldn't be too choosy with money on the line.

Low-frequency Example

I imagine that for a low-frequency model that was expected to return 1.5% per month, I'd want maybe 1% as the distance $(\Delta = .01)$. If the hoped-for Sharpe ratio were 3, then the standard deviation would be 1.7% $(s = .017)$, which I came-up with by backing-out the monthly
returns. So for a confidence interval of 95% $(\alpha = 1.96)$, I'd need 45 months of data.

Good answer. Can u post the chat transcript here also for completeness. I get a page not found error for the link above.
–
Suminda Sirinath SalpitikoralaJan 9 '12 at 3:49

@SumindaSirinathSalpitikorala I get a "This room has been automatically deleted for inactivity" error. There isn't anything for "completeness" anyway; Tal and I had a back-and-forth about examples that ultimately became the answer you see now. Feel free to look at the edit history to see how different my first draft was.
–
chrisaycockJan 9 '12 at 13:46

1

how exactly did you get s=1.7% from r=1.5% and SR=3?
–
eyalerJan 27 '12 at 17:38

I would also note that you need to watch out for correlations between data points. (EG ,if you have a data point proving this works for oil company x. Another data point for oil company y may not actually count as separate.)

If you are looking at 5 day holding periods, why not just grab all the EOD data that you can as well.EOD data is obviously not tradeable but can be used as a sanity check for long term trading strategy returns when you do not actually have the data.

Hi Michael RB, welcome to quant.SE and thanks for contributing an answer. Do you have any ideas on how the correlation reduces the confidence? As for EOD data, of course it will be used as appropriate, but the question here is how much intraday data I need for a theoretical strategy.
–
Tal FishmanSep 14 '11 at 0:40

honestly, it was just an example. for equities, you might want to remove sector returns/ market returns. etc.
–
Michael WSSep 14 '11 at 1:36