Note: [2] In fact, machine learning and artificial intelligence algorithms can be trained to scan billions of data signals in order to design millions, if not billions, of different virtual trading strategies. See,. AI equity research robots are already tracking and providing views on asset prices. See.

Note: [3] Bailey & Lópex de Prado (2014) and Harvey & Liu (2014) discuss ways to adjust Sharpe ratios and p-values based on the number of trials. See also Barras et al 2010 for a discussion of false discoveries in mutual fund performance.

Note: [4] The data generating process (DGP) in the following simulations is a simplified yet standard assumption in the literature (e.g., Harvey & Liu (2014)).

Note: [5] The Python codes to reproduce the results are available on the authors’ webpage.

Note: [6] Results remain qualitatively similar under a loss function with squared errors.

Note: [7] We first show results for the MCS specification using the
TRange,M
test statistic, a moving-block bootstrap of length ℓ=5, and B = 500 bootstrap samples. Results are similar under alternative specifications of the
TRange,M
statistic. However, we find somewhat inconsistent results using the
Tmax,M
test statistic. See Section 3.3.7

Note: [8] The DGP assumes M independent strategies, although we note that in practice some will tend to be correlated. Correlated returns would reduce the variance of dij,t ≡ (Li,t - Lj,t), and therefore reduce the sample size required in MCS to identify the superior model.

Note: [9] Such Sharpe ratios are rarely seen in practice. As a reference, the S&P 500 Sharpe ratio is estimated at 0.38 during 1996–2014;even the best-performing hedge funds typically have average Sharpe ratios below 2 (Titman & Tiu (2010), Getmansky et al. (2015)).

Note: [11] Holm’s method ends once the first null hypothesis cannot be rejected. Holm’s is less strict that Bonferroni’s, which inflates all p-values equally. In fact,
pmHolm≤pmBonf.,∀m∈M
.

Note: [12] Consistent with these results, it has come to our attention that the
Tmax,M
statistic, and therefore the elimination rule
emax,M
, is not recommended in practice. See Corrigendum (Hansen et al. (2011)).

Abstract: Recent advances in machine learning, artificial intelligence, and the availability of billions of high frequency data signals have made model selection a challenging and pressing need. However, most of the model selection methods available in modern finance are subject to backtest overfitting. This is the probability that one will select a financial strategy that outperforms during backtest, but underperforms in practice. We evaluate the performance of the novel model confidence set (MCS) introduced in Hansen et al. (2011a) in a simple machine learning trading strategy problem. We find that MCS is not robust to multiple testing and that it requires a very high signal-to-noise ratio to be utilizable. More generally, we raise awareness on the limitations of model selection in finance.