Let's imagine that we have two separate models, both used to forecast the return for the next period. Both models are estimated everyday, and both models outputs a probability distribution.
How can we evaluate if one model has been better to forecast the distribution of future returns better than the other ?

My first intuition was simply to compute for each date the probability of the ex-post event given by the two models, and simply sum up the probabilities for each model over the different periods. The model with the highest sum would be better.
But I feel that this way is not really clean and lacks of robustness. Any idea on how to improve my methodology ?

2 Answers
2

Answer

If you assume your returns are independent (yes your models might loosen this assumption) then the two models, $Q_1$ and $Q_2$ assign probability distributions to the returns on any given day, $i$: $q_1^i(r^i)$ and $q_2^i(r^i)$.

Presumably you are interested in the model that can more accurately predict the state of the market over subsequent returns, i.e. you are interested in:

This should be familiar territory if you are used to maximum likelihood expectation. Obviously you would like to choose the model with the highest likelihood. Commonly, to avoid floating point rounding error the maximum of the log is taken since it is a monotonic functions so consider maximising, instead:

In this case this is also equivalent to the cross entropy between $p$ the true probability distribution which is 1 for the observed state and 0 otherwise, relative to either model $q_1$ or $q_2$. If you do not assume independence of returns then you have a slightly more complicated problem, post more details if otherwise..

Just a thought

If your models are uncorrelated (or have limited correlation) you may be able to improve your accuracy by using an emsemble. Consider the third model:

Now your probability distribution is, $$\alpha q_1^i(r^i) + (1-\alpha)q_2^i(r^i)$$
and ideally you would like to acquire,

$$max_{\alpha} \quad log(Q_3)$$

This will be at least as good as the best model $Q_1$ or $Q_2$ for $\alpha=1$ or $\alpha=0$, but of course you need to cross-validate $\alpha$ otherwise you will just be overfitting this hyper parameter to your observed data.

Your feeling that there is more to the problem than adding up probabilities is very justified. To give you the bad news first: Your problem as stated has no solution. Since probability distributions have many degrees of freedom there is no general way to compare them. Practically speaking your two models may be good or bad in two different non-comparable ways. E.g. your first model might get the mean well, but has too light tails so misses the extreme market movements. While the other is spot on in predicting market crashs but gets confused in the times between crashes.

This is why any serious attempt on quality assessment must focus on the ultimate purpose of your models. For example if you would like to use your prediction models to inform trading strategies to get rich quick, it were easier to assess the the performance of the trading strategies instead the quality of your prediction distributions. Of course the ugly problem of non-comparability (aka risk-return trade off) will raise its head again but at least you are clear about the target function you are interested in.

Seriously interested in this kind of problem are not the financial market bunch but weather forecasters. Start with this for further reading. A textbook treatment of this topic can be found in "Statistical Methods in the Atmospheric Sciences" Chapter 8: Forecast Verification.