When and How to Use "Best Fit Model" in Your Statistical Forecasting Suite?

Most of the Demand Planning applications like SAP APO, i2, Manugistics, Demantra etc offer statistical forecasting as one of their major differentiating functionality. Statistical forecasting generates the forecast for future periods based on history data provided. Lots of algorithms are on offer in these suites. Sometimes as many as 30 to 40 different algorithms / methods / Forecasting Strategies (All are different terms used for same thing) are on offer.

Statistical forecasting is a bit complex and "not so easy" to understand feature. More the number of algorithms more the confusion it creates in the mind of end users. To tackle this problem almost every demand planning suite has a "Best Fit Algorithm". This algorithm is supposed to be "Superset" of all algorithms. If planner applies this algorithm then in the background all possible combinations of all algorithms will be run and one that will give "Best Fit" in the history data will be chosen and forecast generated based on that algorithm. Many call this algorithm as "algorithm for dummies". Generally it is observed that those who do not understand the statistical part of algorithm use this option more to save the trouble. However in 9 out of 10 cases this algorithm generates forecast which is wrong and cannot be used for planning purposes. Why does this happen?

To unravel this question let us break down what this Best Fit Algorithm do in the background. First of all it generates all possible parameter combination for each algorithm on offer. For example, if Holt Winter Algorithm is on offer then it creates all possible combinations of Alpha, Beta and Gamma values. For those who do not know - Alpha, Beta and Gamma are parameters used by Holt Winters Algorithm. Values of these parameters should be between 0 and 1. In SAP APO this algorithm is on offer and there is additional restriction that values off Alpha, Beta and Gamma can be only in multiples of 0.05, so with this restriction each of them can only take 20 possible values starting from 0.05, 0.10, 0.15 and so on. So there will be 20*20*20 (8000) possible combinations of alpha, beta and gamma values. In some other suites number of combinations can reach one million. Best Fit will generate forecast for all these combinations and will select the one that has lowest "Forecasting Error". Now let us unravel what is the definition of "Forecasting Error"?

All these forecasting suites offer all standard Forecast Error options like MAPE, MAD, RMSE etc. User is given the option of using one of these Error Measures for selecting Best Fit Algorithm. For example, if user chooses MAPE then Best Fit Algorithm will go through all combinations and will select the one that has lowest MAPE. Looks great but as famous saying goes "Devil Lies in the Detail". How is this MAPE calculated?

For very least calculating MAPE require a forecast number and history number. It takes difference of forecast and history number, take the absolute value of difference, divide it by history number and multiply it by 100. Best Fit Algorithm works the following way. Take for example Holt Winters model, it requires minimum 27 history data points to generate first forecast. Hence first 27 data points in history are used to generate the forecast for 28th data point in history. This forecast and history for 28th data point is used for calculating MAPE. Similarly first 28 data points are used to generate the forecast for 29th data point, first 29 data points are used to generate forecast for 30th data point and so on. So if you have 72 data point in the history effectively you will have 45 (72 -27) data points for which you will have both forecast and history. MAPE is calculated based on these 45 data points. And as pointed out previously, combination with lowest MAPE will be selected by Best Fit Algorithm. I see following problems with above selection mechanism of Best Fit

Quality of forecast in initial periods will be absolutely poor as it is based on bare minimum history. As pointed out above only first 27 point history will be used to generate forecast for 28th point. Since number of data point on which this forecast is based is absolutely bare minimum, quality of this forecast will be very poor. Consequently MAPE calculated based on this value will be erroneous.

Every error measure chosen is susceptible to problem. For example, MAPE calculation fails if history data has zero values or values very close to zero. RMSE fails if history data has even one outlier data point. MAD fails if standard deviation of time series is high. So if history data has these problems then error measure chosen will fail and consequently Best Fit model selection based on these measure will fail.

Sectional fitment of model is another problem. For example, if history data has 72 data points, it may so happen that model selected by best fit will fit the best in initial part of the history but will fail in recent history. However overall MAPE can still be low for this combination hence Best Fit will still chose this combination.

Sensitivity of best fit model is very low. Every month new history data point will be added. It may so happen that best fit model last month will be discarded and new one will be chosen. This happens because of problems in error measure calculation mentioned in point 2.

With all above problems use of best fit model becomes very complicated and require deep understanding of underlying statistical principles. It is not the algorithm for dummies; on the contrary it is the algorithm to be used by statistical experts. So next time you are not happy with Best Fit algorithm you know the reason - You have to ramp up your statistical knowledge and understand selection mechanism used by Best Fit better.

Comments

Nikhill - interesting article, for the record, Oracle Demantra is NOT "best fit" it uses a technique called Bayesian Markov. Thought I should clarify, the first paragraph seems to lump it in with the other "best fit" products.

While Demantra is not a best fit "per se", it does fit models during the first step. It then weights the various models that were fit to get an averaged forecast. The point here is that it is not modeling, but fitting.

Many of the points raised are very valid. Most can be overcome with 'best practice' but in pratice this is seldom done.
Some specific points
Error Measure: Different measures dont 'fail' but they will give different answers but they are unlikely to be 'significantly' different and so this does not matter.
It is, however, critical that the error measure allows for the complexity of the model. Clearly a model with say 14 parameters (level, grend and 112 seasonal factors) should fit history better than a 1 parameter model (an average). When using hte saem data to calculate the model parameters as to estimate the error, this MUST be taken into account.
The one case where it matters desperately is where the underlying model changes (flat v trend v seasonal v trend + seasonal) and in these circumstnaces Occams razor should be applied - use the simplest model unless there is a significant improvement in using the more complicated model
Significant: well a test for reduction ion the standard deviation of forecast errors is a reasonable way to go
Demantra uses a so called Bayesian combination of all the forecastss it tried (More weight to forecasts with beter performance). While not the same as best fit, it suffers from very similar issues.
But most of all remember the warnings from our financial colleagues on shares and funds: A good historical record is no guarantee of future performance.