Averaging Probability Forecasts: Back to the Future

Abstract

The use and aggregation of probability forecasts in practice is on the rise. In this position piece, we explore some recent, and not so recent, developments concerning the use of probability forecasts in decision-making. Despite these advances, challenges still exist. We expand on some important challenges such as miscalibration, dependence among forecasters, and selecting an appropriate evaluation measure, while connecting the processes of aggregating and evaluating forecasts to decision-making. Through three important applications from the domains of meteorology, economics, and political science, we illustrate state-of-the-art usage of probability forecasts: how they are aggregated, evaluated, and communicated to stakeholders. We expect to see greater use and aggregation of probability forecasts, especially given developments in statistical modeling, machine learning, and expert forecasting; the popularity of forecasting competitions; and the increased reporting of probabilities in the media. Our vision is that increased exposure to and improved visualizations of probability forecasts will enhance the public’s understanding of probabilities and how they can contribute to better decisions.

Many organizations face critical decisions that rely on forecasts of binary events. In these situations, organizations often gather forecasts from multiple experts or models and average those forecasts to produce a single aggregate forecast. Because the average forecast is known to be under-confident, methods have been proposed that create an aggregate forecast more extreme than the average forecast. But is it always appropriate to extremize the average forecast? And if not, when is it appropriate to anti-extremize (i.e., to make the aggregate forecast less extreme)? To answer these questions, we introduce a class of optimal aggregators. These aggregators are Bayesian ensembles because they follow from a Bayesian model of the underlying information experts have. Each ensemble is a generalized additive model of experts' probabilities that first transforms the experts' probabilities into their corresponding information states, then linearly combines these information states, and finally transforms the combined information states back into the probability space. Analytically, we find that these optimal aggregators do not always extremize the average forecast, and when they do, they can run counter to existing methods. On two publicly available datasets, we demonstrate that these new ensembles are easily fit to real forecast data and are more accurate than existing methods.

Problem definition: In collaboration with Heathrow Airport, we develop a predictive system that generates quantile forecasts of transfer passengers’ connection times. Sampling from the distribution of individual passengers’ connection times, the system also produces quantile forecasts for the number of passengers arriving at the immigration and security areas.
Academic/Practical relevance: Airports and airlines have been challenged to improve decision-making by producing accurate forecasts in real time. Our work is the first to apply machine learning for predicting real-time quantile forecasts in the airport. We focus on passengers’ connecting journeys, which have only been studied by few researchers. Better forecasts of these journeys can help optimize passenger experience and improve airport resource deployment.
Methodology: The predictive model developed is based on a regression tree combined with copula-based simulations. We generalize the tree method to predict complete distributions, moving beyond point forecasts. To derive insights from the tree, we introduce the concept of a stable tree that can be summarized by its key variables’ splits.
Results: We identify seven key factors that impact passengers’ connection times, dividing passengers into 16 passenger segments. We find that adding correlations among the connection times of passengers arriving on the same flight can improve the forecasts of arrivals at the immigration and security areas. When compared to several benchmarks, our model is shown to be more accurate in both point forecasting and quantile forecasting.
Managerial implications: Our predictive system can produce accurate forecasts, frequently, and in real-time. With these forecasts, an airport’s operating team can make data-driven decisions, identify late connecting passengers and assist them to make their connections. The airport can also update its resourcing plans based on the prediction of passenger arrivals. Our approach can be generalized to other domains, such as rail or hospital passenger flow.

We introduce an exponential smoothing model that a manager can use to forecast the demand of a new product or service. The model has five features that make it suitable for accurately forecasting product life cycles at scale. First, the trend in our model follows the density of a new distribution called the tilted-Gompertz distribution. This model can capture the wide range of skewed diffusions commonly found in practice—diffusions of innovations described as having “extra-Bass” skew. Second, its parameters can be updated via exponential smoothing; therefore, the model can react to local changes in the environment. This model is the first exponential smoothing model to incorporate a life-cycle trend. Third, the model relies on multiplicative errors, instead of the additive errors primarily used in existing models. Multiplicative errors ensure that all quantile forecasts are strictly positive. Fourth, the model includes prior distributions on its parameters. These prior distributions become regularization terms in the model and allow the manager to make accurate forecasts from the beginning of a life cycle, which is notoriously difficult. The model's skewed shape, time-varying, regularized parameters, and multiplicative errors can make its quantile forecasts more accurate than leading diffusion models, such as the Bass, gamma/shifted-Gompertz, and trapezoid models. Fifth, the model's estimation procedure is based on an efficient optimization routine, which can be used to forecast product life cycles at scale. In two empirical studies, one of search interest in social networks and the other of new computer sales, we demonstrate that our model outperforms leading diffusion models in out-of-sample forecasting. Our model's point and other quantile forecasts are more accurate. Accurate quantile forecasts at different horizons are critical to many operational decisions, such as capacity and inventory management.