LSTM Model Architecture for Rare Event Time Series Forecasting

This is surprising as neural networks are known to be able to learn complex non-linear relationships and the LSTM is perhaps the most successful type of recurrent neural network that is capable of directly supporting multivariate sequence prediction problems.

A recent study performed at Uber AI Labs demonstrates how both the automatic feature learning capabilities of LSTMs and their ability to handle input sequences can be harnessed in an end-to-end model that can be used for drive demand forecasting for rare events like public holidays.

In this post, you will discover an approach to developing a scalable end-to-end LSTM model for time series forecasting.

After reading this post, you will know:

The challenge of multivariate, multi-step forecasting across multiple sites, in this case cities.

An LSTM model architecture for time series forecasting comprised of separate autoencoder and forecasting sub-models.

The skill of the proposed LSTM architecture at rare event demand forecasting and the ability to reuse the trained model on unrelated forecasting problems.

Motivation

The goal of the work was to develop an end-to-end forecast model for multi-step time series forecasting that can handle multivariate inputs (e.g. multiple input time series).

The intent of the model was to forecast driver demand at Uber for ride sharing, specifically to forecast demand on challenging days such as holidays where the uncertainty for classical models was high.

Generally, this type of demand forecasting for holidays belongs to an area of study called extreme event prediction.

Extreme event prediction has become a popular topic for estimating peak electricity demand, traffic jam severity and surge pricing for ride sharing and other applications. In fact there is a branch of statistics known as extreme value theory (EVT) that deals directly with this challenge.

Classical Forecasting Methods: Where a model was developed per time series, perhaps fit as needed.

Two-Step Approach: Where classical models were used in conjunction with machine learning models.

The difficulty of these existing models motivated the desire for a single end-to-end model.

Further, a model was required that could generalize across locales, specifically across data collected for each city. This means a model trained on some or all cities with data available and used to make forecasts across some or all cities.

We can summarize this as the general need for a model that supports multivariate inputs, makes multi-step forecasts, and generalizes across multiple sites, in this case cities.

The input to each forecast consisted of both the information about each ride, as well as weather, city, and holiday variables.

To circumvent the lack of data we use additional features including weather information (e.g., precipitation, wind speed, temperature) and city level information (e.g., current trips, current users, local holidays).

Feature Extractor: Model for distilling an input sequence down to a feature vector that may be used as input for making a forecast.

Forecaster: Model that uses the extracted features and other inputs to make a forecast.

An LSTM autoencoder model was developed for use as the feature extraction model and a Stacked LSTM was used as the forecast model.

We found that the vanilla LSTM model’s performance is worse than our baseline. Thus, we propose a new architecture, that leverages an autoencoder for feature extraction, achieving superior performance compared to our baseline.

When making a forecast, time series data is first provided to the autoencoders, which is compressed to multiple feature vectors that are averaged and concatenated. The feature vectors are then provided as input to the forecast model in order to make a prediction.

… the model first primes the network by auto feature extraction, which is critical to capture complex time-series dynamics during special events at scale. […] Features vectors are then aggregated via an ensemble technique (e.g., averaging or other methods). The final vector is then concatenated with the new input and fed to LSTM forecaster for prediction.

It is not clear what exactly is provided to the autoencoder when making a prediction, although we may guess that it is a multivariate time series for the city being forecasted with observations prior to the interval being forecasted.

A multivariate time series as input to the autoencoder will result in multiple encoded vectors (one for each series) that could be concatenated. It is not clear what role averaging may take at this point, although we may guess that it is an averaging of multiple models performing the autoencoding process.

The encoded feature vectors are provided to the forecast model with ‘new input‘, although it is not specified what this new input is; we could guess that it is a time series, perhaps a multivariate time series of the city being forecasted with observations prior to the forecast interval. Or, features extracted from this series as the blog post on the paper suggests (although I’m skeptical as the paper and slides contradict this).

The model was trained on a lot of data, which is a general requirement of stacked LSTMs or perhaps LSTMs in general.

The described production Neural Network Model was trained on thousands of time-series with thousands of data points each.

An interesting approach to estimating forecast uncertainty was also implemented that used the bootstrap.

It involved estimating model uncertainty and forecast uncertainty separately, using the autoencoder and the forecast model respectively. Inputs were provided to a given model and dropout of the activations (as commented in the slides) was used. This process was repeated 100 times, and the model and forecast error terms were used in an estimate of the forecast uncertainty.

The model trained on the Uber dataset was then applied directly to a subset of the M3-Competition dataset comprised of about 1,500 monthly univariate time series forecasting datasets.

This is a type of transfer learning, a highly-desirable goal that allows the reuse of deep learning models across problem domains.

Surprisingly, the model performed well, not great compared to the top performing methods, but better than many sophisticated models. The result is suggests that perhaps with fine tuning (e.g. as is done in other transfer learning case studies) the model could be reused and be skillful.

Performance of LSTM Model Trained on Uber Data and Evaluated on the M3 DatasetsTaken from “Time-series Extreme Event Forecasting with Neural Networks at Uber.”

Importantly, the authors suggest that perhaps the most beneficial application of deep LSTM models to time series forecasting are situations where:

There are a large number of time series.

There are a large number of observations for each series.

There is a strong correlation between time series.

From our experience there are three criteria for picking a neural network model for time-series: (a) number of timeseries (b) length of time-series and (c) correlation among the time-series. If (a), (b) and (c) are high then the neural network might be the right choice, otherwise classical timeseries approach may work best.

Outliers usually are anomalies which are abnormal ie. outside a normal distribution. Something like mean+/-2*std. in a time series outliers are sparks, with much higher freq than the normal signal even with rare events.
For instance a Black Friday is rare event but fits in the normal frequency whereas an outlier is much higher frequency.
So how can I bring the frequency part in the equation?
Thanks

very well explained, as always! a lot of your other articles contain code that help us understand the concepts better. I’m sure you’re very busy, but it’d be great if you could add code to this post, or point me to some articles/repos that have some code related to this post. Thanks,

I Master Degree student and I got interested in aply this approach in climate data series. I my research I have an adiction challenge (dimesions) the are latitude and longitude of an extrem or rare event.

So, as an example, I’m interested in predict an extreme rain (> 50mm in 24h) for a selected area (0.75 resolution): latitude from-18.75 to -20.25 and longitude from 315.0 to 316.5., It’s a grid 3 x 3 = 9 grids.

The rain (total precipitation in mm) doesn’t have a gaussian distribution, so, there are a lot of 00mm days, and the time series of rain are not a continuos sequence.

In your experience, this “Ubber” approach can fit despite of distribution problem? I have some doubts about the approach, like how this “LSTM Autoencoder for Feature Extraction” works. Do you expected do code a complete example like this “Uber” approach?