Specifically, from December in year 2, we must forecast January, February and March. From January, we must forecast February, March and April. All the way to an October, November, December forecast from September in year 3.

A total of 10 3-month forecasts are required, as follows:

1

2

3

4

5

6

7

8

9

10

Dec, Jan, Feb, Mar

Jan, Feb, Mar, Apr

Feb, Mar, Apr, May

Mar, Apr, May, Jun

Apr, May, Jun, Jul

May, Jun, Jul, Aug

Jun, Jul, Aug, Sep

Jul, Aug, Sep, Oct

Aug, Sep, Oct, Nov

Sep, Oct, Nov, Dec

Model Evaluation

A rolling-forecast scenario will be used, also called walk-forward model validation.

Each time step of the test dataset will be walked one at a time. A model will be used to make a forecast for the time step, then the actual expected value for the next month from the test set will be taken and made available to the model for the forecast on the next time step.

This mimics a real-world scenario where new Shampoo Sales observations would be available each month and used in the forecasting of the following month.

This will be simulated by the structure of the train and test datasets.

All forecasts on the test dataset will be collected and an error score calculated to summarize the skill of the model for each of the forecast time steps. The root mean squared error (RMSE) will be used as it punishes large errors and results in a score that is in the same units as the forecast data, namely monthly shampoo sales.

Persistence Model

A good baseline for time series forecasting is the persistence model.

This is a forecasting model where the last observation is persisted forward. Because of its simplicity, it is often called the naive forecast.

You can learn more about the persistence model for time series forecasting in the post:

Running the example first prints the entire test dataset, which is the last 10 rows. The shape and size of the train test datasets is also printed.

1

2

3

4

5

6

7

8

9

10

11

[[ 342.3 339.7 440.4 315.9]

[ 339.7 440.4 315.9 439.3]

[ 440.4 315.9 439.3 401.3]

[ 315.9 439.3 401.3 437.4]

[ 439.3 401.3 437.4 575.5]

[ 401.3 437.4 575.5 407.6]

[ 437.4 575.5 407.6 682. ]

[ 575.5 407.6 682. 475.3]

[ 407.6 682. 475.3 581.3]

[ 682. 475.3 581.3 646.9]]

Train: (23, 4), Test: (10, 4)

We can see the single input value (first column) on the first row of the test dataset matches the observation in the shampoo-sales for December in the 2nd year:

1

"2-12",342.3

We can also see that each row contains 4 columns for the 1 input and 3 output values in each observation.

Make Forecasts

The next step is to make persistence forecasts.

We can implement the persistence forecast easily in a function named persistence() that takes the last observation and the number of forecast steps to persist. This function returns an array containing the forecast.

1

2

3

# make a persistence forecast

def persistence(last_ob,n_seq):

return[last_ob foriinrange(n_seq)]

We can then call this function for each time step in the test dataset from December in year 2 to September in year 3.

Below is a function make_forecasts() that does this and takes the train, test, and configuration for the dataset as arguments and returns a list of forecasts.

1

2

3

4

5

6

7

8

9

10

# evaluate the persistence model

def make_forecasts(train,test,n_lag,n_seq):

forecasts=list()

foriinrange(len(test)):

X,y=test[i,0:n_lag],test[i,n_lag:]

# make forecast

forecast=persistence(X[-1],n_seq)

# store the forecast

forecasts.append(forecast)

returnforecasts

We can call this function as follows:

1

forecasts=make_forecasts(train,test,1,3)

Evaluate Forecasts

The final step is to evaluate the forecasts.

We can do that by calculating the RMSE for each time step of the multi-step forecast, in this case giving us 3 RMSE scores. The function below, evaluate_forecasts(), calculates and prints the RMSE for each forecasted time step.

1

2

3

4

5

6

7

# evaluate the RMSE for each forecast time step

def evaluate_forecasts(test,forecasts,n_lag,n_seq):

foriinrange(n_seq):

actual=test[:,(n_lag+i)]

predicted=[forecast[i]forforecast inforecasts]

rmse=sqrt(mean_squared_error(actual,predicted))

print('t+%d RMSE: %f'%((i+1),rmse))

We can call it as follows:

1

evaluate_forecasts(test,forecasts,1,3)

It is also helpful to plot the forecasts in the context of the original dataset to get an idea of how the RMSE scores relate to the problem in context.

We can first plot the entire Shampoo dataset, then plot each forecast as a red line. The function plot_forecasts() below will create and show this plot.

1

2

3

4

5

6

7

8

9

10

11

12

# plot the forecasts in the context of the original dataset

def plot_forecasts(series,forecasts,n_test):

# plot the entire dataset in blue

pyplot.plot(series.values)

# plot the forecasts in red

foriinrange(len(forecasts)):

off_s=len(series)-n_test+i

off_e=off_s+len(forecasts[i])

xaxis=[xforxinrange(off_s,off_e)]

pyplot.plot(xaxis,forecasts[i],color='red')

# show the plot

pyplot.show()

We can call the function as follows. Note that the number of observations held back on the test set is 12 for the 12 months, as opposed to 10 for the 10 supervised learning input/output patterns as was used above.

1

2

# plot forecasts

plot_forecasts(series,forecasts,12)

We can make the plot better by connecting the persisted forecast to the actual persisted value in the original dataset.

This will require adding the last observed value to the front of the forecast. Below is an updated version of the plot_forecasts() function with this improvement.

1

2

3

4

5

6

7

8

9

10

11

12

13

# plot the forecasts in the context of the original dataset

def plot_forecasts(series,forecasts,n_test):

# plot the entire dataset in blue

pyplot.plot(series.values)

# plot the forecasts in red

foriinrange(len(forecasts)):

off_s=len(series)-12+i-1

off_e=off_s+len(forecasts[i])+1

xaxis=[xforxinrange(off_s,off_e)]

yaxis=[series.values[off_s]]+forecasts[i]

pyplot.plot(xaxis,yaxis,color='red')

# show the plot

pyplot.show()

Complete Example

We can put all of these pieces together.

The complete code example for the multi-step persistence forecast is listed below.

Running the example first prints the RMSE for each of the forecasted time steps.

This gives us a baseline of performance on each time step that we would expect the LSTM to outperform.

1

2

3

t+1 RMSE: 144.535304

t+2 RMSE: 86.479905

t+3 RMSE: 121.149168

The plot of the original time series with the multi-step persistence forecasts is also created. The lines connect to the appropriate input value for each forecast.

This context shows how naive the persistence forecasts actually are.

Line Plot of Shampoo Sales Dataset with Multi-Step Persistence Forecasts

Multi-Step LSTM Network

In this section, we will use the persistence example as a starting point and look at the changes needed to fit an LSTM to the training data and make multi-step forecasts for the test dataset.

Prepare Data

The data must be prepared before we can use it to train an LSTM.

Specifically, two additional changes are required:

Stationary. The data shows an increasing trend that must be removed by differencing.

Scale. The scale of the data must be reduced to values between -1 and 1, the activation function of the LSTM units.

We can introduce a function to make the data stationary called difference(). This will transform the series of values into a series of differences, a simpler representation to work with.

1

2

3

4

5

6

7

# create a differenced series

def difference(dataset,interval=1):

diff=list()

foriinrange(interval,len(dataset)):

value=dataset[i]-dataset[i-interval]

diff.append(value)

returnSeries(diff)

We can use the MinMaxScaler from the sklearn library to scale the data.

Putting this together, we can update the prepare_data() function to first difference the data and rescale it, then perform the transform into a supervised learning problem and train test sets as we did before with the persistence example.

The function now returns a scaler in addition to the train and test datasets.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

# transform series into train and test sets for supervised learning

def prepare_data(series,n_test,n_lag,n_seq):

# extract raw values

raw_values=series.values

# transform data to be stationary

diff_series=difference(raw_values,1)

diff_values=diff_series.values

diff_values=diff_values.reshape(len(diff_values),1)

# rescale values to -1, 1

scaler=MinMaxScaler(feature_range=(-1,1))

scaled_values=scaler.fit_transform(diff_values)

scaled_values=scaled_values.reshape(len(scaled_values),1)

# transform into supervised learning problem X, y

supervised=series_to_supervised(scaled_values,n_lag,n_seq)

supervised_values=supervised.values

# split into train and test sets

train,test=supervised_values[0:-n_test],supervised_values[-n_test:]

returnscaler,train,test

We can call this function as follows:

1

2

# prepare data

scaler,train,test=prepare_data(series,n_test,n_lag,n_seq)

Fit LSTM Network

Next, we need to fit an LSTM network model to the training data.

This first requires that the training dataset be transformed from a 2D array [samples, features] to a 3D array [samples, timesteps, features]. We will fix time steps at 1, so this change is straightforward.

Next, we need to design an LSTM network. We will use a simple structure with 1 hidden layer with 1 LSTM unit, then an output layer with linear activation and 3 output values. The network will use a mean squared error loss function and the efficient ADAM optimization algorithm.

The LSTM is stateful; this means that we have to manually reset the state of the network at the end of each training epoch. The network will be fit for 1500 epochs.

The same batch size must be used for training and prediction, and we require predictions to be made at each time step of the test dataset. This means that a batch size of 1 must be used. A batch size of 1 is also called online learning as the network weights will be updated during training after each training pattern (as opposed to mini batch or batch updates).

We can put all of this together in a function called fit_lstm(). The function takes a number of key parameters that can be used to tune the network later and the function returns a fit LSTM model ready for forecasting.

The configuration of the network was not tuned; try different parameters if you like.

Report your findings in the comments below. I’d love to see what you can get.

Make LSTM Forecasts

The next step is to use the fit LSTM network to make forecasts.

A single forecast can be made with the fit LSTM network by calling model.predict(). Again, the data must be formatted into a 3D array with the format [samples, timesteps, features].

We can wrap this up into a function called forecast_lstm().

1

2

3

4

5

6

7

8

# make one forecast with an LSTM,

def forecast_lstm(model,X,n_batch):

# reshape input pattern to [samples, timesteps, features]

X=X.reshape(1,1,len(X))

# make forecast

forecast=model.predict(X,batch_size=n_batch)

# convert to array

return[xforxinforecast[0,:]]

We can call this function from the make_forecasts() function and update it to accept the model as an argument. The updated version is listed below.

1

2

3

4

5

6

7

8

9

10

# evaluate the persistence model

def make_forecasts(model,n_batch,train,test,n_lag,n_seq):

forecasts=list()

foriinrange(len(test)):

X,y=test[i,0:n_lag],test[i,n_lag:]

# make forecast

forecast=forecast_lstm(model,X,n_batch)

# store the forecast

forecasts.append(forecast)

returnforecasts

This updated version of the make_forecasts() function can be called as follows:

1

2

# make forecasts

forecasts=make_forecasts(model,1,train,test,1,3)

Invert Transforms

After the forecasts have been made, we need to invert the transforms to return the values back into the original scale.

This is needed so that we can calculate error scores and plots that are comparable with other models, like the persistence forecast above.

We can invert the scale of the forecasts directly using the MinMaxScaler object that offers an inverse_transform() function.

We can invert the differencing by adding the value of the last observation (prior months’ shampoo sales) to the first forecasted value, then propagating the value down the forecast.

This is a little fiddly; we can wrap up the behavior in a function name inverse_difference() that takes the last observed value prior to the forecast and the forecast as arguments and returns the inverted forecast.

1

2

3

4

5

6

7

8

9

# invert differenced forecast

def inverse_difference(last_ob,forecast):

# invert first forecast

inverted=list()

inverted.append(forecast[0]+last_ob)

# propagate difference forecast using inverted first value

foriinrange(1,len(forecast)):

inverted.append(forecast[i]+inverted[i-1])

returninverted

Putting this together, we can create an inverse_transform() function that works through each forecast, first inverting the scale and then inverting the differences, returning forecasts to their original scale.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

# inverse data transform on forecasts

def inverse_transform(series,forecasts,scaler,n_test):

inverted=list()

foriinrange(len(forecasts)):

# create array from forecast

forecast=array(forecasts[i])

forecast=forecast.reshape(1,len(forecast))

# invert scaling

inv_scale=scaler.inverse_transform(forecast)

inv_scale=inv_scale[0,:]

# invert differencing

index=len(series)-n_test+i-1

last_ob=series.values[index]

inv_diff=inverse_difference(last_ob,inv_scale)

# store

inverted.append(inv_diff)

returninverted

We can call this function with the forecasts as follows:

1

2

# inverse transform forecasts and test

forecasts=inverse_transform(series,forecasts,scaler,n_test+2)

We can also invert the transforms on the output part test dataset so that we can correctly calculate the RMSE scores, as follows:

1

2

actual=[row[n_lag:]forrow intest]

actual=inverse_transform(series,actual,scaler,n_test+2)

We can also simplify the calculation of RMSE scores to expect the test data to only contain the output values, as follows:

1

2

3

4

5

6

def evaluate_forecasts(test,forecasts,n_lag,n_seq):

foriinrange(n_seq):

actual=[row[i]forrow intest]

predicted=[forecast[i]forforecast inforecasts]

rmse=sqrt(mean_squared_error(actual,predicted))

print('t+%d RMSE: %f'%((i+1),rmse))

Complete Example

We can tie all of these pieces together and fit an LSTM network to the multi-step time series forecasting problem.

Running the example first prints the RMSE for each of the forecasted time steps.

We can see that the scores at each forecasted time step are better, in some cases much better, than the persistence forecast.

This shows that the configured LSTM does have skill on the problem.

It is interesting to note that the RMSE does not become progressively worse with the length of the forecast horizon, as would be expected. This is marked by the fact that the t+2 appears easier to forecast than t+1. This may be because the downward tick is easier to predict than the upward tick noted in the series (this could be confirmed with more in-depth analysis of the results).

1

2

3

t+1 RMSE: 95.973221

t+2 RMSE: 78.872348

t+3 RMSE: 105.613951

A line plot of the series (blue) with the forecasts (red) is also created.

The plot shows that although the skill of the model is better, some of the forecasts are not very good and that there is plenty of room for improvement.

Line Plot of Shampoo Sales Dataset with Multi-Step LSTM Forecasts

Extensions

There are some extensions you may consider if you are looking to push beyond this tutorial.

Update LSTM. Change the example to refit or update the LSTM as new data is made available. A 10s of training epochs should be sufficient to retrain with a new observation.

Tune the LSTM. Grid search some of the LSTM parameters used in the tutorial, such as number of epochs, number of neurons, and number of layers to see if you can further lift performance.

Seq2Seq. Use the encoder-decoder paradigm for LSTMs to forecast each sequence to see if this offers any benefit.

Time Horizon. Experiment with forecasting different time horizons and see how the behavior of the network varies at different lead times.

Did you try any of these extensions?
Share your results in the comments; I’d love to hear about it.

Summary

In this tutorial, you discovered how to develop LSTM networks for multi-step time series forecasting.

Specifically, you learned:

How to develop a persistence model for multi-step time series forecasting.

How to develop an LSTM network for multi-step time series forecasting.

How to evaluate and plot the results from multi-step time series forecasting.

Do you have any questions about multi-step time series forecasting with LSTMs?
Ask your questions in the comments below and I will do my best to answer.

Thanks a lot for this post. I was trying to make this for my thesis since september, with no well results. But I’m having trouble: I’m not able to compile. Maybe you or someone who reads this is able to tell me why this happens: I’m getting the following error when running the code:

The TensorFlow library wasn’t compiled to use SSE instructions, but these are available on your machine and could speed up CPU computations.

The TensorFlow library wasn’t compiled to use SSE2 instructions, but these are available on your machine and could speed up CPU computations.

The TensorFlow library wasn’t compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
.
The TensorFlow library wasn’t compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.

The TensorFlow library wasn’t compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.

The TensorFlow library wasn’t compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.

Obviously it has something to do with Tensorflow (I have read about this problem and I think its becase is not installed on source, but have no idea about how to fix it).

Hi,Jason,
Your article is very useful! I have a problem, if the data series are three-dimensional data, the 2th line is the put -in data,and the 3th line is the forecasting data(all include the train and test data ),Do they can run the” difference”and “tansform”?
Thank you very much!

I have discovered how to do it by asking some people. The object series is actually a Pandas Series. It’s a vector of information, with a named index. Your dataset, however, contains two fields of information, in addition to the time series index, which makes it a DataFrame. This is the reason why the tutorial code breaks with your data.

To pass your entire dataset to MinMaxScaler, just run difference() on both columns and pass in the transformed vectors for scaling. MinMaxScaler accepts an n-dimensional DataFrame object:

I mean, for a 2 variables dataset as yours, we can set, for example, this values:

n_lags=1
n_seq=2

so we will have a supervised dataset like this:

var1(t-1) var2(t-1) var1(t) var2 (t) var1(t+1) var2 (t+1)

so, if we want to train the ANN to forecast var2 (which is the target we want to predict) with the var1 as input and the previous values of var2 also as input, we have to separate them and here is where my doubt begins.

Thanks for the previous clarification. I have a dubt in relation to the section “fit network” in the code. I’m having some trouble trying to plot the training graph (validation vs training) in order to see if the network is or not overfitted, but due to the “model.reset_states()” sentence, i can only save the last loss and val_loss from de history sentence. Is there any way to solve this?

Just creating 2 list (or 1, but i see it more clear on this way) and returning then on the function. Then, outside, just plot them. I’m sorry for the question, maybe the answer is obvious, but I’m starting on python and I’m not a programmer.

Now I’m trying to find a way to make the training process faster and reduce RMSE, but it’s pretty dificult (the idea is to make results better than in the NARx model implemented in the Matlab Neural Toolbox, but results and computational time are hard to overcome).

Thanks for the great tutorial, I’m wondering if you can help me clarify the reason you have
model.reset_states()
(line 83)
when fitting the model, I was able to achieve similar results without the line as well.

Hi jason,
When I applied your code into a 22-year daily time series, I find out that the LSTM forecast result is similar to persistence one, i.e. the red line is just a horizontal bar. I’m sure I did not mess those two methods, I wonder what cause this?

Thanks to your tutorial, I’ve been tuning the parameters such as numbers of epochs and neurons these days. However, I noticed that you mentioned the grid search method to get appropriate parameters, could you please explain how to implement it into LSTM? I’m confused about your examples on some other tutorial which has a model class, seems unfamiliar to me.

Thanks, I’ve just finished one test. What does it mean if error oscillates violently with epochs increasing instead of steady diminishing? Can I tune the model better, or LSTM is incapable of this time series?

Understood. Let me re-phrase the question. In a practical application, one would be interested in forecasting the last data point, i.e. in the shampoo dataset, “3-12”. How would you suggest doing that?

I have followed a couple of your articles about LSTM and did learn a lot, but here is a question in my mind: can I introduce some interference elements in the model? For example for shampoo sale problem, there may be some data about holiday sales, or sales data after an incident happens. If I want to make prediction for sales after those incidents, what can I do?

What’s more, I noticed that you will parse date/time with a parser, but you did not really introduce time feature into the model. For example I want to make prediction for next Monday or next January, how can I feed time feature?

I would like to know how to do short term and long term prediction with minimum number of models?

For example, I have a 12-step input and 12-step output model A, and a 12-step input and 1-step output model B, would model A gives better prediction for next first time step than model B?

What’s more, if we have 1-step input and 1-step output model, it is more error prone to long term prediction.
if we have multi-step input and 1-step output mode it is still more more error prone long term. So how to regard the long term and short term prediction?

I want to train a model with the following input size: [6000, 4, 2] ([samples, timestamps, features])

For example, I want to predict shampoo’s sale in next two years. If I have other feature like economy index of every year, can I concatenate sale data and index data in the above format? So my input will be a 3d vector. How should I modify the model to train?

I always get such error: ValueError: Error when checking target: expected dense_1 to have 2 dimensions, but got array with shape (6000, 2, 2).

The error comes from this line: model.fit(X, y, epochs=1, batch_size=n_batch, verbose=0, shuffle=False). Can you provide some advices? Thanks!

what if I want to tell the model to learn from train data (23 samples here) and want to forecast only 3 steps forward (Jan, Feb, Mar). I want to avoid persistence model in this case and only require 3 step direct strategy. hope you got that.

here if i would like to make only one forecast for 3 steps (jan,feb,march) what i have to change. i do not need the rest of the month(april, may, june, july,aug,……dec). one predictions or forecast for 3 steps.

hi
I have so this problem
i have downloaded the dataset from the link in the text
i think this error has occured because the data of our csv file is not in correct format!
can anyone give us the dataset plz???

This post really helped me.
Now the next question is how do we enhance this to consider exogenous variables while forecasting?
If I simply add exogenous variable values at this step:
train, test = supervised_values[0:-n_test], supervised_values[-n_test:], (and obviously make appropriately changes to batch_input_shape in model fit.)
Would it help improve predictions?
What is the correct way of adding independent variables.

Hi Jason, thanks for writing up such detailed explanations.
I am using an LSTM layer for a time series prediction problem.
Everything works fine except for when I try to use the inverse_transform to undo the scaling of my data. I get the following error:

ValueError: Input contains NaN, infinity or a value too large for dtype(‘float64’).

Not really sure how I can get past this problem. Could you please help me with this ?

Hi Jason
I encountered data file format issue and similar NaN issues like Kiran saw
the file format i downloaded doesnt have the 19 format
e.g.
Month,Sales of shampoo over a three year period
01-Jan,266

So I changed the parser() just to return x , as is

Then on the Multi-Step LSTM Network I got the following NaN

ipdb> series
Month
01-Jan 266.0
…
03-Nov 581.3
03-Dec 646.9
NaN NaN
Sales of shampoo over a three year period NaN
Name: Sales of shampoo over a three year period, dtype: float64

When I try step by step forecast. i.e. forecast 1 point and then use this back as data and forecast the next point, my predictions become constant after just 2 steps, sometimes from the beginning itself.

Oh then a hybrid model using residuals from ARIMA for RNN should work well 🙂 ?
The residuals will not have any seasonal components.(even scaling should be well taken care of)
Or here also do you expect MLPs to work better?

I think there is an issue with inverse differencing while forecasting for multistep.(to deal with non stationary data)
This example is adding previously forecasted(and inverse differenced) value to the currently forecasted value.Isn’t this method wrong when we have 30 points to forecast as it keeps adding up the results and hence the output will continuously increase.

I have a question regarding a seq to seq timeseries forcasting problem with multi-step lstm.

I have created a supervised dataset of (t-1), (t-2), (t-3)…, (t-look_back) and (t+1), (t+2), (t+3)…, (t+look_ahead) and our goal is to forcast look_ahead timesteps.

We have tried your complete example code of doing a dense(look_ahead) last layer but received not so good results. This was done using both a stateful and non-stateful network.

We then tried using Dense(1) and then repeatvector(look_ahead), and we get the same (around average) value for all the look_ahead timesteps. This was done using a non-stateful network.

Then I created a stepwise prediction where look_ahead = 1 always. The prediction for t+2 is then based on the history of (t+1)(t)(t-1)… This has given me better results, but only tried for non-stateful network.

My questions are:
– Is it possible to use repeatvector with non-stateful networks? Or must network be stateful? Do you have any idea why my predictions are all the same value?
– What do network you recommend for this type or problem? Stateful or non stateful, seq to seq or stepwise prediction?

The RepeatVector is only for the Encoder-Decoder architecture to ensure that each time step in the output sequence has access the entire fixed-width encoding vector from the Encoder. It is not related to stateful or stateless models.

I would develop a simple MLP baseline with a vector output and challenge all LSTM architectures to beat it. I would look at a vector output on a simple LSTM and a seq2seq model. I would also try the recursive model (feed outputs as inputs for repeating a one step forecast).

I had a question. When reshaping X for lstm (samples,timesteps,features) why did you model the problem as timesteps=1 and features=X.shape[1]. Shouldn’t it be timesteps = lag window size
and the output dense layer have the size of horizon_window. This will give much better results in my opinion.

With this code, I’m able to actually forecast the future ROI%. With the other, it does a lot better at modeling the past data, but I can’t figure out how to get it to forecast the future. Both codes have elements I need, but I can’t seem to figure out how to bring them together.

I am using this framework for my first shot at an LSTM network for monitoring network response times. The data I’m working with currently is randomly generated by simulating API calls. What I’m seeing is the LSTM seems to always predict a return to what looks like the mean of the data. Is this a function of the data being stochastic?

Separate question: since LSTM’s have a memory component built into the neurons, what are the advantages/disadvantages of using a larger n_in/n_lag than 1?

Thanks. I am playing with some toy data now just to make sure I’m understanding how this works.

I am able to model a cosine wave very nicely with a 5 neuron, 100 epoch training run against np.cos(range(100)) split into 80/20 training set. This is with the scaling, but without the difference. I feed in 10 inputs, and get 30 outputs.

Does calling model.predict change the model? I am calling repeatedly with the same 10 inputs and am seeing a different result each time. It looks like the predicted wave cycles through different amplitudes.

Ah ok, I got it. Since stateful is on, I would need to do an explicit reset_states between predictions. Makes sense, I think! Stateful was useful for training, but since I won’t be “online learning” and since I feed the network lag in the features, I should not rely on state for predictions.

I have a simple question. Trying to set up an a different toy problem, with data generated as y=x over 800 points (holding out the next 200 as validation). No matter how many layers, neurons, epochs that I train over, the results tend to be a that predictions start out fairly close to the line for lower values, but it diverges quickly and and approaches some fixed y=400 for higher values.

Hi, there is a problem with the code. when doing data processing, i.e. calculate difference and min max scale. you should not use all data. in more real situation, you can only do this to train data. since you have no idea about test data.

So I changed the code, cut the last 12 month as test. then only use 24 months data for difference, min max scale, fit the model and predict for month 25, 26, 27.

Then I continue to use 25 months data for difference, min max scale, fit the model and predict for month 26, 27, 28.
…

Hi Jason,
thanks a lot for your tutorials on LSTMs.
Do you have a suggestion how to model the network for a multivariate multi-step forecast? I read your articles about multivariate and multi-step forecast, but combining both seems to be more tricky as the output of the dense layer gets a higher dimension.

In words of your example here: if I want to forecast not only shampoo but also toothpaste sales T time steps ahead, how can I achieve the forecast to have the dimension 2xT? Is there an alternative to the dense layer?

Thanks for this great tutorial. Do you think this technique is applicable on the case of a many-to-many prediction?

A toy scenario: Imagine a machine with has 5 tuning knobs [x1, x2, x3, x4, x5] and as a result we can read 2 values [y, z] as a response to a change of any of the knobs.

I am wondering if I can use LSTM to predict y and z at with a single model instead of building one model for y and another for z? I am planning to follow this tutorial but I will love to hear what you think about it.

Hi Jason, thank you very much for this tutorial. I am just starting with LSTM and your series on LSTM is greatly valuable.
A question about multi-output forecasting: how to deal with a multi-output when plotting the true data versus the predicted data.
Let’s say I have a model to forecast the next 10 steps (t, t+1…,t+9).
Using the observation at time:
–> t=0, the model will give a forecast for t =1,2,3,4,5,6,7,8,9,10
and similarly, at
–> t=1, a forecast will be outpout for t=2,3,4,5,6,7,8,9,10,11
etc…
There is overlap in the timestep for the forecast from t=0 and from t=1. For example, if I want to know the value at t=2, should I use the forecast from t=1 or from t=0, or a weighted average of the forecast?

May be using only the forecast from t=1 enough, because it already includes the history of the time series (i.e it already includes the observation at t=0).

You explained that “A rolling-forecast scenario” will be used, also called walk-forward model validation. You said “Each time step of the test dataset will be walked one at a time. A model will be used to make a forecast for the time step, then the actual expected value for the next month from the test set will be taken and made available to the model for the forecast on the next time step”.

What method / algorithm would you suggest doing in the scenario there are no such test/validation data available? In other words, I have a collection of time-series data that stops at a certain point, and I need to forecast the next points.

Thanks for this wonderful tutorial. I’m trying to solve a problem and wanted your input, which is something like this. I have 2 years of sales data on daily basis with some other predictor variables as holiday, promotion etc. lets say jan 2015 to jan 2017. and i wanted to forecast for month of Feb. i was thinking in something like data preparation would be take last 60 days data as input sequence and predict next 30 time steps. Since the dataset is very small. do you think it will work?. Whats you suggestion on this. ?

Mr Jason
I have two questions：
1. In this example, three rmses are exported. What should I do if I want to output the three predictions for each time step and integrate all the output into a data box（Easy to observe）?
2. What if I need to do 6- months, 12-month predictions? How do I change it?
I’m sorry that my python is not very good.
thank you so much！

I see two different prediction results when I save the model and try to predict the model which is loaded.

But the forecast/predictions results are same when I run the model infinite times before saving the model.

With the model that is saved and loaded, results the same prediction output everytime I run with that loaded model.

The problem is, results given before saving the model is not matching/ same with the model that is loaded.

Looks like something gets changed inside the trained model when saving it.
Before saving the model, it provides 98% accuracy. While after saving the model, when we try to predict it give 90% accuracy.

Can you help me to clarify this doubt. I have provided the code snippet with the output below. This code snippet of saving the model and loading it again is from one single python program only. not multiple python scripts.

Note: I am experimenting with a different dataset, that contains prices in decimals and similar to this tutorial dataset.

Your blogs are really great. I have a learned and still learning a lot from them.

I am trying to apply tweet sentiments to LSTM along with some numeric features (e.g price, volume) but still I did not succeed. I have read some blogs and papers but everywhere tweets and numeric features are feed separately but I want to feed both of them as my feature vector.
Any good suggestions ?

Thank you Jason
I’ve been working though your tutorials which are quite useful and
clear – even to a non-Python programmer In this one though I lost the thread around
“Fit LSTM Network. I’m concerned about “fix time steps at 1”.

What about when the timesteps are not a constant size? A specific example: I am
driving, recording my position, acceleration, direction and time every five minutes.
For various reasons the five minutes is approximate. Also, sometimes I lose the
GPS, so I miss one or several records.

Obviously position depends on time. Should I resample all my records so the time periods are equil? Should I interpolate to provide the missing ones? What if I stop overnight. Can I somehow stitch the two days data together?

Second question: where in this tutorial are you providing the punishmenty feedback to the model? I want to use an asymmetric function. (If I want to drive up to the edge of a precipice, it is much worse to go too far than not quite far enough.)

I have to predict the performance of an application. The inputs will be time series of past performance data of the application, CPU usage data of the server where application is hosted, the Memory usage data, network bandwidth usage etc. I’m trying to build a solution using LSTM which will take these input data and predict the performance of the application for next one week. I have followed your blog ‘https://machinelearningmastery.com/multivariate-time-series-forecasting-lstms-keras/’ and understood how to work with multivariate data. I’m currently stumbled at the part where predicting multiple steps to the future, ie, next one week performance of the application. Even though multi-steps prediction is working for me with univariate time series examples, here it is not working. Not sure what I’m missing. Could you please give me some guidance in doing that?

thanks for that great blog! I have a general question about multi-step predictions. Your prediction of t+3 is – as I understand it – independent from the prediction of t+2, which itself is independent of t+1.

Is it meaningful to consider to feedback the former predictions into the network? If yes, how is such a model called?

Hi Jason,
Thanks for the great tutorial! I have several questions about the predictions. If I try to deal with a dataset which contains about 6000 observations, is it meaningful to make predictions from t+1 to t+500 (if n_test=1)?
By the way, when plotting the predictions, there is a small shift from the last data point. Is it the result of the transform from series to supervised? Maybe I mistook something.

Would it be beneficial to also use which time step (t+k) we are predicting on as input to the model? Since right now we are considering all data points in the the span specified by n_seq as “the same time step away from where we are predicting from”.

Hi Jason
Many thanks for your very helpful tutorials. I would be very happy to get some help regarding this problem:
Given is a time series with 20 input variables and one output variable.
The series length is about 500 samples. For 5 of the 20 variables, the are also future samples available. (50 samples). I wonder how I can use the future values of this 5 variables in order to improve the the prediction.
Many thanks for a helpful hint.
Best Regards

For 5 of the 20 input variables (x1..x5), I already have the values for the 50 next timesteps. (This values are given). So I don’t need to predict them, but I want to use it to improve the prediction for the (one) output variable y. (There is no need to predict also the other 15 input values x6–x20)

Dear Jason thanks for awesome codes and explanation, I have one question for you. In this case, one wants to estimate multi-step in future, right? for example 10 steps ahead. But all of the 10 steps are unknown. The model should find them without using the actual value. But what I see here in test sets or train sets is that the model estimates data points considering actual values not predicted.
Let’s see some of data together:
[[ 342.3 339.7 440.4 315.9]
[ 339.7 440.4 315.9 439.3]
[ 440.4 315.9 439.3 401.3]]

let’s imagine model predicts that for first row [ 342.3 339.7 440.4 315.9] the predicted value is 439.4 but actually the correct and actual value is 439.3 (which we don’t know!). So in the second row we should consider [ 339.7 440.4 315.9 439.4] instead of [ 339.7 440.4 315.9 439.3].

The question is this, when you say this method is capable of multiple step ahead forecasting, you mean which of these two:
1) the one which uses no information of future (no actual value ) and just use its own predictions
2) the one that predicts a point for the next step and calculate the error, but forget about the prediction and uses the realization of that point (the actual value) for steps after that.

I believe the model here is the second one, right?
I want to make sure.

I am concern about the fact that the good result, showing here is because of the fact that model is seeing the results in the test set.

In other words, model predicts the shampoo price of Jan, at price 1000, but it actual price is 1200. for February prediction the model uses 1200, ( the correct price) instead of what it predicted (1000)