The 5 Step Life-Cycle for Long Short-Term Memory Models in Keras

Deep learning neural networks are very easy to create and evaluate in Python with Keras, but you must follow a strict model life-cycle.

In this post, you will discover the step-by-step life-cycle for creating, training, and evaluating Long Short-Term Memory (LSTM) Recurrent Neural Networks in Keras and how to make predictions with a trained model.

After reading this post, you will know:

How to define, compile, fit, and evaluate an LSTM in Keras.

How to select standard defaults for regression and classification sequence prediction problems.

How to tie it all together to develop and run your first LSTM recurrent neural network in Keras.

Let’s get started.

Update June/2017: Fixed typo in input resizing example.

The 5 Step Life-Cycle for Long Short-Term Memory Models in KerasPhoto by docmonstereyes, some rights reserved.

Overview

Below is an overview of the 5 steps in the LSTM model life-cycle in Keras that we are going to look at.

Define Network

Compile Network

Fit Network

Evaluate Network

Make Predictions

Environment

This tutorial assumes you have a Python SciPy environment installed. You can use either Python 2 or 3 with this example.

This tutorial assumes you have Keras v2.0 or higher installed with either the TensorFlow or Theano backend.

This tutorial also assumes you have scikit-learn, Pandas, NumPy, and Matplotlib installed.

Next, let’s take a look at a standard time series forecasting problem that we can use as context for this experiment.

Need help with LSTMs for Sequence Prediction?

Step 1. Define Network

The first step is to define your network.

Neural networks are defined in Keras as a sequence of layers. The container for these layers is the Sequential class.

The first step is to create an instance of the Sequential class. Then you can create your layers and add them in the order that they should be connected. The LSTM recurrent layer comprised of memory units is called LSTM(). A fully connected layer that often follows LSTM layers and is used for outputting a prediction is called Dense().

For example, we can do this in two steps:

1

2

3

model=Sequential()

model.add(LSTM(2))

model.add(Dense(1))

But we can also do this in one step by creating an array of layers and passing it to the constructor of the Sequential.

1

2

layers=[LSTM(2),Dense(1)]

model=Sequential(layers)

The first layer in the network must define the number of inputs to expect. Input must be three-dimensional, comprised of samples, timesteps, and features.

Samples. These are the rows in your data.

Timesteps. These are the past observations for a feature, such as lag variables.

Features. These are columns in your data.

Assuming your data is loaded as a NumPy array, you can convert a 2D dataset to a 3D dataset using the reshape() function in NumPy. If you would like columns to become timesteps for one feature, you can use:

1

data=data.reshape((data.shape[0],data.shape[1],1))

If you would like columns in your 2D data to become features with one timestep, you can use:

1

data=data.reshape((data.shape[0],1,data.shape[1]))

You can specify the input_shape argument that expects a tuple containing the number of timesteps and the number of features. For example, if we had two timesteps and one feature for a univariate time series with two lag observations per row, it would be specified as follows:

1

2

3

model=Sequential()

model.add(LSTM(5,input_shape=(2,1)))

model.add(Dense(1))

LSTM layers can be stacked by adding them to the Sequential model. Importantly, when stacking LSTM layers, we must output a sequence rather than a single value for each input so that the subsequent LSTM layer can have the required 3D input. We can do this by setting the return_sequences argument to True. For example:

1

2

3

4

model=Sequential()

model.add(LSTM(5,input_shape=(2,1),return_sequences=True))

model.add(LSTM(5))

model.add(Dense(1))

Think of a Sequential model as a pipeline with your raw data fed in at in end and predictions that come out at the other.

This is a helpful container in Keras as concerns that were traditionally associated with a layer can also be split out and added as separate layers, clearly showing their role in the transform of data from input to prediction.

For example, activation functions that transform a summed signal from each neuron in a layer can be extracted and added to the Sequential as a layer-like object called Activation.

1

2

3

4

model=Sequential()

model.add(LSTM(5,input_shape=(2,1)))

model.add(Dense(1))

model.add(Activation('sigmoid'))

The choice of activation function is most important for the output layer as it will define the format that predictions will take.

For example, below are some common predictive modeling problem types and the structure and standard activation function that you can use in the output layer:

Regression: Linear activation function, or ‘linear’, and the number of neurons matching the number of outputs.

Step 2. Compile Network

Once we have defined our network, we must compile it.

Compilation is an efficiency step. It transforms the simple sequence of layers that we defined into a highly efficient series of matrix transforms in a format intended to be executed on your GPU or CPU, depending on how Keras is configured.

Think of compilation as a precompute step for your network. It is always required after defining a model.

Compilation requires a number of parameters to be specified, specifically tailored to training your network. Specifically, the optimization algorithm to use to train the network and the loss function used to evaluate the network that is minimized by the optimization algorithm.

For example, below is a case of compiling a defined model and specifying the stochastic gradient descent (sgd) optimization algorithm and the mean squared error (mean_squared_error) loss function, intended for a regression type problem.

1

model.compile(optimizer='sgd',loss='mean_squared_error')

Alternately, the optimizer can be created and configured before being provided as an argument to the compilation step.

1

2

algorithm=SGD(lr=0.1,momentum=0.3)

model.compile(optimizer=algorithm,loss='mean_squared_error')

The type of predictive modeling problem imposes constraints on the type of loss function that can be used.

For example, below are some standard loss functions for different predictive model types:

The most common optimization algorithm is stochastic gradient descent, but Keras also supports a suite of other state-of-the-art optimization algorithms that work well with little or no configuration.

Perhaps the most commonly used optimization algorithms because of their generally better performance are:

Stochastic Gradient Descent, or ‘sgd‘, that requires the tuning of a learning rate and momentum.

ADAM, or ‘adam‘, that requires the tuning of learning rate.

RMSprop, or ‘rmsprop‘, that requires the tuning of learning rate.

Finally, you can also specify metrics to collect while fitting your model in addition to the loss function. Generally, the most useful additional metric to collect is accuracy for classification problems. The metrics to collect are specified by name in an array.

Step 3. Fit Network

Once the network is compiled, it can be fit, which means adapt the weights on a training dataset.

Fitting the network requires the training data to be specified, both a matrix of input patterns, X, and an array of matching output patterns, y.

The network is trained using the backpropagation algorithm and optimized according to the optimization algorithm and loss function specified when compiling the model.

The backpropagation algorithm requires that the network be trained for a specified number of epochs or exposures to the training dataset.

Each epoch can be partitioned into groups of input-output pattern pairs called batches. This defines the number of patterns that the network is exposed to before the weights are updated within an epoch. It is also an efficiency optimization, ensuring that not too many input patterns are loaded into memory at a time.

A minimal example of fitting a network is as follows:

1

history=model.fit(X,y,batch_size=10,epochs=100)

Once fit, a history object is returned that provides a summary of the performance of the model during training. This includes both the loss and any additional metrics specified when compiling the model, recorded each epoch.

Training can take a long time, from seconds to hours to days depending on the size of the network and the size of the training data.

By default, a progress bar is displayed on the command line for each epoch. This may create too much noise for you, or may cause problems for your environment, such as if you are in an interactive notebook or IDE.

You can reduce the amount of information displayed to just the loss each epoch by setting the verbose argument to 2. You can turn off all output by setting verbose to 1. For example:

1

history=model.fit(X,y,batch_size=10,epochs=100,verbose=0)

Step 4. Evaluate Network

Once the network is trained, it can be evaluated.

The network can be evaluated on the training data, but this will not provide a useful indication of the performance of the network as a predictive model, as it has seen all of this data before.

We can evaluate the performance of the network on a separate dataset, unseen during testing. This will provide an estimate of the performance of the network at making predictions for unseen data in the future.

The model evaluates the loss across all of the test patterns, as well as any other metrics specified when the model was compiled, like classification accuracy. A list of evaluation metrics is returned.

For example, for a model compiled with the accuracy metric, we could evaluate it on a new dataset as follows:

1

loss,accuracy=model.evaluate(X,y)

As with fitting the network, verbose output is provided to give an idea of the progress of evaluating the model. We can turn this off by setting the verbose argument to 0.

1

loss,accuracy=model.evaluate(X,y,verbose=0)

Step 5. Make Predictions

Once we are satisfied with the performance of our fit model, we can use it to make predictions on new data.

This is as easy as calling the predict() function on the model with an array of new input patterns.

For example:

1

predictions=model.predict(X)

The predictions will be returned in the format provided by the output layer of the network.

In the case of a regression problem, these predictions may be in the format of the problem directly, provided by a linear activation function.

For a binary classification problem, the predictions may be an array of probabilities for the first class that can be converted to a 1 or 0 by rounding.

For a multiclass classification problem, the results may be in the form of an array of probabilities (assuming a one hot encoded output variable) that may need to be converted to a single class output prediction using the argmax() NumPy function.

Alternately, for classification problems, we can use the predict_classes() function that will automatically convert uncrisp predictions to crisp integer class values.

1

predictions=model.predict_classes(X)

As with fitting and evaluating the network, verbose output is provided to given an idea of the progress of the model making predictions. We can turn this off by setting the verbose argument to 0.

1

predictions=model.predict(X,verbose=0)

End-to-End Worked Example

Let’s tie all of this together with a small worked example.

This example will use a simple problem of learning a sequence of 10 numbers. We will show the network a number, such as 0.0 and expect it to predict 0.1. Then show it 0.1 and expect it to predict 0.2, and so on to 0.9.

Define Network: We will construct an LSTM neural network with a 1 input timestep and 1 input feature in the visible layer, 10 memory units in the LSTM hidden layer, and 1 neuron in the fully connected output layer with a linear (default) activation function.

Compile Network: We will use the efficient ADAM optimization algorithm with default configuration and the mean squared error loss function because it is a regression problem.

Fit Network: We will fit the network for 1,000 epochs and use a batch size equal to the number of patterns in the training set. We will also turn off all verbose output.

Evaluate Network. We will evaluate the network on the training dataset. Typically we would evaluate the model on a test or validation set.

Make Predictions. We will make predictions for the training input data. Again, typically we would make predictions on data where we do not know the right answer.

The complete code listing is provided below.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

# Example of LSTM to learn a sequence

from pandas import DataFrame

from pandas import concat

from keras.models import Sequential

from keras.layers import Dense

from keras.layers import LSTM

# create sequence

length=10

sequence=[i/float(length)foriinrange(length)]

print(sequence)

# create X/y pairs

df=DataFrame(sequence)

df=concat([df.shift(1),df],axis=1)

df.dropna(inplace=True)

# convert to LSTM friendly format

values=df.values

X,y=values[:,0],values[:,1]

X=X.reshape(len(X),1,1)

# 1. define network

model=Sequential()

model.add(LSTM(10,input_shape=(1,1)))

model.add(Dense(1))

# 2. compile network

model.compile(optimizer='adam',loss='mean_squared_error')

# 3. fit network

history=model.fit(X,y,epochs=1000,batch_size=len(X),verbose=0)

# 4. evaluate network

loss=model.evaluate(X,y,verbose=0)

print(loss)

# 5. make predictions

predictions=model.predict(X,verbose=0)

print(predictions[:,0])

Running this example produces the following output, showing the raw input sequence of 10 numbers, the mean squared error loss of the network when making predictions for the entire sequence, and the predictions for each input pattern.

Outputs were spaced out for readability.

We can see the sequence is learned well, especially if we round predictions to the first decimal place.

Hello,
I have 3 classes and want to design a LSTM for 3-class classification. Any suggestion what I am doig wrong here.
I get the below error :
Input 0 is incompatible with layer lstm_1: expected ndim=3, found ndim=2

Hi Jason, I enjoy reading your blogs, you have one of the finest explanation out here.

I am trying understand LSTM but still a little confused about the dimensions of input/output. Some details that I found on this post aren’t mentioned anywhere, e.g.:

” If you would like columns to become timesteps for one feature, you can use:
data = data.reshape((data.shape[0], data.shape[1], 1))
If you would like columns in your 2D data to become features with one timestep, you can use:
data = data.reshape((data.shape[0], 1, data.shape[1]))
”

It would be really helpful if you could explain in details about reshaping the data for different types of LSTM networks, one-to-one, many-to-one, many-to-many.