Demonstration of Memory with a Long Short-Term Memory Network in Python

Long Short-Term Memory (LSTM) networks are a type of recurrent neural network capable of learning over long sequences.

This differentiates them from regular multilayer neural networks that do not have memory and can only learn a mapping between input and output patterns.

It is important to understand the capabilities of complex neural networks like LSTMs on small contrived problems as this understanding will help you scale the network up to large and even very large problems.

In this tutorial, you will discover the capability of LSTMs to remember and recall.

After completing this tutorial, you will know:

How to define a small sequence prediction problem that only an RNN like LSTMs can solve using memory.

How to transform the problem representation so that it is suitable for learning by LSTMs.

How to design an LSTM to solve the problem correctly.

Let’s get started.

A Demonstration of Memory in a Long Short-Term Memory NetworkPhoto by crazlei, some rights reserved.

Environment

This tutorial assumes you have a working Python 2 or 3 environment with SciPy, Keras 2.0 or higher with a TensorFlow or Theano backend.

Sequence Problem Description

Given one value in the sequence, the model must predict the next value in the sequence. For example, given a value of “0” as an input, the model must predict the value “1”.

There are two different sequences that the model must learn and correctly predict.

A wrinkle is that there is conflicting information between the two sequences and that the model must know the context of each one-step prediction (e.g. the sequence it is currently predicting) in order to correctly predict each full sequence.

This wrinkle is important to prevent the model from memorizing each single-step input-output pair of values in each sequence, as a sequence unaware model may be inclined to do.

The two sequences to be learned are as follows:

3, 0, 1, 2, 3

4, 0, 1, 2, 4

We can see that the first value of the sequence is repeated as the last value of the sequence. This is the indicator that provides context to the model as to which sequence it is working on.

The conflict is the transition from the second to last items in each sequence. In sequence one, a “2” is given as an input and a “3” must be predicted, whereas in sequence two, a “2” is given as input and a “4” must be predicted.

This is a problem that a multilayer Perceptron and other non-recurrent neural networks cannot learn.

This is a simplified version of “Experiment 2” used to demonstrate LSTM long-term memory capabilities in Hochreiter and Schmidhuber’s 1997 paper Long Short Term Memory (PDF).

In the case of one sequence of input data, the dimensions will be [4, 1, 5] because we have 4 rows of data, 1 time step for each row, and 5 columns in each row.

We can create a 2D NumPy array from our list of X patterns, then reshape it into the required 3D format. For example:

1

2

3

df=DataFrame(X)

values=df.values

array=values.reshape(4,1,5)

We must also convert the list of output patterns (y) into a 2D NumPy Array.

Below is a function named to_lstm_dataset() that takes a sequence as an input and the size of the sequence alphabet and returns an X and y dataset ready for use with an LSTM. It performs the required conversions of the sequence to a one-hot encoding and to input-output pairs before reshaping the data.

1

2

3

4

5

6

7

8

9

10

11

12

# convert sequence to x/y pairs ready for use with an LSTM

def to_lstm_dataset(sequence,n_unique):

# one hot encode

encoded=encode(sequence,n_unique)

# convert to in/out patterns

X,y=to_xy_pairs(encoded)

# convert to LSTM friendly format

dfX,dfy=DataFrame(X),DataFrame(y)

lstmX=dfX.values

lstmX=lstmX.reshape(lstmX.shape[0],1,lstmX.shape[1])

lstmY=dfy.values

returnlstmX,lstmY

This function can be called with each sequence as follows:

1

2

3

4

5

6

seq1=[3,0,1,2,3]

seq2=[4,0,1,2,4]

n_unique=len(set(seq1+seq2))

seq1X,seq1Y=to_lstm_dataset(seq1,n_unique)

seq2X,seq2Y=to_lstm_dataset(seq2,n_unique)

We now have all of the pieces to prepare the data for the LSTM.

Learn Sequences with an LSTM

In this section, we will define the LSTM to learn the input sequences.

This section is divided into 4 sections:

LSTM Configuration

LSTM Training

LSTM Evaluation

LSTM Complete Example

LSTM Configuration

We want the LSTM to make one-step predictions, which we have defined in the format and shape of our dataset. We also want the LSTM to be updated with errors after each time step, this means we will need to use a batch-size of one.

Keras LSTMs are not stateful between batches by default. We can make them stateful by setting the stateful argument on the LSTM layer to True and managing the training epochs manually to ensure that the internal state of the LSTM is reset after each sequence.

We must define the shape of the batch using the batch_input_shape argument with 3 dimensions [batch size, time steps, and features] which will be 1, 1, and 5 respectively.

The network topology will be configured with one hidden LSTM layer with 20 units and a normal Dense layer with 5 outputs for each of the 5 columns in an output pattern. A sigmoid (logistic) activation function will be used on the output layer because of the binary outputs and the default tanh (hyperbolic tangent) activation function will be used on the LSTM layer.

A log (cross entropy) loss function will be optimized when fitting the network because of the binary outputs and the efficient ADAM optimization algorithm will be used with all default parameters.

The Keras code to define the LSTM network for this problem is listed below.

1

2

3

4

model=Sequential()

model.add(LSTM(20,batch_input_shape=(1,1,5),stateful=True))

model.add(Dense(5,activation='sigmoid'))

model.compile(loss='binary_crossentropy',optimizer='adam')

LSTM Training

We must fit the model manually one epoch at a time.

Within one epoch we can fit the model on each sequence, being sure to reset state after each sequence.

The model does not need to be trained for long given the simplicity of the problem; in this case only 250 epochs are required.

Below is an example of how the model can be fit on each sequence across all epochs.

1

2

3

4

5

6

# train LSTM

foriinrange(250):

model.fit(seq1X,seq1Y,epochs=1,batch_size=1,verbose=1,shuffle=False)

model.reset_states()

model.fit(seq2X,seq2Y,epochs=1,batch_size=1,verbose=0,shuffle=False)

model.reset_states()

I like to see some feedback on the loss function when fitting a network, so verbose output is turned on from one of the sequences, but not the other.

LSTM Evaluation

Next, we can evaluate the fit model by predicting each step of the learned sequences.

We can do this by predicting the outputs for each sequence.

The predict_classes() function can be used on the LSTM model that will predict the class directly. It does this by performing an argmax() on the output binary vector and returning the index of the predicted column with the largest output. The output indices map perfectly onto the integers used in the sequence (by careful design above). An example of making a prediction is listed below:

1

result=model.predict_classes(seq1X,batch_size=1,verbose=0)

We can make a prediction, then print the result in the context of the input pattern and the expected output pattern for each step of the sequence.

LSTM Complete Example

We can now tie the whole tutorial together.

The complete code listing is provided below.

First, the data is prepared, then the model is fit and the predictions of both sequences are printed.

Running the example provides feedback regarding the model’s loss on the first sequence each epoch.

At the end of the run, each sequence is printed in the context of the predictions.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

...

4/4 [==============================] - 0s - loss: 0.0930

Epoch 1/1

4/4 [==============================] - 0s - loss: 0.0927

Epoch 1/1

4/4 [==============================] - 0s - loss: 0.0925

Sequence 1

X=3.0 y=0.0, yhat=0.0

X=0.0 y=1.0, yhat=1.0

X=1.0 y=2.0, yhat=2.0

X=2.0 y=3.0, yhat=3.0

Sequence 2

X=4.0 y=0.0, yhat=0.0

X=0.0 y=1.0, yhat=1.0

X=1.0 y=2.0, yhat=2.0

X=2.0 y=4.0, yhat=4.0

The results show two important things:

That the LSTM correctly learned each sequence one step at a time.

That the LSTM used the context of each sequence to correctly resolve the conflicting input pairs.

In essence, the LSTM was able to remember the input pattern at the beginning of the sequence 3 time steps ago to correctly predict the last value in the sequence.

This memory and ability of LSTMs to relate observations distant in time is the key capability that makes LSTMs so powerful and why they are so widely used.

Although the example is trivial, LSTMs are able to demonstrate this same capability across 100s, and even 1000s, of time steps.

Extensions

This section lists ideas for extensions to the examples in this tutorial.

Tuning. The configurations for the LSTM (epochs, units, etc.) were chosen after some trial and error. It is possible that a much simpler configuration can achieve the same result on this problem. Some search of parameters is required.

Arbitrary Alphabets. The alphabet of 5 integers was chosen arbitrarily. This could be changed to other symbols and larger alphabets.

Long Sequences. The sequences used in this example were very short. The LSTM is able to demonstrate the same capability on much longer sequences of 100s and 1000s of time steps.

Random Sequences. The sequences used in this tutorial were linearly increasing. New sequences of random values can be created, allowing the LSTM to devise a generalized solution rather than one specialized to the two sequences used in this tutorial.

Batch Learning. Updates were made to the LSTM after each time step. Explore using batch updates to see if this improves learning or not.

Shuffle Epoch. The sequences were shown in the same order each epoch during training and again during evaluation. Randomize the order of the sequences so that sequence 1 and 2 are fit within an epoch, which might improve the generalization of the model to new unseen sequences with the same alphabet.

Did you explore any of these extensions?
Share your results in the comments below. I’d love to see what you came up with.

Further Reading

I strongly recommend reading the original 1997 LSTM paper by Hochreiter and Schmidhuber; it is very good.

Hi Jason, thanks for your respond
If we set n_batch = 4, it will not converge and result in either 0123 for both sequences or 0124.
It behaves as if it doesn’t keep state …
I see the only diiference that if batch size =1 it makes weight update 4 times passing through the sequence before going to the second sequence, if batch =4 it makes weight update just once , then reseting state and going to the second sequence.

I aslo tried to concatenate both sequences so I could run 1 sequnce of 8 pairs and still I get the same result , it memorize correctly all secuence if batch size ==1 (i.e. 01230124), otherwise if I set batch size = 4 or 8 it results in either 01230123 or 01240124 ( i.e. doesn’t converge).

What am I missing here ?

I also tried other examples in your course , one of them “Understanding Stateful LSTM Recurrent Neural Networks in Python with Keras” where it learned the alphabet with success when increasing the batch size from 1 to the size of the training dataset batch_size=len(dataX) =26.