Understand the Difference Between Return Sequences and Return States for LSTMs in Keras

The Keras deep learning library provides an implementation of the Long Short-Term Memory, or LSTM, recurrent neural network.

As part of this implementation, the Keras API provides access to both return sequences and return state. The use and difference between these data can be confusing when designing sophisticated recurrent neural network models, such as the encoder-decoder model.

In this tutorial, you will discover the difference and result of return sequences and return states for LSTM layers in the Keras deep learning library.

After completing this tutorial, you will know:

That return sequences return the hidden state output for each input time step.

That return state returns the hidden state output and cell state for the last input time step.

That return sequences and return state can be used at the same time.

Let’s get started.

Understand the Difference Between Return Sequences and Return States for LSTMs in KerasPhoto by Adrian Curt Dannemann, some rights reserved.

Tutorial Overview

This tutorial is divided into 4 parts; they are:

Long Short-Term Memory

Return Sequences

Return States

Return States and Sequences

Long Short-Term Memory

The Long Short-Term Memory, or LSTM, is a recurrent neural network that is comprised of internal gates.

Unlike other recurrent neural networks, the network’s internal gates allow the model to be trained successfully using backpropagation through time, or BPTT, and avoid the vanishing gradients problem.

In the Keras deep learning library, LSTM layers can be created using the LSTM() class.

Creating a layer of LSTM memory units allows you to specify the number of memory units within the layer.

Each unit or cell within the layer has an internal cell state, often abbreviated as “c“, and outputs a hidden state, often abbreviated as “h“.

The Keras API allows you to access these data, which can be useful or even required when developing sophisticated recurrent neural network architectures, such as the encoder-decoder model.

For the rest of this tutorial, we will look at the API for access these data.

Return Sequences

Each LSTM cell will output one hidden state h for each input.

1

h=LSTM(X)

We can demonstrate this in Keras with a very small model with a single LSTM layer that itself contains a single LSTM cell.

In this example, we will have one input sample with 3 time steps and one feature observed at each time step:

Return States

The output of an LSTM cell or layer of cells is called the hidden state.

This is confusing, because each LSTM cell retains an internal state that is not output, called the cell state, or c.

Generally, we do not need to access the cell state unless we are developing sophisticated models where subsequent layers may need to have their cell state initialized with the final cell state of another layer, such as in an encoder-decoder model.

Keras provides the return_state argument to the LSTM layer that will provide access to the hidden state output (state_h) and the cell state (state_c). For example:

1

lstm1,state_h,state_c=LSTM(1,return_state=True)

This may look confusing because both lstm1 and state_h refer to the same hidden state output. The reason for these two tensors being separate will become clear in the next section.

We can demonstrate access to the hidden and cell states of the cells in the LSTM layer with a worked example listed below.

1

2

3

4

5

6

7

8

9

10

11

12

from keras.models import Model

from keras.layers import Input

from keras.layers import LSTM

from numpy import array

# define model

inputs1=Input(shape=(3,1))

lstm1,state_h,state_c=LSTM(1,return_state=True)(inputs1)

model=Model(inputs=inputs1,outputs=[lstm1,state_h,state_c])

# define input data

data=array([0.1,0.2,0.3]).reshape((1,3,1))

# make and show prediction

print(model.predict(data))

Running the example returns 3 arrays:

The LSTM hidden state output for the last time step.

The LSTM hidden state output for the last time step (again).

The LSTM cell state for the last time step.

1

2

3

[array([[ 0.10951342]], dtype=float32),

array([[ 0.10951342]], dtype=float32),

array([[ 0.24143776]], dtype=float32)]

The hidden state and the cell state could in turn be used to initialize the states of another LSTM layer with the same number of cells.

Return States and Sequences

We can access both the sequence of hidden state and the cell states at the same time.

This can be done by configuring the LSTM layer to both return sequences and return states.

Hi Jason, is it possible to access the internal states through return_state = True and return_sequences = True with the Sequencial API? Moreover, is it possible to set the hidden state through a function like set_state() ?

Hi Jason, in these example you don’t fit.
When you define the model like this: model = Model(inputs=inputs1, outputs=[lstm1, state_h, state_c]) and then fit, the fit() function expects three values for the output instead of 1. How to correctly print the states (o see they change during training and/or prediction ?

Hi Jason, the question was about the outputs, not the inputs.. The problem is that if i set outputs=[lstm1, state_h, state_c] in the Model(), then the fit() function will expect three arrays as target arrays.

Hi Jason,
In the implementation of encoder-decoder in keras we do a return state in the encoder network which means that we get state_h and state_c after which the [state_h,state_c] is set as initial state of the decoder network. What does setting the initial state mean for a LSTM network? Is it that the state_h of decoder = [state_h,state_c].

Great. I think i get it now. Basically when we give return_state=True for a LSTM then the LSTM will accept three inputs that is decoder_actual_input,state_h,state_c. And during inference we again reset the state_c and state_h with state_h and state_c of previous prediction. Am i correct on my assumption ?

I am still a little bit confused why we use three keras models (model,encoder_model and decoder_model).

I do enjoy reading your blog. I have 2 short questions for this post and hope you could kindly address them briefly:

1. Can we return the sequence of cell states (a sort of variable similar to *lstm1*)?

2. No matter the dimension (I mean #features) of the input sequence, as we place 1 LSTMcell in the layer, both the hidden and cell states are always a scalar, right? As such, the kernel_ and recurrent_kernel_ properties in Keras (at each gate) are not in the matrix form. However, I believe your standpoint on viewing each LSTM cell having 1Dim hidden state/cell makes sense in the case of dropout in deep learning.

Hi, very good explanation.
One question, I thought h = activation (o), is that correct? (h: hidden state output, o: hidden cell)
But tanh(-0.19803026) does not equals -0.09228823. (The default activation for LSTM should be tanh)

“Generally, we do not need to access the cell state unless we are developing sophisticated models where subsequent layers may need to have their cell state initialized with the final cell state of another layer, such as in an encoder-decoder model.”
in the another your post,encoder-decoder LSTM Model code as fllower:
model.add(LSTM(200, activation=’relu’, input_shape=(n_timesteps, n_features)))
model.add(RepeatVector(n_outputs))
model.add(LSTM(200, activation=’relu’, return_sequences=True))
but return_state = false?

Hello Jason,
Thanks for the great post. I have a quick question about the bottleneck in the LSTM encoder above. As I understand it, if the encoder has say 50 cells, then does this mean that the hidden state of the encoder LSTM layer contains 50 values, one per cell?

If this is correct, then would it be accurate to say that if the original data had 50 timesteps and a dimensionality/feature count of 3, then having an encoder LSTM with 20 cells (which would give a hidden state of 20 values) could be considered to be a sort of compression/dimensionality reduction (a la autoencoders and compressed representations) ?

Finally, does it make sense to apply have a fully-connected layer with some nonlinearity operating on the hidden state for purposes of dimensionality reduction i.e hidden state with 50 values -> FFlayer with 10 neurons, ‘compressing’ the 50 values to 10…?