This architecture was originally referred to as a Long-term Recurrent Convolutional Network or LRCN model, although we will use the more generic name “CNN LSTM” to refer to LSTMs that use a CNN as a front end in this lesson.

This architecture is used for the task of generating textual descriptions of images. Key is the use of a CNN that is pre-trained on a challenging image classification task that is re-purposed as a feature extractor for the caption generating problem.

… it is natural to use a CNN as an image “encoder”, by first pre-training it for an image classification task and using the last hidden layer as an input to the RNN decoder that generates sentences

Implement CNN LSTM in Keras

A CNN LSTM can be defined by adding CNN layers on the front end followed by LSTM layers with a Dense layer on the output.

It is helpful to think of this architecture as defining two sub-models: the CNN Model for feature extraction and the LSTM Model for interpreting the features across time steps.

Let’s take a look at both of these sub models in the context of a sequence of 2D inputs which we will assume are images.

CNN Model

As a refresher, we can define a 2D convolutional network as comprised of Conv2D and MaxPooling2D layers ordered into a stack of the required depth.

The Conv2D will interpret snapshots of the image (e.g. small squares) and the polling layers will consolidate or abstract the interpretation.

For example, the snippet below expects to read in 10×10 pixel images with 1 channel (e.g. black and white). The Conv2D will read the image in 2×2 snapshots and output one new 10×10 interpretation of the image. The MaxPooling2D will pool the interpretation into 2×2 blocks reducing the output to a 5×5 consolidation. The Flatten layer will take the single 5×5 map and transform it into a 25-element vector ready for some other layer to deal with, such as a Dense for outputting a prediction.

This makes sense for image classification and other computer vision tasks.

LSTM Model

The CNN model above is only capable of handling a single image, transforming it from input pixels into an internal matrix or vector representation.

We need to repeat this operation across multiple images and allow the LSTM to build up internal state and update weights using BPTT across a sequence of the internal vector representations of input images.

The CNN could be fixed in the case of using an existing pre-trained model like VGG for feature extraction from images. The CNN may not be trained, and we may wish to train it by backpropagating error from the LSTM across multiple input images to the CNN model.

In both of these cases, conceptually there is a single CNN model and a sequence of LSTM models, one for each time step. We want to apply the CNN model to each input image and pass on the output of each input image to the LSTM as a single time step.

We can achieve this by wrapping the entire CNN input model (one layer or more) in a TimeDistributed layer. This layer achieves the desired outcome of applying the same layer or layers multiple times. In this case, applying it multiple times to multiple input time steps and in turn providing a sequence of “image interpretations” or “image features” to the LSTM model to work on.

1

2

3

model.add(TimeDistributed(...))

model.add(LSTM(...))

model.add(Dense(...))

We now have the two elements of the model; let’s put them together.

CNN LSTM Model

We can define a CNN LSTM model in Keras by first defining the CNN layer or layers, wrapping them in a TimeDistributed layer and then defining the LSTM and output layers.

We have two ways to define the model that are equivalent and only differ as a matter of taste.

You can define the CNN model first, then add it to the LSTM model by wrapping the entire sequence of CNN layers in a TimeDistributed layer, as follows:

1

2

3

4

5

6

7

8

9

10

# define CNN model

cnn=Sequential()

cnn.add(Conv2D(...))

cnn.add(MaxPooling2D(...))

cnn.add(Flatten())

# define LSTM model

model=Sequential()

model.add(TimeDistributed(cnn,...))

model.add(LSTM(..))

model.add(Dense(...))

An alternate, and perhaps easier to read, approach is to wrap each layer in the CNN model in a TimeDistributed layer when adding it to the main model.

1

2

3

4

5

6

7

8

model=Sequential()

# define CNN model

model.add(TimeDistributed(Conv2D(...))

model.add(TimeDistributed(MaxPooling2D(...)))

model.add(TimeDistributed(Flatten()))

# define LSTM model

model.add(LSTM(...))

model.add(Dense(...))

The benefit of this second approach is that all of the layers appear in the model summary and as such is preferred for now.

You can choose the method that you prefer.

Further Reading

This section provides more resources on the topic if you are looking go deeper.

Hi, Jason，I am very distressed,I would like to ask you a question.For example, the data of 0-500, the magnitude of data is quite different.When I use LSTM model to predict, the accuracy is too low.Even if the data normalization is not helpful, I would like to ask you, how should the data be processed?Thank you so much!

Not yet, I’m just waiting next tensorflow release since it seems that convlstm would be provided as tf.contrib.rnn.ConvLSTMCell, instead I’ve used cnn + lstm on simple speech recognition experiments and it gives better results than stack of lstm. It really works!

Hi, Jason.
Do you think the CNNLSTM can solve the regression problem, whose inputs are some time series data and some properties/exogenous data (spatial), not image data? If yes, how to deal with the properties/exogenous data (2D) in CNN. Thank you.

It might not make sense given that the LSTM is already interpreting the long term relationships in the data.

It might be interesting if the CNN can pick out structure that is new/different from the LSTM. Perhaps you could have both a CNN and LSTM interpretation of the series and use another model to integrate and interpret the results.

I tried to use CNN + LSTM for timeseries forecasting, hoping that CNN can uncover some structure in the input signals. So far, it seems to perform worse than a 2-layered LSTM model, even after tuning hyperparameters. I thought I would get your book to look at the details, but sounds like this was not covered in the book? Your previous posting on LSTM model was very helpful. Thank you!

I’m starting my studies with deep learning, python and keras.
I would like knowing how to implement the CNN with ELM (extreme learning machine) architecture in Python with Keras for classification task. Do you have a github implementation?

You aso need to specify a batch size in the input dimensions to that layer I guess, to get the fifth dimension. Try using: model.add(TimeDistributed(cnn, input_shape=(None, num_timesteps, 224, 224,num_chan))). The None will then allow variable batch size.

Assuming there are a data set with time series data (e.g temperature, rainfall) and geographic data(e.g. elevation, slope) for many grid positions, I need to use the data set to predict(regression) future weathers.

I think of a method with LSTM (for time series data) + auxiliary (geographic data) to be a solution. But the results of forecast is not very good. Do you have other better methods? Or do you have a related lessons?

Hi Jason, Thanks a lot for this. I am having trouble implementing the same architecture of TimeDistributed CNN with LSTM using functional API. It is throwing an error when I pass the TImeDistributed layer to maxpooling step saying the input is not a tensor. Could you please put few lines of code for the Timedistributed CNN output into LSTM using functional API?

Nice intro, but it’s very incomplete. After reading this I know how to build a CNN LSTM, but I still don’t have any concept of what the input to it looks like, and therefore I don’t know how to train it. What does the input to the network look like, exactly? How do I reconcile the concepts of having a batch size but at the same time my input being a sequence? For someone who has never used RNNs before, this is not at all clear.

You say : ” In both of these cases, conceptually there is a single CNN model and a sequence of LSTM models, one for each time step”

Can you please explain me on how is back propogation working here ? Assuming my sequence length is T, I have confusion as follow :

First interpretation : If a interpret in a way that for each LSTM unit I have corresponding CNN unit. So if input sequence of length T, I have T LSTM’s and corresponding T CNN’s. Then if I am assuming that I am learning weights by back propagation, then shouldn’t all the CNN’s have different weights ? How could all CNN have weight shared across time ?

Second interpretation : Only one CNN and T LSTM. Features across T frames extracted using the same CNN and passed on to T LSTM’s with different weights. But then how is this kind of network learning weights for the CNN.

I have really spent alot of time to understand but I am still confused. Would be really really helpful if you could answer 🙂

Hey there,
Thanks for your informative post… It was very useful!
I want to some similar task but a bit more complicated. Consider that we want to generalize or network to be able to use for different sizes. Therefore we need to look at frames in patch scale and then effect of patches of an image result image effect and then images result for the video. (Note that resizing is not possible in my case!)

In other words consider we want to use video in the network in which each video has a different number of frames and also frames of different videos may have different number of patches considering different frame size for different videos. Therefore the input dimension should be e.g. [None(for batch),None(for frame), None(for patch),100,100,3]

Actually I could not do its programming with Keras or TensorFlow! Would you please help with this?

Thanks for your blog! I have some questions about how to apply this integrated model for my data. Now, I have time-series images with multiple bands for crop yield regression, how do I import these data as input for this model? Can you give me any examples or some references I can go to? Thanks so much!

Hey Jason, This example is very enlightening!
I’m currently aiming to do anomaly detection on some radio-astronomic data, which consists is .tiff image files, where horizontal axis is the time stamp, and vertical is frequencies. In this case, using the frequencies axis as a space (since signals come in varied frequencies) do you think it would be better to apply a 1D convolutional layer than just using a normal LSTM layer when encoding the images?. I understand there is a spatial dependence in my data, but it’s only 1-dimensional. I would like to know your opinion about this.

my images are 20000 (each frame is adding next 30 minute price), “50×50″ 1 channel,
The problem is that using all of regularization that I could, almost all of my architectures are about 0.51 accuracy, this is the last that I made:

So I wanted to ask you, how could you avoid overffiting in this type of architectures, and if the height and length of the frames affect how the model identify all the patterns, as my problem where I don’t know if due to the very small details varying between my images (because are closely the same) could have an impact in the acc and the overfitting.

I have a problem here. I have a project use CNN-LSTM model. However, when I use 1D cnn the performance of Maxpooling layer for the filter number is better than Maxpooling layer for the data size. So I have to resize of data after cnn layer by Pernute layer. How do you think about this?

Actually, I have already changed the filter sizes multiple times. I know normally, the Maxpooling layer is applied to reduce the data size not the number of filtes Even keras only support Maxpooling in cnn2D for width or height of data, so I little worry about this.

Greeting Dr.Jason
My thanks to your tutorial. I’ve got some question.
According to your tutorial here https://machinelearningmastery.com/multivariate-time-series-forecasting-lstms-keras/
I wonder if I could implement your idea of CNN LSTM with that tutorial? If so, what should I change in code? I am trying to do implement it but somehow I stuck with it.
Also, does it make sense to use this model for classification work?
I would appreciated if you answering back Dr.Jason. Thank you so much.

I am using GRUs for sequence learning in captioning problem. What is meant by training loss in GRU training ? and my loss starts from 9.### and drops down till 0.29## but if I keep training then it starts ti increase again. Any Idea what makes the loss increase again ?
My loss function is

I see in the comments that you have mentioned that you might investigate the ConvLSTM layer now available in Keras. I first want to think you for such an immense contribution, your blog has been extremely useful to me in understanding LSTMs. It must take a lot of your time to keep up with all these comments on top of providing the content that you do. However, I have read many of your posts but the knowledge I have fails me!

I am hoping to take advantage of the ConvLSTM2D for a video segmentation problem. It is a very different application to the sequence predictions frequented on this blog. I have N videos of 500 frames each and each video corresponds to a single 2D segmentation mask. I think it is a many to one problem:

Input: (N, 500, cols, rows, 1)
Output: (N, 1, cols, rows, 1)

As per your post on how to deal with long sequences, I have adjusted my input to contain sequence “fragments” , for example of 50 time steps so that I now have:

Input: (N, 10, 50, cols, rows, 1)
Output (N, 1, 1, cols, rows, 1)

Which does not work out so well because Keras LSTM expects a 5D array, not 6D. My understanding was that I would be able to feed a single sequence at a time into a stateful LSTM (500 images chopped up into fragments of 50) and that I could some how remember the state across the 500 images in this way in order to make a final prediction before deciding whether to update the gradients or not.

My implementation approach did not work with Input: (10, 50, cols, rows, 1) as here “10” is considered as the number of samples and thus corresponding output is required to be (10, 1, cols, rows, 1) ie. a segmentation mask every 50 frames, which is not what I am looking for.

I can duplicate the segmentation 10 times to produce the desired output but I am not sure that is the right way to go.

Awesome article as always. I would like to clear a question that came up. Do convolutionalLSTMs [https://github.com/keras-team/keras/blob/master/examples/conv_lstm.py] mean the same as convolutional neural networks followed by an LSTM. I understand you are trying to extrapolate features using the CNN before passing it on to a LSTM, so it should technically be the same?

I was hoping to get your inputs and advice on the model I’m trying to build.

The goal of the model is to act as a PoS tagger using a combination of CNN and LSTM.
CNN portion receives as input, word vector representations from a Glove embedding and hopefully learns information about the word/sequence.

BiLSTM will then process the output from CNN.
A TimeDistributed layer is added at the dense layer for prediction.

The model trains without issues but in terms of performance, the metrics are worse than a pure LSTM model.

how do we feed the video frames as input to cnn+lstm model? Im currently working with that and unaware of how this could be done.Could you guide me on this?Basically i want to know regarding the input part of the model.