Sequence classification is a predictive modeling problem where you have some sequence of inputs over space or time and the task is to predict a category for the sequence.

What makes this problem difficult is that the sequences can vary in length, be comprised of a very large vocabulary of input symbols and may require the model to learn the long-term context or dependencies between symbols in the input sequence.

In this post, you will discover how you can develop LSTM recurrent neural network models for sequence classification problems in Python using the Keras deep learning library.

After reading this post you will know:

How to develop an LSTM model for a sequence classification problem.

How to reduce overfitting in your LSTM models through the use of dropout.

How to combine LSTM models with Convolutional Neural Networks that excel at learning spatial relationships.

The Large Movie Review Dataset (often referred to as the IMDB dataset) contains 25,000 highly-polar movie reviews (good or bad) for training and the same amount again for testing. The problem is to determine whether a given movie review has a positive or negative sentiment.

Keras provides access to the IMDB dataset built-in. The imdb.load_data() function allows you to load the dataset in a format that is ready for use in neural network and deep learning models.

The words have been replaced by integers that indicate the ordered frequency of each word in the dataset. The sentences in each review are therefore comprised of a sequence of integers.

Word Embedding

We will map each movie review into a real vector domain, a popular technique when working with text called word embedding. This is a technique where words are encoded as real-valued vectors in a high dimensional space, where the similarity between words in terms of meaning translates to closeness in the vector space.

Keras provides a convenient way to convert positive integer representations of words into a word embedding by an Embedding layer.

We will map each word onto a 32 length real valued vector. We will also limit the total number of words that we are interested in modeling to the 5000 most frequent words, and zero out the rest. Finally, the sequence length (number of words) in each review varies, so we will constrain each review to be 500 words, truncating long reviews and pad the shorter reviews with zero values.

Now that we have defined our problem and how the data will be prepared and modeled, we are ready to develop an LSTM model to classify the sentiment of movie reviews.

Simple LSTM for Sequence Classification

We can quickly develop a small LSTM for the IMDB problem and achieve good accuracy.

Let’s start off by importing the classes and functions required for this model and initializing the random number generator to a constant value to ensure we can easily reproduce the results.

1

2

3

4

5

6

7

8

9

import numpy

from keras.datasets import imdb

from keras.models import Sequential

from keras.layers import Dense

from keras.layers import LSTM

from keras.layers.embeddings import Embedding

from keras.preprocessing import sequence

# fix random seed for reproducibility

numpy.random.seed(7)

We need to load the IMDB dataset. We are constraining the dataset to the top 5,000 words. We also split the dataset into train (50%) and test (50%) sets.

1

2

3

# load the dataset but only keep the top n words, zero the rest

top_words=5000

(X_train,y_train),(X_test,y_test)=imdb.load_data(num_words=top_words)

Next, we need to truncate and pad the input sequences so that they are all the same length for modeling. The model will learn the zero values carry no information so indeed the sequences are not the same length in terms of content, but same length vectors is required to perform the computation in Keras.

1

2

3

4

# truncate and pad input sequences

max_review_length=500

X_train=sequence.pad_sequences(X_train,maxlen=max_review_length)

X_test=sequence.pad_sequences(X_test,maxlen=max_review_length)

We can now define, compile and fit our LSTM model.

The first layer is the Embedded layer that uses 32 length vectors to represent each word. The next layer is the LSTM layer with 100 memory units (smart neurons). Finally, because this is a classification problem we use a Dense output layer with a single neuron and a sigmoid activation function to make 0 or 1 predictions for the two classes (good and bad) in the problem.

Because it is a binary classification problem, log loss is used as the loss function (binary_crossentropy in Keras). The efficient ADAM optimization algorithm is used. The model is fit for only 2 epochs because it quickly overfits the problem. A large batch size of 64 reviews is used to space out weight updates.

You can see that this simple LSTM with little tuning achieves near state-of-the-art results on the IMDB problem. Importantly, this is a template that you can use to apply LSTM networks to your own sequence classification problems.

Now, let’s look at some extensions of this simple model that you may also want to bring to your own problems.

LSTM For Sequence Classification With Dropout

Recurrent Neural networks like LSTM generally have the problem of overfitting.

Dropout can be applied between layers using the Dropout Keras layer. We can do this easily by adding new Dropout layers between the Embedding and LSTM layers and the LSTM and Dense output layers. For example:

We can see dropout having the desired impact on training with a slightly slower trend in convergence and in this case a lower final accuracy. The model could probably use a few more epochs of training and may achieve a higher skill (try it an see).

Alternately, dropout can be applied to the input and recurrent connections of the memory units with the LSTM precisely and separately.

Keras provides this capability with parameters on the LSTM layer, the dropout for configuring the input dropout and recurrent_dropout for configuring the recurrent dropout. For example, we can modify the first example to add dropout to the input and recurrent connections as follows:

We can see that the LSTM specific dropout has a more pronounced effect on the convergence of the network than the layer-wise dropout. As above, the number of epochs was kept constant and could be increased to see if the skill of the model can be further lifted.

Dropout is a powerful technique for combating overfitting in your LSTM models and it is a good idea to try both methods, but you may bet better results with the gate-specific dropout provided in Keras.

LSTM and Convolutional Neural Network For Sequence Classification

The IMDB review data does have a one-dimensional spatial structure in the sequence of words in reviews and the CNN may be able to pick out invariant features for good and bad sentiment. This learned spatial features may then be learned as sequences by an LSTM layer.

We can easily add a one-dimensional CNN and max pooling layers after the Embedding layer which then feed the consolidated features to the LSTM. We can use a smallish set of 32 features with a small filter length of 3. The pooling layer can use the standard length of 2 to halve the feature map size.

Would this network architecture work for predicting profitability of a stock based time series data of the stock price.

For example with data samples of daily stock prices and trading volumes with 5 minute intervals from 9.30am to 1pm paired with YES or NO to the stockprice increasing by more than 0.5% the rest of the trading day?

Each trading day is one sample and th3 entire data set woule for example the last 1000 trading days.

If this network architecture is not suitable what other would you suggest testing our?

So, the end result of this tutorial is a model. Could you give me an example how to use this model to predict a new review, especially using new vocabularies that don’t present in training data? Many thanks..

Here I see two dropout layers. The second one is easy to understand: For each time step, It just randomly deactivates 20% numbers in the output embedding vector.

The first one confuses me: Does it do dropout on the input? For each time step, the input of the embedding layers should be only one index of the top words. In other words, the input is one single number. How can we dropout it? (Or do you mean drop the input indices of 20% time steps?)

In this tutorial, Embedding layer is used as the input layer as the data is a sequence of words.

I am working on a problem where I have a sequence of images as an example and a particular label is assigned to each example. The number of images in the sequence will vary from example to example. I have the following questions:
1) Can I use a LSTM layer as an input layer?

2) If the input layer is a LSTM layer, is there still a need to specify the max_len (which is constraint mentioning the maximum number of images an example can have)

Thanks for this tutorial. It’s so helpful! I would like to adapt this to my own problem. I’m working on a problem where I have a sequence of acoustic samples. The sequences vary in length, and I know the identity of the individual/entity producing the signal in each sequence. Since these sequences have a temporal element to them, (each sequence is a series in time and sequences belonging to the same individual are also linked temporally), I thought LSTM would be the way to go.
According to my understanding, the Embedding layer in this tutorial works to add an extra dimension to the dataset since the LSTM layer takes in 3D input data.

My question is is it advisable to use LSTM layer as a first layer in my problem, seeing that Embedding wouldn’t work with my non-integer acoustic samples? I know that in order to use LSTM as my first layer, I have to somehow reshape my data in a meaningful way so that it meets the requirements of the inputs of LSTM layer. I’ve already padded my sequences so my dataset is currently a 2D tensor. Padding with zeros however was not ideal because some of the original acoustic sample values are zero, representing a zero-pressure level. So I’ve manually padded using a different number.

I’m planning to use a stack of LSTM layers and a Dense layer at the end of my Sequential model.

Great question and hard to answer. I would caution you to review some literature for audio-based applications of LSTMs and CNNs and see what representations were used. The examples I’ve seen have been (sadly) trivial.

Try LSTM as the first layer, but also experiment with CNN (1D) then LSTM for additional opportunities to pull out structure. Perhaps also try Dense then LSTM. I would use one or more Dense on the output layers.

Its interesting to see that I am also working on a similar problem. I work on speech and image processing. I have a small doubt. Please may I know how did you choose the padding values. Because in images also, we will have zeros and unable to understand how to do padding.

I have one question. Can I use RNN LSTM for Time Series Sales Analysis. I have only one input every day sales of last one year. so total data points is around 278 and I want to predict for next 6 months. Will this much data points is sufficient for using RNN techniques.. and also can you please explain what is difference between LSTM and GRU and where to USE LSTM or GRU

Often you can get better performance with neural networks when the data is scaled to the range of the transfer function. In this case we use a sigmoid within the LSTMs so we find we get better performance by normalizing input data to the range 0-1.

Hi, Jason! Your tutorial is very helpful. But I still have a question about using dropouts in the LSTM cells. What is the difference of the actual effects of droupout_W and dropout_U? Should I just set them the same value in most cases? Could you recommend any paper related to this topic? Thank you very much!

Thank you for your very useful posts.
I have a question.
In the last example (CNN&LSTM), It’s clear that we gained a faster training time, but how can we know that CNN is suitable here for this problem as a prior layer to LSTM. What does the spatial structure here mean? So, If I understand how to decide whether a dataset X has a spatial structure, then will this be a suitable clue to suggest a prior CNN to LSTM layer in a sequence-based problem?

The spatial structure is the order of words. To the CNN, they are just a sequence of numbers, but we know that that sequence has structure – the words (numbers used to represent words) and their order matter.

Model selection is hard. Often you want to pick the model that has the mix of the best performance and lowest complexity (easy to understand, maintain, retrain, use in production).

Yes, if a problem has some spatial structure (image, text, etc.) try a method that preserves that structure, like a CNN.

I have been trying to use your experiment to classify text that come from several blogs for gender classification. However, I am getting a low accuracy close to 50%. Do you have any suggestions in terms of how I could pre-process my data to fit in the model? Each blog text has approximately 6000 words and i am doing some research know to see what I can do in terms of pre-processing to apply to your model.

The words were converted to integers (one int for each word), and we model the data as fixed-length vectors of integers. Because we work with fixed-length vectors, we must truncate and/or pad the data to this fixed length.

You can convert each character to an integer. Then each input will be a vector of integers. You can then use an Embedding layer to convert your vectors of integers to real-valued vectors in a projected space.

This leaves a rather important question, does it actually learn more complicated features than word-counts? And do LSTM’s do so in general? Obviously there is literature out there on this topic, but I think your post is somewhat misleading w.r.t. power of LSTM’s. It would be great to see an example where an LSTM outperforms a TFIDF, and give an idea about the type and size of the data that you need. (Thank you for the quick reply though 🙂 )

LSTM’s are only neat if they actually remember contextual things, not if they just fit simple models and take a long time to do so.

I have some short questions. First, I feel nervous when chose hyperparameter for the model such as length vectors (32), a number of Embedding unit (500), a number of LSTM unit(100), most frequent words(5000). It depends on dataset, doesn’t it? How can we choose parameter?

Second, I have dataset about news daily for predicting the movement of price stock market. But, each news seems more words than each comment imdb dataset. Average each news about 2000 words, can you recommend me how I can choose approximate hyperparameter.

According to my understanding, When training, the number of epoch often more than 100 to evaluate supervised machine learning result. But, In your example or Keras sample, It’s only between 3-15 epochs. Can you explain about that?
Thanks,

Your book is really helpful for me. I have a question about time sequence classifier. Let’s say, I have 8 classes of time sequence data, each class has 200 training data and 50 validation data, how can I estimate the classification accuracy based on all the 50 validation data per class (sth. like log-maximum likelihood) using scikit-learn package or sth. else? It would be very appreciated that you could give me some advice. Thanks a lot in advance.

Hi Jason, thank you for your tutorials, I find them very clear and useful, but I have a little question when I try to use it to another problem setting..

as is pointed out in your post, words are embedding as vectors, and we feed a sequence of vectors to the model, to do classification.. as you mentioned cnn to deal with the implicit spatial relation inside the word vector(hope I got it right), so I have two questions related to this operation:

1. Is the Embedding layer specific to word, that said, keras has its own vocabulary and similarity definition to treat our feeded word sequence?

2. What if I have a sequence of 2d matrix, something like an image, how should I transform them to meet the required input shape to the CNN layer or directly the LSTM layer? For example, combined with your tutorial for the time series data, I got an trainX of size (5000, 5, 14, 13), where 5000 is the length of my samples, and 5 is the look_back (or time_step), while I have a matrix instead of a single value here, but I think I should use my specific Embedding technique here so I could pass a matrix instead of a vector before an CNN or a LSTM layer….

Sorry if my question is not described well, but my intention is really to get the temporal-spatial connection lie in my data… so I want to feed into my model with a sequence of matrix as one sample.. and the output will be one matrix..

I tried it on CPU and it worked fine. I plan to replicate the process and expand your method for a different use case. Its high dimensional compared to this. Do you have a tutorial on making use of GPU as well? Can I implement the same code in gpu or is the format all different?

Thanks for the interesting tutorial! Do you have any thoughts on how the LSTM trained to classify sequences could then be turned around to generate new ones? I.e. now that it “knows” what a positive review sounds like, could it be used to generate new and novel positive reviews? (ignore possible nefarious uses for such a setup 🙂 )

There are several interesting examples of LSTMs being trained to learn sequences to generate new ones… however, they have no concept of classification, or understanding what a “good” vs “bad” sequence is, like yours does. So, I’m essentially interested in merging the two approaches — train an LSTM with a number of “good” and “bad” sequences, and then have it generate new “good” ones.

Thanks, if you do come up with any crazy ideas, please let me know :).

One pedestrian approach I’m thinking off is having the classifier used to simply “weed out” the undesired inputs, and then feed only desired ones into a new LSTM which can then be used to generate more sequences like those, using the approach like the one in your other post.

That doesn’t seem ideal, as it feels like I’m throwing away some of the knowledge about what makes an undesired sequence undesired… But, on the other hand, I have more freedom in selecting the classifier algorithm.

I am not sure I understand how recurrence and sequence work here.
I would expect you’d feed a sequence of one-hot vectors for each review, where each one-hot vector represents one word. This way, you would not need a maximum length for the review (nor padding), and I could see how you’d use recurrence one word at a time.
But I understand you’re feeding the whole review in one go, so it looks like e feedforward.
Can you explain that?

Guys, this is a very clear and useful article, and thanks for the Keras code. But I can’t seem to find any sample code for running the trained model to make a prediction. It is not in imdb.py, that just does the evaluation. Does any one have some sample code for prediction to show?

That’s not the hard part. However, I may have figured out what I need to know. That is take the result returned by model.predict and take the last item in the array as the classifications. Any one disagrees?

I’ve noticed that in the first part you called fit() on the model with “validation_data=(X_test, y_test)”. This isn’t in the final code summary. So I wondered if that’s just a mistake or if you forgot it later on.

But then again it seems wrong to me to use the test data set for validation. What are your thoughts on this?

Hi Jason,
in the last part the LSTM layer returns a sequence, right? And after that the dense layer only takes one parameter. How does the dense layer know that it should take the last parameter? Or does it even take the last parameter?

Hi Jason,
Very interesting and useful article. Thank you for writing such useful articles. I have had the privilege of going through your other articles which are very useful.

Just wanted to ask, how do we encode a new test data to make same format as required for the program. There is no dictionary involved i guess for the conversion. So how can we go about for this conversion? For instance, consider a sample sentence “Very interesting article on sequence classification”. What will be encoded numeric representation?
Thanks in advance

Great article Jason. I wanted to continue the question Prashanth asked, how to pre-process the user input. If we use CountVectorizer() sure, it will convert it in the required form but then words will not be same as before. Even a single new word will create extra element. Can you please explain, how to pre-process the user input such that it resembles with the trained model. Thanks in advance.

I have dataset just a vector feature like [1, 0,5,1,1,2,1] -> y just 0,1 binary or category like 0,1,2,3. I want to use LSTM to classify binary or category, how can i do it guys, i just add LSTM with Dense, but LSTM need input 3 dimension but Dense just 2 dimension. I know i need time sequence, i try to find out more but can’t get nothing. Can u explain and tell me how. pls, Thank you so much

Ay, i have 1 question in another your post about why i use function evaluate model.evaluate(x_test, y_test) to get accuracy score of model after train with train dataset , but its return result >1 in some case, i don’t know why, it make me can’t beleive in this function. Can you explain for me why?

I have tried to create random a dataset, and pass at CNN with 1D, but I don’t know why, the Conv1D accepts my shape (I think that put automaticly the value None), but the fit doesn’t accept (I think becaus the Conv1D have accepted 3 dimension). I have this error:

I wanted to ask for some suggestions on training my data set. The data I have are 1d measurements taken at a time with a binary label for each instance.

Thanks to your blogs I successfully have built a LSTM and it does a great job at classifying the dominant class. The main issue is that the proportion of 0s to 1s is very high. There are about .03 the number of 1s as there are 0s. For the most part, the 1s occur when there are high values of these measurements. So, I figured I could get a LSTM model to make better predictions if a model could see the last “p” measurements. Intuitively, it would recognize an abnormal increase in the measurement and associate that behavior with a output of 1.

Knowing some of this basic basckground could you suggest a structure that may
1.) help exploit the structure of abnormally high measurement with outputs of 1
2.) help with the low exposure to 1 instances

Hi, that’s a great tutorial!
Just wondering: as you are paddin with zeros, why aren’t you setting the Embedding layer flag mask_zero to True?
Without doing that, the padded symbols will influence the computation of the cost function, isn’t it?

Hi Jason,
Great tutorial! Helped a lot.
I’ve got a theoretical question though. Is sequence classification just based on the last state of the LSTM or do you have to take the dense layer for all the hidden units(100 LSTM in this case). Is sequence classification possible just based on the last state? Most of the implementations I see, there is dense and a softmax to classify the sequence.

Hi Jason,
Can you tell me about time_step in LSTM?, with example or something to easy understand. If my data have 2 dimension, [[1,2]…[1,3]] ouput: [1,…0], so with keras, LSTM layer need 3 dimension, so i just can reshape input data to 3 dimension with time_step =1, can train it like this?, with time_step> 1 is it better, i want to know mean of time_step in LSTM, thank you so much for read my question.

Hi Jason,
First of ali, thank you for your great explanation.
I am considering setting up an aws g2.2xlarge instance according to your explanation in another post . Would you have some benchmark (ex: time of 1 epoch of one of the above examples) so that I can compre with my current hardware?

Thanks Jason for your article. I have implemented a CNN followed by LSTM neural network model in keras for sentence classification. But after 1 or 2 epoch my training accuracy and validation accuracy stuck to some number and do not change. Like it has stuck in some local minima or some other reason. What should i do to resolve this problem. If i use only CNN in my model then both training and validation accuracy converges to good accuracy. Can you help me in this. I couldn’t identify the problem.

Jason, thaks for yor great post.
I am beginner with DL.
If I need to include some behavioral features to this analysis, let say: age, genre, zipcode, time (DD:HH), season (spring/summer/autumn/winter)… could you give me some hints to implement that?

My data is of shape (8000,30) and i need to use 30 timesteps.
I do
model.add(LSTM(200, input_shape=(timesteps,train.shape[1])))

but when i run the code it give me and error
ValueError: Error when checking input: expected lstm_20_input to have 3 dimensions, but got array with shape (8000, 30)
How to change the shape of the training data in the format you mentioned
Remember, input data must be structured [samples, timesteps, features]. (8000,30,30)

I am very thankful for your blog-posts. They are undoubtedly one of the best on the internet.
I have one doubt though. Why did you use the validation dataset as x_test and y_test in the very first example that you described. I just find it a little bit confusing.

i added dropout on CNN+RNN like you said and it gives me 87.65% accuracy. I still not clear the purpose of combining both as i thought CNN is for 2D+ input like image or video. But anyway, your tutorial gives me a great starting point to dive into RNN. Many thanks!

If I am understanding right, after the embedding layer EACH SAMPLE (each review) in the training data is transformed into a 32 by 500 matrix. When taking an analogy from audio spectrogram, it is a 32-dim spectrum with 500 time frames long.

With the equivalence or analogy above, I can perform audio waveform classification with audio raw spectrogram as the input and class labels (whatever it is, might be audio quality good or bad) in exact the same code in this post (except the embedding layer). Is it correct?

Furthermore, I am wondering about why should the length of the input be the same, i.e. 500 in the post. If I am doing in the context of online training, in which a single sample is fed into the model at a time (batch size is 1), there should be no concern about varying length of samples right? That is, each sample (of varying length without padding) and its target are used to train the model one after another, and there is no worry about the varying length. Is it just the issue of implementation in Keras, or in theory the input length of each sample should be the same?

I was just wondering if the RNN or LSTM in theory requires every input to be in a same length.
As far as I know, one of the superiorities of RNN over DNN is that it accepts varying-length input.

It doesn’t bother me If the requirement is for efficiency issue in Keras, and the zero’s (if zero-padding is used) is regarded to carry zero information. In the audio spectrogram case, would you recommend zero-padding the raw waveform (one-D) or spectrogram (two-D)? With the analogy to your post, the choice would be the former though.

Is there a way in RNN (keras implementation) to control for the attention of the LSTM.
I have a dataset where 100 time series inputs are fed as sequence. I want the LSTM to give more importance to the last 10 time series inputs.
Can it be done?

Hi Jason,
After building and saving the model I want to use it for a prediction on new texts but I don’t know how to preprocess the plain text in order to use them for predictions. I have searched about it and find this way:
text = np.array([‘this is a random sentence’])
tk = keras.preprocessing.text.Tokenizer( nb_words=2000, lower=True,split=” “)
predictions = loaded_model.predict(np.array(tk.fit_on_texts(text)))

but this is not working for me and showing this error:
ValueError: Error when checking : expected embedding_1_input to have 2 dimensions, but got array with shape ()

Can You please tell me the proper way to preprocess the text. Any help is greatly appreciated.
Thanks

What I interpret is that 1 is the label for positive sentiment and since I am using a positive statement to predict I am expecting output to be 1.
I had made a mistake in the last comment by using model.predict() to get class labels, the correct way to get the label is model.predict_classes() but still, it’s not giving proper class labels.
So my question is whether I made a mistake in converting text into one-hot vector or is it the right way to do it.
Many Thanks

Can we use sequence labelling problem over continous variable. I have datasets of customer paying their debt within due date, buffer period and beyond buffer period. Basis on this I want to score the customer from good to bad. Is it posible using sequence labelling.

I have already mapped an LSTM model from Text column to label column. However, I need to add the Alpha-numeric Column with the Text as an additional feature to my LSTM model. How can I do that in Keras?

Hi, it was really great and I am happy that this tutorial was my first practical project in LSTM. I need to have f-measures, False Positives and AUC instead of “accuracy” in your code. Do you have any idea how to get them?

I have a question about built-in embedding layer in Keras.
I have done word embedding with word2vec model which is working based on the semantic similarity of the words–those in the same context are more similar. I am wondering whether Keras embedding layer is also following the w2v model or it has its own algorithm to map the words into vectors?
Based on what semantics it map the words to vectors?

hi Jason,
Great post for me.
But I want to ask you about: length vector in Embedded layer, you said “the first layer is the Embedded layer that uses 32 length vectors to represent each word” , why you choose 32 instead of another number like 64 or 128, Can you give me some best practice, or reason for your choose.
Thanks you so much.

@Jason,
“Sequence classification is a predictive modeling problem where you have some sequence of inputs over space or time and the task is to predict a category for the sequence.”
this is inspiring. I am thinking about to use sequence classification to IRIS dataset.
Do you think it works ?

@Jason,
Do you mean that:
I can not use LSTM for IRIS classification? I am working on IRIS like dataset. So I m exploring all possible classifiers. You have one here in your website. Besides,
I have tried RBM in SKLearn, it did not work as my inputs are not binary inputs like MNIST dataset (even after SKLearn’s preprocessing.binarizer() function). I think they were wrong to say that RBM In SKLearn works for data in range of [0,1], it only works for 0 and 1.
(by the way I send you my code to for reference)

I also have tried probablistic neural net (PNN), which yields only 78% accuracy, low and no way to increase layers of PNN as it is single layer net (from Neupy).
Now I came to RNN, but you said that.

@Jason,
What would you suggest ? I need your expert advice.
I tried RBM in sklearn, it did not work.
You said ,RNN would not work for it.
I think, CNN clearly does not work for it.
Do DBN and VAE left?

Jason,
I did try multi-layer perceptron. Result was good.
I want to use deep neural net of more than 3 layers.
What do you think about convolutional neural network?
I originally think it is impossible. But, now thinking about it again.

Hey Jason! Great Post 🙂 Really helped me in my internship this summer. I just wanted to get your thoughts on a couple things.

1. I’ve trained with about 400k documents in total and I’m getting an accuracy of ~98%. I always get vary when my model does ‘too’ well. Is that a fair cause-effect due to the enormous dataset ?

2. When I think of CNN’ing+max_pooling word vectors(Glove), I think of the operation basically meshing the word vectors for 3 words(possibly forming like a phrase representation).Am I right in my thought process ?

3. I’m still a little unclear on what the LSTM learns. I understand its not a typical seq-2-seq problem, so what do those 100 LSTM units hold ?

Good question, no the layers do not need to have the same number of units.

For example, If I had a vector of length 5 as input to a single neuron, then the neuron would have 5 weights, one for each element. We do not need 5 neurons for the 5 input elements (although we could), these concerns are separate and decoupled.

Thanks for your reply.
But here we have already each input as a vector not a scalar! would that mean in this case that each neuron will receive 5 vectors each of them 32 dimensional? so each neuron will have 5*32=160 weights? and if so, what is the advantage of that over having every neuron process only one word/vector?

Hi Jason,
consider we have 500 sequences with 100 elements in each sequence.
if we do the embedding in a 32 dimensions vector, we will have a 100*32 matrix for each sequence.
Now assume we are using only a layer of LSTM(20) in our project. I am a bit confused in practice:

I know that We have a hidden layer with 20 LSTM units in parallel. I want to know how Keras gives a sequence to the model. Does it give the same 32 dimension vectors to all LSTM units at a time in order and an iteration finishes at time [t+100]? (this way I think all units give the same (copy) value after training, and it is equivalent to having only on unit), OR it gives 32dim vectors 20 by 20 to the the model in order and iteration ends at time [t+5]?

So, the 100 time steps are passed as input to the model with 500 samples and 1 feature, something like [500, 100, 1].

The Embedding will transform each time step into a 32 dimensional vector.

The LSTM will process the sequence one time step at a time, so one 32-dimensional embedding at a time.

Each memory cell will get the whole input. They all have a go at modeling the problem. An error propagated from deeper layers will encourage the hidden LSTM layer to learn the input sequence in a specific way, e.g. classify the sequence. Each cell will learn something slightly different.

1) I am working on malware detection using LSTM, so I have malware activities in a sequence. As another question, I want to know more about Embedding layer in Keras. In my project I have to convert elements into integer numbers to feed Embedding layer of Keras. I guess Embedding is a frozen neural network layer to convert elements of a sequence to a vector in a way that relations between different elements are meaningful, Right? I would like to know if there is any logical issue of using Embedding in my project.

2) Do you know any references (book, paper, website etc.) for Embedding in Keras (academic/non-academic)? I need to draw a figure describing Embedding training network.

Hey Jason, this post was great for me.
As a question I would like to know how to set the number of LSTM units in the hidden layer?

Is there any relationships between the number of samples (sequences) and the number of hidden units?

I have 400 sequences with 5000 element in each. How many LSTM units should I use? I know that I should test model with different number of hidden units but I am looking for an upperbound and lowerbound for number hidden units.

And now comes the question: In my case I am trying to solve a task classification problem. Each task is described by 57 time series with 74 time steps each. For the training phase I do have 100 task examples of 10 different classes.

This way, I have created a [100,74,57] input and a [100,1] output with the label for each task.

This is, I have a multivariate time series to multilabel classification problem.

What type of learning structure would you suggest? I am aware that I may need to collect/generate more data but I am new both in python and deep learning and I am having some trouble creating a small running example for multivariate ts -> multilabel classification.

For multi-class classification, you will need a one hot encoding of your output variable so the dimensions will be [100,10] and then use a softmax activation function in the output layer to predict the outcome probability across all 10 classes.

For the specific model, try MLPs with sliding window, then maybe some RNNs like LSTMs to see if they can do better.

Thanks for your tutorial. My problem is classfication a packet (is captured everytime with many features) whether normal or abnormal. I would like to adapt LSTM to my own problem. My data are matrixes: X_train(4000,41), Y_train(4000,1), X_test(1000,41), Y_test(1000,1) – Y is label. One of 41 feature is time, others are input variables. I think, I have to extract time feature from 41 features, is it correct. Is this process in Keras?
First, I am confusing how to reshape my data in a meaningful way so that it meets the requirements of the inputs of LSTM layer. I expect my data like this:
x_train.shape = (4000,1,41) #simple, I set time step=1, later it will be changed > 1 to classify from many packets in time step
y_train.shape = (4000,1,1)
How to transform my data to structure above?
Second, I think, the Embedding layer is not suitable to my problems, is it right?. My model is built:
model = Sequential()
model.add(LSTM(64, input_dim=41, input_length=41) # ex, 64 LSTM unints
model.add(Dense(1, activation=’sigmoid’))
model.compile(loss=’binary_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’])
model.fit(X_train, Y_train, epochs=20, batch_size=100)
I’m new to LSTM, Can you give any advice for my problem. Thank you very much

Thanks Jason. That means batch_size=100. Right? I can have my first layer like this:
model.add(LSTM(64, input_dim=41, input_length=400) #hidden 1: 64
Or:
model.add(LSTM(64, batch_input_shape=(100, 1, 41), stateful=True))
Which one is correct? How to set time_step in the first code line.
Can you help me fix that?. Many thanks

I try to build model with my data that I follow your comments, but I get errors:
timesteps=2
train_x=np.array([train_x[i:i+timesteps] for i in range(len(train_x)-timesteps)]) #train_x.shape=(119998, 2, 41)
train_y=np.array([train_y[i:i+timesteps] for i in range(len(train_y)-timesteps)]) #train_y.shape=(119998, 2, 1)

Error:
File “test_data.py”, line 53, in
model.fit(train_x,train_y, nb_epoch=100, batch_size=10,)
File “/home/keras/models.py”, line 870, in fit
initial_epoch=initial_epoch)
File “/home/keras/engine/training.py”, line 1435, in fit
batch_size=batch_size)
File “/home/keras/engine/training.py”, line 1315, in _standardize_user_data
exception_prefix=’target’)
File “/home/engine/training.py”, line 127, in _standardize_input_data
str(array.shape))
ValueError: Error when checking target: expected dense_1 to have 2 dimensions, but got array with shape (119998, 2, 1)
May be I have problem with ouput shape? how can I fix?
Thank you

I think that maybe I was wrong when preparing input data to LSTM.
I have input and label like this: train_x(4000,41) and train_y(4000,1)
Before, I used:
timesteps=2
train_x=np.array([train_x[i:i+timesteps] for i in range(len(train_x)-timesteps)]) #train_x.shape=(119998, 2, 41)
train_y=np.array(train_y[:119998) #train_y.shape=(119998, 1)

===> It is wrong because rows are overlapped and train_y maybe taken wrong

In my data, each instance has multiple features so I want to keep features as it is, means multiple features in the same time.
Help me correct my misunderstand about input data
train_y = reshape(int(train_y.shape[0]/timesteps), train_y.shape[1]) # error: IndexError: tuple index out of range ???
And I concern the time feature is or is not included in input data (because I read a post: https://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/).
I read many your articles in machinelearningmastery.com, so I maybe confused

Hi Jason
After considering carefully about preparing data for LSTM in Keras. I realise that term “feature” doesn’t mean its original meaning (also know as attributes, fields in dataset), actually it is the number of columns after converting multivariate Time Series into supervised learning data. It is based on real features and look_back, calculated as real_feature multiplied by look_back. Am i right?
I followed https://machinelearningmastery.com/multivariate-time-series-forecasting-lstms-keras/
.
Thanks Jason and machinelearningmastery.com

I have some questions, hope you help out.
1. I’m trying to classify intents for a data set containing comments from user. There are several intents corresponding to comments. But the language in my case is not English. So I understand that I have to build the data set to be similar to imdb’s one. But how can I do it. Do you have any instruction/guidelines to build data set like that.

2. Aside from data set, I think that I also have to build embedding vector for my own language. How can I do that.

Generally, you need to clean the data (punctuation, case, vocab), then integer encode it for use with a word embedding. See Keras’ Tokenizer class as a good start.

The Embedding layer will learn the weights for your data. You can try to train a word2vec model and use the pre-trained weights to get better performance, but I’d recommend starting with a learned embedding layer as a first step.

My question: In what cases RNN works better than LSTM? I know that LSTM is originated from RNN and attempts to eliminate the problem of vanishing gradient in RNN.. BUT in my case I am using malware behavioral sequence and I got this chart for TPR and FPR: https://imgur.com/fnYxGwK , the figures show TPR and FPR for different number of units in hidden layer.

I am working through a categorical classification task that involves evaluating a feature that can go as long as 27500 words. My problem is that there are other features that I need to feed into my RNN-LSTM as well. I had thought about combining the long text feature and the other features into one files – features separated by columns of course but I don’t think that will work? Instead, I was think to separate the long text feature into its own file and run that independently through the RNN and then take the other features Can you give me some pointers on how I can go about designed the layers for this challenge I’m facing?

Besides, sometimes it just said “fetch failure on https://s3.amazonaws.com/text-datasets/imdb.npz“.
Is it because imdb data source is not available or network is instability?
Actually I have manually downloaded the data from https://s3.amazonaws.com/text-datasets/imdb.npz.
So if I cannot load the data online, how can I deal with the data I’ve downloaded manually to use it?
I’ve try another code to load data: (X_train, y_train),(X_test, y_test) = imdb.load_data(path = “imdb_full.pkl”), and it’s not work neither.
I’m looking forward to your reply. Thanks again!

Thanks for your reply. Now I can load the dataset. I still have two questions and need your help:
(1) You mentioned that we can “reproduce the results” by using the code “numpy.random.seed(7)”, but I still got different accuracies every time. Is that right that how I understood the code “numpy.random.seed(7)”?
(2) The results I have got are always about 50.6%, which is lower than yours. Why is there so big gap?
Thank you, and I’m looking forward to your reply~

This is an amazing post. I’m very new to nnets and now I have a question.
I do not understand the why you have picked LSTM and RNN for this semantic analysis. to be clear I don’t understand where the sequential part that allow us to use RNN and LSTM.
I’m wondering if you could explain this.
I also want to know if we can use LSTM for entity extraction (NLP) and where is a good data set to train our model.

Thanks for sharing both the model and the code also your enthusiasm in answering all the questions. I built my model for sentence classification based on your cnn+lstm one and it is working well. I am relatively new to neural nets and hence I am trying to learn to interpret how different layer interact, specifically, what is the data shape like. So, given the example above, suppose our dataset has 1000 movie reviews, using a batch size of 64, for each batch, please correct me:

embedding layer: OUTPUT – 64 (sample size) x 500 (words) x 32 (features per word)
conv1d: INPUT – as above; OUTPUT – for *each word*, 32 feature maps x (32/3) features, where 3 is kernel size.
maxpooling1d: INPUT – as above; OUTPUT – for *each word*, and for *each feature map*, a 32/3/2 feature vector
lstm: INPUT – this is where I struggle to understand… 64 is the sample size, 500 is the steps, so should be 64 x 500 x FEATURES, but is FEATURES=32/3/2, or 32 x (32/3/2) where the first 32 is the feature maps from conv1d?
OUTPUT – for *each sample*, a 100-dim feature vector

Hello, read your blog found it really help full however could you please guide me to a code sample as to how exactly hot encode my text for training, I have 20,000 reviews to train.
Or can i just using hashing technique where every word is signifying an integer?
So something like ;
I find the store good.
I find good.

Is represented as ;
1 2 3 4 5
1 2 5

As representing every character with an integer would be exhaustive i think!
And then i can probably run the further steps for padding e.t.c?
In this case how will i predict new sentences having some new words?
(which makes me re think should i assign every character to an integer) if so could you please show me a sample?

I tried to create a model for text summarization in seq2seq with keras. Did not work well. The prediction shows the top words by frequency. I tried blacklisting the top words in english (‘a’, ‘an’, ‘the’ etc). The results were still not good. Some said that in 2016 that keras was not good for text summarization then. Wonder what is missing.

Hello sir i am asad. i want to know how to load data set which is in .text file and text data of movie review. then how i can use it in recurrent neural network?
please tell me the complete procedure. remember data i have is locally in my computer

I would like to let you know that I have written my first ML code following your step by step ML project. I am using a nonlinear dataset(nsl-kdd). My dataset is in CSV format. I want to model and train my dataset using lstm.
For MNIST dataset I have a code,

My question is according to my dataset how I can define the chunk size, number of chunks, and rnn size as new variables for my dataset.
As I am very much new so really confuse how I can model and train my dataset to find accuracy using lstm. I want to use LSTM as a classifier. I don’t know my questions to you is correct or not.
I really appreciate your help.

is it possible to written same code for Simple neural networks for text processing?
is it that best way to use keras for text processing or otherwise any other libraries are present to implement Neural networks for text processing.?

This post and the comments have helped me immensely. Thanks! I am question regarding this sentence –
“The IMDB review data does have a one-dimensional spatial structure in the sequence of words in reviews and the CNN may be able to pick out invariant features for good and bad sentiment. This learned spatial features may then be learned as sequences by an LSTM layer.”

I am not able to visualize how CNN will process words. Also, Could u please throw some light on spatial structure for words?

i have the read about sequence to sequence learning in neural networks,we need to LSTMS layers for it,first one is for input sequence and second is for output sequnce,here we have to send our input sequnce vector in a reverse order to LSTM layers.

what my doubt ,is LSTM layer will take the input in a reverse order or we have to give input in reverse order

For sequence to sequence regression model ,output node i have to give one or i have to give maximum variable length of output vectors,.
finally we will get output vectors,how we have to convert to this output vectors to text ,is there any method is available in Keras ,like in embedding layer we are doing strings to vectors conversion,like vectors to integers conversion.

I got some troubles with overfitting my model –
For the training i am using, text data in Russian language (language essentially doesn’t matter,because text contains a lot of special professional terms, and sadly to employ existing word2vec won’t be an option.)

I have such parameters of training data – Maximum lengths of an article – 969 words Size of vocabulary – 53886 Amount of labels – 12 (sadly they are distributed quite unevenly, for instance i have first label – and have around 5000 examples of this, and second contains only 1500 examples.)

Amount of training data set – Only 9876 entries. I’ts the biggest problem, because sadly i can’t increase size of the training set by any means (only way out to wait another year☻, but even it will only make twice the size of training date, and even double amount is’not enough)

model.fit(x_train, y_train, epochs=25, batch_size=30)
scores = model.evaluate(x_, y_)
I tried different parameters and it gets really high accuracy in training (up to 98%) But i really performs badly on test set. Maximum that i managed to achieve was around 74%, usual result something around 64% And the best result was achieved with small embedding_vecor_length and small batch_size.

I know – that my test set is only 10 percent of training test, and overall data-set is the biggest problem, but i want to find a way around this problem.

So my questions are – 1) Is it correctly builded model for text classification purpose? (it works) Do i need to use simultaneous convolution an merge results instead? I just don’t get how the text information doesn’t get lost in the process of convolution with different filter sized (like in my example) Can you explain hot the convolution works with text data? There are mainly articles about image recognition..

2)i obliviously got a problem with overfitting my model. How can i make the performance better? I have already added Dropout layers. What can i do next?

3)May be i need something different? I mean pure RNN without convolution?

How would you do sequence classification if there were no words involved? For example, I want to classify a sequence that looks like [0, 0, 0.4, 0.5, 0.9, 0, 0.4] either to be a 0 or a 1, but I don’t know what format to get my data in to feed into an LSTM.

What if we need to classify a sequence of numbers, is this example applicable and do i need the embedding layer? and can you refer to an example that you have on the blog or on other places so i can understand more? Thanks

Thanks for the tutorial. Can you clarify however, when you say:
“We can see that we achieve similar results to the first example although with less weights and faster training time.”

When you mean less weights, what are you referring to exactly? cause when you run model.summary the model with Convolution layer has 216k parameters vs. 213k parameters in the original model, technically there are more parameters to train.

Do you mean to say that with the convolution + pooling layers the input into the LTSM layer is from 250 hidden layer nodes vs 500 in the original model? I’m guessing the LTSM layer is harder to train which leads to the reduced fitting time?

Hi
I tried text classification. I have data sets of tweets and I have to train a model to determine the writer was happy or sad. I used your “Simple LSTM for Sequence Classification” code . but the thing is that I want to know before using your code what should I replace with words .
previously I used ” sequences = tokenizer.texts_to_sequences(tweets_dict[“train”])” to convert text to vector and after that I used your code . Is it correct?

Do you mind if I quote a few of your posts as long as I provide
credit and sources back to your website? My blog site is in the exact same area of interest as yours and my users would really benefit
from a lot of the information you provide here.
Please let me know if this okay with you. Many thanks!

Thank you sir, for providing the very nice tutorial. I am working on sequence classification. My data set contains 41 features, each of them are float and Y is 5 class .
Q.1 Do i need embedding ?
Q.2 I have normalized the data , so do i need top_words ?
Q.3 What could be embedding vector length?
Q.4 What could be the maximum review length ?
Q.5 All example contains 41 features, do i need padding ?
I am not very clear about the embedding layer. Your suggestions would be great for me.

president obama says the us needs to do more to help stop the ebola outbreak from becoming a global crisis actdont talk RISK
i was upset and angry that thomasericduncan lied about his exposure to ebola and put us all at risk for catching this deadly disease RISK
ebola is transmitted through blood and saliva so i better stop punching my haters so hard and smooching all these gorgeous b TRANSMISSION
he got the best treatment availablebetter than liberia and i am still not convinced he didnt know he had ebolarace card again TREATMENT
obama and cdc said they will fight ebola in africa news today ebola deaths rise sharply when exactly will they fight it tcot TREATMENT
fuck this is really tough dont know if i have the mind and guts to deal with death and ebola every day of work RISK
something more serious needs to be done about this ebola shit the airport and the town he was in needs to be quarantined im sick of being PREVENTION
if you have ebola symptoms or know someone who does please hug and kiss mr obama show him respect he appreciates tcot SYMPTOM
u can only get it if u have frequent contact with bodily fluids of someone who has ebola and is showing symptoms TRANSMISSION

The first input is the sequence of online activities, which I can use the above mentioned models to deal with.
The second input is a vector of the time difference (minute) between each activity and last activity. In this case, I want my model consider the time impact of the decision as well.

My question is what is the best way to merge the second input to the above models?
What I have done is use a LSTM layer on the second input as well and merge the output with the above one. But it seems not right, because the second input is continuous value rather than the discrete index.

So what kind of layer should I use to apply on these real value vectors?

How to take two types of inputs in this model?
One is a sequence of online activities, the second input is the time different between each activity and last activity.
Should I use a multimodal layer to merge them?
Should I process the second input with LSTM layer as well? (It seems not right as the element of this vector is the continuous value)

Thanks for your response. I understand how to merge two layers, but my question is, in which layer shall I merge the online activities with their recency scores?

For example I can apply a lstm layer on the online activities, and then concatenate the output of lstm layer (the last hidden state output) with the sequence of their recency scores. But it doesn’t make sense.

Or I can multiply the embedding output with the sequence of their recency scores, then put the output into the lstm layer. But I don’t know whether this right or not.

Jason,
Thanks for you excellent explanation.
I’ve done some modification on your codes in oder to get higher accuracy on the test data, finally, I could get accuracy 88.60% on test dataset.
My question is, besides what I’ve done on changing thoese hyper parameters (just like a blind man touching an elephant), what else we could to do improve the prediction accuracy on the test data? Or how to conquer the overfitting to get higher prediction accuray on test data? I found it’s very easy to get higher prediction accuracy on training data, but it’s astonishingly hard to make the same result happen on the test dataset(or validation dataset). The codes I modified is as following if anyone else need them as reference:

Thanks, Jason, that article you wrote, I already carefully read it half year ago. It’s also perfect, but I still feel we have no a clear guide on how to impove the prediction accuracy on test dataset.
We always say:
1. more training and testing data could get better performance, but it’s not always.
2. more deeper layers in the neural network could get better performance, but it’s still not always;
3. Fine tune hyper parameters could get better performance, yes ,it is, but let alone the time comsumption, this kind of work could only imporve the performance very very little (according to my experience.)
4. Try more other architecture neural network algorithms. Yes, sometimes this could work, but soon we’ll get to the upper-limit again. and face the same problem at once: how to impove it then?.
Conquering overfitting is really an interesting but difficult work in neural network, I feel we could find some better working ways to fix this problem in the future.
I still appreciate your articles and reply. Have a happy weekend.

Thanks a lot Jason for your great post. I have difficulty of understanding how LSTM can remember long-term dependencies. Or maybe, I misunderstood the meaning of “remembering dependencies”. Does it remember different parts within a specific training data or among different training data?

For example, if we have 100 training data, does it learn from 81st data by remembering previous training data?