CNN has been successful in various text classification tasks. In [1], the author showed that a simple CNN with little hyperparameter tuning and static vectors achieves excellent results on multiple benchmarks – improving upon the state of the art on 4 out of 7 tasks.

However, when learning to apply CNN on word embeddings, keeping track of the dimensions of the matrices can be confusing. The aim of this short post is to simply to keep track of these dimensions and understand how CNN works for text classification. We would use a one-layer CNN on a 7-word sentence, with word embeddings of dimension 5 – a toy example to aid the understanding of CNN. All examples are from [2].

Setup

Above figure is from [2], with #hash-tags added to aid discussion. Quoting the original caption here, to be discussed later. “Figure 1: Illustration of a CNN architecture for sentence classification. We depict three filter region sizes: 2,3,4, each of which has 2 filters. Filters perform convolutions on the sentence matrix and generate (variable-length) feature maps; 1-max pooling is performed over each map, i.e., the largest number from each feature map is recorded. Thus, a univariate feature vector is generated from all six maps, and these 6 features are concatenated to form a feature vector for the penultimate layer. The final softmax later then receives this feature vector as input and uses it to classify the sentence; here we assume binary classification and hence depict two possible output states.”

#sentence

The example is “I like this movie very much!”, there are 6 words here and the exclamation mark is treated like a word – some researchers do this differently and disregard the exclamation mark – in total there are 7 words in the sentence. The authors chose 5 to be the dimension of the word vectors. We let s denote the length of sentence and d denote the dimension of the word vector, hence we now have a sentence matrix of the shape s x d, or 7 x 5.

#filters

One of the desirable properties of CNN is that it preserves 2D spatial orientation in computer vision. Texts, like pictures, have an orientation. Instead of 2-dimensional, texts have a one-dimensional structure where words sequence matter. We also recall that all words in the example are each replaced by a 5-dimensional word vector, hence we fix one dimension of the filter to match the word vectors (5) and vary the region size, h. Region size refers to the number of rows – representing word – of the sentence matrix that would be filtered.

In the figure, #filters are the illustrations of the filters, not what has been filtered out from the sentence matrix by the filter, the next paragraph would make this distinction clearer. Here, the authors chose to use 6 filters – 2 complementary filters to consider (2,3,4) words.

#featuremaps

For this section, we step-through on how CNN perform convolutions / filtering. I have filled in some numbers in the sentence matrix and the filter matrix for clarity.

The above illustrates the action of the 2-word filter on the sentence matrix. First, the two-word filter, represented by the 2 x 5 yellow matrix w, overlays across the word vectors of “I” and “like”. Next, it performs an element-wise product for all its 2 x 5 elements, and then sum them up and obtain one number (0.6 x 0.2 + 0.5 x 0.1 + … + 0.1 x 0.1 = 0.51). 0.51 is recorded as the first element of the output sequence, o, for this filter. Then, the filter moves down 1 word and overlays across the word vectors of ‘like’ and ‘this’ and perform the same operation to get 0.53. Therefore, o will have the shape of (s–h+1 x 1), in this case (7-2+1 x 1)

To obtain the feature map, c, we add a bias term (a scalar, i.e., shape 1×1) and apply an activation function (e.g. ReLU). This gives us c, with the same shape as o (s–h+1 x 1).

#1max

Notice that the dimensionality of c is dependent both s and h, in other words, it will vary across sentences of different lengths and filters of different region sizes. To tackle this problem, the authors employ the 1-max pooling function and extract the largest number from each c vector.

#concat1max

After 1-max pooling, we are certain to have a fixed-length vector of 6 elements ( = number of filters = number of filters per region size (2) x number of region size considered (3)). This fixed length vector can then be fed into a softmax (fully-connected) layer to perform the classification. The error from the classification is then back-propagated back into the following parameters as part of learning:

The w matrices that produced o

The bias term that is added to o to produce c

Word vectors (optional, use validation performance to decide)

Conclusion

This short post clarifies the workings of the CNN on word embeddings by focussing on the dimensionality of matrices in each intermediate step.

in the paper https://arxiv.org/pdf/1510.03820.pdf (A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification) it’s stated “we set the number of feature maps for this region size to 100.” Does this mean that the filters per region size is 100? Are filter maps equivalent to the number of filters per region size?

Thanks for the interesting paper. From my understanding of Section 2.1 CNN Architecture, I agree with your interpretation. In the paper, feature maps are of dimension (sentence_length-region_size+1, 1). The purpose is to learn complementary features from the same region size.

Hi, I also wonder the same. What does it mean by having feature maps with value 100?
What does this mean “‘feature maps’ refers to the number of feature maps for each filter region size.”?
How can one achieve more than one feature map for each region size? (Like in the example above)

Going back to our toy example in this blog, please refer to the area of the figure where it is labelled #featuremaps. Here, we have 2 feature maps for each region size. And in our toy example, the region size is (2,3,4).

Let’s try a more layman way of explaining this. Let’s have 6 people to independently try to build their own intuition to determine the sentiment of the sentence.

You can bet that Person A and B will build different intuitions even though they are both looking at 4 words at a time because their preexisting knowledge is different (the parallel here is that the random initialization is different).

So, if we have 100 feature maps for each region size (3,4,5) in the paper. It means:
100 different people are limited to looking 3 consecutive words at a time.
100 different people are limited to looking 4 consecutive words at a time.
100 different people are limited to looking 5 consecutive words at a time.
A total of 300 different people.

Is it not the case that software implementations like Keras actually have the transpose of what you show for #sentences where they would be 5×7 and then the kernel is 5x where filter length is 2,3,5 above?

Following the section on #sentence, there, we have 6 words plus an exclamation mark. In the example, the exclamation mark is counted as one word, hence we have 7 words.
Since d = 5 then we have a matrix of 7 * 5.

For your question, if there is just “i like”, then the matrix would be 2 * 5.

In both of the above, we have assumed that the dimension of word vectors is 5.

The high-level concept answer is that the 7 words are looked up in a lookup table of vectors. Every English word has a vector in this lookup table, and they have been pre-trained. Glove vectors and Word2Vec vectors are good examples of these. So in our toy example, each word vector has a length of 5. There are 7 words, so the resulting matrix is 7 x 5.

Thank you very much for the detailed description. I did not understand the filtering part completely. In the #featuremaps section, how the numbers in the yellow matrix (2×5) are defined? are they randomly generated?

Hello Mr. Joshua
I want to ask the value in the first yellow section to get from
0.2 0.1 0.2 0.1 0.1
0.1 0.1 0.4 0.1 0.1
and one more whether the classification of procedure sentences is the same as the classification of documents
please enlighten me thank you Mr Joshua

(1) The values in the first yellow section are randomly initialized at the start of the training, and these values will change as the training proceed.
(2) In principle, both sentences and documents are made up of words, therefore, we could argue for the case to take the same approach. On the other hand, there might be other nuances to consider – For example, a sentence is usually shorter than a document, this difference could make a plain-vanilla Character-Level LSTM a viable option for the sentence-level classification but maybe not for document-level classification.

Thanks for the great explanation. I read the paper which you are talking about. In that author mentioned “zero-padding strategy such that all the tweets have same length”. so when zero-padding has been implemented how the shape of the matrix change? will ‘s’ is going to change to padding size?

Yeap, zero-padding refers to the act of padding vectors of 0 onto the embedding. Let’s refer back to the toy example in this post:
Suppose for every sentence, I want to consider 10 words, instead of 7. So the sentence matrix would become 10 x 5 instead of 7 x 5. Referring to the white table #sentence in the picture, this means that we would have 3 more rows of 0 added (aka zero-pad) onto the sentence matrix. Notice we don’t decide padding size, instead we decide the max number of words we want to consider, and zero pad fills in the difference. This makes sense because every sentence is potentially of the different length.

Thanks for the question, please refer to the question of Trinadh and Radifan, they have both asked a similar question with two different perspectives. Please reach out if you have any further questions.

Yeap, the region size and number of filters is a hyperparameter for the model. We fix that and train the model on many different sentences of differing length. Zero padding standardizes the maximum number of words considered (see my reply to Trinadh). How CNN copes with sentences with only 1 or 2 words is through #1max. Referring back to the toy example, if we are considering a maximum of 7 words, and the sentence only contains 2 words “Great movie”, the sentence matrix will only have the top 2 rows populated, the rest of the 5 rows will be populated with 0. As a result, the #featuremaps will contain many zero, with the exception of the first few elements of the feature map. Then, #1max will capture the largest element of the #featuremaps, effectively ignoring the zeros.

Your article is very useful for my study, so i would ask for several question :

In section #featuremaps, what is mean of result from 2-word filter which is in above is named as output sequence o ? I mean what does this value represent ? Is this represent pattern, sentiment or another ?

Unfortunately, the meaning of the values of the word embeddings is not interpretable. There are techniques like t-SNE that projects the multi-dimensional word embeddings into 2-Dimensional so that humans get a sense of the clusters of words. But to answer your question, we don’t know whether the value represents sentiment, or any single concept/theme.

From my anecdotal experience, CNN are faster to train. However, for a more robust analysis, please refer to “Comparative Study of CNN and RNN for Natural Language Processing” by Wenpeng et. al 2017. They have found that RNNs perform well and robust in a broad range of tasks except when the task is essentially a keyphrase recognition task as in some sentiment detection and question-answer matching settings.

The method of converting word to vector and text to vector is the same. First the text is cleaned – usually involving lower-case, removal of punctuation etc. Then, the individual words are given an index {1: Apple, 2: Ball, 3: Cat} etc. Also, we have a lookup embedding {1: [0.3,0.4,0.5], 2: [-0.2,0.1,0.4], 3:[0.8,0.1,0.2]} so subsitude the word Apple into [0.3, 0.4, 0.5].

Thanks for your fantastic post. Just a couple of questions, I would appreciate if I get your point of view about them:

1- Do you have any intuition what happens when we apply the CNN filters to word vectors? I can see that in case of images as the model would pick features like edges, corners, etc., but what about text and word vectors? Also, what does max polling mean in the context of words? Why the max value is an important feature why not the min?
2- Here, we have a static/pre-trained word embedding, but I saw the paper also talked about a fine-tuning of word vectors while we train the sentence classifier. I was wondering what fine-tuning in this specific problem mean? Is it like a separate neutral net structure or it is just a matrix (with the size of the number of words in vocab by embedding dimension size) that the model learns?

I’m sorry, I did not visit the website recently. Sure, let me try to help with those questions. I am glad you found my post helpful.

In another answer, I have drawn an analogy. For the action from #filters to #featuremaps to #1max, a rough analogy is to compare 1 filter as 1 person. This person (filter) is limited to reading only 2 words at a time for the whole sentence to build intuition between the 2-words he/she read vs. the label. This intuition is based on word vectors, and as you’ve already alluded to, each dimension of the word vector is not interpretable. As for the max pooling, the motivation is that we want the most salient feature from #featuremaps. The strongest signal, therefore is the max signal. If we want strong signals, we wouldn’t want near-0 signals, right? There also has been ML research building a dense layer to use all input values from the #featuremaps to project it down to 1-dimension instead of the naive max-pooling.

2. Fine-tuning the word vectors means allowing the gradient descent to backpropagate into the word vectors, such that the values of the word vectors change. It is usually the same neutral net structure. The shape of the matrix (with the size of the number of words in vocab by embedding size) remains the same, but the values within the matrix is allowed to change.