Need help with Deep Learning for Text Data?

1. Word Embeddings + CNN = Text Classification

The modus operandi for text classification involves the use of a word embedding for representing words and a Convolutional Neural Network (CNN) for learning how to discriminate documents on classification problems.

Yoav Goldberg, in his primer on deep learning for natural language processing, comments that neural networks in general offer better performance than classical linear classifiers, especially when used with pre-trained word embeddings.

The non-linearity of the network, as well as the ability to easily integrate pre-trained word embeddings, often lead to superior classification accuracy.

He also comments that convolutional neural networks are effective at document classification, namely because they are able to pick out salient features (e.g. tokens or sequences of tokens) in a way that is invariant to their position within the input sequences.

Networks with convolutional and pooling layers are useful for classification tasks in which we expect to find strong local clues regarding class membership, but these clues can appear in different places in the input. […] We would like to learn that certain sequences of words are good indicators of the topic, and do not necessarily care where they appear in the document. Convolutional and pooling layers allow the model to learn to find such local indicators, regardless of their position.

Word Embedding: A distributed representation of words where different words that have a similar meaning (based on their usage) also have a similar representation.

Convolutional Model: A feature extraction model that learns to extract salient features from documents represented using a word embedding.

Fully Connected Model: The interpretation of extracted features in terms of a predictive output.

Yoav Goldberg highlights the CNNs role as a feature extractor model in his book:

… the CNN is in essence a feature-extracting architecture. It does not constitute a standalone, useful network on its own, but rather is meant to be integrated into a larger network, and to be trained to work in tandem with it in order to produce an end result. The CNNs layer’s responsibility is to extract meaningful sub-structures that are useful for the overall prediction task at hand.

The tying together of these three elements is demonstrated in perhaps one of the most widely cited examples of the combination, described in the next section.

2. Use a Single Layer CNN Architecture

You can get good results for document classification with a single layer CNN, perhaps with differently sized kernels across the filters to allow grouping of word representations at different scales.

Yoon Kim in his study of the use of pre-trained word vectors for classification tasks with Convolutional Neural Networks found that using pre-trained static word vectors does very well. He suggests that pre-trained word embeddings that were trained on very large text corpora, such as the freely available word2vec vectors trained on 100 billion tokens from Google news may offer good universal features for use in natural language processing.

Despite little tuning of hyperparameters, a simple CNN with one layer of convolution performs remarkably well. Our results add to the well-established evidence that unsupervised pre-training of word vectors is an important ingredient in deep learning for NLP

He also discovered that further task-specific tuning of the word vectors offer a small additional improvement in performance.

Kim describes the general approach of using CNN for natural language processing. Sentences are mapped to embedding vectors and are available as a matrix input to the model. Convolutions are performed across the input word-wise using differently sized kernels, such as 2 or 3 words at a time. The resulting feature maps are then processed using a max pooling layer to condense or summarize the extracted features.

The architecture is based on the approach used by Ronan Collobert, et al. in their paper “Natural Language Processing (almost) from Scratch“, 2011. In it, they develop a single end-to-end neural network model with convolutional and pooling layers for use across a range of fundamental natural language processing problems.

Kim provides a diagram that helps to see the sampling of the filters using differently sized kernels as different colors (red and yellow).

An example of a CNN Filter and Polling Architecture for Natural Language Processing.Taken from “Convolutional Neural Networks for Sentence Classification”, 2014.

Usefully, he reports his chosen model configuration, discovered via grid search and used across a suite of 7 text classification tasks, summarized as follows:

Transfer function: rectified linear.

Kernel sizes: 2, 4, 5.

Number of filters: 100

Dropout rate: 0.5

Weight regularization (L2): 3

Batch Size: 50

Update Rule: Adadelta

These configurations could be used to inspire a starting point for your own experiments.

3. Dial in CNN Hyperparameters

Some hyperparameters matter more than others when tuning a convolutional neural network on your document classification problem.

Ye Zhang and Byron Wallace performed a sensitivity analysis into the hyperparameters needed to configure a single layer convolutional neural network for document classification. The study is motivated by their claim that the models are sensitive to their configuration.

Unfortunately, a downside to CNN-based models – even simple ones – is that they require practitioners to specify the exact model architecture to be used and to set the accompanying hyperparameters. To the uninitiated, making such decisions can seem like something of a black art because there are many free parameters in the model.

4. Consider Character-Level CNNs

Text documents can be modeled at the character level using convolutional neural networks that are capable of learning the relevant hierarchical structure of words, sentences, paragraphs, and more.

Xiang Zhang, et al. use a character-based representation of text as input for a convolutional neural network. The promise of the approach is that all of the labor-intensive effort required to clean and prepare text could be overcome if a CNN can learn to abstract the salient details.

… deep ConvNets do not require the knowledge of words, in addition to the conclusion from previous research that ConvNets do not require the knowledge about the syntactic or semantic structure of a language. This simplification of engineering could be crucial for a single system that can work for different languages, since characters always constitute a necessary construct regardless of whether segmentation into words is possible. Working on only characters also has the advantage that abnormal character combinations such as misspellings and emoticons may be naturally learnt.

The model reads in one-hot encoded characters in a fixed-sized alphabet. Encoded characters are read in blocks or sequences of 1,024 characters. A stack of 6 convolutional layers with pooling follows, with 3 fully connected layers at the output end of the network in order to make a prediction.

The model achieves some success, performing better on problems that offer a larger corpus of text.

… analysis shows that character-level ConvNet is an effective method. […] how well our model performs in comparisons depends on many factors, such as dataset size, whether the texts are curated and choice of alphabet.

Results using an extended version of this approach were pushed to the state-of-the-art in a follow-up paper covered in the next section.

5. Consider Deeper CNNs for Classification

Better performance can be achieved with very deep convolutional neural networks, although standard and reusable architectures have not been adopted for classification tasks, yet.

Alexis Conneau, et al. comment on the relatively shallow networks used for natural language processing and the success of much deeper networks used for computer vision applications. For example, Kim (above) restricted the model to a single convolutional layer.

Other architectures used for natural language reviewed in the paper are limited to 5 and 6 layers. These are contrasted with successful architectures used in computer vision with 19 or even up to 152 layers.

They suggest and demonstrate that there are benefits for hierarchical feature learning with very deep convolutional neural network model, called VDCNN.

… we propose to use deep architectures of many convolutional layers to approach this goal, using up to 29 layers. The design of our architecture is inspired by recent progress in computer vision […] The proposed deep convolutional network shows significantly better results than previous ConvNets approach.

Key to their approach is an embedding of individual characters, rather than a word embedding.

We present a new architecture (VDCNN) for text processing which operates directly at the character level and uses only small convolutions and pooling operations.

Results on a suite of 8 large text classification tasks show better performance than more shallow networks. Specifically, state-of-the-art results on all but two of the datasets tested, at the time of writing.

Generally, they make some key findings from exploring the deeper architectural approach:

Yeah, but most libraries, like Keras, take matrices of fixed size as input. So I have to fix in advance the size of the input. Fixing it to the maximum size of the document would create huge input matrices. Am I right or I miss something?

This article is a very good starting point imho. However, what I do not quite understand is how one would approach large documents which contain hundreds (or thousands) of sentences. How would we represent such a large document? Of course, theoretically we could just build a very large CNN but in practice there will be memory issues.

So, is there a smarter way to do this? I was thinking of e.g. compressing each sentence into features through a convolutional layer and then use this feature as sentence representation for a whole document. However, a document with thousands of sentences might still be too much to process. What would be the way to do this – or can somebody recommend me some literature?

You would use a word embedding on the front end of the network which means that the document need only be stored as a long list of integers (e.g. small). Further, the model could process subsequences of the document at a time, e.g. LSTMs top-out at about 200-400 time steps.

Hi,
I found RJ also did good research on Text Classififcation. Their paper
“Convolutional Neural Networks for Text Categorization: Shallow Word-level vs. Deep Character-level” (https://arxiv.org/abs/1609.00718) shows good result on the same dataset.
And they had good progress on text classification thereafter and published paper “Deep Pyramid Convolutional Neural Networks for Text Categorization” (http://aclanthology.coli.uni-saarland.de/pdf/P/P17/P17-1052.pdf) in 2017 ACL.

Thank you Jason for this useful article. I have question about combination of CNN with LSTM for sentiment classification. is it possible to combine them to improve the performance? and how can I do the training procedure?

Thanks for this article Jason. Could you also give some pointers on how to go about web content extraction (mostly news articles) using CNNs or RNNs. I think they also fall in doc classification (content or not-content classifier), but not sure which type of features and deep architecture work best. Plus I’m also searching for a good training corpus for this type of problem. Can you point me in the right direction, or a dedicated tutorial would be nice. Thanks.

Thanks for feedback. Yesterday I tried again and we could develop Convoltional layer as first layer instead of Embedding layer.
The key was in shape of input. I used numpy.reshape function to prepare data.

Hi Jason, thank you for sharing. Many times I searched the Internet for my problems and ended up here. Your articles are always helpful and very well written. And… a kind of trivial thing, about the font. I know it’s subjective and I find them too thin and light to be very readable. I decided to tell you because chances are I’m going to read more of your helpful articles in the future. 😉