Understanding Natural Language with Deep Neural Networks Using Torch

This post was co-written by Soumith Chintala and Wojciech Zaremba of Facebook AI Research.

Language is the medium of human communication. Giving machines the ability to learn and understand language enables products and possibilities that are not imaginable today.

One can understand language at varying granularities. When you learn a new language, you start with words: understanding their meaning, identifying similar and dissimilar words, and developing a sense of contextual appropriateness of a word. You start with a small dictionary of words, building up your dictionary over time, mentally mapping each newly learned word close to similar words in your dictionary. Once you get familiar with your dictionary of words, you put them together into small sentences, learning grammar and structure. You eventually combine sentences in a sensible way, to write paragraphs and pages. Once you get to this stage, you are comfortable with expressing complicated thoughts in language, letting others understand your thoughts and expression.

As an example, language understanding gives one the ability to understand that the sentences “I’m on my way home.” and “I’m driving back home.” both convey that the speaker is going home.

Word Maps and Language Models

For a machine to understand language, it first has to develop a mental map of words, their meanings and interactions with other words. It needs to build a dictionary of words, and understand where they stand semantically and contextually, compared to other words in their dictionary. To achieve this, each word is mapped to a set of numbers in a high-dimensional space, which are called “word embeddings”. Similar words are close to each other in this number space, and dissimilar words are far apart. Some word embeddings encode mathematical properties such as addition and subtraction (For some examples, see Table 1).

Word embeddings can either be learned in a general-purpose fashion before-hand by reading large amounts of text (like Wikipedia), or specially learned for a particular task (like sentiment analysis). We go into a little more detail on learning word embeddings in a later section.

After the machine has learned word embeddings, the next problem to tackle is the ability to string words together appropriately in small, grammatically correct sentences which make sense. This is called language modeling. Language modeling is one part of quantifying how well the machine understands language.

For example, given a sentence (“I am eating pasta for lunch.”), and a word (“cars”), if the machine can tell you with high confidence whether or not the word is relevant to the sentence (“cars” is related to this sentence with a probability 0.01 and I am pretty confident about it), then that indicates that the machine understands something about words and contexts.

An even simpler metric is to predict the next word in the sentence. Given a sentence, for each word in its dictionary the machine assigns a probability of the word’s likeliness to appear next in the sentence. For example:

“I am eating _____”

To fill in the blank, a good language model would likely give higher probabilities to all edibles like “pasta”, “apple”, or “chocolate”, and it would give lower probability to other words in the dictionary which are contextually irrelevant like “taxi”, “building”, or “music”.

Traditionally language modeling has been done by computing n-grams—groups of words—and processing the n-grams further with heuristics, before feeding them into machine learning models. For example, the 2-grams for the sentence

“I am eating an apple.”

are “I am”, “am eating”, “eating an”, and “an apple”.

When you read a large body of text, like wikipedia, you could generate a new sentence by pairing together 2-grams and 3-grams and matching them with other pairs that were seen before. Sentences generated this way might be grammatically correct, but they can also be totally non-sensical. Over the last few years, deep neural networks have beaten n-gram based models comfortably on a wide variety of natural language tasks.

Deep neural networks for Language

Deep learning—neural networks that have several stacked layers of neurons, usually accelerated in computation using GPUs—has seen huge success recently in many fields such as computer vision, speech recognition, and natural language processing, beating the previous state-of-the-art results on a variety of tasks and domains such as language modeling, translation, speech recognition, and object recognition in images.

Within neural networks, there are certain kinds of neural networks that are more popular and well-suited than others to a variety of problems. Continuing on the topic of word embeddings, let’s discuss word-level networks, where each word in the sentence is translated into a set of numbers before being fed into the neural network. These numbers change over time while the neural net trains itself, encoding unique properties such as the semantics and contextual information for each word.

Word embeddings are not unique to neural networks; they are common to all word-level neural language models. Embeddings are stored in a simple lookup table (or hash table), that given a word, returns the embedding (which is an array of numbers). Figure 1 shows an example.

Figure 1: Word embeddings are usually stored in a simple lookup table. Given a word, the word vector of numbers is returned. Given a sentence, a matrix of vectors for each word in the sentence is returned.

Word embeddings are usually initialized to random numbers (and learned during the training phase of the neural network), or initialized from previously trained models over large texts like Wikipedia.

Feed-forward Convolutional Neural Networks

Convolutional Neural Networks (ConvNets), which were covered in a previous Parallel Forall post by Evan Shelhamer, have enjoyed wide success in the last few years in several domains including images, video, audio and natural language processing.

When applied to images, ConvNets usually take raw image pixels as input, interleaving convolution layers along with pooling layers with non-linear functions in between, followed by fully connected layers. Similarly, for language processing, ConvNets take the outputs of word embeddings as input, and then apply interleaved convolution and pooling operations, followed by fully connected layers. Figure 2 shows an example ConvNet applied to sentences.

Figure 2: ConvNets are applied to text by Collobert et. al. [2]. These ConvNets are largely the same as the ones used for object classification on images.

Recurrent Neural Networks (RNN)

Convolutional Neural Networks—and more generally, feed-forward neural networks—do not traditionally have a notion of time or experience unless you explicitly pass samples from the past as input. After they are trained, given an input, they treat it no differently when shown the input the first time or the 100th time. But to tackle some problems, you need to look at past experiences and give a different answer.

Figure 3: The importance of context. Predicting the next word, given only the previous word and no context.

If you send sentences word-by-word into a feed-forward network, asking it to predict the next word, it will do so, but without any notion of the current context. The animation in Figure 3 shows why context is important. Clearly, without context, you can produce sentences that make no sense. You can have context in feed-forward networks, but it is much more natural to add a recurrent connection.

A Recurrent neural network has the capability to give itself feedback from past experiences. Apart from all the neurons in the network, it maintains a hidden state that changes as it sees different inputs. This hidden state is analogous to short-term memory. It remembers past experiences and bases its current answer on both the current input as well as past experiences. An illustration is shown in Figure 4.

Figure 4: A recurrent neural network has memory of past experiences. The recurrent connection preserves these experiences and helps the network keep a notion of context.

Long Short Term Memory (LSTM)

RNNs keep context in their hidden state (which can be seen as memory). However, classical recurrent networks forget context very fast. They take into account very few words from the past while doing prediction. Here is an example of a language modelling problem that requires longer-term memory.

I bought an apple … I am eating the _____

The probability of the word “apple” should be much higher than any other edible like “banana” or “spaghetti”, because the previous sentence mentioned that you bought an “apple”. Furthermore, any edible is a much better fit than non-edibles like “car”, or “cat”.

Long Short Term Memory (LSTM) [6] units try to address the problem of such long-term dependencies. LSTM has multiple gates that act as a differentiable RAM memory. Access to memory cells is guarded by “read”, “write” and “erase” gates. Information stored in memory cells is available to the LSTM for a much longer time than in a classical RNN, which allows the model to make more context-aware predictions. An LSTM unit is shown in Figure 5.

Figure 5: Illustration of an LSTM unit. The write gate controls the amount of current input to be remembered for the future, the read gate controls the amount of the current memory to be given as output to the next stage, and the erase gate controls what part of the memory cell is erased or retained in the current time step.

Exactly how LSTM works is unclear, and fully understanding it is a topic of contemporary research. However, it is known that LSTM outperforms conventional RNNs on many tasks.

Torch is a scientific computing framework with packages for neural networks and optimization (among hundreds of others). It is based on the Lua language, which is similar to javascript and is treated as a wrapper for optimized C/C++ and CUDA code.

At the core of Torch is a powerful tensor library similar to Numpy. The Torch tensor library has both CPU and GPU backends. The neural networks package in torch implements modules, which are different kinds of neuron layers, and containers, which can have several modules within them. Modules are like Lego blocks, and can be plugged together to form complicated neural networks.

Each module implements a function and its derivative. This makes it easy to calculate the derivative of any neuron in the network with respect to the objective function of the network (via the chain rule). The objective function is simply a mathematical formula to calculate how well a model is doing on the given task. Usually, the smaller the objective, the better the model performs.

The following small example of modules shows how to calculate the element-wise Tanh of an input matrix, by creating an nn.Tanh module and passing the input through it. We calculate the derivative with respect to the objective by passing it in the backward direction.

This ConvNet has :forward and :backward functions that allow you to train your network (on CPUs or GPUs). Here we transfer it to the GPU by calling m:cuda().

An extension to the nn package is the nngraph package which lets you build arbitrary acyclic graphs of neural networks. nngraph makes it easier to build complicated modules such as the LSTM memory unit, as the following example code demonstrates.

With these few lines of code we can create powerful state-of-the-art neural networks, ready for execution on CPUs or GPUs with good efficiency.

cuBLAS, and more recently cuDNN, have accelerated deep learning research quite significantly, and the recent success of deep learning can be partly attributed to these awesome libraries from NVIDIA. [Learn more about cuDNN here!] cuBLAS is automatically used by Torch for performing BLAS operations such as matrix multiplications, and accelerates neural networks significantly compared to CPUs.

To use NVIDIA cuDNN in Torch, simply replace the prefix nn. with cudnn.. cuDNN accelerates the training of neural networks compared to Torch’s default CUDA backend (sometimes up to 30%) and is often several orders of magnitude faster than using CPUs.

We compare the training time of the network on an Intel Core i7 2.6 GHZ vs accelerating it on an NVIDIA GeForce GTX 980 GPU. Table 2 shows the training times and GPU speedups for a small RNN and a larger RNN.

Table 2: Training times of a state-of-the-art recurrent network with LSTM cells on CPU vs GPU.

Recurrent Neural Networks seem to be very powerful learning models. But how powerful are they? Would they be able to learn how to add two decimal numbers?

We trained an LSTM-RNN to predict the result of addition of two decimal numbers, which is almost the same problem as language modelling. In this case we ask the model to read a “sentence” character by character and try to tell what fits best into the missing space.

123 + 19 = ____

Here, the correct answer consists of 4 characters: “1”, “4”, “2”, and the end of sequence character. Surprisingly, an LSTM with small tweaks is able to learn with 99% accuracy how to add numbers of up to 9 digits.

13828700 + 10188872 = 24017572

Such a task involves learning about the carry operator, and how to add digits. On seeing this result, you might feel excited about how smart and powerful LSTM potentially is. However, deeper scrutiny reveals that LSTM is a cheater. Training it on sequences up to 9 digits gives good test performance on sequences up to 9 digits. Yet it fails on longer sequences of digits. This means that the LSTM hasn’t learned the true algorithm behind number addition. Nonetheless, it did learn something about addition.

We have examined what an LSTM can do on much harder problems. For example, can an LSTM simulate computer program execution? We used the same code used in our addition example with different examples. This time the input consists of a character-level representation of a program in a restricted subset of python, and the target output is the result of program execution. Our examples look like the following code snippets.

c=142012
for x in range(12):c-=166776
print(c)

target output: -1820700

Once again LSTM proved to be powerful enough to somewhat learn the mapping from programs to program execution results. Prediction performance is far from 100%, which is achievable by a standard python interpreter. However, the LSTM gives far better prediction than pure chance.

How to make RNN-LSTM models even more powerful remains a research challenge. We bet that an LSTM which would be as powerful as a python interpreter should also be good for natural language processing tasks. The only difference between these tasks is the underlying language: Python vs. English!

Learn More at GTC 2015

If you’re interested in learning more about Deep Learning with Torch, Soumith Chintala will be leading a hands-on lab called “Applied Deep Learning for Vision, Natural Language and Audio with Torch7” at the 2015 GPU Technology Conference at 3:30PM Wednesday, March 18 in room 211A San Jose Convention Center (session S5574).

With dozens of sessions on Machine Learning and Deep Learning, you’ll find that GTC is the place to learn about machine learning in 2015! Readers of Parallel Forall can use the discount code GM15PFAB to get 20% off any conference pass! Register Now!

About Soumith Chintala
Soumith Chintala is a Research Engineer at Facebook AI Research. He works on applying deep learning to computer vision and natural language problems and works on building systems that can adapt to new situations.

The performance comparison section could use some clarification. Both the CPU type as well as the means of parallelization used in the CPU code are omitted. One can just hope that it’s not some low-end several years old i7 CPU (mobile? do desktop i7 with 2.7 GHz exist even?) that you compare the latest GeForce cards against. The means of parallelization in the CPU code is not mentioned either. And than we have not even talked about power consumption and other aspects.

Given that past, even very recent, posts on this blog have made highly questionable and rather unfair performance comparisons, I can not help thinking that the same is happening here.

Soumith should respond to your specific comments, but I’d like to discuss your claim that “past, even very recent, posts on this blog have made highly questionable and rather unfair performance comparisons.”

I think this claim is quite untrue, but if you can provide some pointers to specific examples, I’d be happy to investigate.

I think that the “culture of sloppy comparisons” actually changed years ago, and at NVIDIA we are very careful with accelerated computing comparisons. Again, if you can point out specific examples on this blog that you think are “sloppy” I’d be happy to look into the details.

Mark

pSz

Mark, coincidentally, it is your recent 7.0 RC overview blog article that compares performance of the new solver library performance on K40 against some 4-5 years old Sandy Bridge desktop CPU.

pSz, have you considered the cost of memory? The Xeon CPU price does not include 12GB of GDDR5 RAM with SSE, while the K40 price does. The Xeon CPU TDP does not include the power for the memory, while the K40 TDP does.

pSz

Good point! However, the GPU does need a host too, doesn’t it?

Today this host would most often have at least as much memory as the GPU and in fact you will rarely have drastically less memory in GPU-equipped servers than in non-GPU equipped ones (given a fixed set of use-cases with a certain memory requirement). Plugging the K40 into a desktop box defeats the purpose of the Tesla (among other things its ECC), so the actual difference between the CPU-only and the CPU+GPU server platform that we should be comparing will likely boil down to two cases. Either single-socket + GPU vs dual socket without GPU comparison if density is not a concern. However, as a single-socket machine will have only so much memory bandwidth and PCI lanes on the CPU-side, quite likely a very realistic comparison is dual-socket lower-end CPU + GPU (e.g. 2x2620v3 + 2xK40) vs dual-socket higher-end CPU (e.g. 2x2680v3).

PS: You could of course throw in a third class of systems into the comparison: exotic stuff like the One Stop Systems 16-way 3U HDCA box which gives 14*16=224 GPUs/42U rack if we don’t count switches and hosts, so more realistically 12*16=192 GPUs/rack (if feeding this beast with power and dissipating its heat is even possible in this setup). For workloads that are *very* parallel and very GPU-friendly such a system can do miracles. However, even this density is not unheard of in the CPU-only world. Take the Dell PowerEdge FX2 (up to four half-wide 1U 2s server module into 2U) which allows ~164 sockets in 42U = a rack, or the Supermicro MicroBlade (up to 28 2s modules in 7U) and based on specs up to 192 sockets/42U rack.

chinso00

Hey pSz,

While your skepticism is good, I’ve given the source code (and instructions) on how to run the code right in the blogpost, there is nothing to really hide here.
I’ve rerun the benchmark on the Digits box which has: 6-core Intel(R) Core(TM) i7-5930K CPU @ 3.50GHz. Hopefully that satisfies your constraints for a fair comparison. It takes 207 mins in total on CPU and 29 mins on GPU (which is actually an even more speedup than the benchmark I’ve given in the blogpost). The BLAS used is OpenBLAS (latest trunk) with multiple cores enabled for all blas operations.

p.s.: sorry for the delayed reply. This is the earliest that I could take time to rerun the benchmarks on something more appropriate.

—
Soumith

pSz

Thanks Soumith for the clarification.

I still believe that a correct and complete description of what your “Table 2” compares at least the following information is necessary:
– exact model number of the CPU;
– compiler, flags, etc. as well as means of parallelization used in the CPU code (or a reference to what the code used is capable of, e.g. threading, SIMD);
– amount of resources used on the CPU to execute the experiment (number of threads, HT on/off, turbo boost, etc.)

The same applies to the GPU code and experiments done with ti too. Without all of that, IMHO such comparisons belong to the blogs of the specialist communities who may not care much about such “irrelevant” technicalities.

> I’ve given the source code (and instructions) on how to run the code right in the blogpost, there is nothing to really hide here.

I suspect that you’re missing my point, after this comment even more so. This is the “Parallel Forall blog” and not a machine learning specialist one. Based on the description (https://devblogs.nvidia.com/parallelforall/about/) the blog advertises itself as focused on “detailed technical information on a variety of massively parallel programming topics” and mentions “high-performance programming techniques” as one of the topics discussed. Finally, in the last paragraph it highlights how GPU computing claims its space in the world of HPC/scientific computing.

To live up to these ideals, I believe at least this blog (but preferably the entire HPC/computing division at NVIDIA, including marketing) needs to become (even) better at being more “C” and less “H”. To be successful with a “parent and provider” living off of the gamer community and opponents as well as partners deeply embedded in the scientific and technical computing world – and the minds of those part of it -, I think it is highly beneficial if not necessary for new players to be as honest as possible to the coders visiting such a technical blog. And if the competition pulls dirty tricks, disprove their numbers – or ask the community to do it. I’m sure many will gladly contribute to the extent possible!

chinso00

I’ve given you the exact model number of CPU in the first comment. It is multi-thread capable (using OpenMP where appropriate) and it is SIMD enabled for BLAS operations (the sigmoid and softmax are not SIMD, understandably so, as the instructions to do SIMD for those operations are not obvious or universal). Amount of resources used on CPU is not recorded.

pSz

Perhaps I’m confused, but where exactly does your comment state what your article’s “Table 2” compares against? And in my humble opinion, your article needs amending rather than comments that provide _additional_ data rather than fixing the existing stuff.

Let me say it again: I applaud his prompt and effective actions. At the same time we are instead arguing here (you and me) about something simple and straightforward: that the benchmark data in your article is incomplete (and possible bogus).

Hidekazu Oki

Soumith, thanks for the wonderful post! It was very interesting to read your blog. At any rate, I have a few questions about the source code for the LSTM. How can I get help / advice understanding the code? Specifically, I am wondering about the lines that look like the following:

local i2h = nn.Linear(params.rnn_size, 4*params.rnn_size)(x)

What exactly is the variable x doing there? Why is the function in the foillowing format?

foo (parm1, param2) (param3) ?

This would make sense if foo returns a function, but it doesn’t seem that way…

Thanks in advance for your help!

-Hidekazu

chinso00

Hi Hidekazu,

I wouldn’t do justice to explaining this, compared to this excellent post by Adam Paszke who tears apart that piece of code and explains it with the help of nice diagrams and math:

…rch-distro/install/lib/luarocks/rocks/trepl/scm-1/bin/th:185: in main chunk

[C]: at 0x00406670

[2.1933s]

So where did I mess up?
Thanks

chinso00

Hi Walt. What do you mean by: “how do I talk to it”?

Walt Parkman

I wanted to know how to give it an incomplete sentence “The clouds are in the” and have it answer “sky”. Is there a ready-made interface to the Lua program, or a tutorial describing how to make such an interface?

Sandeep Karthikeyan

Thanks a lot for such easily understandable example of LSTM

Ranjan

Hi Soumith,
In the TDNN example, I believe there’s some issue with this line:
m:add(nn.TemporalConvolution(sentenceLength, 150, embeddingSize))
From the nn readme for Temporal Convolution:
module = nn.TemporalConvolution(inputFrameSize, outputFrameSize, kW, [dW])
In above example, I guess the input-frame size is embeddingSize, and outputFrameSize is 150.
Hence, this seems more appropriate:
m:add(nn.TemporalConvolution(embeddingSize, 150))