Recurrent Neural Networks

Introduction

Language Modeling

In this tutorial we will show how to train a recurrent neural network on
a challenging task of language modeling. The goal of the problem is to fit a
probabilistic model which assigns probabilities to sentences. It does so by
predicting next words in a text given a history of previous words. For this
purpose we will use the Penn Tree Bank
(PTB) dataset, which is a popular benchmark for measuring the quality of these
models, whilst being small and relatively fast to train.

Language modeling is key to many interesting problems such as speech
recognition, machine translation, or image captioning. It is also fun --
take a look here.

For the purpose of this tutorial, we will reproduce the results from
Zaremba et al., 2014
(pdf), which achieves very good quality
on the PTB dataset.

Download and Prepare the Data

The dataset is already preprocessed and contains overall 10000 different words,
including the end-of-sentence marker and a special symbol (<unk>) for rare
words. In reader.py, we convert each word to a unique integer identifier,
in order to make it easy for the neural network to process the data.

The Model

LSTM

The core of the model consists of an LSTM cell that processes one word at a
time and computes probabilities of the possible values for the next word in the
sentence. The memory state of the network is initialized with a vector of zeros
and gets updated after reading each word. For computational reasons, we will
process data in mini-batches of size batch_size. In this example, it is
important to note that current_batch_of_words does not correspond to a
"sentence" of words. Every word in a batch should correspond to a time t.
TensorFlow will automatically sum the gradients of each batch for you.

Truncated Backpropagation

By design, the output of a recurrent neural network (RNN) depends on arbitrarily
distant inputs. Unfortunately, this makes backpropagation computation difficult.
In order to make the learning process tractable, it is common practice to create
an "unrolled" version of the network, which contains a fixed number
(num_steps) of LSTM inputs and outputs. The model is then trained on this
finite approximation of the RNN. This can be implemented by feeding inputs of
length num_steps at a time and performing a backward pass after each
such input block.

Here is a simplified block of code for creating a graph which performs
truncated backpropagation:

# Placeholder for the inputs in a given iteration.
words = tf.placeholder(tf.int32, [batch_size, num_steps])
lstm = tf.contrib.rnn.BasicLSTMCell(lstm_size)
# Initial state of the LSTM memory.
initial_state = state = tf.zeros([batch_size, lstm.state_size])
for i in range(num_steps):
# The value of state is updated after processing each batch of words.
output, state = lstm(words[:, i], state)
# The rest of the code.
# ...
final_state = state

Inputs

The word IDs will be embedded into a dense representation (see the
Vector Representations Tutorial) before feeding to
the LSTM. This allows the model to efficiently represent the knowledge about
particular words. It is also easy to write:

Stacking multiple LSTMs

To give the model more expressive power, we can add multiple layers of LSTMs
to process the data. The output of the first layer will become the input of
the second and so on.

We have a class called MultiRNNCell that makes the implementation seamless:

def lstm_cell():
return tf.contrib.rnn.BasicLSTMCell(lstm_size)
stacked_lstm = tf.contrib.rnn.MultiRNNCell(
[lstm_cell() for _ in range(number_of_layers)])
initial_state = state = stacked_lstm.zero_state(batch_size, tf.float32)
for i in range(num_steps):
# The value of state is updated after processing each batch of words.
output, state = stacked_lstm(words[:, i], state)
# The rest of the code.
# ...
final_state = state

Run the Code

Before running the code, download the PTB dataset, as discussed at the beginning
of this tutorial. Then, extract the PTB dataset underneath your home directory
as follows:

There are 3 supported model configurations in the tutorial code: "small",
"medium" and "large". The difference between them is in size of the LSTMs and
the set of hyperparameters used for training.

The larger the model, the better results it should get. The small model should
be able to reach perplexity below 120 on the test set and the large one below
80, though it might take several hours to train.

What Next?

There are several tricks that we haven't mentioned that make the model better,
including: