Deep Learning Simplified

Quasi-Recurrent Neural Networks

TLDR; RNN’s have limited parallelism because we have to feed in one time-step at a time since outputs depend on previous hidden states. On the other hand, CNNs are highly parallelizable since the filters can be applied simultaneously across the sample but lack the ability to store concepts such as memory/history. Quasi RNNs strike a balance between RNNs and CNNs and offer better performance than base LSTMs while also providing increased parallelism.

Detailed Notes:

RNNs, including gated versions (LSTMs/GRUs), cannot be parallelized since subsequent time step outputs depend on the previous hidden states. Additionally, as sequences get longer, RNNs have trouble dealing with properly holding pertinent information involving long sequences.

With CNNs, we can apply time-invariant filters in parallel across the entire input sequence. This also better scales for long sequences as we can change the filter sizes. CNNs, however, have no concept of memory and assume time invariance and cannot really make sense of large-scale sequence order information.

A common solution is a hybrid model, such as the one from the fully-character level NMT model, which used a CNN to process the char level input (long sequence) and then used pooling to reduce the dimensions but still preserve pertinent information. Then the RNN structure used the reduced features as inputs. But the model for the fully char NMT cannot be parallelized because it still uses regular RNN structure.

Model:

This paper employs the quasi-recurrent neural network (QRNN) for neural sequence modeling. QRNNs will allow us to parallelize across time-step and minibatch dimensions.

When we apply our filters of width k on our input X to create our candidate vectors z, we do not want the filters to account for future inputs when constructing z. In other words, each z_t depends only on x_{t-k+1} through x_t. This is known as masked convolution (van den Oord et al. 2016) and we implement this by simply padding the input to the left by k-1. Let’s see what this looks like for a small input where k=2.

Our candidate vectors z are passed through a tanh nonlinearity. Now we have to use additional conv filters to obtain sequences of vectors for the elementwise (bc gates use elementwise sigmoid operations) gates required for our pooling function. If the pooling includes a forget gate and output gate, then our conv/pool operations can be summarized to the following.

When k=2, the masked conv operations appear very LSTM-like. When we use convolutional filters with large width (k), we are essentially computing higher n-grams features at each time-step, this is especially important for tasks such as character-level tasks since we need to capture long window spans.

We can extend our pooling functions using different combination of gates. The paper explores the three variants below:

Regularization:

The paper extends zoneout (Kreuget et al. 2016), which modifies our pooling function to keep the previous pooling state for a subset of channels. This is equivalent to the following forget gate. On top of this, normal dropout is also applied between layers including the layer between embedding and first QRNN layer.

The authors also extend skip-connections between every QRNN layer which is known as dense convolution. This really improves gradient flow and convergence but keep in mind that this results in a quadratic increase in the number of layers (L(L-1)).

Encoder-Decoder:

We use a QRNN encoder and a modified WRNN (with attention) decoder. The reason for a modified QRNN is that feeding in the last encoder hidden state (output of encoder’s pooling layer) into the decoder’s recurrent pooling layer will not allow the encoder state to affect any of the gates in the decoder pooling layer. So the proposed solution is to use the final encoder hidden state with the outputs from each decoder’s output from its conv operations (basically the last encoder hidden state from encoder pool operations is taken into account with the decoder’s conv operations and both are used to determine the decoder’s pool outputs).

So now the modified operations in the decoder look like this:

We also use soft attention, where the attentional sum is based on the encoder’s last layer’s hidden states. These are used with the decoder’s last layer’s un-gated hidden states via dot product and then normalized with a softmax.

Training Points:

QRNN out performed LSTMs with the same hidden size on language modeling, sentiment classification and character-level machine translation.

The parallelism allowed for a 16X speed up for train and test times.

Unique Points:

QRNN is used in Baidu’s Deep Voice – a real-time neural TTS for production systems.