An LSTM Odyssey

January 26 2016

This week I read LSTM:// A Search Space Odyssey. It’s an excellent paper that systematically evaluates the different internal mechanisms of an LSTM (long short-term memory) block by disabling each mechanism in turn and comparing their performance. We’re going to implement each of the variants in TensorFlow and evaluate their performance on the Penn Tree Bank (PTB) dataset. This will obviously not be as thorough as the original paper but it allows us to see, and try out, the impact of each variant for ourselves.

Be sure to check out the full source for the rest of the cell definition. Mostly we create a new class inheriting from RNNCell and use the above code as the body of __call__. The nice part about this setup is that we can utilize MultiRNNCell to stack the LSTMs into multiple layers.

Notice that we initialize all of our parameters using get_variable. This is necessary so that we can reuse these variables for each time step rather than creating new parameters at each step. Also, all parameters are transposed from the paper’s definitions to avoid additional graph operations.

Then we define each equation as operations in the graph. Many of the operations have reversed inputs from the equations so that the matrix multiplications produce the correct dimensionality. Other than these details we’re directly translating the equations.

Note that from a performance perspective, this is a naïve implementation. If you look at the source for TensorFlow’s LSTMCell you’ll see that all of the cell inputs and states are concatenated together before doing any matrix multiplication. This is to improve performance, however, since we’re more interested in taking the LSTM apart, we’ll keep things simple.

Running this vanilla LSTM on the included notebook we obtain a test perplexity (e^cost) of less than 100. So far so good. This will serve as our baseline to compare to the other variants. Below is the cost (average negative log probability of the target words) on the validation set after each epoch:

<

figure>

Vanilla cost on the validation set

<

figure>

Variants

The most helpful bits for implementing each of the variants can be found in appendix A3 of the paper. The gate omission variants such as no input gate (NIG), no forget gate (NFG), and no output gate (NOG) simply set their respective gates to 1 (be sure to use floats, not integers, here):

The final variant, full gate recurrence (FGR), is the most complex, essentially allowing each gate’s previous state to interact with each gate’s next state:

<

figure>

Recurrent connections are added for each of the gates.

<

figure>

In many of the variants, we can remove parameters no longer needed to compute the cell. The FGR variant, however, adds significantly more parameters (9 additional square matrices) which also increases training time.

To implement each, we’ll simply duplicate our vanilla LSTM cell implementation and make the necessary modifications for the variant. There are too many to show here but you can view the full source for each variant on Github. To train each, we’ll use the same hyperparameters from the vanilla LSTM trial. This probably isn’t fair and a more thorough analysis (as performed in the paper) would try to find the best hyperparameters for each variant.

<

figure>

Training progress of model variants.

<

figure>

Results

The NFG and NOG variants fail to converge to anything useful while the NIAF variant diverges significantly after around the 8th epoch. (This divergence could probably be fixed with learning rate decay which I omitted for simplicity.)

<

figure>

Diverging variants

<

figure>

In contrast, the NIG, CIFG, NP and FGR variants all converge. The NIG and FGR variants do not produce great results while the NP and CIFG variants perform similarly to the vanilla LSTM.

<

figure>

Diverging variants

<

figure>

Finally the NOAF variant. Its poor performance is likely due to the lack of clamping from the output activation function so its cost explodes:

<

figure>

<

figure>

Here are the test perplexities for each variant:

<

figure>

<

figure>

Conclusion

Overall it’s been fun dissecting the LSTM. Feel free to try out the code yourself and if you’re interested in taking this further I recommend running comparisons with GRUs, looking at fANOVA or extending what’s here with more thorough analysis.