Paper summaryhlarochelleThis paper presents an extensive evaluation of variants of LSTM networks. Specifically, they start from what they consider to be the vanilla architecture and, from it, also consider 8 variants which correspond to small modifications on the vanilla case. The vanilla architecture is the one described in Graves & Schmidhuber (2005) \cite{journals/nn/GravesS05}, and the variants consider removing single parts of it (input,forget,output gates or activation functions), coupling the input and forget gate (which is inspired from GRU) or having full recurrence between all gates (which comes from the original LSTM formulation).
In their experimental setup, they consider 3 datasets: TIMIT (speech recognition), IAM Online Handwriting Database (character recognition) and JSB Chorales (polyphonic music modeling). For each, they tune the hyper-parameters of each of the 9 architectures, using random search based on 200 samples. Then, they keep the 20 best hyper-parameters and use the statistics of those as a basis for comparing the architectures.
#### My two cents
This was a very useful ready. I'd make it a required read for anyone that wants to start using LSTMs. First, I found the initial historical description of the developments surrounding LSTMs very interesting and clarifying. But more importantly, it presents a really useful picture of LSTMs that can both serve as a good basis for starting to use LSTMs and also an insightful (backed with data) exposition of the importance of each part in the LSTM.
The analysis based on an fANOVA (which I didn't know about until now) is quite neat. Perhaps the most surprising observation is that momentum actually doesn't seem to help that much. Investigating second order interaction between hyper-parameters was a smart thing to do (showing that tuning the learning rate and hidden layer jointly might not be that important, which is a useful insight).The illustrations in Figure 4, layout out the estimated relationship (with uncertainty) between learning rate / hidden layer size / input noise variance and performance / training time is also full of useful information.
I wont repeat here the main observations of the paper, which are laid out clearly in the conclusion (section 6).
Additionally, my personal take-away point is that, in an LSTM implementation, it might still be useful to support the removal peepholes or having coupled input and forget gates, since they both yielded the ultimate best test set performance on at least one of the datasets (I'm assuming it was also best on the validation set, though this might not be the case...)
The fANOVE analysis makes it clear that the learning rate is the most critical hyper-parameter to tune (can be "make or break"). That said, this is already well known. And the fact that it explains so much of the variance might reflect a bias of the analysis towards a situation where the learning rate isn't tuned as well as it could be in practice (this is afterall THE hyper-parameter that neural net researcher spend the most time tuning in practice). So, as future work, this suggests perhaps doing another round of the same analysis (which is otherwise really neatly setup), where more effort is always put on tuning the learning rate, individually for each of the other hyper-parameters. In other words, we'd try to ignore the regions of hyper-parameter space that correspond to bad learning rates, in order to "marginalize out" its effect. This would thus explore the perhaps more realistic setup that assumes one always tunes the learning rate as best as possible.
Also, considering a less aggressive gradient clipping into the hyper-parameter search would be interesting since, as the authors admit, clipping within [-1,1] might have been too much and could explain why it didn't help
Otherwise, a really great and useful read!

This paper presents an extensive evaluation of variants of LSTM networks. Specifically, they start from what they consider to be the vanilla architecture and, from it, also consider 8 variants which correspond to small modifications on the vanilla case. The vanilla architecture is the one described in Graves & Schmidhuber (2005) \cite{journals/nn/GravesS05}, and the variants consider removing single parts of it (input,forget,output gates or activation functions), coupling the input and forget gate (which is inspired from GRU) or having full recurrence between all gates (which comes from the original LSTM formulation).
In their experimental setup, they consider 3 datasets: TIMIT (speech recognition), IAM Online Handwriting Database (character recognition) and JSB Chorales (polyphonic music modeling). For each, they tune the hyper-parameters of each of the 9 architectures, using random search based on 200 samples. Then, they keep the 20 best hyper-parameters and use the statistics of those as a basis for comparing the architectures.
#### My two cents
This was a very useful ready. I'd make it a required read for anyone that wants to start using LSTMs. First, I found the initial historical description of the developments surrounding LSTMs very interesting and clarifying. But more importantly, it presents a really useful picture of LSTMs that can both serve as a good basis for starting to use LSTMs and also an insightful (backed with data) exposition of the importance of each part in the LSTM.
The analysis based on an fANOVA (which I didn't know about until now) is quite neat. Perhaps the most surprising observation is that momentum actually doesn't seem to help that much. Investigating second order interaction between hyper-parameters was a smart thing to do (showing that tuning the learning rate and hidden layer jointly might not be that important, which is a useful insight).The illustrations in Figure 4, layout out the estimated relationship (with uncertainty) between learning rate / hidden layer size / input noise variance and performance / training time is also full of useful information.
I wont repeat here the main observations of the paper, which are laid out clearly in the conclusion (section 6).
Additionally, my personal take-away point is that, in an LSTM implementation, it might still be useful to support the removal peepholes or having coupled input and forget gates, since they both yielded the ultimate best test set performance on at least one of the datasets (I'm assuming it was also best on the validation set, though this might not be the case...)
The fANOVE analysis makes it clear that the learning rate is the most critical hyper-parameter to tune (can be "make or break"). That said, this is already well known. And the fact that it explains so much of the variance might reflect a bias of the analysis towards a situation where the learning rate isn't tuned as well as it could be in practice (this is afterall THE hyper-parameter that neural net researcher spend the most time tuning in practice). So, as future work, this suggests perhaps doing another round of the same analysis (which is otherwise really neatly setup), where more effort is always put on tuning the learning rate, individually for each of the other hyper-parameters. In other words, we'd try to ignore the regions of hyper-parameter space that correspond to bad learning rates, in order to "marginalize out" its effect. This would thus explore the perhaps more realistic setup that assumes one always tunes the learning rate as best as possible.
Also, considering a less aggressive gradient clipping into the hyper-parameter search would be interesting since, as the authors admit, clipping within [-1,1] might have been too much and could explain why it didn't help
Otherwise, a really great and useful read!