A peephole LSTM block with input, output, and forget gates. The exit arrows from the ct{\displaystyle c_{t}} node actually denote exit arrows from ct−1{\displaystyle c_{t-1}} except for the single right-to-left arrow. There are many other kinds of LSTMs as well.[1]

Contents

Among other successes, LSTM achieved record results in natural language text compression,[4] unsegmented connected handwriting recognition[5] and won the ICDAR handwriting competition (2009). LSTM networks were a major component of a network that achieved a record 17.7% phoneme error rate on the classic TIMIT natural speech dataset (2013).[6]

In the equations below, each variable in lowercase italics represents a vector with a length equal to the number of LSTM units in the block[why?].

LSTM blocks contain three or four gates[clarification needed][citation needed] that control information flow. These gates are implemented using the logistic function to compute a value between 0 and 1[citation needed][why?]. Multiplication is applied with this value[which?] to partially allow or deny information to flow into or out of the memory. For example, an "input" gate controls the extent to which a new value flows into the memory. A "forget" gate controls the extent to which a value remains in memory. An "output" gate controls the extent to which the value in memory is used to compute the output activation of the block. In some implementations, the input and forget gates are merged into a single gate[citation needed][examples needed]. The motivation for combining them is that the time to forget is when a new value worth remembering becomes available[citation needed].

The weights in an LSTM block (W{\displaystyle W} and U{\displaystyle U}[clarification needed]) are used to direct the operation of the gates. These weights are applied to the values that feed into the block (including the input vector xt{\displaystyle x_{t}} and the output from the previous time at step ht−1{\displaystyle h_{t-1}}) at each of the gates[why?]. Thus, the LSTM block determines how to maintain its memory as a function of those values, and training its weights causes the block to learn the function that minimizes loss[further explanation needed][citation needed].

Peephole LSTM with forget gates.[18][19] Peephole connections allow the gates to access the constant error carousel (CEC), whose activation is the cell state.[20]ht−1{\displaystyle h_{t-1}} is not used, ct−1{\displaystyle c_{t-1}} is used instead in most places.

To minimize LSTM's total error on a set of training sequences, iterative gradient descent such as backpropagation through time can be used to change each weight in proportion to its derivative with respect to the error. A problem with using gradient descent for standard RNNs is that error gradients vanish exponentially quickly with the size of the time lag between important events. This is due to limn→∞Wn=0{\displaystyle \lim _{n\to \infty }W^{n}=0} if the spectral radius of W{\displaystyle W} is smaller than 1.[22][23] With LSTM blocks, however, when error values are back-propagated from the output, the error remains in the block's memory. This "error carousel" continuously feeds error back to each of the gates until they learn to cut off the value. Thus, regular backpropagation is effective at training an LSTM block to remember values for long durations.