Self Attention for Machine Translation

The high-level focus of this work is improving the quality of translations produced by neural machine translation systems. One problem current neural MT architectures face is that of long distance dependencies between both words in the source sentence and the target sentence. The choice for a particular translation of a word can place constraints on future translation decisions.

The most common architectures for neural MT are variants of recurrent neural networks (RNN) such as Long short-term memory (LSTM) or Gated recurrent unit (GRU). Although in theory these models have the power to ‘remember’ past decisions inside a fixed-size state vector, in practice this is often not the case. In this proposal I describe one possible RNN architecture augmentation, self-attention, to tackle the memory problem.

The work by Bahdanau et al. showed that moving away from a single vector representation to one vector per word improves translation quality. In my first PhD project I would like to take this idea further. Current neural MT architectures (GRU or LSTM) compute a new internal state also using the previous state. The architectures have gates trained such that important information is ‘written’ into the new state and unnecessary information is ‘forgotten’ from it. However, it is possible that some information might be only necessary to translate the beginning and end of a sentence. If it is forgotten’ in the middle then the translation quality of the end of the sentence suffers.

A simple alternative is to use self-attention. At every time step of an RNN, a weighted average of all the previous states will be used as an extra input to the function that computes the next state. With the self-attentive mechanism, the network can decide to attend to a state produced many time steps earlier. This means that the latest state does not need to store all the information. The mechanism also makes it easier for gradient to flow more easily to all previous states, which can help against the vanishing gradient problem. On the other hand, it is possible that it will make the network hard to train. Since states are now not required to hold all information and gradient can flow more easily into the past, it would be interesting to see if the gating mechanisms inside GRU/LSTM architectures will still be necessary.