An Actor-Critic Algorithm for Sequence PredictionAn Actor-Critic Algorithm for Sequence PredictionDzmitry Bahdanau and Philemon Brakel and Kelvin Xu and Anirudh Goyal and Ryan Lowe and Joelle Pineau and Aaron Courville and Yoshua Bengio2016

Paper summarydennybritzTLDR; The authors propose to use the Actor Critic framework from Reinforcement Learning for Sequence prediction. They train an actor (policy) network to generate a sequence together with a critic (value) network that estimates the q-value function. Crucially, the actor network does not see the ground-truth output, but the critic does. This is different from LL (log likelihood) where errors are likely to cascade. The authors evaluate their framework on an artificial spelling correction and a real-world German-English Machine Translation tasks, beating baselines and competing approaches in both cases.
#### Key Points
- In LL training, the model is conditioned on its own guesses during search, leading to error compounding.
- The critic is allowed to see the ground truth, but the actor isn't
- The reward is a task-specific score, e.g. BLEU
- Use bidirectional RNN for both actor and critic. Actor uses a soft attention mechanism.
- The reward is partially receives at each intermediate step, not just at the end
- Framework is analogous to TD-Learning in RL
- Trick: Use additional target network to compute q_t (see Deep-Q paper) for stability
- Trick: Use delayed actor (as in Deep Q paper) for stability
- Trick: Put constraint on critic to deal with large action spaces (is this analogous to advantage functions?)
- Pre-train actor and critic to encourage exploration of the right space
- Task 1: Correct corrupt character sequence. AC outperforms LL training. Longer sequences lead to stronger lift.
- Task 2: GER-ENG Machine Translation: Beats LL and Reinforce models
- Qualitatively, critic assigns high values to words that make sense
- BLUE scores during training are lower than those of LL model - Why? Strong regularization? Can't overfit the training data.
#### Notes
- Why does the sequence length for spelling prediction only go up to 30? This seems very short to me and something that an LSTM should be able to handle quite easily. Would've like to see much longer sequences.

First published: 2016/07/24 (3 years ago)Abstract: We present an approach to training neural networks to generate sequences
using actor-critic methods from reinforcement learning (RL). Current
log-likelihood training methods are limited by the discrepancy between their
training and testing modes, as models must generate tokens conditioned on their
previous guesses rather than the ground-truth tokens. We address this problem
by introducing a \textit{critic} network that is trained to predict the value
of an output token, given the policy of an \textit{actor} network. This results
in a training procedure that is much closer to the test phase, and allows us to
directly optimize for a task-specific score such as BLEU. Crucially, since we
leverage these techniques in the supervised learning setting rather than the
traditional RL setting, we condition the critic network on the ground-truth
output. We show that our method leads to improved performance on both a
synthetic task, and for German-English machine translation. Our analysis paves
the way for such methods to be applied in natural language generation tasks,
such as machine translation, caption generation, and dialogue modelling.

TLDR; The authors propose to use the Actor Critic framework from Reinforcement Learning for Sequence prediction. They train an actor (policy) network to generate a sequence together with a critic (value) network that estimates the q-value function. Crucially, the actor network does not see the ground-truth output, but the critic does. This is different from LL (log likelihood) where errors are likely to cascade. The authors evaluate their framework on an artificial spelling correction and a real-world German-English Machine Translation tasks, beating baselines and competing approaches in both cases.
#### Key Points
- In LL training, the model is conditioned on its own guesses during search, leading to error compounding.
- The critic is allowed to see the ground truth, but the actor isn't
- The reward is a task-specific score, e.g. BLEU
- Use bidirectional RNN for both actor and critic. Actor uses a soft attention mechanism.
- The reward is partially receives at each intermediate step, not just at the end
- Framework is analogous to TD-Learning in RL
- Trick: Use additional target network to compute q_t (see Deep-Q paper) for stability
- Trick: Use delayed actor (as in Deep Q paper) for stability
- Trick: Put constraint on critic to deal with large action spaces (is this analogous to advantage functions?)
- Pre-train actor and critic to encourage exploration of the right space
- Task 1: Correct corrupt character sequence. AC outperforms LL training. Longer sequences lead to stronger lift.
- Task 2: GER-ENG Machine Translation: Beats LL and Reinforce models
- Qualitatively, critic assigns high values to words that make sense
- BLUE scores during training are lower than those of LL model - Why? Strong regularization? Can't overfit the training data.
#### Notes
- Why does the sequence length for spelling prediction only go up to 30? This seems very short to me and something that an LSTM should be able to handle quite easily. Would've like to see much longer sequences.