Learning to ExecuteLearning to ExecuteZaremba, Wojciech and Sutskever, Ilya2014

Paper summaryshagunsodhani## Problem Statement
* Evaluating if LSTMs can express and learn short, simple programs (linear time, constant memory) in the sequence-to-sequence framework.
* [Link to paper](http://arxiv.org/pdf/1410.4615v3.pdf)
## Approach
* Formulate program evaluation task as a sequence-to-sequence learning problem using RNNs.
* Train on short programs that can be evaluated in linear time and constant memory - RNN can perform only a single pass over the data and its memory is limited.
* Two parameters to control the difficulty of the program:
* `length` : Number of digits in the integer that appears in the program.
* `nesting` : Number of times operations can be combined with each other.
* LSTM reads the input program, one character at a time and produces output, one character at a time.
### Additional Learning Tasks
* **Addition Task** - Given two numbers, the model learns to add them. This task becomes the baseline for comparing performance on other tasks.
* **Memorization Task** - Give a random number, the model memorizes it and outputs it. Following techniques enhance the accuracy of the model:
* **Input reversing** - Reversing the order of input, while keeping the output fixed introduces many short-term dependencies that help LSTM in learning the process.
* **Input doubling** - Presenting the same input to the network twice enhances the performance as the model gets to look at the input twice.
## Curriculum learning
Gradually increase the difficulty of the program fed to the system.
* **No Curriculum (baseline)** - Fixed `length` and fixed `nesting` programs are fed to the system.
* **Naive Curriculum** - Start with `length` = 1 and `nesting` = 1 and keep increasing the values iteratively.
* **Mix Strategy** - Randomly choose `length` and `nesting` to generate a mix of easy and difficult examples.
* **Combined Strategy** - Each training example is obtained either by Naive curriculum strategy or mix strategy.
## Network Architecture
* 2 layers, unrolled for 50 steps.
* 400 cells per layer.
* Parameters initialized uniformly in [-0.08, 0.08]
* minibatch size 100
* norm of gradient normalized to be less than 5
* start with learning rate = 0.5, further decreased by 0.8 after reaching target accuracy of 95%
## Observations
Teacher forcing technique used for computing accuracy ie when predicting the $i_{th}$ digit, the correct first i-1 digits of the output are provided as input to the LSTM.
The general trend is (combine, mix) > (naive, baseline).
In certain cases for program evaluation, baseline performs better than naive curriculum strategy. Intuitively, the model would use all its memory to store patterns for a given size input. Now when a higher size input is provided, the model would have to restructure its memory patterns to learn the output for this new class of inputs. The process of memory restructuring may be causing the degraded performance of the naive strategy. The combined strategy combines the naive and mix strategy and hence reduces the need to restructure the memory patterns.
While LSTMs can learn to map the character level representation of simple programs to their correct output, the idea can not extend to arbitrary programs due to the runtime limitations of conventional RNNs and LSTM. Moreover, while learning is essential, the optimal curriculum strategy needs to be understood further.

## Problem Statement
* Evaluating if LSTMs can express and learn short, simple programs (linear time, constant memory) in the sequence-to-sequence framework.
* [Link to paper](http://arxiv.org/pdf/1410.4615v3.pdf)
## Approach
* Formulate program evaluation task as a sequence-to-sequence learning problem using RNNs.
* Train on short programs that can be evaluated in linear time and constant memory - RNN can perform only a single pass over the data and its memory is limited.
* Two parameters to control the difficulty of the program:
* `length` : Number of digits in the integer that appears in the program.
* `nesting` : Number of times operations can be combined with each other.
* LSTM reads the input program, one character at a time and produces output, one character at a time.
### Additional Learning Tasks
* **Addition Task** - Given two numbers, the model learns to add them. This task becomes the baseline for comparing performance on other tasks.
* **Memorization Task** - Give a random number, the model memorizes it and outputs it. Following techniques enhance the accuracy of the model:
* **Input reversing** - Reversing the order of input, while keeping the output fixed introduces many short-term dependencies that help LSTM in learning the process.
* **Input doubling** - Presenting the same input to the network twice enhances the performance as the model gets to look at the input twice.
## Curriculum learning
Gradually increase the difficulty of the program fed to the system.
* **No Curriculum (baseline)** - Fixed `length` and fixed `nesting` programs are fed to the system.
* **Naive Curriculum** - Start with `length` = 1 and `nesting` = 1 and keep increasing the values iteratively.
* **Mix Strategy** - Randomly choose `length` and `nesting` to generate a mix of easy and difficult examples.
* **Combined Strategy** - Each training example is obtained either by Naive curriculum strategy or mix strategy.
## Network Architecture
* 2 layers, unrolled for 50 steps.
* 400 cells per layer.
* Parameters initialized uniformly in [-0.08, 0.08]
* minibatch size 100
* norm of gradient normalized to be less than 5
* start with learning rate = 0.5, further decreased by 0.8 after reaching target accuracy of 95%
## Observations
Teacher forcing technique used for computing accuracy ie when predicting the $i_{th}$ digit, the correct first i-1 digits of the output are provided as input to the LSTM.
The general trend is (combine, mix) > (naive, baseline).
In certain cases for program evaluation, baseline performs better than naive curriculum strategy. Intuitively, the model would use all its memory to store patterns for a given size input. Now when a higher size input is provided, the model would have to restructure its memory patterns to learn the output for this new class of inputs. The process of memory restructuring may be causing the degraded performance of the naive strategy. The combined strategy combines the naive and mix strategy and hence reduces the need to restructure the memory patterns.
While LSTMs can learn to map the character level representation of simple programs to their correct output, the idea can not extend to arbitrary programs due to the runtime limitations of conventional RNNs and LSTM. Moreover, while learning is essential, the optimal curriculum strategy needs to be understood further.