Paper summarymrdrozdov### General Approach
The Neural Tree Indexer (NTI) approach succeeded to reach 87.3\% test accuracy on SNLI. Here I'll attempt to clearly describe the steps involved based on the publication [1] and open sourced codebase [2].
NTI is a method to apply attention over a tree, specifically applied to sentence pairs. There are three main steps, each giving an incrementally more expressive representation of the input. It's worth noting that the tree is a full binary tree, so sentence lengths are padded to a factor of 2. In this case, the padded length used is $2^5 = 32$.
- **Sequence Encoding.** Run an RNN over your sentence to get new hidden states for each element.
$$h_t = f_1^{rnn}(i_t, h_{t-1})$$
- **Tree Encoding.** Using the hidden states from the previous step, use a variant of TreeLSTM to combine leaves until you have a single hidden state representing the entire sentence. Keep all of the intermediary hidden states for the next step.
$$ h_t^{tree} = f^{tree}(h_l^{tree},h_r^{tree})$$
- **Attention on Opposite Tree.** Until now we've only been describing how to encode a single sentence. When incorporating attention, we attend on the opposite tree by using the hidden states from the previous step. For instance, here is how we'd encode the premise (where the $p,h$ superscripts denote the premise or hypothesis, and $\vec{h}^{h,tree}$ denotes all of the hidden states of the non-attended hypothesis tree.:
$$h_t^p = f_1^{rnn}(i_t^p, h_{t-1}^p) \\
h_t^{p,tree} = f^{tree}(h_l^{p,tree},h_r^{p,tree}) \\
i_t^{p,attn} = f^{attn}(h_t^{p,tree}, \vec{h}^{h,tree}) \\
h_t^{p,attn} = f_2^{rnn}(i_t^{p,attn}, h_{t-1}^{p,attn})
$$
### Datasets
NTI was evaluated on three datasets. Some variant of the model achieved state-of-the-art in some category for each dataset:
- SNLI [3]: Sentence Pair Classification.
- WikiQA [4]: Answer Sentence Selection.
- Stanford Sentiment TreeBank (SST) [5]: Sentence Classification.
### Implementation Details
- Batch size is $32$ pairs (so $32$ of each premise and hypothesis).
- Tree is full binary tree with $2^5 = 32$ leaves.
- All sentences are padded left to length $32$, matching the full binary tree.
- Steps 1 (sentence encoding) runs on all sentence simultaneously. So is Step 2 (tree encoding). Step 3 (attention) is done first on the premise, then on the hypothesis.
- The variant of TreeLSTM used is S-LSTM. It's available as a standard function in Chainer.
- Dropout is applied liberally in each step. The keep rate is fixed at $80\%$.
- MLP has $1$ hidden layer with dimension $1024$. Dimensions of the entire MLP are: $(2 \times H) \times 1024 \times 3$. $H$ is the size of the hidden states and is $300$.
- Uses Chainer's Adam optimizer with $\alpha=0.0003,\beta_1=0.9,\beta_2=0.999,\epsilon=10^{-8}$. Gradient clipping using L2 norm of $40$. Parameters periodically scaled by $0.00003$ (weight decay).
- Weights are initialized uniformly random between $-0.1$ and $0.1$.
[1]: https://arxiv.org/abs/1607.04492
[2]: https://bitbucket.org/tsendeemts/nti/overview
[3]: nlp.stanford.edu/projects/snli/
[4]: https://www.microsoft.com/en-us/research/publication/wikiqa-a-challenge-dataset-for-open-domain-question-answering/
[5]: http://www.socher.org/index.php/Main/SemanticCompositionalityThroughRecursiveMatrix-VectorSpaces

First published: 2016/07/15 (2 years ago)Abstract: Neural networks with recurrent or recursive architecture have shown promising
results on various natural language processing (NLP) tasks. The recurrent and
recursive architectures have their own strength and limitations. The recurrent
networks process input text sequentially and model the conditional transition
between word tokens. In contrast, the recursive networks explicitly model the
compositionality and the recursive structure of natural language. Current
recursive architecture is based on syntactic tree, thus limiting its practical
applicability in different NLP applications. In this paper, we introduce a
class of tree structured model, Neural Tree Indexers (NTI) that provides a
middle ground between the sequential RNNs and the syntactic tree-based
recursive models. NTI constructs a full n-ary tree by processing the input text
with its node function in a bottom-up fashion. Attention mechanism can then be
applied to both structure and different forms of node function. We demonstrated
the effectiveness and the flexibility of a binary-tree model of NTI, showing
the model achieved the state-of-the-art performance on three different NLP
tasks: natural language inference, answer sentence selection, and sentence
classification.

### General Approach
The Neural Tree Indexer (NTI) approach succeeded to reach 87.3\% test accuracy on SNLI. Here I'll attempt to clearly describe the steps involved based on the publication [1] and open sourced codebase [2].
NTI is a method to apply attention over a tree, specifically applied to sentence pairs. There are three main steps, each giving an incrementally more expressive representation of the input. It's worth noting that the tree is a full binary tree, so sentence lengths are padded to a factor of 2. In this case, the padded length used is $2^5 = 32$.
- **Sequence Encoding.** Run an RNN over your sentence to get new hidden states for each element.
$$h_t = f_1^{rnn}(i_t, h_{t-1})$$
- **Tree Encoding.** Using the hidden states from the previous step, use a variant of TreeLSTM to combine leaves until you have a single hidden state representing the entire sentence. Keep all of the intermediary hidden states for the next step.
$$ h_t^{tree} = f^{tree}(h_l^{tree},h_r^{tree})$$
- **Attention on Opposite Tree.** Until now we've only been describing how to encode a single sentence. When incorporating attention, we attend on the opposite tree by using the hidden states from the previous step. For instance, here is how we'd encode the premise (where the $p,h$ superscripts denote the premise or hypothesis, and $\vec{h}^{h,tree}$ denotes all of the hidden states of the non-attended hypothesis tree.:
$$h_t^p = f_1^{rnn}(i_t^p, h_{t-1}^p) \\
h_t^{p,tree} = f^{tree}(h_l^{p,tree},h_r^{p,tree}) \\
i_t^{p,attn} = f^{attn}(h_t^{p,tree}, \vec{h}^{h,tree}) \\
h_t^{p,attn} = f_2^{rnn}(i_t^{p,attn}, h_{t-1}^{p,attn})
$$
### Datasets
NTI was evaluated on three datasets. Some variant of the model achieved state-of-the-art in some category for each dataset:
- SNLI [3]: Sentence Pair Classification.
- WikiQA [4]: Answer Sentence Selection.
- Stanford Sentiment TreeBank (SST) [5]: Sentence Classification.
### Implementation Details
- Batch size is $32$ pairs (so $32$ of each premise and hypothesis).
- Tree is full binary tree with $2^5 = 32$ leaves.
- All sentences are padded left to length $32$, matching the full binary tree.
- Steps 1 (sentence encoding) runs on all sentence simultaneously. So is Step 2 (tree encoding). Step 3 (attention) is done first on the premise, then on the hypothesis.
- The variant of TreeLSTM used is S-LSTM. It's available as a standard function in Chainer.
- Dropout is applied liberally in each step. The keep rate is fixed at $80\%$.
- MLP has $1$ hidden layer with dimension $1024$. Dimensions of the entire MLP are: $(2 \times H) \times 1024 \times 3$. $H$ is the size of the hidden states and is $300$.
- Uses Chainer's Adam optimizer with $\alpha=0.0003,\beta_1=0.9,\beta_2=0.999,\epsilon=10^{-8}$. Gradient clipping using L2 norm of $40$. Parameters periodically scaled by $0.00003$ (weight decay).
- Weights are initialized uniformly random between $-0.1$ and $0.1$.
[1]: https://arxiv.org/abs/1607.04492
[2]: https://bitbucket.org/tsendeemts/nti/overview
[3]: nlp.stanford.edu/projects/snli/
[4]: https://www.microsoft.com/en-us/research/publication/wikiqa-a-challenge-dataset-for-open-domain-question-answering/
[5]: http://www.socher.org/index.php/Main/SemanticCompositionalityThroughRecursiveMatrix-VectorSpaces