This paper explores the problem of question answering based on natural text. While this has been explored recently in the context of Memory Networks, the problems tackled so far have been synthetically generated. In this paper, the authors propose to extract from news sites more realistic question answering examples, by treating the main body of a news article as the content (the "facts") and extracting questions from the article's bullet point summaries. Specifically, by detecting the entities in these bullet points and replacing them with a question place older (e.g. "Producer X will not press charges"), they are able to generate queries which, while grammatically not being questions, do require to perform a form of question answering. Thanks to this procedure, two large *supervised* datasets are created, with several thousands of questions, based on the CNN and Daily Mail news sites.
Then, the authors investigate neural network based systems for solving this task. They consider a fairly simple Deep LSTM network, which is first fed the article's content and then the query. They also consider two architectures that incorporate an attentional mechanism, based on softmax weighting. The first ("Attentive Reader") attends once in the document (i.e. uses a single softmax weight vector) while the second ("Impatient Reader") attends after every word in the query (akin to the soft attention architecture in the "Show Attend and Tell" paper).
These neural network architectures are also compared with simpler baselines, which are closer to what a more "classical" statistical NLP solution might look like.
Results on both datasets demonstrate that the neural network approaches have superior performance, with the attentional models being significantly better than the simpler Deep LSTM model.
#### My two cents
This is welcome development in the research on reasoning models based on neural networks. I've always thought it was unfortunate that the best benchmark available is based on synthetically generated cases. This work fixes this problem in a really clever way, while still being able to generate a large amount of training data. Particularly clever is the random permutation of entity markers when processing each case. Thanks to that, a system cannot simply use general statistics on words to answer questions (e.g. just from the query "The hi-tech bra that helps you beat breast X" it's obvious that "cancer" is an excellent answer). In this setup, the system is forced to exploit the content of the article, thus ensuring that the benchmark is indeed measuring the system's question-answering abilities.
Since the dataset itself is an important contribution of this paper, I hope the authors release it publicly in the near future.
The evaluation of the different neural architectures is also really thoroughly done. The non-neural baselines are reasonable and the comparison between the neural nets is itself interesting, bringing more evidence that the softmax weighted attentional mechanism (which has been gaining in popularity) indeed brings something over a regular LSTM approach.

This paper combines two ideas. The first is stochastic gradient Langevin dynamics (SGLD), which is an efficient Bayesian learning method for larger datasets, allowing to efficiently sample from the posterior over the parameters of a model (e.g. a deep neural network). In short, SGLD is stochastic (minibatch) gradient descent, but where Gaussian noise is added to the gradients before each update. Each update thus results in a sample from the SGLD sampler. To make a prediction for a new data point, a number of previous parameter values are combined into an ensemble, which effectively corresponds to Monte Carlo estimate of the posterior predictive distribution of the model.
The second idea is distillation or dark knowledge, which in short is the idea of training a smaller model (student) in replicating the behavior and performance of a much larger model (teacher), by essentially training the student to match the outputs of the teacher.
The observation made in this paper is that the step of creating an ensemble of several models (e.g. deep networks) can be expensive, especially if many samples are used and/or if each model is large. Thus, they propose to approximate the output of that ensemble by training a single network to predict to output of ensemble. Ultimately, this is done by having the student predict the output of a teacher corresponding to the model with the last parameter value sampled by SGLD.
Interestingly, this process can be operated in an online fashion, where one alternates between sampling from SGLD (i.e. performing a noisy SGD step on the teacher model) and performing a distillation update (i.e. updating the student model, given the current teacher model). The end result is a student model, whose outputs should be calibrated to the bayesian predictive distribution.

This paper presents a variational approach to the maximisation of mutual information in the context of a reinforcement learning agent. Mutual information in this context can provide a learning signal to the agent that is "intrinsically motivated", because it relies solely on the agent's state/beliefs and does not require from the ("outside") user an explicit definition of rewards.
Specifically, the learning objective, for a current state s, is the mutual information between the sequence of K actions a proposed by an exploration distribution $w(a|s)$ and the final state s' of the agent after performing these actions. To understand what the properties of this objective, it is useful to consider the form of this mutual information as a difference of conditional entropies:
$$I(a,s'|s) = H(a|s) - H(a|s',s)$$
Where $I(.|.)$ is the (conditional) mutual information and $H(.|.)$ is the (conditional) entropy. This objective thus asks that the agent find an exploration distribution that explores as much as possible (i.e. has high $H(a|s)$ entropy) but is such that these actions have predictable consequences (i.e. lead to predictable state s' so that $H(a|s',s)$ is low). So one could think of the agent as trying to learn to have control of as much of the environment as possible, thus this objective has also been coined as "empowerment".
The main contribution of this work is to show how to train, on a large scale (i.e. larger state space and action space) with this objective, using neural networks. They build on a variational lower bound on the mutual information and then derive from it a stochastic variational training algorithm for it. The procedure has 3 components: the exploration distribution $w(a|s)$, the environment $p(s'|s,a)$ (can be thought as an encoder, but which isn't modeled and is only interacted with/sampled from) and the planning model $p(a|s',s)$ (which is modeled and can be thought of as a decoder). The main technical contribution is in how to update the exploration distribution (see section 4.2.2 for the technical details).
This approach exploits neural networks of various forms. Neural autoregressive generative models are also used as models for the exploration distribution as well as the decoder or planning distribution. Interestingly, the framework allows to also learn the state representation s as a function of some "raw" representation x of states. For raw states corresponding to images (e.g. the pixels of the screen image in a game), CNNs are used.

This paper presents a linear algebraic trick for computing both the value and the gradient update for a loss function that compares a very high-dimensional target with a (dense) output prediction. Most of the paper exposes the specific case of the squared error loss, though it can also be applied to some other losses such as the so-called spherical softmax. One use case could be for training autoencoders with the squared error on very high-dimensional but sparse inputs. While a naive (i.e. what most people currently do) implementation would scale in $O(Dd)$ where $D$ is the input dimensionality and d the hidden layer dimensionality, they show that their trick allows to scale in $O(d^2)$.
Their experiments show that they can achieve speedup factors of over 500 on the CPU, and over 1500 on the GPU.
#### My two cents
This is a really neat, and frankly really surprising, mathematical contribution. I did not suspect getting rid of the dependence on D in the complexity would actually be achievable, even for the "simpler" case of the squared error.
The jury is still out as to whether we can leverage the full power of this trick in practice. Indeed, the squared error over sparse targets isn't the most natural choice in most situations. The authors did try to use this trick in the context of a version of the neural network language model that uses the squared error instead of the negative log-softmax (or at least I think that's what was done... I couldn't confirm this with 100% confidence). They showed that good measures of word similarity (Simlex-999) could be achieved in this way, though using the hierarchical softmax actually achieves better performance in about the same time.
But as far as I'm concerned, that doesn't make the trick less impressive. It's still a neat piece of new knowledge to have about reconstruction errors. Also, the authors mention that it would be possible to adapt the trick to the so-called (negative log) spherical softmax, which is like the softmax but where the numerator is the square of the pre-activation, instead of the exponential. I hope someone tries this out in the future, as perhaps it could be key to making this trick a real game changer!

This paper starts by introducing a trick to reduce the variance of stochastic gradient variational Bayes (SGVB) estimators. In neural networks, SGVB consists in learning a variational (e.g. diagonal Gaussian) posterior over the weights and biases of neural networks, through a procedure that (for the most part) alternates between adding (Gaussian) noise to the model's parameters and then performing a model update with backprop.
The authors present a local reparameterization trick, which exploits the fact that the Gaussian noise added into the weights could instead be added directly into the pre-activation (i.e. before the activation fonction) vectors during forward propagation. This is due to the fact that computing the pre-activation is a linear operation, thus noise at that level is also Gaussian. The advantage of doing so is that, in the context of minibatch training, one can efficiently then add independent noise to the pre-activation vectors for each example of the minibatch. The nature of the local reparameterization trick implies that this is equivalent to using one corrupted version of the weights for each example in the minibatch, something that wouldn't be practical computationally otherwise. This is in fact why, in normal SGVB, previous work would normally use a single corrupted version of the weights for all the minibatch.
The authors demonstrate that using the local reparameterization trick yields stochastic gradients with lower variance, which should improve the speed of convergence.
Then, the authors demonstrate that the Gaussian version of dropout (one that uses multiplicative Gaussian noise, instead of 0-1 masking noise) can be seen as the local reparameterization trick version of a SGVB objective, with some specific prior and variational posterior. In this SGVB view of Gaussian dropout, the dropout rate is an hyper-parameter of this prior, which can now be tuned by optimizing the variational lower bound of SGVB. In other words, we now have a method to also train the dropout rate! Moreover, it becomes possible to tune an individual dropout rate parameter for each layer, or even each parameter of the model.
Experiments on MNIST confirm that tuning that parameter works and allows to reach good performance of various network sizes, compared to using a default dropout rate.
##### My two cents
This is another thought provoking connection between Bayesian learning and dropout. Indeed, while Deep GPs have allowed to make a Bayesian connection with regular (binary) dropout learning \cite{journals/corr/GalG15}, this paper sheds light on a neat Bayesian connection for the Gaussian version of dropout. This is great, because it suggests that Gaussian dropout training is another legit way of modeling uncertainty in the parameters of neural networks. It's also nice that that connection also yielded a method for tuning the dropout rate automatically.
I hope future work (by the authors or by others) can evaluate the quality of the corresponding variational posterior in terms of estimating uncertainty in the network and, in particular, in obtaining calibrated output probabilities.
Little detail: I couldn't figure out whether the authors tuned a single dropout rate for the whole network, or used many rates, for instance one per parameter, as they suggest can be done.

This paper considers the problem of structured output prediction, in the specific case where the output is a sequence and we represent the sequence as a (conditional) directed graphical model that generates from the first token to the last. The paper starts from the observation that training such models by maximum likelihood (ML) does not reflect well how the model is actually used at test time. Indeed, ML training implies that the model is effectively trained to predict each token conditioned on the previous tokens *from the ground truth* sequence (this is known as "teacher forcing"). Yet, when making a prediction for a new input, the model will actually generate a sequence by generating tokens one after another and conditioning on *its own predicted tokens* instead.
So the authors propose a different training procedure, where at training time each *conditioning* ground truth token is sometimes replaced by the model's previous prediction. The choice of replacing the ground truth by the model's prediction is made by "flipping a coin" with some probability, independently for each token. Importantly, the authors propose to start with a high probability of using the ground truth (i.e. start close to ML) and anneal that probability closer to 0, according to some schedule (thus the name Schedule Sampling).
Experiments on 3 tasks (image caption generation, constituency parsing and speech recognition) based on neural networks with LSTM units, demonstrate that this approach indeed improves over ML training in terms of the various performance metrics appropriate for each problem, and yields better sequence prediction models.
#### My two cents
Big fan of this paper. It both identifies an important flaw in how sequential prediction models are currently trained and, most importantly, suggests a solution that is simple yet effective. I also believe that this approach played a non-negligible role in Google's winner system for image caption generation, in the Microsoft COCO competition.
My alternative interpretation of why Scheduled Sampling helps is that ML training does not inform the model about the relative quality of the errors it can make. In terms of ML, it is as bad to put high probability on an output sequence that has just 1 token that's wrong, than it is to put the same amount of probability on a sequence that has all tokens wrong. Yet, say for image caption generation, outputting a sentence that is one word away from the ground truth is clearly preferable from making a mistake on a words (something that is also reflected in the performance metrics, such as BLEU).
By training the model to be robust to its own mistakes, Scheduled Sampling ensures that errors won't accumulate and makes predictions that are entirely off much less likely.
An alternative to Scheduled Sampling is DAgger (Dataset Aggregation: \cite{journals/jmlr/RossGB11}), which briefly put alternates between training the model and adding to the training set examples that mix model predictions and the ground truth. However, Scheduled Sampling has the advantage that there is no need to explicitly create and store that increasingly large dataset of sampled examples, something that isn't appealing for online learning or learning on large datasets.
I'm also very curious and interested by one of the direction of future work mentioned in the conclusion: figuring out a way to backprop through the stochastic predictions made by the model. Indeed, as the authors point out, the current algorithm ignores the fact that, by sometimes taking as input its previous prediction, this induces an additional relationship between the model's parameters and its ultimate prediction, a relationship that isn't taken into account during training. To take it into account, you'd need to somehow backpropagate through the stochastic process that generated the previous token prediction. While the work on variational autoencoders has shown that we can backprop through gaussian samples, backpropagating through the sampling of a discrete multinomial distribution is essentially an open problem. I do believe that there is work that tried to tackle propagating through stochastic binary units however, so perhaps that's a start. Anyways, if the authors could make progress on that specific issue, it could be quite useful not just in the context of Schedule Sampling, but possibly in the context of training networks with discrete stochastic units in general!

This paper describes a learning algorithm for deep neural networks that can be understood as an extension of stacked denoising autoencoders. In short, instead of reconstructing one layer at a time and greedily stacking, a unique unsupervised objective involving the reconstruction of all layers is optimized jointly by all parameters (with the relative importance of each layer cost controlled by hyper-parameters).
In more details:
* The encoding (forward propagation) adds noise (Gaussian) at all layers, while decoding is noise-free.
* The target at each layer is the result of noise-less forward propagation.
* Direct connections (also known as skip-connections) between a layer and its decoded reconstruction are used. The resulting encoder/decoder architecture thus ressembles a ladder (hence the name Ladder Networks).
* Miniature neural networks with a single hidden unit and skip-connections are used to decode the left and top layers into a reconstruction. Each network is applied element-wise (without parameter sharing across reconstructed units).
* The unsupervised objective is combined with a supervised objective, corresponding to the regular negative class log-likelihood objective (using an output softmax layer). Two losses are used for each input/target pair: one based on the noise-free forward propagation (which also provides the target of the denoising objective) and one with the noise added (which also corresponds to the encoding stage of the unsupervised autoencoder objective).
Batch normalization is used to train the network.
Since the model combines unsupervised and supervised learning, it can be used for semi-supervised learning, where unlabeled examples can be used to update the network using the unsupervised objective only. State of the art results in the semi-supervised setting are presented, for both the MNIST and CIFAR-10 datasets.
#### My two cents
What I find most exciting about this paper is its performance. On MNIST, with only 100 labeled examples, it achieves 1.13% error! That is essentially the performance of stacked denoising autoencoders, trained on the entire training set (though that was before ReLUs and batch normalization, which this paper uses)! This confirms a current line of thought in Deep Learning (DL) that, while recent progress in DL applied on large labeled datasets does not rely on any unsupervised learning (unlike at the "beginning" of DL in the mid 2000s), unsupervised learning might instead be crucial for success in low-labeled data regime, in the semi-supervised setting.
Unfortunately, there is one little issue in the experiments, disclosed by the authors: while they used few labeled examples for training, model selection did use all 10k labels in the validation set. This is of course unrealistic. But model selection in the low data regime is arguably, in itself, an open problem. So I like to think that this doesn't invalidate the progress made in this paper, and only suggests that some research needs to be done on doing effective hyper-parameter search with a small validation set.
Generally, I really hope this paper will stimulate more research on DL methods to the specific case of small labeled dataset / large unlabeled dataset. While this isn't a problem that is as "flashy" as tasks such as the ImageNet Challenge which comes with lots of labeled data, I think this is a crucial research direction for AI in general. Indeed, it seems naive to me to expect that we will be able to collect large labeled dataset for each and every task, on our way to real AI.

**Object detection** is the task of drawing one bounding box around each instance of the type of object one wants to detect. Typically, image classification is done before object detection. With neural networks, the usual procedure for object detection is to train a classification network, replace the last layer with a regression layer which essentially predicts pixel-wise if the object is there or not. An bounding box inference algorithm is added at last to make a consistent prediction (see [Deep Neural Networks for Object Detection](http://papers.nips.cc/paper/5207-deep-neural-networks-for-object-detection.pdf)).
The paper introduces RPNs (Region Proposal Networks). They are end-to-end trained to generate region proposals.They simoultaneously regress region bounds and bjectness scores at each location on a regular grid.
RPNs are one type of fully convolutional networks. They take an image of any size as input and output a set of rectangular object proposals, each with an objectness score.
## See also
* [R-CNN](http://www.shortscience.org/paper?bibtexKey=conf/iccv/Girshick15#joecohen)
* [Fast R-CNN](http://www.shortscience.org/paper?bibtexKey=conf/iccv/Girshick15#joecohen)
* [Faster R-CNN](http://www.shortscience.org/paper?bibtexKey=conf/nips/RenHGS15#martinthoma)
* [Mask R-CNN](http://www.shortscience.org/paper?bibtexKey=journals/corr/HeGDG17)

TLDR; The authors propose "Highway Networks", which uses gates (inspired by LSTMs) to determine how much of a layer's activations to transform or just pass through. Highway Networks can be used with any kind of activation function, including recurrent and convnolutional units, and trained using plain SGD. The gating mechanism allows highway networks with tens or hundreds of layers to be trained efficiently. The authors show that highway networks with fewer parameters achieve results competitive with state-of-the art for the MNIST and CIFAR tasks. Gates outputs vary significantly with the input examples, demonstrating that the network not just learns a "fixed structure", but dynamically routes data based for specific examples examples.
Datasets used: MNIST, CIFAR-10, CIFAR-100
#### Key Takeaways
- Apply LSTM-like gating to networks layers. Transform gate T and carry gate C.
- The gating forces the layer inputs/outputs to be of the same size. We can use additional plain layers for dimensionality transformations.
- Bias weights of the transform gates should be initialized to negative values (-1, -2, -3, etc) to initially force the networks to pass through information and learn long-term dependencies.
- HWN does not learn a fixed structure (same gate outputs), but dynamic routing based on current input.
- In complex data sets each layer makes an important contritbution, which is shown by lesioning (setting to pass-through) individual layers.
#### Notes / Questions
- Seems like the authors did not use dropout in their experiments. I wonder how these play together. Is dropout less effective for highway networks because the gates already learn efficients paths?
- If we see that certain gates outputs have low variance across examples, can we "prune" the network into a fixed strucure to make it more efficient (for production deployments)?

Deep rectified neural networks are over-parameterized in the sense that scaling of the weights in one layer, can be compensated for exactly in the subsequent layer. This paper introduces Path-SGD, a simple modification to the SGD update rule, whose update is invariant to such rescaling. The method is derived from the proximal form of gradient descent, whereby a constraint term is added which preserves the norm of the "product weight" formed along each path in the network (from input to output node). Path-SGD is thus principled and shown to yield faster convergence for a standard 2 layer rectifier network, across a variety of dataset (MNIST, CIFAR-10, CIFAR-100, SVHN). As the method implicitly regularizes the neural weights, this also translates to better generalization performance on half of the datasets.
At its core, Path-SGD belongs to the family of learning algorithms which aim to be invariant to model reparametrizations. This is the central tenet of Amari's natural gradient (NG) \cite{amari_natural_1998}, whose importance has resurfaced in the area of deep learning. Path-SGD can thus be cast an approximation to NG, which focuses on a particular type of rescaling between neighboring layers. The paper would greatly benefit from such a discussion in my opinion. I also believe NG to be a much more direct way to motivate Path-SGD, than the heuristics of max-norm regularization.

TLDR; The authors introduce a new spatial transformation module that can be inserted into any Neural Network. The module consists of a spatial transformation network that predicts transformation parameters, a grid generator that chooses a sampling grid from the input, and a sampler that produces the output. Possible learned transformations include things cropping, translation, rotation, scaling or attention. The module can be trained end-to-end using backpropagation. The authors evaluate evaluate the module on both CNNs and MLPs, achieving state on distorted MNIST data, street view numbers, and fine-grained bird classification.
#### Key Points:
- STMs can be inserted between any layers, typically after the input or extracted features. The transform is dynamic and happens based on the input data.
- The module is fast and doesn't adversely impact training speed.
- The actual transformation parameters (output of localization network) can be fed into higher layers.
- Attention can be seen as a special transformation that increases computational efficiency.
- Can also be applied to RNNs, but more investigation is needed.

TLDR; The authors propose a recurrent memory-based model that can reason over multiple hops and be trained end to end with standard gradient descent. The authors evaluate the model on QA and Language Modeling Tasks. In the case of QA, the network inputs are a list of sentences, a query and (during training) an answer. The network then attends to the sentences at each time step, considering the next piece information relevant to the question. The network outperforms baseline approaches, but does not come close to a strongly supervised (relevant sentences are pre-selected) approach.
#### Key Takeaways
- Sentence Representation: 1. Word embeddings are averaged (BoW) 2. Positional Encoding (PE)
- Synthetic dataset with vocabulary size of ~180. Version one has 1k training example, version 2 has 10k training examples.
- The model is similar to Bahdanau seq2seq attention model, only that it operates on sentences and does not output at every step and used a simpler scoring function.
#### Questions / Notes
- The positional encoding formula is not explained neither is it intutiive.
- There are so many hyperparameters and model variations (jittering, linear start) that it's easy to lose track of the essential.
- No intuitive explanation of what the model does. The easiest way for me to understand this model was to look at it as a variation of Bahdanau's attention model, which is very intuitive. I don't understand the intuition behind the proposed weight constraints.
- The LM results are not convincing. The model beats the baselines by a little bit, but probably only due to very time-intensive hyperparameter optimization.
- What is the training complexity and training time?

TLDR; The authors show that we can pre-train RNNs using unlabeled data by either reconstructing the original sequence (SA-LSTM), or predicting the next token as in a language model (LM-LSTM). We can then fine-tune the weights on a supervised task. Pre-trained RNNs are more stable, generalize better, and achieve state-of-the-art results on various text classification tasks. The authors show that unlabeled data can compensate for a lack of labeled data.
#### Data Sets
Error Rates for SA-LSTM, previous best results in parens.
- IMDB: 7.24% (7.42%)
- Rotten Tomatoes 16.7% (18.5%) (using additional unlabeled data)
- 20 Newsgroups: 15.6% (17.1%)
- DBPedia character-level: 1.19% (1.74%)
#### Key Takeaways
- SA-LSTM: Predict sequence based on final hidden state
- LM-LSTM: Language-Model pretraining
- LSTM, 1024-dimensional cell, 512-dimensional embedding, 512-dimensional hidden affine layer + 50% dropout, Truncated backprop 400 steps. Clipped cell outputs and gradients. Word and input embedding dropout tuned on dev set.
- Linear Gain: Inject gradient at each step and linearly increase weights of prediction objectives
#### Notes / Questions
- Not clear when/how linear gain yields improvements. On some data sets it significantly reduces performance, on other it significantly improves performance. Any explanations?
- Word dropout is used in the paper but not explained. I'm assuming it's replacing random words with `DROP` tokens?
- The authors mention a joint training model, but it's only evaluated on the IMDB data set. I'm assuming the authors didn't evaluate it further because it performed badly, but it would be nice to get an intuition for why it doesn't work, and show results for other data sets.
- All tasks are classification tasks. Does SA-LSTM also improve performance on seq2seq tasks?
- What is the training time? :) (I also wonder how the batching is done, are texts padded to the same length with mask?)

TLDR; The authors propose a new architecture called "Pointer Network". A Pointer Network is a seq2seq architecture with attention mechanism where the output vocabulary is the set of input indices. Since the output vocabulary varies based on input sequence length, a Pointer Network can generalize to variable-length inputs. The attention method trough which this is achieved is O(n^2), and only a sight variation of the standard seq2seq attention mechanism. The authors evaluate the architecture on tasks where the outputs correspond to positions of the inputs: Convex Hull, Delaunay Triangulation and Traveling Salesman problems. The architecture performs well these, and generalizes to sequences longer than those found in the training data.
#### Key Points
- Similar to standard attention, but don't blend the encoder states, use the attention vector directory.
- Softmax probabilities of outputs can be interpreted as a fuzzy pointer.
- We can solve the same problem artificially using seq2seq and outputting "coordinates", but that ignores the output constraints and would be less efficient.
- 512 unit LSTM, SGD with LR 1.0, batch size of 128, L2 gradient clipping of 2.0.
- In the case of TSP, the "student" networks outperforms the "teacher" algorithm.
#### Notes/ Questions
- Seems like this architecture could be applied to generating spans (as in the newer "Text Processing From Bytes" paper), for POS tagging for example. That would require outputting classes in addition to input pointers. How?

TLDR; The authors apply an RNN to modeling the students knowledge. The input is an exercise question and answer (correct/incorrect), either as one-hot vectors or embedded. The network then predicts whether or not the student can answer a future question correctly. The authors show that the RNN approach results in significant improvement over previous models, can be used for curriculum optimization, and also discovers the latent structure in exercise concepts.
#### Key Points
- Two encodings tried: One hot, embedded
- RNN/LSTM, 200-dimensional hidden layer, output dropout, NLL.
- No expert annotation for concepts or question/answers are needed
- Blocking (series of exercises of same type) vs Mixing for curriculum optimization: Blocking seems to perform better
- Lots of cool future direction ideas
#### Question / Notes
- Can we not only predict whether an exercise is answered correctly, but also what the most likely student answer would be? My give insight into confusing concepts.

TLDR; The authors evaluate the use for 9-layer deep CNNs on large-scale data sets for text classification, operating directly on one-hot encodings of characters. The architecture achieves competitive performance across datasets.
#### Key Points
- 9 Layers, 6 conv/ppol layers, 3 affine layers. 1024-dimensional input features for large model, 256-dimensional input features for small model.
- Authors optionally use English thesaurus for training data augmentation
- Fixed input length l: 1014 characters
- Simple n-gram models performs very well on these data sets and beats other models and the smaller data sets (<= 500k examples). CNN wins on the larger data sets (>1M examples)
#### Notes / Questions
- Comparing the CNN with input restricted to 1014 characters to models that operate on words seems unfair. Also, how long is the average document? Would've liked to see some dataset statistics. The fixed input length doesn't make a lot of sense to me.
- Contribution of this paper is that the architecture works without word knowledge and for any language, but at the same time the authors use a word-level English thesaurus to improve their performance? To be fair, the thesaurus doesn't seem to make a huge difference.
- The reason this architecture requires so much data is probably because it's very deep (How many parameters?). Did the authors experiment with fewer layers? Did they perform much worse?
- What about unsupervised pre-training? Can that reduce the amount of data required to achieve good performance. Currently this model doesn't seem very useful in practice as there are very few datasets of such size out there.

This paper addresses the problem of inverse reinforcement learning when the agent can change it's objective during the recording of trajectories. This results in a transition between several reward functions that explain only locally the trajectory of the observed agent. Transition probabilities between reward functions are unknown. The author propose a cascade of an EM and Viterbi algorithms to discover the reward functions and the segments on which they are valid.
Their algorithm consists in maximizing the log-likelihood of the expert's demonstrated trajectories depending on some parameters which are the original distributions of states and rewards, the local rewards and the transition function between rewards. To do so, they use the expectation-maximisation (EM) method. Then, via the Viterbi algorithm, they are able to partition the trajectories into segments with local consistent rewards.
Strengths of the paper:
1. The authors leverage existing and classical methods from the machine learning and optimization fields such as EM, Viterbi, Value iteration and gradient ascent in order to build their algorithm. This will allow the community to easily reproduce their results. 2. The experiments are conducted on synthetic and real-world data. They compare their method to MLIRL which does not use locally consistent rewards and which is the canonical choice to compare to as their algorithm is a generalization of MLIRL. The results presented show the superiority of their method over MLIRL. 3. The idea presented by the authors is original as far as I know.
Weaknesses of the paper:
1. The paper is very dense ( the figures are incorporated in the text) which makes the reading difficult.
2. The algorithm proposed needs the knowledge of the dynamics and the number of rewards. The authors, as future works, plan to extend their algorithm to unknown number of rewards, however they do not mention to get rid off the knowledge of the dynamics. Could the authors comment on that as some IRL algorithms do not need a perfect knowledge of the dynamics?
3. The method needs to solve iteratively MDPs when learning the reward functions. For each theta in the gradient ascent a MDP needs to be solved. Is this prohibitive for huge MDPs? Is there a way to avoid that step? The action-value function Q is defined via a softmax operator in order to have a derivable policy, does it allow to solve more efficiently the MDP?
4. The authors are using gradient ascent in the EM method, could they comment on the concavity of their criteria?
5. In the experiments (gridworlds), the number of features for the states is very small and thus it is understandable that a reward which is linear on the features will perform badly. Do the authors consider comparing their method to an IRL method where the number of features defining the states is greater? This is the main problem that I have with the experiments, the features used are not expressive enough to consider using a classical IRL method and this can explain why MLIRL performs badly and that its performance does not improve when the number of expert trajectories grows.
6. The performance is measured by the average log-likelihood of the expert's demonstrated trajectories which is the criterion maximized by the algorithm. I think that a more pertinent measure would be the value function of the policy produced by the optimization of the reward obtained by the algorithm. Could the authors comment on that and explain why their performance metric is more appropriate?

This paper extends the stochastic optimization algorithm SVRG proposed in recent years. These modifications mainly includes: the convergence analysis of SVRG with corrupted full gradient; Mix the iteration of SGD and SVRG; the strategy of mini-batch; Using support vectors etc. For each modification, the author makes clear proofs and achieves linear convergence under smooth and strong convex assumptions. However, this paper's novelty is not big enough. The improvement of convergence rate is not obvious and the proof outline is very similar to the original SVRG. The key problem such as the support for non-strongly convex loss is still unsolved.
This paper starts with a key proposition showing that SVRG does not require a very accurate approximation of the total gradient of the objective function needed by SVRG algorithm. The authors use this proposition to derive a batching SVRG algorithm with the same convergence rate as that of original SVRG. Then, the authors propose a mixed stochastic gradient/SVRG approach and give a convergence proof for such a scheme. As a different approach of speeding up, the authors proposed a speed-up technique for Huberized hinge-loss support vector machine.

The paper proposes a novel way to train a sparse autoencoder where the hidden unit sparsity is governed by a winner-take-all kind of selection scheme. This is a convincing way to achieve a sparse autoencoder, while the paper could have included some more details about their training strategy and the complexity of the algorithm.
The authors present a fully connected auto-encoder with a new sparsity constraint called the lifetime sparsity. For each hidden unit across the mini-batch, they rank the activation values, keeping only the top-k% for reconstruction. The approach is appealing because they don't need to find a hard threshold and it makes sure every hidden unit/filter is updated (no dead filters because their activation was below the threshold).
Their encoder is a deep stack of ReLu and the decoder is shallow and linear (note that usually non-symmetric auto-encoders lead to worse results). They also show how to apply to RBM. The effect of sparsity is very effective and noticeable on the images depicting the filters.
They extend this auto-encoder in a convolutional/deconvolutional framework, making it possible to train on larger images than MNIST or TFD. They add a spatial sparsity, keeping the top activation per feature map for the reconstruction and combine it with the lifetime sparsity presented before.
The proposed approach exploits on a mechanism close to the one of k-sparse autoencoders proposed by Makkhzani et al [14]. The authors extend the idea from [14] to build winner-take-all encoders (and RBMs), that enforce both spatial and lifetime regularization by keeping only a percentage (the biggest) of activations. The lifetime sparsity allows overcoming problems that could arise with k-sparse autoencoders. The authors next propose to embed their modeling framework in convolutional neural nets to deal with larger images than e.g. those of mnist.

This paper addresses the task of image-based Q&A on 2 axes: comparison of different models on 2 datasets and creation of a new dataset based on existing captions.
The paper is addressing an important and interesting new topic which has seen recent surge of interest (Malinowski2014, Malinowski2015, Antol2015, Gao2015, etc.). The paper is technically sound, well-written, and well-organized. They achieve good results on both datasets and the baselines are useful to understand important ablations. The new dataset is also much larger than previous work, allowing training of stronger models, esp. deep NN ones.
However, there are several weaknesses: their main model is not very different from existing work on image-Q&A (Malinowski2015, who also had a VIS+LSTM style model (but they were also jointly training the CNN and RNN, and also decoding with RNNs to produce longer answers) and achieves similar performance (except that adding bidirectionality and 2-way image input helps). Also, as the authors themselves discuss, the dataset in its current form, synthetically created from captions, is a good start but is quite conservative and limited, being single-word answers, and the transformation rules only designed for certain simple syntactic cases.
It is exploration work and will benefit a lot from a bit more progress in terms of new models and a slightly more broad dataset (at least with answers up to 2-3 words).
Regarding new models, e.g., attention-based models are very relevant and intuitive here (and the paper would be much more complete with this), since these models should learn to focus on the right area of the image to answer the given question and it would be very interesting to analyze the results of whether this focusing happens correctly.
Before attention models, since 2-way image input helped (actually, it would be good to ablate 2-way versus bidirectionality in the 2-VIS+BLSTM model), it would be good to also show the model version that feeds the image vector at every time step of the question.
Also, it would be useful to have a nearest neighbor baseline as in Devlin et al., 2015, given their discussion of COCO's properties. Here too, one could imagine copying answers of training questions, for cases where the captions are very similar.
Regarding a broader-scope dataset, the issue with the current approach is that it is too similar to the captioning approach or task, which has the drawback that a major motivation to move to image-Q&A is to move away from single, vague (non-specific), generic, one-event-focused captions to a more complex and detailed understanding of and reasoning over the image; which doesn't happen with this paper's current dataset creation approach, and so this will also not encourage thinking of very different models to handle image-Q&A, since the best captioning models will continue to work well here. Also, having 2-3 word answers will capture more realistic and more diverse scenarios; and though it is true that evaluation is harder, one can start with existing metrics like BLEU, METEOR, CIDEr, and human eval. And since these will not be full sentences but just 2-3 word phrases, such existing metrics will be much more robust and stable already.
Originality:
The task of image-Q&A is very recent with only a couple of prior and concurrent work, and the dataset creation procedure, despite its limitations (discussed above) is novel. The models are mostly not novel, being very similar to Malinowski2015, but the authors add bidirectionality and 2-way image input (but then Malinowski2015 was jointly training the CNN and RNN, and also decoding with RNNs to produce longer answers).
Significance:
As discussed above, the paper show useful results and ablations on the important, recent task of image-Q&A, based on 2 datasets -- an existing small dataset and a new large dataset; however, the second, new dataset is synthetically created by rule-transforming captions and only to single-word answers, thus keeping the impact of the dataset limited, because it keeps the task too similar to the generic captioning task and because there is no generation of answers or prediction of multi-word answers.

I think this paper has two main ideas in there, I see them as independent, for reasons explained below:
- A new penalty function that aims at regularising the second derivative of the trajectory the latent representation traces over time. I see this as a generalisation of the slowness principle or temporal constancy, more about this in the next section.
- A new autoencoder-like method to predict future frames in video. Video is really hard to forward-predict with non-probabilistic models because high level aspects of video are genuinely uncertain. For example, in a football game, you can't really predict whether the ball will hit the goalpost, but the results might look completely different visually. This, combined with L2 penalties often results in overly conservative, blurry predictions. The paper improves things by introducing extra hidden variables, that allow the model to represent uncertainty in its predictions. More on this later.
#### Inductive bias: penalising curvature
The key idea of this paper is to learn good distributed representations of natural images from video in an unsupervised way. Intuitively, there is a lot of information contained in video, which is lost if you scramble the video and look at statistics individual frames only. The race is on to develop the right kind of prior and inductive bias that helps us fully exploit this temporal information. This paper presents a way, which is called learning to linearise (I'm going to call this L2L).
Naturally occurring images are thought to reside on some complex, nonlinear manifold whose intrinsic dimension is substantially lower than the number of pixels in an image. It is then natural to think about video as a journey on this manifold surface, along some smooth path. Therefore, if we aim to learn good generic features that correspond to coordinates on this underlying manifold, we should expect that these features vary in a smooth fashion over time as you play the video.
L2L uses this intuition to motivate their choice of an objective function that penalises a scale-invariant measure of curvature over time. In a way it tries to recover features that transform nearly linearly as time progresses and the video is played.
In their notations, $x_{t}$ denotes the data in frame $t$, which is transformed by a deep network to obtain the latent representation $z_{t}$. The penalty for the latent representation is as follows.
$$-\sum_{t} \frac{(z_t - z_{t-1})^{T}(z_{t+1} - z_{t})}{\|z_t - z_{t-1}\|\|z_{t+1} - z_{t}\|}$$
The expression above has a geometric meaning as the cosine of the angle between the vectors $(z_t - z_{t-1})$ and $(z_{t+1} - z_{t})$. The penalty is minimised if these two vectors are parallel and point in the same direction. In other words the penalty prefers when the latent feature representation keeps its momentum and continues along a linear path - and it does not like sharp turns or jumps. This seems like a sensible prior assumption to build on.
L2L is very similar to another popular inductive bias used in slow feature analysis: the temporal slowness principle. According to this principle, the most relevant underlying features don't change very quickly. The slowness principle has a long history both in machine learning and as a model of human visual perception. In SFA one would minimise the following penalty on the latent representation:
$$\sum_{t} (z_t - z_{t-1})^{2},$$
where the square is applied component-wise. There are additional constraints in SFA, more about this later. We can understand the connection between SFA and this paper's penalty if we plot the penalty for a single hidden feature $z_{t,f}$ at time $t$, keeping all other features and values at neighbouring timesteps constant. This is plotted in the figure below (scaled and translated so the objectives line up nicely).
![](http://www.inference.vc/content/images/2015/09/-RfX4Dp2Y3YAAAAASUVORK5CYII-.png)
As you can see, both objectives have a minimum at the same location: they both try to force $z_{t,f}$ to linearly interpolate between the neighbouring timesteps. However, while SFA has a quadratic penalty, the learning to linearise objective tapers off at long distances. Compare this to Tukey's loss function used in outlier-resistant robust regression.
Based on this, my prediction is that compared to SFA, this loss function is more tolerant of outliers, which in the temporal domain would mean abrupt jumps in the latent representation. So while SFA is equivalent to assuming that the latent features follow a Brownian-motion-like Ornstein–Uhlenbeck process, I'd imagine this prior corresponds to something like a jump diffusion process (although I don't think the analogy holds mathematically).
Which one of these inductive biases/priors are better at exploiting temporal information in natural video? Slow Brownian motion, or nearly-linear trajectories with potentially a few jumps Unfortunately, don't expect any empirical answer to that from the paper. All experiments seem to be performed on artificially constructed examples, where the temporal information is synthetically engineered. Nor there is any real comparison to SFA.
#### Representing predictive uncertainty with auxillary variables
While the encoder network learns to construct smoothly varrying features $z_t$, the model also has a decoder network that tries to reconstruct $x_t$ and predict subsequent frames. This, the authors agree, is necessary in order for $z_t$ to contain enough relevant information about the frame $x_t$ (more about whether or not this is necessary later). The precise way this decoding is done has a novel idea as well: minimising over auxillary variables.
Let's say our task is to predict a future frame $x_{t+k}$ based on the latent representation $z_{t}$. The problem is, this is a very hard problem. In video, just like in real life, anything can happen. Imagine you're modelling soccer footage, and the ball is about to hit the goalpost. In order to predict the next frames, not only do we have to know about natural image statistics, we also have to be able to predict whether the goal is in or not. An optimal predictive model would give a highly multimodal probability distribution as its answer. If you use the L2 loss with a deterministic feed-forward predictive network, it's likely to come up with a very blurry image, which would correspont to the average of this nasty multimodal distribution. This calls for something better, either a smarter objective function, or a better way of representing predictive uncertainty.
The solution the authors gave is to introduce hidden variables $\delta_{t}$, that the decoder network also receives as input in addition to $z_t$. For each frame, $\delta_t$ is optimised so that only the best possible reconstruction is taken into account in the loss function. Thus, the decoder network is allowed to use $\delta$ as a source of non-determinism to hedge its bets as to what the contents of the next frame will be. This is one step closer to the ideal setting where the decoder network is allowed to give a full probability distribution of possibilities and then is evaluated using a strictly proper scoring rule.
This inner loop minimisation (of $\delta$) looks very tedious, and introduces a few more parameters that may be hard to set. The algorithm is reminiscent of the E-step in expectation-maximisation, and also very similar to the iterated closest point algorithm Andrew Fitzgibbon talked about in his tutorial at BMVC this year.
In his tutorial, Andrew gave examples where jointly optimising model parameters and auxiliary variables ($\delta$) is advantageous, and I think the same logic applies here. Instead of the inner loop, simultaneous optimisation helps fixing some pathologies, like slow convergence near the optimum. In addition, Andrew advocates exploiting the sparsity structure of the Hessian to implement efficient second-order gradient-based optimisation methods. These tricks are explained in paragraphs around equation 8 in (Prasad et al, 2010).
#### Predictive model: Is it necessary?
On a more fundamental level, I question whether the predictive decoder network is really a necessary addition to make L2L work.
The authors observe that the objective function is minimised by the "trivial" solutions $z_{t} = at + b$, where $a,b$ can be arbitrary constants. They then say that in order to make sure features do something more than just discover some of these trivial solutions, we also have to include a decoder network, that uses $z_t$ to predict future frames. I believe this is not necessary at all.
Because $z_t$ is a deterministic function of $x_t$, and $t$ is not accessible to $z_{t}$ in any other way than through inferring it from $x_t$, as long as $a\neq 0$, the linear solutions are not trivial at all. If the network discovers $z_{t} = at, a\neq 0$, you should in fact be very happy (assuming a single feature). The only problems with trivial solutions occur when $z_{t} = b$ ($z$ doesn't depend on the data at all) or when $z$ is multidimensional and several redundant features are sensitive to exactly the same thing.
These trivial solutions could be avoided the same way they are avoided in SFA, by constraining the overall spatial covariance of $z_{t}$ over the videoclip to be $I$. This would force each feature to vary at least a little bit with data- hence avoiding the trivial constant solutions. It would also force features to be linearly decorrelated - solving the redundant features problem.
So I wonder if the decoder network is indeed a necessary addition to the model. I would love to encourage the authors to implement their new hypothesis of a prior both with and without the decoder. They may already have tried it without and found it really didn’t work, so it might just be a matter of including those results. This would in turn allow us to see SFA and L2L side-by-side, and learn something about whether and why their prior is better than the sl

This post is a comment on the Laplacian pyramid-based generative model proposed by researchers from NYU/Facebook AI Research.
Let me start by saying that I really like this model, and I think - looking at the samples drawn - it represents a nice big step towards convincing generative models of natural images.
To summarise the model, the authors use the Laplacian pyramid representation of images, where you recursively decompose the image to a lower resolution subsampled component and the high-frequency residual. The reason this decomposition is favoured in image processing is the fact that the high-frequency residuals tend to be very sparse, so they are relatively easy to compress and encode.
In this paper the authors propose using convolutional neural networks at each layer of the Laplacian pyramid representation to generate an image sequentially, increasing the resolution at each step. The convnet at each layer is conditioned on the lower resolution image, and some noise component $z_k$, and generates a random higher resolution image. The process continues recursively until the desired resilution is reached. For training they use the adversarial objective function. Below is the main figure that explains how the generative model works, I encourage everyone to have a look at the paper for more details:
![](http://www.inference.vc/content/images/2015/07/Screen-Shot-2015-07-23-at-11-15-17.png)
#### An argument about Conditional Entropies
What I think is weird about the model is the precise amount of noise that is injected at each layer/resolution. In the schematic above, these are the $z_k$ variables. Adding the noise is crucial to defining a probabilistic generative process; this is how it defines a probability distribution.
I think it's useful to think about entropies of natural images at different resolutions. When doing generative modelling or unsuperised learning, we want to capture the distribution of data. One important aspect of a probability distribution is its entropy, which measures the variability of the random quantity. In this case, we want to describe the statistics of the full resolution observed natural image $I_0$. (I borrow the authors' notation where $I_0$ represents the highest resolution image, and $I_k$ represent the $k$-times subsampled version. Using the Laplacian pyramid representation, we can decompose the entropy of an image in the following way:
$$\mathbb{H}[I_0] = \mathbb{H}[I_{K}] + \sum_{k=0}^{K-1} \mathbb{H}[I_k\vert I_{k+1}].$$
The reason why the above decomposition holds is very simple. Because $I_{k+1}$ is a deterministic function of $I_{k}$ (subsampling), the conditional entropy $\mathbb{H}[I_{k+1}\vert I_{k}] = 0$. Therefore the joint entropy of the two variables is simply the entropy of the higher resolution image $I_{k}$, that is $\mathbb{H}[I_{k},I_{k+1}] = \mathbb{H}[I_{k}] + \mathbb{H}[I_{k+1}\vert I_{k}] = \mathbb{H}[I_{k}]$. So by induction, the join entropy of all images $I_{k}$ is just the marginal entropy of the highest resolution image $I_0$. Applying the chain rule for joint entropies we get the expression above.
Now, the interesting bit is how the conditional entropies $\mathbb{H}[I_k\vert I_{k+1}]$ are 'achieved' in the Laplacian pyramid generative model paper. These entropies are provided by the injected random noise variables $z_k$. By the information processing lemma $\mathbb{H}[I_k\vert I_{k+1}] \leq \mathbb{H}[z_k]$. The authors choose $z_k$ to be uniform random variables whose dimensionality grows with the resolution of $I_k$. To quote them "The noise input $z_k$ to $G_k$ is presented as a 4th color plane to low-pass $l_k$, hence its dimensionality varies with the pyramid level." Therefore $\mathbb{H}[z_k] \propto 4^{-k}$, assuming that the pixel count quadruples at each layer.
So the conditional entropy $\mathbb{H}[I_k\vert I_{k+1}]$ is allowed to grow exponentially with resolution, at the same rate it would grow if the images contained pure white noise. In their model, they allow the per-pixel conditional entropy $c\cdot 4^{-k}\cdot \mathbb{H}[I_k\vert I_{k+1}]$ to be constant across resolutions. To me, this seems undesirable. My intuition is, for natural images, $\mathbb{H}[I_k\vert I_{k+1}]$ may grow as $k$ decreases (because the dimensionality gorws), but the per-pixel value $c\cdot 4^{k}\cdot \mathbb{H}[I_k\vert I_{k+1}]$ should decrease or converge to $0$ as the resolution increases. Very low low-resolution subsampled natural images behave a little bit like white noise, there is a lot of variability in them. But as you increase the resolution, the probability distribution of the high-res image given the low-res image will become a lot sharper.
In terms of model capacity, this is not a problem, inasmuch as the convolutional models $G_{k}$ can choose to ignore some variance in $z_k$ and learn a more deterministic superresolution process. However, adding unnecessarily high entropy will almost certainly make the fitting of such model harder. For example, the adversarial training process relies on sampling from $z_k$, and the procedure is pretty sensitive to sampling noise. If you make the distribution of $z_k$ unneccessarily high entropy, you will end up doing a lot of extra work during training until the network figures out to ignore the extra variance.
To solve this problem, I propose to keep the entropy of the noise vectors constant, or make them grow sub-linearly with the number of pixels in the image. This mperhaps akes the generative convnets harder to implement. Another quick solution would be to introduce dependence between components of $z_k$ via a low-rank covariance matrix, or some sort of a hashing trick.
#### Adversarial training vs superresolution autoencoders
Another weird thing is that the adversarial objective function forgets the identity of the image. For example, you would want your model so that
`"if at the previous layer you have a low-resolution parrot, the next layer should be a higher-resolution parrot"`
Instead, what you get with the adversarial objective is
`"if at the previous layer you have a low-resolution parrot, the next layer should output a higher-dimensional image that looks like a plausible natural image"`
So, there is nothing in the objective function that enforces dependency between subsequent layers of the pyramid. I think if you made $G_k$ very complex, it could just learn to model natural images by itself, so that $I_{k}$ is in essence independent of $I_{k+1}$ and is purely driven by the noise $z_{k}$. You could sidestep this problem by restricting the complexity of the generative nets, or, again, to restrict the entropy of the noise.
Overall, I think the approach would benefit from a combination of the adversarial and a supervised (superresolution autoencoder) objective function.

# Skip-Thought Vectors
## Introduction
* The paper describes an unsupervised approach to train a generic, distributed sentence encoder.
* It also describes a vocabulary expansion method to encode words not seen at training time.
* [Link to the paper](https://arxiv.org/abs/1506.06726)
## Skip-Thoughts
* Train an encoder-decoder model where the encoder maps the input sentence to a sentence vector and the decoder generates the sentences surrounding the original sentence.
* The model is called **skip-thoughts** and the encoded vectors are called **skip-thought vectors.**
* Similar to the [skip-gram](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf) model in the sense that surrounding sentences are used to learn sentence vectors.
### Architecture
* Training data is in form of sentence tuples (previous sentence, current sentence, next sentence).
* **Encoder**
* RNN Encoder with GRU.
* **Decoder**
* RNN Decoder with conditional GRU.
* Conditioned on encoder output.
* Extra matrices introduced to bias the update gate, reset gate and hidden state, given the encoder output.
* **Vocabulary matrix (V)** - Weight matrix having one row (vector) for each word in the vocabulary.
* Separate decoders for the previous and next sentence which share only **V**.
* Given the decoder context **h** (at any time), encoder output, and list of words already generated for the output sentence, the probability of choosing *w* as the next word is proportional to *exp(**V(*word*)h**)*
* **Objective**
* Sum of the log-probabilities for the forward and backwards sentences conditioned on the encoder output.
## Vocabulary Expansion
* Use a model like Word2Vec which can be trained to induce word representations and train it to obtain embeddings for all the words that are likely to be seen by the encoder.
* Learn a matrix **W** such that *encoder(word) = cross-product(W, Word2Vec(word))* for all words that are common to both Word2Vec model and encoder model.
* Use **W** to generate embeddings for words are not seen during encoder training.
## Dataset
* [BookCorpus dataset](https://arxiv.org/abs/1506.06724) having books across 16 genres.
## Training
* **uni-skip**
* Unidirectional auto-encoder with 2400 dimensions.
* **bi-skip**
* Bidirectional model with forward (sentence given in correct order) and backward (sentence given in reverse order) encoders of 1200 dimensions each.
* **combine-skip**
* concatenation of uni-skip and bi-skip vectors.
* Initialization
* Recurrent matricies - orthogonal initialization.
* Non-recurrent matricies - uniform distribution in [-0.1,0.1].
* Mini-batches of size 128.
* Gradient Clipping at norm = 10.
* Adam optimizer.
## Experiments
* After learning skip-thoughts, freeze the model and use the encoder as feature extractor only.
* Evaluated the vectors with linear models on following tasks:
### Semantic Relatedness
* Given a sentence pair, predict how closely related the two sentences are.
* **skip-thoughts** method outperforms all systems from SemEval 2014 competition and is outperformed only by dependency tree-LSTMs.
* Using features learned from image-sentence embedding model on COCO boosts performance and brings it at par with dependency tree-LSTMs.
### Paraphrase detection
* **skip-thoughts** outperforms recursive nets with dynamic pooling if no hand-crafted features are used.
* **skip-thoughts** with basic pairwise statistics produce results comparable with the state-of-the-art systems that house complicated features and hand engineering.
### Image-sentence Ranking
* MS COCO dataset
* Task
* Image annotation
* Given an image, rank the sentences on basis of how well they describe the image.
* Image search - Given a caption, find the image that is being described.
* Though the system does not outperform baseline system in all cases, the results does indicate that skip-thought vectors can capture image descriptions without having to learn their representations from scratch.
### Classification
* **skip-thoughts** perform about as good as bag-of-words baselines but are outperformed by methods where sentence representation has been learnt for the task at hand.
* Combining **skip-thoughts** with bi-gram Naive Bayes (NB) features improves the performance.
## Future Work
* Variants to be explored include:
* Fine tuning the encoder-decoder model during the downstream task instead of freezing the weights.
* Deep encoders and decoders.
* Larger context windows.
* Encoding and decoding paragraphs.
* Encoders, such as convnets.

The paper seeks to establish a connection between algorithmic stability and generalization performance. Notions of algorithmic stability have been proposed before and linked to the generalization performance of learning algorithms \cite{conf/uai/KutinN02} \cite{journals/neco/KearnsR99} and have also been shown to be crucial for learnability \cite{journals/jmlr/Shalev-ShwartzSSS10}.
\cite{PoggioETAL:04} proved that for bounded loss functions, the generalization of ERM is equivalent to the probabilistic leave-one-out stability of the learning algorithm. \cite{journals/jmlr/Shalev-ShwartzSSS10} then showed that a problem is learnable in Vapnik's general setting of learning iff there exists an asymptotically stability ERM procedure.
This paper first establishes that for Vapnik's general setting of learning, a probabilistic notion of stability, is necessary and sufficient for the training losses to converge to test losses uniformly for all distributions. The paper then presents some discussions on how this notion of stability can be interpreted to give results in terms of the capacity of the function class or the size of the population.

The paper presents a solution to binary classification with symmetric label noise (SLN). They show that, in order to obtain consistency (w.r.t. to the 0-1 loss in the "noiseless" case) while using a convex surrogate, one must use the loss $\ell(v,y) = 1 - vy$ -- the "unhinged loss" -- , which is shown to enjoy some useful properties, including robustness to SLN. In a more restricted sense of robustness, it is the only such loss, but in any case it overcomes the limitations of other convex losses for the same problem.
Different implications of using the unhinged loss are discussed; the problem of classification with SLN with the unhinged loss and "linear" classifiers is investigated and solved analytically. The authors also present an empirical evaluation to motivate that their theoretical considerations have practical impact.

The paper proposes a payment rule for crowdsourced tasks. This rule is intended to incentivize workers to accurately report their confidence (e.g. by skipping a task when they have low confidence), and to pay little to spammers. Payment is based on the product of the evaluations of a worker's responses to a set of gold-standard tasks; if the worker gets a single gold standard task wrong and asserts high confidence, the overall payment is zero.

This paper introduces ASUGS (adaptive sequential updating and greedy search), building on the previous work on SUGS by Wang & Dunson 2011 \cite{10.1198/jcgs.2010.07081}, which is a sequential (ie online) MAP inference method for DPMMs.
The main contribution of the paper is to provide online updating for the concentration parameter, $\alpha$.
The paper shows that the posterior distribution on $\alpha$ can be expected to behave has a gamma distribution (that depends on the current number of clusters and on n) in the large-scale limit, assuming an exponential prior on $\alpha$.
ASUGS uses the mean of this gamma distribution as the $\alpha$ for updating cluster assignments, the remainder of the algorithm proceeding as in SUGS (ie using conjugacy to update model parameters in an online fashion, with hard assignments of data to clusters.)
The paper also shows that this choice of \alpha is bounded by $\log^\epsilon n$ for an arbitrarily small $\epsilon$, so that we may expect this process to converge, or at the very least be stable even in large settings.

This paper presents a new method (the "covariance-controlled adaptive Langevin thermostat") for MCMC posterior sampling for Bayesian inference. Along the lines of previous work in scalable MCMC, this is a stochastic gradient sampling method. The presented method aims to decrease parameter-dependent noise (in order to speed-up convergence to the given invariant distribution of the Markov chain, and generate beneficial samples more efficiently), while maintaining the desired invariant distribution of the Markov chain. Similar to existing stochastic gradient MCMC methods, this method aims to find use in large-scale machine learning settings (i.e. Bayesian inference with large numbers of observations). Experiments on three models (a normal-gamma model, Bayesian logistic regression, and a discriminative restricted Boltzmann machine) aim to show that the presented method performs better than Stochastic Gradient Hamiltonian Monte Carlo (SGHMC) \cite{10.1016/0370-2693(87)91197-X} and Stochastic Gradient Nose-Hoover Thermostat (SGNHT), two similar existing methods.

The authors derive an estimator of a "proxy" of the covariance matrix of a stationary stochastic process (in their case asset returns) which is robust to data outliers and does not make assumptions on the tails of the distribution. They show that for elliptical distributions, which includes Gaussians, this proxy is consistent with true covariance matrix up to a scaling factor; and that their proposed estimator of the proxy has bounded error.

This paper proposes a novel online algorithm for constructing a multiclass classifier that enjoys a time complexity logarithmic in the number of classes k. This is done by constructing online a decision tree which locally maximizes an appropriate novel objective function, which measures the quality of a tree according to a combined "balancedness" and "purity" score. A theoretical analysis (of a probably intractable algorithm) is provided via a boosting argument (assuming weak learnability), essentially extending the work of Kearns and Mansour (1996) \cite{conf/stoc/KearnsM96} to the multiclass setup. A concrete algorithm is given to a relaxed problem (but see below) without any guarantees, but quite simple, natural and interesting.

The paper presents a method to obtain a hierarchical clustering of a planar graph by posing the problem as that of approximating a set of edge weights using an ultrametric. This is accomplished by minimizing the $\ell_2$ norm between the given edge weights and the learnt ultrametric. Learning the ultrametric amounts to estimating a collection of multicuts that satisfies a hierarchical partitioning constraint. An efficient algorithm is presented that solves an approximation based on a finding a linear combination of a subset of possible two-way cuts of the graph.

The paper attacks the problem of describing a sequence of images from blog-posts with a sequence of consistent sentences. For this the paper proposes to first retrieve the K=5 most similar images and associated sentences from the training set for each query image. The main contribution of the paper lies in defining a way to select the most relevant sentences for the query image sequence, providing a coherent description. For this sentences are first embedded in a vector and then the sequence of sentences is modeled with a bidirectional LSTM. The output of the bi-directional LSTM is first fed through a relu \cite{conf/icml/NairH10} and fully connected layer and then scored with a compatibility score between image and sentence. Additionally a local coherence model \cite{journals/coling/BarzilayL08} is included to enforce the compatibility between sentences.

This work addresses an important special case of the correlation clustering problem: Given as input a graph with edges labeled -1 (disagreement) or +1 (agreement), the goal is to decompose the graph so as to maximize agreement within components. Building on recent work \cite{conf/kdd/BonchiGL14} \cite{conf/kdd/ChierichettiDK14}, this paper contributes two concurrent algorithms, a proof of their approximation ratio, a run-time analysis as well as a set of experiments which demonstrate convincingly the advantage of the proposed algorithms over the state of the art.

The paper presents a data visualisation method based on the concept of space-time. The space-time representation is capable of showing a broader family of proximities than an Euclidean space with the same dimensionality. Based on the KL measure, the authors argue that the lower dimensional representation of the high dimensional data using the space-time local embedding method can keep more information than Euclidean embeddings. I am quite convinced, but there is one question about interpretability of the visualised data in space-time.

The paper presents results on recovery of low-rank semidefinite matrices from linear measurements, using nonconvex optimization. The approach is inspired by recent work on phase retrieval, and combines spectral initialization with gradient descent. The connection to phase retrieval comes because measurements which are linear in the semidefinite matrix $X = Z Z'$ are quadratic in the factors $Z$. The paper proves recovery results which imply that correct recovery occurs when the number of measurements m is essentially proportional to n $r^2$, where n is the dimensionality and r is the rank. The convergence analysis is based on a form of restricted strong convexity (restricted because there is an $r(r-1)/2$-dimensional set of equivalent solutions along which the objective is flat). This condition also implies linear convergence of the proposed algorithm.
The implementation seems awful. When compared to recent implementations, e.g. http://arxiv.org/abs/1408.2467 the performance seems orders of magnitude away from the state of the art -- and being an order of magnitude faster than general-purpose SDP solver on the nuclear norm does not make it any better. The authors should acknowledge that and compare the results with other codes on some established benchmark (e.g. Lenna), so as to show that the price in terms of run-time brings about much better performance in terms of objective function values (SNR, RMSE) -- which is plausible, but far from certain.

This paper considers a generalization of the Interactive Submodular Set Cover (ISSC) problem \cite{conf/icml/GuilloryB10}. In ISSC, the goal is to interactively collect elements until the value of the set of elements, represented by an unknown submodular function, reaches some threshold. In the original ISSC there is a single correct submodular function, which can be revealed using responses to each selected element, and a single desired threshold. This paper proposes to simultaneously require reaching some threshold for all the possible submodular functions. The threshold value is determined as a convex function of a submodular agreement measure between the given function and the responses to all elements. Each element has a cost, and so the goal is to efficiently decide which elements to collect to satisfy the goal at a small cost.

The authors introduce a novel approach for inferring hidden physical properties of objects (mass and friction), which also allows the system to make subsequent predictions that depend on these properties. They use a black-box generative model (a physics simulator), to perform sampling-based inference, and leverage a tracking algorithm to transform the data into more suitable latent variables (and reduce its dimensionality) as well as a deep model to improve the sampler. The authors assume priors over the hidden physical properties, and make point estimates of the geometry and velocities of objects using a tracking algorithm, which comprise a full specification of the scene that can be input to a physics engine to generate simulated velocities. These simulated velocities then support inference of the hidden properties within an MCMC sampler: the properties' values are proposed and their consequent simulated velocities are generated, which are then scored against the estimated velocities, similar to ABC. A deep network can be trained as a recognition model, from the inferences of the generative model, and also from the Physics 101 dataset directly. Its predictions of the mass and friction can be used to initialize the MCMC sampler.

This paper addresses the problem of learning reserve prices that approximately maximize revenue, using sample draws from an unknown distribution over bidder valuations. The authors introduce t-level auctions, in which (roughly speaking) each bidder's bid space is effectively discretized into levels, and the bidder whose bid falls on the highest level wins and pays the lowest value that falls on its lowest level required to win.
The authors bound the number of samples needed to find an approximately revenue-maximizing auction from all auctions in a set C (e.g., from the set of 10-level auctions). They bound the difference in revenue between the revenue-maximizing t-level auction and the optimal auction. Results are presented for single-item auctions but are generalized to matroid settings and single-parameter settings.

This paper describes using an additional time scale over trials to model (slow) non-stationarities. It adds to the successful PLDS model, another gain vector matching the latent dimensions that is constant during each trial. Many neuroscientific datasets indeed show such slow drifts, which could very well be captured by such modeling effort.

The algorithm presented here is simple and interesting. Pixel luminance, chrominance, and illumination chrominance are all histogrammed, and then evaluation is simply each pixel's luminance voting on each pixel's true chrominance for each of the "memorized" illuminations. The model can be trained generative by simply counting pixels in the training set, or can be trained end-to-end for a slight performance boost. This algorithm's simplicity and speed are appealing, and additionally it seems like it may be a useful building block for a more sophisticated spatially-varying illumination model.

This paper propose a new inference mechanism for the Plackett-Luce model based on the preliminary observation that the ML estimate can be seen as the stationary distribution of a certain Markov chain. In fact, two inferences mechanisms are proposed, one is approximate and consistent, the other converges to the ML estimate but is slower. The authors then debate on the application settings (pairwise preferences, partial rankings). Finally, the authors exhibit three sets of experiments. The first one compares the proposed algorithm to other approximate inference mechanisms for the PL model in terms of statistical efficiency. Then on real-world datasets, one experiment compares the empirical performance of the approximate methods and a second the speed of exact methods to reach a certain level of optimality.

The authors propose a probabilistic version of the "line search" procedure that is commonly used as a subroutine in many deterministic optimization algorithms. The new technique can be applied when the evaluations of the objective function and its gradients are corrupted by noise. Therefore, the proposed method can be successfully used in stochastic optimization problems, eliminating the requirement of having to specify a learning rate parameter in this type of problems. The proposed method uses a Gaussian process surrogate model for the objective and its gradients. This allows us to obtain a probabilistic version of the conditions commonly used to terminate line searches in the deterministic scenario. The result is a soft version of those conditions that is used to stop the probabilistic line search process. At each iteration within such process, the next evaluation location is collected by using Bayesian optimization methods. A series of experiments with neural networks on the MNIST and CIFAR10 datasets validate the usefulness of the proposed technique.

Endowing memory to recurrent neural networks is clearly one of the most important topics of deep learning and crucial to do real reasoning. The proposed stack-augmented recurrent nets outperform simple RNN and LSTM \cite{journals/neco/HochreiterS97} on a series of synthetic problems (learning simple algorithmic patterns). The complexity of problems is clearly defined and the behavior of resulting stack RNN could be well understood and easily analyzed. However, the conclusions merely depending on those synthetic data set may take a risk. The importance of the problems to real sequence modeling task could be uncertain and the failures of other models could be greatly improved by more and dense hyper-parameter searching. Like in \cite{journals/corr/LeJH15}, by a very simple trick a RNN works very well on a toy task (a adding problem) which seems to need to model long term dependencies.

The authors' model confidence data from two experiments (conducted by others and previously published in the scientific literature) using a POMDP. In both experiments, subjects saw a random-dot kinematogram on each trial and made a binary choice about the dominant motion direction. The first experiment used monkeys as subjects and stimuli had a fixed duration. The second experiment used people as subjects and stimuli continued until a subject made a response. The paper reports that the POMDP model does a good job of fitting the experimental data, both the accuracy data and the confidence data.

The paper proposes a sampler for iHMMs, which the authors show has improved mixing properties and performs better in posterior inference problems when compared to the existing state-of-the-art sampling methods. An existing Gibbs sampler is turned into a particle Gibbs sampler by using a conditional SMC step to sample the latent sequence of states. The paper uses conjugacy to derive optimal SMC proposals and ancestor sampling to improve the performance of the conditional SMC step. The result is more efficient sampling of the latent states, making the sampler robust to spurious states and yielding faster convergence.

The authors introduce a new method for actively selecting the model that best fits a dataset. Contrary to active learning, where the next learning point is chosen to get a better estimate of the model hyperparameters, this methods selects the next point to better distinguish between a set of models. Similar active model selection techniques exist, but they need to retrain each model for each new data point to evaluate. The strength of the author's method is that is only requires to evaluate the predictive distributions of models, without retraining.
They propose to apply this method to detect noise-induced hearing loss. The traditional way of screening for NIHL involves testing a wide range of intensities and frequencies, which is time consuming. The authors show that with their method, the number of tests to be run could be drastically decreased, reducing the cost of large-scale screenings for NIHL.

The paper gives justification for the widespread use of the Good-Turing estimator for discrete distribution estimation through minimax regret analysis with two comparator classes. The paper obtains competitive regret bounds that lead to a more accurate characterization of the performance of the the Good-Turing estimators and in some cases is much better than the best known risk bounds. The comparator classes considered are estimators with knowledge of the distribution up to permutation, and estimators with full knowledge of the distribution, but with the constraint that the must assign the same probability mass to symbols appearing with the same frequencies.

The authors perform theoretical analysis about faster convergence with multi-player normal-form games by generalizing techniques for two-player zero-sum games. They also perform empirical evaluation by using the 4-bidder simultaneous auction game.
The paper is concerned with two problems:
1. How does the social welfare of players using regret minimization algorithms compare to the optimal welfare.
2. Can one obtain better regret bounds when all players use a regret minimization algorithm
The paper deals with bounds on regret minimization algorithms in games. The usual regret bounds on these algorithms is in $O(\sqrt{T})$. However, this assumes that the learner faces a completely adversarial opponent. However, it is natural to assume that on a game everyone will play a regret minimization algorithm and the question is whether or not one can obtain better rates in this scenario. The authors show that regret in $O(T^{1/4})$ is achievable for general games.

The paper "Bandits with unobs. confounders: a causal approach" addresses the problem of bandit learning. It is assumed that in the observational setting, the player's decision is influenced by some unobserved context. If we randomize the player's decision, however, this intention is lost. The key idea is now that, using the available data from both scenarios, one can infer whether one should overrule the player's intention. Ultimately, this leads to the following strategy: observe the player's intention and then decide whether he should act accordingly or pull the other arm.
The author showed that the current MAB algorithms actually attempt to maximize rewards according to the experimental distribution, which is not optimal in the confounding case, and proposed to make use of the effect of the treatment on the treated (ETT), i.e., by comparing the average payouts obtained by players for going in favor of or against their intuition. To me, the paper is interesting because it addresses the confounding issue in MAB and proposed a way to estimate some properties of the confounder (related to the casino's payout strategy in the given example) based on ETT.
At first glance, one might think that the blinking light on the slot machines (B) and the drunkenness of the patron (D) could be either modified or observed in lines 153-159, where we read about a hypothetical attempt to optimize reward using traditional Thompson sampling. If those factors were observable or subject to intervention -- and I'd think they would be, in reality -- then it would be straightforward to do better than the 30% reward rate that's given. The paper eventually makes it clear that both of these variables are unobserved and unalterable. It would help if this were explicit early in the example, or if the cover story were modified to make this aspect more intuitive.

The paper introduces a model which is probabilistic for non linear manifold discovery. It is based on a generative model with missing variables and required a variational EM implementation which is standard but nevertheless technical to derive in this specific context.