Paper summaryhlarochelleThis paper suggests a method (NoBackTrack) for training recurrent neural networks in an online way, i.e. without having to do backprop through time. One way of understanding the method is that it applies the [forward method for automatic differentiation](//en.wikipedia.org/wiki/Automatic_differentiation#Forward_accumulation), but since it requires maintaining a large Jacobian matrix (nb. of hidden units times nb. of parameters), they propose a way of obtaining a stochastic (but unbiased!) estimate of that matrix. Moreover, the method is improved by using Kalman filtering on that estimate, effectively smoothing the estimate over time.
#### My two cents
Online training of RNNs is a big, unsolved problem. The current approach people use is to truncate backprop to only a few steps in the past, which is more of a heuristic.
This paper makes progress towards a more principled approach. I really like the "rank-one trick" of Equation 7, really cute! And it is quite central to this method too, so good job on connecting those dots!
The authors present this work as being preliminary, and indeed they do not compare with truncated backprop. I really hope they do in a future version of this work.
Also, I don't think I buy their argument that the "theory of stochastic gradient descent applies". Here's the reason. So the method tracks the Jacobian of the hidden state wrt the parameter, which they note $G(t)$. It is update into $G(t+1)$, using a recursion which is based on the chain rule. However, between computing $G(t)$ and $G(t+1)$, a gradient step is performed during training. This means that $G(t)$ is now slightly stale, and corresponds to the gradient with respect to old value of the parameters, not the current value. As far as I understand, this implies that $G(t+1)$ (more specifically, its stochastic estimate as proposed in this paper) isn't unbiased anymore. So, unless I'm missing something (which I might!), I don't think we can invoke the theory of SGD as they suggest.
But frankly, that last issue seems pretty unavoidable in the online setting. I suspect this will never be solved, and future research will have to somehow have to design learning algorithms that are robust to this issue (or develop new theory that shows it isn't one).
So overall, kudos to the authors, and I'm really looking forward to read more about where this research goes!

This paper suggests a method (NoBackTrack) for training recurrent neural networks in an online way, i.e. without having to do backprop through time. One way of understanding the method is that it applies the [forward method for automatic differentiation](//en.wikipedia.org/wiki/Automatic_differentiation#Forward_accumulation), but since it requires maintaining a large Jacobian matrix (nb. of hidden units times nb. of parameters), they propose a way of obtaining a stochastic (but unbiased!) estimate of that matrix. Moreover, the method is improved by using Kalman filtering on that estimate, effectively smoothing the estimate over time.
#### My two cents
Online training of RNNs is a big, unsolved problem. The current approach people use is to truncate backprop to only a few steps in the past, which is more of a heuristic.
This paper makes progress towards a more principled approach. I really like the "rank-one trick" of Equation 7, really cute! And it is quite central to this method too, so good job on connecting those dots!
The authors present this work as being preliminary, and indeed they do not compare with truncated backprop. I really hope they do in a future version of this work.
Also, I don't think I buy their argument that the "theory of stochastic gradient descent applies". Here's the reason. So the method tracks the Jacobian of the hidden state wrt the parameter, which they note $G(t)$. It is update into $G(t+1)$, using a recursion which is based on the chain rule. However, between computing $G(t)$ and $G(t+1)$, a gradient step is performed during training. This means that $G(t)$ is now slightly stale, and corresponds to the gradient with respect to old value of the parameters, not the current value. As far as I understand, this implies that $G(t+1)$ (more specifically, its stochastic estimate as proposed in this paper) isn't unbiased anymore. So, unless I'm missing something (which I might!), I don't think we can invoke the theory of SGD as they suggest.
But frankly, that last issue seems pretty unavoidable in the online setting. I suspect this will never be solved, and future research will have to somehow have to design learning algorithms that are robust to this issue (or develop new theory that shows it isn't one).
So overall, kudos to the authors, and I'm really looking forward to read more about where this research goes!