Data Science Stack Exchange is a question and answer site for Data science professionals, Machine Learning specialists, and those interested in learning more about the field. It only takes a minute to sign up.

To speed up training, by imediately correcting each layer with predicted gradient

To be able to learn longer sequences

I see problems with both of them. Please note, I really like Synthetic Gradients and would like to implement them. But I need to understand where my trail of thought is incorrect.

I will now show why Point 1 and Point 2 don't seem to be beneficial, and I need you to correct me, if they are actually beneficial:

Point 1:

Synthetic Gradients tell us we can rely on another "mini-helper-network" (called DNI) to advise our current layer about what gradients will arrive from above, even during fwd prop.

However, such gradients will only come several operations later. Same amount of Backprop will have to be done as without DNI, except that now we also need to train our DNI.

Adding this Asyncronisity shouldn't not make layers train faster than during the traditional "locked" full fwdprop -> full back prop sequence, because same number of computations must be done by the device. It's just that the computations will be slid in time

This makes me think Point 1) will not work. Simply adding SG between each layer
shouldn't improve training speed.

Point 2:

Ok, how about adding SG only on the last layer to predict "gradient from future" and only if it's the final timestep during forward prop.

This way, even though our LSTM has to stop predicting and must backpropagate, it can still predict the future-gradient it would have received (with the help of DNI sitting on the last timestep).

Notice, we have our DNI sitting at the very end of "Session A" and predicting "what gradient I would get flowing from the beginning of Session B (from future)".
Because of that, timestep_3A will be equipped with gradient "that would have come from timestep_1B", so indeed, corrections done during A will be more reliable.

But, hey! These predicted "synthetic gradients" will be very small (negligible) anyway - after all, that's why we start a new backprop session B. Weren't they too small, we would just parse all 6 timesteps in a single long bkprop "session A".

Ok. Maybe we can get the benefit of training "session A", "session B"etc on separate machines? But then how is this different to simply training with the usual minibatches in parallel? Keep in mind, was mentioned in point 2: things are worsened by sessionA predicting gradients which are vanishing anyway.

Question:
Please help me understand the benefit of Synthetic Gradient, because the 2 points above don't seem to be beneficial

$\begingroup$Why do you think this won't speed up training? The only justification I see is the bare assertion that this "shouldn't improve training speed" but you don't provide your reasoning. Also it's not clear what you mean by "step 1)" as you haven't described any steps in teh question. In any case, the paper demonstrates that it does provide speedups. Data beats theory any day. Have you read the paper?$\endgroup$
– D.W.May 24 '18 at 1:44

$\begingroup$Data beats theory any day, I agree, but the best counter example I can make is GPUs vs CPUs. Everywhere people keep telling GPU runs orders of magnitudes faster than CPU, and provide comparisons. However, a properly coded mulithreaded CPU is only 2-3 times slower than its same category GPU and is cheaper than the GPU. larsjuhljensen.wordpress.com/2011/01/28/… Once again, I am not going against Synthetic Gradients, - they seem awesome, it's just until I can get answer to my post, I won't be able to rest :D$\endgroup$
– KariMay 24 '18 at 5:47

$\begingroup$I'm not sure that a 7-year old blog post about BLAST is terribly relevant here.$\endgroup$
– D.W.May 24 '18 at 7:00

$\begingroup$What I am trying to say is "there are ways to make parallelism seem better than it might actually be", in any scenario$\endgroup$
– KariMay 24 '18 at 11:36

2 Answers
2

But, hey! These predicted "synthetic gradients" will be very small (negligible) anyway - after all, that's why we start a new backprop session B. Weren't they too small, we would just parse all 6 timesteps in a single long bkprop "session A".

-That's not necessarily correct. We typically truncate and start a new backprop because of hardware constraints, such as memory or computational speed. Vanishing-gradient can be improved by other means, such as Gradient Normalization - scaling-up the gradient vector if it gets too small beyond certain layers, or scaling-down if it's about to explode. Or even using Batch Normalization

It's important to understand how to update any DNI module. To clear things up, consider an example of network with several layers and 3 DNI modules:

Longer sequences are more precise than minibatches, however minibatches add a regulization effect. But, given some technique to prevent gradient from exploding or vanishing, training longer sequences can provide a lot better insight into context of the problem. That's because network infers output after considering a longer sequence of input, so the outcome is more rational.

For the comparison of benefits granted by SG refer to the diagrams page 6 of the Paper, mainly being able to solve longer sequences, which I feel is most beneficial (we can already parallelize via Minibatches anyway, and thus SG shouldn't speed up the process when performed on the same machine - even if we indeed only propagate up to the next DNI).

However, the more DNI modules we have, the noisier the signal should
be. So it might be worth-while to train the layers and DNI all by the legacy backprop, and only after some epochs have elapsed we start using the DNI-bootstrapping discussed above.

That way, the earliest DNI will acquire at least some sense of what to expect at the start of the training. That's because the following DNIs are themselves unsure of what the true gradient actually looks like, when the training begins, so initially, they will be advising "garbage" gradient to anyone sitting earlier than them.

Don't forget that authors also experimented with predicting the actual inputs for every layer.

If your layers have expensive backprop (perhaps you have Batch-Normalization or some fancy activation functions), correcting with DNI might be a lot cheaper, once it's sufficiently well-trained. Just remember that DNI is not free - it requires matrix multiplication and probably won't provide much speed-up on a simple dense layer.

Minibatches gives us speedup (via parallelization) and also give us the regularization.
Synthetic Gradients allow us to infer better by working with longer sequences and (potentially) less-expensive gradient. All together this is a very powerful system.

Synthetic gradients make training faster, not by reducing the number of epochs needed or by speeding up the convergence of gradient descent, but rather by making each epoch faster to compute. The synthetic gradient is faster to compute than the real gradient (computing the synthetic gradient is faster than the backpropagation), so each iteration of gradient descent can be computed more rapidly.

$\begingroup$From my understanding, time-wise the gradients shouldn't reach the DNI faster, it's just that they are now slid in time, and computed asynchronously while the forward prop is occurring. The DNI will still have to get the true gradient to train itself. So the Synthetic Gradients should require the same number of computations to be done in parallel as when with standard BPTT. Is this correct?$\endgroup$
– KariMay 24 '18 at 11:14

$\begingroup$So there wouldn't be any speedup in simply introducing the SG between the layers. Yes, we do get the immediate predicted gradient from the DNI, but for each such prediction we will eventually have to pay the price by asynchronous full back propagation towards that DNI, a bit later$\endgroup$
– KariMay 24 '18 at 11:35

$\begingroup$@Kari, no, that doesn't sound right to me. If you need the same number of iterations, but now each iteration takes the same 50% less time on the GPU, then the resulting computation will be done earlier. Even if you need 10% more iterations/epochs (because the gradients are delayed or the synthetic gradients don't perfectly match the actual gradients), that's still a win: the speedup from being able to compute the synthetic gradient faster than the real gradient outweighs other effects. You seem confident that this can't help, but the data in the paper shows it does help.$\endgroup$
– D.W.May 24 '18 at 13:11

$\begingroup$Hm, well for example, we have 4 layers sitting after our DNI; In normal backprop, we would have 4 "forward" exchanges between the layers, and then 4 "backwards exchanges", and while this occurs the system is locked. With DNI we can straight away correct our weights, but will need to get true gradients later on.But now, the system is not locked, allowing in the meantime for more forward passes to slide by. But we still owe the true gradient from before, to our DNI ...To obtain and deliver this gradient back to DNI it will take 100% of the time (same 4 steps forward, same 4 steps backward).$\endgroup$
– KariMay 24 '18 at 13:22

$\begingroup$It's just that our DNI says "fine, give them when possible, later on", but we still got to pay the full price, so I don't see the performance increase. I agree, the papers show great results, but how come? We already can train minibatches in parallel anyway :/$\endgroup$
– KariMay 24 '18 at 13:56