Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

3.
Topics of this talk: how to compute
gradients through stochastic units
First 20 min.
• Stochastic unit
• Learning methods for stochastic neural nets
Second 20 min.
• How to implement it with Chainer
• Experimental results
Take-home message: you can train stochastic NNs without
modifying backprop procedure in most frameworks
(including Chainer)
3

4.
Caution!!!
This talk DOES NOT introduce
• basic maths
• backprop algorithm
• how to install Chainer (see the official documantes!)
• basic concept and usage of Chainer (ditto!)
I could not avoid using some math to explain the work, so just
take this talk as an example of how a researcher writes scripts
in Python.
4

10.
Gradient estimation of stochastic NN is
difficult!
Stochastic NN is NOT deterministic
-> we have to optimize expectation over the stochasticity
• All possible realization of stochastic units should be
considered (with losses weighted by the probability)
• Enumerating all such realizations is infeasible!
• We cannot enumerate all samples from Gaussian
• Even in case of using Bernoulli, it costs time for units
-> need approximation
10
we want to optimize it!

11.
General trick: likelihood-ratio method
• Do forward prop with sampling
• Decrease the probability of chosen values if the loss is high
• difficult to decide whether the loss this time is high or low…
-> decrease the probability by an amount proportional to the loss
• Using log-derivative results in unbiased gradient estimate
Not straight-forward to implement on NN frameworks
(I’ll show later)
11
“sampled from”sampled loss log derivative

12.
Technique: LR with baseline
LR method results in high variance
• The gradient is accurate only after observing many samples
(because the log-derivative is not related to the loss function)
We can reduce the variance by shifting the loss value by a
constant: using instead of
• It does not change the relative goodness of each sample
• The shift is called baseline
12

13.
Modern trick: reparameterization trick
Write the sampling procedure as a differentiable computation
• Given noise, the computation is deterministic and differentiable
• Easy to implement on NN frameworks (as easy as dropout)
• The variance is low!!
13
noise

14.
Summary of learning stochastic NNs
For Gaussian units, we can use reparameterization trick
• It has low variance so that we can train them efficiently
For Bernoulli units, we have to use likelihood-ratio methods
• It has high variance, which is problematic
• In order to capture discrete nature of data representation, it
is better to use discrete units, so we have to develop a fast
algorithm of learning discrete units
14

26.
Other note on experiments (1)
Plain LR does not learn well. It always needs to use baseline.
• There are many techniques, including
• Moving average of the loss value
• Predict the loss value from the input
• Optimal constant baseline estimation
Better to use momentum SGD and adaptive learning rate
• = Adam
• Momentum effectively reduces the gradient noise
26

27.
Other note on experiments (2)
Use Trainer!
• snapshot extension makes it easy to do resume/suspend,
which is crucial for handling long experiments
• Adding a custom extension is super-easy: I wrote
• an extension to hold the model of the current best validation score
(for early stopping)
• an extension to report variance of estimated gradients
• an extension to plot the learning curve at regular intervals
Use report function!
• It is easy to collect statistics of any values which are
computed as by-products of forward computation
27

29.
My research
My current research is on low-variance gradient estimate for
stochastic NNs with Bernoulli units
• Need extra computation, which is embarrassingly
parallelizable
• Theoretically guaranteed to have lower variance than LR
(even vs. LR with the optimal input-dependent baseline)
• Empirically shown to be faster to learn
29

30.
Summary
• Stochastic units introduce stochasticity to neural networks
(and their computational graphs)
• Reparameterization trick and likelihood-ratio methods are
often used for learning them
• Reparameterization trick can be implemented with Chainer
as a simple feed-forward network with additional noise
• Likelihood-ratio methods can be implemented with Chainer
using fake loss
30