Contents

Table of Contents

Learning with Stochastic Gradient Descent

Basics

Many numerical learning algorithms amount
to optimizing a cost function that can be
expressed as an average over the training examples.
The loss function measures how well (or how poorly)
the learning system performs on each example.
The cost function is then the average of the
loss function measures on all training examples,
possibly augmented with capacity control terms.

Computing such an average takes a time
proportional to the number of examples n.
This constant always appears in the running time
of all optimization algorithm that considers the
complete cost function.
Stochastic gradient descent instead
updates the learning system on the basis
of the loss function measured for a single example.
Such an algorithm works because the averaged
effect of these updates is the same.
Although the convergence is much more noisy,
the elimination of the constant n in the
computing cost can be a huge advantage for
large-scale problems

Stochastic Gradient Descent has been historically associated
with back-propagation algorithms in multilayer neural networks.
These nonlinear nonconvex problems can be very difficult.
Therefore it is useful to see how Stochastic Gradient Descent
performs on simple linear and convex problems such
as linear Support Vector Machines (SVMs)
or Conditional Random Fields (CRFs).
Source code and benchmarks are available on
the page about Stochastic Gradient Examples.
The benchmarks show very clearly the superiority
of Stochastic Gradient Descent on large learning problems.

Research

Many learning algorithms of the early nineties can be expressed as stochastic gradient descent,
including perceptrons, adalines, multilayer networks, k-means, learning vector quantization, etc.
Studying stochastic optimization allowed me to characterize their convergence
and their speed
in a unifiedframework.
This theoretical background became a solid basis to study
the practical issues surrounding stochastic gradient descent,
and for designing learning algorithms for new categories of problems.