Learning Mixture Models

Learning mixture models is a natural formalization of the clustering
problem. A mixture model is a collection of distributions D1,
.., Dk and weights w1, .., wk. To sample
from a mixture model, we draw i with probability wi, and then
draw a random sample from Di. Thus, each cluster in the data
corresponds to a different distribution Di in the mixture, and the weights wi correspond to the fraction of points from each cluster in the
entire dataset.
In the problem of learning mixture models, we are given samples from a
mixture model, and our goal is to (a) learn the parameters of the
distributions Di and (b) classify each sample, according to its source
distribution.
The most popular algorithms for learning mixture models in practice are the
Expectation-Maximization(EM) and the k-means algorithms.

Separation

If the distributions in the mixture are very close together in space, then
intuitively the mixture will be hard to learn. To quantify this effect,
Dasgupta[Das99] introduced the idea of a separation condition . A separation
condition is a promise that ensures that every pair of distributions in
the mixture are far apart. Given a mixture with a certain separation, the
goal of the algorithm designer is to learn the correct clustering of the
data.

A long line of literature has produced algorithms which have smaller and
smaller separation requirements. Typically, separation is
measured by the minimum distance between the means of any pairs of
distributions in the mixture, as a function of a standard deviation. The
parameters here are d, the dimensionality of the data, σ, the maximum
directional standard deviation, and σ* the maximum directional
standard deviation in the subspace containing the means.

The table below summarizes some of the bounds.

Reference

Separation

Comments

Algorithm

[Das99]

σ d1/2

Spherical Gaussians

Random Projections

[DS00]

σ d1/4

Bounded Eccentricity

EM

[AK01]

σ d1/4

Includes other conditions, like concentric Gaussians

Distance thresholding

[VW02]

σ k1/4

Spherical Gaussians

PCA Projections

[KSV05, AM05]

σ/wmin1/2

PCA Projections

[CR08]

σ* k1/2

Product Distributions

Canonical Correlations

[BV08]

Low Fischer discriminant values

Isotropic PCA

Learning Mixtures with No Separation Condition

In the absence of a lower bound on the separation, the problem of learning
mixture models is considered to be quite hard, particularly, when there are
more than two clusters. In this case, one is again samples from a mixture
model, and the goal is to produce a mixture which has KL-Divergence at most
ε to the true mixture.

For a mixture of two product distributions, Freund and Mansour[FM99]
provide an algorithm, with running time polynomial in n, 1/ε.
For a mixture of k product distributions, Feldman, O'Donnell and Servedio [FOS05]
provides an algorithm which produces with a running-time of
dO(k3/ε3).

Sample Requirement

Another aspect of an algorithm that learns mixture models is its sample
requirement. [CHRZ07, Cha07] shows an algorithm which learns a mixture of
two binary product distributions, given Ω(d/s4) samples, where s is
the separation. [BCFZ07] shows a SVD-based algorithm that learns a mixture
of k binary product distributions, given Ω(d/s2) samples from the
mixture. [L96] shows an information-theoretic lower bound of
Ω(1/s4) on any algorithm for learning a mixture of two
spherical Gaussians on a line.

[SSR06] experimentally studies the sample requirement of EM for
learning Gaussian mixtures. They conjecture that learning mixtures
has three phases, depending on the number of
samples : with less than about Ω(d/s4) samples, it
is information-theoretically hard; with more than about
Ω(d/s2)
samples, computationally easy, and in between, computationally hard,
but easy in an information-theoretic sense.

Learning Mixtures of More General Distributions

[DHKS05] examines the question of learning mixtures of more general
ddistributions than Gaussians, particularly, distributions with
heavy-tails, and provides an algorithm for this problem which runs in
time exponentially in the dimensionality of the data. [CR08b] provides
a polynomial-time algorithm for this problem, using ideas
from metric-embedding literature.

Also, [AM05] and [KSV05] show that their results apply to mixtures of a
slightly broader class of distributions than Gaussians, including mixtures
of log-concave distributions.

EM and k-means

The most common algorithms used in practice for learning mixture models
remain k-means and EM. However, not much is known about these
algorithms theoretically.

[DS00] shows a two-round variant of EM that produces the correct clustering
when the separation between the Gaussians is O(σ d1/4).
[RW84] analyze the likelihood surface around the optimal solution, and show
that EM has first-order convergence; [XJ96] characterizes some conditions
under which EM has second-order convergence.