week 06:

mixed feelings

goals this week

Introduction to single-cell RNA-seq analysis, and how we use it to
identify new “cell types” by clustering.

Understanding the K-means clustering algorithm, one of a few basic
clustering algorithms.

Viewing K-means clustering as a mixture model, an important class
of generative probability model, and viewing the K-means clustering
algorithm as an example of expectation maximization.

clustering single cell RNA-seq data

Starting
around 2009
people started doing RNA-seq experiments on single cells, and the
technologies for doing these experiments are advancing rapidly. Single
cell RNA-seq is a major breakthrough because it allows us to see
cell-to-cell heterogeneity in transcriptional profiles, including
different cell types. So long as we’re willing to use transcriptional
profile as a proxy for “cell type”, by using single-cell RNA-seq we
can comprehensively survey all the cell types present in some
tissue. This promises to be especially powerful in understanding
complex biological structures composed of many cell types, like
brains.

A single mammalian cell only contains about 400K mRNAs, and many are
inevitably lost in the experimental procedure. Depending on the
protocol, perhaps a few million different fragments are generated per
cell. There’s little point in sequencing single-cell libraries to the
depth of a standard RNA-seq library (30M-100M reads). Typically,
experiments obtain 200K-5M reads per cell, and sometimes as few as 50K
[Pollen et al 2014].

These reads are distributed over 20K mammalian gene
transcripts, so we only get small numbers of counts per typical gene:
on the order of 2-100 mapped read counts, depending on sequencing
depth. Only highly expressed genes are detected reliably, and
low-level transcripts may not be sampled at all. There is a “dropout”
effect, correlated with low expression: puzzlingly frequently, no
counts are detected at all for some expressed genes. Statistical
modeling of single-cell RNA-seq data typically includes a model of the
dropout effect
[Kharchenko et al, 2014].

For these reasons single-cell RNA-seq experiments typically detect
expression of fewer genes than population RNA-seq experiments
do. Population RNA-seq experiments are better at deeply sampling the
“complete” transcriptome of a population of cells. Single cell
experiments are most useful for identifying enough higher-expressed
genes to distinguish useful cell types. These used to be called marker
genes, when we had few of them – cell types are still often
operationally identified by a single marker gene they express highly,
such as the parvalbumin+ (Pv+) interneurons in mammalian hippocampus.

One can pool single cell data after clustering to approximate
population RNA-seq on purified cell type populations, although the
higher technical difficulty of single cell RNA-seq experiments
probably makes this an imperfect approximation, with higher technical
noise.

Single cell RNA-seq experiments are very noisy. The amount of material
is small, and has to be amplified with noisy amplification protocols.
Important experimental trickery is used, including the use of UMI
barcodes (unique molecular identifiers), to detect and correct
amplification artifacts. One good review of the experimental design,
analysis, and interpretation of single-cell RNA-seq experiment is
[Stegle, Teichmann, and Marioni, 2015].

The heart of the analysis, and what concerns us here, is that we want
to cluster the transcriptional profiles of the single cells, to
identify groups of cells with similar profiles.

Our focus this week is on understanding a classic clustering algorithm
called K-means clustering.

k-means clustering

We have data points , and we want to assign them to
clusters.

Each is a point in a multidimensional space. It has
coordinates for each dimension . In two dimensions
() we can visualize the as dots on a scatter plot.
It’s easiest to think about K-means clustering in 2D, but important to
remember that many applications of it are in high dimensions. Gene
expression vectors for different genes in each cell are points
in a -dimensional space.

A cluster is a group of points that are close together, for some
definition of “close”. The easiest to imagine is “close in space”. The
Euclidean distance between any two points and is

definition of a k-means clustering

Suppose we know there are clusters. (This is a drawback of
K-means: we have to specify .) Suppose we assign each point
to a cluster (i.e. a set) , so we can refer to the
indices . We can calculate the centroid of each
cluster, : its mean location, its center of mass, simply by
averaging independently in each dimension $z$:

where the denominator is the number of
points in cluster – the cardinality of set , if you
haven’t seen that notation before.

The centroid is just another point in the same multidimensional space.
We can calculate a distance of any of our items to
a centroid , the same way we calculated the distance between
two points .

For any given assignment consisting of sets , we
can calculate the centroids , and then calculate the
distance from each point to its assigned centroid
. We can calculate a total distance for a clustering :

A K-means clustering of the data is defined by finding a
clustering that minimizes the total distance , given
. We literally try to place means (centroids) in the
space, such that each point is close to one of them. That’s
the definition of a K-means clustering; how do we find one?

k-means clustering algorithm

The following algorithm almost does the job – well enough for
practical purposes. (Almost, because it is a local optimizer rather
than a global optimizer, as we’ll discuss in a sec.) Start with some
randomly chosen centroids . Then iterate:

Assignment step: Assign each data point to its closest
centroid .

Update step: Calculate new centroids .

Iterate until the assignments don’t change.

It is possible for a centroid to have no points assigned to it, so you
have to do something to avoid dividing by zero in the update step. One
common thing to do is to leave such a centroid unchanged.

The algorithm is guaranteed to converge, albeit to a local rather than
a global optimum. It’s sometimes called Lloyd’s algorithm.

local minima problem; initialization strategies

K-means is extremely prone to spurious local optima. For instance, you
can have perfectly obvious clusters yet end up with multiple centroids
on one cluster and none on another.

People take two main approaches to this problem. They try to choose
the initial guessed centroid locations “wisely”, and they run K-means
many times and take the best solution (best being the clustering that
gives the lowest total distance ).

Various ways of initializing the centroids include:

choose random points in space;

choose random points from and put the initial centroids
there;

assign all points in randomly to clusters , then calculate
centroids of these.

soft k-means

Standard K-means is distinguished by a hard assignment: data point
belongs to cluster , and no other cluster. We can also
imagine doing a soft assignment, where we assign a probability that
point belongs to each cluster .

Let’s introduce , the assignment of point , so we can
talk about : the probability that we assign
point to cluster given the K-means centroids .

What should this probability be, since we haven’t described K-means in
probabilistic terms? (Not yet, anyway.) The standard definition of
soft k-means calculates a “responsibility” :

where for each point , since the
denominator forces normalization. We call it a
“responsibility” instead of a probability because we haven’t given it
any well-motivated justification (again, not yet).

People call the parameter the stiffness. The higher
is, the closer a centroid has to be to a point
to get any responsibility for it. In the limit of , the closest centroid gets , and we
get standard K-means hard clustering as a special case.

The centroid of each cluster is now a weighted mean, with each
point weighted by its responsibility (remember that each
point is a vector; we’re going to drop the indexing on its dimensions,
for more compact notation):

The soft K-means algorithm starts from an initial guess of the
centroids and iterates:

Assignment step: Calculate responsibilities for
each data point , given the current centroids .

Update step: Calculate new centroids , given
responsibilities .

This “responsibility” business looks ad hoc. Is there a
justification for what we’re doing?

mixture models

A mixture model is a generative probability model composed of
other probability distributions. A mixture model is specified by:

components.

each component has a mixture coefficient , the
probability of sampling each component.

each component has its own parameters , and generates a data sample
with some probability .

The probability of sampling (generating) an observed data point
goes through two steps:

sample a component according to its mixture coefficient

sample data point from the component ’s probability
distribution .

This is a simple example of a hierarchical model. More complicated
hierarchical models have more steps. It’s also a simple example of a
Bayesian network composed of two random variables, with an random
variable for the observed data dependent on a hidden random variable
for the component choice.

The probability that an observed data point is generated by a
mixture model is the marginal sum over the unknown component :

If we had to infer which (hidden) component generated observed
data point , we can calculate the posterior probability:

The mean (average) over many data points assigned to a mixture
is, for each component :

You can see this is the update step of soft K-means, with a
probability in place of the ad hoc responsibility
. But there’s a missing piece: we haven’t said anything yet
about what the component probability distributions can be.

That’s because the components can be anything
you like. It’s most common to make them all the same elemental type of
distribution. So we can have mixture exponential distributions with
parameters ; we can have Poisson mixture distributions
also with parameters ; we can have mixture multinomial
distributions with parameters ; and so on.

For example, one common mixture model is a mixture Gaussian
distribution, with parameters and for each
component. This might arise in trying to fit data that had multiple
modes (peaks), with different peak widths, and different proportions
of data in each peak. A mixture Gaussian would give:

soft K-means as a mixture probability model

Let’s stare at that, comparing it to soft K-means. Specifically,
consider what the posterior probability
looks like for a mixture Gaussian, compared to the responsibility
calculation.

First notice that the vector-valued term in the
spherical Gaussian is the square of the Euclidean distance between the
point and the mean : .
Already we see a parallel between the of the
Gaussian and the of soft K-means.

But what about ? A mixture Gaussian requires that we
specify a variance for each component, and no such term
appeared in soft K-means. Maybe soft K-means is making assumptions
that make it go away?

Remember that our are vectors with dimensions , and so
are our . We could specify some sort of an asymmetric
Gaussian, possibly with correlations along arbitrary axes using a
covariance matrix, but let’s keep it simple. Let’s specify symmetric
spherical Gaussian distributions for our components, with a single
per component.

Let’s go a step further, and assume that there is a constant for all spherical Gaussian components .

Let’s go another step further, and assume that all components are
equiprobable: .

Now you should be able to prove to yourself that after making these
assumptions, the mixture Gaussian posterior probability equation
reduces to:

As if you haven’t already recognized the equation for the
responsibilities in soft K-means, let’s go ahead and define
a stiffness , and:

implicit assumptions of K-means

Having derived the responsibility update equation for soft K-means
from first principles, we’ve laid bare its implicit assumptions:

the data are drawn from a mixture Gaussian distribution;

composed of spherical Gaussians;

each component has an identical spherical Gaussian variance
– each component is the same “width”; and

each component has an equal expected number of data points.

It should not be surprising that K-means may tend to fail when any of
these assumptions is violated. Standard descriptions of K-means will
tell you it can fail when the individual data clusters are
asymmetrically distributed, have different variances, or have
different sizes in terms of number of assigned data points.

fitting mixture models with expectation maximization

Soft K-means, as a probability model, is a simplified example of an
important class of optimization algorithm called expectation
maximization (EM).

Expectation maximization is a common way of fitting mixture models,
and indeed other models with hidden variables. You recall I already
mentioned that kallisto and other RNA-seq quantitation methods use EM
to infer which transcript a read maps to, when the read is compatible
with multiple reads. That’s essentially a clustering problem too.

For example, here’s how you’d use expectation maximization to fit a
mixture Gaussian model to data. To simplify things, let’s assume that
there is a single known, fixed variance parameter , as in a
soft K-means stiffness. But let’s relax one of the other K-means
assumptions, and allow different mixture coefficients . Then,
starting from an initial guess at the means :

Expectation step: Calculate posterior probabilities for each data point , given the
current means (centroids) , fixed , and mixture
coefficients .

Maximization step: Calculate new centroids and mixture
coefficients given the posterior probabilities .

We can follow the total log likelihood as we
iterate, and watch it increase and asymptote. When we start from
different starting conditions, a better solution has a better total
log likelihood.

When we calculate the updated mixture coefficients , that’s
just the expected fraction of data points assigned to component :

If we had to fit more than just a mean parameter for each
component, we would have some way of doing maximum likelihood
parameter estimation given posterior-weighted observations. (For
example, if we were also trying to estimate terms, we
could calculate the expected variance for each component , not
just the expected mean.)

the negative binomial distribution

Single-cell RNA-seq data consist of a small number of counts
for each gene (or transcript) . Many analysis approaches model
these observed counts directly, rather than trying to infer
quantitation (in TPM).

If a single-cell RNA-seq experiment generated observed counts from a
mean expression level like pulling balls out of urns, then the
differences we see across replicates of the same condition would just
be technical variation due to finite count sampling error. We
would expect the observed counts to be Poisson-distributed
around a mean .

The Poisson distribution only has one parameter. The variance
of the Poisson is equal to its mean . This is a
strong assumption about .

In real biological data from biological replicates have
more variance than the Poisson predicts: they are
overdispersed. This shouldn’t surprise you. If the samples are
outbred organisms with different genotypes, for example, we get
expression variation due to genotype variation. We also get variation
from a host of other environmental and experimental factors that are
changing when we do biological replicates.

Just as a practical matter, we want a distribution form that allows us
to control the variance separately from the mean, so we can model this
overdispersion. Empirically, two different distributions have been
seen to fit RNA-seq count data reasonably well: the negative
binomial distribution and the lognormal distribution. The p-set
this week is based on the negative binomial distribution.

definition of the negative binomial

Though this may look like gibberish, it’s strongly related to the
familiar binomial distribution. Remember, gamma functions
are just the continuous extension of the familiar
factorial, where for integers , so that
mess of gammas is just a binomial coefficient in continuous form. The
terms inside parentheses can be written as and , if we
say , thus , so again familiar terms of a binomial distribution
appear. Suppose is the number of “successes” that occur with
probability , and is the number of “failures”
that are allowed before we stop. The binomial distribution is the
probability of seeing successes (with probability ) before
one failure (with probability ). The negative binomial
distribution is the probability of seeing successes before
failures.

The mean of the NB is . Its variance is .

The parameter is the dispersion. As , the NB asymptotes to the Poisson.

Examples of the negative binomial distribution, compared to the Poisson.
'Open image in new tab' to embiggen.

Once we assume the NB, then for each gene , we need two
parameters to calculate a likelihood of its counts in a sample: its
mean number of counts , and its dispersion .

SciPy’s negative binomial distribution

I mentioned that the NB appears in various notations. This is a
fiddly detail that can bite you in this week’s p-set. The NB is
defined in different ways, so when you plug numbers into it, you have
to be careful you’re using the right numbers. You may have to do some
change-of-variables manipulation first.

You’re going to want to calculate NB log likelihoods in the p-set, so
you need a routine to calculate the NB log pmf (probability mass
function) The NB is a discrete distribution, with outcomes defined
over integers ).

So you’d naturally head over to
scipy.stats.nbinom
to find scipy.stats.nbinom.logpmf(). It takes three arguments: the
observed data count , and two parameters and . These
are not and , alas, so we have to do some thinking.

To map between different notations for the NB, it helps to think about
where the negative binomial arises most naturally, by comparison to
the binomial. Suppose we have some sort of random discrete event
(like flipping heads with a coin or rolling a one on a die) where we
can talk about a probability of success and of
failure. Now we look at a series of events (Bernoulli trials, in
the jargon), and we ask: if I’m allowed failures before I have
to stop, how many successes am I likely to see?

In a chain of events, I’m allowed any combination of failures
and successes, but I know the final event is a failure, because
the ‘th failure is what terminates the chain. So the probability
of the chain is:

which is just

Now the next thing to notice about this is that one person’s parameter
is another person’s random variable – I could say “what’s the
probability of seeing successes before the ‘th failure”, or
I could just as easily say “what’s the probability that there are
failures in a chain that has exactly successes (and ends
with a failure)?” So now we have to pay attention to whether someone’s
NB PMF is calculating or whether they’re calculating
.

Which means we have to read the documentation (and we have to check
that it’s right – trust but verify, as Reagan urged us). We see that
although
Wikipedia
shows the PMF in exactly the terms I defined it in, that’s not what
SciPy
does. SciPy says
it calculates:

which (sigh) is the other way: is associated with ,
is associated with , so if we consider to be the
failure probability, SciPy is calculating the PDF of the number of
failures, not successes.

Fine, if that’s what SciPy does, let’s just do whatever change of
variables we need to to make our equation for
map onto SciPy’s PMF. Provided that you remember that for integer , you should be able to prove to yourself
that you need:

and now when you call scipy.stats.nbinom.logpmf(x,n,p), it will give
you .

fitting the negative binomial distribution to data

Suppose I give you a bunch of samples that have been sampled
from a negative binomial distribution with unknown parameters . How do I estimate optimal parameter values?

Maximum likelihood fitting of an NB turns out to be a bit fiddly. For
almost all practical uses, an alternative parameter estimation
strategy suffices, and it’s a strategy worth knowing about in general:
method of moments (MOM) estimation.

The idea is that since I know the relationship between the NB
parameters and the mean and variance (in the limit of having lots of
observed data):

so why not calculate the mean and variance of the data, then just
solve for the parameters:

We can often get away with MOM estimation. In the p-set, you can. And
in fact, you don’t even need to estimate , because it’s given
to you. You only need to estimate the parameter of the NB –
and that’s just the sample mean of the count data.