week 09:

maximizing your expectations

reading for this week

Strengthened by weeks of Python and probabilistic inference, full of
confidence, we’re going to pull a bunch of pieces together. This
week is a big week - sort of a review, sort of a midterm, sort of a
summing up. We know enough now to understand the heart of an
important RNA-seq data analysis paper,
Li, Ruotti, Stewart, Thomson, and Dewey (2010),
RNA-seq with read mapping uncertainty. We’re prepared to
understand its Methods section - the generative model in section
2.1, and the expectation maximization algorithm in section 2.2.

We’ve already been warmed up on expectation-maximization algorithms
in
our week on k-means clustering. Now
we see EM again in a different context. To really hammer it home, in
the Thursday lecture, we’re going to work through yet another EM
problem in biological data analysis: the problem of identifying a
small consensus DNA motif sequence shared by a set of long-ish
unaligned sequences. Some relevant background there includes
Detecting subtle sequence signals: a Gibbs sampling strategy for
multiple alignment,
Lawrence et al (1993),
and Identifying protein-binding sites from unaligned DNA
fragments,
Stormo and Hartzell (1989).

probabilistic graphical models

Figure 1 from
Li et al. (2010)
is a probabilistic graphical model, also known as a Bayesian
network. A lot can be said about probabilistic graphical models –
one good doorstopping textbook is
from Koller and Friedman
– but we only need to know some basic essentials right now, and the
basic essentials are straightforward.

The directed probabilistic graphical model from Li et al (2010).
"Directed" means there are arrows
representing dependencies as conditional probabilities.
as opposed to "undirected" graphs which only show
dependencies as connections (undirected edges).

We use a graphical model as an intuitively powerful way to state our
assumptions about what the random variables are, and where there
statistical dependencies are (and aren’t) in an inference problem.
Specifically, in a so-called directed graphical model, we use
nodes (random variables) and arrows (directed edges, representing
conditional probability dependencies) to think about how we’re going
to start with an (often nastily complex) joint probability
distribution over all our random variables, and factor that down into
a product of conditional probability distributions that we can work
with.

the biological problem: ambiguous read mapping

Let’s start with the specific example of
Li et al. (2010).
First let’s state the RNA-seq analysis problem they’re trying to deal
with, in terms of how we imagine that the observed data are generated.
That is, let’s state our generative probability model for the
RNA-seq data.

The problem that they’re trying to tackle is that just because I
observe a read sequence , I don’t necessarily know which
transcript isoform to assign that read to. Transcripts can share
identical stretches of sequence because of overlapping isoforms, or
because of repetitive elements. Multimapped reads are reads that
the read mapping software can’t place uniquely, instead mapping them
to two or more possible sources. Li and Dewey’s framing distinguishes
gene multireads that map to different genes (because of repetitive
elements) from isoform multireads that map to different isoforms of
the same locus (because of overlapping exons).

Making the problem worse, DNA sequencing isn’t 100% accurate. We have
to worry about similar matches, not just identical matches. Read
mappers have to tolerate a small number of mismatches, typically one
or two. Now a read might map to similar sequences, not just identical
ones; and now we also have useful probabilistic information about the
source. Depending on a sequencing error model that we could
parameterize, we could guess that the read’s true source is more
likely to be more similar (or identical), compared to a less similar
potential source. It’s even possible, with enough sequencing error,
that we could mismap a read only to false sources, and miss the
true one.

the generative model

Li et al. state their generative model for one observed read as
follows.

There are possible transcript isoforms, , with
true expression abundances . These
are nucleotide abundances, what we’ve also called before,
as opposed to transcript abundances in TPM.

Select a transcript with probability .

Given transcript , which has a length , select a
starting point for a fixed-length sequencing read, . This probability could be uniform . A
more sophisticated model could try to capture the fact that read
positions are known to be nonuniformly distributed because of end
effects in making libraries (fewer reads at the 5’ and 3’ ends), and
sequence composition biases (GC-rich reads are often
underrepresented, and standard cDNA synthesis protocols use a
specific nonrandom pool of oligonucleotide primers). Li et al. call
this a read start position distribution. We’ll just stick with a
uniform distribution for our discussion here. Li et al also do some
hand-waving to deal with an edge effect, that not all start
positions can give you a complete read: they say, hey, there’s an
added poly-A tail that’s typically longer than the read length, so
even works as a start position (um… not really, but
we’ll go with it).

In strand-nonspecific RNA-seq protocols, we’re going to observe this
sequence in either orientation (forward/reverse, Watson/Crick);
while in strand-specific RNA-seq protocols, we’re usually going to
observe this sequence only in one orientation. So Li et al. say that
we sample an orientation for the read, a binary value 0
or 1. (They make this depend on (i.e. ),
because of a detail in their model that I’m glossing over, that they
include a “transcript” that models unmapped reads, and for
an unmapped read the orientation is irrelevant. For our purposes, we
could just think of independent of .)

Given a true sequence and an orientation , we generate an
observed read by modeling sequencing error. For example, if
our sequencing process gives a 1% error rate per position, we’d
generate read by dirtying up with 1% error at each
position. Li et al generalize this, and imagine a matrix
, the probability that we see residue at position
of a read, if corresponds to a true residue ; the
probability of the read is a product of these individual terms.

In any modeling problem, some of our variables are observed data,
some of our variables are unknown model parameters, some are
fixed model parameters, and some are hidden (or latent)
variables that represent the process by which the model generates
observed data, but are not directly observed themselves.

Here, the read sequences are observed data. are the
unknown model parameters. The process that gets us to observed data
requires specifying three hidden variables: the transcript source
, the start position , and the orientation .

The transcript lengths are an example of fixed model
parameters that we assume are known. We could keep them in our
conditional probability expressions below, to be very explicit, but
for conciseness, we’re not going to.

the likelihood of the model

The likelihood of the model is the probability that the model
generates observed data , given the model parameters :

but with our hidden variables in play, what we actually know how to
write down is

the joint probability of the data and the unknown hidden variables. To
get the likelihood, we have to marginalize (sum out) the unknown
hidden variables:

Hidden variables are sometimes called nuisance variables:
variables that have to be specified in order to calculate a likelihood
as an actual number, but whose values are not relevant or interesting
to us (at least at the moment), so we marginalize over their
uncertainty to make them disappear.

factorizing and stating independence assumptions

If we literally needed to specify , we
would have to specify (or at least imagine) an enormous
multidimensional table of probabilities for all possible combinations of outcomes
even if we only had just one choice of . Then
we’d have one of those tables for each possible choice of .
Since is a vector of continuous real-valued probabilities,
we’d have an infinite number of tables. This is not good.

This is why we stated the generative model as a succession of steps:

given , choose with probability ;

given , choose with probability (say);

given , choose with probability (say);

given , generate using the sequencing error
model.

So, let’s do this. Without making any assumptions (yet), we can
correctly factorize our joint probability into a product of
conditional probabilities in different ways, including one that
corresponds to the order that we’re thinking of our generative process
in:

This gives us our joint probability in terms of a product of
conditional probabilities of our individual random variables.

Now we can walk through each conditional probability term and formally
state our independence assumptions, which make our model
tractable. From right to left:

The probability of selecting isoform given is,
well, , so keep that.

The probability of selecting a read start position only
depends on the length , hence on the choice . It’s
independent of . So .

The probability of selecting orientation is independent of
and , so . (Or
even just , more simply.)

The probability of generating read is independent of
, so .

Now look at the Li et al Figure 1 again. Its nodes and arrows show
exactly these conditionings (after dropping the conditioning on fixed
model parameters like because they’re fixed). The label on a
node is the stuff to the left of a , the probability
distribution of this random variable; the arrows that come into the
node are the stuff to the right, the conditional dependencies. A
directed graphical model is just a convenient pictorial representation
of the conditional dependencies we want to assume in our model.

inference

Given a probability model, there are lots of things I might want to do
with it. For example, given observed data , I might want to know
the posterior distribution of the unknown parameters . I’d
write down Bayes’ rule for getting the posterior distribution
, in terms of the marginal likelihood. I’d be
forced to specify a prior probability distribution
to do this.

If we could specify the prior and brute force enumerate or sample
the space of all possible parameters (like in
the Student’s game night pset,
where we used coarse-grain discretization of two parameters
and ), we could get this posterior probability distribution
by calculating the likelihood times prior for all possible
choices and normalizing. But here is an M-dimensional
probability vector, for on the order of tens of thousands, the
number of different transcript isoforms in the transcriptome. We’re
not going to enumerate all the choices, even if we did assume a
coarse-grained discretization of some sort. Even supposing we did some
sort of binary discretization like high/low, there’d still be
possible settings for we’d have to do.

So this posterior inference problem is still a high class problem for
us – buzzwords that we’ll see for solving problems like that are
things like Gibbs sampling and Markov chain Monte Carlo – and
we’re going to sneak up on it instead. Suppose, instead of getting the
posterior distribution for , I was happy to just make a
point estimate for an optimal .

The first thing we have to worry about is, what do you mean optimal?

maximum likelihood parameter estimation

We’ve seen that a standard definition of optimal is maximum
likelihood: find the values of parameters that maximize
the likelihood for observed data .

Suppose I told you the count of reads that truly map to
transcript . Then you’d simply estimate . That’s the maximum likelihood estimate for
– but I gave you the counts , which means I had
to give you the hidden values of for all the reads .
If we want to be super formal we can write these unknown “true counts
per transcript” as:

where is a so-called Kronecker delta function, often
seen as a notational trick in counting subsets of things: it takes the
value 1 when its condition is true, and 0 when its condition is
false. Here, “count all the reads that are assigned to transcript
; that sum is ”.

Your problem is that you don’t know , because the values of the
are hidden and unknown.

inference of hidden variables

But, given observed data , and if I told you , you
could infer . By manipulating the likelihood , you can solve for in terms of :

where the denominator there is just a renormalization, a marginal sum
over the numerator, :

Alas, we don’t know how to put numbers to ; our
model is such that we can put numbers to ,
in terms of additional nuisance variables (the read start
position) and (the orientation). We get our likelihood term by
marginalization over the nuisance variables:

So the computation would be to just calculate the probability for each possible assignment of
observed read to transcript – now you have a vector of
likelihoods – and you renormalize them to get the distribution
.

This is your uncertainty about the assignment of observed read
to underlying transcript . Now what you could do is assign each
transcript a fraction of a counted read, weighted by that
uncertainty. You’d be calculating the expected value of the
counts , given your uncertainty about the mapping:

And then you could renormalize the to get your maximum
likelihood estimate of the .

Which is all just great, except that you had to know in the
first place to get your expected counts , but that’s what we’re
trying to estimate!

expectation maximization

We have a chicken and the egg problem. If I told you the unknown
, you could estimate the unknown by maximum
likelihood. If I told you the unknown , you could estimate
the unknown by probabilistic inference from the observed read
data . But both the and are unknown.

Amazingly enough, the following apparently dumb solution works:

Initialize a starting guess to anything (even random)

Iterate:

(Expectation:) Infer expected counts given current .

(Maximization:) Calculate updated given .

until converged.

As we’ve seen before with in k-means week, this is
Expectation Maximization,
or EM for short. It was formally and generally introduced in a famous
paper by
Dempster, Laird, and Rubin (1977),
though as you might imagine from its apparent simplicity, it had shown
up in various guises before.

Generally, EM is a local optimizer. For many problems that it is
applied to, different starting guesses may yield different optima, if
the posterior probability surface for
is rough. It happens that in this case, Li et al. (2010) prove that
there is only a single optimum for (“we prove that the
likelihood function for our model is concave with respect to
”, in the jargon), so EM reaches a global optimum.

motif identification by expectation maximization

In the Li (2010) problem, we have three latent variables ()
We only need to infer counts for one of them (). We have to push
around and get them marginalized out. In the
homework problem this week, we work through a
more abstracted version of the problem that gets rid of the
orienttation and the sequencing error model ,
reducing the problem to a single hidden variable (the unknown
transcript isoform that a read came from). We saw that
k-means clustering was also in terms of a single hidden variable (the
unknown cluster assignment for a data point). To beat this horse
thoroughly, it might be useful to see another example of an important
biological data analysis problem where there’s only one hidden
variable other than our observed data and our unknown model
parameters .

Here’s the most basic case of the motif identification problem:
we’re given DNA sequences . Each individual DNA
sequence is a string of residues,
.

Embedded somewhere in each DNA sequence is exactly one motif, of
length . A motif might be a protein binding site, for example. A
DNA-binding protein prefers certain sequences, and many DNA-binding
proteins, including transcription factors, have distinctive
site-specific preferences for individual residues: for instance we
might have a protein that likes GaattCC, where it cares a lot more
about the G and CC than it cares about the aatt.

is short: typically 6-10 residues for typical DNA-binding
proteins. is long-ish: experiments like ChIP-seq experiments
(where we recover DNA fragments bound to a protein of interest, and
sequence the bound fragments) might give us sequences of length
.

The game is to identify the W-length motif that all the sequences
share, even though the motif is a probabilistic pattern of
preferences, not merely the same identical subsequence everywhere.

A good approximate model of the preferences of a DNA-binding protein
is a position-specific weight matrix (PWM). The PWM is a
generative probability model of a sequence of length . In a PWM,
we assume that each residue is generated independently, with
probabilities for A/C/G/T at that position, i.e.:

so therefore, the PWM consists of Wx4 parameters
, 4 parameters for the probability of getting A,C,G, or
T at each position .

We can specify the unknown location of the motif in each larger
sequence by its start position . The possible start
positions for range from (there’s an edge
effect; you can’t start too close to the end and have a complete
motif). So each sequence is composed of three segments: the
first are random, the motif
is a sample from the PWM, and the
remaining are random.

If we wanted to be super compulsive, we could now state the
probability of generating an observed sequence , by stating that
the random (non-motif) residues in it are generated with probability
0.25 for each base A,C,G,T. Then:

but you’ll notice that there are always random non-motif
residues, so really this is just a constant times the probability that
the PWM assigns to the possible motif starting at position :

and we can even drop the constant and just deal with a
proportionality, since it’ll turn out that we’ll be renormalizing the
likelihood for other reasons anyway:

So there’s our data likelihood for one sequence : the probability
of that contains one motif at position , given the
PWM . The likelihood for independent sequences
is the product over each.

If I told you the unknown for a data set , you
could extract the individual motifs, align them (put them in register,
in columns 1..W), and collect the observed counts of residues
for each motif position and each residue .
Given those counts, your maximum likelihood estimate for the PWM
is just to use the residue frequencies at each position:

That is, if you find the motifs in 100 sequences, and at motif
position 1 you see 80 A, 10 C, 10 G, and 0 T, then you estimate
probabilities .

If instead I told you the unknown PWM , then you could
calculate your expected counts via inferring the unknown
position of the motif in each sequence:

is the data likelihood, which we already
stated above. is the prior for ,
and we can state that a priori (before we’ve seen any sequence ,
all positions are equiprobable (),
thus constant, and this constant is present in both the numerator and
all terms of the denominator, so it disappears. (Similarly the
proportionality constant we talked about for our
likelihood is going to disappear.) The denominator is just the sum of the numerator over all choices of position
. So:

which means, in pseudocode-ish words:

For each position , calculate the likelihood given that this is the motif’s true position:

Sum over and renormalize.

Given a posterior distribution over the uncertain motif location
for each observed sequence , you collect expected
counts by apportioning one total motif count per sequence across the
possible according to the uncertainty:

which may look harder than it really is, with that Kronecker delta in the notation. In words:

for each sequence :

calculate the posterior distribution as above;

for each possible motif position :

for each column of the PWM:

Let be the residue you see at position of the PWM, when you’ve
got the PWM aligned at .

Add a fraction of a count, , to .

The EM algorithm is now essentially the same as we saw above for the
RNA-seq inference problem:

Initialize a starting guess at the PWM to anything (even random)

Iterate:

(Expectation:) Infer expected counts for the motif, given current .

(Maximization:) Calculate updated given .

until converged.

bonus: sampling random probability vectors from the Dirichlet

An interesting probability sampling problem arose when we said things
like “initialize the starting to something random”. We
don’t need to sample our starting guess uniformly for the Li et al
(2010) EM algorithm – any guess will do – but suppose we did want to
make a completely random guess at ? How do we choose a
uniformly random multinomial probability vector
, for size ? We’re talking about sampling
from a specified probability distribution over probability vectors,
. We’re going to need to think about such things, because
in some Bayesian inference problems, we’re going to want to specify
nonuniform priors over model probability parameters. Thinking about
how to sampling from a uniform is one place to get
started.

You might think you could just sample each individual
uniformly on 0..1 (using Python’s np.random.uniform, for example) –
i.e. , for a uniform random variate , but
of course that doesn’t work, because they’re constrained to be
probabilities that sum to one: .

One bad way to sample probability vectors uniformly, and
two good ways. The examples show 1000 samples, plotted
in so-called "barycentric" coordinates (also known as
a "ternary plot"), to show the 3D simplex in 2D.
The top panel (the bad way) shows sampling by drawing uniform variates
and renormalizing. The middle panel shows sampling from
a Dirichlet distribution with all alpha=1. The bottom panel
shows sampling uniform variates, taking their log, and
renormalizing.

So then you might think, fine, I’ll sample elements uniformly then
renormalize the vector: . But this
doesn’t sample probability vectors uniformly, as shown by the
clustering of samples in the middle of in the top panel of the figure
to the right.

The right way to do it is to use the Dirichlet distribution. The
Dirichlet is a distribution over multinomial probability vectors:

where the normalization factor is:

With all , all probability vectors are sampled
uniformly, because the term becomes a constant
for all values of .

So if you go over to Python’s np.random.dirichlet, you’ll find that
you can do:

importnumpyasnptheta=np.random.dirichlet(np.ones(K))

and you’ll get a uniformly sampled probability vector of
size . That’s illustrated in the middle panel to the right.

Then it turns out that in the special case of all ,
you can prove that sampling uniformly from the Dirichlet is equivalent
to taking the logarithm of uniform variates and renormalizing,

This special case might be useful to know, like if you’re programming
in a language other than Python that doesn’t put a
np.random.dirichlet so conveniently at your fingertips. The bottom
plot of the figure to the right shows using the method.

The Dirichlet distribution will turn out to be useful to us for other
reasons than just sampling uniformly-distributed probability vectors,
as the next section delves further into.

a bit more on the Dirichlet

David MacKay often talked about the Dirichlet using an analogy to a
“dice factory”: a Dirichlet generates
probability vectors that you could use as parameters of a
die roll.

The Dirichlet appears a lot in probabilistic inference when we’re
working with likelihoods that are multinomial distributions (rolling
dice, flipping coins, Laplace counting girl versus boy births in
Paris, choosing an RNA isoform by its abundance in an RNA
population) and we want to specify a prior. If we’ve got multinomial
probability parameters that generated observed counts
, and we want to specify the joint probability of everything
, then we’ll break that down into a likelihood and a prior . The Dirichlet, with its
parameters , gives us a mathematically convenient way to
parameterize a prior as .

There are other ways one might parameterize a prior . The
advantage of the Dirichlet is that it plays well with the multinomial;
it is the multinomial’s conjugate prior. The multinomial is:

where the normalization constant is the reciprocal of the
multinomial coefficient:

When you put that together with a Dirichlet prior to get a joint
probability you get:

which is just as easy to handle as the multinomial alone when we need
to do any integration or differentiation.

The binomial distribution is the special case of the multinomial, so
you can also apply a Dirichlet prior to a binomial, and integrate the
result using Beta functions just like Laplace integrated the binomial
by itself.