A Short Tutorial on the Expectation-Maximization Algorithm

A Short Tutorial on the Expectation-Maximization AlgorithmDetlef Prescher Institute for Logic, Language and Computation University of Amsterdam prescher@science.uva.nl

1

Introduction

The paper gives a brief review of the expectation-maximization algorithm (Dempster, Laird, and Rubin 1977) in the comprehensible framework of discrete mathematics. In Section 2, two prominent estimation methods, the relative-frequency estimation and the maximum-likelihood estimation are presented. Section 3 is dedicated to the expectation-maximization algorithm and a simpler variant, the generalized expectation-maximization algorithm. In Section 4, two loaded dice are rolled. Enjoy!

2

Estimation Methods

A statistics problem is a problem in which a corpus1 that has been generated in accordance with some unknown probability distribution must be analyzed and some type of inference about the unknown distribution must be made. In other words, in a statistics problem there is a choice between two or more probability distributions which might have generated the corpus. In practice, there are often an in?nite number of di?erent possible distributions – statisticians bundle these into one single probability model – which might have generated the corpus. By analyzing the corpus, an attempt is made to learn about the unknown distribution. So, on the basis of the corpus, an estimation method selects one instance of the probability model, thereby aiming at ?nding the original distribution. In this section, two common estimation methods, the relative-frequency and the maximum-likelihood estimation, are presented.

Statisticians use the term sample but computational linguists prefer the term corpus

1

Each x ∈ X is called a type, and each value of f is called a type frequency. The corpus size2 is de?ned as |f | = f (x)x∈X

Finally, a corpus is called non-empty and ?nite if 0 < |f | < ∞ In this de?nition, type frequencies are de?ned as non-negative real numbers. The reason for not taking natural numbers is that some statistical estimation methods de?ne type frequencies as weighted occurrence frequencies (which are not natural but non-negative real numbers). Later on, in the context of the EM algorithm, this point will become clear. Note also that a ?nite corpus might consist of an in?nite number of types with positive frequencies. The following de?nition shows that De?nition 1 covers the standard notion of the term corpus (used in Computational Linguistics) and of the term sample (used in Statistics). De?nition 2 Let x1 , . . . , xn be a ?nite sequence of type instances from X . Each xi of this sequence is called a token. The occurrence frequency of a type x in the sequence is de?ned as the following count f (x) = | { i | xi = x} | Obviously, f is a corpus in the sense of De?nition 1, and it has the following properties: The type x does not occur in the sequence if f (x) = 0; In any other case there are f (x) tokens in the sequence which are identical to x. Moreover, the corpus size |f | is identical to n, the number of tokens in the sequence.

Relative-Frequency EstimationLet us ?rst present the notion of probability that we use throughout this paper. De?nition 3 Let X be a countable set of types. A real-valued function p : X → R is called a probability distribution on X , if p has two properties: First, p’s values are non-negative numbers p(x) ≥ 0 for all x ∈ X and second, p’s values sum to 1 p(x) = 1x∈X

Readers familiar to probability theory will certainly note that we use the term probability distribution in a sloppy way (Duda et al. (2001), page 611, introduce the term probability mass function instead). Standardly, probability distributions allocate a probability value p(A) to subsets A ? X , so-called events of an event space X , such that three speci?c axioms are satis?ed (see e.g. DeGroot (1989)): Axiom 1 p(A) ≥ 0 for any event A.2 Note that the corpus size |f | is well-de?ned: The order of summation is not relevant for the value of the (possible in?nite) series f (x), since the types are countable and the type frequencies are non-negative x∈X numbers

2

corpus of data

probability model

an instance of the probability model maximizing the corpus probability(input)

Maximum?Likelihood Estimation

(output)

corpus of data

the probability distribution comprising the relative frequencies of the corpus types(input)

Now, however, note that the probability distributions introduced in De?nition 3 induce rather naturally the following probabilities for events A ? X p(A) :=x∈ A

p(x)

Using the properties of p(x), we can easily show that the probabilities p(A) satisfy the three axioms of probability theory. So, De?nition 3 is justi?ed and thus, for the rest of the paper, we are allowed to put axiomatic probability theory out of our minds. De?nition 4 Let f be a non-empty and ?nite corpus. The probability distribution p ?: X → [0, 1] where p ?(x) = f (x) |f |

is called the relative-frequency estimate on f . The relative-frequency estimation is the most comprehensible estimation method and has some nice properties which will be discussed in the context of the more general maximumlikelihood estimation. For now, however, note that p ? is well de?ned, since both |f | > 0 and |f | < ∞. Moreover, it is easy to check that p ?’s values sum to one: ?(x) = x∈X p ? 1 ? 1 ? 1 · f (x) = |f | · x∈X f (x) = |f | · |f | = 1. x∈X |f |

Maximum-Likelihood EstimationMaximum-likelihood estimation was introduced by R. A. Fisher in 1912, and will typically yield an excellent estimate if the given corpus is large. Most notably, maximum-likelihood estimators ful?ll the so-called invariance principle and, under certain conditions which 3

are typically satis?ed in practical problems, they are even consistent estimators (DeGroot 1989). For these reasons, maximum-likelihood estimation is probably the most widely used estimation method. Now, unlike relative-frequency estimation, maximum-likelihood estimation is a fully-?edged estimation method that aims at selecting an instance of a given probability model which might have originally generated the given corpus. By contrast, the relative-frequency estimate is de?ned on the basis of a corpus only (see De?nition 4). Figure 1 reveals the conceptual di?erence of both estimation methods. In what follows, we will pay some attention to describe the single setting, in which we are exceptionally allowed to mix up both methods (see Theorem 1). Let us start, however, by presenting the notion of a probability model. De?nition 5 A non-empty set M of probability distributions on a set X of types is called a probability model on X . The elements of M are called instances of the model M. The unrestricted probability model is the set M(X ) of all probability distributions on the set of types M(X ) = p : X → [0, 1]x∈X

p(x) = 1

A probability model M is called restricted in all other cases M ? M(X ) and M = M(X )

In practice, most probability models are restricted since their instances are often de?ned on a set X comprising multi-dimensional types such that certain parts of the types are statistically independent (see examples 4 and 5). Here is another side note: We already checked that the relative-frequency estimate is a probability distribution, meaning in terms of De?nition 5 that the relative-frequency estimate is an instance of the unrestricted probability model. So, from an extreme point of view, the relative-frequency estimation might be also regarded as a fully-?edged estimation method exploiting a corpus and a probability model (namely, the unrestricted model). In the following, we de?ne maximum-likelihood estimation as a method that aims at ?nding an instance of a given model which maximizes the probability of a given corpus. Later on, we will see that maximum-likelihood estimates have an additional property: They are the instances of the given probability model that have a “minimal distance” to the relative frequencies of the types in the corpus (see Theorem 2). So, indeed, maximum-likelihood estimates can be intuitively thought of in the intended way: They are the instances of the probability model that might have originally generated the corpus. De?nition 6 Let f be a non-empty and ?nite corpus on a countable set X of types. Let M be a probability model on X . The probability of the corpus allocated by an instance p of the model M is de?ned as L(f ; p) = p(x)f (x)x∈X

An instance p ? of the model M is called a maximum-likelihood estimate of M on f , if and only if the corpus f is allocated a maximum probability by p ? L(f ; p ?) = max L(f ; p)p∈M

(Based on continuity arguments, we use the convention that p0 = 1 and 00 = 1.) 4

~ p=^ p

M(X) M

~ p ^ p ???

M(X) M

Figure 2: Maximum-likelihood estimation and relative-frequency estimation yield for some “exceptional” probability models the same estimate. These models are lightly restricted or even unrestricted models that contain an instance comprising the relative frequencies of all corpus types (left-hand side). In practice, however, most probability models will not behave like that. So, maximum-likelihood estimation and relative-frequency estimation yield in most cases di?erent estimates. As a further and more serious consequence, the maximum-likelihood estimates have then to be searched for by genuine optimization procedures (right-hand side).

By looking at this de?nition, we recognize that maximum-likelihood estimates are the solutions of a quite complex optimization problem. So, some nasty questions about maximumlikelihood estimation arise: Existence Is there for any probability model and any corpus a maximum-likelihood estimate of the model on the corpus? Uniqueness Is there for any probability model and any corpus a unique maximumlikelihood estimate of the model on the corpus? Computability For which probability models and corpora can maximum-likelihood estimates be e?ciently computed? For some probability models M, the following theorem gives a positive answer. Theorem 1 Let f be a non-empty and ?nite corpus on a countable set X of types. Then: (i) The relative-frequency estimate p ? is a unique maximum-likelihood estimate of the unrestricted probability model M(X ) on f . (ii) The relative-frequency estimate p ? is a maximum-likelihood estimate of a (restricted or unrestricted) probability model M on f , if and only if p ? is an instance of the model M. In this case, p ? is a unique maximum-likelihood estimate of M on f . Proof Ad (i): Combine theorems 2 and 3. Ad (ii): “?” is trivial. “?” by (i) q.e.d. On a ?rst glance, proposition (ii) seems to be more general than proposition (i), since proposition (i) is about one single probability model, the unrestricted model, whereas proposition (ii) gives some insight about the relation of the relative-frequency estimate to a maximumlikelihood estimate of arbitrary restricted probability models (see also Figure 2). Both propositions, however, are equivalent. As we will show later on, proposition (i) is equivalent to the famous information inequality of information theory, for which various proofs have been given in the literature. 5

Example 1 On the basis of the following corpus f (a) = 2, f (b) = 3, f (c) = 5 we shall calculate the maximum-likelihood estimate of the unrestricted probability model M({a, b, c}), as well as the maximum-likelihood estimate of the restricted probability model M = p ∈ M({a, b, c}) The solution is instructive, but is left to the reader. p(a) = 0.5

p (Based on continuity arguments, we use the convention that 0 log 0 q = 0 and p log 0 = ∞ and 0 0 log 0 = 0. The logarithm is calculated with respect to the base 2.)

Connecting maximum-likelihood estimation with the concept of relative entropy, the following theorem gives the important insight that the relative-entropy of the relative-frequency estimate is minimal with respect to a maximum-likelihood estimate. Theorem 2 Let p ? be the relative-frequency estimate on a non-empty and ?nite corpus f , and let M be a probability model on the set X of types. Then: An instance p ? of the model M is a maximum-likelihood estimate of M on f , if and only if the relative-entropy of p ? is minimal with respect to p ? D(? p || p ?) = min D(? p || p)p∈M

So, ?nally, minimizing the relative entropy D(? p || p) is equivalent to maximizing the corpus 3 probability L(f ; p). Together with Theorem 2, the following theorem, the so-called information inequality of information theory, proves Theorem 1. The information inequality states simply that the relative entropy is a non-negative number – which is zero, if and only if the two probability distributions are equal. Theorem 3 (Information Inequality) Let p and q be two probability distributions. Then D(p || q ) ≥ 0 with equality if and only if p(x) = q (x) for all x ∈ X . Proof See, e.g., Cover and Thomas (1991), page 26.

*Maximum-Entropy EstimationReaders only interested in the expectation-maximization algorithm are encouraged to omit this section. For completeness, however, note that the relative entropy is asymmetric. That means, in general D(p||q ) = D(q ||p) It is easy to check that the triangle inequality is not valid too. So, the relative entropy D(.||.) is not a “true” distance function. On the other hand, D(.||.) has some of the properties of a distance function. In particular, it is always non-negative and it is zero if and only if p = q (see Theorem 3). So far, however, we aimed at minimizing the relative entropy with respect to its second argument, ?lling the ?rst argument slot of D(.||.) with the relative-frequency estimate p ?. Obviously, these observations raise the question, whether it is also possible to derive other “good” estimates by minimizing the relative entropy with respect to its ?rst argument. So, in terms of Theorem 2, it might be interesting to ask for model instances p? ∈ M with D(p? ||p ?) = min D(p||p ?)p∈M

For at least two reasons, however, this initial approach of relative-entropy estimation is too simplistic. First, it is tailored to probability models that lack any generalization power. Second, it does not provide deeper insight when estimating constrained probability models. Here are the details:3

For completeness, note that the perplexity of a corpus f allocated by a model instance p is de?ned as|f |

1 as well as the common perp(f ;p) interpretation that the perplexity value measures the complexity of the given corpus from the model instance’s view: the perplexity is equal to the size of an imaginary word list from which the corpus can be generated by the model instance – assuming that all words on this list are equally probable. Moreover, the equations state that minimizing the corpus perplexity perp(f ; p) is equivalent to maximizing the corpus probability L(f ; p).

?;p) perp(f ; p) = 2H (p . This yields perp(f ; p) =

1 L(f ;p)

|f |

and L(f ; p) =

7

? A closer look at De?nition 7 reveals that the relative entropy D(p||p ?) is ?nite for those model instances p ∈ M only that ful?ll p ?(x) = 0 ? p(x) = 0 So, the initial approach would lead to model instances that are completely unable to generalize, since they are not allowed to allocate positive probabilities to at least some of the types not seen in the training corpus. ? Theorem 2 guarantees that the relative-frequency estimate p ? is a solution to the initial approach of relative-entropy estimation, whenever p ? ∈ M. Now, De?nition 8 introduces the constrained probability models Mconstr , and indeed, it is easy to check that p ? is always an instance of these models. In other words, estimating constrained probability models by the approach above does not result in interesting model instances. Clearly, all the mentioned drawbacks are due to the fact that the relative-entropy minimization is performed with respect to the relative-frequency estimate. As a resource, we switch simply to a more convenient reference distribution, thereby generalizing formally the initial problem setting. So, as the ?nal request, we ask for model instances p? ∈ M with D(p? ||p0 ) = min D(p||p0 )p∈M

In this setting, the reference distribution p0 ∈ M(X ) is a given instance of the unrestricted probability model, and from what we have seen so far, p0 should allocate all types of interest a positive probability, and moreover, p0 should not be itself an instance of the probability model M. Indeed, this request will lead us to the interesting maximum-entropy estimates. Note ?rst, that D(p||p0 ) = H (p; p0 ) ? H (p) So, minimizing D(p||p0 ) as a function of the model instances p is equivalent to minimizing the cross entropy H (p; p0 ) and simultaneously maximizing the model entropy H (p). Now, simultaneous optimization is a hard task in general, and this gives reason to focus ?rstly on maximizing the entropy H (p) in isolation. The following de?nition presents maximumentropy estimation in terms of the well-known maximum-entropy principle (Jaynes 1957). Sloppily formulated, the maximum-entropy principle recommends to maximize the entropy H (p) as a function of the instances p of certain “constrained” probability models. De?nition 8 Let f1 , . . . , fd be a ?nite number of real-valued functions on a set X of types, the so-called feature functions4 . Let p ? be the relative-frequency estimate on a non-empty4 Each of these feature functions can be thought of as being constructed by inspecting the set of types, thereby measuring a speci?c property of the types x ∈ X . For example, if working in a formal-grammar framework, then it might be worthy to look (at least) at some feature functions fr directly associated to the rules r of the given formal grammar. The “measure” fr (x) of a speci?c rule r for the analyzes x ∈ X of the grammar might be calculated, for example, in terms of the occurrence frequency of r in the sequence of those rules which are necessary to produce x. For instance, Chi (1999) studied this approach for the context-free grammar formalism. Note, however, that there is in general no recipe for constructing “good” feature functions: Often, it is really an intellectual challenge to ?nd those feature functions that describe the given data as best as possible (or at least in a satisfying manner).

Furthermore, a model instance p? ∈ Mconstr is called a maximum-entropy estimate of Mconstr if and only if H (p? ) = max H (p)p∈Mconstr

It is well-known that the maximum-entropy estimates have some nice properties. For example, as De?nition 9 and Theorem 4 show, they can be identi?ed to be the unique maximumlikelihood estimates of the so-called exponential models (which are also known as log-linear models). De?nition 9 Let f1 , . . . , fd be a ?nite number of feature functions on a set X of types. The exponential model of f1 , . . . , fd is de?ned by Mexp = p ∈ M(X ) p(x) = 1 λ1 f1 (x)+...+λd fd (x) e with λ1 , . . . , λd , Zλ ∈ R Zλ

Here, the normalizing constant Zλ (with λ as a short form for the sequence λ1 , . . . , λd ) guarantees that p ∈ M(X ), and it is given by Zλ =x∈X

eλ1 f1 (x)+...+λd fd (x)

Theorem 4 Let f be a non-empty and ?nite corpus, and f1 , . . . , fd be a ?nite number of feature functions on a set X of types. Then (i) The maximum-entropy estimates of Mconstr are instances of Mexp , and the maximumlikelihood estimates of Mexp on f are instances of Mconstr . (ii) If p? ∈ Mconstr ∩ Mexp , then p? is both a unique maximum-entropy estimate of Mconstr and a unique maximum-likelihood estimate of Mexp on f . Part (i) of the theorem simply suggests the form of the maximum-entropy or maximumlikelihood estimates we are looking for. By combining both ?ndings of (i), however, the search space is drastically reduced for both estimation methods: We simply have to look at the intersection of the involved probability models. In turn, exactly this fact makes the second part of the theorem so valuable. If there is a maximum-entropy or a maximum-likelihood 9

Figure 3: Maximum-likelihood estimation generalizes maximum-entropy estimation, as well as bothvariants of minimum relative-entropy estimation (where either the ?rst or the second argument slot of D(.||.) is ?lled by a given probability distribution).

estimate, then it is in the intersection of both models, and thus according to Part (ii), it is a unique estimate, and even more, it is both a maximum-entropy and a maximum-likelihood estimate. Proof See e.g. Cover and Thomas (1991), pages 266-278. For an interesting alternate proof of (ii), see Ratnaparkhi (1997). Note, however, that the proof of Ratnaparkhi’s Theorem 1 is incorrect, whenever the set X of types is in?nite. Although Ratnaparkhi’s proof is very elegant, it relies on the existence of a uniform distribution on X that simply does not exist in this special case. By contrast, Cover and Thomas prove Theorem 11.1.1 without using a uniform distribution on X , and so, they achieve indeed the more general result. Finally, we are coming back to our request of minimizing the relative entropy with respect to a given reference distribution p0 ∈ M(X ). For constrained probability models, the relevant results di?er not much from the results described in Theorem 4. So, let Mexp·ref = p ∈ M(X ) p(x) = 1 λ1 f1 (x)+...+λd fd (x) e · p0 (x) with λ1 , . . . , λd , Zλ ∈ R Zλ

Then, along the lines of the proof of Theorem 4 it can be also proven that the following propositions are valid. (i) The minimum relative-entropy estimates of Mconstr are instances of Mexp·ref , and the maximum-likelihood estimates of Mexp·ref on f are instances of Mconstr . (ii) If p? ∈ Mconstr ∩ Mexp·ref , then p? is both a unique minimum relative-entropy estimate of Mconstr and a unique maximum-likelihood estimate of Mexp·ref on f . All results are displayed in Figure 3.

3

The Expectation-Maximization Algorithm

The expectation-maximization algorithm was introduced by Dempster et al. (1977), who also presented its main properties. In short, the EM algorithm aims at ?nding maximumlikelihood estimates for settings where this appears to be di?cult if not impossible. The trick 10

incomplete data

complete data

symbolic analyzer

incomplete?data corpus

complete?data model starting instance(input)

sequence of instances of the complete?data model aiming at maximizing the probability of the incomplete?data corpus

Expectation?Maximization Algorithm

(output)

Figure 4: Input and output of the EM algorithm. of the EM algorithm is to map the given data to complete data on which it is well-known how to perform maximum-likelihood estimation. Typically, the EM algorithm is applied in the following setting: ? Direct maximum-likelihood estimation of the given probability model on the given corpus is not feasible. For example, if the likelihood function is too complex (e.g. it is a product of sums). ? There is an obvious (but one-to-many) mapping to complete data, on which maximumlikelihood estimation can be easily done. The prototypical example is indeed that maximum-likelihood estimation on the complete data is already a solved problem. Both relative-frequency and maximum-likelihood estimation are common estimation methods with a two-fold input, a corpus and a probability model5 such that the instances of the model might have generated the corpus. The output of both estimation methods is simply an instance of the probability model, ideally, the unknown distribution that generated the corpus. In contrast to this setting, in which we are almost completely informed (the only thing that is not known to us is the unknown distribution that generated the corpus), the expectation-maximization algorithm is designed to estimate an instance of the probability model for settings, in which we are incompletely informed. To be more speci?c, instead of a complete-data corpus, the input of the expectationmaximization algorithm is an incomplete-data corpus together with a so-called symbolic analyzer. A symbolic analyzer is a device assigning to each incomplete-data type a set of analyzes, each analysis being a complete-data type. As a result, the missing completedata corpus can be partly compensated by the expectation-maximization algorithm: The application of the the symbolic analyzer to the incomplete-data corpus leads to an ambiguous complete-data corpus. The ambiguity arises as a consequence of the inherent analytical ambiguity of the symbolic analyzer: the analyzer can replace each token of the incompletedata corpus by a set of complete-data types – the set of its analyzes – but clearly, the symbolic analyzer is not able to resolve the analytical ambiguity. The expectation-maximization algorithm performs a sequence of runs over the resulting ambiguous complete-data corpus. Each of these runs consists of an expectation step followed by a maximization step. In the E step, the expectation-maximization algorithm combines the symbolic analyzer with an instance of the probability model. The results of5

We associate the relative-frequency estimate with the unrestricted probability model

11

this combination is a statistical analyzer which is able to resolve the analytical ambiguity introduced by the symbolic analyzer. In the M step, the expectation-maximization algorithm calculates an ordinary maximum-likelihood estimate on the resolved complete-data corpus. In general, however, a sequence of such runs is necessary. The reason is that we never know which instance of the given probability model leads to a good statistical analyzer, and thus, which instance of the probability model shall be used in the E-step. The expectationmaximization algorithm provides a simple but somehow surprising solution to this serious problem. At the beginning, a randomly generated starting instance of the given probability model is used for the ?rst E-step. In further iterations, the estimate of the M-step is used for the next E-step. Figure 4 displays the input and the output of the EM algorithm. The procedure of the EM algorithm is displayed in Figure 5.

Symbolic and Statistical AnalyzersDe?nition 10 Let X and Y be non-empty and countable sets. A function A : Y → 2X is called a symbolic analyzer if the (possibly empty) sets of analyzes A(y ) ? X are pair-wise disjoint, and the union of all sets of analyzes A(y ) is complete X =y ∈Y

A(y )

In this case, Y is called the set of incomplete-data types, whereas X is called the set of complete-data types. So, in other words, the analyzes A(y ) of the incomplete-data types y form a partition of the complete-data X . Therefore, for each x ∈ X exists a unique y ∈ Y , the so-called yield of x, such that x is an analysis of y y = yield(x) if and only if x ∈ A(y )

For example, if working in a formal-grammar framework, the grammatical sentences can be interpreted as the incomplete-data types, whereas the grammatical analyzes of the sentences are the complete-data types. So, in terms of De?nition 10, a so-called parser – a device assigning a set of grammatical analyzes to a given sentence – is clearly a symbolic analyzer: The most important thing to check is that the parser does not assign a given grammatical analysis to two di?erent sentences – which is pretty obvious, if the sentence words are part of the grammatical analyzes. De?nition 11 A pair < A, p > consisting of a symbolic analyzer A and a probability distribution p on the complete-data types X is called a statistical analyzer. We use a statistical analyzer to induce probabilities for the incomplete-data types y ∈ Y p(y ) :=x∈A(y )

p(x)

Even more important, we use a statistical analyzer to resolve the analytical ambiguity of an incomplete-data type y ∈ Y by looking at the conditional probabilities of the analyzes x ∈ A(y ) p(x) p(x|y ) := where y = yield(x) p(y ) 12

Figure 5: Procedure of the EM algorithm. An incomplete-data corpus, a symbolic analyzer (a deviceassigning to each incomplete-data type a set of complete-data types), and a complete-data model are given. In the E step, the EM algorithm combines the symbolic analyzer with an instance q of the probability model. The results of this combination is a statistical analyzer that is able to resolve the ambiguity of the given incomplete data. In fact, the statistical analyzer is used to generate an expected complete-data corpus fq . In the M step, the EM algorithm calculates an ordinary maximum-likelihood estimate of the complete-data model on the complete-data corpus generated in the E step. In further iterations, the estimates of the M-steps are used in the subsequent E-steps. The output of the EM algorithm are the estimates that are produced in the M steps.

It is easy to check that the statistical analyzer induces a proper probability distribution on the set Y of incomplete-data types p(y ) =y ∈Y y ∈Y x∈A(y )

Of course, by de?ning p(x|y ) = 0 for y = yield(x), p(.|y ) is even a probability distribution on the full set X of analyzes.

Input, Procedure, and Output of the EM AlgorithmDe?nition 12 The input of the expectation-maximization (EM) algorithm is (i) a symbolic analyzer, i.e., a function A which assigns a set of analyzes A(y ) ? X to each incomplete-data type y ∈ Y , such that all sets of analyzes form a partition of the set X of complete-data types X =y ∈Y

A(y )

13

(ii) a non-empty and ?nite incomplete-data corpus, i.e., a frequency distribution f on the set of incomplete-data types f:Y →R such that f (y ) ≥ 0 for all y ∈ Y and 0 < |f | < ∞

(*) implicit input: an incomplete-data model M ? M(Y ) induced by the symbolic analyzer and the complete-data model. To see this, recall De?nition 11. Together with a given instance of the complete-data model, the symbolic analyzer constitutes a statistical analyzer which, in turn, induces the following instance of the incomplete-data model p : Y → [0, 1] and p(y ) =x∈A(y )

p(x)

(Note: For both complete and incomplete data, the same notation symbols M and p are used. The sloppy notation, however, is justi?ed, because the incomplete-data model is a marginal of the complete-data model.) (iv) a (randomly generated) starting instance p0 of the complete-data model M. (Note: If permitted by M, then p0 should not assign to any x ∈ X a probability of zero.) De?nition 13 The procedure of the EM algorithm is (1) (2) (3) for each i = 1, 2, 3, ... do q := pi?1 E-step: compute the complete-data corpus fq : X → R expected by q fq (x) := f (y ) · q (x|y ) (4) where y = yield(x)

In line (3) of the EM procedure, a complete-data corpus fq (x) has to be generated on the basis of the incomplete-data corpus f (y ) and the conditional probabilities q (x|y ) of the analyzes of y (conditional probabilities are introduced in De?nition 11). In fact, this generation procedure is conceptually very easy: according to the conditional probabilities q (x|y ), the frequency f (y ) has to be distributed among the complete-data types x ∈ A(y ). Figure 6 displays the procedure. Moreover, there exists a simple reversed procedure (summation of all frequencies 14

incomplete?data corpus

f

...y3

y1

y2

distribute f(y) to the analyzes x of y according q(x|y)

...x 32 x 33 ... x ...

x11

x 12

x 13

... x ...

x 21

x 22

x 23

... x...

x31

analyzes of y 1 total frequency = f( y1 )

analyzes of y 2 total frequency = f( y2 )

analyzes of y 3 total frequency = f( y3 )

...complete?data corpus

fq

Figure 6: The E step of the EM algorithm. A complete-data corpus fq (x) is generated on the basis of the incomplete-data corpus f (y ) and the conditional probabilities q (x|y ) of the analyzes of y . The frequency f (y ) is distributed among the complete-data types x ∈ A(y ) according to the conditional probabilities q (x|y ). A simple reversed procedure guarantees that the original incomplete-data corpus f (y ) can be recovered from the generated corpus fq (x): Sum up all frequencies fq (x) with x ∈ A(y ). So the size of both corpora is the same |fq | = |f |. Memory hook : fq is the q omplete data corpus. fq (x) with x ∈ A(y )) which guarantees that the original corpus f (y ) can be recovered from the generated corpus fq (x). Finally, the size of both corpora is the same |fq | = |f | In line (4) of the EM procedure, it is stated that a maximum-likelihood estimate p ? of the complete-data model has to be computed on the complete-data corpus fq expected by q . Recall for this purpose that the probability of fq allocated by an instance p ∈ M is de?ned as L(fq ; p) = p(x)fq (x)x∈X

In contrast, the probability of the incomplete-data corpus f allocated by an instance p of the incomplete-data model is much more complex. Using De?nition 12.*, we get an expression involving a product of sums? ?f (y)

L(f ; p) =y ∈Y

?x∈A(y )

p(x)?

Nevertheless, the following theorem reveals that the EM algorithm aims at ?nding an instance of the incomplete-data model which possibly maximizes the probability of the incomplete-data corpus.

15

Theorem 5 The output of the EM algorithm is: A sequence of instances of the complete-data model M, the so-called EM re-estimates, p0 , p1 , p2 , p3 , ... such that the sequence of probabilities allocated to the incomplete-data corpus is monotonic increasing L(f ; p0 ) ≤ L(f ; p1 ) ≤ L(f ; p2 ) ≤ L(f ; p3 ) ≤ . . . It is common wisdom that the sequence of EM re-estimates will converge to a (local) maximumlikelihood estimate of the incomplete-data model on the incomplete-data corpus. As proven by Wu (1983), however, the EM algorithm will do this only in speci?c circumstances. Of course, it is guaranteed that the sequence of corpus probabilities (allocated by the EM re-estimates) must converge. However, we are more interested in the behavior of the EM re-estimates itself. Now, intuitively, the EM algorithm might get stuck in a saddle point or even a local minimum of the corpus-probability function, whereas the associated model instances are hopping uncontrolled around (for example, on a circle-like path in the “space” of all model instances). Proof See theorems 6 and 7.

The Generalized Expectation-Maximization AlgorithmThe EM algorithm performs a sequence of maximum-likelihood estimations on complete data, resulting in good re-estimates on incomplete-data (“good” in the sense of Theorem 5). The following theorem, however, reveals that the EM algorithm might overdo it somehow, since there exist alternative M-steps which can be easier performed, and which result in re-estimates having the same property as the EM re-estimates. De?nition 14 A generalized expectation-maximization (GEM) algorithm has exactly the same input as the EM-algorithm, but an easier M-step is performed in its procedure: (4) M-step (GEM): compute an instance p ? of the complete-data model M such that L(fq ; p ?) ≥ L(fq ; q ) Theorem 6 The output of a GEM algorithm is: A sequence of instances of the complete-data model M, the so-called GEM re-estimates, such that the sequence of probabilities allocated to the incomplete-data corpus is monotonic increasing. Proof Various proofs have been given in the literature. The ?rst one was presented by Dempster et al. (1977). For other variants of the EM algorithm, the book of McLachlan and Krishnan (1997) is a good source. Here, we present something along the lines of the original proof. Clearly, a proof of the theorem requires somehow that we are able to express the probability of the given incomplete-data corpus f in terms of the the probabilities of completedata corpora fq which are involved in the M-steps of the GEM algorithm (where both types of corpora are allocated a probability by the same instance p of the model M). A certain entity, which we would like to call the expected cross-entropy on the analyzes, plays a major role for solving this task. To be speci?c, the expected cross-entropy on the analyzes is 16

de?ned as the expectation of certain cross-entropy values HA(y) (q, p) which are calculated on the di?erent sets A(y ) of analyzes. Then, of course, the “expectation” is calculated on the basis of the relative-frequency estimate p ? of the given incomplete-data corpus HA (q ; p) =y ∈Y

p ?(y ) · HA(y) (q ; p)

Now, for two instances q and p of the complete-data model, their conditional probabilities q (x|y ) and p(x|y ) form proper probability distributions on the set A(y ) of analyzes of y (see De?nition 11). So, the cross-entropy HA(y) (q ; p) on the set A(y ) is simply given by HA(y) (q ; p) = ?x∈A(y )

q (x|y ) log p(x|y )

Recalling the central task of this proof, a bunch of relatively straight-forward calculations leads to the following interesting equation6 L(f ; p) = 2HA (q;p) Using this equation, we can state that L(f ; p) = 2HA (q;p)?HA (q,q) L(f ; q )|f | |f |

· L(fq ; p)

·

L(fq ; p) L(fq ; q )

In what follows, we will show that, after each M-step of a GEM algorithm (i.e. for p being a GEM re-estimate p ?), both of the factors on the right-hand side of this equation are not less than one. First, an iterated application of the information inequality of information theory (see Theorem 3) yields HA (q ; p) ? HA (q, q ) =y ∈Y

p ?(y ) · HA(y) (q ; p) ? HA(y) (q ; q ) p ?(y ) · DA(y) (q ||p)y ∈Y

= ≥ 0

So, the ?rst factor is never (i.e. for no model instance p) less than one 2HA (q;p)?HA (q,q)6

This equation states that the probability of a complete-data corpus (generated by a statistical analyzer) is the product of the probability of the given incomplete-data corpus and |f |-times the average probability of the di?erent corpora of analyzes (as generated for each of the |f | tokens of the incomplete-data corpus).

17

Second, by de?nition of the M-step of a GEM algorithm, the second factor is also not less than one L(fq ; p ?) ≥1 L(fq ; q ) So, it follows L(f ; p ?) ≥1 L(f ; q ) yielding that the probability of the incomplete-data corpus allocated by the GEM re-estimate p ? is not less than the probability of the incomplete-data corpus allocated by the model instance q (which is either the starting instance p0 of the GEM algorithm or the previously calculated GEM re-estimate) L(f ; p ?) ≥ L(f ; q ) Theorem 7 An EM algorithm is a GEM algorithm. Proof In the M-step of an EM algorithm, a model instance p ? is selected such that L(fq ; p ?) = max L(fq , p)p∈M

So, especially L(fq ; p ?) ≥ L(fq , q ) and the requirements of the M-step of a GEM algorithm are met.

Example 4 We shall consider again Experiment 2 in which two loaded dice are rolled, but we shall now compute a maximum-likelihood estimate of the probability model which assumes that the numbers appearing on the ?rst and second die are statistically independent. First, recall the de?nition of statistical independence (see e.g. Duda et al. (2001), page 613). De?nition 15 The variables x1 and x2 are said to be statistically independent given a joint probability distribution p on X if and only if p(x1 , x2 ) = p1 (x1 ) · p2 (x2 ) where p1 and p2 are the marginal distributions for x1 and x2 p1 (x1 ) =x2

p(x1 , x2 ) p(x1 , x2 )x1

p2 (x2 ) =

So, let M1/2 be the probability model which assumes that the numbers appearing on the ?rst and second die are statistically independent M1/2 = {p ∈ M(X ) | x1 and x2 are statistically independent given p} 19

In Example 2, we have calculated the relative-frequency estimator p ?. Theorem 1 states that p ? is the unique maximum-likelihood estimate of the unrestricted model M(X ). Thus, p ? is also a candidate for a maximum-likelihood estimate of M1/2 . Unfortunately, however, x1 and x2 are not statistically independent given p ? (see e.g. p ?(1, 1) = 0.03790 and p ?1 (1)·p ?2 (1) = 0.0379493). This has two consequences for the experiment in which two (loaded) dice are rolled: ? the probability model, which assumes that the numbers appearing on the ?rst and second die are statistically independent, is a restricted model (see De?nition 5), and ? the relative-frequency estimate is in general not a maximum-likelihood estimate of the standard probability model assuming that the numbers appearing on the ?rst and second die are statistically independent. Therefore, we are now following De?nition 6 to compute the maximum-likelihood estimate of M1/2 . Using the independence property, the probability of the corpus f allocated by an instance p of the model M1/2 can be calculated as? ? ? ?

L(f ; p) = ?x1 =1,...,6

p1 (x1 )f1 (x1 ) ? · ?x2 =1,...,6

p2 (x2 )f2 (x2 ) ? = L(f1 ; p1 ) · L(f2 ; p2 )

De?nition 6 states that the maximum-likelihood estimate p ? of M1/2 on f must maximize L(f ; p). A product, however, is maximized, if and only if its factors are simultaneously maximized. Theorem 1 states that the corpus probabilities L(fi ; pi ) are maximized by the relative-frequency estimators p ?i . Therefore, the product of the relative-frequency estimators p ? and p ? (on f and f respectively) might be a candidate for the maximum-likelihood 1 2 1 2 estimate p ? we are looking for p ?(x1 , x2 ) = p ? ? 1 (x1 ) · p 2 (x2 ) Now, note that the marginal distributions of p ? are identical with the relative-frequency estimators on f1 and f2 . For example, p ?’s marginal distribution for x1 is calculated as p ?1 (x1 ) =x2

Example 5 We shall consider again Experiment 2 in which two loaded dice are rolled. Now, however, we shall assume that we are incompletely informed: the corpus of outcomes (which is given to us) consists only of the sums of the numbers which appear on the ?rst and second die. Nevertheless, we shall compute an estimate for a probability model on the complete-data (x1 , x2 ) ∈ X . If we assume that the corpus which is given to us was calculated on the basis of the corpus given in Example 2, then the occurrence frequency of a sum y can be calculated as f (y ) = x1 +x2 =y f (x1 , x2 ). These numbers are displayed in the following table f (y ) 3790 7508 10217 10446 12003 17732 13923 8595 6237 5876 3673 For example, f (4) = f (1, 3) + f (2, 2) + f (3, 1) = 1520 + 3794 + 4903 = 10217 The problem is now, whether this corpus of sums can be used to calculate a good estimate on the outcomes (x1 , x2 ) itself. Hint: Examples 2 and 4 have shown that a unique relative-frequency estimate p ?(x1 , x2 ) and a unique maximum-likelihood estimate p ?(x1 , x2 ) can be calculated on the basis of the corpus f (x1 , x2 ). However, right now, this corpus is not available! Putting the example in the framework of the EM algorithm (see De?nition 12), the set of incomplete-data types is Y = {2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12} whereas the set of complete-data types is X . We also know the set of analyzes for each incomplete-data type y ∈ Y A(y ) = {(x1 , x2 ) ∈ X | x1 + x2 = y } As in Example 4, we are especially interested in an estimate of the (slightly restricted) complete-data model M1/2 which assumes that the numbers appearing on the ?rst and second die are statistically independent. So, for this case, a randomly generated starting instance p0 (x1 , x2 ) of the complete-data model is simply the product of a randomly generated probability distribution p01 (x1 ) for the numbers appearing on the ?rst dice, and a randomly generated probability distribution p02 (x2 ) for the numbers appearing on the second dice p0 (x1 , x2 ) = p01 (x1 ) · p02 (x2 ) 21 y 2 3 4 5 6 7 8 9 10 11 12

In this example, more EM iterations will result in exactly the same re-estimates. So, this is a strong reason to quit the EM procedure. Comparing p1584,1 and p1584,2 with the results of Example 3 (Hint: where we have assumed that a complete-data corpus is given to us!), we see that the EM algorithm yields pretty similar estimates.

AcknowledgmentsParts of the paper cover parts of the teaching material of two courses at ESSLLI 2003 in Vienna. One of them has been sponsored by the European Chapter of the Association for Computational Linguistics (EACL), and both have been co-lectured by Khalil Sima’an and me. Various improvements of the paper have been suggested by Wietse Balkema, Gabriel ? Nuall? Infante-Lopez, Karin M¨ uller, Mark-Jan Nederhof, Breannd? an O ain, Khalil Sima’an, and Andreas Zollmann.