Probabilistic models, such as hidden Markov models or Bayesian networks, are commonly used to model biological data. Much
of their popularity can be attributed to the existence of efficient and robust procedures for learning parameters from observations.

Often, however, the only data available for training a probabilistic model are incomplete. Missing values can occur, for example, in medical diagnosis, where patient histories generally include results from a limited battery of tests.

Alternatively, in gene expression clustering, incomplete data arise from the intentional omission of gene-to-cluster assignments in the probabilistic model.

The Expectation-Maximization (EM) algorithm is a way to find maximum-likelihood estimates for model parameters when your data is incomplete, has missing data points, or has unobserved (hidden) latent variables.

It is an iterative way to approximate the maximum likelihood function.

While maximum likelihood estimation can find the “best fit” model for a set of data, it doesn’t work particularly well for incomplete data sets.

The more complex EM algorithm can find model parameters even if you have missing data. It works by choosing random values for the missing data points, and using those guesses to estimate a second set of data. The new values are used to create a better guess for the first set, and the process continues until the algorithm converges on a fixed point.

The key bit of intuition is that the greater the weight of a colour on a data point, the more the data point influences the next estimates for that colour's parameters. This has the effect of "pulling" the parameters in the right direction.

In [13]:

defestimate_mean(data,weight):returnnp.sum(data*weight)/np.sum(weight)defestimate_std(data,weight,mean):variance=np.sum(weight*(data-mean)**2)/np.sum(weight)returnnp.sqrt(variance)# new estimates for standard deviationnew_blue_std_guess=estimate_std(both_colours,blue_weight,blue_mean_guess)new_red_std_guess=estimate_std(both_colours,red_weight,red_mean_guess)# new estimates for meannew_red_mean_guess=estimate_mean(both_colours,red_weight)new_blue_mean_guess=estimate_mean(both_colours,blue_weight)

# estimates for the meanred_mean_guess=1.1blue_mean_guess=9# estimates for the standard deviationred_std_guess=2blue_std_guess=1.7y=[0.001foriinrange(len(both_colours))]plt.plot(both_colours,y,'mo')plt.ylim(-0.02,0.40)foriinrange(5):likelihood_of_red=stats.norm(red_mean_guess,red_std_guess).pdf(both_colours)likelihood_of_blue=stats.norm(blue_mean_guess,blue_std_guess).pdf(both_colours)likelihood_total=likelihood_of_red+likelihood_of_bluered_weight=likelihood_of_red/likelihood_totalblue_weight=likelihood_of_blue/likelihood_totalblue_std_guess=estimate_std(both_colours,blue_weight,blue_mean_guess)red_std_guess=estimate_std(both_colours,red_weight,red_mean_guess)red_mean_guess=estimate_mean(both_colours,red_weight)blue_mean_guess=estimate_mean(both_colours,blue_weight)print(red_mean_guess,blue_mean_guess)x=np.linspace(-3,12,100)ymaxr=max(mlab.normpdf(x,red_mean_guess,red_std_guess))/0.39ymaxb=max(mlab.normpdf(x,blue_mean_guess,blue_std_guess))/0.39plt.plot(x,mlab.normpdf(x,red_mean_guess,red_std_guess),'r',x,mlab.normpdf(x,blue_mean_guess,blue_std_guess),'b',alpha=0.2*i)plt.axvline(red_mean_guess,color='r',linestyle='--',ymin=0.067,ymax=ymaxr,alpha=0.2*i)plt.axvline(blue_mean_guess,color='b',linestyle='--',ymin=0.067,ymax=ymaxb,alpha=0.2*i)

# estimates for the meanred_mean_guess=1.1blue_mean_guess=9# estimates for the standard deviationred_std_guess=2blue_std_guess=1.7foriinrange(20):likelihood_of_red=stats.norm(red_mean_guess,red_std_guess).pdf(both_colours)likelihood_of_blue=stats.norm(blue_mean_guess,blue_std_guess).pdf(both_colours)likelihood_total=likelihood_of_red+likelihood_of_bluered_weight=likelihood_of_red/likelihood_totalblue_weight=likelihood_of_blue/likelihood_totalblue_std_guess=estimate_std(both_colours,blue_weight,blue_mean_guess)red_std_guess=estimate_std(both_colours,red_weight,red_mean_guess)red_mean_guess=estimate_mean(both_colours,red_weight)blue_mean_guess=estimate_mean(both_colours,blue_weight)y=[0.001foriinrange(len(both_colours))]plt.plot(both_colours,y,'mo')plt.ylim(-0.02,0.50)x=np.linspace(-3,12,100)ymaxr=max(mlab.normpdf(x,red_mean_guess,red_std_guess))/0.5ymaxb=max(mlab.normpdf(x,blue_mean_guess,blue_std_guess))/0.5plt.plot(x,mlab.normpdf(x,red_mean_guess,red_std_guess),'r',x,mlab.normpdf(x,blue_mean_guess,blue_std_guess),'b',alpha=1)plt.axvline(red_mean_guess,color='r',linestyle='--',ymin=0.06,ymax=ymaxr,alpha=1)plt.axvline(blue_mean_guess,color='b',linestyle='--',ymin=0.06,ymax=ymaxb,alpha=1)