Login

Implementing Bayesian Inference Using PHP: Part 2

While the first article in this series discussed building intelligent Web
applications through conditional probability, this Bayesian inference article
examines how you can use Bayes methods to solve parameter estimation problems.
Relevant concepts are explained in the context of Web survey analysis using PHP
and JPGraph. (This intermediate-level article was first published by IBM
developerWorks on April 12, 2004 at http://www.ibm.com/developerWorks).

Statistics consists of a set of classic problems and knowledge about the methods that might be used to solve those problems. Classic problems in statistics include parameter estimation, hypothesis testing, prediction, model selection, and decision making. Researchers in artificial intelligence have added classification, clustering, and learning to this list of inference problems.

Bayes inference methods are distinguished from other inference methods (such as least squares, maximum likelihood, maximum entrophy, minimum description length) by the fact that a Bayes inference method can be applied to every type of problem in this list and often represents the optimal method to use. Statisticians, philosophers, and computer scientists have proposed numerous guides to reliable inference, but Bayes inference methods reign supreme as the most versatile and often the most effective set of methods to use for a vast range of inductive inference problems.

In this article, I will discuss the classic problem of parameter estimation and develop Bayesian parameter estimation code using the popular Web scripting language PHP. I am specifically concerned with developing parameter estimation code that might be used in the analyses of simple Web surveys.

{mospagebreak title=Defining simple surveys}

Surveys come in many forms. You can present questions and solicit answers in a large variety of ways. I won’t be concerned with understanding how to analyze every possible type of survey; instead, I will try to be a bit more strategic by starting with the simplest possible types of surveys.

The first type of survey to examine is one in which all the survey questions require a boolean response (yes/no, agree/disagree, male/female, and so forth). Think of this as a multiple-choice survey where all questions only offer two mutually-exclusive response options, probably the simplest type of survey you can imagine constructing.

When participants take a Web survey, you need to store their answers in a form suitable for later analysis. For analysis purposes, the best way to store survey answers is in a database table dedicated to responses from a particular survey. The survey table should have columns devoted to recording the boolean-valued response for each question (denoted q1 to q3 in the following table):

Opinion poll surveys with binary response options are ideally collected in this format. Surveys with binary data collected in this format are referred to as binary surveys.

A survey that is constructed for the purposes of classifying participants differs from the above in that it requires at least one extra classification field (denoted c1 below) to record, for example, the employment status of the participant (coded as 0 = unemployed and 1 = employed).

Table 2. Extra field allows participants to be classified

participant

q1

q2

q3

c1

1

0

0

0

1

2

0

0

1

1

3

1

1

0

0

add more rows here

Note that records used for medical diagnostic testing are likely to have a similar format.

Surveys with binary data collected in this format are referred to as binary classification surveys.

When the adjective “simple” is used to describe a binary survey, this means that the survey consists of only one binary response per participant — which most people would refer to as a poll. It can also be viewed as the limiting case of a survey.

When the adjective “simple” is used to describe a binary classification survey, this means that the survey consists of only two binary responses per participant, one being a response to the test question q1 and one being a response to the classification question c1.

You will find the parameter estimation concepts that I discuss in this article to be useful for analyzing simple binary surveys. In my next article, I focus on concepts and code useful for analyzing simple binary classification surveys and multivariate binary classification surveys.

The range of binary surveys as defined here represent a distinct and interesting class of surveys to study. A cornucopia of literature and palette of analytic techniques is available to analyze binary data. Binary surveys are also interesting because responses coded as 0s and 1s are written in the native language of hardware-based computing. Fields as diverse as statistics, computer science, physics, medical diagnosis, data compression, and electrical engineering can be treated in a unified manner within the mathematics of binary data analysis and modeling.

{mospagebreak title=What is parameter estimation?}

Parameter estimation refers to the process of using sample data to estimate the value of a population parameter (for example, the mean, variance, or t score) or a model parameter (for example, a weight in a regression equation). Surveys are often used for the purposes of parameter estimation.

A consumer survey might ask about past spending decisions in various retail sectors. The purpose might be to estimate the parameter values to use in a mathematical model describing the complete consumer spending profile of the sampled population. (The surveys I use in this article are not so complex; I are more concerned with understanding the basic concepts involved.) To start I want to focus on a simple binary survey (a poll).

When estimating parameters, the most common inference technique used to estimate the “true value” of a parameter is the maximum likelihood technique. The manner in which a maximum likelihood estimator (MLE) is computed depends upon:

The level of measurement used (nominal, ordinal, interval, ratio) to record data.

In the case of a simple binary survey, the level of measurement is nominal because you use a category-based scale with two levels to measure participant responses to the question.

In what follows, you use the binomial sampling distribution to model the response distribution you obtained from a simple binary survey. You could also use a hypergeometric distribution if the fact that you are sampling without replacement is a concern (for example, when sampling from a small-finite population, such as an employee survey).

Now that you know the level of measurement and the appropriate sampling distribution to use, you can discuss the specifics of the maximum likelihood technique that should be used to estimate our population or model parameters. This discussion of maximum likelihood estimation is relevant to Bayesian inference because it demonstrates another technique that can be used to compute the likelihood terms in Bayes equation. This new technique involves using probability distributions to compute the likelihood (and prior terms) in Bayes formula. In the previous article, I computed the likelihood term by dividing the joint distribution by the relevant marginal distribution. Here, I will demonstrate how to compute the likelihood term using a theoretical sampling distibution.

Another reason why maximum likelihood techniques are relevant is because they are the main competition to using Bayes inference in parameter-estimation contexts.

{mospagebreak title=Computing the MLE}

In the context of surveys with questions having only binary response options, you can model the distribution of responses as a binomial random variable: a variable that can only take on one of two values. Given this probability distribution model, one of the parameters you want to estimate using your survey data is the probability of success for a given question where success (denoted as p) can be defined as the probability that participants will give a 1-coded response. The letter q can be used to denote “failure” (a 0-coded response) and is a probability value equal to 1 – p.

To see how to compute an MLE of p, imagine that you have the following survey data to base your estimate of p on (the probability of responding “yes”):

Table 3. Survey data basis of estimate of p

participant

q1

1

0

2

1

3

0

4

0

5

0

To estimate p for question 1 using maximum likelihood techniques, you need to try out various values of p to see which one maximizes the conditional probability of the observed results R:

P(R | pi)

The results R can be summarized as the proportion of successes k observed among n sample items k/n. From the table above, one in five (or 20 percent) of the participants responded with a 1-coded response. Therefore, to compute the MLE of p, you try out various values of p and see which one maximizes the conditional probability of k/n:

MLE = max( P(k/n | pi) )

Given that the distribution of question scores can be modeled as a binomial random variable, you can use the binomial distribution function to calculate the probability of the observed results. The binomial distribution function returns the likelihood of an event occurring $k times in $n attempts, in which the probability of success on a single attempt is $p:

$likelihood = binomial($n, $k, $p);

The equation for computing the binomial probability of a particular result k/n given a particular value of p is:

P(k/n | pi) = nCk * (p)k * (1 – p) (n – k)

A binomial probability is a product of three terms:

The number of ways of selecting k items from n items: nCk. T

The probability of success raised to an exponent equal to the number of success events involved in the outcome: pk.

The probability of failure raised to an exponent equal to the number of failure events involved in the outcome: 1 – p (n – k).

So to determine the value of $p that maximizes the observed results, you can simply construct a loop where you keep the values of $n and $k fixed on each iteration but vary the probability of success $p for a particular trial (as in the following listing):

You’ve computed the MLE of p by applying the max()function to the likelihood array (a more accurate root-finding method such as Newton-Raphson should be used when more accuracy is required). Selecting the max of the likelihood array only gives you a point estimate of what value of p is most likely given the results. Other values of p are also possible but less likely to produce the observed results.

To get a sense of how likely various values of p are, we should examine the relationship between different p values and their corresponding likelihood values (denoted as l(p) in the following graph).

JPGraph is the leading PHP-based package for creating professional graphs that can be displayed on the Web or in other media. The following code is used to create a graph of the likelihood distribution for p. One interesting feature of this code is that it demonstrates usage of the cubic spline function for interpolating values and creating smooth curves.

Listing 3. Creating the likelihood distribution graph

<?php/*** Script to graph likelihood distribution of the binomial * parameter p (for example, probability of success per trial). ** Much of the plotting code below has been adapted from * example code written by JPGraph author Johan Persson.** To get this working locally, you must install JPGraph* and set the PHP_MATH constant to the folder where the* JPGraph source code resides.* * @see http://www.aditus.nu/jpgraph/index.php*/

The y-axis represents the likelihood of p (denoted l(p)) and was computed using the binomial formula. The subtitle of the graph tells you that the likelihood achieves a maximum (technically where the derivative is 0) of .4096 when p = 0.20, which is equal to the observed proportion of sample successes.

Why do you need to do all this work to estimate p when we could have used common sense to arrive at the same result? The fact that the MLE procedure agrees with common sense helps to convince you that you can also use this technique when estimating parameters that are not so easy to determine through common sense. In those cases you can proceed by:

Finding a way to express the likelihood of the results as a function of the parameters.

Computing the likelihood of the result with respect to the parameter.

Selecting the parameter value that yields the maximum likelihood value.

The use of maximum likelihood principle is pervasive in statistical reasoning along with other principles such as the Bayesian principle of maximizing the posterior. Other notable principles include the least-squared error criterion, maximum entrophy, minimum description-length, variational inference, and various energy minimization principles.

High-level statistical reasoning is more about welding these principles effectively to estimate parameters, test hypothesis, and such, than it is about algebraic cleverness in deriving new formulas. A bit of algebraic cleverness, however, can come in handy.

{mospagebreak title=Algebraic cleverness}

When estimating p using maximum likelihood analysis, you use the binomial formula to compute the likelihood of p:

P( k/n | pi) = nCk (p)k (1 – p) (n – k)

The computation involved keeping the k and n parameters fixed while varying pi and then seeing which value of pi maximized the likelihood of the results k/n. If you examine the above equation, you should note that the value of nCk will remain constant as you vary pi. This implies that you can drop this term from the equation without affecting the shape of the likelihood distribution or the MLE. To confirm this, you can modify the likelihood graphing code by replacing this line:

$likelihoods[$i] = binomial($n, $k, $p);

with this line:

$likelihoods[$i] = pow($p, $k) * pow(1-$p, $n-$k);

This is simply the binomial formula without the combinations term. When you do this, you get the following graph:

Figure 2. The likelihood distribution graph (reduced formula)

Note that the MLE value is smaller than before, but that 0.20 is still the MLE of p. As the likelihood distribution is not a probability distribution, these reduced values are immaterial — the shape and maxima are all that really matters. From now on, you can use this reduced formula to compute the MLE of p.

P( k/n | pi) = (p)k * (1 – p) (n – k)

Another bit of algebraic cleverness involves eliminating the exponents and multiplications by taking the logarithm of each term appearing on the right-hand side. When you do so, we obtain the log likelihood formula (commonly denoted with a capital L):

Note that taking the log of a formula with exponents in it changes the exponents into multipliers. Also, terms that were multiplied are now added. It is often easier to find the derivative of 0 with the log version of the formula (to find the MLE).

To convince yourself that the log likelihood formula can be used to derive the MLE of p, you can modify the likelihood graphing code by replacing this line:

$likelihoods[$i] = pow($p, $k) * pow(1-$p, $n-$k);

with this line:

$log_likelihoods[$i] = $k * log($p) + ($n – $k) * log(1 – $p);

which results in this graph:

Figure 3. The likelihood distribution graph (log likelihood formula)

As you can see, the shape has changed somewhat but the MLE of p is still 0.20. Note also that the first graphed point does not start at 0 because the log of 0 produces an infinite value. The simple solution I adopted was to start plotting with a p value of 0.05 instead of 0.

It was necessary to show you these alternate formulas for computing the likelihood of p because statisticians often switch between them in different contexts based on the mathematical convenience of doing so. The log likelihood version is especially important because you often see abundant use of logarithms in the context of logistic regression which is often used to analyze multivariate surveys and experimental data having a binary dependent measure that one wants to predict and explain. Logistic regression uses maximum likelihood techniques to estimate(theta) on the basis of several explanatory response variables (for n explanatory variables):

= P( Y=1 | xi1, …, xin )

Logistic regression is used for estimation, prediction, and modeling purposes and is an important technique you should learn if you want to design and analyze multivariate binary surveys.

In the specific case of a simple binary survey, the sample results can be expressed as the number of success events k divided by the total number of events n:

R = k/n

The Bayes parameter estimation formula for poll data looks like this:

P(i | k/n) = P(k/n |i) * P(i) / P(k/n)

Recall that the numerator term P(k/n) plays a relatively insignificant normalizing role, so you can ignore it for the purposes of understanding how to compute the posterior distribution:

P(i | k/n) ~ P(k/n |i) * P(i)

In the last few sections, I have shown you how the likelihood term P(k/n |i) in the above formula can be computed using maximum likelihood techniques — in particular, the binomial formula for computing the probability of various values ofi (where p is replaced by the generic term denoting a parameter):

P(k/n |i) = nCk *ik * (1 -i) (n – k)

Now that you know how to compute the likelihood term in Bayes equation, how can you compute the prior term P(i)?

The key to computing P(i) is to first recognize thati represents the probability of a success event (like a 1-coded response) and as such, can only take on values in the 0 to 1 range. Each value ofi in this range will have a different probability of occurrence associated with it. The parameteri can assume an infinite number of values between 0 and 1 which means that you need to represent it with a continuous probability distribution (like the normal distribution) as opposed to a discrete probability distribution (like the binomial distribution).

In the case of a simple binary survey, the beta distribution is the appropriate continuous distribution to use to represent P(i) because:

The domain of your probability distribution function is between 0 and 1, and

The outcomes of your survey arise from a Bernoulli process.

A Bernoulli process:

consists of a series of independent, dichotomous trials where the possible events occurring on each trial are labeled “success” and “failure”,p is the probability of success on a given trial, and p remains unchanged from trial to trial. — Winkler and Hayes, Statistics: Probability, Inference and Decision, p 204.

The process that generates the observed response distribution for a particular binary question in the survey can be legitimately viewed as arising from a Bernoulli process as Winkler and Hayes defined. A process that can be modeled as a Bernoulli process gives rise to a Beta distribution for the parameter p (estimated using k/n). I’m ready now to discuss the beta distribution and the critical role it plays in computing the posterior parameter estimate P(i | R).

{mospagebreak title=Beta distribution sampling model}

A random variable is said to have the standard beta distribution with parameters a and b if its probability density function is:

f() =a – 1 * (1 -) b – 1 / B(a, b)

Rather than explain the formula by resorting to more mathematics, I will discuss PHP code that you can use to compute f() for various values of a and b. Towards this end I created a class called BetaDistribution.phpand added it to a probability distributions package I developed for a previous article (see Resources). This class supplies methods that accept the a, b, andparameters.

The class constructor is first called with the a and b parameters as follows:

Listing 4. Instantiating the BetaDistribution class

<?php

// Demonstration of how to instantiate Beta Distribution// class.

require_once “../BetaDistribution.php”;

$a = 1; // num successes$b = 4; // num failures

$beta = new BetaDistribution($a, $b);

?>

In this example, the number of success events (previously k) is denoted by a. The number of failure events is equal to n – k and is denoted by b. The a and b parameters jointly control the shape and location of the beta distribution curves.

Graphically speaking, the beta distribution refers to a large family of plotting curves that can differ substantially from one another depending upon the a and b parameter values. As you shall see, the a and b parameters can be used to represent the prior probability distribution that you feel is most appropriate for representing P(i).

Suppose you test your simple binary survey before going live. Select five people that you think are representative of the target sampling population and ask them to fill it out the survey online; then observe the following results:

One participant responds “yes” (a “success” event).

Four participants respond “no” (“failure” events).

The code in Listing 5 invokes the BetaDistributionwith the appropriate parameter values of a=1 and b=4 to represent the results of this survey. Once you instantiate the beta distribution constructor with the appropriate a and b parameter values, you can then use other methods in this class (depicted in the following) to compute standard probability distribution functions:

Listing 5. Using other methods to compute probability distribution functions

If the test has no glitches, you can go into your main experiment with BetaDistribution(1, 4) being used to represent your prior distribution P(). Note that the mean value (= p = k/n = a / a + b = .20) reported in the table is what you expect it to be from common-sense considerations (such as the expected value ofequal to the observed proportion of cases to date k/n).

To visualize your prior probability distribution, you can use the following code below to obtain the x and y coordinates to plot. The probability density function PDF()returns a “probability” value associated with a particularvalue — for instance, P[p = .20]. Given a contiguous range of p values, the PDF()method give you a corresponding range of probability values f(p) that you can use to graph the shape of the probability distribution for fixed a, b parameters and for a range of possiblevalues:

In the following graph, p = $parameters[$i], and f(p) = $pdf_vals[$i].

Figure 4. Prior distribution is not well defined; too few observations

The exact values of f(p) are of less concern than the overall shape and center of gravity for the prior distribution. What this graph shows is that your prior distribution is still not very well defined because it does not peak around a particular parameter estimate. This is as it should be when you only have a few observations to work with.

{mospagebreak title=Beta distribution source code}

The BetaDistributionclass depends upon methods supplied by the parent ProbabilityDistributionclass and functions supplied by the SpecialMath.php library of functions. The SpecialMath.php library does most of the heavy lifting since it implements the critical logBeta and incompleteBeta functions, along with some important gamma functions. All of these functions are ports of functions and classes that originally appeared as part of the open source JSci package (see Resources).

Updating through conjugate priors

Suppose that you go live with your simple binary survey and collect the following responses.

Four participants respond with a 1-coded answer (success events).

Sixteen participants respond with a 0-coded answer (failure events).

People in the live survey responded with the same success proportions (k/n = 4/20 = 1/5 = .20) as in the pilot survey (k/n = 1/5 = .20).

Look at the graph of the results expressed as a beta distribution with a=4 and b=16.

Figure 5. More data means that estimates rest more firmly within a range of values

As you can see, the graph is starting to sharply peak around the parameter estimate .20 and the standard deviation of the parameter estimate is decreasing. The beta distribution is representing the fact that you have more data to learn from and that your estimates can be more firmly placed within a range of values. Confidence intervals for your parameter estimate, also known in Bayesian statistics as credible intervals, can also be computed (but I’ll leave that as an exercise).

I will conclude this discussion by demonstrating how easy it is to update the Bayes parameter estimate with new information by using the concept of a conjugate prior. In particular, I will look at how to combine the parameter estimate obtained from the test survey with the parameter estimate obtained from the live survey. Don’t throw out the test survey data if it is representative of the population you want to draw inferences about and the test conditions remain the same.

Essentially, a conjugate prior allows you to represent the Bayes parameter estimation formula in incredibly simple terms using the beta parameters a and b:

aposterior = alive + atestbposterior = blive + btest

aposterior = 4 + 1bposterior = 16 + 4

Using the conjugate priors updating rule to combine test and live survey parameter estimates, you pass a=5 and b=20 into our BetaDistributionclass and plot the resulting probability distribution.

You can summarize the test survey results through the parameters atest=1 and btest=4 in a beta prior distribution (P() = Beta[1, 4]). You can summarize the live survey results through the parameters alive=4 and blive=16 in a beta likelihood distribution (that is, P(D |) = Beta[4, 16] ).

Adding these conjugate beta distributions (Beta[1, 4] + Beta[4, 16]) together amounts to adding together the a and b parameters from both beta distributions. Similarly, simple conjugate prior updating rules are available for Gaussian (Normal-Wishart family of distributions) and multinomial data (Dirichlet family of distributions) as well.

The concept of conjugate priors is attractive from the point of view of implementing Bayes networks and imagining how you might propagate information from parent nodes to child nodes. If several parent nodes use the beta a and b parameters to represent information about some aspect of the world, then you may be able to propogate this information to downstream child nodes by simply summing parent node beta weights.

Another attractive feature of the conjugate prior updating rule is that it is recursive and, in the limit, can be used to tell you how to update your posterior probabilities on the basis of a single new observation (another exercise for you to think about).

The use of conjugate priors is not, however, without its critics who argue that the mindless use of conjugate priors abrogates a Bayesian’s responsibility to use all information at his disposal to represent prior knowledge about a parameter. Just because the likelihood distribution can be represented using a beta sampling model does not mean that you also need to represent your prior knowledge with a beta distribution. Personally, I would discount these criticisms in the case of a simple binary survey because the beta sampling model appears to be an appropriate representation to use to depict the prior estimate of what value of p you might observe.

I conclude this section by comparing maximum likelihood estimators of p with Bayesian estimators of p. Both estimation techniques produce unbiased estimates of p and converge on p in the long run (they share similar asymptotic behavior). MLE estimators are generally simpler to compute and are often preferred by statisticians when doing parameter estimation.

Bayesian estimators and MLE estimators differ in their small sample behavior as estimators. You should study convergence rates and bias measures to get a practical sense of how they might differ. Bayesian methods allow more flexibility in terms of how you might incorporate external information (through the prior probability distribution) into the parameter estimation process.

{mospagebreak title=Conclusions}

This article demonstrated the use of Bayesian methods to solve parameter estimation problems. I discussed several important concepts relevant to using Bayesian methods to solve parameter estimation problems, including maximum likelihood estimators, binomial random variables, the Bernoulli process, the beta distribution, and conjugate priors. I hope the discussion provided you with a better general understanding of Bayesian inference techniques and also helped you to understand some concepts that play important roles in statistics and probability.

I highlighted the important role played by probability distributions in representing the likelihood and prior terms in Bayes theorem. Recall that in the first article of this series, I computed the posterior distribution without resorting to the use of theoretical probability distributions. You can conclude from this that full mastery of Bayes methods involves learning a more theoretically-oriented probability distribution approach to Bayesian inference along with a more empirically-oriented joint frequency approach.

In my next article, I will move from analyzing simple binary surveys to the concepts and techniques that are useful for analyzing simple binary classification surveys and multivariate classification surveys. In doing this, I will examine another classic inference problem that Bayes methods are particularly good at solving – – classification problems.

“Implement Bayesian inference using PHP, Part 3” solves classification problems in medical diagnostic testing and Web survey analysis as it applies Bayesian and conditional probability concepts to both building classifier systems and analyzing the accuracy of their output (developerWorks, May 2004)