\(\newcommand{\xb}{{\bf x}}
\newcommand{\betab}{\boldsymbol{\beta}}
\newcommand{\zb}{{\bf z}}
\newcommand{\gammab}{\boldsymbol{\gamma}}\)We have no choice but to choose

We make choices every day, and often these choices are made among a finite number of potential alternatives. For example, do we take the car or ride a bike to get to work? Will we have dinner at home or eat out, and if we eat out, where do we go? Scientists, marketing analysts, or political consultants, to name a few, wish to find out why people choose what they choose.

In this post, I provide some background about discrete choice models, specifically, the multinomial probit model. I discuss this model from a random utility model perspective and show you how to simulate data from it. This is helpful for understanding the underpinnings of this model. In my next post, we will use the simulated data to demonstrate how to estimate and interpret effects of interest.

Random utility model and discrete choice

A person confronted with a discrete set of alternatives is assumed to choose the alternative that maximizes his or her utility in some defined way. Utilities are typically conceived of as the result of a function that consists of an observed deterministic and an unobserved random part, because not all factors that may be relevant for a given decision can be observed. The frequently used linear random utility model is

\[U_{ij} = V_{ij} + \epsilon_{ij}, \hspace{5mm} j = 1,…,J\]

where \(U_{ij}\) is the utility of the \(i\)th individual related to the \(j
\)th alternative, \(V_{ij}\) is the observed component, and \(\epsilon_{ij}\) is the unobserved component. In the context of regression modeling, the observed part, \(V_{ij}\), is typically construed as some linear or nonlinear combination of observed characteristics related to individuals and alternatives and corresponding parameter estimates, while the parameters are estimated based on a model that makes certain assumptions about the distribution of the unobserved components, \(\epsilon_{ij}\).

Motivating example

Let’s take a look at an example. Suppose that individuals can enroll in one of three health insurance plans: Sickmaster, Allgood, and Cowboy Health. Thus we have the following set of alternatives:

We would expect a person’s utility related to each of the three alternatives to be a function of both personal characteristics (such as income or age) and characteristics of the health care plan (such as its price).

We might sample individuals and ask them which health plan they would prefer if they had to enroll in one of them. If we collected data on the person’s age (in decades), the person’s household income (in $10,000), and the price of a plan (in $100/month), then our data might look something like the first three cases from the simulated data below:

Taking the first case (id==1), we see that the case-specific variables hhinc and age are constant across alternatives and that the alternative-specific variable price varies over alternatives.

The variable alt labels the alternatives, and the binary variable choice indicates the chosen alternative (it is coded 1 for the chosen plan, and 0 otherwise). Because this is a simulated dataset, we know the underlying utilties that correspond to each alternative, and those are given in variable U. The first respondent’s utility is highest for the first alternative, and so the outcome variable choice takes the value 1 for alt==”Sickmaster” and 0 otherwise. This is the marginal distribution of cases over alternatives:

As we will see below, a useful model for analyzing these types of data is the multinomial probit model.

Multinomial probit model

The multinomial probit model is a discrete choice model that is based on the assumption that the unobserved components in \(\epsilon_{ij}\) come from a normal distribution. Different probit models arise from different specifications of \(V_{ij}\) and different assumptions about \(\epsilon_{ij}\). For example, with a basic multinomial probit model, as is implemented in Stata’s mprobit command (see [R] mprobit), we specify \(V_{ij}\) to be

\[V_{ij} = \xb_{i}\betab_{j}^{\,’}\]

where \(\xb_{i}\) is a vector of individual-specific covariates, and \(\betab_{j}\) is the corresponding parameter vector for alternative \(j\). The random components \(\epsilon_{ij}\) are assumed to come from a multivariate normal distribution with mean zero and identity variance–covariance matrix. For example, if we had three alternatives, we would assume

Specifying the above covariance structure means that the unobserved components, \(\epsilon_{ij}\), are assumed to be homoskedastic and independent across alternatives.

Independence implies that differences in utility between any two alternatives depend on these two alternatives but not on any of the other alternatives. This property is known as the independence from irrelevant alternatives (IIA) assumption. When the IIA assumption holds, it can lead to a number of convenient advantages such as studying only a subset of alternatives (see Train [2009, 48]). However, IIA is a fairly restrictive assumption that might not hold.

Continuing with our health care plan example, suppose that Sickmaster and Allgood both favor people with health problems, while Cowboy Health favors people who only rarely see a doctor. In this case, we would expect the utilities that correspond to alternatives Sickmaster and Allgood to be positively correlated while being negatively correlated with the utility corresponding to Cowboy Health. In other words, utilities with respect to alternatives Sickmaster and Allgood are related to those of Cowboy Health. In this case, we must use a model that relaxes the IIA assumption and allows for correlated utilities across alternatives.

Another potential limitation of our multinomial probit specification concerns the observed \(V_{ij}\), which consists of the linear combination of individual-specific variables and alternative-specific parameters. In other words, we only consider observed variables that vary over persons but not over alternatives. In a setting like this, we would use

\[V_{ij} = \xb_{i}\betab_{j}’ + \zb_{ij}\gammab’\]

where \(\zb_{ij}\) are alternative-specific variables that vary both over individuals and alternatives and \(\gammab\) is the corresponding parameter vector. Combining this with our more flexible assumptions about the unobservables, we can write our model as

As we will see later, we can fit this model in Stata with the asmprobit command; see [R] asmprobit for details about the command and implemented methods.

We said in our health plan example that we think that the price that individual \(i\) has to pay for the plan is important and it could vary both over individuals and alternatives. We can therefore write our utility model for three alternatives as

We can simulate data assuming the data-generating process given in the above model. We will specify the two case-specific variables, household income (hhinc) and age (age), and we will take the price of the plan (price) as the alternative-specific variable. The case-specific variables hhinc and age will be constant across alternatives within each individual, while the alternative-specific variable price will vary over individuals and within individuals over alternatives.

We specify the following population parameters for \(\betab_{j}\) and \(\gamma\):

To allow for alternative specific covariates, we will expand our data so that we will have one observation for each alternative for each case, then create an index for the alternatives, and then generate our variables \({\tt price}_{ij}\):

Looking at the code above, you will notice that we included a factor to scale our specified population parameters. This is due to identification details related to our model that I explain further in the Identification section. One thing we need to know now, however, is that for the model to be identified, the utilities need to be normalized for level and scale. Normalizing for level is straightforward because, since we are only interested in the utilities relative to each other, we can define a base-level alternative and then take the differences of utilities with respect to the set base. If we set the first alternative as the base, we can rewrite our model as follows:

What is left to complete our simulated dataset is to generate the outcome variable that takes the value 1 if observation \(i\) chooses alternative \(k\), and 0 otherwise. To do this, we will first create a single variable for the utilities and then determine the alternative with the highest utility:

Looking at the above output, we see that the coefficient of the alternative-specific variable price is \(\widehat \gamma = -0.49\), which is close to our specified population parameter of \(\gamma = -0.50\). We can say the same about our case-specific variables. The estimated coefficients of hhinc are \(\widehat \Delta \beta_{2,\mathtt{hhinc}} = -0.50\) for the second and \(\widehat \Delta \beta_{3,\mathtt{hhinc}} = -1.99\) for the third alternative. The estimates for age are \(\widehat\Delta \beta_{2,\mathtt{age}} = 2.00\) and \(\widehat \Delta \beta_{3,\mathtt{age}} = 1.49\). The estimated differences in alternative-specific constants are \(\widehat \Delta \beta_{2,\mathtt{cons}} = -4.98\) and \(\widehat \Delta \beta_{3,\mathtt{cons}} = 3.04\).

Identification

Now let me shed more light on the identification details related to our model that we needed to consider when we simulated our dataset. An important feature of \(U_{ij}\) is that the level as well as the scale of utility is irrelevant with respect to the chosen alternative because shifting the level by some constant amount, or multiplying it by a (positive) constant, does not change the rank order of utilities and thus would have no impact on the chosen alternative. This has important ramifications for modeling utilities because without a set level and scale of \(U_{ij}\), there are an infinite number of parameters in \(V_{ij}\) that yield the same outcome in terms of the chosen alternatives. Therefore, utilities need to be normalized to identify the parameters of the model.

We already saw how to normalize for level. Normalizing for scale is a bit more difficult, though, because we assume correlated and heteroskedastic errors. Because of the hetersokedasticity, we need to set the scale for one of the variances and then estimate the other variances in relation to the set variance. We must also account for the nonzero covariance between the errors, which makes additional identifying restrictions necessary. It turns out that given our model assumptions, only \(J(J-1)/2-1\) parameters of our variance–covariance matrix are identifiable (see chapter 5 in Train [2009] for details about identifying restrictions in the context of probit models). To be concrete, our original variance–covariance matrix was the following:

Thus, because utilities are scaled by the standard deviation, they are divided by \(\sqrt{\nu/2}\). Now, getting back to our simulation, if we wish to recover our specified parameters, we need to scale them accordingly. We start from the variance–covariance matrix of error differences:

We see that our estimate is close to the true normalized covariance matrix.

Conclusion

I discussed multinomial probit models in a discrete choice context and showed how to generate a simulated dataset accordingly. In my next post, we will use our simulated dataset and discuss estimation and interpretation of model results, which is not as straightforward as one might think.