Note: In Stata 14, two new commands for modeling proportions, fracreg and betareg, were introduced.

How do you fit a model when the dependent variable is a proportion?

Title

Logit transformation

Author

Allen McDowell, StataCorp
Nicholas J. Cox, Durham University, UK

A traditional solution to this problem is to perform a logit transformation
on the data. Suppose that your dependent variable is called y and your
independent variables are called X. Then, one assumes that the model that
describes y is

y = invlogit(XB)

If one then performs the logit transformation, the result is

ln( y / (1 - y) ) = XB

We have now mapped the original variable, which was bounded by 0 and 1, to
the real line. One can now fit this model using OLS or WLS, for example
by using regress.
Of course, one cannot perform the transformation on observations where the
dependent variable is zero or one; the result will be a missing value, and
that observation would subsequently be dropped from the estimation sample.

A better alternative is to estimate using
glm with
family(binomial), link(logit), and robust; this is the
method proposed by Papke and Wooldridge (1996). At the time this article was
published, Stata’s glm command could not fit such models, and
this fact is noted in the article. glm has since been enhanced
specifically to deal with fractional response data.

In either case, there may well be a substantive issue of interpretation.
Let us focus on interpreting zeros: the same kind of issue may well arise
for ones. Suppose the y variable is proportion of days workers spend off
sick. There are two extreme possibilities. The first extreme is that all
observed zeros are in effect sampling zeros: each worker has some nonzero
probability of being off sick, and it is merely that some workers were not,
in fact, off sick in our sample period. Here, we would often want to include
the observed zeros in our analysis and the glm route is attractive.
The second extreme is that some or possibly all observed zeros must be
considered as structural zeros: these workers will not ever report sick,
because of robust health and exemplary dedication. These are extremes, and
intermediate cases are also common. In practice, it is often helpful to
look at the frequency distribution: a marked spike at zero or one may well
raise doubt about a single model fitted to all data.

A second example might be data on trading links between countries. Suppose
the y variable is proportion of imports from a certain country. Here a zero
might be structural if two countries never trade, say on political or
cultural grounds. A model that fits over both the zeros and the nonzeros
might not be advisable, so that a different kind of model should be
considered.