Pages

Thursday, March 24, 2011

Dummies for Dummies

The Devil made me do it! It just had to happen sooner or later, and no one else seemed to be willing to bite the bullet. So, I figured it was up to me. I've written the definitive addition to the Dummies series:

O.K., I'm (uncharacteristically) exaggerating just a tad. I think the cover looks good, though; and I've even started to assemble some of the core material, as you'll see below. So what brought on this fit of enthusiasm for what at first blush might be misinterpreted as the neuronically challenged?

Well, last week I gave a talk in the Statistics seminar series in the Math. & Stats. department here at UVic, and this week I gave a similar talk in my own department's brown-bag seminar. The first of those talks was titled "Interpreting Indicator Covariates in Semi-logarithmic Regression Models". The talk for the Economics department was more succinctly called "Dummies for Dummies". The content of the two talks was pretty much the same, but I had to take into account a couple of differences in the language used by econometricians and statisticians. We say "regressor", and they say "covariate". They say "indicator variable", and we say "dummy variable",........ you get the picture. There's another difference too - statisticians don't need to be cajoled into attending seminars by giving the talk a provocative (and possibly insulting) title!

On this occasion the economists noticeably self-selected, and there was a healthy turnout of the curious and homeless. Regrettably we can't afford to hand out free lunches at seminars in the way our colleagues in the Business School purport to. Perhaps it's because we know that such things don't exist! People actually turned up in spite of this. Curiosity got the better of them.

These seminars were based on a recently completed research paper of mine (Giles, 2011a). The main point of that paper is to derive the exact sampling distribution of a particular statistic that arises naturally when estimating a log-linear regression model with one or more dummy variables as regressors. The paper also shows what can go wrong if you don't do the job properly when interpreting that statistic - but more on this below.

Dummy variables are quite alluring when it comes to including them in regression models. However, they're rather special in certain ways. So, here are four things that your mother probably never taught you, but which will form the cornerstones of the forthcoming tome, Dummies for Dummies. Meanwhile, you keen users of dummy variables may want to keep them in mind.

1. Dummies in Log-Linear Models:

Interpreting a dummy variable's coefficient when the dependent variable has been log-transformed has to be undertaken with care. Trust me, the literature is full of empirical applications where the authors get it wrong, and most of the standard text books are no better. The way to interpret the coefficient of a continuous regressor in a regression model, where the dependent variable has been log-transformed, can be seen by considering the following regression model:

ln(Y) = a + bX + cD + ﻿ε . (1)

Here, X is a continuous regressor, and D is a zero-one dummy variable. The interpretation of the coefficient, b, is that it is the partial derivative of ln(Y) with respect to X. So, 100b is the percentage change in Y for a small change in X (up or down), other things held equal.

Unfortunately, lots of people (who really should know better) then apply the same "reasoning" to the interpretation of c. The trouble is, of course that D is not continuous, so we can't differentiate ln(Y) with respect to D. The way to get the percentage effect of D on Y is pretty obvious. Curiously enough those same people who go about this the correct way when computing marginal effects in the case of Logit and Probit models just don't seem to do it right in the present context. All we have to do is take the exponential of both sides of equation (1), then evaluate Y when D = 0 and when D = 1. The difference between these two values, divided by the expression for Y based on the starting value of D gives you the correct interpretation immediately:

If D switches from 0 to 1, the % impact of D on Y is 100[exp(c) - 1]. (2)

If D switches from 1 to 0, the % impact of D on Y is 100[exp(-c) - 1]. (3)

Notice the asymmetry of the impacts - unlike the case of the continuous regressor. Also notice that, in general, these values will be quite different from the 100c that some of our chums insist on using. For example, if c = 0.6, the naïve econometrician will conclude that there is a 60% impact; whereas it is really an 82.2% positive impact as D changes from 0 to 1, and a 45.1% negative impact as D goes from 1 to 0! (Recalling the formula for the Taylor series expansion of exp(c) will make it really transparent why and when things go wrong by using c itself.)

Let me hasten to spoil your day by assuring you that this not breaking news. This little pearl of wisdom has been around in the mainstream economics/econometrics literature for at least 30 years. Hence the "Read Your History" byline on the cover of Dummies for Dummies. Moreover, even more care has to be taken when using an estimated value of c - say after fitting model (1) using OLS. You might be tempted to simply replace cc*, in the formulae in (2) and (3). Not a good plan, as we've known since at least 1981! The resulting estimator of the percentage impact is then biased, in a direction that you can figure out for yourself using Jensen's inequality. A nice practical solution - one that gives an almost-unbiased estimator of the % impact of the dummy on Y - was suggested by Kennedy (1981), assuming normal errors in (1). You just have to modify the formula in (2) to become 100[exp(c*-½v*(c*)) -1], where v*(c*) is the estimated variance of c* - i.e., it's the square of the standard error for c*. You make the corresponding adjustment to the formula in (3), though none of the writers back around 1980 (myself included) actually observed that there are these two separate cases. If you want to be really tricky, and use the exact minimum variance unbiased estimator, I derived the formula for this in Giles (1982). However, it's really messy, and in practice adds very little to Kennedy's estimator that I've just described. My colleague. Ken Stewart has a nice discussion of this in his excellent book, Introduction to Applied Econometrics.

So, this is something to think about the next time you're fitting a log-linear regression. If you want to go further than this, and worry about matters beyond point estimation - such as confidence intervals and the like - then you'll be thrilled to know that the sampling distribution of Kennedy's almost unbiased estimator is nowhere near normal. So be even more careful in this case, and maybe even read the paper on which my seminars were based.

2. Dummies That Take Only One Non-Zero Value:

Alright, now here's another trap for young players. I'll keep it really brief. You probably know already that if you have a dummy variable that is zero for all but one of the sample values, then your OLS estimates of the regression model's coefficients will be identical to those that you'd get if you simply dropped the "special"observation (for which the dummy is non-zero) from the regression altogether. I often set the proof of this as an exercise for my students. In addition, the residual for that one special observation will be exactly zero.

So, be careful how you interpret your OLS results if you choose to use such a dummy variable! I'm not saying that you shouldn't do so. In fact, the standard error for the estimated coefficient on the dummy variable is of some interest. It enables you to test if that observation makes a significant contribution. You could use this information to to test if an apparent "outlier" in the sample is having a statistically significant impact on your estimated model.

Did you know, however, that this same result holds for lots of other estimation methods, beyond least squares? You won't find it discussed in your textbook, but it' something that is proven, and discussed in another recent paper of mine (Giles, 2011b). More specifically, the above result relating to the use of single-valued dummy variables also holds for GMM estimation; any generalized IV estimator (including 2SLS and LIML); the MLE for any of the standard count-data models, such as Poisson, Negative Binomial and Exponential; and even for quantile regression.

3. Inconsistency of OLS When the Number of Non-Zero Values is Fixed:

I'll bet you didn't know that for many of the situations where you estimate a regression model with a dummy variable in it, the estimator of that variable's coefficient is inconsistent. This has nothing to do with random regressors, measurement error or omitted variables. The model can meet all of the usual "textbook assumptions". Guess what else? The problem I'm alluding to arises not just with OLS estimation, but also with any generalized instrumental variables (IV) estimator. And that's not all! The estimator of that coefficient has a non-normal sampling distribution - even for an infinite sample size! The asymptotic distribution is horribly skewed to the right, so this is really going to cause strife if you try to construct confidence intervals or test hypotheses about the dummy's coefficient, but ignore this fact. Remember - this is an asymptotic result, so it doesn't get any better even if you have a huge sample of data.

What on earth is this all about, and why didn't your mom warn you?

Well, notice that I said "....for many of the situations...". So this problem doesn't always arise. Also notice that I was referring only to the coefficient(s) of the dummy variable regressor(s) - not to estimators of the coefficients of the "regular" (measured) regressors in the model. Everything is just fine in their case. So what are these "...many situations.."? You probably won't like the answer to this, because unfortunately these are situations you'll have met many, many times - they're really common, and rather interesting. In a nutshell any time that the dummy variable takes a non-zero (usually unit) value for a finite and fixed number of observations, then the usual asymptotics don't apply and you get the problems I've just mentioned. Of course, the situation of OLS estimation when there is just a single non-zero value for the dummy variable in the sample is a special example of this, and this case is discussed by Hendry and Santos (2005). It doesn't seem to be widely known, however. I provide the generalization from one observation to any finite number of observations; and from OLS to IV estimation in my recent paper, Giles (2011c).

So, consider the following situation, for example:

We want to fit a regression using a sample of data that covers the period 1940 to 1980, and we notice that there is an obvious structural break corresponding to the period of the 2nd World War - 1939 to 1945. So, when we estimate our regression model we include a dummy variable (either to shift the intercept, or multiplicatively to shift one or more of the slope parameters), and this dummy variable is zero except for the 7 years, 1939 to 1945 inclusive. Now, we can't re-write the history books, more's the pity. So, no matter how much more data were to become available, before 1940 or since 1980, our dummy variable will always have just 7 non-zero values. When we look at the coefficient of that dummy variable, the OLS estimator will still be "Best Linear Unbiased" (under our otherwise standard assumptions), but it will be inconsistent. It will be very unreliable even with an infinitely large sample size. We should also be really careful about constructing confidence intervals or tests relating to this coefficient, because the non-normality of the sampling distribution for this particular OLS estimator, even asymptotically.

How many times have you seen emprical studies, perhaps using thousands of observations, where dummy variables of the type I've mentioned appear as regressors? Lots, I'll bet. Those large samples are not much help at all in this case, and you should be skeptical when the authors get all excited about the interpretation of the coefficients of their dummy variables. These numbers mean very little at all!

4. The Perils of Using Seasonal Dummy Variables:

Finally, ask yourself: "How many times have I estimated an OLS regression model using quarterly time-series data, and included seasonal dummy variables to deal with the observed seasonality in the dependent variable?" (Probably more times than you can recall.) Now ask yourself: "What on earth had I been inhaling?" (Don't answer that if you don't want to. Just end me an email and I promise - nudge, nudge -it won't go viral.)

Now, don't panic - I'm not about to launch into a boring little homily about the "dummy variable trap". Here's the thing. Do you recall the Frisch-Waugh Theorem? It was actually published in volume 1 of Econometrica, would you believe! In the context of our seasonal dummy variables this theorem tells us the following, as was pointed out by Lovell (1963). Suppose that we estimate the following regression model by OLS, where the Si's are the quarterly seasonal dummy variables:

Y = a + bX +c1S1 + c2S2 + c3S3 + e . (4)

Let the b* be the OLS estimator of b.

Now, suppose that we decide to "seasonally adjust" the Y data by "explaining" the seasonal component in that variable using the seasonal dummies, and then eliminating that part of the series. So, we fit an OLS regression:

Y = a + c1S1 + c2S2 + c3S3 + v , (5)

and then treat the residuals as the seasonally adjusted Y series, Ysa. We do the same sort of thing to "seasonally adjust" the X series. We fit the OLS regression:

X = a' + c'1S1 + c'2S2 + c'3S3 + u , (6)

and treat these residuals as the seasonally adjusted series, Xsa. Finally, we regress Ysa on Xsa:

Ysa = a" +b"Xsa + e' . (7)

The Frisch-Waugh-Lovell Theorem tells us that the OLS estimator of b" in (7) will be identical to the OLS estimator of b, namely b*, in (4).

This is a purely algebraic result - it doesn't rely on any "statistics" per se,and it certainly doesn't rely on any assumptions about the random errors in any of the fitted models. In addition, it doesn't even require that OLS estimation be used throughout. I showed some years ago (Giles, 1984) that the same results emerge if you replace the OLS estimator with any IV estimator.

What you need to be aware of is that this is not just a rather quaint little result. The implications of what we've just seen are actually quite important. Let's see why this is. First, if we fit a regression with regular data and seasonal dummy variables, this is equivalent to "seasonally adjusting" all of the data (Y and X). Second, the variables have all been effectively "seasonally adjusted" in exactly the same way, which is totally unrealistic - this is not what happens when our statistical agencies seasonally adjust time-series using the Census X-12-ARIMA method (which you can download for free, and is a standard feature in EViews, if you use that package). Third, the data have not really been seasonally adjusted at all, because no account has been taken of the other components of the time-series, Y and X. In general, they will have trend and cyclical components that need to be taken into account, properly, and differently for each series, as is done when the X-12-ARIMA method is used.

So the bottom line is that including seasonal dummy variables makes sense only if: (a) you think that the dependent variables and all of the regressors in your model have a simple additive seasonal component; and (b) you don't think theyhave any trend or cyclical components! When could you last put your hand on your heart and swear that this was the case in practice?

Anyway, I hope that this sneak preview will whet your appetite somewhat, and I look forward to receiving the flood of orders for Dummies for Dummies when it rolls of the presses. You'll be the first to know - trust me!

Note: The links to the following references will be helpful only if your computer's IP address gives you access to the electronic versions of the publications in question. That's why a written References section is provided.

Giles,D. E. (2011c). On the inconsistency of instrumental variables estimators for the coefficients of certain dummy variables. Econometrics Working Paper EWP1106, Department of Economics, University of Victoria.

This blog is an awesome help but probably due to the shallowness of my statistical knowledge I can't find the answer to an urgent problem of mine: I've got an regression analysis in which one independent variable virtually works as an dummy variable. The variablie is called "GDPleader" but since the USA is the permanent leader in this regression it's just a dummy for USA. In the linear model the variable is positive and significant but after the logit transformation it turns out to be negative and not significant anymore. The authors of the paper (Chinn and Frankel 2005) don't comment this process at all but I really want to know what's happening.Thank you!

Luca - it's hard to tell without seeing the data. However, it's an interesting result because all too often we see people saying "I just used OLS because the results are basically the same as after a logistic transformation". In the case you cite, they are fundamentally different.

Are there similar traps we should be looking out for with interpreting coefficients on dummy variables in quantile regression models, with or without log dependent variables?

For example, suppose I'm doing median regression looking at a binary treatment T and have one binary covariate (e.g. male / female). If I run med(y|T,F) = b1*T + b2*F + b3*(TxF), can we do the usual additive thing and say that the effect on the median for women is b1+b3?

Bert - an interesting question. Yes, (b1 + b3) should be interpreted as you suggest in this case. This is still differences-in-differences, but the response is the median, not the mean.

In the second case, taking logs for the dependent variable could be motivated by a desire to have the usual regression coefficients measure RELATIVE changes, rather than level changes. Here, those relative changes would be with respect to the median, not the mean, and (b1+b3) would be interpreted as such.

Thank you for an excellent post. Only recently I became aware of the need to transform the dummy coeficients.

Maybe I missed some reference, but it seems that textbooks don't mention it: for example, Wooldridge's graduate text book does not mention it and his undergraduate textbook mentions only the "naive" (100[exp(c) - 1]) transformation. He also seems to argue that because the impacts of going from D=0 to D=1 are different from going from D=1 to D=0, as you mentioned in the text, only the OLS beta should be reported or commented. I found nothing on the topic in "mostly harmless econometrics", or Cameron e Trivedi's, 2005, "Microeconometrics - Methods and Applications". It definitively was never mentioned in my graduate econometrics classes.

I'm curious tp get an estimate on how often applied empirical papers interpret the coefficients of the dummy estimates making the transformation. From memory I could not think of many. For instance:- in Krueger's QJE-1993 paper on computers and the wage structure does the "naive" transformation (http://flash.lakeheadu.ca/~mshannon/Krueger_QJE93.pdf);- in Fryer 2011, a recent handbook of labor economics chapter on on racial inequality in the US no transformation of the data if made whan interpreting results. Do you have statistics on this?

In any case, it seems that there is still little awareness of this.

There is this tread at the EJMR: http://www.econjobrumors.com/topic/reporting-dummy-coeficients-in-log-linear-models .

I would like to know your opinion in two topics:

1) do you agree with the suggestions in the tread that the regression table should have the original OLS estimates and that the transformed coeficients should only be mentioned in the text?

2) The last post in the EJMR tread points to a literature on how OLS log-linear models can be biassed if errors are heterocedastic or heavy tailled, which is quite common in the data. And that in this case the Kennedy transformation does not hold (as you say it requires normality). What to do in those cases?

Thanks for the comment and questions. I agree that this point is generally overlooked - one reason why I talked about it in this post! I certainly don't have any stats. along the lines you asked about.The thread you pointed me to was interesting. To answer your other questions:

1. I do agree with this suggestion.2. I was not previously aware of the 3 articles referred to. But then, I never read "Journal of Health Economics". :-) I will certainly read the papers though. Perhaps after doing so I'll have some suggestions as to what to do in the case(s) you mention. Sorry I don't have a quick answer.

That's fine. They have in mind a model in which the error enters multiplicatively:Y = A (X^B)(DUM^c)exp(e) ; where DUM = 1 or eThen LnY = a + BLnX + cLn(DUM) + e, where a = Ln(A).Notice that D = Ln(DUM) = 0 or 1, as required.

I am so thankful to find your great blog here, Prof Gile! I am doing regression using panel data of 172 regency from 2001-2012. Based on some arguments, i did the unit root test (cause my time series dimension is long enough). I found one of my dummy variable wasn't stationer, can i include it in my model? or i should drop it?

Would you please recommend me some references--journal or article? Thank you.

Hi Dave -- great post, thanks. I was wondering if I need to be careful in interpreting proportions variables, as well? I have three categories and the proportions of each sum to 1, so I include only two of the proportions. I also include squared terms for those two proportions. The model is log-linear. Any pitfalls I should be aware of?