Dummy Variables

Dummy Variables


Dummy variables refers to the technique of
using a dichotomous variable (coded 0 or 1)
to represent the separate categories of a
nominal level measure.
The term “dummy” appears to refer to the
fact that the presence of the trait indicated by
the code of 1 represents a factor or collection
of factors that are not measurable by any
better means within the context of the
analysis.
Coding of dummy Variables

Take for instance the race of the respondent
in a study of voter preferences

Race coded white(0) or black(1)

There are a whole set of factors that are possibly
different, or even likely to be different, between voters
of different races


Income, socialization, experience of racial discrimination,
attitudes toward a variety of social issues, feelings of
political efficacy, etc.
Since we cannot measure all of those differences
within the confines of the study we are doing, we
use a dummy variable to capture these effects.
Multiple categories



Now picture race coded white(0), black(1),
Hispanic(2), Asian(3) and Native American(4)
If we put the variable race into a regression
equation, the results will be nonsense since
the coding implicitly required in regression
assumes at least ordinal level data – with
approximately equal differences between
ordinal categories.
Regression using a 3 (or more) category
nominal variable yields un-interpretable and
meaningless results.
Creating Dummy variables

The simple case of race is already coded correctly

Black: coded 0 for white and 1 for black


Note the coding can be reversed and leads only to changes in
sign and direction of interpretation.
The complex nominal version turns into 5 variables:





White; coded 1 for whites and 0 for non-whites
Black; coded 1 for blacks and 0 for non-blacks
Hispanic; coded 1 for Hispanics and 0 for non- Hispanics
Asian; coded 1 for Asians and 0 for non- Asians
AmInd; coded 1 for native Americans and 0 for non-native
Americans
Regression with Dummy
Variables

The dummy variable is then added the regression
model
Yi =a + B1 * Xi + B2 * Blacki + ei

Interpretation of the dummy variable is usually quite
straightforward.


The intercept term represents the intercept for the omitted
category
The slope coefficient for the dummy variable represents the
change in the intercept for the category coded 1 (blacks)
Regression with only a dummy

When we regress a variable on only the
dummy variable, we obtain the estimates
for the means of the depended variable.
Yi =a + B1 * Blacki + ei

a is the mean of Y for Whites and a+B1
is the mean of Y for Blacks.
Omitting a category





When we have a single dummy variable, we have
information for both categories in the model
Also note that
White = 1 – Black
Thus having both a dummy for White and one for Blacks is
redundant.
As a result of this, we always omit one category, whose
intercept is the model’s intercept.
This omitted category is called the reference category


In the dichotomous case, the reference category is simply the
category coded 0
When we have a series of dummies, you can see that the
reference category is also the omitted variable.
Suggestions for selecting the
reference category



Make it a well defined group – ‘other’ or an
obscure one (low n) is usually a poor choice.
If there is some underlying ordinality in the
categories, select the highest or lowest
category as the reference. (e.g. blue-collar,
white-collar, professional)
It should have ample number of cases. The
modal category is also often a good choice.
Multiple dummy Variables

The model for the full dummy variable
scheme for race is:
Yi  a  B1 * X i  B2 * Blacki  B3 * Hispanici 
B4 * Asiani  B5 * AmIndi  ei

Note that the dummy for White has been
omitted, and the intercept a is the intercept
for Whites.
Tests of Significance


With dummy variables, the t tests test
whether the coefficient is different from
the reference category, not whether it
is different from 0.
Thus if a = 50, and B1 = -45, the
coefficient for Blacks might not be
significantly different from 0, while
Whites are significantly different from 0
Interaction terms


When the research hypotheses state that
different categories may have differing
responses to other independent variables, we
need to use interaction terms.
For example, race and income interact with
each other so that the relationship between
income and ideology is different (stronger or
weaker) for Whites than Blacks.
Creating Interaction terms

To create an interaction term is easy


Multiply the category * the independent variable
The full model is thus:
Yi  a  B1Racei  B2 Income B3 ( Race* Income)  ei





a is the intercept for Whites;
(a + B1) is the intercept for Blacks;
B2 is the slope for Whites; and
(B2 + B3) is the slope for Blacks
t-tests for B1 and B3 are whether they are different than a
and B2
Separating Effects



The literature is unclear on how to fully
interpret interaction effects
There is multicolinearity between a
dummy and its interaction terms, and
also the regular independent variable
It is suggested that you do not use a
model with Interactions terms and no
intercept!