For quick questions email data@princeton.edu. *No appts. necessary during walk-in hrs.Note: the DSS lab
is open as long as Firestone is open, no appointments necessary to use the lab computers for your own analysis.

Working With Dummy Variables

Why use dummies?

Regression analysis is used with numerical variables. Results only have a valid interpretation if it makes sense to assume that having a value of 2 on some variable is does indeed mean having twice as much of something as a 1, and having a 50 means 50 times as much as 1.

However, social scientists often need to work with categorical variables in which the different values have no real numerical relationship with each other. Examples include variables for race, political affiliation, or marital status.
If you have a variable for political affiliation with possible responses including Democrat, Independent, and Republican, it obviously doesn't make sense to assign values of 1 - 3 and interpret that as meaning that a Republican is somehow three times as politically affiliated as a Democrat.

The solution is to use dummy variables - variables with only two values, zero and one. It does make sense to create a variable called "Republican" and interpret it as meaning that someone assigned a 1 on this varible is Republican and someone with an 0 is not.

Nominal variables with multiple levels

If you have a nominal
variable that has more than two levels, you need to
create multiple dummy variables to "take the place of" the original nominal variable. For
example, imagine that you wanted to predict depression from year in school:
freshman, sophomore, junior, or senior. Obviously, "year in school" has more
than two levels.

What you need to do is to recode "year in school" into a set of
dummy variables, each of which has two levels. The first step in this process is
to decide the number of dummy variables. This is easy; it's simply k-1, where k
is the number of levels of the original variable.

You could also create dummy variables for all levels in the original variable, and simply drop one from each analysis.

In this instance, we would
need to create 4-1=3 dummy variables. In order to create these variables, we are
going to take 3 of the levels of "year of school", and create a variable
corresponding to each level, which will have the value of yes or no (i.e., 1 or
0). In this instance, we can create a variable called "sophomore," "junior," and
"senior." Each instance of "year of school" would then be recoded into a value
for "sophomore," "junior," and "senior." If a person were a junior, then
"sophomore" would be equal to 0, "junior" would be equal to 1, and "senior"
would be equal to 0.

Interpreting results

The decision as to which level is not coded is often
arbitrary. The level which is not coded is the category to which all other
categories will be compared. As such, often the biggest group will be the not-
coded category. For example, often "Caucasian" will be the not-coded group if
that is the race of the majority of participants in the sample. In that case,
if you have a variable called "Asian", the coefficient on the "Asian" variable in your regression will show the effect being Asian rather than Caucasian has on your dependant variable.

In our example,
"freshman" was not coded so that we could determine if being a sophomore,
junior, or senior predicts a different depressive level than being a freshman.
Consequently, if the variable, "junior" was significant in our regression, with
a positive beta coefficient, this would mean that juniors are significantly more
depressed than freshman. Alternatively, we could have decided to not code
"senior," if we thought that being a senior is qualitatively different from
being of another year.