Introduction to Research Methods in Political Science:
The POWERMUTT* Project
(for use with SPSS)*Politically-Oriented Web-Enhanced Research Methods for Undergraduates  Topics & ToolsResources for introductory research methods courses in political science and related disciplines

This topic describes what, for reasons that
will be explained shortly, is also called ordinary least squares (OLS). It is called regression analysis because Francis Galton
(1822-1911), a pioneer in the application of OLS to the behavioral sciences,
used it to study “regression toward the mean.”[2]
Regression analysis is a simple but extremely powerful technique with a wide
variety of applications. It also forms the basis for many other
techniques in intermediate and advanced research methods courses. To use regression analysis appropriately, all variables must be at least interval though, as we will see, dichotomous variables constitute a special case that may seem to, but really doesn't, violate this rule.

To help us understand regression analysis,
we will try to explain why people in some states identify themselves more with
the Republican Party (and less with the Democratic Party) than do people in some other states.Our measure of party identification is a
scale derived from analysis of CBS/New York Times polls by Gerald Wright et al.The data used here are from 1999 through 2003.The scale has a theoretical range from 0
(a completely Democratic state) to 100 (an all GOP state).

Scatterplots

The philosopher and mathematician René
Descartes (1596-1650) famously wrote, “I think, therefore I am.” One of
the things he thought about was coordinate graphs. In his honor, the locations of points on such a graph are sometimes referred to as
“Cartesian coordinates.”[3]) The graph consists of a horizontal
(X) axis, sometimes called the “ordinate,” and a vertical (Y) axis, sometimes
called the “abscissa.” You can think of the X axis as being similar to
the columns of a contingency table, and the Y axis as similar to the rows,
except that, in testing a hypothesis, the independent variable should always be placed on the X
axis, and the dependent variable should always be placed on the Y
axis. (When you are using a scatterplot for purely descriptive purposes, it doesn't matter which is on the horizontal and which is on the vertical axis. This would be the case, for example, if you wanted to compare two different measures of the voting record of members of congress.) Each case is represented by a point on the graph based on its
values for X and Y.Taken
together, the points form a scatterplot (or scattergram, or scatter
diagram).In the following figure, each point represents a state. We will begin by examining the perhaps obvious hypothesis that the
more conservative the people of a state are, the more they will identify with the
Republican Party.The independent
variable, ideology, is also derived from Wright et al., and uses a scale on which 0 is most liberal and
100 is most conservative.

Regression
Equations

Notice that the points in the scatterplot
form a pattern.As the value
of the independent variable increases, the value of the dependent variable
tends to increase as well.Insofar as it
increases at a constant rate, the scatterplot will tend to form an upwardly
sloping straight line.Conversely,
insofar as the values of one variable decrease as the other increases, the scatterplot will tend to form a downwardly sloping straight line.The line of best fit (also called
the regression line) is the straight line
that, loosely speaking, passes through the "middle" of the scatterplot.
More precisely, it is the one with the smallest variance of points about the
line. Recall that variance is the mean squared deviation. The
best fitting line, in other words, is the one with the least squares in the
deviations between the line and the points on the graph.

In high school geometry, you probably
learned that the general formula for a straight line is: Y = mX + b.

Statisticians use slightly different
symbols, using “a” instead of “b,” “b” instead of “m,” and reversing the order
of the two terms on the right side of the equation, thus producing: Y = a + bX.

In this formula, “a” is the Y intercept
(the value of Y when X = 0), and “b” is the slope. The former usually
has little or no theoretical importance, and would often lead to absurd
conclusions if interpreted. The “b” coefficient (also called the regression
coefficient) is very important. It tells us the nature of the linear
relationship between the dependent and independent variables: the increase or
decrease in the dependent variable that, all else being equal, can be expected from an increase of one unit in the value of the independent
variable.

This general equation describes any
straight line. Using appropriate formulas,[4] we solve for the values of “a” and “b”
that produce the least squares equation for our data.We obtain the results shown here:

We’ll return to the “model summary” and the “ANOVA” later.For now, we are interested only in the “B”
column under “unstandardized coefficients” in the “coefficients” table.The “constant” (20.219) is the Y intercept,
while the number underneath it (.493) is the slope.Rewriting these numbers in standard algebraic
form, we obtain: Y' = 20.219 + .493X. Note that the dependent
variable in the equation is shown as Y′(Y prime), the “predicted”[5] value of Y. This means that,
all else being equal, we would “predict” that a given case will fall exactly on
the line (the coordinates represented by multiplying a given value of X by
.493 and then adding 20.219). Arkansas, for example, has an ideology score of 62.92.
Plugging this value into the equation, we predict a party id score for
this state of 51.24 (that is, with Republicans enjoying a slight edge over Democrats).

In fact, the party identification score
for Arkansas is 39.06 (that is, showing a decided advantage
for the Democrats).The least squares
regression line, even though it is the best fitting straight line, is far from
a perfect fit to the data. If it were, all the points in the graph would
fall exactly on the line. (The deviations between the actual points and
where they would fall on the line are called theresiduals.)

The following figure repeats the scatterplot
shown above, but this time the regression line has been added.

In
the next figure, each point has been labeled with the name of the state, and points have been coded by region.Note that several southern states have
negative residuals — they seem to identify less with the GOP than we would
expect based on ideology.In other
words, despite Republican inroads into the once “solid South,” the South (notwithstanding its relatively conservative ideology) still
retains some of its traditional ties to the Democratic Party. On the other hand, a number of states in the Rocky Mountains and Great Plains are more Republican than we would expect based on ideology.

Pearson’s r2 (the Coefficient of Determination)

This then raises the question of how good
the best fitting line is. In guessing the value of the dependent
variable, how much does it help to know the value of the independent variable?[6]

For an interval or ratio variable, our
best guess as to the score of an individual case, if we knew nothing else about
that case, would be the mean. The total sum of the squared deviations
from the mean gives us a measure of the error we make by guessing the mean,
since the greater this sum, the less reliable a predictor the mean will be.From the analysis of variance (“ANOVA”) table
presented earlier in this topic, we see that in this case the total sum of squares is 1466.433.

How much less will our error be in
guessing the value of the dependent variable (in this case, party
identification) if we know the value of the independent variable
(ideology)? We can calculate the sum of squared deviations about the
least squares line in the same way as total variation is calculated, except
that, instead of subtracting the overall mean from each score, we subtract the
predicted value of Y. In the case of Arkansas, for example, we subtract the predicted value of
51.24 from the actual value of 39.06, leaving us with a deviation (or
residual) of -12.18. (In other words, Arkansas is about 12 points less Republican than would have been guessed based on its ideology score.) By doing this for each state, then squaring
and summing the results, we obtain the residual sum of squares (1263.552, from the ANOVA table above).We can then determine how
much less variation there is about the regression line than about the
mean. The formula:

provides us with the
familiar proportional reduction in error. (Note: as when computing eta2 in the previous topic, we don't need to divide each element in the equation by N, since it is the same in each instance.)

In this case:

r2= (1466.433 - 1263.552) / 1466.433 = .138

In other words, by knowing a state’s
ideology score, we can reduce the error we make in guessing its partisanship
score by about 13.8 percent. Pearson’s r2 thus belongs to the
same “PRE” family of measures of association as Lambda, Gamma, Kendall’s tau, eta2, and
others. Pearson’s r2 is also called the coefficient of
determination, because it tells us the proportion of the variance in the
dependent variable that is “determined” by (or “explained” by) its association
with the independent variable.Put
another way, it tells us how much closer the points in the scatterplot come to
the regression line than they do to the mean.

Pearson’s
r

Just as the standard deviation, rather
than the variance, is usually reported in measuring dispersion, Pearson’s r
(also called the correlation coefficient) is usually reported rather
than r2. Pearson’s r is the positive square root of r2 when the relationship (as indicated by the sign of the “b” coefficient) is
positive, and the negative square root when the relationship is negative.[7] It thus ranges from 0 (when there
is no relationship between the two variables) to ±1 (when indicating a perfect
relationship). In the case of the relationship between partisanship and
ideology, it is -.372.

We can also perform a test for the statistical significance of the relationship. From the ANOVA table presented earlier, we see that the
relationship is significant at the .009 level (written “p = .009”).Note that this is a so-called "two-tailed" test, that is, one in which the hypothesis does not predict the direction (positive or negative) of the relationship. Since, in this and in most cases, we do in fact predict the direction (in this instance, positive), we can use a "one-tailed" test, and the relationship is actually twice as significant (that is, p = .0045).

Deviant Case Analysis

One difference between political science
as a social science and political science as a humanity is that the latter
tends to focus on the unique person or event while the former tends to focus on
typical patterns in human behavior. This is not, however, a hard and fast
distinction. Moreover, the study of the unique and the typical are complementary, not
conflicting, pursuits. Regression analysis illustrates these points quite
well. Once we have discovered an overall pattern, we can then focus on
those cases that do not fit that pattern, that is, those with high
residuals. These unusual cases are “the exceptions
that prove the rule.” (The original meaning of this saying was that the
exception "tests" the rule, which actually makes a lot more sense.)

Earlier, we found that several southern states were a good deal less Republican
than we would have predicted.There are, on the other hand, some Rocky Mountain
and Great Plains states that were substantially more
Republican than their ideology would have led us to expect. Perhaps these areas have great potential for party building efforts (the South for Republicans and the Rocky Mountains and Great Plains for the Democrats).

Finding deviant cases may help us generate
additional hypotheses. In other words, just as finding a pattern helps us
focus on cases that do not fit the pattern, finding such cases may in turn help
us in looking for other patterns. Figure 3 shows that two New England states, New Hampshire and Vermont, are about as liberal as two others, Massachusetts and Rhode Island, but are far more Republican. Can you speculate as to why this might be the case (think demographic characteristics, such as religion and ethnicity)?

Multivariate Analysis

Regression can be extended to analysis
that includes more than one independent variable. There are limits to such analysis. Among these is multicollinearity. When two or more independent variables are highly correlated with one another, it may be impossible to separate the impact of each on the dependent variable. Despite this, regression provides a very powerful tool for creating more comprehensive models of political life.

Though not easily
represented graphically, the multiple regression equation is relatively straightforward:

Y' = a + b1X1 + + b2X2 . . . + bnXn

This equation, instead of describing the
least squares line in a two dimensional plane, describes the least squares
plane (or hyper plane) in a space with as many dimensions as there are
variables in the equation. Don't panic — the computer can handle the
calculations.

The equation just described is called the unstandardized regression equation, because each b coefficient is expressed in terms of the
original units of analysis. For example, if Y is measured in percents and
X1 in thousands of dollars, a value of -0.8 for b1 would mean that,
all else being equal, an increase of $1,000 in X1will result in a decrease of
0.8 percent in Y.

There is also a standardized regression
equation, in which all relationships are expressed in standard scores. This equation takes the following general
form:

Y' = β1X1 + + β2X2 . . . + βnXn

If β1 (pronounced "beta sub 1"), for example, were to equal -0.8, it would mean that, all
else being equal, an increase of 1 standard deviation in X1 will
result in a decrease of .8 standard deviations in Y. In a standardized
regression equation, the “a” coefficient is always zero, and so drops out of
the equation.

Both standardized and unstandardized
regression equations are important. Because it expresses relationships in
terms of the original units of analysis, the unstandardized equation is often
easier to understand. It is also easier to use the unstandardized
equation to calculate the value of the predicted value of Y for any given case or
set of cases. On the other hand, because it expresses all
relationships in terms of standard scores, the standardized equation lets us
evaluate the relative importance of each independent variable: the higher the
value (ignoring sign) of, say, β1, the
bigger the change in Y produced by a change of one standard deviation in X1.

For either form of the equation, we can
obtain the multiple correlation coefficient and coefficient of determination,
written R and R2 respectively. R2 is a proportional
reduction in error measure that tells us how much more accurately we can guess
the value of the dependent variable by knowing the values of all the
independent variables in the equation. R2 should usually be
adjusted to take into account the number of variables in the equation.

When an independent
variable is a dichotomy, it can be entered into a regression equation like any
other variable. Called dummy variables in this context, dichotomies are usually coded
“0” and “1.” (Actually, any two numbers will do, but 0 and 1 will make the results easier to interpret.) Suppose that gender is an independent variable, with female
coded 1 and male coded 0. In an unstandardized regression equation in
which the dependent variable is a “feeling thermometer” for Hillary Clinton, a
“b” coefficient of 8.106 associated with gender would mean that, all else being
equal, women rate her a little more than 8 points higher than do men.

A variable with more than two categories
can be converted into a series of dummy variables. If we have a region
variable with four categories (Northeast, Midwest, South, and West), we can create up to three dummy variables (such as
Northeast, South, and West) in which a case is coded 1 if it is located in the region, and 0 if it is not. We could not create a
fourth dummy variable, since if we specify the value of a case for three regions, we have in effect already specified its value for
the fourth. (If a respondent does not live in the Northeast, the South,
or the Midwest, and there is only one other region, (s)he must live in the West.)

Note that dummy variables cannot be used
as dependent variables in an OLS regression equation. There are other methods
available for such a purpose (notably logit and probit), but they are beyond the scope of this
topic.

Finally, we can perform a test for the statistical significance
of each independent variable in a multiple regression equation (called the “t-test”) and for the equation as a whole (called the “F ratio”).

To illustrate the notion of multiple regression, consider the following variables for the
American states:

Y = Party identification

X1=
Ideology

X2=Percent of households that
include a married couple

X3 =
Dummy variable (South = 1; other states = 0)

Let’s begin by creating a correlation
matrix among all four of these variables. We can see that
there are moderately strong positive correlations between identification with the GOP
and the percent of households containing a married couple and with ideology (that is, with conservatism), and
a moderately strong negative corelation with being in the South.(Note that the
coefficients in this matrix are “r,” not “r2.”)

Now let’s carry out a regression analysis.The relevant portions of the output provided
by SPSS are as follows

:

The "Adjusted R Square" (adjusted for the number of variables in the equation) for the model summary shows that all three independent variables taken together
explain about 59 percent of the variation among the states in party identification.

The ANOVA table shows that the residual sum of
squares (the sum of squared deviations from the least squares line) is
564.653, while the total sum of squares (the sum of squared deviations from
the mean) is 1466.433.Note that
(1466.433 – 564.653) / 1466.653 = .615.This is identical to the unadjusted
R Square
in the model summary.The “Sig” of .000
is the significance level (based on an “F ratio”).In other words, for the model as a whole,p < .001.

The “coefficients” table provides the regression
equations.Under “unstandardized
coefficients,” the “Constant” (-26.644) is the “a” coefficient.The remaining values in this column are the “b” coefficients.Rewriting this in
standard algebraic form, the unstandardized regression equation is:

The unstandardized equation tells us that,
all else being equal, each additional point on the ideology (conservatism) scale
is associated with an increase of something over a half point on the party
identification (Republicanism) scale, that each additional percent of
households containing married couples is associated with an
increase of about 0.9 points, and that a Southern state will have a score about
4.8 points lower than a state outside the South.Note that each of these coefficients holds
constant the other variables in the equation.Thus, for example, the coefficient for the South indicates that we would
expect a Southern state to score about 4.8 points lower than a non-Southern
state with the same ideology score and the same proportion of married couples.

A limitation of the unstandardized
equation is that each variable is measured in terms of very different and hard
to compare units of analysis.For example, it
isn't obvious whether an increase of .542 points on the party id scale for
an increase of one point on the ideology scale constitutes a larger or a smaller
change than an increase of .933 points for an increase of one percent in
households including married couples.

The standardized equation makes it easier
to compare the relative importance of the different independent variables.In this case, it tells us that us that each independent variable has a moderately important impact on party identification, even when the other independent variables are held constant, with the percent of households that include married couples being a little more important than either of the other two variables.A disadvantage
of the standardized equation is that it's rather abstract. What does it mean,
for example, to say that an increase of one standard deviation in household
composition is associated with an increase of .469 standard deviations in
party id?Because unstandardized and
standardized equations each have their strengths and limitations, it is helpful
to have both.

Finally, the figures in the “Sig” column, show that (based on t tests) the contribution
of each independent variable is statistically significant even when the other
variables in the equation are taken into account.

Curvilinear Relationships

Sometimes a scatterplot will show that
there is a relationship between two variables, but that the pattern forms a
curved rather than a straight line.There are techniques for dealing with so-called curvilinear patterns,
but they are beyond the scope of this topic.

1. Start SPSS, and
open the states.sav file. Open the states codebook. Repeat
the analysis described in this topic, but use either ideo
or percent voting for Bush (which you will need to compute) as your dependent
variable.Select independent variables
that you hypothesize influence your dependent variable.

An optional byproduct of the SPSS regression tool is the ability to save residual scores as a new variable. Choose this option. Using SPSS Data View, find the states with the highest
positive and negative residuals. Can you think of any reasons that would
explain these?

2. Start SPSS, and
open the senate.sav file and the senate codebook. Look at the measures of senators' voting records described in exercise 3 of the topic on Standard Scores and the Normal Distribution. Using the correlate procedure, see if these variables are all measuring more or less the same thing. Note that unity measures the degree to which a senator votes with his or her own party when majorities of the two parties are opposed. To make these measures comparable to the others, convert them so that they represent the degree to which a senator votes with the Republican Party. Note again that Sanders (I, Vermont) is treated as a Democrat for purposes of this variable.

3. What constituency variables can be used to explain a senator’s voting
record? Given a senator’s constituency, does knowing anything about the
senator as an individual (such as party or gender) provide a more complete
explanation? Are there some individual senators who are much more liberal
or much more conservative than predicted by your equation? (Note: Along with Angus King (Maine), Bernie Sanders of Vermont is coded as an independent for the party variable. To treat party as a dummy variable either, 1) recode to treat these two senators as Democrats (since they caucus with the Democratic Party), 2) go
to SPSS Variable View and make “3” a missing value for this variable, or 3) use select cases to exclude these senators from your analysis.)

4. Open the countries.sav file and the countries codebook.
Freedom House provides estimates of each county’s level of political rights and civil
liberties. Compute an additive index summing these two measures. What variables help explain the value of this index? Are
there countries that are either much more or much less democratic by this
measure than your equation predicts? Can you explain these “deviant
cases”?

Repeat this analysis, but use perceived political corruption as your dependent variable.

Lowry, Richard, “Introduction to Linear
Correlation and Regression,” Concepts and Applications of Inferential
Statisticshttp://www.vassarstats.net/textbook/. (Go to Table of Contents, then to chapter 3.)

[1] The order in which concepts are introduced here
differs a bit from that found in most introductory texts, which calculate
Pearson’s r directly and then proceed to Pearson’s r2. I
cover r2 first because of its relationship to other PRE measures
discussed earlier and because r2 follows more directly from the
discussion of lines of best fit. For textbooks that employ a similar
approach, see William Buchanan, Understanding Political Variables, 4th edition. (NY: Macmillan, 1988), chs.18-19, and Susan Ann Kay, Introduction to the Analysis of
Political Data.(Englewood Cliffs, NJ: Prentice
Hall, 1991), chs.4-5.