Possibly the most important (and overlooked) part is to properly state the research problem.

For example, an intersection is plagued with lots of accidents, so the Chief decides to increase enforcement. Say that we are called in to determine whether doing so is helpful.

1. We must first identify the assumptions that must be true for the proposed solution (increased enforcement) to have the desired effect (fewer accidents). In this example there are at least two assumptions: (A) Driver carelessness is at fault (increased enforcement canīt help if accidents are actually being caused by, say, malfunctioning signal lights), (B) visible police presence acts as a deterrent (if drivers are oblivious to police, assigning more cops canīt help).

How can we know if our assumptions are plausible? If we are called in before changes in enforcement take place, we can look for the presence of extraneous factors (such as road defects, obstructions to view or malfunctioning signals). We must also review the literature to determine whether the assumptions have scientific support. Have other researchers concluded that driver carelessness causes accidents, or that a visible police presence deters misconduct?

Once we are satisfied with the assumptions, they can be incorporated into the causal chain that logically takes us, step by step, from the proposed solution to its anticipated effect:

more police ==> less driver carelessness ==> fewer accidents

2. Next, we create a hypothesis. Hypotheses are scientific (meaning empiricallyverifiable) conjectures. They are normally expressed in the form of a conditional statement of cause and effect. By conditional we mean that "if" a specific change takes place (increased enforcement) "then" a corresponding change will follow (fewer accidents). Changes on both sides of a hypothesis (the "if" and "then", the cause and the effect) must be measurable. In this example, our hypothesis can simply be "increased enforcement leads to fewer accidents."

3. Why do we bother with a hypothesis? Why not simply collect all the data we can, then try to determine what is going on? As you recall, our answer had to do with homicide and cycles of the moon. Simply because certain characteristics seem to go together means little. In real life chance is always at work, so it is hazardous to take associations at face value. We cannot have confidence in associations unless they were predicted ahead of time (a priori), on the basis of prior findings, and the prediction was scientifically tested. That prediction is a hypothesis, and the prior findings come from a good literature review. (We will talk about "testing" later.) So now you know "the rest of the story."

4. Some points of clarification...

Theory v. hypothesis. What is the difference between a "theory" and a "hypothesis"? Like hypotheses, theories are scientific conjectures about cause and effect. But theories are something more. In actual practice, the "theory" label is attached to scientific conjectures that seek to explain fundamental principles. For example, Travis Hirschiīs "control theory ", which posits that delinquency is caused by a lack of bonds with conventional society. In contrast, hypotheses typically address narrow, practical problems, such as the proposition that arrest deters domestic violence. Hypotheses are also directly testable, while to evaluate theories we must test their underlying hypotheses. Can you think of any hypotheses that could be used to test Hirschiīs control theory?

Concepts and constructs. Please note that some textbook writers get carried away distinguishing between "concepts" and "constructs". These are important issues in scientific philosophy, but for our purposes the above definition of a hypothesis will do. Hey, if itīs good enough for Government work...

What happens next? That depends on our role. If called in ahead of time, we can design the study so that extraneous factors (such as changes in staffing or weather) do not interfere with our capability to assess the effects - if any - of changes in enforcement. Research design will be discussed during the next several sessions. But letīs jump even farther ahead - to the end - when we analyze the data. Then, statistical methods will be employed to determine whether changes in enforcement correspond with changes in the number of accidents, in the anticipated direction (more enforcement, fewer accidents). If our math indicates that the degree of correspondence is significantly higher than what might happen by chance alone, we would conclude that increased enforcement had the desired effect.

Concepts and variables

In class we will discuss an example of CJ research. Its hypothesis, which the studyīs authors apparently confirmed to their satisfaction, is that youths who display poor demeanors during encounters with police are more likely to receive a severe disposition. But we found many things to quibble about:

The independent (causal) variable, demeanor, is a categorical variable with four discrete (mutually exclusive) levels. Another way of saying it is that "demeanor" can take on one - and only one - of four values. But the lesser three are quite similar. That poses two problems.

(1) Artificial distinctions can contaminate a studyīs findings. Our conclusions should represent how things "really" are - not just how researchers think they are.

(2) It may be impossible to operationalize the levels with sufficient clarity and precision so that coders can consistently assign the appropriate values to individual cases. This is the issue of reliability. Coder "A" might think that in the case of Billy Bobb, who gave lip to officer Hardnose, the disposition was "informal reprimand." Looking at the same case, Coder "B" might identify the disposition as "admonish and release." Really, it would have been better to collapse the four levels into three, yielding the following values: arrest (most severe), citation (intermediate severity), admonish and release (least severity). These distinctions mean something in the "real world". They should also be simple to operationalize and to reliably code.

Weīre not done yet. Here is the last sentence from the studyīs conclusions: "....He is a delinquent because someone in authority has defined him as one, often on the basis of the public face he has presented to officials rather than on the kind of offense he has committed." That may be a nice piece of writing. But is it supportable? Two issues:

1. With so many cells (intersection between levels of the independent and dependent variables) and so few cases (total of 66) the case counts in most cells is very small (only 2 cases in the arrested/cooperative cell.) Typical practice calls for at least 100 cases overall and not less than 30 cases in each cell. Use your common sense. Would you trust the conclusions from such a small sample? (More about sampling next week.)

2. Might the reported change in the level of the dependent variable (severity of disposition) have been caused by another independent variable? Well, the authors of the study did consider one: nature of the offense. When police investigate a crime like murder or robbery, juvenile suspects will probably get arrested regardless of their attitude. To avoid confounding the findings, it is normal practice to exclude (control for) cases where other powerful causal forces, or determinants, may be at work. Controlling for independent variable "offense severity" levels the playing field so that the impact of independent variable "attitude" can be accurately assessed. (Of course, by excluding murder and robbery the studyīs findings cannot apply to police-juvenile encounters that involve serious crime.)

Might there be other independent variables at work? Hypotheses may not explicitly say so, but they are always preceded by "other factors being equal..." Scientific research requires that we identify and control every independent variable that could affect the level of the dependent variable. We find out about other potential causes from prior research, through literature reviews, and by carefully modelling our hypotheses and examining their assumptions. A key assumption of the present hypothesis is that officers can be verbally provoked into making inappropriately severe dispositions. (Look at it this way - if officers are completely immune to verbal provocation, there is no hypothesis at all.) But is it equally possible that officers process verbal provocations more objectively? Juveniles who give "lip" might be labelling themselves as dangerous and therefore more worthy of being arrested. Here is the most likely causal chain, with "estimate of dangerousness" placed as an intervening (independent) variable:

If this sequence is correct, how might the studyīs conclusions change?

Set this study aside for now. Assume that you have done a different study and concluded that changes in independent variable A are associated with changes in dependent variable B. Does that mean that A causes B? Maybe, and maybe not. Even if intervening variables are ruled out, youīre not home free. Unbeknownst to you, there is an extraneous variable, X, which is associated with independent variable A and with dependent variable B. Had you known about X, you might have discovered:

1. Changes in X account for all of the change in B2. Independent variable A has no causal effect on B. It only seemed that way because A is associated with the true causal agent, X.

In this situation the "relationship" between A and B is not causal - itīs spurious.

How do we get around the analytical difficulties posed by intervening and extraneous variables? It may be impossible to identify each and every potential cause in advance. And to simply throw out every situation where a confounding factor might be at work could leave us with such a small set of cases that our findings are trivial. Consider the study criticized above, whose conclusions can only be applied to circumstances that fit a very narrow profile (e.g., juvenile/officer encounters that do not involve serious crime, instead of all juvenile/officer encounters.)

Fortunately, there are ways to avoid losing cases while, at the same time, "automatically" controlling (holding constant) every independent variable other than the one of interest.

Research designs

NOTE: Texts do not always agree on what can be called "experimental" or "quasi-experimental". There are also differences in how non-experimental designs are defined. For the purposes of this course, please use the definitions given here and the paragraphs below. However, you are still expected to read and understand issues about research designs discussed in the text.

Experimental

Experimental research designs are the best way to keep extraneous variables from confounding the results of a study.

Cases are randomly assigned to two or more groups. This makes the groups as similar as possible.

All groups are immediately measured for the independent and dependent variables.

One or more groups are designated as "experimental". These groups are exposed to an intervention. To "intervene" means to insert a new independent variable, or to change the level of an existing independent variable in the hypothesized direction. (Normally there is only one experimental group.)

Usually a "control" group is selected that goes along its business - its members receive no special attention. The levels of any existing independent variables are not changed.

All groups are then measured for the dependent variable. Any substantial difference between the experimental and control groups can be attributed to the intervention - the changing of the level of the independent variable.

In the Minneapolis domestic violence experiment, officers responding to a domestic violence call randomly exercised their power of arrest. This process randomly "assigned" the calls to either an arrest "group" or a non-arrest "group". Subsequent differences between the groups as to the level of the dependent variable (recidivism) was attributed to the intervention.

Note: Since we leave the control group alone, any change in the level of its dependent variable must be caused by factor(s) extraneous to the experiment. It seems logical that these factor(s) would also affect the experimental group by a like amount. So we must subtract this quantity from any observed difference in scores for the experimental group when we calculate the effect of the intervention.

For ethical and practical reasons, many criminal justice hypotheses cannot be evaluated with experimental designs. For example, to study the hypothesis that poverty causes crime, one would have to randomly assign children at birth to poor and rich families, then measure any differences in their criminality.

Quasi-experimental

Experimental, but with a serious limitation: either random assignment was not used, or there was no control group.

Without random assignment, it is possible that the groups differed in an important respect before the intervention. Perhaps this difference - rather than the intervention - caused any measured difference in the dependent variable. ("Matching" is occasionally used instead of random assignment. It is an imperfect remedy.)

Without a control group, it is possible that an extraneous event ("maturation effect") might have caused the change in the level of the dependent variable.

Still, in quasi-experimental designs the researcher supervises the application of an intervention and measures any effects on the dependent variable. The same ethical and practical issues that affect true experimental designs also apply.

Non-experimental designs are sometimes called "quasi-experimental" when the circumstances mimic an experiment. It may be known that the value of an independent variable was adjusted on a certain date, and there are measurements available of the dependent variable preceding and following that date.

Say that on a certain date a police department increased patrol in a certain area. Two years later we want to know whether doing so lowered crime. Certainly there would be records of the crime rate before and after the increase. We might even be able to "find" a control group, perhaps a similar area where there was no change in patrol.

With care, it is sometimes possible to retrospectively (looking back) collect data in a way that mimics a quasi-experimental design. For example, if an intervention took place, and if good records were kept about the independent variable before and after the intervention, then only an absence of the researcherīs presence keeps the design from being considered "quasi-experimental".

Non-experimental (aka "ex-post-facto")

Practical and ethical problems mean that much social science research falls into the third category, non-experimental. Researchers evaluate their hypotheses by retrospectively collecting data on the independent and dependent variables. For example, if the hypothesis is that poverty causes crime, researchers select a sample, then measure each case for income and number of prior arrests.

Non-experimental designs are inherently inferior:

Since an experimenter is not "there" to apply an intervention, causal order (cause precedes effect) cannot be assured. (Could crime cause poverty?)

Without random assignment or matched control groups, the possibly confounding effects of other independent variables on the level of the dependent variable cannot be discounted. (Could effects of poverty such as lack of education or living in a violent area be the true cause of crime?)

Sampling

To sample means to select cases for inclusion in a group. Random sampling is the best procedure because it gives each case an equal chance of being selected. This process minimizes the possibility of introducing bias and winding up with mismatched groups. For example, in a domestic violence experiment we do not want to accidentally pack a "yes-arrest" group with hardcore batterers, as doing so might lead us to underestimate the impact of arrest on recidivism.

Random sampling has other benefits. First, it allows us to economically estimate the characteristics ("parameters") of a large group, such as the average ("mean") rate of recidivism, using just a small set of members. We did that in class by randomly drawing chips from a bowl that represented a population of 200 inmates with a known mean sentence. (Most research involves samples, and the actual population parameters are usually unknown.) We demonstrated that our aggregate sample (n=30) yielded a mean score that was very close to the known population mean. Means of our individual, smaller samples (n=10) were generally less accurate. In social science research, a sample size of at least 30 helps assure that the sample statistic (such as the mean) closely approximates its corresponding population parameter.

An even greater benefit of random sampling is that it allows us to use probability theory to determine whether a change in the score of a dependent variable is statistically significant. Without sampling it is difficult to counter the argument that what seems to be a substantial change is really a normal fluctuation that could have been produced by chance alone. Say that we measure officer cynicism and get a mean score of 5.0. We then give the officers sensitivity training and the mean cynicism score drops to 4.3. Could a change of .7 have taken place without the training? If we did not draw a random sample but simply tested every officer, we could never know. More will be said about this intriguing topic later in the course.

In simple random sampling, we select a sample of desired size directly from the population in a way that gives each member an equal chance of being included. This process requires that we list each member of the population in a sampling frame, essentially a handwritten or computerized list from which cases can be physically drawn. As you can imagine, this process is not always "simple". Some sampling frames can be very large (e.g., the population of the United States) or may be otherwise difficult to document (e.g., every car parked at Cal State Fullerton). So other ways to draw samples have been devised. Keep in mind that to insure results can be projected to the population, all sampling methods must allow each member of the population an approximately equal chance of being included in the sample.

One popular alternative sampling method is cluster sampling, which the text covers in some detail. This procedure is particularly convenient because only cluster(s) need be drawn at random. Once clusters are selected, every element (case) within that cluster is included in the sample. Other than being randomly selected, clusters should also be equal in size. These and other limitations (also called assumptions) make cluster sampling more appropriate for use in largescale surveys than for ordinary criminal justice research.

Another alternative is systematic sampling, which is the procedure we followed in the parking lot. If done correctly, it yields about the same results as random sampling.

Disproportionate sampling is an important and commonly-used technique in criminal justice research. As we discussed in class, simple random sampling may sometimes yield too few cases that possess a characteristic of interest. For example, say that we are conducting a study of officer attitudes about community policing for a 500-officer department. If we are interested in how the average officer feels, regardless of rank, our sampling frame would be a list of 500 officers. We then draw a random sample (say, 100 officers) and administer the survey instrument. Since the sample was randomly drawn, it would reflect the department as a whole and include about the same proportions of ranks and positions. For example, if the department has 50 Sergeants (50/500, or 10/100, or 10 percent) a simple, randomly drawn sample would yield about 10 sergeants (10/100). If the department has 400 patrol officers (400/500, or 80/100, or 80 percent) there would be about 80 patrol officers (80/100) in our sample. After calculating the proportion of responses favorable to community policing, we would announce the results. Since simple random sampling was used, the results could be assumed to represent the views of patrol officers as a whole.

What if the literature review suggests that Sergeants and patrol officers have different attitudes? Thatīs simple enough to check - just compare the survey responses for the 10 Sergeants and 80 patrol officers. Letīs say that 8 (8/10) sergeants liked community policing and 60 patrol officers (60/80, or 75/100, or 75 percent) did not. Youīll notice that the proportion for sergeants was not transformed into a percentage (it would be 80/100, or 80 percent) because the numbers of sergeants - 10 - is so small.

But thereīs a problem: a small sample may not accurately represent the characteristics of the population from which its drawn There should be at least 30 cases in each group being compared, with 100 being optimal (see the book, pg. 83). A sample of 10 sergeants cannot be assumed to accurately represent the views of all sergeants in the department.

How can we increase the number of sergeants without also drawing tons of officers? From the frame (list) of 400 patrol officers, draw an optimal number for social science research (100). From the frame of 50 sergeants, draw the minimum number for social science research (30). Dividing a population into groups (officers/sergeants) is called stratification. Drawing a sample that does not represent its groupīs (strataīs) proportion in the population is called disproportionate sampling.

Now examine the responses. Say that 25 sergeants and 25 patrol officers favor community policing. Transform these frequencies into percentages. For 25 sergeants, 25/30 = 83/100 = 83 percent. For 25 officers, 25/100 = 25 percent. Compare the percentages directly - what do they tell you? Thatīs the beauty of proportions - they allow us to compare the characteristics of strata (sergeants and officers) regardless of sample size.

Can we pool our samples to estimate how all uniformed officers in the department feel about community policing. Not if we sampled disproportionately! Look at it this way. You drew samples of 30 sergeants and 100 patrol officers (30/100 sergeants, or 3 for every 10 patrol officers). But in the department as a whole there are 50 sergeants and 400 patrol officers (50/400, 13/100, or 1.3 sergeants for every 10 patrol officers). Pooling disproportionately sampled strata gives undue weight to one side or the other. Had you done so in this example you would have biased the overall results in the direction of - in favor of - sergeantsī attitudes about community policing.

Dividing a population into strata and sampling disproportionately can allow us to answer a question of interest (compare the attitudes of officers v. sergeants) without causing the size of the larger stratum to become enormous. If we should wish to report overall results (how uniformed officers feel, regardless of rank) we would simply draw more cases at random into the under-represented stratum (patrol officers) until proportionate representation is achieved. (For more on this topic see pp. 76-77 in our main text.)

Measurement

Descriptive statistics are used to summarize distributions. Two common statistics used for this purpose is the proportion (percentage) and the mean (arithmetic average).

Summarizing data: Say that we wish to describe a prison population to a visiting criminologist. Our visitor is interested in the age of prisoners and the length of their sentences. Of course we could go prisoner by prisoner, and report that prisoner A is 26 years old and is serving a 3 year term, prisoner B is 23 years old and serving a 1 year term, and so on. That might be ok if there are only two or three prisoners, but if there are more we need a handier way to provide a single, summary measure that describes how prisoners are distributed by age or sentence length.

If you read the book, you will find out that age and sentence are continuous variables, and that the distribution of a continuous variable can be summarized by using its mean. So we could calculate the mean age and sentence for all our prisoners and report to our visitor that their mean age is, say, 22 and that their mean sentence is, say, 4 1/2 years (I made these figures up).

But how do we summarize distributions for categorical (discrete) variables? For example, consider the hypothesis that faculty members drive nicer cars than students. We have two variables - employment status (student or faculty/staff parking area) and car value (5 categories, from economy to prestige). Our unit of analysis is a car. Each car is being measured along two variables - employment status (nominal) and car value (ordinal). How cars distribute by employment status, and how they distribute by car value, can be summarized using the summary statistic known as a proportion:

Proportions are usually expressed as a percentage, which is simply a fraction with a denominator of 100. Proportions are often used to summarize datasets with discrete variables. For example, instead of saying that 23 of the 60 cars in the student lots were of the lowest value, we could report that 38 percent of the student cars, and 21 percent of the faculty/staff cars, were of the lowest value. Please note that when the underlying numbers (nīs) are relatively small, we must also report the actual frequency in each cell, since percentages can overstate the significance of a difference (3 is 300 percent higher than 1, while 6,000 is only 20 percent higher than 5,000).

Displaying relationships between variables: Summary statistics can be used to explore relationships between variables. Relationships between categorical (discrete) variables can be explored using a table such as the above. Relationships between continuous variables require more complex measures. These topics will be covered later in the term.

Variability

Mean, median and mode can be used alone or in combination to "summarize" a dataset (set of data - can be a sample or population). As we discussed in class, the mean is best used as a summary statistic for approximately normal distributions.

A distribution is an arrangement of cases in a sample or population, in order of the value or score for the variable being measured. Here is a distribution of students in one of my classes, arranged by the continuous variable "age" (n = 12):

23 23 23 24 25 25 26 34 34 37 40 46

Distributions are usually displayed in a bar chart, with the values along the "x" (horizontal) axis, and the frequency (number of cases for each value) along the "y" (vertical) axis:

When a variable is measured at the interval or ratio levels, a distribution can be summarized by computing the mean (the average value). What is the mean of this distribution? Simply add up the values or scores for each case and divide by (n), the number of cases. A simple way to proceed is to multiply each value or score (23, 24, etc.) by the corresponding number of cases, then sum:

Does a mean of 30 adequately represent this group? Means are good descriptors when a distribution approaches normality - the so-called "bell" curve. That is not what we have here. To better describe this distribution we should also report other measures, including the median (25.5) and the mode (23).

Distributions can be described in other ways. A very powerful method is to measure the dispersion of scores or values, using the mean as a reference point. To report dispersion we can use two statistics: average deviation and its derivative, standard deviation (s).

Average deviation is simply that - the average distance between each score or value and the sample mean. Simply obtain the absolute distance (disregard + or -) between each score and the mean, add the distances and divide by n (the number of scores).

mean = 30n = 12

Distance to meanSquared distance to mean Used for average deviation Used for standard deviation (s)

The average deviation - the average distance between scores or values and the mean - is 6.83 (82/12). For this distribution, 6.83 seems like a fairly large number (the distance between the minimum score - 23 - and the maximum score - 46 - is 23.) Computing the average deviation quantified our suspicion about the descriptive value of this particular mean.

Now that you know all about average deviation, you need to learn about another measure of dispersion: standard deviation. To compute it requires a bit more work:

1. Each scoreīs distance from the mean is squared (second column)2. The squared distances are summed (total = 686)3. Divide by the number of cases (686/12 = 57.17). This number is also known as the variance, ors2.4. We then take the square root of the result (sq. root of 57.17 = 7.56). This is the standard deviation, or s.

Text fig. 7.14 (pg. 172) depicts a normal distribution as a bell shaped curve, with 50 percent of the cases having scores that fall above the mean, and 50 percent below. Other breakdowns are also depicted. For example, about 68 percent of the scores in a normal distribution fall within one standard deviation from the mean. (These rules do not apply to our age distribution, which is not normal.)

From Text Table B.1 (pg. 311) a score that yields a z of .42 is about 16 percent from the mean (column B, area between mean and z). Our score, 14, is higher than the mean (13). By adding 16 percent to the mean (50th. percentile) we place this score at the 66th. percentile. So 66 percent of the cases in this normally distributed dataset have a score of 14 or lower, and 34 percent of the cases have higher scores (column C, area beyond z).

Well, thatīs fine. But the dataset in the text was purposely designed to be "normal". In the real world many datasets are not close to normal (we can only use the z table when the distribution of scores approximates normality.) An even better reason to bother with z, though, is that the meansof repeatedsamples form a normal distribution. This gives us a way to determine the probability of obtaining any individual score, if it is part of a random sample. (More about this in the weeks to come.)

Using Crosstabulation to AnalyzeRelationships Between Categorical Variables

Descriptive statistics can be used to summarize data and to analyze and display possible relationships between variables. We are now concerned with the latter purpose.

Crosstabulations, also called contingency tables, are often used to display the relationship between two discrete or categorical variables. Consider, for example, the faculty/student car hypothesis:

Car value

Student lots C+E

Fac/staff lots D+F

5 (highest)

2

1

4

4

5

3

9 (15%)

17 (28%)

2

22 (36%)

24 (40%)

1 (lowest)

23 (38%)

13 (21%)

n=60 (100%)

n=60 (100%)

Our hypothesis (A > B) is that income determines type of car. Since we cannot measure income directly, we use car lot (student or faculty) as a surrogate independent variable. The dependent variable, car value, is ordinal and has 5 categories. Are certain values of car lot associated with certain values of car value?

A table cell denotes the intersection of a specific value of the independent variable with a specific value of the dependent variable. The number inside - the cell frequency - contains every case that was coded with these values. In our above analysis we selectively combined (collapsed) car value categories and summed the cell counts. Whether and how one collapses categories can be misleading, so care is important!

Other examples of bivariate (two variable) relationships are in the texts. They are easier to analyze as they show stronger relationships.

Unless we conduct an experiment we will usually be concerned with more than two variables. Experiments - particularly the "classic" type that incorporates a control group - make it less likely that an extraneous variable (an independent variable that we are not considering) is secretly affecting our results. But we cannot always experiment. To experimentally test a hypothesized relationship between income and car value we would have to randomly select two groups of people, make all members of one sample rich and the other sample poor, then see what kinds of cars they buy. Not!

As you know, the main weakness of a non-experimental research design is that changes in the level of the dependent variable may be affected by extraneous factors - independent variables that we know nothing of. (A good literature review might have alerted us, but maybe we did not do our homework!) Perhaps the relationship between car lot and car value might have seemed stronger (greater difference between the values of student and faculty car) if students had not brought family cars to school. If students had to furnish their own transportation, would a larger proportion of their cars be of the lowest value?

To test whether a third variable - car ownership (owner/non-owner) - might affect the results, we simply make up two car lot/car value tables. One table only includes cars owned by the driver, and the other cars that are not owned by the driver. Each is called a "first-order partial table" - partial because only a subset of cars is included, first order because we are "only" on our third variable - car ownership. Using tables to control for the effects of a test variable is known as elaboration analysis.

So far we have measured at the discrete or categorical level, with cells reporting the proportion (percentage) of cases that fall at certain values of each variable. Should one variable be continuous we can use its mean score instead. This (fictitious) table, which displays the average sentence given to persons convicted for violent and property crimes, reports the mean score for the continuous variable at each level of the discrete variable.

Violent

Property

Mean Sentence(years)

10

1

When both variables are continuous there is a more sophisticated way of analyzing their relationship. Stay tuned!

Using correlation and regressionto analyze relationships between continuous variables

As demonstrated in class, we can visually display the relationship between two continuous variables on a chart, called a scattergram. If a hypothesis predicts causality - that change in an independent variable causes corresponding change in a dependent variable - we place a scale of values for the independent variable along the "x" (horizontal) axis and a scale of values for the dependent variable along the "y" (vertical) axis. If causality is not predicted and we simply wish to display a possible relationship, it does not matter where each variable is placed.

Before starting on the chart we must have already have taken our measurements. We should have a pair of values for each case (one score for each variable.) A single data point is plotted for each case. This yields a cloud of dots. A positive relationship means that scores for both variables rise and fall together (for an example, click here). A negative relationship means that as scores for one variable rise, scores for the other variable fall (for an example, click here). The strength of the relationship determines the shape of the cloud. A narrow cloud that closely fits a diagonal line indicates a strong positive or negative relationship. A diffused cloud with no obvious pattern denotes a very weak or no relationship (for an example, click here).

Most of the time it is difficult to use a scattergram to assess the nature and strength of a relationship. For that we turn to the r statistic, which is typically computed by keying in data for both variables into a software program. r can take on a range of values between +1 (perfect positive relationship) and -1 (perfect negative relationship), with a value of 0 denoting no relationship. The computer begins the process by calculating a "regression line", which is a straight line that best "fits" the data, or cloud of dots. It does so using the formula y=a+bx, where y is the predicted value for the dependent variable (or the variable scaled along the y axis) for each value of the independent variable (or the variable scaled along the x axis), a is the starting point for y when x is zero, and b is the slope of the line. The computer uses our actual x values and "massages" the data to derive values for a and b.

Relationships between variables are seldom perfect, so the line drawn by the computer will probably miss most of the dots we placed using actual data. To finish producing the r statistic, the computer then calculates the "residual error of the estimate". This is the aggregate distance between its predictions for each y score (the line) and the actual locations of each y score (our dots). Since the r statistic is concerned with association between variables (not causality) it also adds in error in the other direction, meaning the distance between its predictions of x and the actual locations of x. Briefly, the less the aggregate error between the regression line and the dots (the narrower the cloud), the larger the r statistic (+ or -).

Given the inexactness of the social sciences, rīs as small as .3 can signify a statistically significant relationship. (Weīll get into tests of significance during a later class session.) If there is a hypothesis predicting cause-and-effect, we can square r . The result, known as the coefficient of determination, can be interpreted as the amount of variance (variation) explained in the dependent variable by the independent variable. (Donīt expect large numbers. For example, the square of 3 is .09, meaning 9 percent.) This effect can be depicted visually. Look at the text, pg. 210, fig. 9.9. The bottom cloud is narrow, meaning a strong relationship between variables, while the top cloud is more diffuse, meaning a weak relationship. This can be confirmed by determining the range of scores in the dependent variable that are associated with any single score in the independent variable. In the bottom cloud each value of x is associated with a fairly narrow band of y scores, making x a pretty good predictor of y, while in the top cloud the range of yīs for each value of x is much larger.

You probably remember our introduction of a third, "test" or control variable during elaboration analysis, which was used to analyze the relationship between discrete variables. Essentially the same process can be used in correlation analysis (it is called "partial" correlation). For example, if a third, test variable has no effect on the original bivariate (two-variable) correlation, we can say that our findings have been replicated. But if the original correlation changes, perhaps our hypothesis needs more work.

Remember the parking lot exercise (click here)? We tested the hypothesis that persons who make more money drive fancier cars. Our unit of measurement was cars. Each car was measured along two variables: whether parked in a faculty/staff lot or a student lot (nominal), and car value (ordinal). Since both variables were categorical, proportions were used to assess their relationship. A better, more direct way to conduct this exercise would have been to use persons as a unit of measurement. We could have simply asked a sample of people two questions: what is your income, and what is your car worth? Since both are interval-level variables, they could have been compared using correlation analysis. A strong positive relationship between the variables (persons who make little money drive cheap cars, and persons who make lots of dough drive fancy cars) would confirm the hypothesis.

Check it out: draw both axis of a scatterplot, with the independent variable, income, on the horizontal (x axis) and the dependent variable, car value, on the vertical (y axis). Both variables start with values of zero on the lower left. Then make up realistic data for ten hypothetical respondents (or ask your family and friends). What shape does the regression line take? What kind of relationship between variables does it suggest?

Correlation and regression analysis can be used to determine more complex relationships, such as the proportionate contribution of each of numerous independent variables to variability in a single dependent variable.

Introduction to Inferential Statistics

By this point you should be able, with access to basic reference materials, to select representative samples, to design and administer survey instruments, to test simple assertions (hypotheses) and to design and evaluate commonplace projects. So why bother with inferential statistics?

As discussed in class, there comes the point when we must interpret findings. Say that the Chief is concerned about officer cynicism and wants you to evaluate Jayīs super-duper (but very expensive) cyniscism reduction course. Using the skills you learned in class, you draw a random sample of officers, pretest them, and get a mean cynicism score of 3.0. The officers then take the course, are retested, and now have a mean cynicism score of 2.5. You organize a lynch mob and go looking for Jay, since it looks like he sold you a bill of goods. But when you catch up with Jay at the local Alfa repair shop, he says he deserves a bonus. In his opinion a drop of 1/2 point in cynicism is very substantial. Should you set the noose or let him go?

Inferential statistics might - or might not - save Jayīs bacon. Their use can allow you to determine whether a difference between scores - between the pretest mean of 3.0 and the post-test mean of 2.5 - is statistically significant at a certain level of confidence. What is a level of confidence? By accepted convention, social scientists do not want to take a greater risk of being wrong than 5 chances in a hundred. This is called the .05 level. Running the appropriate statistical test can tell us whether "the difference between the pretest and post-test means is statistically significant at the .05 level". If it is not, string him up!

Some social scientists feel that results should not be deemed statistically significant until they reach the .01 level, meaning there is only one chance in a hundred that they were produced by chance alone.

Oops! What is this business of "being a product of chance alone?" Actually, itīs quite simple. Say that we repeatedly draw random samples of 100 officers from a 1,000 officer department. We will find that the mean (average) cynicism score for each sample is different. These differences, which should be relatively minor, are caused by nothing except chance. The means will form a bell curve or "normal distribution", which has known, fixed properties (Text pg. 235, fig. 10.6).

Essentially the same thing took place when we flipped coins in class. On that occasion, our summary statistic was not a mean score - it was the proportion of heads achieved in each sample (or trial) of 10 coin tosses. After taking numerous samples of 10 tosses each, the proportions of heads distributed normally (Text pg. 225, fig. 10.2).

Why didnīt we get exactly five heads every time we tossed a coin ten times? If we repeatedly draw samples of 100 officers, why arenīt their pretest score means be exactly the same? Because of random "error" (chance fluctuation), thatīs why. Look at fig. 10.2 again - itīs not just a distribution of sample proportions, itīs also a distribution of error due to chance. What social scientists worry about - and with good reason - is that what may seem to be a significant difference between scores might simply be a chance fluctuation!

Fortunately, if we draw random samples, we know that summary statistics such as means and proportions will distribute themselves normally. This allows us to use the characteristics of the normal curve to determine the probability of obtaining a certain mean cynicism score for a sample of recruits - or a certain proportion of heads in a sample of 10 coin tosses (Text pg. 235, fig. 10.6) Look at Figure 10.2 (pg. 225) - what is the probability of getting exactly five heads in a sample of 10 tosses? The answer is .246 - 25 percent, or one out of 4. What is the probability of getting five heads or less? By adding the probabilities we get .62 - 62 percent, or about three out of five. What is the probability of getting between no heads and ten heads? Thatīs a probability of 1, of course, since you have to get between 0 and 10 heads. As long as the coin has not been tampered with, these will be the probabilities every time you take a sample of 10 tosses.

Figure 10.2 depicts how error due to chance is distributed. Say that we flip a coin ten times and produce 10 heads. The chances that sampling error alone could produce ten heads in a sample of ten tosses are .001, or one in a thousand. For a social scientist, thatīs good enough to suggest youīre a cheat - that your tosses are imperfect or that the coins have been tampered with. In effect, your tosses (the samples) are not random. What if you get nine heads? That probability is .01, or one in a hundred, so youīre still on the hook. What about eight heads? The probability of getting this many heads by chance alone is .044, so if the cutoff is .05 youīre still in trouble. What about getting three heads? Now youīre free and clear.

This might seem like a silly example, and to some extent it is. Remember that before getting this far we would have formulated a hypothesis and done a thorough literature review. But the underlying principle - comparing our results to what might happen due to chance alone - is the engine that drives inferential statistics and makes interpreting results possible. This can only be accomplished if we use random samples, where distribution of error due to chance enables us to calculate the probabilities of any particular outcome.

Soon we will be applying this knowledge to actual examples. First, however, we need to shift gears and learn how to calculate a confidence interval. A "confidence interval" is the predicted range of scores within which a population parameter will fall. It is always computed at a fixed level of probability, usually known as the "95 percent confidence interval", as we do not want to take more than 5 chances in a hundred that the real population parameter could fall outside the range we predict.

As good as sampling is, we know that because of chance, summary statistics such as the mean fluctuate from sample to sample. So it cannot be said that a population mean will be the same as a sample mean. Usually we draw only one sample, so all we have available are two sample statistics: the mean and the s, the standard deviation, which depicts the extent to which scores are dispersed within the sample.

To draw the confidence interval we use the sample s, or standard deviation, to help calculate the standard error of the mean, essentially a "standard deviation" for the means of repeated samples. Since the means of repeated random samples disperse normally, we can use the probability characteristics of a normal curve. We use a mathematically derived variable, "z", whose values correspond to probability scores.

Fig. 10.6, pg. 235, is a "standard" normal curve, which distributes z scores according to their probability equivalents. We fix our sights on zīs of +1.96 and -1.96 because these scores encompass 95 percent of all z scores in the distribution. In probability terms, the odds of obtaining z scores that high (or low) by chance alone is .05, or five in a hundred (both tails of the distribution are used so .025 is doubled.)

The remaining details for completing the calculation are in the text. Three final points:

1. We cannot use the sample s to represent the standard error of the mean. Dispersion between the means of repeated samples is always less than the dispersion of scores within a sample (Text Figs. 10.4 & 10.5, pp. 232-3).

2. Since larger samples have less variation due to chance, the number of cases in a sample (n) is used to correct s (see formula, pg. 233).

3. We know that means can be "pulled" by extreme scores. If a distribution is sufficiently skewed its mean may not be a good descriptor, so its use for any purpose is questionable. One tip-off of skewness is a high sample s, or standard deviation. And for now, thatīs as far as we will go on that subject.

Testing Relationships for Statistical Significance

We have learned that we can use a sample mean to predict the likelihood that the mean of a population will fall within a certain range of scores. As discussed above, this process only works if the sample was randomly drawn. But this still doesnīt answer the initial question - did Jayīs program, which lowered cynicism scores by 1/2 point, actually work?

There are actually two issues.

1. If we repeatedly draw random samples from the same population, how much of a difference in their summary statistics - say, their means - can we expect by chance alone? Say we have a police department of 1,000 officers. We randomly draw six groups of ten officers each and give each officer a cynicism test. If thatīs all we do, how much could we expect the mean scores between groups to differ? (Since weīre doing nothing special, and the samples are randomly drawn from the same population, the differences would be due to chance alone.)

Here we are interested whether our observed difference between means is greater than what could happen by chance alone. Our basis for comparison is the "standard error of the difference between means", the estimated dispersion of differences between pairs of means drawn from repeated random samples (Text formula 11.3, p. 244). Since two samples are involved, we must take into account their combined dispersion, which is calculated by "pooling" their individual variance (Text formula 11.1, p. 243). (Contrast this to the procedure for setting a confidence interval, where the sample standard deviation - the square root of the variance - was used to estimate the standard error of the mean.)

The standard error of the difference between means represents the extent to which means can differ by chance alone. How do we compare this figure to the observed difference between means? Simple. We create a proportion with the observed difference as the numerator (on top) and the error as the denominator (on the bottom) (Text formula 11.4, pg. 244.) This proportion must take on a value of at least 1, as the difference between means that is due to chance cannot exceed the actual (observed) difference. Sure - this proportion has a name: itīs the "t" statistic.

Just like for z, there is also a probability distribution for t (Text Appendix C, pg. 317). The larger the t, the greater the likelihood that the difference between means is statistically significant. As you know, social scientists typically do not want to take a chance greater than five in 100 of being wrong, so to call a difference statistically significant the t must be at least as large as what the chart shows for the .05 level.

2. Well, thatīs interesting, but it didnīt help us decide whether Jayīs cynicism reduction program works. Jayīs intervention (the program) was applied to a single sample of officers, reducing their cynism by .5 in a 1-5 scale. We need to find out if this reduction is greater than the "chance" reductions we would get repeatedly testing the officers without putting them through the program.

The formula for this "one-sample, paired differences" t-test is not in our book (itīs in a handout). Suffice it to say that the chance differences when repeatedly testing the same group will be considerably smaller than the chance differences between groups. A difference (such as .5) that is not significant if between groups of officers can be highly significant if it represents before-and-after testing of the same group of officers.

A few more things:

1. You might have read about a "null hypotheses", or hypothesis of no significant difference between means. We begin by assuming that an observed difference between means does not exceed the difference that could be attributed to chance alone. We can only reject the null hypotheses if the value we obtain for tis sufficiently large to be significant at the .05 level.

2. Df, or "degrees of freedom". This is the combined sample n, or total number of cases, with a minor adjustment (Text formula 11.5, p. 244).

3. One-tailed or two-tailed? If our working hypothesis predicts that a difference in means will be in a certain direction (e.g., mean cynicism score will fall) we use only one-half of the probability distribution for t. If the working hypothesis predicts a difference but says nothing of the direction, then both "tails" of the distribution must be used. When the direction of a difference is predicted, smaller tīs are required to obtain a statistically significant result.

Regardless of the kind of data, whenevera random sample has been drawn there is usually a statistical test of significance that can be applied. Each time the underlying logic is the same: could the results have been caused by chance alone? For example:

T-tests are appropriate when the independent variable is nominal-level (cynicism course - yes or no, gender - male or female) and the dependent variable is interval-level (cynicism score, confidence in the police score). An extension of the t test known as analysis of variance can be used when there are multiple independent variables.

When both variables are interval-level, the computer program that runs correlation analysis will report whether obtained coefficients (say, +.5) are statistically significant at the .05 and .01 levels.

When both variables are discrete, a statistical test known as Chi-square can be applied.

By this point, you should know enough to get into a lot of trouble (just kidding!). Keep in mind that even "experts" frequently turn to others when conducting research. If you need to interpret data, go for it, but remember that you are not alone. First read the book, then use your networking skills to consult with colleagues and educators. Thatīs how itīs done in the "real world!"