Well-designed and implemented randomized controlled trials are considered
the "gold standard" for evaluating an intervention's effectiveness,
in fields such as medicine, welfare and employment policy, and psychology.7
This section discusses what a randomized controlled trial is, and outlines evidence
indicating that such trials should play a similar role in education.

A. Definition: Randomized controlled trials are studies that randomly
assign individuals to an intervention group or to a control group, in order
to measure the effects of the intervention.

For example, suppose you want to test, in a randomized controlled trial,
whether a new math curriculum for third-graders is more effective than your
school's existing math curriculum for third-graders. You would randomly assign
a large number of third-grade students to either an intervention group, which
uses the new curriculum, or to a control group, which uses the existing curriculum.
You would then measure the math achievement of both groups over time. The
difference in math achievement between the two groups would represent the
effect of the new curriculum compared to the existing curriculum.

In a variation on this basic concept, sometimes individuals are randomly
assigned to two or more intervention groups as well as to a control group,
in order to measure the effects of different interventions in one trial. Also,
in some trials, entire classrooms, schools, or school districts - rather than
individual students - are randomly assigned to intervention and control groups.

B. The unique advantage of random assignment: It enables you to
evaluate whether the intervention itself, as opposed to other factors, causes
the observed outcomes.

Specifically, the process of randomly assigning a large number of individuals
to either an intervention group or a control group ensures, to a high degree
of confidence, that there are no systematic differences between the groups
in any characteristics (observed and unobserved) except one - namely, the
intervention group participates in the intervention, and the control group
does not. Therefore - assuming the trial is properly carried out (per the
guidelines below) - the resulting difference in outcomes between the intervention
and control groups can confidently be attributed to the intervention and not
to other factors.

C. There is persuasive evidence that the randomized controlled trial,
when properly designed and implemented, is superior to other study designs
in measuring an intervention's true effect.

1. "Pre-post" study designs often produce erroneous results.

Definition: A"pre-post" study examines whether participants
in an intervention improve or regress during the course of the intervention,
and then attributes any such improvement or regression to the intervention.

The problem with this type of study is that, without reference to a control
group, it cannot answer whether the participants' improvement or decline
would have occurred anyway, even without the intervention. This often leads
to erroneous conclusions about the effectiveness of the intervention.

Example: A randomized controlled trial of Even Start - a
federal program designed to improve the literacy of disadvantaged families
- found that the program had no effect on improving the school readiness
of participating children at the 18th-month follow-up. Specifically, there
were no significant differences between young children in the program and
those in the control group on measures of school readiness including the
Picture Peabody Vocabulary Test (PPVT) and PreSchool Inventory.8

If a pre-post design rather than a randomized design had been used
in this study, the study would have concluded erroneously that the program
was effective in increasing school readiness. This is because both
the children in the program and those in the control group showed improvement
in school readiness during the course of the program (e.g., both groups
of children improved substantially in their national percentile ranking
on the PPVT). A pre-post study would have attributed the participants' improvement
to the program whereas in fact it was the result of other factors, as evidenced
by the equal improvement for children in the control group.

Example: A randomized controlled trial of the Summer Training
and Education Program - a Labor Department pilot program that provided summer
remediation and work experience for disadvantaged teenagers - found that
program's short-term impact on participants' reading ability was positive.
Specifically, while the reading ability of the control group members eroded
by a full grade-level during the first summer of the program, the reading
ability of participants in the program eroded by only a half grade-level.
9

If a pre-post design rather than a randomized design had been used
in this study, the study would have concluded erroneously that the program
was harmful. That is, the study would have found a decline in participants'
reading ability and attributed it to the program. In fact, however, the
participants' decline in reading ability was the result of other factors
- such as the natural erosion of reading ability during the summer vacation
months - as evidenced by the even greater decline for members of the control
group.

2. The most common "comparison group" study designs (also
known as "quasi-experimental" designs) also lead to erroneous
conclusions in many cases.

a. Definition: A "comparison group" study compares
outcomes for intervention participants with outcomes for a comparison
group chosen through methods other than randomization.

The following example illustrates the basic concept of this design.
Suppose you want to use a comparison-group study to test whether a new
mathematics curriculum is effective. You would compare the math performance
of students who participate in the new curriculum ("intervention
group") with the performance of a "comparison group" of
students, chosen through methods other than randomization, who do not
participate in the curriculum. The comparison group might be students
in neighboring classrooms or schools that don't use the curriculum, or
students in the same grade and socioeconomic status selected from state
or national survey data. The difference in math performance between the
intervention and comparison groups following the intervention would represent
the estimated effect of the curriculum.

Some comparison-group studies use statistical techniques to create a
comparison group that is matched with the intervention group in socioeconomic
and other characteristics, or to otherwise adjust for differences between
the two groups that might lead to inaccurate estimates of the intervention's
effect. The goal of such statistical techniques is to simulate a randomized
controlled trial.

b. There is persuasive evidence that the most common comparison-group
designs produce erroneous conclusions in a sizeable number of cases.

A number of careful investigations have been carried out - in the areas
of school dropout prevention,.10 K-3 class-size reduction,.11
and welfare and employment policy.12 - to examine whether and
under what circumstances comparison-group designs can replicate the results
of randomized controlled trials.13 These investigations first compare
participants in a particular intervention with a control group, selected
through randomization, in order to estimate the intervention's impact
in a randomized controlled trial. Then the same intervention participants
are compared with a comparison group selected through methods other than
randomization, in order to estimate the intervention's impact in a comparison-group
design. Any systematic difference between the two estimates represents
the inaccuracy produced by the comparison-group design.

These investigations have shown that most comparison-group designs in
education and other areas produce inaccurate estimates of an intervention's
effect. This is because of unobservable differences between the members
of the two groups that differentially affect their outcomes. For example,
if intervention participants self-select themselves into the intervention
group, they may be more motivated to succeed than their control-group
counterparts. Their motivation - rather than the intervention - may then
lead to their superior outcomes. In a sizeable number of cases, the inaccuracy
produced by the comparison-group designs is large enough to result in
erroneous overall conclusions about whether the intervention is effective,
ineffective, or harmful.

Example from medicine. Over the past 30 years, more than
two dozen comparison-group studies have found hormone replacement therapy
for postmenopausal women to be effective in reducing the women's risk
of coronary heart disease, by about 35-50 percent. But when hormone therapy
was finally evaluated in two large-scale randomized controlled trials
- medicine's "gold standard" - it was actually found to do the
opposite: it increased the risk of heart disease, as well as stroke
and breast cancer..14

Medicine contains many other important examples of interventions whose
effect as measured in comparison-group studies was subsequently contradicted
by well-designed randomized controlled trials. If randomized controlled
trials in these cases had never been carried out and the comparison-group
results had been relied on instead, the result would have been needless
death or serious illness for millions of people. This is why the Food
and Drug Administration and National Institutes of Health generally
use the randomized controlled trial as the final arbiter of which medical
interventions are effective and which are not.

3. Well-matched comparison-group studies can be valuable in generating
hypotheses about "what works," but their results need to be confirmed
in randomized controlled trials.

The investigations, discussed above, that compare comparison-group designs
with randomized controlled trials generally support the value of comparison-group
designs in which the comparison group is very closely matched with
the intervention group in prior test scores, demographics, time period in
which they are studied, and methods used to collect outcome data. In most
cases, such well-matched comparison-group designs seem to yield correct
overall conclusions in most cases about whether an intervention is effective,
ineffective, or harmful. However, their estimates of the size of the intervention's
impact are still often inaccurate. As an illustrative example, a well-matched
comparison-group study might find that a program to reduce class size raises
test scores by 40 percentile points - or, alternatively, by 5 percentile
points - when its true effect is 20 percentile points. Such inaccuracies
are large enough to lead to incorrect overall judgments about the policy
or practical significance of the intervention in a nontrivial number of
cases.

As discussed in section III of this Guide, we believe that such well-matched
studies can play a valuable role in education, as they have in medicine
and other fields, in establishing "possible" evidence an intervention's
effectiveness, and thereby generating hypotheses that merit confirmation
in randomized controlled trials. But the evidence cautions strongly against
using even the most well-matched comparison-group studies as a final arbiter
of what is effective and what is not, or as a reliable guide to the strength
of the effect.

D. Thus, we believe there are compelling reasons why randomized controlled
trials are a critical factor in establishing "strong" evidence of
an intervention's effectiveness.