This is the accessible text file for GAO report number GAO-09-911
entitled 'No Child Left Behind Act: Enhancements in the Department of
Education's Review Process Could Improve State Academic Assessments'
which was released on September 24, 2009.
This text file was formatted by the U.S. Government Accountability
Office (GAO) to be accessible to users with visual impairments, as part
of a longer term project to improve GAO products' accessibility. Every
attempt has been made to maintain the structural and data integrity of
the original printed product. Accessibility features, such as text
descriptions of tables, consecutively numbered footnotes placed at the
end of the file, and the text of agency comment letters, are provided
but may not exactly duplicate the presentation or format of the printed
version. The portable document format (PDF) file is an exact electronic
replica of the printed version. We welcome your feedback. Please E-mail
your comments regarding the contents or accessibility features of this
document to Webmaster@gao.gov.
This is a work of the U.S. government and is not subject to copyright
protection in the United States. It may be reproduced and distributed
in its entirety without further permission from GAO. Because this work
may contain copyrighted images or other material, permission from the
copyright holder may be necessary if you wish to reproduce this
material separately.
Report to the Chairman, Committee on Health, Education, Labor, and
Pensions, U.S. Senate:
United States Government Accountability Office:
GAO:
September 2009:
No Child Left Behind Act:
Enhancements in the Department of Education's Review Process Could
Improve State Academic Assessments:
GAO-09-911:
GAO Highlights:
Highlights of GAO-09-911, a report to the Chairman, Committee on
Health, Education, Labor, and Pensions, U.S. Senate.
Why GAO Did This Study:
The No Child Left Behind Act of 2001 (NCLBA) requires states to develop
high-quality academic assessments aligned with state academic
standards. Education has provided states with about $400 million for
NCLBA assessment implementation every year since 2002. GAO examined (1)
changes in reported state expenditures on assessments, and how states
have spent funds; (2) factors states have considered in making
decisions about question (item) type and assessment content; (3)
challenges states have faced in ensuring that their assessments are
valid and reliable; and (4) the extent to which Education has supported
state efforts to comply with assessment requirements. GAO surveyed
state and District of Columbia assessment directors, analyzed Education
and state documents, and interviewed assessment officials from
Maryland, Rhode Island, South Dakota, and Texas and eight school
districts in addition to assessment vendors and experts.
What GAO Found:
States reported their overall annual expenditures for assessments have
increased since passage of the No Child Left Behind Act of 2001
(NCLBA), which amended the Elementary and Secondary Education Act of
1965 (ESEA), and assessment development was the largest expense for
most states. Forty-eight of 49 states that responded to our survey said
that annual expenditures for ESEA assessments have increased since
NCLBA was enacted. Over half of the states reported that overall
expenditures grew due to development of new assessments. Test and
question—also referred to as item—development was most frequently
reported by states to be the largest ESEA assessment expense, followed
by scoring. State officials in selected states reported that alternate
assessments for students with disabilities were more costly than
general population assessments. In addition, 19 states reported that
assessment budgets had been reduced by state fiscal cutbacks.
Cost and time pressures have influenced state decisions about
assessment item type—such as multiple choice or open/constructed
response—and content. States most often chose multiple choice items
because they can be scored inexpensively within tight time frames
resulting from the NCLBA requirement to release results before the next
school year. State officials also reported facing trade-offs between
efforts to assess highly complex content and to accommodate cost and
time pressures. As an alternative to using mostly multiple choice, some
states have developed practices, such as pooling resources from
multiple states to take advantage of economies of scale, that let them
reduce cost and use more open/constructed response items.
Challenges facing states in their efforts to ensure valid and reliable
assessments involved staff capacity, alternate assessments, and
assessment security. State capacity to provide vendor oversight varied,
both in terms of number of state staff and measurement-related
expertise. Also, states have been challenged to ensure validity and
reliability for alternate assessments. In addition, GAO identified
several gaps in assessment security policies that were not addressed in
Education’s review process for overseeing state assessments that could
affect validity and reliability. An Education official said that
assessment security was not a focus of its review. The review process
was developed before recent efforts to identify assessment security
best practices.
Education has provided assistance to states, but issues remain with
communication during the review process. Education provided assistance
in a variety of ways, and states reported that they most often used
written guidance and Education-sponsored meetings and found these
helpful. However, Education’s review process did not allow states to
communicate with reviewers during the process to clarify issues, which
led to miscommunication. In addition, state officials were in some
cases unclear about what review issues they were required to address
because Education did not identify for states why its decisions
differed from the reviewers’ written comments.
What GAO Recommends:
GAO recommends that Education (1) incorporate assessment security best
practices into its peer review protocols, (2) improve communication
during the review process, and (3) identify for states why its peer
review decisions in some cases differed from peer reviewers’ written
comments. Education indicated that it believes its current practices
are sufficient regarding our first recommendation and agreed with GAO’s
other two recommendations.
View [hyperlink, http://www.gao.gov/products/GAO-09-911] or key
components. For more information, contact Cornelia Ashby at (202) 512-
7215 or AshbyC@gao.gov.
[End of section]
Contents:
Letter:
Background:
States Reported That Assessment Spending Has Increased Since NCLBA Was
Enacted and Test Development Has Been the Largest Assessment Cost in
Most States:
States Have Considered Cost and Time in Making Decisions about
Assessment Item Type and Content:
States Faced Several Challenges in Their Efforts to Ensure Valid and
Reliable ESEA Assessments, including Staff Capacity, Alternate
Assessments, and Assessment Security:
Education Has Provided Assistance to States, but the Peer Review
Process Did Not Allow for Sufficient Communication:
Conclusions:
Recommendations for Executive Action:
Agency Comments and Our Evaluation:
Appendix I: Objectives, Scope, and Methodology:
Appendix II: Student Population Assessed on ESEA Assessments in School
Year 2007-08:
Appendix III: Validity Requirements for Education's Peer Review:
Appendix IV: Reliability Requirements for Education's Peer Review:
Appendix V: Alignment Requirements for Education's Peer Review:
Appendix VI: Item Types Used Most Frequently by States on General and
Alternate Assessments:
Appendix VII: Comments from the U.S. Department of Education:
Appendix VIII: GAO Contact and Staff Acknowledgments:
Table:
Table 1: Illustration of Depth of Knowledge Levels:
Figures:
Figure 1: Examples of Item Types:
Figure 2: State Expenditures for Assessment Vendors, 2007-08:
Figure 3: ESEA Assessment Activities That Received the Largest Share of
States' Total ESEA Assessment Costs, 2007-08:
Figure 4: The Number of States Reporting Changes in Item Type Use on
ESEA Assessments since 2002:
Figure 5: Number of FTEs Dedicated to ESEA Assessments in States, 2007-
08:
Abbreviations:
ARRA: The American Recovery and Reinvestment Act of 2009:
AYP: Adequate Yearly Progress:
CCSSO: Council of Chief State School Officers:
Education: U.S. Department of Education:
ESEA: The Elementary and Secondary Education Act:
FTE: full-time equivalent:
LEP: Limited English Proficiency:
NCLBA: The No Child Left Behind Act of 2001:
NECAP: The New England Common Assessment Program:
SFSF: State Fiscal Stabilization Fund:
TAC: Technical Advisory Committee:
[End of section]
United States Government Accountability Office:
Washington, DC 20548:
September 24, 2009:
The Honorable Tom Harkin:
Chairman:
Committee on Health, Education, Labor, and Pensions:
United States Senate:
Dear Mr. Chairman:
The No Child Left Behind Act of 2001 (NCLBA), which amended the
Elementary and Secondary Education Act of 1965 (ESEA), aims to improve
student achievement, particularly among poor and minority students. To
reach this goal, the law requires states to develop high-quality
academic assessments aligned with challenging state academic standards
that measure students' knowledge of reading/language arts, mathematics,
and science. Student achievement as measured by these assessments is
the basis for school accountability, including corrective actions such
as removing principals or implementing new curricula. NCLBA required
that states test all students in grades 3 through 8 annually in
mathematics and reading/language arts and at least once in one of the
high school grades by the 2005-06 school year. It also required that
states test students in science at least once in elementary, middle,
and high school by 2007-08. Education has provided states with about
$400 million for ESEA assessment[Footnote 1] implementation every year
since 2002. To ensure that assessments appropriately measure student
achievement, the law requires that assessments be valid and reliable
and that they measure higher-order thinking skills and understanding.
The U.S. Department of Education's (Education) guidance defines valid
assessments as those for which results accurately reflect students'
knowledge in a subject, and it defines reliable assessments as those
that produce similar results among students with similar levels of
knowledge. The law also directs states to assess all students,
including those with disabilities. For children with significant
cognitive disabilities, Education has directed states to develop
alternate assessments that measure achievement on alternate state
standards designed for these children.
States have primary responsibility for developing ESEA assessments and
ensuring their technical quality, and can work with private assessment
vendors that provide a range of assessment services, such as question
(item)[Footnote 2] development and scoring. Education provides
technical assistance and oversees state implementation of ESEA
assessment requirements through its standards and assessments peer
review process. In Education's peer review process, a group of experts-
-reviewers--review whether states are complying with ESEA assessment
requirements, including requirements for validity and reliability, and
that assessments cover the full depth and breadth of academic
standards.
NCLBA increased the number of assessments that states are required to
develop compared to prior years, and states have reported facing
challenges in implementing these new assessments. Little is known about
how federal, state, and local funds have been used for assessments, or
how states make key decisions as they implement ESEA assessments, such
as whether to use multiple choice or open/constructed response items.
To shed light on these issues and to assist Congress in its next
reauthorization of ESEA, the Chairman of the Senate Committee on
Health, Education, Labor, and Pensions requested that GAO provide
information on the quality and funding of student assessments.
Specifically, you asked GAO to examine the following questions: (1) How
have state expenditures on ESEA assessments changed since NCLBA was
enacted in 2002, and how have states spent funds? (2) What factors have
states considered in making decisions about item type and content of
their ESEA assessments? (3) What challenges, if any, have states faced
in ensuring the validity and reliability of their ESEA assessments? (4)
To what extent has Education supported state efforts to comply with
ESEA assessment requirements?
To conduct our work, we used a variety of methods, including reviews of
Education and state documents, a 50-state survey, interviews with
Education officials, and site visits in 4 states. We also reviewed
relevant federal laws and regulations. To learn whether state
expenditures for assessments have changed since NCLBA enactment, and if
so, how they have changed, and how states have spent these funds, we
analyzed responses to our state survey, which was administered to
assessment directors of the 50 states and the District of Columbia in
January 2009. We received responses from 49 states, for a 96 percent
response rate.[Footnote 3]
We also conducted site visits to four states--Maryland, Rhode Island,
South Dakota, and Texas--that reflect a range of population size and
results from Education's assessment peer review. On these site visits
we interviewed state officials, officials from two districts in each
state, and technical advisors to each state.
To gather information about factors states consider when making
decisions about the item type and content of their assessments, we
analyzed our survey and interviewed state officials and state technical
advisors from our site visit states. We reviewed studies from our site
visit states that evaluated the alignment between state standards and
assessments, including the level of cognitive complexity in
assessments, and spoke with representatives from four alignment
organizations--organizations that evaluate the alignment between state
standards and assessments--that states hire to conduct these studies.
These alignment organizations included the three organizations that
states most frequently hire to conduct alignment studies, and
representatives of a fourth alignment organization that was used by one
of our site visit states.
In addition, we interviewed four assessment vendors that were selected
because they work with a large number of states to obtain their
perspectives on ESEA assessments and the assessment industry. We used
our survey to collect information about challenges states have faced in
ensuring validity and reliability. We also reviewed state documents
from our site visit states, such as test security documentation for
peer review and assessment security protocols, and interviewed state
officials. We asked our site visit states to review a checklist created
by the Council of Chief State School Officers (CCSSO), an association
of state education agencies. A CCSSO official indicated that this
checklist is still valid for state assessment programs.
To address the extent of Education's support and oversight of ESEA
assessment implementation, we reviewed Education guidance, summaries of
Education assistance, and peer review protocols and training documents,
and interviewed Education officials in charge of the peer review and
assistance efforts.
We conducted this performance audit from August 2008 through September
2009 in accordance with generally accepted government auditing
standards. Those standards require that we plan and perform the audit
to obtain sufficient, appropriate evidence to provide a reasonable
basis for our findings and conclusions based on our audit objectives.
We believe that the evidence obtained provides a reasonable basis for
our findings and conclusions based on our audit objectives.
Background:
The ESEA was created to improve the academic achievement of
disadvantaged children.[Footnote 4] The Improving America's Schools Act
of 1994, which reauthorized ESEA, required states to develop state
academic content standards, which specify what all students are
expected to know and be able to do, and academic achievement standards,
which are explicit definitions of what students must know and be able
to do to demonstrate proficiency.[Footnote 5] In addition, the 1994
reauthorization required assessments aligned to those standards. The
most recent reauthorization of the ESEA, the No Child Left Behind Act
of 2001, built on the 1994 requirements by, among other things,
increasing the number of grades and subject areas in which states were
required to assess students.[Footnote 6] NCLBA also required states to
establish goals for the percentage of students attaining proficiency on
ESEA assessments that are used to hold schools and districts
accountable for the academic performance of students. Schools and
districts failing to meet state proficiency goals for 2 or more years
must take actions, proscribed by NCLBA, in order to improve student
achievement. Every state, district, and school receiving funds under
Title I, Part A of ESEA--the federal formula grant program dedicated to
improving the academic achievement of the disadvantaged--is required to
implement the changes described in NCLBA.
ESEA assessments may contain one or more of various item types,
including multiple choice, open/constructed response, checklists,
rating scales, and work samples or portfolios. GAO's prior work has
found that item type is a major factor influencing the overall cost of
state assessments and that multiple choice items are less expensive to
score than open/constructed response items.[Footnote 7] Figure 1
describes several item types states use to assess student knowledge.
Figure 1: Examples of Item Types:
[Refer to PDF for image: illustration]
Multiple choice item:
An item written that offers students two or more answers from which to
select the best answer to the question or stem statement.
Open/constructed response item:
A student provides a brief response to a posed question that may be a
few words or longer.
Checklist:
Often used for observing performance in order to keep track of a
student's progress or work over time. This can also be used to
determine whether students have met established criteria on a task.
Rating scale:
Used to provide feedback of a student's performance on an assessment
based on pre-determined criteria.
Work samples or portfolios:
A teacher presents tasks for students to perform, and then rates and
records students’ response for each task. These ratings and responses
are recorded in student portfolios.
Source: GAO; images, Art Explosion.
[End of figure]
NCLBA authorized additional funding to states for these assessments
under the Grants for State Assessments program. Each year states have
received a $3 million base amount regardless of its size, plus an
additional amount based on its share of the nation's school age
population. States must first use the funds to pay the cost of
developing the additional state standards and assessments. If a state
has already developed the required standards and assessments, NCLBA
allows these funds to be used to administer assessments or for other
activities, such as developing challenging state academic standards in
subject areas other than those required by NCLBA and ensuring that
state assessments remain valid and reliable. In years that the grants
have been awarded, the Grants for Enhanced Assessment Instruments
program (Enhanced Assessment grants) has provided between $4 million
and $17 million to several states. Applicants for Enhanced Assessment
grants receive preference if they plan to fund assessments for students
with disabilities, for Limited English Proficiency (LEP) students or
are part of a collaborative effort between states. States may also use
other federal funds for assessment-related activities, such as funds
for students with disabilities, and funds provided under the American
Recovery and Reinvestment Act of 2009 (ARRA).[Footnote 8] ARRA provides
about $100 billion for education through a number of different
programs, including the State Fiscal Stabilization Fund (SFSF). In
order to receive SFSF funds, states must provide certain assurances,
including that the state is committed to improving the quality of state
academic standards and assessments. In addition, Education recently
announced plans to make $4.35 billion in incentive grants available to
states through SFSF on a competitive basis. These grants--referred to
by Education as the Race to the Top program--can be used by states for,
among other things, improving the quality of assessments.
Like other students, those with disabilities must be included in
statewide ESEA assessments. This is accomplished in different ways,
depending on the effects of a student's disability. Most students with
disabilities participate in the regular statewide assessment either
without accommodations or with appropriate accommodations, such as
having unlimited time to complete the assessments, using large print or
Braille editions of the assessments, or being provided individualized
or small group administration of the assessments. States are permitted
to use alternate academic achievement standards to evaluate the
performance of students with the most significant cognitive
disabilities. Alternate achievement standards must be linked to the
state's grade-level academic content standards but may include
prerequisite skills within the continuum of skills culminating in grade-
level proficiency. For these students, a state must offer alternate
assessments that measure students' performance. For example, the
alternate assessment might assess students' knowledge of fractions by
splitting groups of objects into two, three, or more equal parts. While
alternate assessments can be administered to all eligible children, the
number of proficient and advanced scores from alternate assessments
based on alternate achievement standards included in Adequate Yearly
Progress (AYP)[Footnote 9] decisions generally is limited to 1 percent
of the total tested population at the state and district levels.
[Footnote 10] In addition, states may develop modified academic
achievement standards--achievement standards that define proficiency at
a lower level than the achievement standards used for the general
assessment population, but are still aligned with grade-level content
standards--and use alternate assessments based on those standards for
eligible students whose disabilities preclude them from achieving grade-
level proficiency within the same period of time as other students.
States may include scores from such assessments in making AYP decisions
but those scores generally are capped at 2 percent of the total tested
population.[Footnote 11]
States are also required to include LEP students in their ESEA
assessments. To assess these students, states have the option of
developing assessments in students' native languages. These assessments
are designed to cover the content in state academic content standards
at the same level of difficulty and complexity as the general
assessments.[Footnote 12] In the absence of native language
assessments, states are required to provide testing accommodations for
LEP students, such as providing additional time to complete the test,
allowing the use of a dictionary, administering assessments in small
groups, or simplified instructions.
By law, Education is responsible for determining whether or not states'
assessments comply with statutory requirements. The standards and
assessments peer review process used by Education to determine state
compliance began under the 1994 reauthorization of ESEA and is an
ongoing process that states go through whenever they develop new
assessments. In the first step of the peer review process, a group of
at least three experts--peer reviewers--examines evidence submitted by
the state to demonstrate compliance with NCLBA requirements, identifies
areas for which additional state evidence is needed, and summarizes
their comments. The reviewers are state assessment directors,
researchers, and others selected for their expertise in assessments.
After the peer reviewers complete their review, an Education official
assigned to the state reviews the peer reviewers' comments and the
state's evidence and, using the same guidelines as the peer reviewers,
makes a recommendation on whether the state meets, partially meets, or
does not meet each assessment system critical element and on whether
the state's assessment system should be approved. A group of Education
officials from the relevant Education offices--including a
representative from the Office of the Assistant Secretary of Elementary
and Secondary Education--meet as a panel to discuss the findings. The
panel makes a recommendation about whether to approve the state and the
Assistant Secretary makes the final approval decision. Afterwards a
letter is sent to the state notifying them of whether they have been
approved, and--if the state was not approved--Education's letter
identifies why the state was not approved. States also receive a copy
of the peer reviewers' written comments as a technical assistance tool
to support improvement.
Education has the authority to withhold federal funds provided for
state administration until it determines that the state has fulfilled
ESEA assessment requirements and has taken this step with several
states since NCLBA was enacted. Education also provides states with
technical assistance in meeting the academic assessment requirements.
ESEA assessments must be valid and reliable for the purposes for which
they are intended and aligned to challenging state academic standards.
Education has interpreted these requirements in its peer review
guidance to mean that states must show evidence of technical quality--
including validity and reliability--and alignment with academic
standards. According to Education's peer review guidance, the main
consideration in determining validity is whether states have evidence
that their assessment results can be interpreted in a manner consistent
with their intended purposes. See appendix III for a complete
description of the evidence used by Education to determine validity.
A reliable assessment, according to the peer review guidance, minimizes
the many sources of unwanted variation in assessment results. To show
evidence of consistency of assessment results, states are required to
(1) make a reasonable effort to determine the types of error that may
distort interpretations of the findings, (2) estimate the likely
magnitude of these distortions, and (3) make every possible effort to
alert the users to this lack of certainty. As part of this requirement,
states are required to demonstrate that assessment security guidelines
are clearly specified and followed. See appendix IV for a full
description of the reliability requirements.
Alignment, according to Education's peer review guidance, means that
states' assessment systems adequately measure the knowledge and skills
specified in state academic content standards. If a state's assessments
do not adequately measure the knowledge and skills specified in its
content standards or if they measure something other than what these
standards specify, it will be difficult to determine whether students
have achieved the intended knowledge and skills. See appendix V for
details about the characteristics states need to consider to ensure
that its standards and assessments are aligned.
In its guidance and peer review process, Education requires that--as
one component of demonstrating alignment between state assessments and
academic standards--states must demonstrate that their assessments are
as cognitively challenging as their standards. To demonstrate this,
states have contracted with organizations to assess the alignment of
their ESEA assessments with the states' standards. These organizations
have developed similar models of measuring the cognitive challenge of
assessment items. For example, the Webb model categorizes items into
four levels--depths of knowledge--ranging in complexity from level 1--
recall, which is the least difficult for students to answer, to level
4--extended thinking, which is the most difficult for students to
answer. Table 1 provides an illustration, using the Webb model, of how
depth of knowledge levels may be measured.
Table 1: Illustration of Depth of Knowledge Levels:
Depth of knowledge level: Level 1 - Recall;
Description: Includes the recall of information such as a fact,
definition, term, or a simple procedure, as well as performing a simple
algorithm or applying a formula. Other key words that signify a Level 1
activity include "identify," "recall," "recognize," "use," and
"measure."
Depth of knowledge level: Level 2 - Skill/Concept;
Description: Includes the engagement of some mental processing beyond a
habitual response. A Level 2 assessment item requires students to make
some decisions as to how to approach the problem or activity. Keywords
that generally distinguish a Level 2 item include "classify,"
"organize," "estimate," "make observations," "collect and display
data," and "compare data." These actions imply more than one step.
Other Level 2 activities include noticing and describing non-trivial
patterns; explaining the purpose and use of experimental procedures;
carrying out experimental procedures; making observations and
collecting data; classifying, organizing, and comparing data; and
organizing and displaying data in tables, graphs, and charts.
Depth of knowledge level: Level 3 - Strategic Thinking;
Description: Requires reasoning, planning, using evidence, and a higher
level of thinking than the previous two levels. In most instances,
requiring students to explain their thinking is a Level 3. Activities
that require students to make conjectures are also at this level. The
cognitive demands at Level 3 are complex and abstract. The complexity
does not result from the fact that there are multiple answers, a
possibility for both Levels 1 and 2, but because the task requires more
demanding reasoning. Other Level 3 activities include drawing
conclusions from observations, citing evidence and developing a logical
argument for concepts, explaining phenomena in terms of concepts, and
using concepts to solve problems.
Depth of knowledge level: Level 4 - Extended Thinking;
Description: Requires complex reasoning, planning, developing, and
thinking most likely over an extended period of time. At Level 4, the
cognitive demands of the task should be high and the work should be
very complex. Students should be required to make several connections--
relate ideas within the content area or among content areas--and would
have to select one approach among many alternatives on how the
situation should be solved, in order to be at this highest level. Level
4 activities include developing and proving conjectures; designing and
conducting experiments; making connections between a finding and
related concepts and phenomena; combining and synthesizing ideas into
new concepts; and critiquing experimental designs.
Source: Norman L. Webb, Issues Related to Judging the Alignment of
Curriculum Standards and Assessments, April 2005.
[End of table]
States Reported That Assessment Spending Has Increased Since NCLBA Was
Enacted and Test Development Has Been the Largest Assessment Cost in
Most States:
Assessment Expenditures Have Grown in Nearly Every State since 2002,
and Most States Reported Spending More for Vendors than State Staff:
State ESEA assessment expenditures have increased in nearly every state
since the enactment of NCLBA in 2002, and the majority of these states
reported that adding assessments was a major reason for the increased
expenditures. Forty-eight of 49 states that responded to our survey
said their states' overall annual expenditures for ESEA assessments
have increased, and over half of these 48 states indicated that adding
assessments to their state assessment systems was a major reason for
increased expenditures[Footnote 13]. In other cases, even states that
were testing students in reading/language arts and mathematics in all
of the grades that were required when NCLBA was enacted reported that
assessment expenditures increased due to additional assessments. For
example, officials in Texas--which was assessing general population
students in all of the required grades at the time NCLBA was enacted--
told us that they created additional assessments for students with
disabilities.
In addition to the cost of adding new assessments, states reported that
increased vendor costs have also contributed to the increased cost of
assessments. On our survey, increasing vendor costs was the second most
frequent reason that states cited for increased ESEA assessment costs.
One vendor official told us that shortly after the 2002 enactment of
NCLBA, states benefited from increased competition because many new
vendors entered the market and wanted to gain market share, which drove
down prices. In addition, vendors were still learning about the level
of effort and costs required to complete this type of work.
Consequently, as the ESEA assessment market has stabilized and vendors
have gained experience pricing assessments, the cost of ESEA assessment
contracts have increased to reflect the true cost of vendor assessment
work. One assessment vendor that works with over half of the states on
ESEA assessments told us that vendor costs have also been increasing as
states have been moving toward more sophisticated and costly procedures
and reporting.
Nearly all states reported higher expenditures for assessment vendors
than for state assessment staff. According to our survey responses, 44
out of the 46 states that responded said that of the total cost of ESEA
assessments, much more was paid to vendors than to state employees. For
example, one state reported it paid approximately $83 million to
vendors and approximately $1 million to state employees in the 2007-08
school year. The 20 states that provided information for the costs of
both vendors and state employees in 2007-08 reported spending more than
$350 million for vendors to develop, administer, score, and report the
results of ESEA assessments--more than 10 times the amount they spent
on state employees.
State expenditures for ESEA assessment vendors, which were far larger
than expenditures for state staff, varied. Spending for vendors on ESEA
assessments in the 40 states that reported spending figures on our
survey ranged from $500,000 to $83 million, and in total all 40 states
spent more than $640 million for vendors to develop, administer, score,
and report results of the ESEA assessments in 2007-08. The average cost
in these 40 states was about $16 million. See figure 2 for the
distribution of state expenditures for vendors in 2007-08.
Figure 2: State Expenditures for Assessment Vendors, 2007-08:
[Refer to PDF for image: vertical bar graph]
Dollar amounts states spent on assessment vendors: Below $15 million;
Number of states: 26.
Dollar amounts states spent on assessment vendors: $15 million-$60
million;
Number of states: 12.
Dollar amounts states spent on assessment vendors: Over $60 million;
Number of states: 2.
Source: GAO survey.
[End of figure]
Over half of the states reported that the majority of their funding for
ESEA assessments--including funding for expenses other than vendors--
came from their state governments. Of the 44 states that responded to
the survey question, 26 reported that the majority of their state's
total funding for ESEA assessments came from state government funds for
2007-08, and 18 reported that less than half came from state funds. For
example, officials from one state that we visited, Maryland, reported
that 84 percent of their total funding for ESEA assessments came from
state government funds and that 16 percent of the state's funding for
ESEA assessments came from the federal Grants for State Assessments
program in 2007-08. In addition to state funds, all states reported
using Education's Grants for State Assessments for ESEA assessments,
and 17 of 45 states responding to the survey question reported using
other federal funds for assessments. One state reported that all of its
funding for ESEA assessments came from the Grants for State Assessments
program. The other federal funds used by states for assessments
included Enhanced Assessment grants.
The Majority of States Reported That Assessment Development Was the
Most Expensive Component of the Assessment Process; Development Has
Been More Challenging for Small States:
More than half of the states reported that assessment development costs
were more expensive than any other component of the student assessment
process, such as administering or scoring assessments.[Footnote 14]
Twenty-three of 43 states that responded to the question in our survey
told us that test and item development and revision was the largest
assessment cost for 2007-08. For example, Texas officials said that the
cost of developing tests is higher than the costs associated with any
other component of the assessment process. After test and item
development costs, scoring was most frequently cited as the most costly
activity, with 12 states reporting it as their largest assessment cost.
Similarly, states reported that test and item development was the
largest assessment cost for alternate assessments, followed by scoring.
See figure 3 for more information.
Figure 3: ESEA Assessment Activities That Received the Largest Share of
States' Total ESEA Assessment Costs, 2007-08:
[Refer to PDF for image: horizontal bar graph]
General ESEA assessment activities:
Number of states responding: Test development: 23;
Number of states responding: Scoring of test: 12;
Number of states responding: Test administration: 8.
Alternate assessment with alternate achievement standards:
Number of states responding: Test development: 23;
Number of states responding: Scoring of test: 15;
Number of states responding: Test administration: 4.
Alternate assessment with modified achievement standards:
Number of states responding: Test development: 7;
Number of states responding: Scoring of test: 2;
Number of states responding: Test administration: 1. `
Source: GAO survey data.
[End of figure]
The cost of developing assessments was affected by whether states
release assessment items to the public.[Footnote 15] According to state
and vendor officials, development costs are related to the percentage
of items states release to the public every year because new items must
be developed to replace released items. According to vendor officials,
nearly all states release at least some test items to the public, but
they vary in the percentage of items that they release. In states that
release 100 percent of their test items each year, assessment costs are
generally high and steady over time because states must develop
additional items every year. However, some states release only a
portion of items. For example, Rhode Island state officials told us
that they release 20 to 50 percent of their reading and math assessment
items every year. State and vendor officials told us that despite the
costs associated with the release of ESEA assessment items, releasing
assessment items builds credibility with parents and helps policymakers
and the public understand how assessment items relate to state content
standards.
The cost of development has been particularly challenging for smaller
states.[Footnote 16] Assessment vendors and Education officials said
that the price of developing an assessment is fixed regardless of state
size and that, as a result smaller states with fewer students usually
have higher per pupil costs for development. For example, state
assessment officials from South Dakota told us that their state and
other states with small student populations have the same development
costs as states with large assessment populations, regardless of the
number of students being assessed. In contrast to development costs,
administration and scoring costs vary based on the number of students
being assessed and the item types used. Although large and small states
face similar costs for development, each has control over some factors--
such as item type and releasing test items--that can increase or
decrease costs.
Selected States Are Concerned about Costs of Developing and
Administering Alternate Assessments for Students with Disabilities and
Budget Cuts:
State officials from the four states we visited told us that alternate
assessments based on alternate achievement standards were far more
expensive on a per pupil basis than general assessments. In Maryland,
state officials told us that general assessments cost $30 per pupil,
and alternate assessments cost between $300 and $400 per pupil. Rhode
Island state officials also reported that alternate assessments cost
much more than general assessments. These officials also said that, in
addition to direct costs, the administration of alternate assessments
has resulted in significant indirect costs, such as professional
development for teachers. Technical advisors and district and state
officials told us that developing alternate assessments is costly on a
per pupil basis because the number of students taking these assessments
is small. See appendix VI for more information about states' use of
various item types for alternate assessments.
In light of recent economic conditions, many states have experienced
fiscal reductions, including within ESEA assessment budgets. As of
January 2009, 19 states said their state's total ESEA assessment budget
had been reduced as a result of state fiscal cutbacks. Fourteen states
said their state's total ESEA assessment budgets had not been reduced,
but 10 of these states also said they anticipated future reductions.
Half of the 46 states that responded to the question told us that in
developing their budget proposals for the next fiscal year they
anticipated a reduction in state funds for ESEA assessments. For
example, one state that responded to our survey said it had been asked
to prepare for a 15 percent reduction in state funds.
States Have Considered Cost and Time in Making Decisions about
Assessment Item Type and Content:
States Used Primarily Multiple Choice Items in Their ESEA Assessments
Because They Are Cost-Effective and Can Be Scored within Tight Time
Frames for Reporting Results:
States have most often chosen multiple choice items over other item
types on assessments. In 2003, we reported that the majority of states
used a combination of multiple choice and a limited number of open-
ended items for their assessments.[Footnote 17] According to our
survey, multiple choice items comprise the majority of unweighted score
points (points)--the number of points that can be earned based on the
number of items answered correctly--for ESEA reading/language arts and
mathematics general assessments administered by most responding states.
Specifically, 38 of 48 states that responded said that multiple choice
items comprise all or most of the points for their reading/language
arts assessments, and 39 states said that multiple choice items
comprise all or most of the points for mathematics assessments. Open/
constructed response items are the second most frequently used item
type for reading/language arts or mathematics general assessments. All
states that responded to our survey reported using multiple choice
items on their general reading/language arts and mathematics
assessments, and most used some open/constructed response items. See
appendix VI for more information about the types of items used by
states on assessments.
Some states also reported on our survey that, since 2002, they have
increased their use of multiple choice items and decreased their use of
other item types. Of the 47 states that responded to our survey
question, 10 reported increasing the use of multiple choice items on
reading/language arts general assessments, and 11 reported increasing
their use of multiple choice items on mathematics assessments. For
example, prior to the enactment of NCLBA, Maryland administered an
assessment that was fully comprised of open/constructed response items,
but state assessment officials told us that they have moved to an
assessment that is primarily multiple choice and plan to eliminate
open/constructed response items from assessments. However, several
states reported that they have decreased the use of multiple choice
items and/or increased the use of open/constructed response items. For
more information about how states reported changing the mix of items on
their assessments, see figure 4.
Figure 4: The Number of States Reporting Changes in Item Type Use on
ESEA Assessments since 2002:
[Refer to PDF for image: illustration]
Reading/language arts:
Upward changes, multiple choice: 10;
Upward changes, open/constructed response: 5;
Downward changes, multiple choice: 4;
Downward changes, open/constructed response: 11.
Mathematics:
Upward changes, multiple choice: 11;
Upward changes, open/constructed response: 5;
Downward changes, multiple choice: 4;
Downward changes, open/constructed response: 13.
Source: GAO.
[End of figure]
States reported that total cost of use and the ability to score
assessments quickly were key considerations in choosing multiple choice
item types. In response to our survey, most states reported considering
the cost of different item types and the ability to score the tests
quickly when making decisions about item types for ESEA assessments.
Officials from the states we visited reported choosing multiple choice
items because they can be scored inexpensively within challenging time
frames. State officials, assessment experts, and vendors told us that
multiple choice item types are scored electronically, which is
inexpensive, but that open/constructed response items are usually
scored manually, making them more expensive to score. Multiple scorers
of open/constructed response items are sometimes involved to ensure
consistency, but this also increases costs. In addition, state
officials said that training scorers of open/constructed response items
is costly. For example, assessment officials in Texas told us that the
state has a costly 3-week long training process for teachers to become
qualified to assess the open-ended responses. State assessment
officials also told us that they used multiple choice items because
they can be scored quickly, and assessment vendors reported that states
were under pressure to release assessment results to the public before
the beginning of the next school year in accordance with NCLBA
requirements. For example, assessment officials from South Dakota told
us that they explored using open/constructed response items on their
assessments but that they ultimately determined it would not be
feasible to return results in the required period of time. States also
reported considering whether item types would meet certain technical
considerations, such as validity and reliability. Texas assessment
officials said that using multiple choice items allows the state more
time to check test scores for reliability.
States Reported That the Use of Multiple Choice Items in Assessments
Has Limited the Content and Complexity of What They Test:
Despite the cost-and time-saving benefits to states, the use of
multiple choice items on assessments has limited the content included
in the assessments. Many state assessment officials, alignment experts,
and vendor officials told us that items possess different
characteristics that affect how amenable they are to testing various
types of content. State officials and their technical advisors told us
that they have faced significant trade-offs between their efforts to
assess highly cognitively complex content and their efforts to
accommodate cost and time pressures. All four of the states that we
visited reported separating at least a minor portion of standards into
those that are used for ESEA assessment and those that are for
instructional purposes only. Three of the four states reported that
standards for instructional purposes only included highly cognitively
complex material that could not be assessed using multiple choice
items. For example, a South Dakota assessment official told us that a
cognitively complex portion of the state's new reading standards could
not be tested by multiple choice; therefore, the state identified these
standards as for instructional purposes only and did not include them
in ESEA assessments. In addition to these three states, officials from
the fourth state--Maryland--told us that they do not include certain
content in their standards because it is difficult to assess. Many
state officials and experts we spoke with told us that multiple choice
items limit states from assessing highly cognitively complex content.
For example, Texas assessment officials told us that some aspects of
state standards, such as a student's ability to conduct scientific
research, cannot be assessed using multiple choice.
Representatives of the alignment organizations told us that it is
difficult, and in some cases not possible, to measure highly
cognitively complex content with multiple choice items. Three of the
four main groups that conduct alignment studies, including alignment
studies for all of our site visit states, told us that states cannot
measure content of the highest complexity with multiple choice and that
ESEA assessments should include greater cognitive complexity. Maryland
state officials said that before NCLBA was enacted the state
administered an assessment that was fully comprised of open/constructed
response items. Maryland technical advisors told us that because the
state faced pressure to return assessment results quickly, the state
changed its test to include mostly multiple choice items, but that this
had limited the content assessed in the test. According to an analysis
performed in 2002 after the enactment of NCLBA, of 36 scorable items on
one Maryland high school mathematics assessment, about 94 percent of
the items were rated at the two lowest levels of cognitive demand, out
of four levels based on an independent alignment review.[Footnote 18],
[Footnote 19] Representatives of all four alignment groups told us that
multiple choice items can measure intermediate levels of cognitive
complexity, but it is difficult and costly to develop these items.
These alignment experts said that developing multiple choice items that
measure cognitively challenging content is more expensive and time-
consuming than for less challenging multiple choice items.
Vendor officials had differing views about whether multiple choice
items assess cognitively complex content. For example, officials from
three vendors said that multiple choice items can address cognitively
complex content. However, officials from another vendor told us that it
is not possible to measure certain highly cognitively complex content
with multiple choice items. Moreover, two other vendors told us that
there are certain content and testing purposes that are more amenable
to assessment with item types other than with multiple choice items.
Several of the vendors reported that there are some standards that,
because of practical limitations faced by states, cannot be assessed on
standardized, paper-and-pencil assessments. For example, one vendor
official told us that performance-based tasks enabled states to assess
a wider variety of content but that the limited funds and quick
turnaround times required under the law require states to eliminate
these item types.
Although most state officials, state technical advisors, and alignment
experts said that ESEA assessments should include more open/constructed
response items and other item types, they also said that multiple
choice items have strengths and that there are challenges with other
types of items. For example, in 2008 a national panel of assessment
experts appointed and overseen by Education reported that multiple
choice items do not measure different aspects of mathematics competency
than open/constructed response items. Also, alignment experts said that
multiple choice items can quickly and effectively assess lower level
content, which is also important to assess. Moreover, open/constructed
response items do not always assess highly complex content, according
to an alignment expert. This point has been corroborated by several
researchers who have found that performance tasks, which are usually
intended to assess higher-level cognitive content may inadvertently
measure low-level content.[Footnote 20] For example, one study
describes a project in which students were given a collection of
insects and asked to organize them for display. High-scoring students
were supposed to demonstrate complex thinking skills by sorting insects
based on scientific classification systems, rather than less complex
criteria, such as whether or not insects are able to fly. However,
analysis of student responses showed that high scorers could not be
distinguished from low scorers in terms of their knowledge of the
insects' features or of the scientific classification system.[Footnote
21]
The presence or absence of highly complex content in assessments can
impact classroom curriculum. Several research studies have found that
content contained in assessments influences what teachers teach in the
classroom. One study found that including open-ended items on an
assessment prompted teachers to ask students to explain their thinking
and emphasize problem solving more often.[Footnote 22] Assessment
experts told us that the particular content that is tested impacts
classroom curriculum. For example, one assessment expert told us that
the focus on student results, combined with the focus on multiple
choice items, has led to teachers teaching a narrow curriculum that is
focused on basic skills.
Under the federal peer review process, Education and peer reviewers
examined evidence that ESEA assessments are aligned with the state's
academic standards. Specifically, peer reviewers examined state
evidence that assessments cover the full depth and breadth of the state
academic standards in terms of cognitive complexity and level of
difficulty. However, consistent with federal law, it is Education's
policy not to directly examine a state's academic standards,
assessments, or specific test items.[Footnote 23] Education officials
told us that it is not the department's role to evaluate standards and
assessments themselves and that few at Education have the expertise
that would be required to do so. Instead, they explained that
Education's role is to evaluate the evidence provided by states to
determine whether the necessary requirements are met.
States Used Alternative Practices to Reduce Cost and Meet Quick
Turnaround Times while Attempting to Assess Complex Material:
As an alternative to using mostly multiple choice items on ESEA
assessments, states used a variety of practices to reduce costs and
meet quick turnaround times while also attempting to assess cognitively
complex material. For example, some states have developed and
administered ESEA assessments in collaboration with other states, which
has allowed these states to pool resources and use a greater diversity
of item types. In addition, some states administered assessments at the
beginning of the year that test students on material taught during the
prior year to allow additional time for scoring of open-response items,
or administered assessments online to decrease turnaround time for
reporting results. States have reported advantages and disadvantages
associated with each of these practices:
* Collaboration among states: All four states that we visited--
Maryland, Texas, South Dakota, and Rhode Island--indicated interest in
collaborating with other states in the development of ESEA reading/
language arts or mathematics assessments, as of March 2009, but only
Rhode Island was. Under the New England Common Assessments Program
(NECAP), Rhode Island, Vermont, New Hampshire, and Maine share a
vendor, a common set of standards, and item development costs. Under
this agreement, the cost of administration and scoring are based on per
pupil rates. NECAP states use a combination of multiple choice, short
answer, and open/constructed response items. According to Rhode Island
assessment officials, more rigorous items, including half of their math
items, are typically embedded within open/constructed response items.
When asked about the benefits of working in collaboration with other
states to develop ESEA assessments, assessment officials for Rhode
Island told us that the fiscal savings are very apparent. Specifically,
they stated that Rhode Island will save approximately $250,000 per year
with the addition of Maine to the NECAP consortium because, as Rhode
Island assessment officials noted, Maine will take on an additional
share of item development costs. Also, officials said that with a multi-
state partnership, Rhode Island is able to pay more for highly skilled
people who share a common vision. Finally, they said that higher
standards are easier to defend politically as part of collaboration
because there are more stakeholders in favor of them. An assessment
expert from New Hampshire said that the consortium has been a
"lifesaver" because it has saved the state considerable funding and
allowed it to meet ESEA assessment requirements.
Assessment experts from Rhode Island and New Hampshire told us that
there are some challenges to working in collaboration with other states
to develop ESEA assessments. Because decisions are made by consensus
and the NECAP states have philosophical differences in areas such as
item development, scoring, and use of item types, decision-making is a
lengthy process. In addition, a Rhode Island official said that
assessment leadership in the states changes frequently, which also
makes decision-making difficult.
* Beginning of year test administration: NECAP states currently
administer assessments in the beginning of the year, which eases time
pressures associated with the scoring of open/constructed response
items. As a result, the inclusion of open/constructed response items on
the assessment has been easier because there is enough time to meet
NCLBA deadlines for reporting results. However, Rhode Island officials
said that there are challenges to administering tests at the beginning
of the year. For example, one official stated that coordinating testing
with the already challenging start of school is daunting. For example,
she said that state assessment officials are required to use school
enrollment lists to print school labels for individual tests, but
because enrollment lists often change in the beginning of the year,
officials are required to correct a lot of data. District assessment
officials also cited this as a major problem.
* Computerized testing: Of the states we visited, Texas was the only
one administering a portion of its ESEA assessments online, but
Maryland and Rhode Island were moving toward this goal. One assessment
vendor with whom we spoke said that many states are anticipating this
change in the not-too-distant future. Assessment vendors and state
assessment officials cited some major benefits of online assessment.
For example, one vendor told us that online test administration reduces
costs by using technology for automated scoring. They also told us that
states are using online assessments to address cognitively complex
content in standards that are difficult to assess, such as scientific
knowledge that is best demonstrated through experiments. In addition,
assessment officials told us that online assessments are less
cumbersome and easier than paper tests to manage at the school level if
schools have the required technology and that they enable quicker
turnaround on scores. State and district assessment officials and a
vendor with whom we spoke also cited several challenges associated with
administering tests online, including security of the tests;
variability in students' computer literacy; strain on school computer
resources, computer classrooms/labs, and interruption of classroom/lab
instruction; and lack of necessary computer infrastructure.
States Faced Several Challenges in Their Efforts to Ensure Valid and
Reliable ESEA Assessments, including Staff Capacity, Alternate
Assessments, and Assessment Security:
States Varied in Their Capacity to Guide and Oversee Vendors:
State officials are responsible for guiding the development of the
state assessment program and overseeing vendors, but states varied in
their capacity to fulfill these roles. State officials reported that
they are responsible for making key decisions about the direction of
their states' assessment programs, such as whether to develop alternate
assessments based on modified achievement standards, or online
assessments. In addition, state officials said that they are
responsible for overseeing the assessment vendors used by their states.
However, state assessment offices varied based on the measurement
expertise of their staff. About three-quarters of the 48 responding
states had at least one state assessment staff member with a Ph.D. in
psychometrics or another measurement-related field. Three states--
North Carolina, South Carolina, and Texas--each reported having five
staff with this expertise. However, 13 states did not have any staff
with this expertise. In addition, states varied in the number of full-
time equivalent professional staff (FTE) dedicated to ESEA assessments
from 55 professional staff in Texas to 1 professional staff in Idaho
and the District of Columbia. See figure 5 for more information about
the number of FTEs dedicated to ESEA in the states.
Figure 5: Number of FTEs Dedicated to ESEA Assessments in States, 2007-
08:
[Refer to PDF for image: vertical bar graph]
Number of FTEs: 1 to 5;
Number of states: 11.
Number of FTEs: 6 to 15;
Number of states: 21.
Number of FTEs: 16 to 25;
Number of states: 6.
Number of FTEs: 26 and up;
Number of states: 6.
Source: GAO survey.
[End of figure]
Small states had less assessment staff capacity than larger states. The
capacity of state assessment offices was related to the amount of
funding spent on state assessment programs in different states,
according to state officials. For example, South Dakota officials told
us that they had tried to hire someone with psychometric expertise but
that they would need to quadruple the salary that they could offer to
compete with the salaries being offered by other organizations. State
officials said that assessment vendors can often pay higher salaries
than states and that it is difficult to hire and retain staff with
measurement-related expertise.
State officials and assessment experts told us that the capacity of
state assessment offices was the key challenge for states implementing
NCLBA. Greater state capacity allows states to be more thoughtful in
developing their state assessment systems, and provide greater
oversight of their assessment vendors, according to state officials.
Officials in Texas and other states said that having high assessment
staff capacity--both in terms of number of staff and measurement-
related expertise--allows them to research and implement practices that
improve student assessment. For example, Texas state officials said
that they conduct research regarding how LEP students and students with
disabilities can best be included in ESEA assessments, which state
officials said helped them improve the state's assessments for these
students. In contrast, officials in lower capacity states said that
they struggled to meet ESEA assessment requirements and did not have
the capacity to conduct research or implement additional strategies.
For example, officials in South Dakota told us that they had not
developed alternate assessments based on modified achievement standards
because they did not have the staff capacity or funding to implement
these assessments.
Also, of three states we visited that completed a checklist of
important assessment quality control steps,[Footnote 24] those with
fewer assessment staff addressed fewer key quality control steps.
Specifically, Rhode Island, South Dakota, and Texas reviewed and
completed a CCSSO[Footnote 25] checklist on student assessment, the
Quality Control Checklist for Processing, Scoring, and Reporting. These
states varied with regard to fulfilling the steps outlined by this
checklist. For example, state officials in Texas, which has 55 full-
time professional staff working on ESEA assessments, including multiple
staff with measurement-related expertise, reported that they fulfill 31
of the 33 steps described in the checklist and address the 2 other
steps in certain circumstances. Officials in Rhode Island, who told us
that they have six assessment staff and work in conjunction with other
states in its assessment consortium, said that they fulfill 27 of the
33 steps. South Dakota, which had three professional full-time staff
working on ESEA assessments--and no staff with measurement-related
expertise--addressed nine of the steps, according to state officials.
For example, South Dakota officials said that the state does not verify
the accuracy of answer keys in the data file provided by the vendor
using actual student responses, which increases the risk of incorrectly
scoring assessments. Because South Dakota does not have staff with
measurement-related expertise and has fewer state assessment staff,
there are fewer individuals to fulfill these quality control steps than
in a state with greater capacity, according to state officials.
Having staff with psychometric or other measurement-related expertise
improved states' ability to oversee the work of vendors. For example,
the CCSSO checklist recommends that states have psychometric or other
research expertise for nearly all of the 33 steps. Having staff with
measurement-related expertise allows states to know what key technical
questions or data to ask of vendors, according to state officials, and
without this expertise they would be more dependent on vendors. State
advisors from technical advisory committees (TAC)--panels of assessment
experts that states convene to assist them with technical oversight--
said that TACs are useful, but that they generally only meet every 6
months. For example, one South Dakota TAC member said that TACs can
provide guidance and expertise, but that ensuring the validity and
reliability of a state assessment system is a full-time job. The TAC
member said that questions arise on a regular basis for which it would
be helpful to bring measurement-related expertise to bear. Officials
from assessment vendors varied in what they told us. Several told us
that states do not need measurement-related expertise, but others said
that states needed this expertise on staff.
Education's Inspector General (OIG) found reliability issues with
management controls over state ESEA assessments.[Footnote 26]
Specifically, the OIG found that Tennessee did not have sufficient
monitoring of contractor activities for the state assessments such as
ensuring that individuals scoring open/constructed response items had
proper qualifications. In addition, the OIG found that the state lacked
written policies and procedures describing internal controls for
scoring and reporting.
States Have Faced Challenges in Ensuring the Validity and Reliability
of Alternate Assessments for Students with Disabilities:
Although most states have met peer review expectations for validity and
reliability of their general assessments, ensuring the validity of
alternate assessments for students with disabilities is still a
challenge. For example, our review of Education documents as of July
15, 2009, showed that 12 states' reading/language arts and mathematics
standards and assessment systems--which include general assessments and
alternate assessments based on alternate achievement standards--had not
received full approval under Education's peer review process and that
alternate assessments were a factor preventing approval in 11 of these
states.[Footnote 27]
In the four states[Footnote 28] where alternate assessments were the
only issue preventing full approval, technical quality (which includes
validity and reliability) or alignment was a problem. For example, in a
letter to Hawaii education officials dated October 30, 2007,
documenting steps the state must take to gain full approval of its
standards and assessments system, Education officials wrote that Hawaii
officials needed to document the validity and alignment of the state
alternate assessment.
States had more difficulty assessing the validity and reliability of
alternate assessments using alternate achievement standards than ESEA
assessments for the general student population. In our survey, nearly
two-thirds of the states reported that assessing the validity and
reliability of alternate assessments with alternate achievement
standards was either moderately or very difficult. In contrast, few
states reported that either validity or reliability were moderately or
very difficult for general assessments.
We identified two specific challenges to the development of valid and
reliable alternate assessments with alternate achievement standards.
First, ensuring the validity and reliability of these alternate
assessments has been challenging because of the highly diverse
population of students being assessed. Alternate assessments are
administered to students with a wide range of significant cognitive
disabilities. For example, some students may only be able to
communicate by moving their eyes and blinking. As a result, measuring
the achievement of these students often requires greater
individualization. In addition, because these assessments are
administered to relatively small student populations, it can be
difficult for states to gather the evidence needed to demonstrate their
validity and reliability.
In addition, developing valid and reliable alternate assessments with
alternate achievement standards has been challenging for states because
there is a lack of research about the development of these assessments,
according to state officials and assessment experts. States have been
challenged to design alternate assessments that appropriately measure
what eligible students know and provide similar scores for similar
levels of performance. Experts and state officials told us that more
research would help them ensure validity and reliability. An Education
official agreed that alternate assessments are still a challenge for
states and said that there is little consensus about what types of
alternate assessments are psychometrically appropriate. Although there
is currently a lack of research, Education is providing assistance to
states with alternate assessments and has funded a number of grants to
help states implement alternate assessments.
States that have chosen to implement alternate assessments with
modified achievement standards and native language assessments have
faced similar challenges, but relatively few states are implementing
these assessments. On our survey, 8 of the 47 states responding to this
question reported that in 2007-08 they administered alternate
assessments based on modified achievement standards, which are optional
for states, and several more reported being in the process of
developing these assessments. Fifteen states reported administering
native language assessments, which are also optional. States reported
mixed results regarding the difficulty of assessing the validity and
reliability of these assessments, with about two-thirds indicating that
each of these tasks was moderately or very difficult for both the
alternate assessments with modified achievement standards and native
language assessments. Officials in states that are not offering these
assessments reported that they lacked the funds necessary to develop
these assessments or that they lacked the staff or time.
States Have Taken Measures to Ensure Assessment Security, but Gaps
Exist:
The four states that we visited and districts in those states had taken
steps to ensure the security of ESEA assessments. Each of the four
states had a test administration manual that is intended to establish
controls over the processes and procedures used by school districts
when they administer the assessments. For example, the Texas test
administration manual covered procedures for keeping assessment
materials secure prior to administration, ensuring proper
administration, returning student answer forms for scoring, and
notifying administrators in the event of assessment irregularities.
States also required teachers administering the assessments to sign
forms saying that they would ensure security and had penalties for
teachers or administrators who violated the rules. For example, South
Dakota officials told us that teachers who breach the state's security
measures could lose their teaching licenses.
Despite these efforts, there have been a number of documented instances
of teachers and administrators cheating in recent years. For example,
researchers in one major city examined the frequency of cheating by
test administrators.[Footnote 29] They estimated that at least 4 to 5
percent of the teachers and administrators cheated on student
assessments by changing student responses on answer sheets, providing
correct answers to students, or illegitimately obtaining copies of
exams prior to the test date and teaching students using knowledge of
the precise exam items. Further, the study found that teachers' and
administrators' decisions about whether to cheat responded to
incentives. For example, when schools faced the possibility of being
sanctioned for low assessment scores, teachers were more likely to
cheat. In addition, the study found that teachers in low-performing
classrooms were more likely to cheat.
In our work, we identified several gaps in state assessment security
policies. For example, assessment security experts said that many
states do not conduct any statistical analyses of assessment results to
detect indications of cheating. Among our site visit states, one state--
Rhode Island--reported analyzing test results for unexpected gains in
schools' performance. Another state, Texas, had conducted an erasure
analysis to determine whether schools or classrooms had an unusually
high number of erased responses that were changed to correct responses,
possibly indicating cheating. These types of analysis were described as
a key component of assessment security by security experts. In
addition, we identified one specific state assessment policy where
teachers had an opportunity to change test answers. South Dakota's
assessment administration manual required classroom teachers to inspect
all student answers to multiple choice items and darken any marks that
were too light for scanners to read. Further, teachers were instructed
to erase any stray marks, and ensure that, when a student had changed
an answer, the unwanted response was completely erased. This policy
provided teachers an opportunity to change the answers, and improve
assessment results. South Dakota officials told us that they had
considered taking steps to mitigate the potential for cheating, such as
contracting for an analysis that would identify patterns of similar
erasure marks that could indicate cheating, but that it was too
expensive for the state.
States' assessment security policies and procedures were examined
during Education's standards and assessments peer review process.
According to Education's peer review guidance, which Education
officials told us were the criteria used by peer reviewers to examine
state assessment systems, states must demonstrate the establishment of
clear criteria for the administration, scoring, analysis, and reporting
components of state assessment systems. One example of evidence of
adequate security procedures listed in the peer review guidance was
that the state uses training and monitoring to ensure that people
responsible for handling or administering state assessments properly
protect the security of the assessments. Education indicated that a
state could submit as evidence documentation that the state's test
security policy and consequences for violating the policy are
communicated to educators, and documentation of the state's plan for
training and monitoring assessment administration. According to
Education officials, similar indicators are included in Education's
ongoing efforts to monitor state administration and implementation of
ESEA assessment requirements.
Although test security was included as a component in the peer review
process, we identified several gaps in how the process evaluated
assessment security. The peer reviewers did not examine whether states
used any type of data analysis to review student assessment results for
irregularities. When we spoke with Education's director of student
achievement and school accountability programs--who manages the
standards and assessments peer review process--about how assessment
security was examined in the peer review process, he told us that
security was not a focus of peer review. The official indicated that
the review already required a great deal of time and effort by
reviewers and state officials and that Education had given a higher
priority to other assessment issues. In addition, the state policy
described above in which teachers darken marks or erase unwanted
responses was approved through the peer review process.
The Education official who manages the standards and assessments peer
review process told us that the peer review requirements, including the
assessment security portion, were based on the Standards for
Educational and Psychological Testing[Footnote 30] when they were
developed in 1999. The Standards provide general guidelines for
assessment security, such as that test users have the responsibility of
protecting the security of test materials at all times. However, they
do not provide comprehensive best practices for assessment security
issues. The Association of Test Publishers developed draft assessment
security guidelines in 2007. In addition, in the spring 2010, the
Association of Test Publishers and CCSSO plan to release a best
practices guide for state departments of education that is expected to
offer best practices for test security.
Education has made certain modifications to the peer review process but
does not plan to update the assessment security requirements. Education
updated the peer review protocols to address issues with the alternate
assessment using modified achievement standards after those regulations
were released. In addition, Education has made certain modifications to
the process that were requested by states. However, Education officials
indicated that they do not have plans to update the peer review
assessment security requirements.
Education Has Provided Assistance to States, but the Peer Review
Process Did Not Allow for Sufficient Communication:
Education Provided Technical Assistance with Assessments, including
Those for Students with Disabilities and LEP Students:
Education provided technical assistance to states in a variety of ways.
Education provided technical assistance through meetings, written
guidance, user guides, contact with Education staff, and assistance
from its Comprehensive Centers and Clearinghouses. In our survey,
states reported they most often used written guidance and Education-
sponsored meetings and found these helpful. States reported mixed
results in obtaining assistance from Education staff. Some reported
receiving consistent helpful support while others reported staff were
not helpful or responsive. Relevant program offices within Education
provided additional assistance as needed. For example, the Office of
Special Education Programs provided assistance to states in developing
alternate assessments for students with disabilities and the Office of
English Language Acquisition, Language Enhancement, and Academic
Achievement for Limited English Proficient Students assisted states in
developing their assessments for LEP students. In addition, beginning
in 2002, Education awarded competitive Enhanced Assessment Grants to
state collaboratives working on a variety of assessment topics such as
developing valid and reliable assessments for students with
disabilities and LEP students. For example, one consortium of 14 states
and jurisdictions was awarded about $836,000 to investigate and provide
information on the validity of accommodations for future assessments
for LEP students with disabilities, a group of students with dual
challenges. States awarded grants are required to share the outcomes of
their projects with other states at national conferences; however,
since these are multi-year projects, the results of many of them are
not yet available.
Education's Peer Review Process Did Not Allow Direct Communication
between States and Reviewers to Quickly Resolve Problems:
Education's peer review process did not allow for direct communication
between states and peer reviewers that could have more quickly resolved
questions or problems that arose throughout the peer review process.
After states submitted evidence of compliance with ESEA assessment
requirements to Education, groups of three reviewers examined the
materials and made recommendations to Education. To ensure the
anonymity of the peer reviewers, Education did not permit communication
between reviewers and state officials. Instead, Education liaisons
periodically relayed peer reviewers' questions and comments to the
states and then relayed answers back to the peer reviewers. Education
officials told us the assurance of anonymity was an important factor in
their ability to recruit peer reviewers who may not have felt
comfortable making substantive comments on states' assessment systems
if their identity was known.
However, the lack of direct communication resulted in miscommunication
and prevented quick resolutions to questions arising during the peer
review process. State officials and reviewers told us that there was
not enough communication between states and reviewers during the
process, preventing the quick resolution of questions that arose during
the review process. For example, one state official reported on our
survey that the lack of direct communication with peer reviewers led to
misunderstandings that could have been readily resolved with a
conversation with peer reviewers. A number of the peer reviewers who we
surveyed provided similar information. For example, one said that the
process was missing direct communication, which would allow state
officials to provide immediate responses to the reviewers' questions.
The Education official who manages the standards and assessments peer
review process recognized that the lack of communication, such as a
state not understanding how to interpret peer reviewers' comments,
created confusion. Two experts we interviewed about peer review
processes in general said that communication between reviewers and
state officials is critical to having an efficient process that avoids
miscommunication and unnecessary work. State officials said that the
peer review process was extensive and that miscommunication made it
more challenging.
In response to states' concerns, Education has taken steps to improve
the peer review process by offering states the option of having greater
communication with reviewers after the peer review process is complete.
However, the department has not taken action to allow direct
communication between states and peer reviewers during the process to
ensure a quick resolution to questions or issues that arise, preferring
to continue its reliance on Education staff to relay information
between states and peer reviewers and protecting the anonymity of the
peer reviewers.
Reasons for Key Decisions Stemming from Education's Peer Review Process
Were Not Communicated to States:
In some cases, the final approval decisions made by Education, which
has final decision-making authority, differed from the peer reviewers'
written comments, but Education could not tell us how often this
occurred. Education's panels assessed each state's assessment system
using the same guidelines used by the peer reviewers, and agency
officials told us that peer reviewers' comments carried considerable
weight in the agency's final decisions. However, Education officials
said that--in addition to peer reviewers' comments--they also
considered other factors in determining whether a state should receive
full approval, including the time needed by the state to come into
compliance and the scope of the outstanding issues. Education and state
officials told us that, in some cases, Education reached different
decisions than the peer reviewers. For example, the Education official
who manages the standards and assessments peer review process described
a situation in which the state was changing its content standards and
frequently submitting new documentation for its mathematics assessment
as the new content standards were incorporated. Education officials
told us the peer reviewers got confused by the documentation, but
Education officials gave the state credit for the most recent
documentation. However, Education could not tell us how often the
agency's final decisions matched the written comments of the peer
reviewers because it did not track this information.
In cases in which Education's final decisions differed from the peer
reviewers' comments, Education did not explain to states why it reached
its decisions. Although Education released the official decision
letters describing reasons that states had not been approved through
peer review, the letters did not document whether their decisions
differed from the peer reviewers' comments or why their decisions were
different. Because Education did not communicate this to states, it was
unclear to states how written peer reviewer comments related to
Education's decisions about peer review approval. For example, in our
survey, one state reported that the comments provided to the state by
peer reviewers and the letters sent to the state by Education
describing their final decisions about approval status did not match.
State officials we interviewed reported confusion about what issues
needed to be addressed to receive full approval of their assessment
system. For example, some state officials reported confusion about how
to receive final peer review approval when the written summary of the
peer review comments differed from the steps necessary to receive full
approval that were outlined in the official decision letters from
Education. The Education official who manages the standards and
assessments peer review process said that in some cases the differences
between decision letters and peer reviewers' written comments led to
state officials being unclear about whether they were required to
address the issues in Education's decision letters, comments from peer
reviewers, or both.
Conclusions:
NCLBA set lofty goals for states to work toward having all students
reach academic proficiency by 2013-2014, and Congress has provided
significant funding to assist states. NCLBA required a major expansion
in the use of student assessments, and states must measure higher order
thinking skills and understanding with these assessments. Education
currently reviews states' adherence to NCLBA standards and assessment
requirements through its peer review process in which the agency
examines evidence submitted by each state that is intended to show that
state standards and assessment systems meet NCLBA requirements.
However, ESEA, as amended, prohibits federal approval or certification
of state standards. Education reviews the procedures that states use to
develop their standards, but does not review the state standards on
which ESEA assessments are based or evaluate whether state assessments
cover highly cognitively complex content. As a result, there is no
assurance that states include highly cognitively complex content in
their assessments.
Although Education does not assess whether state assessments cover
highly complex content, Education's peer review process does examine
state assessment security procedures, which are critical to ensuring
that assessments are valid and reliable. In addition, the security of
ESEA assessments is critical because these assessments are the key tool
used to hold schools accountable for student performance. However,
Education has not made assessment security a focus of its peer review
process and has not incorporated best practices in assessment security
into its peer review protocols. Unless Education takes advantage of
forthcoming best practices that include assessment security issues,
incorporates them into the peer review process, and places proper
emphasis on this important issue, some states may continue to rely on
inadequate security procedures that could affect the reliability and
validity of their assessment systems.
State ESEA assessment systems are complex and require a great deal of
time and effort from state officials to develop and maintain. Due to
the size of these systems, the peer review process is an extensive
process that also took a great deal of time and effort on the part of
state officials. However, because Education, in an attempt to maintain
peer reviewer confidentiality, does not permit direct communication
between state officials and peer reviewers, miscommunication may have
resulted in some states spending more time than necessary clarifying
issues and providing additional documentation. While Education
officials told us the assurance of anonymity was an important factor in
their ability to recruit peer reviewers, anonymity should not
automatically preclude communications between state officials and peer
reviewers during the peer review process. For example, technological
solutions could be used to retain anonymity while still allowing for
direct communications. Direct communication between reviewers and state
officials during the peer review process could reduce the amount of
time and effort required of both peer reviewers and state officials.
The standards and assessments peer review is a high-stakes decision-
making process for states. States that do not meet ESEA requirements
for their standards and assessments systems can ultimately lose federal
Title I, Part A funds. Transparency is a critical element for ensuring
that decisions are fully understood and peer review issues are
addressed by states. However, because critical Education decisions
about state standards and assessments systems sometimes differed from
peer reviewers' written comments, but the reasons behind these
differences were not communicated to states, states were confused about
the issues they needed to address.
Recommendations for Executive Action:
To help ensure the validity and reliability of ESEA assessments, we
recommend that the Secretary of Education update Education's peer
review protocols to incorporate best practices in assessment security
when they become available in spring 2010.
To improve the efficiency of Education's peer review process, the
Secretary of Education should develop methods for peer reviewers and
states to communicate directly during the peer review process so
questions that arise can be addressed quickly. For example, peer
reviewers could be assigned a generic e-mail address that would allow
them to remain anonymous but still allow them to communicate directly
with states.
To improve the transparency of its approval decisions pertaining to
states' standards and assessment systems and help states understand
what they need to do to improve their systems, in cases where the
Secretary of Education's peer review decisions differed from those of
the reviewers, the Secretary should explain why they differed.
Agency Comments and Our Evaluation:
We provided a draft of this report to the Secretary of Education for
review and comment. Education's comments are reproduced in appendix
VII. In its comments, Education recognizes the value of test security
practices in maintaining the validity and reliability of states'
assessment systems. However, regarding our recommendation to
incorporate test security best practices into the peer review
protocols, Education indicated that it believes that its current
practices are sufficient to ensure that appropriate test security
policies and procedures are implemented. Education officials indicated
that states currently provide the agency with evidence of state
statutes, rules of professional conduct, administrative manuals, and
memoranda that address test security and reporting of test
irregularities. Education officials also stated that additional
procedures and requirements, such as security methods and techniques to
uncover testing irregularities, are typically included in contractual
agreements with test publishers or collective bargaining agreements and
that details on these additional provisions are best handled locally
based on the considerations of risk and cost. Furthermore, Education
stated that it plans to continue to monitor test security practices and
to require corrective action by states they find to have weak or
incomplete test security practices. As stated in our conclusions, we
continue to believe that Education should incorporate forthcoming best
practices, including assessment security issues into the peer review
process. Otherwise, some states may continue to rely on inadequate
security procedures, which could ultimately affect the reliability and
validity of their assessment systems.
Education agreed with our recommendations to develop methods to improve
communication during the review process and to identify for states why
its peer review decisions in some cases differed from peer reviewers'
written comments. Education officials noted that the agency is
considering the use of a secure server as a means for state officials
to submit questions, documents, and other evidence to strengthen
communication during the review process. Education also indicated that
it will conduct a conference call prior to upcoming peer reviews to
clarify why the agency's approval decisions in some cases differ from
peer reviewers' written comments. Education also provided technical
comments that we incorporated into the report as appropriate.
We are sending copies of this report to appropriate congressional
committees, the Secretary of Education, and other interested parties.
In addition, the report will be available at no charge on GAO's Web
site at [hyperlink, http://www.gao.gov]. Please contact me at (202) 512-
7215 if you or your staff have any questions about this report. Contact
points for our Offices of Congressional Relations and Public Affairs
may be found on the last page of this report. Other major contributors
to this report are listed in appendix VIII.
Sincerely yours,
Signed by:
Cornelia M. Ashby:
Director, Education, Workforce, and Income Security Issues:
[End of section]
Appendix I: Objectives, Scope, and Methodology:
The objectives of this study were to answer the following questions:
(1) How have state expenditures on assessments required by the
Elementary and Secondary Education Act of 1965 (ESEA) changed since the
No Child Left Behind Act of 2001 (NCLBA) was enacted in 2002, and how
have states spent funds? (2) What factors have states considered in
making decisions about question (item) type and content of their ESEA
assessments? (3) What challenges, if any, have states faced in ensuring
the validity and reliability of their ESEA assessments? (4) To what
extent has the U.S. Department of Education (Education) supported and
overseen state efforts to comply with ESEA assessment requirements?
To meet these objectives, we used a variety of methods, including
document reviews of Education and state documents, a Web-based survey
of the 50 states and the District of Columbia, interviews with
Education officials and assessment experts, site visits in four states,
and a review of the relevant federal laws and regulations. The survey
we used was reviewed by several external reviewers, and we incorporated
their comments as appropriate.
We conducted this performance audit from August 2008 through September
2009 in accordance with generally accepted government auditing
standards. Those standards require that we plan and perform the audit
to obtain sufficient, appropriate evidence to provide a reasonable
basis for our findings and conclusions based on our audit objectives.
We believe that the evidence obtained provides a reasonable basis for
our findings and conclusions based on our audit objectives.
Providing Information on How State Expenditures on Assessments Have
Changed Since the Enactment of NCLBA and How States Have Spent Funds:
To learn how state expenditures for ESEA assessments have changed since
NCLBA was enacted in 2002 and how states spent these funds, we analyzed
responses to our state survey, which was administered to state
assessment directors in January 2009. In the survey, we asked states to
provide information about the percentage of their funding from federal
and state sources, their use of contractors, cost and availability of
human resources, and rank order cost of assessment activities. The
survey used self-administered, electronic questionnaires that were
posted on the Internet. We received responses from 49 states,[Footnote
31] for a 96 percent response rate. We did not receive responses from
New York and Rhode Island. We reviewed state responses and followed up
by telephone and e-mail with states for additional clarification and
obtained corrected information for our final survey analysis.
Nonresponse is one type of nonsampling error that could affect data
quality. Other types of nonsampling error include variations in how
respondents interpret questions, respondents' willingness to offer
accurate responses, and data collection and processing errors. We
included steps in developing the survey, and collecting, editing, and
analyzing survey data to minimize such nonsampling error. In developing
the Web survey, we pretested draft versions of the instrument with
state officials and assessment experts in various states to check the
clarity of the questions and the flow and layout of the survey. On the
basis of the pretests, we made slight to moderate revisions of the
survey. Using a Web-based survey also helped remove error in our data
collection effort. By allowing state assessment directors to enter
their responses directly into an electronic instrument, this method
automatically created a record for each assessment director in a data
file and eliminated the need for and the errors (and costs) associated
with a manual data entry process. In addition, the program used to
analyze the survey data was independently verified to ensure the
accuracy of this work.
We also conducted site visits to four states--Maryland, Rhode Island,
South Dakota, and Texas--that reflect a range of population size and
results on Education's assessment peer review. On these site visits we
interviewed state officials, officials from two districts in each
state--selected in consultation with state officials to cover heavily-
and sparsely-populated areas--and technical advisors to each state.
Identifying Factors That States Have Considered in Making Decisions
about Item Type and Content of Their Assessments:
To gather information about factors states consider when making
decisions about the item type and content of their assessments, we
analyzed survey results. We asked states to provide information about
their use of item types, including the types of items they use for each
of their assessments (e.g., general, alternate, modified achievement
standards, or native language), and changes in their relative use of
multiple choice and open/constructed response items and factors
influencing their decisions on which item types to use for reading/
language arts and mathematics general assessments. We interviewed
selected state officials and state technical advisors. We also
interviewed officials from other states that had policies that helped
address the challenge of including cognitively-complex content in state
assessments. We interviewed four major assessment vendors to provide us
a broad perspective of the views of the assessment industry. Vendors
were selected in consultation with the Association of American
Publishers because its members include the major assessment vendors
states have contracted with for ESEA assessment work. We reviewed
studies that our site visit states submitted as evidence for
Education's peer review approval process to document whether
assessments are aligned with academic content standards, including the
level of cognitive complexity in standards and assessments. We also
spoke with representatives from three alignment organizations that
states most frequently hire to conduct this type of study, and
representatives of a fourth alignment organization that was used by one
of our site visit states, who provided a national perspective on the
cognitive complexity of assessment content. In addition, we reviewed
selected academic research studies that examined the relationship
between assessments and classroom curricula using GAO's data
reliability tests. We determined that the results of these research
studies were sufficiently valid and reliable for the purposes of our
work.
Describing Challenges, If Any, That States Have Faced in Ensuring the
Validity and Reliability of Their ESEA Assessments:
To gather information about challenges states have faced in ensuring
validity and reliability, we used our survey to collect information
about state capacity and technical quality issues associated with
assessments. We conducted reviews of state documents, such as
assessment security protocols, and interviewed state officials. We
asked state officials from the states we visited to complete a CCSSO
checklist on student assessment--the Quality Control Checklist for
Processing, Scoring, and Reporting--to show which steps they took to
ensure quality control in high-stakes assessment programs. We used this
specific document created by CCSSO because, as an association of public
education officials, the organization provides considerable technical
assistance to states on assessment. We confirmed with CCSSO that the
document is still valid for state assessment programs and has not been
updated. We also interviewed four assessment vendors and assessment
security experts that were selected based on the extent of their
involvement in statewide assessments. We also reviewed summaries of the
peer review issues for states that have not yet been approved through
the peer review process, the portion of peer review protocols that
address assessment security, and the assessment security documents used
to obtain approval in our four site visit states.
Describing the Extent to Which Education Has Supported State Efforts to
Comply with ESEA Assessment Requirements:
To address the extent of Education's support of ESEA assessment
implementation, we reviewed Education guidance, summaries of Education
assistance, peer review training documents, and previous GAO work on
peer review processes. In addition, we analyzed survey results. We
asked states to provide information on the federal role in state
assessments, including their perspectives on technical assistance
offered by Education and Education's peer review process. We also asked
peer reviewers to provide their perspectives on Education's peer review
process. Of the 76 peer reviewers Education provided us, we randomly
sampled 20 and sent them a short questionnaire asking about their
perspectives on the peer review process. We obtained responses from
nine peer reviewers. In addition, we interviewed Education officials in
charge of the peer review and assistance efforts.
[End of section]
Appendix II: Student Population Assessed on ESEA Assessments in School
Year 2007-08:
General Reading/Language Arts Assessment;
Approximate number of students assessed: 25 million in each of
reading/language arts and mathematics in 49 states reporting.
Alternate Reading/Language Arts Assessment Using Alternate Achievement
Standards;
Approximate number of students assessed: 250,000 in each of
reading/language arts and mathematics in 48 states reporting.
Alternate Reading/Language Arts Assessment Using Modified Achievement
Standards;
Approximate number of students assessed: 200,000 in each of
reading/language arts and mathematics in 46 states reporting.
Source: GAO.
[End of section]
Appendix III: Validity Requirements for Education's Peer Review:
Education's guidance describes the evidence states needed to provide
during the peer review process. These are:
1. Evidence based on test content (content validity). Content validity
is the alignment of the standards and the assessment.
2. Evidence of the assessment's relationship with other variables. This
means documenting the validity of an assessment by confirming its
positive relationship with other assessments or evidence that is known
or assumed to be valid. For example, if students who do well on the
assessment in question also do well on some trusted assessment or
rating, such as teachers' judgments, it might be said to be valid. It
is also useful to gather evidence about what a test does not measure.
For example, a test of mathematical reasoning should be more highly
correlated with another math test, or perhaps with grades in math, than
with a test of scientific reasoning or a reading comprehension test.
3. Evidence based on student response processes. The best opportunity
for detecting and eliminating sources of test invalidity occurs during
the test development process. Items need to be reviewed for ambiguity,
irrelevant clues, and inaccuracy. More direct evidence bearing on the
meaning of the scores can be gathered during the development process by
asking students to "think-aloud" and describe the processes they
"think" they are using as they struggle with the task. Many states now
use this "assessment lab" approach to validating and refining
assessment items and tasks.
4. Evidence based on internal structure. A variety of statistical
techniques have been developed to study the structure of a test. These
are used to study both the validity and the reliability of an
assessment. The well-known technique of item analysis used during test
development is actually a measure of how well a given item correlates
with the other items on the test. A combination of several statistical
techniques can help to ensure a balanced assessment, avoiding, on the
one hand, the assessment of a narrow range of knowledge and skills but
one that shows very high reliability, and on the other hand, the
assessment of a very wide range of content and skills, triggering a
decrease in the consistency of the results.
In validating an assessment, the state must also consider the
consequences of its interpretation and use. States must attend not only
to the intended effects, but also to unintended effects. The
disproportional placement of certain categories of students in special
education as a result of accountability considerations rather than
appropriate diagnosis is an example of an unintended--and negative--
consequence of what had been considered proper use of instruments that
were considered valid.
[End of table]
Source: NCLB Standards and Assessments Peer Review Guidance.
[End of section]
Appendix IV: Reliability Requirements for Education's Peer Review:
The traditional methods of portraying the consistency of test results,
including reliability coefficients and standard errors of measurement,
should be augmented by techniques that more accurately and visibly
portray the actual level of accuracy. Most of these methods focus on
error in terms of the probability that a student with a given score, or
pattern of scores, is properly classified at a given performance level,
such as "proficient." For school-level or district-level results, the
report should indicate the estimated amount of error associated with
the percent of students classified at each achievement level. For
example, if a school reported that 47 percent of its students were
proficient, the report might say that the reader could be confident at
the 95 percent level that the school's true percent of students at the
proficient level is between 33 percent and 61 percent. Furthermore,
since the focus on results in a Title I context is on improvement over
time, the report should also indicate the accuracy of the year-to-year
changes in scores.
Source: NCLB Standards and Assessments Peer Review Guidance.
[End of section]
Appendix V: Alignment Requirements for Education's Peer Review:
To ensure that its standards and assessments are aligned, states need
to consider whether the assessments:
* Cover the full range of content specified in the state's academic
content standards, meaning that all of the standards are represented
legitimately in the assessments.
* Measure both the content (what students know) and the process (what
students can do) aspects of the academic content standards.
* Reflect the same degree and pattern of emphasis apparent in the
academic content standards (e.g., if the academic content standards
place a lot of emphasis on operations, then so too should the
assessments).
* Reflect the full range of cognitive complexity and level of
difficulty of the concepts and processes described, and depth
represented, in the state's academic content standards, meaning that
the assessments are as demanding as the standards.
* Yield results that represent all achievement levels specified in the
state's academic achievement standards.
Source: NCLB Standards and Assessments Peer Review Guidance.
[End of section]
Appendix VI: Item Types Used Most Frequently by States on General and
Alternate Assessments:
[Refer to PDF for image: series of horizontal bar graphs]
Subject studies: General reading/language arts:
Multiple choice:
Number of survey respondents:
Number of states that use this item type: 48;
Number of states that responded to the question: 48;
Number of states that did not respond or checked “no response”: 1.
Open/constructed response:
Number of survey respondents:
Number of states that use this item type: 38;
Number of states that responded to the question: 45;
Number of states that did not respond or checked “no response”: 4.
Work samples/portfolio:
Number of survey respondents:
Number of states that use this item type: 2;
Number of states that responded to the question: 41;
Number of states that did not respond or checked “no response”: 8.
Subject studies: General math:
Multiple choice:
Number of survey respondents:
Number of states that use this item type: 48;
Number of states that responded to the question: 48;
Number of states that did not respond or checked “no response”: 1.
Open/constructed response:
Number of survey respondents:
Number of states that use this item type: 34;
Number of states that responded to the question: 45;
Number of states that did not respond or checked “no response”: 4.
Other format[A];
Number of survey respondents:
Number of states that use this item type: 5;
Number of states that responded to the question: 38;
Number of states that did not respond or checked “no response”: 11.
Subject studies: Alternate assessment using alternate achievement
standards reading/language arts:
Multiple choice:
Number of survey respondents:
Number of states that use this item type: 9;
Number of states that responded to the question: 40;
Number of states that did not respond or checked “no response”: 9.
Rating scales:
Number of survey respondents:
Number of states that use this item type: 13;
Number of states that responded to the question: 38;
Number of states that did not respond or checked “no response”: 11.
Work samples/portfolio:
Number of survey respondents:
Number of states that use this item type: 26;
Number of states that responded to the question: 43;
Number of states that did not respond or checked “no response”: 6.
Subject studies: Alternate assessment using alternate achievement
standards math:
Multiple choice:
Number of survey respondents:
Number of states that use this item type: 9;
Number of states that responded to the question: 40;
Number of states that did not respond or checked “no response”: 9.
Rating scales:
Number of survey respondents:
Number of states that use this item type: 13;
Number of states that responded to the question: 38;
Number of states that did not respond or checked “no response”: 11.
Work samples/portfolio:
Number of survey respondents:
Number of states that use this item type: 26;
Number of states that responded to the question: 43;
Number of states that did not respond or checked “no response”: 6.
Source: GAO survey.
[A] Other format includes gridded response, performance event,
scaffolded multiple choice and performance events, and locally-
developed formats.
[End of figure]
[End of section]
Appendix VII: Comments from the U.S. Department of Education:
United States Department Of Education:
Office Of Elementary And Secondary Education:
The Assistant Secretary:
404 Maryland Ave., S.W.
WASHINGTON, DC 20202:
[hyperlink, http://www.ed.gov]
"Our mission is to ensure equal access to education and to promote
educational excellence throughout the nation."
September 8, 2009:
Ms. Cornelia M. Ashby:
Director:
Education, Workforce, and Income Security Issues:
U.S. Government Accountability Office:
441 G Street, NW:
Washington, DC 20548:
Dear Ms. Ashby:
I am writing in response to your request for comments on the draft
Government Accountability Office (GAO) report, "No Child Left Behind
Act: Enhancements in the Department of Education's Review Process Could
Improve State Academic Assessments" (GAO-09-911).
This report has three recommendations for the Secretary of Education.
Following is the Department's response.
Recommendation: Incorporate test security best practices into the peer
review protocols.
Response: The Department recognizes the value of this recommendation
and the importance of test security practices in maintaining the
validity and reliability of each State's assessment system. Currently,
as part of the peer review process, States do provide us with evidence
of State statutes, rules of professional conduct, administrative
manuals, and memoranda that address test security and reporting of test
irregularities. Other procedures and requirements (e.g., remedies for
teacher misconduct) are typically included in contractual agreements
with test publishers and other patties, or collective bargaining
agreements. The Department does not examine those additional provisions
because we believe that our current practices are sufficient to ensure
that appropriate test security policies and procedures are promulgated
and implemented at the State level. Details on these additional
provisions such as security methods and techniques to discover testing
irregularities are best handled locally based on consideration of risk
and cost factors. As the report mentions, the Department monitors the
implementation of State test security policies in its regularly
scheduled Title I monitoring visits to State and local educational
agencies. Department staff will continue to monitor test security
practices during the monitoring visits, issue findings to States with
weak or incomplete test security practices, and require corrective
action by States with monitoring findings.
Recommendation: Develop methods to improve communication during the
review process.
Response: The Department has made the following improvements over the
last year to improve communications with States during the peer review
process. First, peers and the Department staff member assigned to
review the State's assessment system typically call State assessment
officials and discuss the submission and the peers' concerns. This
occurs prior to the conclusion of the peer review, giving peers time to
correct any misconceptions before they complete their review. Through
this process, State officials have opportunities to ask questions and
obtain clarification regarding the peers' and Department's concerns.
Second, during the Technical Assistance Peer Review (May 2008), State
assessment professionals (individuals or teams) met directly with the
peers or peer team leader and the Department staff member assigned to
the State to thoroughly discuss the peers' comments and concerns. A
technical assistance review is conducted to help States understand
where further development is required before the system is ready for
review. The Department will continue this process.
Furthermore, the Department is looking into the possibility of using a
secure server as a means for State officials to submit questions,
documents, and other evidence that would only be viewed by the
reviewers, State officials, and Department staff We believe that the
use of a secure server, in combination with the procedures already in
place, would strengthen the communication that takes place during the
peer review process.
Recommendation: Identify for States why its peer review decisions in
some cases differed from peer reviewers' written comments.
Response: Peer notes sometimes address areas outside of the
Department's purview, offer recommendations to improve elements of the
system beyond the requirements of the law and regulations, or offer
opinions on technical matters. We do not use those recommendations in
judging the merits of the assessment system, but, as a professional
courtesy, we include them as technical assistance in the peer notes
provided to the States. Peer notes, and the deliberations they
document, are recommendations to the Assistant Secretary for Elementary
and Secondary Education, and on occasion, Department staff may disagree
with the peers' summary comments. The Assistant Secretary is presented
with these discrepancies after they have been discussed internally
among Department staff. These discrepancies usually deal with limits on
the range of evidence that is required to be provided to demonstrate
compliance with the applicable statutory and regulatory provisions and
the extent to which the Department has authority in judging the quality
of certain features of a State assessment system. For example, the
Department has no prerogative to deny approval of an assessment system
based on the substance of content standards nor is the State required
to submit evidence on that issue. The Department and peers review only
the process used to develop a State's content standards, ensure broad
participation of stakeholders in the process, and ensure that a State
demonstrates the rigor of the standards. Hence, there are no peer-
review "decisions," only peer recommendations reflecting the
professional experience and perspectives of the reviewers. The
Assistant Secretary takes these recommendations under consideration,
along with those of Department staff, in making a decision regarding
the approval of a State's assessment system.
However, in response to this recommendation, Department staff will
conduct a conference call in advance of upcoming peer reviews to
clarify why the Department's decisions in some cases differ from peer
reviewers' written comments.
I appreciate the opportunity to share our comments on the draft report.
I hope that these comments are useful to you. In addition, we have
provided some suggested technical edits that should be considered to
add clarity to the report.
Sincerely,
Signed by:
Thelma Melendez de Santa Ana, Ph.D.
[End of section]
Appendix VIII: GAO Contact and Staff Acknowledgments:
GAO Contact:
Cornelia M. Ashby (202) 512-7215 or ashbyc@gao.gov:
Staff Acknowledgments:
Bryon Gordon, Assistant Director, and Scott Spicer, Analyst-in-Charge,
managed this assignment and made significant contributions to all
aspects of this report. Jaime Allentuck, Karen Brown, and Alysia
Darjean also made significant contributions. Additionally, Carolyn
Boyce, Doreen Feldman, Cynthia Grant, Sheila R. McCoy, Luann Moy, and
Charlie Willson aided in this assignment.
[End of section]
Footnotes:
[1] For purposes of this report, the term "ESEA assessments" refers to
assessments currently required under ESEA, as amended. The Improving
America's Schools Act of 1994 created some requirements for
assessments, and these requirements were later supplemented by the
requirements in NCLBA.
[2] For purposes of this report, we refer to test questions as "items."
The term item can include multiple choice, open/constructed response,
and various other types, while the term "question" connotes the usage
of a question mark.
[3] New York and Rhode Island did not respond to the survey. For the
purposes of this report, we refer to the District of Columbia as a
state.
[4] Pub. L. No. 89-10.
[5] Pub. L. No. 103-382.
[6] Pub. L. No. 107-110.
[7] GAO, Title I: Characteristics of Tests Will Influence Expenses;
Information Sharing May Help States Realize Efficiencies, GAO-03-389
(Washington, D.C.: May 2003).
[8] Pub. L. No. 111-5.
[9] Adequate Yearly Progress is a measure of year-to-year student
achievement under ESEA. AYP is used to make determinations about
whether or not schools or school districts have met state academic
proficiency targets. All schools and districts are expected to reach
100 percent proficiency by the 2013-14 school year.
[10] For the total number of students tested on each of the different
types of assessment in 2007-08, see appendix II.
[11] The 2 percent of the scores being included in AYP using the
alternate assessment based on modified academic achievement standards
is in addition to the one percent of the student population included
with the alternate assessment based on alternate academic achievement
standards.
[12] LEP students may only take assessments in their native language
for a limited number of years.
[13] GAO's 2003 report (GAO-03-389) found that item type has a major
influence on overall state expenditures for assessments. However,
regarding the changes to state expenditures for assessments since the
enactment of NCLBA--which our survey examined--few states reported that
item type was a major factor.
[14] We asked states to rank the cost of test/item development,
scoring, administration, reporting test results, data management, and
all other assessment activities.
[15] Although GAO-03-389 found that item type was a key factor in
determining the overall cost of state ESEA assessments, these
differences were related to the cost of scoring assessments rather than
developing assessments. Our research did not find that item type
affected the cost of development.
[16] We defined small states as those states administering 500,000 or
fewer ESEA assessments in 2007-08. Reading/language arts and
mathematics assessments were counted separately.
[17] GAO, Title I: Characteristics of Tests Will Influence Expenses;
Information Sharing May Help States Realize Efficiencies, [hyperlink,
http://www.gao.gov/products/GAO-03-389] (Washington, D.C.: May 8,
2003).
[18] This does not necessarily indicate that state assessments were not
aligned to state standards. For example, if the content in standards
does not include the highest cognitive level, assessments that do not
address the highest cognitive level could be aligned to standards.
[19] The alignment review was conducted by Achieve, Inc., which was one
of the four alignment organizations that we interviewed.
[20] Committee on the Foundations of Assessment, James W. Pellegrino,
Naomi Chudowsky, and Robert Glaser, editors, Knowing What Students
Know: The Science and Design of Educational Assessment (Washington,
D.C.: National Academy Press, 2001) 194.
[21] Gail P. Baxter and Robert Glaser, "Investigating the Cognitive
Complexity of Science Assessments," Educational Measurement: Issues and
Practice, vol. 17, no. 3 (1998).
[22] Helen S. Apthorp, et al., "Standards in Classroom Practice
Research Synthesis," Mid-Continent Research for Education and Learning
(October 2001).
[23] For example, see 20 U.S.C. § 7907(c)(1) and 20 U.S.C. § 6575.
[24] Maryland did not complete this checklist.
[25] CCSSO is an association of public officials who head departments
of elementary and secondary education in the states, the District of
Columbia, the Department of Defense Education Activity, and five extra-
state jurisdictions. It provides advocacy and technical assistance to
its members. The CCSSO checklist describes 33 steps that state
officials should take to ensure quality control in assessment programs
that are used to make decisions with consequences for students or
schools. The checklist can be found at [hyperlink,
http://www.ccsso.org].
[26] U.S. Department of Education, Office of the Inspector General,
Tennessee Department of Education Controls Over State Assessment
Scoring, ED-OIG/A02I0034 (New York, N.Y.: May 2009).
[27] The 12 states that had not received full approval were California,
the District of Columbia, Florida, Hawaii, Michigan, Mississippi,
Nebraska, Nevada, New Hampshire, New Jersey, Vermont, and Wyoming. In
all of these states except California the alternate assessments based
on alternate achievement standards were a factor preventing full
approval.
[28] The four states were Florida, New Hampshire, New Jersey, and
Vermont.
[29] Brian A. Jacob and Steven D. Levitt, "Rotten Apples: An
Investigation of the Prevalence and Predictors of Teacher Cheating,"
The Quarterly Journal of Economics (August 2003).
[30] American Educational Research Association, American Psychological
Association, National Council on Measurement in Education, Standards
for Education and Psychological Testing (1999).
[31] In this report, we refer to the District of Columbia as a state.
[End of section]
GAO's Mission:
The Government Accountability Office, the audit, evaluation and
investigative arm of Congress, exists to support Congress in meeting
its constitutional responsibilities and to help improve the performance
and accountability of the federal government for the American people.
GAO examines the use of public funds; evaluates federal programs and
policies; and provides analyses, recommendations, and other assistance
to help Congress make informed oversight, policy, and funding
decisions. GAO's commitment to good government is reflected in its core
values of accountability, integrity, and reliability.
Obtaining Copies of GAO Reports and Testimony:
The fastest and easiest way to obtain copies of GAO documents at no
cost is through GAO's Web site [hyperlink, http://www.gao.gov]. Each
weekday, GAO posts newly released reports, testimony, and
correspondence on its Web site. To have GAO e-mail you a list of newly
posted products every afternoon, go to [hyperlink, http://www.gao.gov]
and select "E-mail Updates."
Order by Phone:
The price of each GAO publication reflects GAO’s actual cost of
production and distribution and depends on the number of pages in the
publication and whether the publication is printed in color or black and
white. Pricing and ordering information is posted on GAO’s Web site,
[hyperlink, http://www.gao.gov/ordering.htm].
Place orders by calling (202) 512-6000, toll free (866) 801-7077, or
TDD (202) 512-2537.
Orders may be paid for using American Express, Discover Card,
MasterCard, Visa, check, or money order. Call for additional
information.
To Report Fraud, Waste, and Abuse in Federal Programs:
Contact:
Web site: [hyperlink, http://www.gao.gov/fraudnet/fraudnet.htm]:
E-mail: fraudnet@gao.gov:
Automated answering system: (800) 424-5454 or (202) 512-7470:
Congressional Relations:
Ralph Dawn, Managing Director, dawnr@gao.gov:
(202) 512-4400:
U.S. Government Accountability Office:
441 G Street NW, Room 7125:
Washington, D.C. 20548:
Public Affairs:
Chuck Young, Managing Director, youngc1@gao.gov:
(202) 512-4800:
U.S. Government Accountability Office:
441 G Street NW, Room 7149:
Washington, D.C. 20548: