Purpose: With the introduction of the No Child Left Behind Act of 2001, the number of tests taken by U.S. grade school students has been on the rise. Many of these tests administer different test forms on different dates so that students who take the test on different dates will not answer exactly the same test questions. To ensure fair scores on high-stakes exams, individual differences in scores taken on exams on different test dates must reflect differences in achievement levels of the students and not differences in test difficulty. To achieve this goal, researchers employ a technique called equating, a statistical procedure to ensure that scores from different test forms can be used interchangeably. Among the different equating designs, the Non-Equivalent groups with Anchor Test (NEAT) design is widely used for equating tests used to help make high-stakes decisions, such as the SAT, Advanced Placement tests, and standardized state assessments. Three popular equating methods that can be used with a NEAT design are: the poststratification equipercentile equating method (also called the frequency estimation method), the chain equipercentile equating method, and the item response theory observed score equating method. Each of these methods makes distinct assumptions about the missing data in the NEAT design. These assumptions are usually neither explicit nor testable under the usual equating situations. The proposed project is an attempt to examine the missing data assumptions and their implications on equating, and thus to obtain a deeper understanding of the comparative performance of the three abovementioned equating methods.

Project Activities: In this project, the research team examines the three above-mentioned equating methods. First, the missing data assumptions for the three methods are described explicitly. Next, the research team is comparing the three methods using datasets from several operational tests, including those given to U.S. grade school students, which use the NEAT design for equating. For any dataset, the team examines the three equating methods performed when the missing data satisfy the assumptions made by only one of these equating methods.

Products: The products of this project include a deeper understanding of the comparative performance of the three equating methods. This will assist practitioners who use the NEAT design to choose the most appropriate method for equating; this will in turn ensure reporting of fair scores in tests that employ a NEAT design.

Structured Abstract

Purpose: The purpose of this project is to assist practitioners using the NEAT design to choose the most appropriate method for equating. This will in turn ensure reporting of fair scores in tests that employ a NEAT design.

Research Design and Methods: In this project, the research team examines three popular equating methods that can be used with a NEAT design: the poststratification equipercentile equating method (also called the frequency estimation method), the chain equipercentile equating method, and the item response theory observed score equating method. First, the missing data assumptions for the three methods are described explicitly. Next, the research team compares the three methods using datasets from several operational tests, including those given to U.S. grade school students, which use the NEAT design for equating. For any dataset, the team is examining how the three equating methods perform in comparison to each other when the missing data satisfy the assumptions made by only one of these equating methods.

Data Analytic Strategy: For data from each test, the following three-step procedure is being used to compare the methods under different missing data assumptions:

The "true equating function" under each method is obtained by making the missing data assumptions inherent in the method. For example, a typical missing data assumption made by the poststratification equating method is that of all the examinees who obtained a score of, say, 10, on the anchor test, the proportion that obtained a score of, say, 30, on the test to be equated is the same irrespective of whether the examinees belong to the new form population or to the old form population.

The equating function is then obtained for each equating method. These are called the "observed equating functions." These are the standard equating functions that, for example, a testing company employing these methods computes from the data in an operational equating environment.

The difference between the two (the true and observed equating functions) at each score point is computed for each pair of true and observed equating functions. The differences are plotted in two-dimensional graphical displays. The differences are summarized using their weighted averages.

The above steps are carried out under two alternative methods (linear equating and kernel equating) for continuizing discrete data and for several tests.

** This project was submitted to and funded as an Unsolicited application in FY 2007.