Theory and Practice of Fit

The article by Douglas serves as a starting point for a
discussion of the theory of fit and its practice in Rasch
measurement. Although we often speak of fit as a unitary concept,
there are really two underlying questions being asked when fit is
discussed.

The first question concerns fit of the data to the model. If
the desirable properties of Rasch measurement are to hold, then the
data must approximate the model. This is important in calibrating
item sets, in equating test forms, in studies of bias and in
studies of the underlying definition of the variable. This
question must be answered for the data as a whole, before further
analysis of the data are useful. It is similar to that asked about
any statistical analysis: are the specifications of the model
approximated by the data under study? In Rasch measurement there
are many global, item, and person fit statistics that have been
used to assess this question including the Wright-Panchapakesan
(1969) statistics.

The second question concerns the degree to which the total score
that an examinee earns on a test adequately summarizes the
examinee's total set of responses. This question of response fit
comes later in the analysis at the point when a decision must be
made about the individual examinee based on the results of his/her
particular examination performance - decisions such as admission,
assigning grades, promotion, graduation, certification. A variety
of person fit or appropriateness techniques have been developed to
answer this question. This is not a question of the utility of the
data for analysis by the measurement model, but of the meaning
(validity) of the measure for the individual. It is possible, and
in practice inevitable, that a favorable answer to the question of
the utility of the data for analysis by a Rasch measurement model
does not guarantee a favorable answer to the question concerning
validity for every individual tested. No matter how hard we try to
construct potentially valid tests there will always be individual
performances for whom those tests were not valid.

To understand the first question it is necessary to understand
that models are abstractions designed to bring order to
observations. Real data can never fit any model perfectly. That
is why simulated data must be used to develop significance values
for fit indices. The relevant question is one of robustness. How
robust is a set of data to violations of the model's requirements?
Can the analysis extract useful information from the data? A vital
property of strong models, such as the Rasch models, is that the
information extracted from the data can be useful even when the
data do not fit the model very well. This is because the model
constructs a strong frame of reference against which the particular
properties of the data are revealed. Experience has shown that
data analyses guided by Rasch measurement models are quite robust
to violations of the model's requirements. In particular,
individual measurement disturbances seldom have tangible effects on
equating or bias studies.

The alert investigator can strengthen the "person-freed" item
calibration property by a priori removing, from the
calibration sample, individuals whose responses exhibit measurement
disturbances without actually calculating person fit indices. The
BICAL item calibration program, for example, made it easy for
investigators to omit extreme raw scores from item calibration,
i.e., performances with scores near or below the chance level where
"guessing" might occur or near perfect performances where
"carelessness" might occur. The BICAL program also produced, on
request, two sets of item calibrations, one with all misfitting
persons (based on total weighted fit) excluded and one with them
included. The alert investigator could compare these two
calibrations to determine the effect, if any, of the misfitting
persons on the item calibrations.

As more powerful data editors and word processors became
available, these features were dropped from subsequent Rasch
calibration programs. The fact remains, however, that misfit
editing of the data prior to final calibration often produces more
stable item calibrations.

When in doubt, run two calibrations with misfits and/or low and
high scoring persons included and excluded and study the
differences.

With regard to the validity of the total score or ability
estimate for a person, a second set of concerns arises.
Investigators often assume that the fit indices contained in
calibration programs for items and persons are sufficient to
guarantee the validity of the measure for the individual against
all meaningful measurement disturbances. These global fit indices
may provide adequate information for answering the question as to
the utility of the data for analysis by the model. However, they
only begin to provide the information necessary to answer the
second question. Studies by Smith point to the need to use a
combination of total and between fit statistics when investigating
the validity of person measures or item difficulties (Smith, 1986,
1988; Smith & Hedges, 1982). The extent of care and
thoroughness needed to validate person measures depends on the
importance of the decision to be made with the measures.

It has been implied that the size of most testing programs makes
it impractical to look closely at the validity of person measures.
But recent efforts by the College Board (for PSAT and SAT tests)
and Australian Council for Educational Research (KIDMAPS for grade
level achievement tests in New South Wales) show that the
statistical results of person fit analysis can be expressed in
terms that are accessible and useful to students and parents.

The primary tool for fit analysis in Rasch measurement have been
the standardized chi-square statistics based on the work of Wright
and Panchapakesan (1969) and further elaborated by Mead (1975),
Wright (1977), Wright and Stone (1979), and Wright and Masters
(1982). Since their inception these statistics have come under
criticism from several fronts. Initial criticism was based on the
fact that the squared differences between observed and predicted
responses for item/person interactions were only approximately
chi-square. Since, however, real data never fit any ideal model,
all applications of chi-square are approximations.

Later criticism was that the true distributional properties of
these approximate chi-squares or their transformations were
unknown. A variety of alternatives have been proposed (Andersen,
1973; Van den Wollenberg, 1982; Yen, 1981). But study and practice
has shown that these other statistics offer no useful advantage
over the Wright- Panchapakesan statistics. Work by Smith on the
distribution of standardized residuals and the null distributions
of standardized fit statistics has shown that even though these
statistics are not "true" chi-squares, they are regular enough to
identify outliers reliably.

The most recent suggestion for an alternative fit statistic,
based on the exact probabilities of a given person response pattern
(Molenaar and Hoijtink, 1990), is discussed in the Douglas paper.
The Wright-Panchapakesan statistics are computationally simpler
than the Molenaar-Hoijtink statistic, and are highly correlated
with the exact probabilistic results, but can be summarized to
answer a priori hypotheses that are inaccessible with the
Molenaar-Hoijtink statistic.

The Wright-Panchapakesan (WP) statistics and their derivatives
have offered an efficient and practical way to evaluate fit to the
Rasch measurement models for 20 years. The WP approximations stand
up well in comparison with possibly more precise tests such as
likelihood-ratio chi-squares and the Molenaar-Hoijtink statistic.
Studies of the distributional properties of WP statistics show that
the tails of their distributions are regular enough to identify
outliers reliably. There is no practical reason to use anything
more complicated.

Go to Institute for Objective Measurement Home Page.
The Rasch Measurement SIG (AERA) thanks the Institute for Objective Measurement for inviting the publication of Rasch Measurement Transactions on the Institute's website, www.rasch.org.