Observations are Always Ordinal;
Measurements, however, Must be Interval

ABSTRACT. Quantitative observations are based on counting observed
events or levels of performance. Meaningful measurement is based on
the arithmetical properties of interval scales. The Rasch
measurement model provides the necessary and sufficient means to
transform ordinal counts into linear measures. Imperfect
unidimensionality and other threats to linear measurement can be
assessed by means of fit statistics. The Rasch model is being
successfully applied to rating scales.

Merbitz and associates[5] provide a sensitive and useful
explanation of the hazards encountered when data are treated
improperly. This misconstruction of data can be understood as the
result of a confusion as to the relationship between observation
and measurement - a confusion which can be speedily resolved with
a little clarification.

Data are Always Ordinal

All observations begin as ordinal, if not nominal, data.
Quantitative science begins with identifying conditions and events
which, when observed, are deemed worth counting. This counting is
the beginning of quantification. Measurement is deduced from well-
defined sets of counts. The most elementary level is to count the
presence, "1," or absence, "0," of the defined condition or
event.

More information can be obtained when the conditions that
identify countable events are ordered into successive categories
which increase (or decrease) in status along some intended
underlying variable. It then becomes possible to count, not just
the presence (versus absence) of an event, but the number of steps
up the ordered set of categories which the particular category
observed implies.

When, for example, a rating scale is labeled: "none," "plenty,"
"nearly all," "all," the inarguable order of these labels from less
to more can be used to represent them as a series of steps. The
observation of "none" can be counted as zero steps up this rating
scale, "plenty" as one step up, "nearly all" as two steps up and
"all" as three. This counting has nothing to do with any numbers or
weights with which the categories might have been tagged in
addition to or instead of their labels. For instance, "plenty"
might also have been labelled as "20" or "40" by the instrument
designer, but the assertion of such a numerical category label
would not alter the fact that, on this scale, "plenty" is just one
step up the scale from "none."

All classifications are qualitative. Some classifications[1],
like those above, can be ordered and so are more than nominal.
Other classifications, such as one based on race, can usually not
be ordered, though there may be perspectives from which an ordering
becomes useful. This does not mean that nominal data, such as race
or gender, cannot have powerful explanatory or diagnostic power.
Nonparametric statistical techniques[3] can be useful in such
cases. But it does mean that they are not measurement in the
accepted sense of the word.

As Merbitz and colleagues[5] emphasize, this counting of steps
says nothing about distances between categories, nor does it
require that all test items employ the same rating scale. Whenever
four category labels share the same ordering, however else they may
differ in implied amounts, they can only be represented by exactly
the same step counts, even though, after analysis, their
calibrations may well differ. It would not make any difference to
the method of step counting if the four ordered categories were
labeled quite differently for another item, say, "none," "almost
none," "just a little," "all." Even though the relative meanings
and the intended amounts corresponding to the alternative sets of
category labels are conspicuously different, their order is the
same and so their step counts can only be the same (0, 1, 2, 3).
This is always so no matter how the four ordered categories might
be labeled.

Measures are Always Interval/Ratio

What every scientist and layman means by a "measure" is a number
with which arithmetic (and linear statistics) can be done, a number
which can be added and subtracted, even multiplied and divided, and
yet with results that maintain their numerical meaning. The
original observations in any science are not yet measures in this
sense. They cannot be measures because a measure implies the
previous construction and maintenance of a calibrated measuring
system with a well-defined origin and unit which has been shown to
work well enough to be useful. Merbitz and his coworkers stress the
importance of linear scales as a prerequisite to unequivocal
statistical analysis. They are saying that something must be done
with counts of observed events to build them into measures. A
measurement system must be constructed from a related set of
relevant counts and its coherence and utility established.

Confusing Counts with Measures

It is true that counts of concrete events are on a kind of ratio
scale. They have an obvious origin in "none" and the counted events
provide a raw unit of "one more event." The problem is that the
events that are counted are specific rather than general, concrete
rather than abstract, and varying rather than uniform in their
import. Sometimes the next "one more event" implies, according to
the labels assigned, a small increment as in the step up from
"none" to "almost none." Sometimes the next event implies a big
increment as in the step up from "none" to "plenty." Since, in
either case, all that we can do at this stage is to count one more
step, our raw counts, as they stand, are insensitive to any
differing implications of the steps taken. To get at these implied
step sizes we must construct a measuring system based on a
coordinated set of observed counts. This requires a measurement
analysis of the inevitably ordinal observations which always
comprise the initial data in any science.

Even those counts which seem to be useful measures in one
context may not be measures in another[8]. For example "seconds"
would seem always to be a linear measure of time. But. surprising
as it may seem at first, counting the number of seconds it takes a
patient to walk across a room does not necessarily provide a linear
measure of "patient mobility." For that, the "seconds" counted are
just the raw data from which a measuring system has still to be
constructed. It is naive to believe that a seemingly universal
counter like "seconds," that is so often linear in physics and
commerce, will necessarily also be linear in the measurement of
patient mobility. To construct a linear measure of patient mobility
based on elapsed time we must first count the seconds taken by a
relevant sample of patients of varying mobility to cover a variety
of relevant distances of varying magnitudes. Then we must analyze
these counting data to discover whether a linear measure of
"mobility" can be constructed from them and if so what its relation
to "seconds" may be.

The Step from Observation to Measurement

Realization of the necessity of a progression from counting
observations to measurement is not new. Serious recognition of the
need to transform observations into measures goes back to the turn
of the century. Edward Thorndike[10] called for it 80 years ago.
Louis Thurstone[11] invented techniques which partially solved the
problem in the 1920s. Finally, in 1953, Georg Rasch[7] devised a
complete solution which has since been shown to be not only
sufficient but also necessary for the construction of measures in
any science. The phrase "in any science" is notable here since the
Rasch relationship has been shown to be just as fundamental to the
construction of a surveyor's yardstick as it is to the construction
of other less familiar and more subtle measures.

Rasch's insight into the problem was simple and yet profound.
First, he realized that, to be of any use at all, a measure must
retain its quantitative status, within reason, regardless of the
context in which it occurs. For a yardstick to be useful for
measurement, it must maintain its length calibrations irrespective
of what it is measuring. So too, each test or rating scale item
must maintain its level of difficulty, regardless of who is
responding to it. It also follows that the person measured must
retain the same level of competence or ability regardless of which
particular test items are encountered, so long as whatever items
are used belong to the calibrated set of items which define the
variable under study. The implementation of this essential concept
of invariance or objectivity has been successfully extended in the
past decade to the leniency (or severity) of raters and to the step
structure of rating scales.

Second, Rasch recognized that the outcome of an interaction
between an object-to-be-measured, such as a person, and a
measuring-agent, such as a test item, cannot, in practice, be fully
predetermined but must involve an additional, unavoidably
unpredictable, component. This realization changes the way we can
usefully specify what is supposed to happen when a person responds
to an item from an "absolute" outcome to a "likely" outcome. The
final measuring system requirements become: the more able the
person, the more likely a success on any relevant
item. The more difficult the item, the less likely a
success for any relevant person.

From just these,in retrospect rather obvious, requirements,
Rasch deduced a mathematical model which specifies exactly how to
convert observed counts into linear (and ratio) measures. The model
also specifies how to find out the extent to which any particular
conversion has been successful enough to be useful. This "Rasch"
model has since been demonstrated to be the one and only possible
mathematical formulation for performing this essential
function.

Rasch's introduction of his discovery appears in his innovative
1960 book[7]. Detailed, elementary explanations of why, when and
how to apply Rasch's idea to dichotomous (right/wrong, yes/no,
present/absent) data are provided by Wright and Stone[14]. The
extension of this to rating scales and other observations embedded
in ordered categories is developed and explained in Wright and
Masters[13].

This conversion from counts to measures is greatly facilitated
by the use of a computer. Rasch analysis compute programs have been
available since 1965. The two most recent and most versatile are
BlGSCALE[12] and FACETS[4]. These programs analyze the initial
original data for the possibility of a single latent variable along
which the intended measuring agents, the items, can be calibrated
and the intended objects of measurement, the subjects, can be
measured. The programs then report: 1) the best possible
unidimensional calibrations and measures which these data can
support, 2) the reliabilities of these calibrations and measures in
terms of their standard errors and 3) their internal validities in
terms of detailed fit statistics.

Choosing an Origin

The concept of measurement implies a count of some well-defined
unit from a well-defined starting point usually called "none" or
"zero." This implication can be visualized as a distance between
two points on a line. To be useful, measures must be set up to
begin counting their standard units from some convenient reference
point defined to be their standard origin. The location of this
origin is fundamentally arbitrary, although there are often frames
of reference, or theories, for which a particular position is
especially convenient. Consider temperature. The Celsius,
Fahrenheit and Kelvin scales have different zero points. Each
choice was made for good theoretical reasons. Each has been
convenient for particular applications. But no one of them is
universally superior, despite the exhortations of molecular
thermodynamicists. It is the same for psychometric scales. Each
origin is chosen for the convenience of its users. Should two users
choose different origins, then, as with temperature, it must be a
simple monotonic operation to transform measures relative to one
origin into measures relative to another, or they are not talking
about the same variable. However intriguing it may be
theoretically, there is no measurement requirement to locate an
absolute point of minimum intensity or to extrapolate a point such
as that of "zero mobility."

A ratio scale does have a clear origin. But that origin is
usually of more theoretical interest than practical utility. It is
a simple arithmetical operation to convert measures from an
interval scale to a ratio scale and vice versa. When interval
scales are exponentiated, their arbitrary origins become the unit
of the resulting ratio scale and their minus infinity becomes this
ratio scale's origin. This mathematical result, by the way, reminds
us that the seemingly unambiguous origins of ratio scales, however
intriguing they may be theoretically, are necessarily unrealizable
abstractions, see also "What is a Ratio Scale".

The practical convenience of being able to measure length from
some arbitrary origin, like the end of a yardstick, far outweighs
the abstract benefit of measurement from some theoretically
interesting "absolute" origin, such as the center of the universe.
With an interval scale, once it is constructed from relevant
counts, we can always answer questions such as "Is the distance
from 'wheelchair' to 'unaided' more than twice as far as the
distance from 'cane' to 'unaided'?" The convenient origin for kind
of question is the shared category 'unaided' rather than some
abstract point tagged "complete mobility" or "complete immobility."

`

Why Treating Raw Scores as Measures Sometimes Seems to Work

In view of the clear difference between counts and measures, why do
regressions and other interval-level statistical analyses of raw
score counts and numerical category labels so often seem to work?
Examples mentioned include Miller's "100 point" scale[6], the LORS-
IIB[6], the FIM[2], and the Barthel Index[2]. This paradox is due
to the monotonic relationship between scores and measures when data
is complete and unedited. This guarantees that correlation analyses
of scores and the measures they may imply will be quite similar.
Further the relationship between scores and measures is necessarily
ogival because the closed interval between the minimum possible
score and the maximum possible score must be extended to an open
interval of measures from minus infinity to plus infinity. Toward
the center of this ogive the relationship between score and measure
is nearly linear.
But the monotonicity between score and measure holds only when data
are complete, that is, when every subject encounters every item,
and no unacceptably flawed responses have been deleted. This kind
of completeness is inconvenient and virtually impossible to
maintain, since it permits no missing data and prevents tailoring
item difficulties to person abilities. It is also no more necessary
for measurement than it would be to require that all children be
measured with exactly the same particular yardstick before we could
analyze their growth. Further the approximate linearity between
central scores and their corresponding measures breaks down as
scores approach their extremes and is strongly influenced by the
step structure of the rating scale.

Consequently, as Merbitz and associates warn, it is foolish to
count on raw scores being linear. It is always necessary to verify
that any particular set of raw scores do, in fact, closely
correspond to linear measures before subjecting them to statistical
analysis[2]. Whatever the outcome of such a verification it is
clearly preferable to convert necessarily nonlinear raw scores to
necessarily linear measures and then to perform the statistical
analyses on these measures.

Unidimensionality

An occasional objection to Rasch measurement is its imposition on
the data of a single underlying unidimensional variable. This
objection is puzzling because unidimensionality is exactly what is
required for measurement. Unidimensionality is an essence of
measurement. In fact the importance of the Rasch model as
the method for constructing measures is due, in part, to
its deduction from the requirement of unidimensionality.

In actual practice, of course, unidimensionality is a
qualitative rather than quantitative concept. No actual test can
ever be perfectly unidimensional. No empirical situation can meet
exactly the requirements for measurement which generate the Rasch
model. This fact of life is encountered by every science. Even
physicists make corrections for unavoidable multidimensionalities
an integral part of their experimental technique. Nevertheless, the
ideal of unidimensional measures must be approximated if
generalizable results are to be obtained.

If a test comprising a mixture of medical and law items is used
to make a single pass/fail decision, then the examination board,
however inadvertently, has decided to use this mixed test as though
it were unidimensional. This is regardless of any qualitative or
quantitative arguments which might "prove" multidimensionality.
Further, their practical decision does not make medicine and law
identical or exchangeable anywhere but in their pass/fail actions.
But their "unidimensional" behavior does testify that they are
making medicine and law exchangeable for these pass/fail decisions.
Unless each test item is to be treated as a test in itself, every
test score is a compromise between the essential ideal of
unidimensionality and the unavoidable exigencies of practice. The
Rasch model fit statistics are there in order to evaluate the
success of that compromise in each instance. It is the
responsibility of test developers and test managers to use these
validity statistics to identify the extent of the compromises they
are making and to minimize their effects on practice.

The pursuit of approximate unidimensionality is undertaken at
two levels. First, the test constructor makes every effort to
produce a useful set of observable categories (rating scales) which
are intended and expected to work together to gather unambiguous
information along a single, useful underlying dimension. Test
items, tasks, observation techniques and other aspects of the
testing situation are organized to realize, as perfectly as
possible, the variable which the test is intended to measure.
Second, the test analyst collects a relevant sample of these
carefully defined observations and evaluates the practical
realization of that intention.

Before observations can be used to support any quantitative
research or substantive decisions, the observations must be
examined to see how well they fit together to define the intended
underlying variable on a linear scale[9]. Rasch provides theory and
technique. But the extent to which a particular set of observations
is in accord with this theory is, indeed, an "empirical matter"[3].
Merbitz and coworkers[5] caution us against blindly accepting any
total score without verifying that its meaning is in accord with
the meanings of the scores on its component items. Assistance in
doing this is provided by fit statistics which report the degree to
which the observations match the specifications necessary for
measurement. Misfitting items can be redesigned. Misfitting
populations can be reassessed. Once the quality of the measures has
been determined, the analyst, test constructor, and examination
board are then, and only then, in a position to make informed
decisions concerning the quantitative significance of their
measures.

The process of test evaluation is never finished. Every time we
use our measuring agents, questions, or items to collect new
information from new persons in order to estimate new measures, we
must verify in those new data that the unidimensionality
requirements of our measuring system have once again been
sufficiently well approximated to maintain the quantitative utility
of the measures produced. Whether a particular set of data can be
used to initiate or to continue a unidimensional measuring system
is an empirical question. The only way it can be addressed is to 1)
analyze the relevant data according to a unidimensional measurement
model, 2) find out how well and in what parts these data do conform
to our intentions to measure and, 3) study carefully those parts of
the data which do not conform, and hence cannot be used for
measuring, to see if we can learn from them how to improve our
observations and so better achieve our intentions.

Once interval scale measures have been constructed, it is then
reasonable to proceed with statistical analysis in order to
determine the predictive validity of the measures from a particular
test. We can also then compare the measures produced by different
test instruments, such as the FAS subscales, to see if they are
measures of the same thing, like inches and centimeters, or
different things, like inches and ounces.

Rasch Analysis and the Practice of Measurement

The Rasch measurement model has been successfully applied to
testing in schools since 1965, with large scale implementations in
Portland (OR), Detroit, Chicago and New York. Many medical
specialty boards, including the National Board of Medical
Examiners[9], employ it in their certification examinations. Pilot
research at the Veteran's Administration and Marianjoy
Rehabilitation Center has demonstrated that useful measures of the
degree of impairment can bc constructed from ratings of the
performance of handicapped individuals. New applications of the
Rasch model are continually emerging; judge-awarded ratings is
currently an area of active interest for the Board of Registry of
the American Society of Clinical Pathologists and for a national
group of occupational therapists centered at the University of
Illinois.

We are grateful to Merbitz and colleagues[5] for raising the
important topic of ordinal scales and inference and so permitting
us to discuss this often misunderstood concept of measurement.