What's this about?

Many researchers study cognitive abilities, personality traits, attitudes,
quality of life, patient satisfaction, and other attributes that cannot be
measured directly. To quantify these types of latent traits,
a researcher often develops an instrument—a
questionnaire or test consisting of binary, ordinal, or categorical items—to
determine individuals' levels of the trait.

Item response theory (IRT) models can be used to evaluate the relationships
between the latent trait of interest and the items intended to measure the
trait. With IRT, we can also determine how the instrument as a whole relates
to the latent trait.

IRT is used when new instruments are developed, when analyzing and scoring
data collected from these instruments, when comparing instruments that measure
the same trait, and more.

For instance, when we develop a new instrument, we have a set of items that we
believe to be good measurements of our latent trait. We can use IRT models to
determine whether these items, or a subset, can be combined to form a good
measurement tool. With IRT, we evaluate the amount of information each item
provides. If some items do not provide much information, we may eliminate
them. IRT models estimate the difficulty of each item. This tells us the
level of the trait that is assessed by the item. We want
items that provide information across the full continuum of the latent trait
scale. We can also ask how much information an instrument, as a whole,
provides for each level of the latent trait. If there are ranges of the
latent trait for which little information is provided, we may add items to the
test.

Let's see it work

Suppose we have a test designed to assess mathematical ability based on eight
questions that are scored 0 (incorrect) or 1 (correct). We fit a
one-parameter logistic model, a model that estimates only the difficulty of each of
our eight items, by typing

Coefficients labeled "Diff" report difficulty. Based on this model, question 8
is the easiest with a coefficient of −2.328. Question 5 is the most difficult
with a coefficient of 1.595.

We have only eight questions in our example. If we had 50 questions, it would
not be as easy to spot those that correspond to a particular difficulty level.
We can use estat report to sort the questions by difficulty.

We can visualize the relationship between questions and mathematical
ability—between items and latent trait—by graphing the item
characteristic curves (ICCs) using irtgraph icc.

We made the easiest question blue and the hardest, red. The probability of
succeeding on the easiest question is higher than the probability of succeeding
on all other questions. Because we fit a 1PL model, this is true at every
level of ability.

irtgraph tif graphs the test information function.

The hump in the middle shows that
this test provides the most information for average mathematical
ability levels.

When we have binary items, we can fit a 1PL, 2PL, or 3PL model. The
irt 2pl command fits a 2PL model and allows items to have different
difficulties and different abilities to discriminate between high and low
levels of the latent trait. Visually, differing discriminations means that
the slopes of our ICC curves differ across items. The irt 3pl command
extends the 2PL model to allow for the possibility of guessing correct
answers.

Let's see it work with ordinal items

IRT models can be fit to ordinal and categorical items, too. Here we have
a new test, also with eight questions. Individuals are expected to show their
work as they solve each problem. Responses are scored as 0 (incorrect), 1
(partially correct), or 2 (correct).

With ordinal data, we could fit a graded response model, a partial credit
model, or a rating scale model. These models make different assumptions about
how the ordered scores relate to the latent trait. Here we fit a graded
response model by typing

One way to evaluate how an individual item, say, q3, relates to mathematical
ability is to look at the category characteristic curves produced by irtgraph
icc.

Respondents with mathematical ability levels below −1.3 are most
likely to answer q3 with a completely incorrect answer, those with
levels between −1.3 and −0.2 are most likely to give a partially
correct answer, and those with ability levels above −0.15 are most
likely to give a completely correct answer.

From the test characteristic curve produced by irtgraph tcc, we see
how the expected total test score relates to mathematical ability levels.

Out of a possible 16 points on the test, a person with above-average
mathematical ability (above 0) is expected to score above 7.94 or, because all
scores are integers, above 7.

Not interested in standardized testing?

IRT models can be used to measure many types of latent traits. For example,

attitudes

personality traits

health outcomes

quality of life

Use IRT for analyzing any unobservable characteristic for which binary
or categorical measurements are observed.