Breaking Down Commonly Used Psychometric Terms: Validity, Reliability and Fairness

Chad W. Buckendahl

A psychometrician is someone involved with psychological measurement. Although they may fulfill different roles related to design, development, analysis, and evaluation, they are partners in ensuring test results are fair and follow the appropriate standards. If you work with a psychometrician, it’s important to understand their language.

As the first in a series on the topic, in this article we will explore terms that psychometricians most commonly use and explain what they mean for non-psychometric audiences.

Psychometric concepts are often organized around three larger, interrelated concepts: validity, reliability, and fairness.

Within the psychometric community, validity is evaluated based on the intended interpretations and uses of scores. This means that tests are neither valid or invalid, but rather how we use the tests that needs to be supported by evidence. Conceptually, validity is an overarching concept under which multiple sources of evidence will be subsumed.

Here in this graphic, we show validity as an umbrella, protecting our program’s sources of evidence supporting intended uses from threats or sources of error. For credentialing programs, some of these important sources of evidence will include job-related content, appropriate level of cognitive complexity, reliability of decisions, appropriate passing standards, equating, consistent administration, security, and fairness for all candidates. Illustrating potential threats to validity, we see dark clouds of unfair treatment of candidates, and lightning bolts of inappropriate content and uncertainty of decisions, with raindrops representing additional threats.

The collection of evidence supporting the intended use of scores contributes to validity. Although it is a source of evidence needed for validity, some psychometricians will discuss reliability as a separate concept. Conceptually, reliability is an estimate of measurement error that may be derived from different sources including scores, human scorers, and decisions. These estimates are generally statistical in nature and are interpreted based on expected thresholds. Psychometricians will often characterize reliability as being necessary, but not sufficient, as an indicator of validity. This means that it is possible for a program to have evidence of reliability of scores or scorers, but may not be hitting the mark. The following two graphics illustrate this potential disconnect. In the image below, the illustration on the right shows a situation where we see a reliable cluster of arrows having made it to the intended target, but is far away from the goal of the bullseye. We would say that this is evidence of reliability, but having missed the bullseye, would not be valid. In contrast, we see in the illustration on the left that we have maintained our cluster of arrows, but in this case, we have also hit the intended target. Credentialing programs need evidence of reliability in combination with validity evidence to support the use of scores.

Finally, psychometricians will also often call out evidence of fairness separately, too, because of the intersection of policies and practices associated with ensuring that candidates are treated fairly. Evidence of fairness within a program may be represented by sources that may include judgmental bias review, ensuring comparability of forms of the test, security of examination content, and standardized test administration practices. Because a lack of fairness is a threat to validity, psychometricians will be a contributor, but because there are additional policy and legal considerations regarding fairness, this is an area that extends beyond just psychometric input for testing programs.

In future articles in this series, we’ll explore a few of these terms in greater detail.