Validity

Validity denotes the meaning of a test score or assessment result. Although historical notions of validity have suggested there are multiple forms of validity, contemporary views of validity consider it to be a unitary construct supported by distinct forms of evidence. Contrary to popular belief, validity is neither obvious nor intuitive. Therefore, the validity of any test or assessment, from teacher-made quizzes, tests, and assignments to published tests and procedures, should be established by collecting relevant forms of evidence so that educators can draw appropriate interpretations of assessment results.

Until the twentieth century, the validity of a test or assessment was determined primarily by the content of the assessment process. That is, a test was deemed to reflect mathematical skills if the questions on the test were primarily mathematical in nature; likewise, a test composed of questions about Elizabethan England was viewed as revealing a student's knowledge of history. However, advances in testing and assessment during the early twentieth century challenged these assumptions, in large part because the inferences about what tests were intended to measure changed from direct inferences (e.g., the student knows or does not know trigonometry) to less direct inferences (e.g., the student has strong or weak aptitude for quantitative reasoning). Therefore, the meaning of tests and assessments could not be directly inferred from test content. For example, the answer to the question “Who wrote Romeo and Juliet?” could reflect a student's understanding of a unit on British literature, or it could be a reflection of a student's ability to acquire knowledge incidentally from the environment (especially if the student had not yet had a course in British literature). Content analysis alone is insufficient to determine whether the question was a measure of literary knowledge or student ability.

By the 1980s, assessment experts generally agreed that there were three types of validity: content, construct, and criterion (AERA, APA, NCME, 1985). The meaning of a test result was determined in part by the test's content, but also by evidence that the test result behaved in ways consistent with its theoretical construct (e.g., raw scores on a test of intelligence should increase with age up to the mid- to late teens), and should predict socially valued criteria (e.g., a mechanical aptitude test should predict grades in an industrial arts class). Test users could expect to have all three forms of evidence (content, construct, and criterion) available for published tests, while also recognizing some forms of evidence were more important for some test uses than others.

However, by the end of the twentieth century, professional standards redefined validity so that there was only one “type” of validity—i.e., the meaning of a test score or assessment result (AERA, APA, NCME, 1999). However, standards outlined five forms of evidence needed to determine test score meaning: (1) content, (2) response processes, (3) internal structure, (4) relationships to other variables, and (5) test consequences. Test developers (and users) are expected to consider the forms of evidence most relevant to determining the meaning of a test result, and collect and provide relevant evidence to define the meaning of the test score or assessment result. Each of these forms of evidence is described to help those who use assessment results (e.g., teachers, psychologists, administrators) to understand, demand, and collect such evidence.

Content. Primary evidence in this domain includes the specification of the intended content and expert judgment of test items regarding adherence to those specifications. For example, tests of state standards (required by the No Child Left Behind Act of 2002) are validated in part by specifying what students are supposed to know and do in a given domain, the degree to which these skills should be represented in a given test, and the cognitive complexity and type of item to be used to assess the skill. A reading test at third grade might therefore have more items devoted to phonological skills (e.g., grapheme/phoneme correspondence), word attack, and vocabulary than an eighth-grade test of reading, which might have fewer items for phonological and word attack skills, but more items reflecting inferential comprehension. For domains with less clearly specified content (e.g., intelligence), judgments regarding content are more dependent on theory than on exact specification of intended domains. In all cases, judges evaluate the test content against the intended interpretation of the test to evaluate the degree to which the test content represents the intended meaning of the test. Additionally, panels of judges with expertise in diversity (e.g., linguistic, ethnic, gender, religious) may also evaluate the content to identify any content that may be problematic for diverse groups.

Response Processes. The psychological processes test takers use when responding to an assessment are known as response processes. Evidence that a test elicits the processes it intends to measure, and only those processes, is useful for establishing the meaning of the test score. One cannot assume that a test item will elicit the intended process; for example, the item “Use a barometer to calculate the height of a tall apartment building” could elicit use of air pressure to estimate altitude, but it might elicit social knowledge (e.g., “Offer to give the barometer to the building superintendent in return for telling you the height of the building”). Test takers may get an item right (or wrong) without ever using the process intended to be elicited by the item. Forms of evidence supporting response processes are direct (e.g., ask test-takers how they solved items), and indirect (e.g., analysis of eye gaze, brain activity, or error patterns).

Internal Structure. Simply put, tests should behave the way the developers expect them to behave—meaning items purporting to measure the same thing should be more related to each other than items purporting to measure something different. For example, a mathematics test might intend to measure probability and geometry; if so, the items and subscales measuring probability ought to be more related to each other than items and subscales measuring geometry. The primary forms of evidence, then, are internal consistency (items that measure the same thing should correlate with each other), and factor analysis (items should aggregate into common scales, and scales into expected patterns).

Relationships to other Variables. Assessment results ought to relate to measures outside of the assessment in ways that are consistent with the intended meaning. Evidence is often framed as convergent (i.e., the assessment should relate to things expected to be related), and divergent (i.e., the assessment should not relate to things that are expected to be unrelated). Forms of convergent evidence include grades given by teachers or scores on tests of similar constructs; forms of divergent evidence include a proposed independence between creativity and general intelligence (which rarely occurs, calling into question whether creativity is really different from general intelligence).

Test Consequences. Assessments are intended to benefit the test-giver and the test-taker. Therefore, test developers (and users) should collect and provide evidence that tests achieve their intended benefits, and avoid unintended consequences. For example, some argue testing mandates in No Child Left Behind benefit students by holding schools accountable for student achievement; critics argue that tests demoralize students and educators. The evidence needed to determine whether benefits are realized, or whether unintended consequences outweigh benefits, is often lacking from tests in educational settings. For example, in their 2005 study Braden and Niebling note that many popular cognitive ability tests claim to help educators match programs and instruction to student needs, but those promoting the tests do not actually provide supporting evidence.

A fundamental assumption is that no one form of evidence is sufficient to establish validity; rather, multiple forms of evidence must be presented and evaluated to support (or reject) an interpretation of a test score. Therefore, test users should obtain and weigh multiple forms of evidence to determine the validity of a test score or assessment result.

There are a number of issues that are particularly relevant to educators and educational settings. The following is a brief selection of critical issues.

Distinguishing Validity from Reliability. Validity refers to the meaning of a test score or assessment result, whereas reliability is the consistency of a score or result. Consistency of assessment results is established over time (i.e., test-retest reliability or “stability”), agreement among test items (internal consistency), and agreement between raters (i.e., inter-rater reliability). A test that is not reliable cannot be valid; therefore, educators should evaluate reliability evidence before they even consider validity evidence. Reliability is most often expressed as a number, which is somewhat analogous to a percentage; reliability indexes of .80 (i.e., 80%) or higher are generally considered to be adequate for use in making educational decisions.

Causes of Test Score Invalidity. In addition to poor reliability, Messick's 1995 study identified two causes of invalidity: construct under-representation and construct-irrelevant variance. Construct under-representation means the test inadequately taps what it intends to measure. For example, many multiple choice tests intend to measure skills ranging from lower order (e.g., recognition, recall) and higher order (e.g., analysis, synthesis, application, evaluation) skills. However, it takes time, energy, and expertise to create multiple choice items that tap higher order thinking skills, and so many multiple choice tests over-represent lower-order skills and under-represent higher-order skills. If the test purports to measure a broad range of cognitive skills, but has few items tapping evaluation and application, its scores will be invalid because they under-represent higher-order thinking in the domain.

Construct-irrelevant variance occurs when an assessment demands skills it does not intend to measure. For example, most reading comprehension tests intend to measure the test-taker's ability to decode and comprehend written text. However, tests demand that the test taker has the visual acuity to discriminate the letters and words in the text. If the test-taker does not have that ability (e.g., the test-taker has an uncorrected visual impairment), then the test score will not reflect reading comprehension; instead, it will reflect the test taker's skill in an unintended domain (i.e., visual acuity). In cases in which construct-irrelevant variance is high (e.g., limited visual acuity), the test-taker's score will be invalid if the test taker lacks the construct-irrelevant skill.

Test Accommodations. The principles of construct under-representation and construct-irrelevant variance can help identify ways in which tests can (and cannot) be changed to accommodate test takers. Essentially, changes that reduce construct-irrelevant variance without creating construct under-representation are valid accommodations; those that either fail to reduce construct-irrelevant variance, or that reduce construct representation, are invalid accommodations. Returning to the example of a reading comprehension test, enlarging the text print, or allowing the use of eyeglasses, reduces construct-irrelevant variance while maintaining construct representation, and is therefore a valid accommodation. However, reading the text to a person with a visual impairment is invalid, because it reduces construct representation (i.e., it under-represents decoding of text). It is important to note that test standards place the burden of proof for accommodation validity on the party recommending the change; the default assumption is that, in the absence of additional evidence, any changes to standardized assessment procedures and materials reduce the validity of the assessment result.

Assessment Bias. Assuming tests are equally reliable across groups (which is typically true), bias occurs when the meaning of the test score is different for one group relative to another (i.e., the validity is not consistent across groups). To determine if an assessment is biased, multiple forms of evidence must be presented showing different meanings for different groups. Simply looking at test content (e.g., claiming that a test is biased because it presumes cultural knowledge more prevalent in one group than in another) is insufficient to demonstrate bias. Most published assessments used in educational settings are not biased for groups whose native language is English; that is, the test scores represent the same thing for all groups of test-takers across ethnic, gender, and socioeconomic groups. Or to put it another way, tests generally reveal, rather than create, differences between groups. However, the validity evidence for test takers whose native language is not English is often limited, and less consistent, leaving open the possibility that assessment results have different meanings for native English versus non-English speakers. Simply noting that one group has lower scores than another is not in itself evidence of bias (i.e., groups can, and often do, differ).

Constructed versus Selected Response Assessments. Many educators and advocacy groups (e.g., Fairtest) are critical of selected-response or multiple choice tests. Critics contend that multiple choice tests emphasize low-level cognitive skills. Advocates of selected-response tests argue such claims are not supported by validity evidence, and they argue selected-response tests are reliable and cost effective. The bulk of evidence across the five validity domains suggests carefully developed assessments can overcome most limitations associated with item formats. For example, development of rubrics and rater training can produce adequate reliability in constructed-response assessments (e.g., performances, portfolios), and carefully constructed selected-response tests elicit higher-order cognitive processes. However, the greater costs in terms of time, scoring, and storing constructed-response assessments are not offset by better validity evidence, so most educational assessment programs primarily use selected-response assessments.

Teacher-developed Assessments. Most teachers have limited assessment training and even less time to develop and evaluate their own tests. Yet teachers routinely develop, administer, and score tests without considering validity issues. Some practical steps teachers can take to ensure student test scores reflect intended meanings include the following:

Develop student assessments before developing instructional units. By deciding in advance what students should know and do, teachers ensure content validity (e.g., breadth and depth of coverage), and can also align their instruction to ensure students learn the intended materials.

Specify assessment content and format with an item specification table. For example, a reading test might target 10% phonemic awareness, 20% alphabetic principle/phonics, 30% fluency, 15% vocabulary, and 25% comprehension distributed so that 50% of items are selected-response, 30% are fill-in-the-blank, and 20% are short answer. The resulting two-dimensional table helps teachers ensure content coverage is distributed as they intended, and also helps ensure that methods of assessment are distributed across content domains.

Provide multiple opportunities for students to demonstrate knowledge. Large-scale projects or final exams with a single score are less likely to be reliable, and therefore less likely to be valid, than a series of smaller, more frequent assessments.

Look for evidence that test scores might not convey intended meanings. Even tests developed by experts have unexpected problems; teacher-made tests are no exception. Teachers should look for evidence of unexpected outcomes (e.g., students who do well on quizzes do poorly on the unit test; items that nobody passes; evidence that groups differ in unusual ways on particular tests or items), and adjust tests if they conclude that the test did not accurately reflect its intended meaning.

Weighing the Evidence. Test makers and users want as much evidence as possible to understand what test scores mean. However, collecting and reporting evidence costs time and money, and so evidence is often incomplete. Test users must therefore weigh evidence to evaluate the degree to which it supports the validity, or intended meanings, of the test. Test users should invoke a four-step process to weigh evidence:

For example, suppose a test purports to measure mathematical knowledge and skills (step 1). Test users should decide what forms of consistency are most important (e.g., inter-rater agreement, stability), and then determine whether the test scores have reliability values greater than or equal to 0.80 (step 2). If the test has reasonable evidence of reliability, users should decide which forms of evidence are most salient. Achievement tests should have substantial evidence of test content (e.g., items should represent domains identified by the National Council of Teachers of Mathematics), internal structure (e.g., items purporting to measure common subdomains actually relate to each other), and relationships to other variables (e.g., the test correlates with other math tests). If the test also claims intended consequences (e.g., it will help plan educational interventions), then evidence showing how test scores enhance intervention selection, implementation, or outcomes should be provided. It should be noted that some domains (e.g., content) are more important than other domains (e.g., response processes). Finally, users should assess the evidence provided to judge the degree to which evidence supports the claims. Test reviews can be helpful, but test standards mandate that test users are personally responsible for ensuring that the tests they use have appropriate supporting evidence.

BIBLIOGRAPHY

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education (1985). Standards for educational and psychological testing (2nd ed.). Washington, DC: American Psychological Association.

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education (1999). Standards for educational and psychological testing (3rd ed.). Washington, DC: American Educational Research Association.