Part I: Math Online Executive Summary

The Math Online (MOL) study is one of three field investigations in the National Assessment of Educational Progress (NAEP) Technology-Based Assessment Project, which explores the use of new technology in administering NAEP. The MOL study addresses issues related to measurement, equity, efficiency, and operations in online mathematics assessment. The other two studies focus on the use of computers in assessing writing and problem solving.

In the MOL study, data were collected in spring 2001 from more than 100 schools at each of two grade levels. Over 1,000 students at grade 4 and 1,000 at grade 8 took a test on a computer via the World Wide Web or on laptop computers taken into schools. At both grades 4 and 8, the study collected background data concerning students’ access to computers, use of them, and attitudes toward them. In addition, students were administered hands-on exercises designed to measure input skill.

Over 2,700 students at grade 8 took comparable paper-and-pencil tests. The students taking paper-and-pencil tests were assigned randomly to one of three forms. One paper-and-pencil form, which presented identical items to the grade 8 computer-based test, provides the main comparisons for the effect of computer delivery vs. paper delivery. The other two paper-and-pencil forms were used to study psychometric questions related to the automatic generation of test items.

A priori and empirical analyses were performed to explore the implications of technology-based assessment for measurement, equity, efficiency, and operations. A review of findings in these categories follows.

Measurement

In general, eighth-grade NAEP mathematics items appear suitable for computer delivery. Content review of the questions from the 2000 mathematics assessment suggested that most questions could be computer-delivered with no or only moderate difficulty.

At grade 8, mean scale scores on the computerized test were about 4 points lower than on the paper version, a statistically significant difference.

At the item level, there was a mean difficulty difference of .05 on the proportion-correct scale between the computer and paper tests, meaning that on average 5 percent more students responded to the items correctly on paper than on computer. Also, on average, the differences appeared to be larger for constructed-response items than for multiple-choice questions.

Equity

At grade 8, no significant difference in performance on the computer test vs. the paper test was detected for the NAEP reporting groups examined (gender, race/ethnicity, parents’ education level, region of the country, school location, and school type), except for students reporting that at least one parent graduated from college. These students performed better on paper than on computer tests.

Background data suggest that the majority of fourth- and eighth-grade students have some familiarity with using a computer. For example, 85 percent of fourth-graders and 88 percent of eighth-graders reported that they use a computer at home.

Use of computers by students at school also appears to be common. Eighty-six percent of fourth-graders and 80 percent of eighth-graders reported using a computer at school.

To explore the possibility that, for some students, lack of computer familiarity impeded online test performance, both self-reported and hands-on indicators of computer familiarity were used to predict online test performance. At both grades, results suggested that performance on computer-delivered mathematics tests depended in part on how familiar a student was with computers.

Efficiency

On the basis of a content analysis, about three-quarters of the items used on the NAEP 2000 mathematics assessment appear amenable to automatic generation. Geometry and Spatial Sense was the only framework content area for which the majority of the items could not be automatically generated.

The degree to which the item-parameter estimates from one automatically generated item could be used for related automatically generated items was also investigated. Results suggested that, while the item-parameter estimates varied more than would be expected from chance alone, this added variation would have no statistically significant impact on NAEP scale scores.

Eight of the nine constructed-response items included in the computer test at each grade were scored automatically. For both grades, the automated scores for the items requiring simple numeric entry or short text responses generally agreed as highly with the grades assigned by two human raters as the raters agreed with each other. Questions requiring more extended text entry were scored automatically, with less agreement with the grades assigned by two human raters.

Based on an analysis of typical test development cycles, it is estimated that moving NAEP assessments to the computer would not have any significant short-term effect on the pilot stage of the NAEP development cycle but could possibly shorten the operational stage somewhat by requiring fewer steps.

Operations

Although most tests were administered via laptop computers brought into schools by NAEP administrators (80 percent of students at fourth grade and 62 percent at eighth grade), a portion of schools tested some or all of their students via the Web (25 percent of the schools at grade 4 and 46 percent of schools at grade 8).

Most administrations went smoothly, but technical problems caused some tests to be interrupted. Interrupted test sessions were associated with lower test scores by a statistically significant, but small, amount.

Perhaps due in part to experiencing more frequent technical problems, eighth-grade students taking tests on NAEP laptops scored significantly lower than those taking tests on school computers, thereby contributing to the lack of comparability found between computer and paper tests.

Implications of Findings

The authors believe that these findings have several implications for NAEP:

Most NAEP mathematics items could be computer delivered, arguably improving the measurement of some content areas specified by the mathematics framework. At the same time, conventional delivery may be needed for other items, especially those that require the manipulation of a real (as opposed to a simulated) physical object.

Although the computerized test was somewhat more difficult than its paper counterpart for the population as a whole, it may be possible in future assessments to put tests given in the two modes on the same scale by administering a subset of common items in each mode to different randomly assigned groups of students.

Even though most students reported some familiarity with technology, differences in computer proficiency may introduce irrelevant variance into performance on NAEP mathematics test items presented on computer, particularly on tests containing constructed-response items. For the near term, NAEP should be particularly thoughtful about delivering computer mathematics tests, especially when they include constructed-response items or where students have limited experience with technology.

In the not-too-distant future, constructed-response mathematics tests may be feasible as keyboarding skills become pervasive, improved computer interfaces offer simpler means of interaction, and designers become more proficient in their renditions of open-ended items. When that occurs, automated scoring may help reduce NAEP’s costs, increase speed of reporting, and improve scoring consistency across trend years.

Automatic item generation might help to increase NAEP’s efficiency, security, and depth of content coverage. Item variants could offer the opportunity to cover framework content areas more comprehensively, permit generation of precalibrated replacements for questions that have been disclosed, and allow the creation of item blocks as the assessment is administered.

NAEP should expect the transition and near-term operating costs for electronic assessment to be substantial. However, the program may still need to deliver some assessments via computer despite higher cost. As students do more of their academic work on computers, NAEP may find it increasingly hard to justify documenting their achievement with paper tests.

For the foreseeable future, occasional equipment problems and difficulties with internet connectivity are likely to cause interruptions in testing for some students or for some schools. Options for dealing with these events include discarding the data and reducing the representativeness of samples, retaining the data and possibly introducing bias into results, or conducting make-up sessions that could add considerable expense for NAEP.

School technology infrastructures may not yet be advanced enough for national assessments to be delivered exclusively via the Web to school computers. However, if assessment blocks are initially composed solely of multiple-choice items and short constructed-response items, with more complex constructed-response questions left for paper blocks, web delivery may be possible for most schools.

Future research should examine several factors related to irrelevant variation in online test scores. These factors include the impact of using laptop vs. school computers, the effectiveness of methods that attempt to compensate for differences in the operating characteristics of school machines, the effect of test interruptions on performance and comparability, the impact of constructed-response questions requiring different degrees of keyboard activity, the extent to which repeated exposure to tutorials and online practice tests might reduce variation in performance due to computer familiarity, and the impact of typed vs. handwritten responses on human grading.

Part II: Writing Online Executive Summary

The 2002 Writing Online (WOL) study is the second of three field investigations in the Technology-Based Assessment project, which explores the use of new technology in administering the National Assessment of Educational Progress (NAEP).1 The study addresses issues related to measurement, equity, efficiency, and operations in a computer-based writing assessment.

This report describes the results of testing a national sample of eighth-grade students on computer. The WOL study was administered to students on school computers via the World Wide Web or on NAEP laptop computers brought into schools. Both writing tasks (herein referred to as “essays”) used in the WOL study were taken from the existing main NAEP writing assessment and were originally developed for paper administration.

During April and May 2002, data were collected from more than 1,300 students in about 160 schools. Student performance on WOL was compared to that of a national sample that took the main NAEP paper-and-pencil writing assessment between January and March 2002. For the samples taking WOL, background information concerning access to, use of, and attitudes toward computers was also collected. In addition, exercises designed to measure computer skills were administered. Results are considered to be statistically significant if the probability of obtaining them by chance alone does not exceed the .05 level.

Measurement

Performance on computer versus a paper test was measured in terms of essay score, essay length, and the frequency of valid responses. Results showed no significant difference in essay scores or essay length between the two delivery modes. However, for the second of the two essays comprised in the test, delivery mode did significantly predict response rate, with roughly 1 percent more students responding to the test on paper than on computer.

Equity

Performance on paper and computer versions of the same test was evaluated separately for the categories of gender, race/ethnicity, parents’ education level, school location, eligibility for free/reduced-price school lunch, and school type. With one exception, there were no significant differences for the NAEP reporting groups examined between the scores of students who wrote their essays on paper and those who responded on computer. The exception was for students from urban fringe/large town locations, who performed higher on paper than on computer tests by about 0.15 standard deviation units.

The effect of delivery mode on performance was also evaluated for gender groups in terms of response length and frequency of valid responses. For the second essay, males wrote significantly fewer words on paper than on computer. Also for that second essay, a significantly higher percentage of females responded on paper than on computer. The difference in percent responding was about 2 percentage points.

The impact of assignment to a NAEP laptop versus a school computer was evaluated in two analyses. Results from the two analyses were not completely consistent. In an experimental substudy in which a small number of students were randomly assigned to computer type, those who took the test on NAEP laptops scored significantly lower than students taking the test on school computers, but for only one of the two essays. In a quasi-experimental analysis with larger sample sizes, however, only female students performed significantly lower on the NAEP laptops, but this group did so for both essays.

To determine if computer familiarity affected online test performance, students’ self-reported computer experience and hands-on measures of keyboarding skill were used to predict online writing performance, after controlling for their paper writing score. Hands-on skill was significantly related to online writing assessment performance, so that students with greater hands-on skill achieved higher WOL scores when holding constant their performance on a paper-and-pencil writing test. Computer familiarity added about 11 percentage points over paper writing score to the prediction of WOL performance.

Efficiency

With respect to timeliness, it is anticipated that delivering assessments via computer would not have any significant short-term effect on the pilot stage of the NAEP assessment cycle, but could possibly shorten the operational stage appreciably by requiring fewer steps.

Assuming similar levels of effort for current NAEP writing assessments, the costs for an online test should be similar for test development, similar or higher for assessment delivery and administration, and similar or lower for scoring.

Results showed that the automated scoring of essay responses did not agree with the scores awarded by human readers. The automated scoring produced mean scores that were significantly higher than the mean scores awarded by human readers. Second, the automated scores agreed less frequently with the readers in level than the readers agreed with each other. Finally, the automated scores agreed less with the readers in rank order than the readers agreed with one another.

Operations

Because the WOL delivery software supported only the Windows operating system and required broadband connections that were not available at some schools, 65 percent of students (and 59 percent of schools) were tested on laptop computers provided by NAEP administrators. The remainder were tested on school computers via the Web. Both web and laptop administrations ran very smoothly, with only minimal problems overall and almost no problems with computer hardware.

The authors believe these results have important implications for NAEP:

Aggregated scores from writing tests taken on computer do not appear to be measurably different from ones taken on paper for the eighth-grade population as a whole, as well as for all but one of the NAEP reporting groups examined.

Scores for individual students may not be comparable, however. Even after controlling for their level of paper writing skill, students with more hands-on computer facility appear to get higher scores on WOL than do students with lower levels of keyboard proficiency.

Because scores for individuals on paper and computer writing tests do not appear to be comparable, relationships of certain demographic variables to writing proficiency may change, depending upon the mode in which that proficiency is measured.

NAEP should expect the transition and near-term costs for conducting an electronic writing assessment to be considerable. NAEP will likely need to supplement web delivery by bringing laptop computers into some schools.

Delivering writing assessments on computer may allow responses to be automatically scored, which could help NAEP reduce costs and speed up reporting. Although automated scores did not agree highly enough with the scores awarded by human readers to consider the two types of scoring interchangeable, this technology has been found to work effectively in some studies, is evolving rapidly, and may soon

Future research should address the generalizability of this study’s findings to other grades and other types of essay tasks, and investigate the impact of differences in equipment configuration on NAEP population estimates. Finally, in this study, WOL readers scored student responses with lower levels of agreement than did the main NAEP readers. Future research should attempt to minimize more effectively differences in reader reliability across modes that can potentially affect the precision of scores and the meaning of results.

1The initial project in the series was the 2001 Math Online study, an investigation of the implications of delivering NAEP mathematics assessments on computer. The third project in the series is the 2003 Problem Solving in Technology-Rich Environments study, an investigation of how computers might be used to measure skills that cannot be measured in a paper test.

Download sections of the report (or the complete report ) in a PDF file for viewing and printing: