Validity and
Utility of Computer-Based Test Interpretation

James
N. Butcher
Department of Psychology University of Minnesota,Twin Cities Campus Julia N. Perry
Department of Psychology University of Minnesota,Twin Cities Campus Mera M. Atlis
Department of Psychology University of Minnesota,Twin Cities Campus

ABSTRACT

Computers have been important to applied psychology since their introduction,
and the application of computerized methods has expanded in recent decades. The
application of computerized methods has broadened in both scope and depth. This
article explores the most recent uses of computer-based assessment methods and
examines their validity. The comparability between computer-administered tests
and their pencil-and-paper counterparts is discussed. Basic decision making in
psychiatric screening, personality assessment, neuropsychology, and personnel
psychology is also investigated. Studies on the accuracy of computerized
narrative reports in personality assessment and psychiatric screening are then
summarized. Research thus far appears to indicate that computer-generated
reports should be viewed as valuable adjuncts to, rather than substitutes for,
clinical judgment. Additional studies are needed to support broadened
computer-based test usage.

How can such a complex human cognitive
process as test interpretation be performed by an electronic device? A
theoretical rationale for computerized evaluation was provided by the introduction
of the concept of actuarial interpretation. In 1954 ,
Meehl published a monograph in which he debated the merits of
actuarial/statistical (objective) decision-making methods versus clinical
(subjective) ones. His analysis led to the conclusion that decisions based on
objectively derived information ultimately were more valid than judgments
rooted in decision-making rules that were derived and applied subjectively. As
a result of this and his subsequent collaborative works ( Dawes,
Faust, & Meehl, 1989 ; Grove & Meehl, 1996 ),
objective assessment procedures generally are considered to be equal or
superior to subjective methods. After reviewing 136 studies, a recent meta-analysis
by Grove, Zald, Lebow, Snitz, and Nelson (2000 ; this issue)
concluded that the advantage in accuracy for statistical prediction over
clinical prediction was about 10%.

Despite their common roots and comparable rationales, however, actuarial
assessment and computerized assessment are not strictly equivalent
concepts. Therefore, establishing the validity of the former does not mean that
the veracity of the latter can be assumed in all cases. A computer-based test
interpretation (CBTI) system is actuarial only if its interpretive output is
wholly determined by statistical rules that have been demonstrated empirically
to exist between the output and the input data ( Sines, 1966 ).
Other types of CBTI systems, which are not actuarial in nature, draw
conclusions on the basis of work of a clinician who creates interpretations
using published research and clinical experience ( Wiggins, 1973
). Because of such distinctions, the validity of each type of computerized
assessment must be examined in its own right, particularly in light of
computerized procedures' standard use by modern-day clinicians and the degree
to which the clinicians' results are relied on to make crucial health care
decisions.

In 1984, the American Psychological Association Committee on
Professional Standards cautioned psychologists who used interpretive
reports in business and school settings against using narrative summaries in
the absence of adequate data to validate their accuracy; however, the American
Psychological Association's caveat is important for practitioners in all areas
of psychology, necessitating exploration of the validity of
computerized-assessment procedures. In addition to examining computer-based
reports, this review also investigates computer interviews that have been used
to make diagnoses.

Previous reviews of research in this area have pointed to the limitations of
computer-based interpretations ( Maddux & Johnson, 1998 ).
For example, Moreland (1985) reviewed validity studies of
computer-based interpretations and pointed to the fact that conclusions drawn
from the research must be evaluated in light of several problems. Among those
he discussed were the following: small sample sizes; inadequate external
criterion measures with which to compare the CBTI statements; a lack of
information regarding the reports' base-rate accuracy; failure to assess the
ratings' reliability across time or across raters; failure to investigate the
internal consistency of the reports' interpretations; and issues pertaining to
the report raters, such as lack of familiarity with the interpretive system
used, lack of expertise in the area of interest, and possible bias secondary to
the theoretical orientation held by the rater. For precisely these types of
reasons, Matarazzo cautioned that computerized testing must be subjected to the
"checks and balances of democratic society" ( 1986 ,
p. 14) in order to preserve the integrity of psychological assessment. He also
provided several key criticisms of his own with respect to computerized test
interpretations, including the interpretations' failure to recognize test
takers' uniqueness, the tendency for the interpretations to be unsigned
(ostensibly leaving no one directly accountable for the interpretations'
contents), and the tendency to be viewed as an end rather than as means to an
end. Even more recently, Snyder, Widiger, and Hoover (1990) voiced
their own concerns with CBTIs, concluding that "[w]hat the [CBTI]
literature lacks (and desperately needs) are rigorously controlled experimental
studies examining methodological issues in CBTI validation per se" and
recommending specifically that future studies "include representative
samples of both [CBTI] consumers and test respondents" and "use
characteristics of each as moderator variables in analyzing reports'
generalizability" (p. 476).

Throughout the past decade, the areas that
computer programs have addressed have been as varied as the individuals using
them. At the most fundamental level, research can examine the validity of
computerized tests in general and the more specific issue of the tests'
equivalence to paper-and-pencil measures. Research on this topic is
particularly important because, as noted by Hofer (1985) ,
there are several factors specific to computerized test administration that
could be instrumental in yielding noncomparable results. Foremost among these
may be some individuals' discomfort with computers and consequent awkwardness
when dealing with them. In addition, there are such factors as the type of
equipment used and the nature of the test material. Regarding the latter, Hofer
noted that when item content deals with sensitive and personal information,
respondents may be more willing to reveal their true feelings to a computer
than to a human being, which may lead to atypical results when computerized
assessment is used. Computer administration has other advantages over
traditional methods as well. These include potential time savings, elimination
of the possibility of respondents making errors while filling out handwritten
answer sheets, and elimination of the possibility of clinicians and technicians
making errors while hand-scoring items ( Allard, Butler, Faust,
& Shea, 1995 ). However, there are disadvantages as well. Chief among
the disadvantages is the possibility of excessive generality of results and the
high potential for the results' misuse because of their increased availability
( Butcher, 1987 ). Whether the distinctions between
computerized and traditional assessment methods seem positive or negative,
however, they leave room for questions regarding comparability and, ultimately,
validity. Below, we review studies comparing computer-administered tests with
paper-and-pencil tests or other traditional methods of data collection.

Psychiatric Screening

In recent years, computerized assessment frequently has been used for
psychiatric screening. This is in part due to research that demonstrates mental
health clients' comfort with (and even preference for) automated assessments
(e.g., Hile & Adkins, 1997 ). Because automated
procedures have become common components of psychiatric assessment batteries,
their validity requires careful evaluation.

Several recent studies have investigated the accuracy with which
computerized assessment programs were able to diagnose the presence of
behavioral problems. In one, Ross, Swinson, Larkin, and Doumani
(1994) examined the Computerized Diagnostic Interview Schedule (C-DIS). In
their study, 173 substance abusers were assessed using both the C-DIS and a
clinician-administered Structural Clinical Interview for the Diagnostic and
Statistical Manual for Mental Disorders (3rd ed., rev.; DSM—III—R; American Psychiatric Association, 1987 ). The diagnostic
agreement between the two instruments was then examined. Results showed that,
with the exceptions of substance abuse disorders and antisocial personality
disorder, levels of agreement (assessed using kappa coefficients) between the
C-DIS and the Structural Clinical Interview for DSM—III—R were poor. For
alcohol, cocaine, and opiate diagnoses, however, the kappa values were .71,
.75, and .81, respectively. In addition, the C-DIS was able the rule out the
presence of comorbid disorders in the sample with approximately 90% accuracy.

Using a somewhat more complex design, Jemelka, Wiegand,
Walker, and Trupin (1992) administered a brief mental health and personal
history interview, the group form of the Minnesota Multiphasic Personality
Inventory (MMPI), the Revised Beta IQ Examination, the Suicide Probability
Scale, the Buss—Durkee Hostility Inventory, the Monroe Dyscontrol Scale, and
the Veteran's Alcohol Screening Test to 100 incarcerated male felony offenders.
Algorithms from a CBTI system were then used to assign rankings to each
participant, on the basis of potential for violence, substance abuse, suicide,
and victimization. The algorithms were also used to identify the presence of
clinical diagnoses based on the basis of the DSM—III—R. Clinical
interviewers then rated the members of the sample on the same five dimensions.
An additional 109 participants were administered eight sections of the
Diagnostic Interview Schedule (DIS). Agreement between the CBTI ratings and the
clinician ratings was fair, and the CBTI's algorithms for violence and suicide
potential were judged to be too sensitive. There was high agreement between CBTI-
and clinician-diagnosed DSM—III—R disorders, with an overall concordance
rate of 82%. The researchers suggested that a more accurate prediction could be
achieved using algorithms that produced continuous scale scores rather than
Likert-type ratings.

Farrell, Camplair, and McCullough (1987) evaluated the
ability of a computerized interview to identify the presence of a group of
target complaints. A face-to-face, unstructured intake interview and the
interview component of a computerized mental health information system, the
Computerized Assessment System for Psychotherapy Evaluation and Research, were
both administered to 103 adult clients requesting outpatient psychological
treatment. Response formats included Likert-type scales, behavioral
frequencies, and multiple-choice options. Paper-and-pencil assessment measures
were also completed by the participants. Results showed that there was fairly
poor agreement (mean r = .33) between target complaints as reported by
clients on computer and as identified by therapists in interviews. However, the
complaints identified in the computerized interview correlated adequately with
measures of global functioning, with 9 of the 15 correlations significant at
the .0009 level. They ranged from .04 to .54.

Personality Assessment

A number of studies related to the assessment of personality focused on
evaluating the comparability of computer and booklet administrations of various
instruments. For example, a study by F. R. Wilson, Genco, and
Yager (1985) used a test-attitudes screening instrument as a representative
of all comparable assessment instruments administered through computer.
Ninety-eight female college freshman were administered the Test Attitude
Battery (TAB), which came in both paper-and-pencil and computer-administered
formats. They were also administered the Test Anxiety Questionnaire, whose sole
format was paper and pencil. Each participant was assigned to one of three
conditions, which corresponded with the three possible pairings of the three
instrument formats, and the subsequent order of test administration was
counterbalanced. Means, variances, and correlations with construct validating
variables were comparable for paper-and-pencil and computerized test versions,
demonstrating that the two versions satisfied criteria for serving as
alternative forms and could also be considered parallel forms. The
paper-and-pencil—computer-administered TAB correlations were all significant at
the .001 level, and they ranged from .71 to .95. Despite these findings,
however, the authors advised that caution be exercised in generalizing the
results too broadly beyond the scope of their study, particularly with regard
to clinical samples and less straightforward measures.

Much of the research investigating the comparability of computer-based
personality assessment measures has used the MMPI or the restandardized MMPI-2.
For example, several studies examined possible differences between
paper-and-pencil and computerized testing formats. The findings of Lambert, Andrews, Rylee, and Skinner (1987) , Schuldberg
(1988) , and Watson, Juba, Anderson, and Manifold (1990) all
indicate that the differences are generally few and small in magnitude, leading
to between-forms correlations of .68 to .94 ( Watson et al.,
1990 ). Other researchers have obtained close ( Sukigara,
1996 ) or "near perfect" (i.e., 92% to 97%) agreement between
computer and booklet administrations ( Pinsoneault, 1996 ). Honaker, Harrell, and Buffaloe (1988) investigated the
equivalency of Microtest computer MMPI administration among 80 community
volunteers. They found no significant differences between various computer
formats for means and standard deviations for validity, clinical, and 27
additional scales. However, analogous to a number of studies investigating the
equivalency of computer and booklet forms of the MMPI, the power of their
statistical analyses did not provide conclusive evidence regarding the
equivalency of the paper-and-pencil and computerized administration format ( Honaker et al., 1988 ).

Most recently, a comprehensive meta-analysis on the topic of psychometric
comparability between computerized and standard MMPI testing formats was
conducted by Finger and Ones (1999) , who included all
research available to them. They meta-analyzed the results of 14 studies, all
of which reported on the use of computerized formats of the MMPI or MMPI-2,
that were conducted between 1974 and 1996. Across studies, the differences in T
-score means and standard deviations between test formats were negligible.
Corrected mean crossform correlations were consistently near 1.00. On the basis
of these findings, Finger and Ones (1999) concluded that
" ... the evidence is strong that there is little effect on MMPI scale
scores when the test is administered via computer as opposed to when the test
is administered via booklet" (p. 19).

In addition to providing a viable alternative format to traditional
paper-and-pencil testing, computerized methods offer the option of adapting the
test to suit the needs and preferences of the test taker. The comparability and
validity of this method (known as computerized adaptive testing) were examined
by Roper, Ben-Porath, and Butcher (1995) in their study of
the MMPI-2. Five hundred seventy-one undergraduate psychology students were
administered three versions of the MMPI-2: a booklet version, an adaptive
computerized version, and a conventional computerized version. Each participant
took the same format twice, the booklet and adaptive computerized versions (in
counterbalanced order), or the convention and adaptive computerized versions
(again, in counterbalanced order). There were few statistically significant differences
in the resulting mean scale scores between the booklet and adaptive
computerized formats. Men scored significantly lower on the adaptive
computerized version on clinical Scale 0 (Social Introversion) and on the
content scales Bizarre Mentation ( BIZ ) and Depression ( DEP ).
Women scored significantly lower on the adaptive computerized version on
clinical Scale 0 (Social Introversion) and the Content scales Fears ( FRS )
and Family Problems ( FAM ). They scored significantly higher on the
adaptive computerized version on clinical Scale 5 (Masculinity—Femininity) and
the Content scale Type A ( TPA ). Regarding the conventional
computerized versus adaptive computerized comparisons, there were no
significant differences for either men or women. In terms of criterion-related
validity, there were no significant differences between the correlations of
criterion measures (viz., the Beck Depression Inventory, Trait Anger and Trait
Anxiety from the State—Trait Personality Inventory, and nine scales from the Symptoms
Checklist–Revised) with MMPI-2 scales from the adaptive computerized
administration and as compared with the booklet version.

Holden and Hickman (1987) investigated computerized and
paper-and-pencil versions of the Jenkins Activity Scale, a measure that
assesses behaviors related to the Type-A personality. Sixty male undergraduate
students were assigned to one of the two administration formats. The stability
of scale scores was comparable for both formats, as were mean scores,
variances, reliabilities, and construct validities, and none of them
demonstrated statistically significant differences.

The 72 German-speaking participants in Merten and Ruch's
(1996) study of German versions of the Eysenck Personality
Questionnaire–Revised and the Carroll Rating Scale for Depression completed
half of each instrument using paper and pencil and the other half on computer
(with orders counterbalanced). They compared the results from the two formats
to one another as well as to data from another German sample, consisting of
individuals who were administered only the paper-and-pencil version of the
Eysenck Personality Questionnaire–Revised. Once again, means and standard
deviations were comparable across computerized and more traditional formats.

Neuropsychology

Pelligrino, Hunt, Abate, and Farr's (1987) study
comparing a battery of 10 computerized tests of spatial abilities with
traditional paper-and-pencil counterparts suggested that computer-administered
static spatial reasoning tasks can augment current paper-and-pencil procedures.
The study indicated that "computer-based static tasks and measures
encompass the variance contained in traditional paper-and-pencil tests and also
provide measures of unique variance associated with speed versus accuracy of
processing" (p. 235). Choca and Morris (1992) used a
sample of 46 neurologically impaired men to compare a computerized version of
the Halstead Category Test to its standard counterpart. All participants were
administered both versions of the test. Results generally supported the
measures' comparability, as the difference between the mean number of errors
obtained on each version (85.22 for the standard vs. 90.22 for the
computerized) was nonsignificant.

The compatibility of a computerized neuropsychological test with the
original measure can be influenced by the differences between standard and
computerized administration procedures. For example, research conducted by French and Beaumont (1990) investigated the validity of
computerized versions of the Mill Hill Vocabulary Test and the Standard
Progressive Matrices Test by administering computerized and standard versions
of the former to 274 patients and computerized and standard versions of the
latter to an additional 184 patients. On the Mill Hill Vocabulary Test, there
were no significant score differences between the computerized and standard
versions. However, participants obtained significantly lower scores on the
computerized version than on the standard version of the Standard Progressive
Matrices Test, indicating that these two measures cannot be used
interchangeably. The researchers noted that the poor resolution of the computer
graphics system was a likely explanation of the difference. French
and Beaumont (1990) also found that for the Mill Hill test, participants
expressed a clear preference for keyboard to the touch-screen responding.
Additionally, those participants who took the computer version first reported
that they would be more likely to take another similar test.

Personnel Psychology

In the field of personnel psychology, automated procedures typically are
used for such purposes as personnel selection and job performance prediction. McKee and Levinson's (1990) review addressed the adaptation of
paper-and-pencil measures to computerized formats in general and included a
specific discussion of the computerized form of the Self-Directed Search, an
instrument commonly used in career counseling. They cited the sole study
investigating the computerized measure's equivalence to the standard version ( Reardon & Loughead, 1988 ) and advised that additional
research (particularly with respect to reliability and validity) be conducted.

Some of the recent research in this domain addressed the role of
computerized assessment in the selection and classification of military
personnel. Carretta (1989) examined the usefulness of the
computerized test battery, Basic Attributes Battery, for selecting and
classifying United States Air Force Pilots. A total of 478 Air Force officer
candidates completed a paper-and-pencil qualifying test and the Basic
Attributes Battery, and they were also judged on the basis of undergraduate
pilot training performance. The validation of the Basic Attributes Battery was
ongoing at the time of the report, but the data then available demonstrated
that it was adequately assessing abilities and skills related to flight
training performance, as measured by its correlations with the other military
assessment procedures, such as dot estimation, memory for digits, encoding
speed, and self-crediting word knowledge. However, despite the significance of
many of the multiple correlations at the .001 level, they ranged in magnitude
from .062 to only .498.

Conclusions

The comparability of computerized and standard administration of various
measures appears to vary, with the most promising results in the area of
personality assessment. Studies examining a variety of instruments and models
have demonstrated that scores from these computerized measures, and decisions
made using them, are typically equivalent to more traditional methods whose
validity has already been established (although there are some notable
exceptions). However, there remains a need to investigate each measure on a
case-by-case basis ( Hofer & Green, 1985 ) and to
establish the repeatability of the resulting data, particularly within the
areas of psychiatric screening and neuropsychology. Additional pertinent
issues, such as the impact of ergometric factors on the computer administration
process, also warrant further consideration.

Studies of validity of CBTI systems
frequently highlight the ease with which computerized assessment procedures can
provide extensive interpretations of findings. Numerous computer programs that
are now in use not only provide assessment and rudimentary interpretation but
also incorporate test findings into extensive narrative reports. It is the
validity of these applications to which we turn.

Narrative reports generated by automated
assessment appear to fall into two broad categories: descriptive and
consultative. Descriptive reports (e.g., for the 16 Personality factors [16PF])
are generated for each individual scale without being moderated by scores on
other scales. Consultative reports (e.g., those for the MMPI and Decision Tree
[DTREE]) provide detailed analyses of all test data; they are designed to mimic
as closely as possible an expert human consultant.

The validity of narrative CBTIs has been
investigated extensively in the fields of personality assessment and
psychiatric screening. In personality assessment, the validity of MMPI
narratives has been described most extensively (for a detailed review of this
topic, see Moreland, 1987 ). However there also has
been research aimed at exploring the accuracy of narrative reports for other
computerized personality tests, such as the 16PF (e.g., Guastello
& Rieke, 1990 ; O'Dell, 1972 ), the Millon Clinical
Multiaxial Inventory (MCMI; e.g., Moreland & Onstad, 1987 ),
and the Rorschach Inkblot Test (e.g., Harris et al., 1981 ; Prince & Guastello, 1990 ). In psychiatric screening,
studies of the computer outputs of the DIS (e.g., First, 1994 )
are most common.

Narrative Reports in Personality Assessment

Moreland (1987) summarized a number of studies investigating
the accuracy ratings for computer-generated MMPI narrative reports. A few
studies compared computer-generated narrative interpretations with criterion
reports generated by human interpreters of the MMPI. One methodological
limitation of such studies arises from the possibility that the validity of a
clinician's interpretation is often low enough so that a CBTI can have low
validity yet can still be judged as accurate ( Moreland, 1987 ).
For example, Labeck, Johnson, and Harris (1983) asked three
clinicians (each with at least 12 years of clinical experience) to rate
independently the quality and the accuracy of code-type interpretations and
associated rules using an automated MMPI program (the clinicians did not rate
the fit of a narrative to a particular patient, however). Results indicated
that the MMPI code-type, diagnostic, and overall profile interpretive
statements were consistently rated highly by the expert judges. The narratives
generated by automated MMPI programs were judged to be substantially better
than average when compared with the "blind" interpretations of
similar profiles that were produced by the expert clinicians. Unfortunately,
the report did not state how the researchers judged the quality of the blind
interpretation. Further, researchers did not investigate the possibility that
statements in the blind interpretation could have been so brief and general
(especially when compared to a two-page narrative CBTI) that they could have artificially
inflated the ratings of the CBTI reports. Despite these limitations, however,
this research design was useful in evaluating the congruence of
computer-generated decision and interpretation rules.

In most published studies of CBTI validity, the CBTIs' recipients (usually
clinicians) rate the accuracy of computer interpretations on the basis of the
clinicians' knowledge of test respondents (refer to Moreland,
1987 , p. 42, for a summary of these studies). A study by Butcher
et al. (1998) explored the utility of computer-based MMPI-2 reports in
Australia, France, Norway, and the United States. In all four countries,
clinicians administered MMPI-2 to their patients (whom they were seeing for
psychological evaluation or therapy) using a booklet format, in the language of
each country. Item responses for each patient were analyzed by the Minnesota
Report using the United States norms. Each training clinician then rated the
amount of information in each narrative section as insufficient, some,
adequate, more than adequate, or extensive. For each report,
clinicians also indicated the percentage of accurate descriptions of the
patient and were asked to respond to open-ended questions regarding ways to
improve the report. Relatively few raters found the reports inappropriate or
inaccurate. In all four countries, Validity Considerations, Symptomatic
Patterns, and Interpersonal Relations sections of the Minnesota Report were
found to be the most useful sections in providing detailed information about
the patients. Over two thirds of the records were found to be highly accurate,
which indicated that clinicians judged 80% to 100% of the computer-generated
narrative statements in them to be appropriate. In 87% of the reports, at least
60% of the computer-generated narrative statements were believed to be
appropriate.

In one of the few studies comparing clinical and computer-based assessment
decisions, Butcher (1988) investigated the usefulness of
computerized MMPI assessment for screening in personnel settings. In his study,
262 airline-pilot applicants were evaluated by both expert clinicians and
computer-based decision rules. Each applicant's overall level of adjustment was
rated by experts (using only an MMPI profile) on a Likert-type scale with three
categories: adequate, problems possible, and problems likely. The
computer-based decision rules were also used to make determinations about the
applicants. Here, the categories of excellent, good, adequate, problems
possible, and poor were used. The study showed high agreement
between the computer-based decisions and those made by clinicians in rating
overall adjustment. Over 50% of individuals falling into the adequate category
on the basis of the computer-based rules were given ratings of adequate by
the clinicians. There was agreement between the computer rules and clinician
judgment on the possibility of problems being present in 27% of cases. Over 60%
of individuals rated as poor by the computer rules were given problems
likely ratings by the clinicians. No kappa values were calculated. This
study showed strong agreement between clinicians and the computer; however, no
external criteria were available to allow for an assessment of the relative
accuracy of each method, thereby leaving in question the validity of both the
computer-generated and the clinician-generated ratings.

One methodological problem involving accuracy ratings that is commonly found
in field studies is that estimates of interrater reliability are difficult, if
not impossible, to obtain. In addition, clinicians are frequently asked to
judge the accuracy of computer interpretations for their patients without
providing descriptions of how these judgments were made and without verifying
the appropriateness of such judgments with information from the patients
themselves and from other sources (such as physician reports or patients'
family members). Furthermore, to study the validity of computer-generated
narrative reports, raters should evaluate individual interpretive statements,
because global accuracy ratings may limit the usefulness of ratings in
developing the CBTI system ( Moreland, 1985 ).

Moreland's (1987) suggestions for evaluating the validity
of narrative CBTIs were adopted by Eyde, Kowal, and Fishburne in their 1991
investigation of the comparative validity of the narrative outputs for CBTI
systems. These researchers used case histories and self-report questionnaires as
a criterion against which narrative reports generated by seven MMPI computer
interpretation systems were evaluated. Each of the clinicians rated six
protocols on the basis of profile type. Some of the cases were assigned to all
raters; they consisted of a pair of Black and White patients who were matched
for 7-2 (Psychasthenia—Depression) code-type and a pair of Black and White
soldiers who had all clinical scales in the subclinical range ( T <
70). Clinicians rated the relevance of each sentence in the narrative CBTI as
well as the global accuracy of each report. Some CBTI systems studied showed a
high degree of accuracy. However, the overall results indicated that the
validity of the narrative outputs varied, with the highest accuracy ratings
being associated with narrative lengths in the short-to-medium range. For
different CBTI systems, results for both sentence-by-sentence and global
ratings were consistent, but they differed for the clinical and subclinical
normal profiles. For instance, the subclinical normal cases presented to all
clinicians had a high percentage ( Mdn = 50%) of unratable sentences,
whereas the 7-2 profiles had a low percentage ( Mdn = 14%) of sentences
that could not be rated. A potential explanation for such differences may come
from the fact that the clinical cases were inpatients for whom more detailed
case histories were available. In addition, since the length of time between
the preparation of the case histories and the administrations of the MMPI
varied, it was impossible to control for changes (i.e., some patient might have
gotten better or worse) in acute symptoms over time ( Eyde,
Kowal, & Fishburne, 1991 ).

The research by Eyde et al. (1991) represents a group of
studies in which participants rated the accuracy of their narrative feedback
after they had taken a test. A possible shortcoming of this design is that
there was a failure to control for a phenomenon alternately known as the P. T.
Barnum effect (e.g., Meehl, 1956 ) and Aunt Fanny effect
(e.g., Tallent, 1958 ), which holds that a narrative report
may contain high base-rate descriptions that will apply to virtually anybody.
Research on the Barnum effect has shown that participants can detect the nature
of the overly general feedback if asked the appropriate questions about it ( Furnham & Schofield, 1987 ; Layne, 1979 ).
However, research also has demonstrated that people typically are more
accepting of favorable Barnum feedback than of unfavorable feedback ( Dickson & Kelly, 1985 ; Furnham &
Schofield, 1987 ; Snyder & Newburg, 1981 ). Further,
people have been shown to perceive favorable descriptions as more appropriate
for themselves than for people in general ( Baillargeon &
Danis, 1984 ). Dickson and Kelly's (1985) research
demonstrated that test situations, such as the type of assessment used, can be
significant in eliciting acceptance of Barnum statements, although Baillargeon
and Danis (1984) found no interaction between the type of assessment device
and the favorability of statements. Research also demonstrates that people are
more likely to personalize Barnum descriptions delivered by persons of
authority or expertise ( Lees-Haley, Williams, & Brown, 1993
). This may be because in group settings, feedback from people of high
status might be better accepted and perceived as more accurate ( Snyder
& Newburg, 1981 ). Personality-related variables such as extraversion,
introversion, and neuroticism ( Furnham, 1989 ), as well as
the extent of private self-consciousness ( Davies, 1997 ),
also have been found to be connected to individuals' acceptance of Barnum
feedback.

In an attempt to deal with Barnum effects on narrative CBTIs, some studies
have compared the accuracy of CBTI ratings to a stereotypical patient or an
average subject. Researchers have also tried dealing with Barnum-effect
problems by using multireport—multirating intercorrelation matrices ( Moreland, 1987 ) and by looking at differences in perceived
accuracy between bogus and real reports ( Moreland & Onstad,
1987 ; O'Dell, 1972 ).

Some studies that compared bogus and real reports found the reports to be
statistically significantly different. For example, Guastello,
Guastello, and Craft (1989) asked 64 undergraduate students to complete
Comprehensive Personality Profile Compatibility Questionnaire. One group of
students rated the real computerized test interpretation of the Comprehensive
Personality Profile Compatibility Questionnaire, and another group rated a
bogus CBTI. The difference between the accuracy ratings for the bogus and real
profiles (58% and 75%, respectively) was statistically significant. In a study
by Guastello and Rieke (1990) , 54 undergraduate students
enrolled in an industrial psychology class evaluated a real computer-generated
Human Resources Development Report of the 16PF and a bogus report generated
from the average 16PF profile of the entire class. Results indicated no statistically
significant difference between the ratings for the real reports and the bogus
reports (which had mean accuracy ratings of 71% and 71%, respectively).
However, when analyzed separately, four out of five sections of the real 16PF
output had significantly higher accuracy ratings than did the bogus report.
Unlike the aforementioned groups of researchers, Prince and
Guastello (1990) found no statistically significant differences between
bogus and real CBTI interpretations when they investigated a computerized Exner
Rorschach interpretation system. In this study, four psychiatrists evaluated
real and bogus computer-generated outputs for 12 psychiatric outpatients.
Analyses indicated that the discriminant power of the CBTI for any single
patient was 5%, with 60% of interpretive statements only describing
characteristics of the "typical" outpatient.

In a study of the validity of Millon's Computerized Interpretation System
for the MCMI, Moreland and Onstad (1987) asked eight
clinical psychologists to rate real MCMI computer-generated reports and
randomly generated reports. The judges rated the accuracy of the reports as a
whole as well as the global accuracy of each section (viz., Axis I narrative,
Axis II narrative, Axis I diagnoses, Axis II diagnoses, Axis IV stressors,
Severity, and Therapeutic Implications). When considered one at a time, five
out of seven sections of the report exceeded chance accuracy. Axis I and Axis
II sections demonstrated the highest incremental validity. There was no
difference in accuracy between the real reports and the randomly selected
reports for the Axis IV psychosocial stressors section. The overall pattern of
findings suggests that MCMI can exceed chance accuracy. However, further
research is needed to determine the conditions that limit its accuracy ( Cash, Mikulka, & Brown, 1989 ; Moreland
& Onstad, 1987 , 1989 ).

The recent studies largely uphold the findings of their predecessors.
Research into computer-generated narrative reports for personality assessment
generally has revealed that the interpretive statements contained in the
reports are comparable to clinician-generated statements. Research also points
to the importance of controlling for the degree of generality of the reports'
descriptions, because of the confounding influence of the Barnum effect.

Computerized Structured Interviews

The work in computer-assisted psychiatric screening has focused mostly on
the development of statistical and logic-tree decision models ( Erdman,
Klein, & Greist, 1985 ). Computer outputs based on statistical models
such as Bayes's probabilities and discriminant function analysis ( Hirshfeld, Spitzer, & Miller, 1974 ) usually provide a list
of probable diagnoses and do not include narrative descriptions of the
decision-making process. Logic-tree systems, on the other hand, are designed to
establish the presence of symptoms specified in diagnostic criteria and arrive
at a particular diagnosis ( First, 1994 ). These systems
adopt an expert system approach, in which reasons for asking questions and the
diagnostic significance of the criterion in question are available after the
assessment process. For instance, DTREE is a recent program designed to guide
the clinician through the diagnostic process ( First, Williams,
& Spitzer, 1989 ). It provides diagnostic consultation both during and
after the assessment process. The narrative report from DTREE includes DSM—III—R
diagnoses and the ruled-out diagnoses, as well as the extensive narrative
explaining the reasoning behind diagnostic decisions.

Studies assessing the validity of logic-tree programs frequently compare
diagnostic decisions made by a computer and diagnostic decisions made by
clinicians. For example, a pilot study by First et al. (1993) evaluated
DTREE in an inpatient setting by comparing results of expert clinicians' case
conferences with DTREE output. Twenty inpatients were evaluated by a consensus
case conference and by five treating psychiatrists who used DTREE software. On
the primary diagnosis, perfect agreement was reached between the DTREE and the
consensus case conference in 75% of cases ( N = 15). However, the number
of cases within each of the diagnostic categories was small. Also, the
agreement was likely to be inflated because some of the treating psychiatrists
participated in both the DTREE evaluation and the consensus case conference.
Despite these limitations, this preliminary analysis suggests that DTREE might
be useful in education and in evaluation of diagnostically challenging clients
( First et al., 1993 ).

Another logic-tree program is a computerized version of the World Health
Organization Composite International Diagnostic Interview (CIDI-Auto). Peters and Andrews (1995) have investigated the procedural validity
of the CIDI-Auto in the DSM—III—R diagnoses of anxiety disorders. Prior
to entering an anxiety disorders clinic, 98 patients were interviewed by the
first clinician in a brief clinical intake interview. The patients returned for
the second session to self-administer a CIDI-Auto and to be interviewed by the
second clinician. The order in which CIDI-Auto was completed varied, depending
on the availability of the computer and the second clinician. At posttreatment,
clinicians reached consensus about the diagnosis in each individual case ( k
= .93). When such agreement could not be reached, diagnoses were not
recorded as the longitudinal, expert, and all data (LEAD) standard against
which CIDI-Auto results were evaluated. The agreement between clinicians and
CIDI-Auto ranged from poor ( k = .002) for Generalized Anxiety
Disorder to excellent ( k = .88) for obsessive—compulsive
disorder. Modest overall agreement ( k = .40) is consistent with
paper-and-pencil versions of CIDI as well as studies assessing various
computerized diagnostic interviews. LEAD diagnostic procedure yielded 1.56
diagnoses per patient, while CIDI-Auto resulted in 3.05 diagnoses per patient. Peters and Andrews (1995) suggested that the CIDI overdiagnosis
might have been caused by clinicians who used more strict diagnostic rules in
the application of duration criteria for symptoms.

In another international study, 37 inpatients from the general wards of one
of the hospitals in London completed a structured computerized interview
assessing their psychiatric history ( Carr, Ghosh, & Ancill,
1983 ). Case records and clinician interview confirmed 90% of the
information elicited by the computer. Most patients (88%) found computer
interview no more demanding than a traditional interview, and 33% found the
computer interview to be easier. Problems with reading English occurred in 8%
of the sample. Some patients felt that their responses to the computer were
more accurate because they did not have to make a clinician wait while they
deliberated their answers. The computer program in this study elicited an
average of 5.5 items per patient that were unrecorded by other assessment
methods; in other words, 10% of computer-gathered items were unknown to a
clinician. Carr et al. (1983) indicated that this finding
should not be viewed as a criticism of clinical care because patients vary in
how much and what kind of information they are willing to reveal to a clinician
as compared to a computer. The researchers also noted that, during the 4 weeks,
the patients in this study were assessed by a registrar for several hours and
by a consultant, senior registrar, and a social worker on at least one
occasion. Because the computerized interview was conducted after the patients
were admitted to the hospital, it is possible that it provided an opportunity
to reveal forgotten material.

One of the advantages of such logic-tree programs as DTREE and CIDI-Auto is
that they potentially can make use of the unlimited number of diagnostic
categories. However, unlike the statistical models that are frequently limited
to a sample from which they are derived, logic-tree models cannot assign a
numeric likelihood of a given diagnosis for an individual patient ( Altman, Evenson, & Cho, 1976 ). Furthermore, a number of
researchers have cautioned against using clinician diagnoses as a criterion
against which the validity of another diagnostic instrument is validated (e.g.,
Altman et al., 1976 ; Peters & Andrews,
1995 ), due to the fact that the imperfect reliability of clinician
diagnoses is likely to reduce the upper level of agreement between the new
diagnostic instrument and the clinician diagnoses.

In psychiatric screening research, evaluating computer-administered versions
of the DIS appears to be the most common. Research has shown that patients tend
to hold favorable attitudes toward DIS systems, though diagnostic validity and
reliability are questioned when such programs are used alone ( First,
1994 ). Table 1 lists studies that investigate the
various administration formats for this instrument. It includes the kappa
coefficients associated with a variety of DSM—III diagnoses and demonstrates
that overall reliabilities for computerized administrations are variable,
ranging from .49 to .68.

Differences in diagnostic style, method of data analysis, and diagnostic
system used to arrive at a decision all can potentially influence agreement
between computers and clinicians. For instance, as shown in Table
1 , Wyndowe (1987) found little diagnostic agreement
between the computerized DIS and the nonstructured clinical interview. Another
study of the computerized DIS, by Mathisen, Evans, and Meyers
(1987) , revealed that the overall agreement between the psychiatrist and
the computerized DIS diagnosis for all disorders was low, as indicated by a
kappa coefficient of .34. The overall kappa coefficient increased to .48 when
the researchers limited their analysis to primary diagnoses. When the analysis
was limited to primary diagnosis among patients who received both a clinical
diagnosis and a positive computer diagnosis, the kappa coefficient was .67, a
figure comparable to the clinician agreement reported in DSM—III trials.

Table 2 provides a summary of several studies
investigating the C-DIS. Again, the results of this research vary. Diagnostically,
the C-DIS reports generally were regarded as accurate. However, C-DIS yielded
more diagnoses per patient than a clinician-administered interview, and, as
compared to clinicians, it appears to be more likely to fail to assign any
diagnoses to the interviewee (e.g., Mathisen et al., 1987 ).
Also, the C-DIS administration is longer (on average) than a more traditional
face-to-face interview, although patients' perceptions of their comparative
lengths vary (e.g., Erdman et al., 1992 ; Greist
et al., 1987 ).

Neuropsychological Assessment

Another area in which computerized testing batteries have been used to
assist in assessment decisions is that of neuropsychology. The 1960s marked the
beginning of the investigations into the applicability of computerized testing
to this field (e.g., Knights & Watson, 1968 ).
Traditionally, the belief has been that neuropsychology programs have not
"produced results that are either satisfactory or equal in accuracy to
those achieved by human clinicians" ( Adams & Heaton,
1985 , p. 790; see also Golden, 1987 ). Garb
and Schramke's (1996) narrative review and meta-analysis included a section
that discussed the utility of computer programs for improving
neuropsychological assessment. Although they characterized automated assessment
as "promising," they also concluded that there is room for improvement.
Specifically, they suggested that new programs be created and prediction rules
be modified to include such data as patient history and clinician observation,
in addition to the psychometric and demographic data that are more customarily
incorporated into the prediction process.

A recent review by Russell (1995) indicated that the
ability of computerized testing procedures to detect and locate brain damage is
quite accurate, though not as accurate as clinical judgment. A number of recent
studies conducted in the area of neuropsychology investigated specific assessment
measures. For example, the Right Hemisphere Dysfunction Test and Visual
Perception Test were the focus of one study ( Sips,
Catsman-Berrevoets, van Dongen, van der Werff, & Brook, 1994 ). These
computerized measures were created for the purpose of assessing
right-hemisphere dysfunction in children and were intended to have the same
validity as the Line Orientation Test and Facial Recognition Test had for
adults. Fourteen children with acquired cerebral lesions were administered all
four tests. Findings indicated that the Right Hemisphere Dysfunction Test and
Visual Perception Test together were sensitive (at a level of 89%) to
right-hemisphere lesions, had relatively low specificity (40%), had high
predictive value (72%), and accurately located the lesion in 71% of cases. Fray, Robbins, and Sahakian (1996) reviewed findings regarding
a computerized assessment program, the Cambridge Neuropsychological Test
Automated Batteries (CANTAB). Although specificity and sensitivity were not
reported, the reviewers concluded that CANTAB can detect the effects of
progressive, neurogenerative disorders sometimes before other signs manifest
themselves. In particular, CANTAB has been found successful in detecting early signs
of Alzheimer's, Parkinson's, and Huntington's diseases ( Fray et
al., 1996 ).

On the basis of the findings of the various
studies discussed, several conclusions can be drawn. For one, research has
supported the position that, by and large, computer-administered tests are
essentially equivalent to booklet-administered instruments. There are some
reported inconsistencies between computerized and other modes of test
administration, but the nonequivalence issues are typically small or
nonexistent for most computerized adaptations of paper-and-pencil tests (Finger
& Ones, 1998; Moreland, 1987 ; see also Hofer & Green, 1985 ; and Honaker, 1988 ).
However, the marriage between computers and psychological test interpretation
has not been a perfect union, which is indicative of the fact that past efforts
at computerized assessment have not made optimal use of the flexibility and
power of computers in making complex decisions. To date, CBTIs largely perform
a "look up and list out" function. That is, a broad range of
interpretations are stored in the computer for various test indexes, and the
computer simply lists out the stored information for appropriate scale score
levels: Computers are not involved as much in decision making.

Computerized applications are limited, to some extent, by the available
technology. To date, computer—human interactions are confined to written
material. Consequently, potentially critical nonverbal cues, such as speech
patterns, vocal tone, and facial expressions, cannot be accounted for in CBTIs
as they currently exist. Furthermore, the verbal choices are usually provided
to the test-taker in a fixed format (e.g., true—false on an MMPI-2). Temporal
relationships relating to the course of a disorder, as well as quantitative
relationships between various clusters of symptoms, are crucial in defining
disorders and are quite difficult for computers to evaluate ( First,
1994 ). As suggested by Sawyer (1966) , the best
predictions are made when clinicians are allowed to gather information and
decide how to use it, and computers are programmed to provide statistical summaries
and compare results with a data bank.

Extensive research on computer-assisted decisions in several applied
contexts (such as neuropsychology, personnel screening, and clinical diagnosis)
has shown that automated procedures have the capacity to provide more or less
valid and accurate descriptions and predictions. Research on the accuracy of
some computer-based systems (particularly those based on the MMPI—MMPI-2, which
have been subjected to more studies) has shown promising results with respect
to accuracy. However, instruments and situations vary in their reliability and
utility, as illustrated by Eyde et al. (1991) in their
extensive study of the accuracy of CBTIs. Consequently, each computer-based
application needs to be evaluated carefully. Simply put, just because a report
comes from a computer, does not necessarily mean that it is valid. The caution
required in assessing the utility of computer-based applications brings about a
distinct need for specialized training in their evaluation. It is apparent that
some sort of instruction in the use (and avoidance of misuse) of CBTIs is
essential for all professionals who use them ( Hofer &
Green, 1985 ). There is also a need for further research focusing on the
accuracy of the information contained in the reports.

Moreover, even though computer-based reports have been validated in some
settings, this does not guarantee their validity and appropriateness for all
applications. In their discussion of the misuse of psychological tests, Wakefield and Underwager (1993) cautioned against using
computerized test interpretations of the MCMI and MCMI-II, designed for
clinical populations, in other settings, such as for forensic evaluations. If a
test has not been developed for or validated in a particular setting, then
computer-based applications of it in that context are not warranted. The danger
of misusing data applies to all psychological test formats, but the risk seems
particularly high when one considers the convenience of computerized outputs
and (as noted by Garb, 1998 ) the fact that some of the
consumers of CBTI services are nonpsychologists who are unlikely to be familiar
with the validation research. It is important for scoring and interpretation
services to provide computer-based test results only to qualified users
(however they are defined).

Even though decision-making and interpretation procedures may be automated
with computerized testing, personal factors must still be considered in some
way. Styles's (1991) research investigated the importance of
a trained psychologist during computerized testing with children. Her study of
Raven's Progressive Matrices demonstrated the need for the psychologist to
maintain rapport and interest prior to, during, and after testing. These
factors were found to have important effects on the reliability and validity of
the test data, insofar as they affected test-taking attitudes, comprehension of
test instructions, on-task behavior, and demeanor. Carson (1990)
has also argued for the importance of a "sound clinicianship,"
both in the development of psychological test systems and in their use.

All of this points to the conclusion that computer-generated reports should
be viewed as valuable adjuncts to clinical judgment rather than
"stand-alone" substitutes for a skilled clinician ( Fowler,
1969 ). It should be noted that attempting to tailor test results to unique
individual characteristics is a complex process and may not always increase
their validity. In addition, it bears noting that CBTIs do not necessarily make
clinicians' lives easier (as they ostensibly intend to), because in real-life clinical
practice, the evaluation of computerized outputs, especially in narrative
formats, requires special expertise and extra time. This consideration may help
to explain why a number of studies indicate that clients' attitudes toward the
use of computers in clinical activities are not only favorable, but are more
favorable than attitudes among the professionals who serve them ( R.
Wilson, Omeltchenko, & Yager, 1991 ). However, an alternative
explanation may be that professionals are likely to have more knowledge
regarding the limitations of CBTIs and, thus, are more skeptical about their
use than are nonprofessionals.

It is apparent that whatever else computerized assessment has done for the
field of psychology, it clearly has focused attention on accurate assessment in
the fields of clinical evaluation and diagnosis. Whether decisions are based on
computer interpretation strategies or clinical judgment, psychological
assessment, as a discipline, requires further evaluation and continual
refinement.

Correspondence may be addressed to James N. Butcher, Elliott Hall,
University of Minnesota, 75 East River Road, Minneapolis, Minnesota, 55455.
Electronic mail may be sent to butch001@tc.umn.edu
Received: July 10, 1998
Revised: March 22, 1999
Accepted: March 29, 1999