Name: Exceptional Children Publisher: Council for Exceptional Children Audience: Academic; Professional Format: Magazine/Journal Subject: Education; Family and marriage Copyright: COPYRIGHT 1988 Council for Exceptional Children ISSN:0014-4029

Issue:

Date: Sept, 1988 Source Volume: v55 Source Issue: n1

Accession Number:

6707902

Full Text:

Direct Observation: Factors Affecting the Accuracy of Observers

Direct observation is a method of data collection in which the
target behavior is observed and recorded as it occurs. Although
thousands of studies have used direct observation, research on the
accuracy of the observers who provide data is not commensurate with the
widespread use of observation methodologies. There exists a general
presumption that observers are collecting accurate data, as well as a
belief that adequate reliability scores are synonymous with adequate
levels of accuracy. Some research, however, has suggested that these
views are not necessarily correct (e.g., DeMaster, Reid, &
Twentyman, 1977).

The first purpose of this article is to review research on the
accuracy of observers. The review is organized around seven major
factors that may potentially affect accuracy: (a) reactiv-ity; (b)
observer drift; (c) the recording procedure; (d) location of the
observation; (e) reliability; (f) expectancy and feedback; and (g)
characteristics of subjects, observers, and settings. The second
purpose is to offer recommendations for increasing the accuracy of
observers.

FACTORS AFFECTING OBSERVER

ACCURACY

Reactivity

In direct observation, we concurrently observe and record the
behaviors of interest (Repp, 1983), and in many of these situations, the
observer's presence is known to the subject. A common presumption
would seem to be that the subject's behaviour is the same as it
would be if the observer were not present. However, reactivity surely
occurs in some of these cases, as subjects respond to the presence of
observers by changing their behaviors. Haynes and Horn (1982) presented
a comprehensive review of studies related to reactivity and suggested
that behaviors may be increased, decreased, made more variable, or not
be affected at all. For example, some subjects may present themselves
in their "best light," and socially desirable behaviors may
increase in probability as a function of observer presence; other
subjects may have the opposite reaction. When such reactivity occurs,
the study's internal validity is threatened as the effects from
reactivity would not have been separated from any effects of the
experimental variable. External validity, or the extent to which the
findings of a study can be generalized, may also be affected, and two
concerns arise (Kazdin, 1980). Once is whether the findings of a study
in which reactivity occurred apply to nonreactive situations. A second
is pretest sensitization, wherein reactivity during baseline may
sensitive subjects to the intervention and make them either more or less
receptive to the experimental variable. Therefore, the results may not
apply to individuals who are not similarly sensitized.

Whereas most research has focused on subject reactivity, some has
focused on the effect that observation has on the behavior of the
observers. This effect may be termed observer reactivity, and it has
been demonstrated in two studies. In one (Hay, Nelson, & Hay,
1977), teachers who were instructed to record the behavior of students
in their classrooms began giving more prompts to the observed students.
In another (Hay, Nelson, & Hay, 1980), one of four teachers acting
as observers increased her rate of instructing and of giving positive
feedback.

Observer Drift

Drift is a cognitive phenomenon that involves a gradual shift by
the observer from the original response definition, and it results in
behavior being inconsistently recorded (Hersen & Barlow, 1976). As
Lipinski and Nelson (1974) noted, it is relevant to both between-group
and within-group designs and thus may warrant considerable concern.
When drift occurs, the data collected are no longer directly comparable
across conditions, because they no longer quantify the same precise
response. For example, an intervention study cannot be properly
evaluated if the target response has been defined differently in the
baseline and treatment phases.

This phenomenon also points out the distinction between observer
agreement and observer accuracy. The former is obtained by comparing
the scores of observers, whereas the latter is obtained by comparing
these scores with a previously established criterion (Mash &
McElwee, 1974). These two measures may differ: high observer agreement,
for example, may be gained at the expense of accuracy. Kent,
O'Leary, Diament, and Dietz (1974) conducted a study of observer
variables and found a consistent difference between observation pairs in
their use of the behavioral code. Members of pairs developed agreement
between themselves, but there was far less agreement between pairs.
Thus each pair had high agreement scores, but at least one pair had to
have inaccurate data. This condition, in which both members of an
observation pair similarly change definitions, has been termed
consensual observer drift (Johnson & Bolstad, 1973).

Recording Procedure

The recording procedure itself can probably bring more error to the
data than would most conscientious observers, and there are no
experiments to show whether observer-contributed errors are greater
under one procedure than under another. In behavioral research using
direct observation, data are collected either through continuous
recording or one of three time-sampling procedures: whole interval
recording, partial interval recording, and momentary time sampling.
Time-sampling procedures are appealing when multiple behaviors are
observed and the equipment necessary to record then continuously is
unavailable.

The accuracy of the data produced by these procedures, however, has
been seriously questioned. A small body of literature has examined how
closely time-sampling methods can approximate continuous measurement
(e.g., Brulle & Repp, 1984; Repp, Robert, Slack, Repp, &
Berkler, 1976). In general, these studies have found that (a) partial
interval overestimates the continuous measure; (b) whole interval
underestimates; (c) momentary time sampling is preferred, because it
randomly overestimates and underestimates the continuous measure and
thus produces a fairly accurate average; and (d) smaller observation
intervals produce far more accurate data than do large intervals.

Much of the observational research employs partial interval
recording at 10-second intervals, yet this interval size has been shown
to produce unrepresentative data. Researchers who use time sampling and
are interested in accurate data should use extremely small intervals
(cf. Sanson-Fisher, Poole, & Dunn, 1980). Ideally, with this
procedure, one should first sample responding, select an interval length
such that only one response can occur per interval, and then begin
formal data collection (Repp, 1983; Repp et al., 1976).

Location of the Observation

Although most of the data from direct observation are collected in
situ, some are collected from audio or videotapes in an effort to reduce
the obtrusiveness of observations (e.g., Schoggen, 1964). Though the
devices used may cause some reactivity initially, studies have suggested
that this effect is at most ephemeral.

For example, in the first of a two-part study, Christensen and
Hazzard (1983) questioned whether families who were being audiotaped
would change their rate of positive and negative interactions over time,
and they found no systematic changes over 16 sessions. In the second
part of their study, they compared conditions in which families were
either aware or unaware of their conversations being taped. Results for
two of the three families studied show no effects; results for the third
showed an initial but unlasting change.

Additional studies (e.g., Fulton & Rupiper, 1962; Kent,
O'Leary, Deitz, & Diament, 1979) suggested that there is little
difference for most behaviors between data collected in the natural
setting and that collected from videotape, although some behaviors
(e.g., vocalization) may show a difference.

Reliability

In this article, the term reliability is used specifically to refer
to interobserver agreement, or the degree to which two observers agree
that responding has occurred. Although observers can be trained to
evaluate themselves, Boykin and Nelson (1981) cautioned experiments to
perform the calculations themselves, lest observers reach high agreement
scores at the expense of accuracy.

In addition, observer awareness during reliability checks has been
shown to affect both observer accuracy and reliability. Reid (1970),
for example, found that observers were more accurate when they believed
they were being monitored; and Romanczyk, Kent, Diament, and
O'Leary (1973) found that interobserver agreement was higher when
observers believed reliability was assessed. The concept of reactivity
discussed earlier is thus also applicable here.

The question of how to calculate interobserver agreement has
generated considerable discussion and different formulas (Hartman, 1977;
Hawkins, 1979; Hopkins, 1979; Kratochwill, 1979; Repp, 1983; Repp.
Deitz, Boles, Deitz, & Repp, 1976; Rojahn & Schroeder, 1983).
Much of the discussion has revolved around the contribution of chance to
interobserver agreement scores. For example, high-frequency behaviors
inflate percentage agreement on occurrence, whereas low-frequency
behaviors inflate percentage agreement on nonoccurrence (Rojahn &
Schroeder, 1983). Because there is a relationship between the rate of
behavior and the formula used to calculate reliability, no single
standard of acceptable agreement levels has been adopted.

The purpose of observation is also a factor. Data that are used to
diagnose and place children must be highly reliable and accurate, but a
teacher who is collecting data to determine if recess is a reinforcer
fora child may tolerate more error. In sum, reliability scores and the
methods used to calculate them should continue to be provided to
consumers of research so they can make their own evaluations of the
data.

Observer Expectancy and Feedback

Observers may be biased through expectations of subject performance
that are based on factors such as sex, the behavior of peers, or the
purpose of the intervention. Since bias in a study weakens any possible
conclusions concerning independent variables, its presence is serious.
O'Leary and Kent (1977), in a review of their own studies in
observation, reported that global evaluations can be influenced by
expectations alone. For example, if observers are informed that an
intervention to reduce stereotypic behavior is occurring, they are
likely to report at the end of the intervention that the target behavior
decreased. However, research shows that error-producing bias can be
substantially reduced through the use of observers who are trained in
systematic direct observation methods (Kent et al., 1974; Redfield &
Paul, 1976).

Experimenter feedback may also affect the behavior of observers.
For example, O'Leary, Kent, and Kanowitz (1975) showed that
observer expectations alone were insufficient to affect the behaviors of
observers using systematic procedures such as time sampling. They found
other factors at work: expectation of behavior change and contingent
experimenter feedback on whether the data supported the
researcher's hypotheses. In this study feedback was provided by
the experimenter; however, in other instances, feedback could be from
events in the setting. For example, if a teacher has just praised a
child for being on task, an observer might presume (cf. Strain, Lambert,
Kerr, Stagg, & Lenkner, 1983) that the child should have just been
scored as being on task. This issue was addressed by Harris and
Ciminero (1978), who found that some observers increased their scoring
of a behavior when they witnessed a consequence for it, but did not
increase their scoring of another behavior for which they did not
witness a consequence when neither behavior occurred.

Subject and Setting Variables

Sex of Subjects. The majority of the research on demographic
characteristics of observed subjects has been on the variable of sex.
Yarrow and Waxler (1979) noted that separate analyses are provided for
males and females in most child development studies and that a
significant finding often appears for only one sex. They suggested that
these differences may either be genuine, or due to differences in the
way observers score behaviors for the sexes. Their data indicated that
for many behaviors, equal reliability was attained for both sexes; for
other behaviors, however, significant differences were found.

In two studies, for instance, Yarrow and Waxler found that
observers scored aggressive behavior more reliably for males than for
females. Other studies have reported the presence of an interaction
between the sex of the subject and the sex of the observer. One study
(Gurwirz & Dodge, 1975) found that adult observers tended to rate
opposite-sex children more positively. However, another study (Horn
& Haynes, 1981) found that training observers in an objective coding
method that focuses their attention on overt, operationally defined
responses may provide a way to reduce sex bias. In this study, male and
female observers were trained to code disruptive behaviors in children
and then were asked to rate the subjects along 12 subjective dimensions.
Results showed no sex differences among the behavioral ratings and a
difference in the subjective ratings on only one dimension.

Another study (Moss & Jones, 1977) examined the demographic
variable of socioeconomic status. This study found that reliability was
significantly higher when observers were scoring the behavior of middle
class mothers than when scoring that of lower class mothers. Since all
observers were middle class, the authors suggested that this finding may
reflect an interaction between social class of observer and subject.

Subject Behavior Patterns. Subject characteristics also include
the nature of behavior patterns. The effect of behavioral complexity on
observer accuracy was studied by Jones, Reid, and their colleagues
(Jones, Reid, & Patterson, 1975; Taplin & Reid, 1973). These
authors defined behavioral complexity as the number of discriminations
required during an observation session, as measured by the number of
different categories rated. In a series of studies, they consistently
found negative correlations between the complexity of observed behavior
and reliability coefficients. They thus concluded that observation is
more difficult when there is a broad range of responses to code. Along
these lines, reliability coefficients will be misleading if behavioral
complexity differs systematically between reliability and nonreliability
sessions (Jones et al., 1975). Interestingly, complexity levels have
been found to be lower during reliability sessions (Jones et al., 1975).

Predictability of Subject Responses. Behavior may be predictable
because it occurs in a sequence with one response always following
another, or because it occurs either frequently or infrequently each
session. Mash and McElwee (1974) hypothesized that behaviors which
often occurred in predictable sequences would be more easily scored than
those in unpredictable sequences. Other researchers have suggested that
rate may be a factor in observer accuracy (Johnson & Bolstad, 1973;
Thomas, Loomis, & Arrington, 1983). In our own training of large
numbers of undergraduates, we have found that certain categories of
behavior, when occurring more than 80% of the time, tend to be scored in
every recording interval whether they are occurring or not. This seems
to be especially true when a behavior occurs frequently in the earlier
part of the session and seldom in the later part. We have also found
that additional traning is needed to reach acceptable levels of accuracy
on low-rate behaviors, which are more often overlooked.

Familiarity with Setting or Subjects. Finally, several authors
have suggested that setting characteristics may affect the accuracy of
observers. For example, familiarity with the setting may make
observation easier and thereby increase observer accuracy. Kent and
Foster (1977) noted that reliability seems to lower when observers first
enter a new setting, but then increases after practice in the setting.
Similarly, familiarity with the subject population may make observation
easier. Other setting characteristics, such as a high activity and
noise level, may make observation more difficult (Wasik & Loven,
1980).

INCREASAING THE ACCURACY OF

OBSERVERS

The second purpose of this article is to offer recommendations for
increasing the accuracy of observers. Of course, oberver accuracy is a
moot point if the selected target behaviors do not reflect the questions
and concerns of clients, staff, parents, and so forth. The first step
is thus to select what behaviors are to be measured. Kazdin (1980)
suggested that several dependent measures be used, because many of the
behaviors assessed are multifaceted and complex. For example, if a
child is referred to special education because of academic deficits and
noncompliance, many academic and social behaviors may be assessed. In
addition, teacher behaviors such as instructional antecedents and the
delivery of reinforcers may be observed. This is not to suggest,
however, that all the measures are likely to converge on a single
conclusion; because of the complexity of behavior and settings, such
convergence would be the exception rather than the rule. Rather than
regarding disagreements among multiple measures as introducing ambiguity
into the process, they should be regarded instead as a means of
elaboration (Kazdin, 1980).

After the behaviors to be measured are selected and defined, they
must be observed accurately. Table 1 provides a summary of five
recommendations for increasing observer accuracy, along with the threats
to accuracy that each addresses. The recommendations are explained and
illustrated as follows.

Well-Trained Observers

Hartmann and Wood (1982) provided a model for training observers
that includes learning the observation manual, practice sessions,
retraining and recalibration sessions, and postinvestigation debriefing.
Observation is thus a skill to be taught systematically, and extensive
practice is advised before beginning a study (Hersen & Barlow,
1976). Recalibration and retraining throughout a study serves to
maintain this skill and guards against observer drift (O'Leary
& Kent, 1977).

In addition, there are other precautions that may be taken to
increase the level of observer performance. First, the observation
codes used should not be more complex than is necessary, and observation
schedules should be reasonable in order to reduce fatigue-related
errors. Second, both sexes should be used equally in the study, and
observers should be balanced across sessions by experimental conditions.
A possible exception to this is when phase changes would signal the
hypothesis; in this case, new observers should be brought in, for they
cannot be biased by what has already occurred. Third, as alluded to,
all observers should be blind to the experimental hypothesis, and they
should be praised for accuracy rather than for obtaining the desired
results. Fourth, consensual observer drift can be avoided by
eliminating interaction between observers and by giving feedback on
interobserver agreement only after the study is done. Experiments
should perform reliability calculations themselves (Boykin & Nelson,
1981). Finally, where possible the data should be compared with a
standard. For example, some sessions can be videotaped, coded by
experienced observers, and used as a criterion for novice observers.

Adaptation Period

A potential correction for reactivity is the use of adaptation
periods, wherein both subjects and observers can become familiar with
the observation process (Sulzer-Azaroff & Mayer, 1977). For
example, Barkley (1981) discussed the use of observational methods in
the diagnosis of hyperactivity and suggested that observations can be
made in clinic playrooms equipped with a sound system and a one-way
mirror. He further suggested that children be given at least an hour to
adpt to the playroom. In some cases, the length of the adaptation
period can be defined empirically; i.e., when certain behaviors (e.g.,
looking at the observer) decrease, or when behavior becomes more stable.

Unobtrusive Observation

Observing unobtrusively means taking steps to ensure that subjects
are relatively unaware of assessment and that observers are unaware of
reliability evaluations. Of course, both parties must agree to the fact
of observation for ethical reasons, but they must also agree to be
unaware of the exact observation schedules in order to reduce reactivity
and increase reliability. For instance, an observer coding interactions
on the playground can sit indoors by a window facing the play area,
rather than on the playground itself.

Permanent Products

Permanent products of behavior, such as written responses or
projects completed, are invaluable; they can be coded after behavior has
taken place and as many times as necessary to achieve accuracy and
reliability. Audiotapes and videotapes that capture more fleeting
responses, such as talking, can be scored repeatedly to avoid bias and
error. For example, pages from a child's daily workbooks can be
used as natural samples of academic task performance. A cassette player
can be used to record interactions among adolescents in a group
discussion in order to facilitate coding specific social skills, such as
waiting for another to complete a statement before expressing one's
own point of view.

Systematic and Frequent Observation

Objective rather than subjective methods of recording increase
accuracy by providing, along with explicit response definitions, rules
for scoring behavior. These are more likely to reduce biases
contributed by characteristics of the subjects, observers, or setting.
In cases where time sampling is preferred over continuous recording for
practical reasons, extremely small and numerous intervals are best
(Repp, 1983). Portable lap computers, programmed for the entry and
storage of continuous data (e.g., Repp, Harmon, & Felce, 1984), may
serve to increase the feasibility of continuous measures and may be used
to gather data for use in developing and monitoring Individual
Educational Plans (Olinger & Brusca, 1985).

CONCLUSION

The results of the research on the accuracy of observers suggest
that more caution might be exercised in conducting observational studie
than is generally evidenced. A number of factors that contribute to
observer error emerge; fortunately many are correctable. Formally
training observers, using an adaptation period, observing unobtrusively,
using permanent products of behavior, and observing frequently and
systematically are all good practices to follow.

Brulle, A. R., & Barton, L. E. (1980). The accuracy of
momentary time sampling procedures when used in applied setting. Paper
presented at the Annual Meeting of the Association for Behavior
Analysis, Dearborn, Michigan.

Brulle, A. R., & Repp, A. C. (1984). An investigation of the
accuracy of momentary time sampling procedures with time series data.
British Journal of Psychology, 75, 481-485.

Hawkins, R. P. (1979). The functions of assessment: Implications
for selection and development of devices for assessing repertoires in
clinical, educational, and other settings. Journal of Applied Behavior
Analysis, 12, 501-516.