Over 2,000 research studies on student ratings of instruction have been published. For
those interested, researchers have published several major reviews of this body of
literature (Aleamoni, 1987; Arreola, 1995; Costin, Greenough, & Menges, 1971; Cashin,
1988 &1995; Centra, 1993; Davis 1993; Braskamp & Ory, 1994; Marsh, 1987;
McKeachie, 1994 & 1997). Meta-analysts have provided quantitative summaries of the
relationship between student ratings and student learning (Cohen, 1981, 1982, 1983;Dowell
& Neal, 1982; Abrami, 1984; McCallum, 1984; all cited in dApollonia &
Abrami, 1997). More than 25 years of published research evidence supports the conclusion
that there is "a moderate to large association between student ratings and student
learning, indicating that student ratings of general instructional skill are valid
measures of instructor-mediated learning in students" (dApollonia & Abrami,
1997, p.1202). McKeachie (1997) summarized the most recent research on the validity of
student ratings, stating that "student ratings are the single most valid source of
data on teaching effectiveness" (p. 1219). Moreover, according to reviews of the
literature conducted by Aleamoni (1987) and Arreola (1995) well-developed, tested, student
rating forms of teaching effectiveness exhibit both reliability and validity.

This report assesses a group of published student ratings forms currently available for
use at an institutional level. This group includes the Educational Testing Services
(ETSs) Student Instructional Report II (SIR-II), the Instructional Development and
Effectiveness Assessment (IDEA), the University of Arizonas Arizona Teacher-Course
Evaluation Questionnaire (AZTEQ), and the Purdue Cafeteria System. The IDEA and the AZTEQ
are available in both long and short forms. This report addresses only the long forms, as
they are designed to be useful for both administrative/personnel evaluation and improving
teacher effectiveness. The following criteria for assessment of these instruments have
emerged from the literature: the multidimensionality of the teaching effectiveness
construct; the reliability and validity of the measure; and cost.

Forms must distinguish among the various items and their dimensions to insure that
instructors receive ratings on all of the appropriate dimensions. The importance of
addressing various dimensions cannot be under-emphasized when the purpose is to improve
teaching. However, one or a few global or summary type items might provide sufficient
student rating data for personnel decisions (Abrami, 1989a; Abrami & dApollonia,
1991). In either case, the simple averaging of dissimilar items is inappropriate.

The SIR-II edited and updated the original five scales of the SIR (course organization
& planning; communication; faculty/student interaction; assignments, exams, &
grading; course difficulty, workload, & pace) and added three new scales to reflect
new emphases in learning research (course outcomes, student effort & involvement,
methods of instruction). The IDEA assesses the three major constructs of student learning;
difficulty and workload; and teaching methods. Teaching methods is further delineated in
terms of communicating content and purpose, involving students, creating enthusiasm, and
preparing exams. The AZTEQ was built around four dimensions: instructors
presentation and delivery, instructors interaction and feedback, course components
and integration, and workload and difficulty. Clearly the SIR-II, the IDEA, and the AZTEQ
cover most of the major components of the construct of teaching effectiveness. Literature
on The Purdue Cafeteria System did not specifically discuss the teaching dimensions it
assesses, however because instructors, departments, and institutions select rating scale
items from a catalog to meet their specific needs, the authors posit that it provides
diagnostic evaluation along several dimensions.

Reliability of the Measure

Reliability indicates whether or not a set of items consistently measures a particular
construct or set of constructs. Reliability is a pre-condition for validity. This paper
focuses on three types of reliability. Consistency across raters (inter-rater reliability
or agreement) refers to agreement of all student ratings of one instructor or course.
Stability or consistency across time (test-retest reliability) refers to whether or not an
instructor receives similar ratings every semester. Generalizablity reflects how well the
data assesses the instructors general teaching effectiveness, not just instructor
effectiveness in a particular course or term.

Inter-rater reliability provides the most common and appropriate indication of the
reliability of student rating forms (Marsh & Roche, 1997). Reliabilities vary
according to the number of raters, such that Cashin (1995) recommends a minimum of 10
raters for an acceptable reliability of .70 or better. Aleamoni (1987) cites several
rating forms with an inter-rater reliability of .9 or better, based on an average class
size of 25. These include the SIR-II, the IDEA, and the AZTEQ. No reliability studies on
the Purdue Cafeteria System have been done since the early 70s, and those were
unpublished.

Validity of the Measure

The validity of a measure indicates to what extent student-rating items measure some
aspect of teaching effectiveness. Validity coefficients are interpreted differently than
reliability coefficients, as seen below:

.00- .29: even when statistically significant; not practically useful

.30- .49: practically useful

and .50- .70: very useful; not common when studying complex phenomena.

The following discussion of validity includes: content, construct, and criterion
validity, as well as the evaluation and control of potential bias.

Content Validity. Content validity incorporates estimates of the extent to which
the content of an instrument relates to what it is designed to measure. The items and
scales of the SIR-II, IDEA, AZTEQ, and The Purdue Cafeteria System were all designed to
reflect the content of what many sources (teachers, students, administrators, conferences,
and publications) define as effective teaching. They consequently clearly demonstrate high
content validity.

Construct Validity. Construct validity evaluates the degree to which the scores
from an instrument correspond to other measures of the underlying theoretical trait. In
this case, the student ratings should correspond with the dimensionally specific scales
chosen to represent teaching effectiveness. Researchers use factor analysis as one
approach to studying construct validity. Factors produced from student ratings should
closely duplicate the scales employed. This type of analysis was used in the development
of the SIR-II (Centra, 1998). The six scales subjected to factor analysis accounted for
88% of the variance among the scales included in the SIR-II demonstrating the high
construct validity of the measure. The AZTEQ was also subjected to factor analysis,
revealing that the four factors represented by the questionnaire items accounted for 75%
of the total variance of the items, demonstrating the construct validity of this
instrument as well. The IDEA demonstrates construct validity differently. The IDEA relies
upon student ratings of their own learning on objectives chosen for evaluation by the
instructor. Thus, significant correlation (.25) between student learning on course
objectives and instructor designated importance of those objectives (in contrast to .02
for learning and importance on unrelated objectives) supports their claim to construct
validity (Hoyt, 1973). The Purdue Cafeteria System has not published or provided validity
studies addressing this issue. By offering selection of items, or additional items, for a
more personalized or relevant evaluation, increased construct validity and flexibility are
available on all of these forms.

The correlation of scores with other external variables gives alternate evidence used
to measure construct validity. Most specifically, low correlation with items outside the
construct indicates lack of bias, or discriminant validity. The original SIR showed little
relationship with class size, subject area, course type, expected student grade, class
level, and a variety of other studied variables (Centra, 1976). Research on the specific
effect of potential biases has not been addressed in literature on the IDEA, the AZTEQ, or
The Purdue Cafeteria System.

Criterion Validity. Criterion validity represents performance in relation to
particular tasks or discrete cognitive or behavioral objectives. There are two measures of
criterion validity. The first is a measure of predictive validity - the degree to which
scores predict performance. Reviews (Cohen, 1982; Feldman, 1989b) show that student
learning, as represented by scores on an external final exam (across all instructors of
the same course) has moderate to high correlation with student ratings. Students in
classes in which on average gave the instructors higher ratings also on average scored
higher on the exam, or learned more. Only the literature on the SIR-II attempts to
specifically address predictive validity. The original SIR (Centra, 1976) demonstrated
that learning gains were related to the students overall evaluation of the
instructor as well as to some of the scale scores. ETS expects similar validity on the
SIR-II, but suggests that additional research validate the revisions and additions in the
new form.

The second measure of criterion validity focuses on concurrent validity  the
degree to which scores on two or more measures directly measure the same thing. The
literature presents a variety of measures of teaching effectiveness as parallel to student
ratings. Student ratings of effective teaching also have moderate to high correlations
with instructor self-ratings (Feldman, 1989a; Marsh, Overall, & Kesler, 1979; and
Marsh & Dunkin, 1992), as well as evaluations of teacher effectiveness by
colleagues/faculty and administrators (Kulik & McKeachie, 1975; Feldman, 1989a), and
alumni (Overall & Marsh, 1980; Braskamp & Ory, 1994). Student ratings on overall
instructor effectiveness are also highly correlated with their responses to open ended
questions on instructor effectiveness (Ory, Braskamp, & Pieper, 1980; Braskamp, Ory,
& Pieper, 1981).

The AZTEQ makes indirect claims of concurrent validity based on its intentional
resemblance to "other validated instruments." However, published research on the
instruments evaluated here indicates that concurrent validity has not been empirically
investigated. Thus none of the instruments assessed in this report has thoroughly
demonstrated the concurrent aspect of criterion validity.

Evaluation and control of potential bias. Researchers disagree on the appropriate
definition of "bias"(Cashin, 1988, 1995; Marsh, 1984). Some writers have
suggested that bias is "anything not under the control of the instructor,"
(Cashin, 1995). Marsh (1984) argued against this broad definition instead stating that
bias "should be restricted to variables not related to teaching effectiveness."
Different definitions of bias have served to confuse the literature, thus discussion of
bias is often more clearly organized in terms of specific variables which do or do not
require control (Cashin, 1995).

According to the Marshs (1984) definition of bias, only those variables that are
related to student ratings require control. The literature allows for the exclusion of
many variables as not significantly correlated with student ratings and therefore
non-biasing. The first group of such variables concern instructor characteristics and
include: age and teaching experience (Marsh & Hocevar, 1991), gender (Costin, 1971;
Feldman, 1992; Marsh & Roche, 1997), race (Li, 1993, cited in Cashin, 1995),
personality (Aleamoni, 1987; Braskamp & Ory, 1994; Centra, 1993; Murray, 1983),
faculty rank (Arreola, 1995; Marsh & Roche, 1997) and research productivity (Arreola,
1995; Centra, 1993; Feldman, 1987). The second group of variables cover student
characteristics and consists of student age (Centra, 1993), gender (Costin, 1971; Feldman,
1977, 1993; Marsh & Roche, 1997), level  e.g. freshman (Mc Keachie, 1979), GPA
(Feldman, 1976a) and personality (Abrami, Perry, & Leventhal, 1982). The final set of
non-biasing variables includes: class size (aleamoni, 1987;Feldman, 1984) or time of day
(Aleamoni, 1987; Feldman 1978); and the time during the second half of the term when
ratings are collected (Feldman, 1979). Variables that are correlated with student ratings
but enhance learning, such as instructor enthusiasm or expressiveness (Aleamoni, 1987;
Marsh & Roche, 1997; Marsh & Ware, 1982) and workload or course difficulty (Marsh,
1987; Marsh & Roche, 1997) are also considered non-biasing and do not require control.

Research has also pointed to several variables requiring control. Student motivation is
the most prominent of these variables. The literature supports the belief that instructors
of elective courses receive higher ratings than instructors of required courses (Arreola,
1995; Marsh & Roche, 1997), and that prior interest in course subject matter
contributes to higher ratings (Marsh and Dunkin, 1992). Academic field also impacts
student ratings. According to Marsh and Roche (1997) and others (Braskamp & Ory, 1994;
Cashin, 1990; Centra, 1993; Feldman, 1978; Marsh & Dunkin, 1992), instructors of
courses in the sciences appear to be rated lower than instructors of courses in the
humanities. This may or may not pose a biasing influence. Cashin (1990) points out that
the lower ratings in courses requiring more quantitative reasoning skills may be
associated with reduced student competency in those areas, necessitating control of
academic field. But control is not appropriate if classes within particular fields are
poorly taught. The final variable requiring control concerns course level. Higher level
courses, especially graduate level courses, tend to receive higher ratings (Aleamoni &
Hexner, 1980; Braskcamp & Ory, 1994, Feldman, 1978, Marsh 1997), but these differences
are small, and not as relevant to this discussion at a two-year institution.

The effect of expected grades or grading leniency is perhaps the most controversial and
most researched of the potential biases to student ratings (Arreola, 1995). To the degree
that higher grades reflect greater learning, a positive relationship between grades and
learning, is appropriate and should be expected. Research on the grading leniency effect
indicates that the effect is both weak and unsubstantial (Braskamp & Ory, 1994;
Feldman, 1976a; Marsh & Dunkin, 1992; Marsh & Roche, 1997). Most of the
correlation between grades and ratings can be accounted for by self-reported student
learning (Howard & Maxwell, 1980, 1982), which supports the hypothesis that teaching
effectiveness influences both grades and ratings, therefore student ratings are valid.
However, other possible hypotheses have been posed to explain this association (Cashin,
1995: Greenwald & Gillmore, 1997): 1) student motivation (general or course specific)
influences both learning and ratings, and is controlled for statistically; or 2) students
give high ratings in appreciation for lenient grading. Rather than statistical control for
possible leniency, Cashin (1995) recommends peer review of the course material, exams,
graded samples of essays and projects, etc to determine grade inflation.

Comparative data provide an option to statistical control of the above variables. If
statistical control is used, Cashin (1995) suggests that course level and academic field
be controlled for only if these variables maintain significant differences after
controlling for student motivation. Under these circumstances it would be necessary to
develop level- or field-specific comparative data, for reference and appropriate
interpretation of results.

The four forms assessed in this report each use a different method of handling
potential biases. The SIR-II controls for the influence of potential bias or confounding
variables through reference to appropriate comparative data. Ideally, for our purposes,
this would consist of data from two-year institutions. The SIR-II most recently published
comparative data from two-year colleges and universities from 1995 to 1997 (ETS, 1998).
ETS also encourages institutions using the SIR-II to collect data on local norms, which
serve as an additional reference in interpretation of evaluations. Studies involving the
original SIR (Centra, 1976) indicate that potential biases influenced ratings only weakly,
if significantly. Student motivation was most highly with student ratings, and ETS
recommends that this be taken into account in comparative interpretation of the SIR data.
The IDEA provides both unadjusted student rating scores and adjusted scores, which reflect
the statistical control of ratings for variables that may bias results (including class
size, student motivation, course difficulty, student effort, and other motivational
influences). Although IDEA does not maintain a comparative database, it does assist in the
collection of data for the purpose of establishing local norms. The AZTEQ technical report
indicates that biasing factors present in the research literature have a "small
magnitude" of effect on their instrument so long as comparisons among instructors or
courses include the control of the following variables: course discipline and content,
class size, course level, and the course as a requirement versus as an elective. For the
purposes of institutional reporting, AZTEQ provides collection of local comparative data.
The Purdue Cafeteria System software can be purchased, and includes a normative data file
for its items. Item norms are based on the performance evaluations of all Purdue faculty
who have used the Cafeteria System item since 1974. The system also provides a routine
that collects local data. Comparisons are made based on local or system norms, and no
discussion was given on control of biases. Thus either the SIR-II or the IDEA allows for
the drawing of relatively unbiased conclusions; a possibility overtime from the AZTEQ as
well.

Cost of Measurement

The costs of using each of the instruments discussed in this report varies widely
across a number of variables including: the number of forms, the number of classes, the
customization of forms, the bulk of the processing, the number and type of reports
requested, the provision of data discs, and the purchase of system software and technical
support. Table 2 summarizes current price information.

Limitations of Student Ratings, Students as Raters, and

Application of Ratings Information

Reviewers and meta-analysts clearly agree that evidence supports the assertion that
student ratings are related to teaching effectiveness. In fact, "student ratings are
the single most valid source of data on teaching effectiveness" (McKeachie, 1997, p.
1219). In addition, well-developed and tested student rating forms of teaching
effectiveness, such as those discussed in this report, exhibit both reliability and
validity. While these forms provide valuable, useful, and reliable information, some of
the limitations of student ratings, of students as raters, and of the application of
ratings information bear further discussion.

Ratings in general are inherently subject to two weaknesses, the "error of central
tendency" and the "halo effect," which tend to reduce discrimination among
individuals and represent subdued estimates of effect (Anastasi & Urbina, 1997). The
error of central tendency occurs because most people tend to avoid the extremes in rating,
so ratings tend to accumulate in the center of the scale. Thus ratings of "moderately
effective" or "somewhat ineffective" may present more modest estimates of
teaching effectiveness than are justified. The "halo effect" refers to the
tendency of raters to be unduly influenced by a favorable or unfavorable general opinion
of the person being rated, and then to let that opinion color all specific ratings. The
halo effect causes raters to make less differentiation between the specific strengths and
weaknesses of instructors or courses than warranted.

Students ratings of their own learning and of the instructors techniques
(after adjustment for known confounds) have acceptable validity. However, Cashin (1989)
concluded that students are not qualified to judge a number of other factors that
characterize exemplary instruction:

The appropriateness of the instructors objectives

The relevance of assignments or readings

The degree to which subject matter content was balanced and up-to-date

The degree to which grading standards were unduly lax or severe

Although these issues can form essential components of a comprehensive evaluation of
teaching effectiveness, they may require methods other than student ratings to address
them.

Student ratings are valuable indicators of teaching effectiveness. They provide
constructive information to help guide the improvement efforts of instructors,
departments, and institutions. Meta-analysis (Cohen, 1980) shows that ratings feedback is
related to improved teaching. However, the greatest increases in teaching effectiveness
were found when instructors received not only feedback on student ratings, but a
combination of ratings feedback and consultation (type of consultation varied across the
studies in the meta-analysis). Thus student ratings provide the most help when combined in
a comprehensive program including a variety of evaluation tools and systematic faculty
development.

References

Abrami, P. C. (1989a). How should we use student ratings to evaluate teaching? Research
in Higher Education, 30, 221-227.

Aleamoni, L. M. & Hexner, P. Z. (1980). A review of the research on student
evaluation and a report on the effect of different sets of instructions on student course
and instructor evaluation. Instructional Science, 9, 67-84.

Feldman, K. A. (1976a). Grades and college students evaluations of their courses
and teachers. Research in Higher Education, 4, 69-111.

Feldman, K. A. (1976b). The superior college teacher from the students view. Research
in Higher Education, 5, 243-288.

Feldman, K. A. (1977). Consistency and variability among college students in rating
their teachers and courses: A review and analysis. Research in Higher Education, 6,
233-274.

Feldman, K. A. (1978). Course characteristics and college students ratings of
their teachers; what we know and what we dont. Research in Higher Education, 9,
199-242.

Feldman, K. A. (1979). The significance of circumstances for college students
ratings of their teachers and courses. Research in Higher Education, 10, 149-172.

Feldman, K. A. (1984). Class size and college students evaluations of teachers
and courses: A closer look. Research in Higher Education, 21, 45-116.

Feldman, K. A. (1987). Research productivity and scholarly accomplishment of college
teachers as related to their instructional effectiveness: A review and exploration. Research
in Higher Education, 26, 227-298.

Feldman, K. A. (1989b). The association between student ratings of specific
instructional dimensions and student achievement: Refining and extending the synthesis of
data from multisection validity studies. Research in Higher Education, 30, 583-645.

Feldman, K. A. (1992). College students views of male and female college
teachers: Part I-Evidence from the social laboratory and experiments. Research in
Higher Education, 33, 317-375.

Feldman, K. A. (1993). College students views of male and female college
teachers: Part II-Evidence from students evaluations of their classroom teachers. Research
in Higher Education, 34, 151-211.

Student Instructional Report II (SIR-II)
This questionnaire gives students the chance to comment anonymously about a particular
course and the way it was taught. Using the rating scale below, students mark the one
response for each statement that is closest to their view. (Bubble forms are provided
administration of this questionnaire).
(5) = Very Effective
(4) = Effective
(3) = Moderately Effective
(2) = Somewhat Effective
(1) = Ineffective
(0) = Not Applicable, Not used in the course, or you don=t know.
In short, the statement does not apply to the course or instructor.
As students respond to each statement, they are asked to think about each practice as it
contributed to their learning the course evaluated.A. Course Organization and Planning.
1.The instructor=s explanation of course requirements.
2.The instructor=s preparation for each class period.
3.The instructor=s command of the subject matter.
4.The instructor=s use of class time.
5.The instructor=s way of summarizing or emphasizing important points in class.B. Communication
6.The instructor=s ability to make clear and understandable presentations.
7.The instructor=s command of spoken English(or the language used in the course)
8.The instructor=s use of examples or illustrations to clarify course material.
9.The instructor=s use of challenging questions or problems.
10.The instructor=s enthusiasm for the course material.C. Faculty/Student Interaction
11.The instructor=s helpfulness and responsiveness to students.
12.The instructor=s respect for students.
13.The instructor=s concern for student progress.
14.The availability of extra help for this class (taking into account of class size).
15.The instructor=s willingness to listen to student questions and opinions.D. Assignments, Exams, and Grading
16.The information given to students about how they would be graded.
17.The clarity of exam questions.
18.The exams coverage of important aspects of the course.
19.The instructor=s comments on assignments and exams.
20.The overall quality of the textbook(s).
21.The helpfulness of assignments in understanding course material.
Many different teaching practices can be used during a course. In this section (E),
students rate only those practices that the instructor included as a part of the course
evaluated.
Students are asked to rate the effectiveness of each practice used as it contributed to
their learning. E. Supplementary Instructional Methods
22. Problems or questions presented by the instructor for small group discussions.
23. Term paper(s) or project(s).
24. Laboratory exercises for understanding important course concepts.
25. Assigned projects in which students worked together.
26. Case studies, simulations, or role playing.
27. Instructor=s use of computers as aids in instruction.For the next TWO sections ( F & G ), students use the rating scale below. They
are asked to mark the one response for each statement that is closest to their view.
(5) = Much More Than most courses
(4) = More Than most courses
(3) = About the Same as other courses
(2) = Less Than most courses
(1) = Much Less Than most courses
(0) = Not Applicable, not used in the course, or you don=t know. In short, the statement
does not apply to the course or the instructor.

F. Course Outcomes
29. My learning increased in this course.
30. I made progress toward achieving course objectives.
31. My interest in the subject area has increased.
32. This course helped me to think independently about the subject matter.
33. This course actively involved me in what I was learning.

G. Student Effort and Involvement
34. I studied and put effort into the course.
35. I was prepared for each class (writing and reading assignments).
36. I was challenged by this course.

H. Course Difficulty, Workload, and Pace
37. For my preparation and ability, the level of difficulty of this course was:
Very Elementary
Somewhat elementary
About right
Somewhat difficult
Very difficult
38. The work load for this course in relation to other courses of equal credit was:
Much lighter
Lighter
About the same
Heavier
Much heavier
39. For me, the pace at which the instructor covered the material during the term was:
Very slow
Somewhat slow
Just about right
Somewhat fast
Very fastI. Student Information
40. Which one of the following best describes this course for you?
A major/minor requirement
A college requirement
An elective
Other
41. What is your class level?
Freshman/1st year
Sophomore/2nd year
Other
42. Sex Female Male
43. What grade do you expect ot receive in this course? Range from A to Below C
44. Do you communicate better in English or in another language?
Better in English
Better in another language
Equally well in English and another languageJ. Overall Evaluation
45. Rate the quality of instruction in this course as it contributed to your learning (try
to set aside your feelings about the course content): Scale of Ineffective, Somewhat
ineffective, Moderately effective, Effective, and Very effective.K. Supplementary Questions
The SIR-II provides space for instructors to provide up to10 supplementary questions in
this section of the questionnaire.L. Student Comments
If you would like to make additional comments about the course or instruction, use a
separate sheet of paper. You might elaborate on the particular aspects you liked most as
well as those you liked least, and how the course or the way it was taught can be
improved. An additional form may be provided for your comments. Please give these comments
to the instructor.