Effects of Instructor Gender and Ethnicity on FCQ Ratings Given by Students
Perry Sailor, October 2002

Summary of major findings
The study was designed to look for possible effects of instructor gender and ethnicity on FCQ ratings
(student ratings of courses and instructors - see
an online version of the instrument), after statistically controlling the effects of class level (graduate
vs. undergraduate), size, and department. We examined tenured and tenure-track (TTT) and non-TTT instructors
separately, and excluded TAs. The major findings of the study are:

Gender differences were inconsistent and for the most part exceedingly small; all reported results
below are therefore combined across genders.

The only large effect of instructor ethnicity on ratings is limited to non-TTT Asians (a category which
may include both Asian-Americans and non-citizens native to Asia), who are rated much lower than whites.

When the population studied is restricted to instructors who taught at least three sections in the
three-year period studied, that effect is largely eliminated and little or no effect of ethnicity on ratings can be detected. To the extent that
there is an effect for this limited group, it involves Asians, both TTT and non-TTT (rated .36 and .31 standard deviations, respectively, lower than
whites).

Details
We looked at FCQ ratings for all fall/spring terms from 3 academic years - 1999-00, 2000-01, and 2001-02.
Each observation on the data file (N=14,677) represented an instructor/section combination, with a mean
rating from students in the section on FCQ items 11 (global instructor rating) and 12 (global course rating).
The ratings were restricted to those for

lectures, recitations, labs, and seminars

instructor groups A (tenured and tenure-track, or TTT) and B (other primary instructors, not TTT).
Group C (teaching assistants) was excluded.

We statistically removed from each rating the effects of class size, level (undergrad vs. grad), and
individual department. This was done by entering these variables as predictors into a SAS regression
modeling procedure called GLM (general linear models), and obtaining a predicted rating based on them.
This predicted rating was then subtracted from the actual rating, yielding a residual. After converting
each section mean rating to a residual in this fashion, a single mean residual for each
instructor was then calculated, by averaging the instructor's ratings across all sections taught.

The study included ratings on FCQ item 12, course ratings, as well as item 11, instructor ratings.
However, because course ratings and instructor ratings were so highly correlated (r=.91), the remainder
of this report will discuss only instructor ratings. All statements and differences reported below that
apply to one apply to the other also.

Results
All results below are stated in terms of residual scores. The mean of all the residual scores across
the entire population is zero, by definition (see technical note below). This means that a residual of,
for example, 0.1 can be interpreted as "0.1 points above average, after adjusting for class size, level,
and department." The magnitude of a mean residual, as with any mean score, can be evaluated by comparing
it to the standard deviation.

The table below summarizes the results from the study:

Ethnic Group

Residual Ratings - All Instructors

TTT

Not TTT

N

Mean

SD

N

Mean

SD

African-Am.

27

.01

.35

21

-.04

.40

Asian

74

-.09

.35

47

-.33

.76

Hispanic

55

-.06

.40

57

-.12

.66

Native Am.

6

-.02

.36

5

.17

.15

Unknown

51

-.07

.43

143

-.16

.55

White

944

.01

.38

1,118

-.04

.51

All

1,157

.00

.38

1,391

-.07

.53

Among TTT instructors, there was little difference between other ethnic groups compared to whites,
the largest being the .10 lower rating for Asians, a difference of about .26 standard deviation units
(.10/.38 = .26), which is fairly small. However, among non-TTT instructors, the difference between ratings
of Asians and whites was considerably larger - the mean rating for Asians was .29 below that for whites,
a difference of over half a standard deviation. Non-TTT Asian instructors were scattered across 24
different departments, with only two departments having more than three; furthermore, these two departments
- East Asian Languages and Literature, and Economics, with 6 non-TTT Asian instructors each - did NOT
contribute much to the extremely low overall mean, since their non-TTT Asian instructors' mean ratings
were .02 and -.12, respectively.

A separate analysis was done after eliminating from the population instructors who taught only 1
or 2 sections across the 6 terms. Restricting the analysis to instructors who taught at least three
sections sharply attenuated the large negative difference between Asian non-TTT instructors and other
groups, and also resulted in a large drop in standard deviation for that group, indicating that most
of the negative effect seen in the above table was due to a few instructors who taught only one or two
sections each and received extremely low ratings. Perhaps the low ratings they received is the reason
they only taught one or two sections - their departments realized they were ineffective instructors and
gave them no more teaching assignments. This is just speculation, however.

Ethnic Group

Residual Ratings - Minimum 3 Sections Taught

TTT

Not TTT

N

Mean

SD

N

Mean

SD

African-Am.

25

-.01

.36

9

.05

.32

Asian

67

-.11

.36

21

-.13

.40

Hispanic

47

-.04

.39

38

-.07

.59

Native Am.

4

.06

.35

3

.07

.07

Unknown

30

-.11

.36

53

-.08

.42

White

827

.02

.36

646

.00

.42

All

1,000

.00

.37

770

-.01

.43

Technical Note:
Because the residual values were calculated on the individual section mean ratings, before the reduction to
one mean score per instructor, the overall mean across instructors, collapsed across sections, will not
necessarily be 0; in fact, in this dataset it is -.04 for instructor ratings, -.03 for course ratings.

Other Studies in the Literature

We have not done a systematic search of the higher education literature for other studies in this area.
However, a recent study by Centra and Gaubatz (2000) that specifically looked at gender effects (student,
instructor, and the interaction between them) reported that results of past studies were inconclusive, with
some studies finding no or exceedingly small effects, and a few finding that male students may rate female
instructors lower than male instructors.

Centra and Gaubatz's own study of gender bias used data from 741 classes from a variety of institutions, all
using a common evaluation form developed by the Educational Testing Service. In their analyses of students in
the same class rating either a female or male instructor, they found that female instructors received higher
ratings from female than from male students on 6 of 8 scales, including a global rating. The differences were
statistically significant but very small (about a quarter of a standard deviation), and thus of little practical
utility. Male instructors received the same ratings from male and female students.

In comparisons across classes, female students rated female instructors higher on some scales, male students
rated male instructors higher on some others, but global ratings did not differ by instructor or student gender.
And the differences were again very small, on the order of a quarter of a standard deviation, and thus of no
practical effect.