What are the concerns with only using student evaluations to assess teaching?

Here is a quote from a comprehensive and conservative review on Student Evaluations of Teaching (i.e. SET), from 2013 Review of Educational Research: “This review of the state of the art in the literature has shown that the utility and validity ascribed to SET should continue to be called into question. … many types of validity of SET remain at stake. Because conclusive evidence has not been found yet, such evaluations should be considered fragile, as important stakeholders (i.e., the subjects of evaluations and their educational performance) are often judged according to indicators of effective teaching (in some cases, a single indicator), the value of which continues to be contested in the research literature.”

Toftness et al. (2017) found that “Instructor fluency leads to higher confidence in learning, but not better learning” due to an “illusion of learning” associated with lecture-based learning.

Here is a striking study with small ‘n’ but big differences in how students rated online instructors, depending simply on if they were told the instructor was male or female: “In promptness, for example, the instructors matched their grading schedules so that students in all groups received feedback at about the same rate. The instructor whom students thought was male was graded a 4.35 out of 5 for promptness, while the instructor perceived to be female received a 3.55.”

To explore “gendered language” in teaching reviews, this interactive chart lets you see the frequency of various words used to describe male and female teachers in about 14 million reviews from RateMyProfessor.com. http://benschmidt.org/profGender

Purdue’s Senate Faculty Affairs committee made the following recommendation: “Academic units are strongly encouraged not to use student responses to these questions for summative evaluation purposes, i.e. for promotion and tenure decisions.”

University of Michigan strongly recommends using teaching portfolios; they also recommend putting student evaluation numbers into context and not using individual student letters.

Student ratings should be only one of multiple measures of teaching: The most common additional sources of data about the faculty member’s teaching include written student feedback, peer and administrator observations, internal or external reviews of course materials, and more recently, teaching portfolios and teaching scholarship (instructor assessment of teaching effectiveness). Data collection for each of these additional data sources should be systematic rather than informal.

A faculty member’s complete history of student ratings should be considered, rather than a single composite score.

Small differences in mean (average) ratings are common and not necessarily meaningful: Variations of up to 0.4 points within a course are not unusual, and that of course depends on the rating scale.

Examine the distribution of scores across the entire scale, as well as the mean: The median or the mode is a better measure of central tendency in skewed distributions.

Avoid comparing faculty to each other or to a unit average in personnel decisions: Student ratings instruments are not designed to gather comparative data about faculty. The faculty who are most likely to be negatively impacted by faculty-faculty comparisons are those who do not fit common stereotypes about the professoriate—typically women and faculty of color.

Focus on the most common ratings and comments rather than emphasizing one or a few outlier ratings or comments. Too often, faculty and administrators seem to focus their attention on rare comments, possibly because they are typically the most vehement or the most negative. Evaluators need to be particularly vigilant and self-aware when they are reading or summarizing students’ comments. One of the best ways to ensure that summaries of comments represent students’ views is to sort student comments into groups based on similarity and label the group with a theme, then rank the themes based on the frequency of comments in each. Some common themes include: Labs, Homework, Teamwork, Lecture, Availability, Textbook, and Exams.

Consider putting scores into context, comparing to similar courses taught by other instructors.

May want to indicate demographics of faculty member in case of possible biases.

Resources from CEILS Teaching Evaluation Symposium

CEILS hosted a symposium at UCLA on June 12, 2018, called “Exploring Practical Ways to Inspire and Reward Teaching Effectiveness and Instructional Innovation”. The event details can be found here. Several visiting speakers, including Emily Miller, Associate Vice President for Policy at AAU, Sierra Dawson, Associate Vice Provost for Academic Affairs at the University of Oregon, and Diane O’Dowd, Vice Provost for Academic Personnel at UC Irvine, shared resources on student ratings of instruction, peer teaching observations, and self-assessment of teaching practices, among others. Many thought leaders from the UCLA community also participated as panelists, moderators, and participants throughout the day. Please explore the resources shared by our colleagues.

Click here to access the UCLA Box folder with handouts, rubrics, guidelines, and other materials shared during the symposium. A password is required to access the Box folder. Please email us at media@ceils.ucla.edu to request the password.

Click here to view the spreadsheet with a list of the documents and Box folder locations.

CEILS also hosted visiting Scientific Teaching Scholar Philip Stark, Professor Statistics and Associate Dean of Mathematical and Physical Sciences at UC Berkeley, who gave a talk on November 2, 2018, entitled “Student Evaluations of Teaching: Managing Bias and Increasing Utility”. Resources shared at this event can be downloaded from the event page found here; these include slides from his talk, UC Berkeley’s guide for documenting teaching effectiveness and their guide to peer review of course instruction. We encourage you to check out these and our growing list of resources.