Crowdsourcing Comments

By

Meet Caesar: the computer system that, like its Roman namesake, divides and conquers.

What Caesar divides is large amounts of code submitted by students in professor Rob Miller’s “Elements of Software Construction” course at the Massachusetts Institute of Technology. In doing so, it helps Miller and his graders conquer the problem of getting students timely, useful feedback by combining peer review with crowdsourcing.

“Before, [students] would hand in a problem set and the graders would start working on it, but the students would have handed in another problem set, and maybe another, before they started to get feedback about what they had been doing wrong,” Miller said.

So, Miller, a principal investigator at the MIT Computer Science and Artificial Intelligence Lab, developed Caesar.

When a problem set is due, Miller’s 200 students submit their code to Caesar, which designates chunks of code for review, based on heuristics built into the system and on input from graders, who can provide Caesar with a list of file names that definitely should or definitely should not be reviewed.

Breaking down a long string of code into small portions allows each section to be reviewed by multiple people, makes the process go faster, and ensures that, ultimately, the student will get valuable, correct feedback, since even if one reviewer suggests an incorrect change, someone else can set that person straight, according to Miller.

Once the reviewable chunks of code have been selected, Caesar sends each chunk to a diverse set of reviewers, selected by algorithms. Reviewers are not told who submitted each chunk of code, though students can choose to reveal themselves later in the process.

Here, the system differs from typical peer-review processes, including those used by some massive open online courses. Rather than randomly selecting reviewers for each section of code, Caesar considers a reviewer’s role – the reviewing pool for Miller’s class includes current students, alumni, and graders – and reputation. Each reviewer has a reputation score, based on the quality of his comments, as judged by how often the comments get a “thumbs up” or “thumbs down.” Miller envisions that in the future, the score could also take into account comment quality, by having graders occasionally assess reviewers’ work.

Caesar, then, might assign a chunk of code to someone with a particularly high reputation score, so the student who submitted the code receives valuable feedback, and to someone else with a low reputation score, so that person can learn from the comments left by the more skilled reviewer.

“We want to try to provide some diversity of viewpoint and feedback to the students,” Miller said. “This is one way this is potentially superior to having a single grader.”

The system also tries to ensure that selections from one student’s problem set are distributed to different people. By the end of the semester, Miller said, a student who submitted four problem sets ideally has received feedback from 50 or 60 different people.

On the reviewer’s end, Caesar focuses on similarity, rather than diversity. With a typical problem set, a student would receive between 8 and 10 files to review. (Alumni reviewers work with fewer files, since their participation is entirely on a volunteer basis.) The system aims to give each reviewer clusters of two or three files that cover similar parts of the program, so the reviewer can compare and can move more quickly through the process. Most student reviewers, Miller notes, spend about 45 minutes to an hour giving feedback on other students’ work.

With Caesar, students now get comments on their work within three days. Though graders still review every student’s work, the process goes much faster, Miller said, because they are only looking at certain sections, chosen by Caesar, and because they can simply give a thumbs-up to a comment, rather than starting from scratch. Although the identity of the student who submitted the code remains hidden, comments and thumbs-up or thumbs-down ratings include the reviewer’s name, so it’s easy for a student to distinguish a grader’s note from another student’s.

Though Caesar was initially created to solve the problem of slow feedback, Miller sees plenty of other benefits. Students get to hear different viewpoints and have conversations about the material through the Caesar interface. Course alumni, meanwhile, get to help current students and keep their reviewing skills sharp. Eventually, Miller envisions the process could even be a recruiting tool for alumni working in the programming industry, and he hopes to make a concerted push for greater alumni involvement.

A Tool for MOOCs

MIT is also looking at introducing Caesar to some of its courses on edX, the MOOC provider started by MIT and Harvard University. Some MOOCs, mostly those in the humanities, have experimented with peer grading, though users and professors have raised concerns about the quality of the feedback. Miller believes Caesar could help address problems MOOCs have struggled with by bringing in more reviewers and allowing for a conversation, not just static comments.

Scaling Caesar shouldn’t be a problem, Miller said, since the bulk of the work is done by the program’s algorithms. He acknowledges that the quality of reviewers in a MOOC might be more varied, and that some simply might neglect the task, but he says there are ways to deal with that.

“This is a common problem in crowdsourcing,” Miller said. “One solution is making the reputation mean something. If you have to get thumbs-ups on your reviews from a variety of people in order to get a good grade in the class, that’s one way to motivate people.”

Though Miller does not use Caesar as a grading tool right now – only as a feedback mechanism – he believes it could easily be used for grading within a MOOC.

Miller also hopes using Caesar in a MOOC will also help keep course alumni engaged, something Coursera, another MOOC provider, has also been focusing on recently.

Whether Caesar will be useful outside of programming courses remains to be seen, and it’s something Miller and his team are investigating.

“Software is interesting, because you really try to design it so there are independent modules,” he said. “Writing is a little bit similar in the sense that we do break down arguments into sections and paragraphs, and you can think about giving low-level feedback at least on those individual pieces. But for providing global kinds of feedback, that’s something the Caesar approach does not target very well right now.”