So how do you remove as much of the luck, noise, and bias in candidate review as you can? Using structured interviews is a great way to consistently ask candidates the same questions, but we wondered if how we review the candidate’s answers was as consistent as it could be. We thought that instead of grading all of one candidate’s responses and then moving onto the next candidate, we should instead “chunk” responses, reviewing different candidates’ responses to the same question together (a trick school teachers grading tests have used for years). So we ran an experiment to see if these quirky ordering effects impacted hiring decisions, and if our chunking method is actually successful at reducing bias.

Using an online experimental platform, we asked 150 reviewers to rate 100 unique responses to four work-related challenges, drawn from real applications to the Behavioural Insights Team. Responses were anonymized, chunked by question, and their order within each question was randomized across every reviewer. So each reviewer scored a batch of randomly ordered responses to question 1 on a scale of 1 (unsatisfactory) to 5 (exceptional), and then a batch of randomly ordered responses to question 2, and so on. We then compared these to a benchmark score combining every reviewer’s score of that response.

The results were clear: context matters. Specifically, we found three ordering effects at play in the candidate review process:

Reviews get more accurate over time. We compared whether the score given to a particular response was different if the reviewer read it as the first in the batch, 9th, 17th, or last, by looking at the deviation between the reviewer’s score and the average score of all reviews. The group’s average deviation from the first to the last rated response for the first question falls by roughly 17% and this result was highly statistically significant.

It helps to be first in line. The average rating across all candidates is 3.35, but being reviewed first increases that to 3.52. Taken over multiple questions, this can turn out to be decisive. Interestingly, the positive effect of being first appeared for the first question, but also for each subsequent question. Combined with finding 1, this suggests that reviewers go through some general calibration when they start off, but also that they recalibrate slightly within each question.

It matters who comes before you. We took the top 10 and bottom 10 candidates on each question (as rated by all reviewers) and looked at the impact they had on scores for the next few candidates. We found strong evidence that scores given to candidates were affected by the strength or weakness of the candidate seen immediately before, even controlling for our earlier findings. An average candidate gets a lower score if they come after a phenomenal candidate; but they’ll get a higher score if they come after a poor one. These ‘spillover’ effects are more extreme in the latter case, and it turns out that they affect more than just the next candidate: two and three down the line benefit from having a poor candidate reviewed recently!