Is there a conflict?

Recommendation Letter Bias: Quantify Instead?

I recently attended a presentation by an ETS (Educational Testing Service, the company that runs the GRE, TOEFL, and Praxis tests) about their most recent changes to the GRE. The audience was those of us at the university involved in the graduate admissions process, so that we could better understand the new scores coming in this fall, what exactly the GRE is evaluating, and also a bit about the TOEFL and how it is administered. In addition to addressing the GRE and TOEFL, the company’s representative also introduced us to the new Personal Potential Index (PPI) that is being offered by ETS as a “stand-alone” product.

The PPI is intended to measure non-cognitive skills that have been shown to be correlated with success in graduate study: “knowledge and creativity, resilience, communication skills, planning and organization, teamwork, and ethics and integrity.” At first I thought, is this some kind of “personality test,” where people are just going to answer the way that they think people want to hear? Actually, it’s essentially a website where you choose your evaluators (just like you would choose your letter of recommendation writers – seems like they’d be the same people), and they rate you on 24 statements, which you can read here.

For each statement, students are rated on a 5 point scale. Of course, it’s a bit “top-heavy,” since (hopefully) few people agreeing to recommend a student would rank them poorly. The choices are: below average, average, outstanding (top 5%), and truly exceptional (top 1%). Evaluators may also choose “Insufficient Opportunity to Evaluate.” Then they can also add comments for each category.

For the student, if they waive their FERPA rights, then they do not see the results but can choose who among their reviewers they include in their report (though it is not clear to me, if you cannot see the results, why you would not use someone’s evaluation). The report can be send to 4 schools if you’ve taken the GRE (and if you’re using this system, you probably are applying to grad school and so you probably already have paid for the GRE), and additional schools for $20 each. I am not a big fan of increasing financial barriers to grad school, but at least including 4 reports with the price of the GRE helps.

That’s how this new product works. No one at our university said they used it, and the representative from ETS named maybe two or three universities that had at least one program require it. So clearly, it’s not a “big thing” yet, but perhaps it will be in the future.

The question is, should it be? Like many in the audience (at least, the very vocal ones behind me), I was pretty skeptical about the PPI. Can you really narrow down a person’s “potential” to succeed with a simple number? Isn’t that half of my beef with the GRE and other standardized tests anyway? I also couldn’t shake the suspicion that every evaluator is just going rank their student in the top 1% anyway, so students are going to be stuck paying $20 to distribute a report that has just as high of a score as everyone else’s.

But then I started thinking about this in the context of recommendation letters, which this service is meant to – well, I’m not sure if it’s meant to supplement or replace them. And I remembered how unconscious bias has been shown to come into play often with recommendation letters, and that this can unfairly (and unintentionally) harm the career prospects of women. (I would also be interested to hear if such bias has been found with other cultural groups, especially if there is a disconnect in understanding between the writer and the student.) For those not familiar, I’m referring to the 2003 study by Trix and Psenka and the 2009 study by Madera, Hebl, and Martin. MHM09 stated “(a) that women were described as more communal and less agentic than men (Study 1) and (b) that communal characteristics have a negative relationship with hiring decisions in academia that are based on letters of recommendation (Study 2)” TP03’s abstract states:

Letters written for female applicants were found to differ systematically from those written for male applicants in the extremes of length, in the percentages lacking in basic features, in the percentages with doubt raisers (an extended category of negative language, often associated with apparent commendation), and in frequency of mention of status terms. Further, the most common semantically grouped possessive phrases referring to female and male applicants (‘her teaching,’ ‘his research’) reinforce gender schema that tend to portray women as teachers and students, and men as researchers and professionals.

When letters focus more on the “concrete” signs of success (publications, a take-charge attitude), as they do for males, they tend to be more valued more highly than letters which address more social or personal aspects, as recommenders tend to do with women. Remember, these are subtle changes being analyzed in the letters, and they’re not because the women had no accomplishments to talk about (these were all successful applicant letters). It’s all about what first comes to mind, how you choose to phrase things, what you emphasize, etc.

Given that letters of recommendation can be subject to bias, could a well-researched, quantifiable, and standardized recommendation system (still with room for personal comments and insight, of course) be more fair? One issue with standard letters of recommendation is that they are so open ended, and it’s in those situations where unconscious bias and schemas are most likely to be used. The University of Michigan’s STRIDE group (Committee on Strategies and Tactics for Recruiting to Improve Diversity and Excellence) emphasizes focusing on multiple, specific criteria during evaluation (in the specific case of hiring). They offer a “Candidate Evaluation Tool,” kind of a worksheet, where evaluators have to specifically mark the candidate’s potential in a variety of specific areas. This avoids “global judgements,” where biases are more likely to come into play. Forcing an evaluator to sit down and analyze based on specific criteria has helped those who may have been harmed by “snap judgements.”

In the case of letters of recommendation, are they the “global judgements” whereas the PPI can serve much like STRIDE’s Candidate Evaluation Tool? Not only does this assure that important aspects of an applicant’s success aren’t forgotten, but it also means flaws can be honestly pointed out (a letter writer probably won’t mention a candidate’s poor ability to take criticism, but could honestly mark them as “average” when specifically asked). Not all bias can be removed, if in years of interactions, evaluators still focus more on communal aspects with their female students. But it’s a step that forces evaluators to think about specific aspects of personality, one question at a time.

On the other hand… changes aren’t implemented in a vacuum. If you implement any changes in hiring/evaluation, it’s important to train the people using them. Those responsible for admissions, mostly research professors, will probably not receive such training in the PPI, if it were to become a new standard. I worry that some would just focus on scores for the statement “Is among the brightest persons I know,” and just forget about teamwork, integrity, and resilience. Remember, that was one of the problems mentioned in MHM09, that these aspects were not particularly valued (that study was specifically in medicine). Will the PPI be used as intended, or largely ignored as “irrelevant?”

Even after the amount of thinking I had to put in to this issue just to write this post, I’m still on the fence. On the one hand, this is just another product that may cost applicants more money, but on the other hand the idea is very consistent with my understanding of the best practices in evaluation. The more specific the evaluation criteria, the less likely schemas and unconscious bias will creep in like with open ended evaluations. If the criteria are not valued by those using the evaluations, however, the PPI will not have gained anything for graduate school applicants.

I’m curious what others think. Could the PPI be helpful, harmful, or neither? If the idea is on track but the implementation is lacking, what could be done to promote its usefulness?