Categories

Meta

Month: February 2018

I’ve been noodling on writing this post for quite a while. Mostly because “performance” is such a big topic to unpack, and partly because I’m still working on my own holistic answer. But it’s time to remind myself that “perfect is the enemy of good” and there’s value in trying to summarize my point of view on this topic even if it’s still a work-in-progress.

A good starting point for this conversation is getting on the same page on how we got to where we are today. The Performance Management Revolution provides a great recap of the evolution of performance reviews and the environmental changes that drove it over the last 80(!) years.

One of the key challenges in the debate around performance reviews is that in its 80 years of history the term has gone through what Martin Fowler refers to as “semantic diffusion”:

Semantic diffusion occurs when you have a word that is coined by a person or group, often with a pretty good definition, but then gets spread through the wider community in a way that weakens that definition. This weakening risks losing the definition entirely — and with it any usefulness to the term

Today there is no clear definition of the attributes that turn a particular set of conversations into a “performance review”. Many articles use this term without providing their own definition, assuming that we’re all talking about the same thing.

If we were to boil down performance reviews to a shared core definition, it would be something along the lines of: “a program to facilitate periodic feedback conversations”. Hardly anybody objects to the argument that having periodic feedback conversations is a valuable organizational practice. Most of the criticism that calls for “abolishing the performance review” tends to criticize specific program design elements which they consider to be part of the core definition of what a performance review is.

It is a fair statement that most performance review programs are designed in ways that ignore some human aspects of the interaction, and especially lessons from the last 30 years of research in the fields of psychology, sociology and neuroscience. This in turn leads many of them to have unintended or sub-optimal results. But this also makes it clear, in my mind at least, that the solution here is to integrate those lessons into the program design rather than get rid of the program altogether.

Humanistic performance reviews principles

With that in mind, I’d like to highlight some of the interim insights that should be taken into account when designing such programs:

Reduce functional overloading — Many programs today suffer from “functional overloading”, we’re trying to do too many things with the same program and end up in a “jack of all trades, master of none” situation since often a program element optimized for one need, causes harm to another. For example, using a performance program to generate documentation of poor performance to minimize legal risk in performance-based terminations, will likely limit the effectiveness of the developmental feedback that it provides. Deconstructing monolithic performance programs and decoupling the components that serve different organizational needs is a good step towards starting to address that challenge.

More frequent, but not too frequent — we are all prone to “recency bias”. Out memory is far from perfect and we tend to overweight the importance of things that have happened more recently in formulating our judgment. This suggests that an annual review cycle is probably too long. But the solution is not real-time/continuous feedback either. There is a lower-bound to the frequency, since we need to give the changes we’ve made in the last cycle enough time to impact the outcomes. Otherwise, we’re just introducing thrash. Furthermore, giving good feedback typically requires a period of reflection and thoughtful composition of the feedback which we would not be able to do if we were to give it “on-the-fly”.

Minimize subjectivity — while subjectivity cannot be eliminated altogether it can certainly be reduced. On the receiving end, by accounting for the overconfidence effect. And on the evaluating end by reducing the idiosyncratic rater effect.

Forward-looking — rather than focus on what happened in the past, the conversation should focus on what should be sustained or changed going forward.

Maximize credibility — the credibility of the person providing feedback effects our motivation to take action based on it. Three key levers can be addressed structurally in the program design: a) a healthy mix of sustain (“positive”) and change (“negative”) feedback b) structuring the feedback in a way that separates facts from interpretations c) making a request to change while hand-in-hand taking responsibility over one’s own interpretations.

A couple of harder questions

The design principles listed above will go a long way in helping to design more effective performance review programs. However, while far from trivial or easy, they are not the hardest part of this challenge.

To use Ronald Heifetz’s distinction, I believe they capture many of the “technical” aspects of the challenge, but it’s the “adaptive” ones — the ones that have to do with the values underlying the system that are the most difficult to address. And those will change from organization to organization.

A couple of harder, more adaptive questions come to mind:

How do we define “performance”? a different definition will lead to different forms of measurement and evaluation: Do we take into account efforts, or just results? Do we account for factors that are outside of our control and influenced the results? How do we deal with the relationship between individual performance and group performance? What about investments that haven’t yielded results just yet? How do we account for intangibles?

What is the role of power in the evaluation of performance? In High Output Management, Andy Grove offers the following:

The review process also represents the most formal type of institutionalized leadership. It is the only time a manager is mandated to act as judge and jury: we managers are required by the organization that employs us to make a judgment regarding a fellow worker, and then to deliver that judgment to him face-to-face.

…

“This is what I, as your boss, am instructing you to do. I understand that you do not see it my way. You may be right or I may be right. But I am not only empowered, I am required by the organization for which we both work to give you instructions, and this is what I want you to do…”

Some organizations may agree with this definition. Some may not. Their performance review programs will be fundamentally different as a result…

It’s a short book that I’d highly recommend, but it wasn’t an easy read for me. The highly spiritual context in which it is set constantly conflicted with my extremely rational worldview and required a very deliberate process of parsing out the highly insightful pieces, instead of just writing it all of as “spiritual mumbo-jumbo”. The effort was well worth it.

In a nutshell, the four agreements are:

Be impeccable with your word

Don’t take anything personally

Don’t make assumptions

Always do your best

Certainly good ideals to aspire to even though they’d always be just a little out-of-reach. But the thing that stuck with me most from the book was not the agreements themselves, but an underlying metaphor that Ruiz constantly uses. The metaphor of the Dream:

And he came to the conclusion that human perception is merely light perceiving light. He also saw that matter is a mirror — everything is a mirror that reflects light and creates images of that light — and the world of illusion, the Dream, is just like smoke which doesn’t allow us to see what we really are.

…

He had discovered that he was a mirror for the rest of the people, a mirror in which he could see himself. “Everyone is a mirror,” he said. He saw himself in everyone, but nobody saw him as themselves. And he realized that everyone was dreaming, but without awareness, without knowing what they really are. They couldn’t see him as themselves because there was a wall of fog or smoke between the mirrors. And that wall of fog was made by the interpretation of images of light — the Dream of humans.

The Dream metaphor is a beautiful one-word summary of the fact that we all experience reality in our own subjective way, and our actions are the result of the way we subjectively interpret that reality.

Not taking things personally is still a big area of growth for me. Here’s how the Dream metaphor can help in that context:

Nothing other people do is because of you. It is because of themselves. All people live in their own dream, in their own mind; they are in a completely different world from the one we live in. When we take something personally, we make the assumption that they know what is in our world, and we try to impose our world on their world.

…

When you take things personally, then you feel offended, and your reaction is to defend your beliefs and create conflicts. You make something big out of something so little, because you have the need to be right and make everybody else wrong. You also try hard to be right by giving them your own opinions. In the same way, whatever you feel and do is just a projection of your own personal dream, a reflection of your own agreements. What you say, what you do, and the opinions you have are according to the agreements you have made — and these opinions have nothing to do with me.

I found the Dream to be a powerful mnemonic that helps me catch myself when I impulsively take things too personally, defuse from that perception, and create the capacity take more deliberate action instead.

The clear “winner” in its ability to predict job performance on a standalone basis according to Schmidt’s analysis are “General Mental Ability” (GMA) tests, such as the O*NET Ability Profiler, the Slosson Intelligence Test and the Wonderlic Cognitive Ability Test. These are on average able to predict 65% of a candidate’s job performance. This represents a 14% increase in their predictive ability compared to the ’98 data, unseating “work-sample test” (’98–54%, ’16–33%). The average here only tells part of the story as more refined analysis suggest a significant difference in its predictive ability depending on job type: 74% for professional and managerial jobs, and 39% for unskilled jobs.

Source: Schmidt (2016)

Interestingly, no organization I’ve ever worked for or heard of seems to be using GMA. One reason might be that the consistency and precision in the method, coupled with the large sample sizes make it easier to prove that these tests introduce both gender and racial bias. This seems unfortunate, since none of the other evaluation methods are bias-free, it’s just harder to measure. Being able to measure bias precisely allows us to correct for it, in the short-term — post-hoc, and in the long-term — through better test design.

Next up are employment interviews (58%), where “structured interviews” refer to interviews in which both questions and answers evaluation criteria are consistent across candidates. The MSA and PSQ questions I discussed here are a good example of structured interview questions. The list goes down from there all the way to graphology and age with little to no predictive power. While the two don’t seem to differ in predictive power, unstructured interviews are certainly more bias-prone.

Since GMA seems to be the best measure for making hiring decisions, Schmidt looks at all other measures relative to it, asking the following question:

When used in a properly weighted combination with a GMA measure, how much will each of these measures increase predictive validity for job performance over the .65 that can be obtained by using only GMA?

In this case, the focus shifts from looking solely at their standalone predictive ability and instead also taking into account their covariance with GMA (smaller covariance = better).

The more extensive summary table is shown below but the bottom-line is this:

Overall, the two combinations with the highest multivariate validity and utility for predicting job performance were GMA plus an integrity test (mean validity of .78) and GMA plus a structured interview (mean validity of .76)

Source: Schmidt (2016)

While employment interviews maintain their position at the top of the list, integrity tests such as the Stanton Survey, Reid Report and PSI take the #1 spot. Again, not a tool commonly used today.

So where does all of this leave us? In my opinion it seems like the pendulum in recruiting may have swung too far from quantitative assessment pole to the qualitative assessment pole. It seems like we’d get much better outcomes from our recruiting efforts if GMA and Integrity assessments replaced some of our structured interviews, all the while as we work diligently to remove bias out of our recruiting efforts, regardless of the assessment methods we use.