Why I don't believe in Evidence-Based Management (3 of 6): Problems with 'Evidence'

Evidence-Based Management (EBM) is an emerging movement that aims to improve the quality of decision-making by urging managers to use 'the best available evidence' to support their decisions. Recently, Scrum.org has picked up on EBM and is now actively promoting it as the next logical step in improving organizational agility within the context of software development. In my first post, I introduced the series and summarized my objections. In my previous post, I discussed the history of EBM and how Scrum.org applies it to software development. In this post, I will discuss one of the strongest objections that I have with EBM: it’s definition of evidence.

If Evidence-Based Management is about using the 'best available evidence' to support a decision, we have to investigate what it means to consider something 'Evidence'. The central assumption behind EBM is that the quality of management decisions improves if we shift our attention from subjective to objective evidence. Subjective evidence includes gut feelings, experience, intuitions, hopes and interpretations and is highly dependent on our perspective. Their accurateness can't be measured, nor can it be evaluated by others. On the other end of the continuum lies objective evidence, which exists outside of ourselves and can be appraised by others. This evidence is direct, clear and indisputable and includes metrics, scientific research and other 'hard' data. If we shift our decisions from subjective to objective evidence we effectively rationalize our process of making decisions and eliminate personal biases. In this ideal scenario, similar decisions should be be reached by others using the same data.

Setting the stage

There are two key assumptions behind Evidence-Based Management. The first is that we are able to distinguish between subjective and objective evidence, or 'circumstantial' and 'direct' evidence, or ‘e’ and ‘E’ evidence (Rousseau, 2005). The second is that it is possible to gather objective, 'direct' or 'E' evidence in the context of managerial decision-making, either from local or scientific research, that can be used as 'clear and indisputable proof or contradiction of the truth ...' (Scrum.org, 2014a). Below, I will show that these key assumptions don't hold. Not only is the qualitative distinction between 'e' and 'E' a gross oversimplification of how evidence is treated in the scientific method (e.g. Kanazawa, 2008), it also ignores how hard it is to actually gather evidence that comes close to being direct, 'E' or objective. Let alone in the context of managerial decisions. I will show that any evidence from organizational research (either local or scientific) is only more or less subjective at best, also compared to standards from Evidence-Based Practices. From these arguments I will build my primary objection against EBM and how it is being applied: it persuades decision-makers into believing that they are making 'objective' decisions, even though all the personal biases, intuitions and interpretations are still ever present.

In order to build my objection we first need to dive into a working definition of 'Evidence' and how it is being gathered by the scientific method.

So, what's Evidence?

In science, evidence is derived through empirical research and logical inference (Longino, 1979). This evidence consists of verified assertions about causality between variables as hypothesized by some theory, or the rejection thereof. For example, ‘public health policy saves lives’, ‘steroid treatment works for Bell’s Palsy‘ or ‘doppler probes reduce complications amongst patients undergoing surgery’ (Academy of Medical Royal Colleges, 2013). To verify these assertions we need to analyse certain variables and outcomes (such as the number of complications, death toll, etcetera). It is important to distinguish between measurements, outcomes and metrics on the one hand and evidence on the other. Although evidence is generally derived through measurement, a measurement itself is not evidence without an assertion to provide meaning. Within the context of software development, Scrum.org presents ‘Current Value’, ‘Time to Market’ and ‘Ability to Innovate’ as important categories of outcomes that can be measured to evaluate the effectiveness of practices and decisions (Scrum.org, 2014b). But these outcomes are not evidence of anything without an assertion (like 'Time to market has increased this year').

So, ‘evidence’ should be understood as a verified causal assertion, such as the assertion that ‘pair coding improves code quality’ (Hannay, Arisholm & Sjøberg, 2009) or that ‘gradual implementation of downsizing implementations increases organizational improvement’ (Cameron, Freeman & Mishra, 1995), or that ‘using Agile methodologies increases the chance of project success by 30%’ (2013, Standish Group). So these are sensible practices since there is scientific evidence to support their use (also read my next post if you want to know how strong this evidence really is). But the quality of the evidence depends on the quality of the methods employed to verify the assertion. And that's where it gets hard.

Is all evidence of equal quality?

It isn't. And this requires that decision-makers are able to assess the quality of evidence. Interestingly, EBM is of little help here. But in Evidence-Based Medicine there are several taxonomies, like GRADE (Guyatt et. al., 2010), that help to classify research according to statistics, methodology, publication bias, reliability, validity, risk, consistency, etc. The bottom line here is that good evidence originates from aggregations (or so-called meta-analyses) of randomized controlled design experiments (RCD’s) (Sackett, Strauss & Richardson, 2000). Evidence from cohort research, case studies, uncontrolled or non repeated experiments is considered to be of low quality (ibid). It should be noted, however, that these taxonomies allow for far more nuance than just ‘weak’ and ‘strong’ or ‘direct’ and ‘circumstantial’.

Evidence-Based Practices often herald the use of 'Randomized Controlled Design' experiments (RCD's) as the 'gold standard' for evidence, much like the scientific method. But what does this entail, and why is this evidence stronger than other evidence? I will explain this shortly, but first we need to foray into scientific methodology to gain a better understandig of RCDs. You can skip the next section if you're mostly interested in the conclusions.

Why you need randomized, controlled designs for ‘strong’ evidence

If you want to test a hypothesized causal relation between a bunch of variables (say ‘coding quality’ and ‘the use of pair coding’), you need to ‘freeze’ (control) all other variables that might influence the relationship. This is what happens, methodologically speaking, in an ‘experiment’: a select number of variables are independently manipulated while a number of outcomes are measured. All variables that are likely to impact the relationship (called ‘confounders’) are controlled (frozen). So if you’re testing a new treatment, you’re going to work with at least two groups. One group gets the treatment (the experimental group), the other doesn’t (the control group, they usually get a placebo). If you measure mortality, you could hypothesize that the experimental group will have lower mortality because you theoretically expect your treatment to work.

Suppose that we find that mortality is lower in the experimental group. Have we now proven the effectiveness of the treatment? No. It is possible that participants in the control group were less healthy to begin with. Or older, which usually doesn’t bode well for mortality. Or maybe the control group contained people with other illnesses, or more risky lifestyles. Or maybe it was just dumb luck. The point is that we have to rule out all other explanations by implementing tight controls.

The first control is that both groups require a similar distribution in (say) age, general health, sex, living style and all factors that might impact the outcomes. The experiment also has to be performed in exactly the same way for both groups, with the same procedure, the same instructions and the same length. This is what ‘double-blind protocols’ are for; both subjects and testers don’t know which group someone belongs to. This avoids all kinds of biases, like the observer bias (observers see effects because they wish to see them) or participant bias (participants behave differently because they know they’re in the control or experimental groups).

All this scientific rigor is required to rule out alternative explanations and arrive at good evidence. But we’re not there yet. We also need to rule out the effects of randomness and noise. Although we may have controlled related variables, there are infinite variables that we haven’t controlled for (called ‘unknown confounders’). This causes noise in our findings. An effect might be there for one subject, but it might not be there for the next (after all, 100% effectiveness is highly unlikely). This means that you have to gather a sufficiently large number of subjects in both groups. And this is where statistics come into play. A statistical law is that the certainty of finding a ‘signal’ in ‘noise’ increases as a result of the number of subjects (the statistical power) as this reduces the influence of noise (all uncontrolled variables).

In contrast to popular belief, there are no strong guidelines as to how large your sample should be. It’s not like you’re automatically in the safe zone if you have 100 subjects. It all depends on the variances of scores within your groups. That is why scientist often perform what are called ‘significance tests’. These calculations determine if the difference in average scores between groups is strong enough to be considered a signal in noise.

But even in the most tightly controlled experiments, there is still sufficient room for noise. So usually effects are significant when there is less than 5% chance of the effect being caused by randomness. However, even if an effect is statistically significant it is not automatically relevant. With large enough groups, even very tiny effects can become significant. Maybe mortality is significantly different between the control group (51%) and the experimental (50%) group. Taking the treatment has such a small effect, that it’s probably not worth the cost. This is why there is a growing movement of scientists and statisticians that prefer the measure the effect size over significance, even though this is rarely done. Also, we have a problem called ‘Type II errors’. By definition, all statistical tests have a chance of finding a signal in noise that isn’t a signal. This chance is equal to the aforementioned 5% (or whatever threshold was chosen). This is hardly a risk with experiments that only perform one significance test. But in modern research studies many significance tests are often performed. If 20 tests are performed (for example, 20 groups are compared), you can safely bet on one test returning a signal when there isn’t one. This, among other reasons, is why scientific research often involves the replication and validation of studies by other research teams.

Replication by other research teams is one way to validate research findings, but another is the use of ‘peer-reviews’. A common practice in scientific research is that findings are only published after being reviewed by a (usually anonymous) panel of experienced peers. This weeds out experiments that are badly designed, incorrectly analyzed or where the findings are not supported sufficiently by the gathered evidence.

Taken together, evidence is at it's strongest when the assertion has been verified by multiple randomized controlled experiments (RCD's) and has been published in peer-reviewed scientific journals. This kind of evidence is considered the 'gold standard' in the scientific method. It also considered the standard for Evidence-Based Practices (e.g. Guyatt, et. al. 2010; Thompson, 2005; USPSTF, 1989, or here, or here, or here, or here).

Why ‘gold standard’ evidence is a fantasy in the context of EBM

Up to this point, I’ve been talking mostly about how strong evidence is gathered in medicine. This kind of research is very costly, labor intensive and difficult to do well. And while it’s difficult in medicine, it’s perhaps even harder in the social sciences (which includes all kinds of organizational research).

The biggest problems lie in the even larger number of variables that need to be controlled, and the difficulty and practicality of doing so. Take a practical example. How are you going to measure the success of Scrum in your organization if you only have one team? Even if the team is performing better with Scrum, the success might be caused by other factors (trying something new, experience, higher morale, different team composition, different project, pure luck). You could work with a control group, but you can’t do a ‘blind experiment’. Unless you know how to convince a team they’re doing Scrum, while they actually aren’t (a placebo). And even if you have two teams, it’s unlikely you’ll be able to rule out simple luck (randomness) as statistical tests with 2 subjects (2 teams) are pointless. The solution is to increase the size of the sample, but as the number of teams grows, so do the number of confounding variables. Teams are probably from different departments, different locations, slightly different corporate cultures, different leaders and management styles, different HR practices, different age groups and skillsets, and the list goes on. You can’t possibly control for all of these confounders in a living organization. And it gets even harder if you want to measure the effect of a controlled change over time (like introducing a new way of working); a necessary part of true experimental design. Longitudinal designs, as they are called, also require that confounding variables are frozen over time. Otherwise you can’t rule out time-based effects like learning, attrition or changes in the team or it’s environment that may explain the effect you find or drown it out.

And there’s another problem in organizational research, and that is of participants ‘gaming the system’. If you implement a number of outcomes to measure (like bug count, code coverage, team morale, etc.), how do you prevent teams or managers that have ‘their skin in the game’ from influencing the result in a manner they prefer?

Why evidence from organizational research is almost always ‘weak’ evidence

Therefore, the use of experimental research to make hard causal claims within organizations is of very, very limited use. And even if experiments are possible, the quality of evidence will be very low. Unless you work for a massive company and you have hundreds of teams to work with. And even if you do, you’ll require the use of very advanced statistics (like regression modelling, structural equation modelling or multilevel analysis) in order to find reliable evidence in data that is muddied by countless confounding known and unknown variables. Because experimental research is practically impossible, scientists in the social sciences usually resort to correlational research.

In this kind of research, the ‘experimental’ part of a study is let go. Instead of introducing a new way of working to one team and measure some outcomes over time, you simply compare a team (A) that works with Scrum with team (B) that doesn’t work with Scrum at a single moment in time. In this case, you could find that the performance of team A is 140%, while team B scores 120%. In correlational studies, potential (known) confounding variables (like age, skill level, project complexity) are also measured. When the measurement is done, correlations between variables are calculated and tested for significance to rule out the effects of pure chance or luck. Although this can yield some interesting results, it can’t tell you anything definite about causality. You may have heard the saying ‘correlation is not causation’, and this is where it applies. Here’s why: there may be a strong correlation between Scrum-use and the value delivered by a team, but the effect can be caused by something else (called a ‘spurious correlation’). For example, maybe teams that work with Scrum are more likely to try out new things (like Scrum) and are more effective because they innovate. Or teams that use Scrum are generally younger, less resistant to change or more interested in and aware of their process. Or maybe Scrum teams work on very different kinds of projects (like new product development or smaller projects), and the success of these kinds of projects is simply higher. Because a correlational study does not involve experimentation (where one variables is changed while others are measured), it’s quite impossible to determine causality. All causal inferences from correlational data are necessarily interpretations, and therefore certainly refutable.

This observation is confirmed by advocates of EBM. At the very best, ‘definitive causal conclusions in quantitative research can only be reached on the basis of true randomized trials’ (Thompson, 2005). And even in that case, it’s very hard. Thompson (ibid) emphasizes ‘that correlational evidence can at least tentatively inform evidence-based practices when sophisticated causal modeling (e.g. regression discontinuity analyses) or exclusion methods are employed’. The US-based coalition for Evidence-Based Policy takes an equally strong stance and implores decision-makers to use evidence from ‘preferably randomized controlled trials or, if not feasible, prospective, closely-matched comparison-group studies’ (2006). In other words, conclusions should be drawn by people with knowledge of statistics and methodology that are capable of determining the quality of research. And even then, it’s only still circumstantial evidence because of methodological constraints.

So what’s the problem, really?

The (lengthy) point I’m trying to make here is twofold. First, there is absolutely no such thing as ‘direct evidence’ or ‘big E evidence’ in the social sciences. This was recently aptly illustrated by a large-scale replication of 27 of the strongest and most taken-for-granted effects in (social) psychology that failed to replicate at least 10 of them (Nosek & Lakens, 2014). This limitation in the strength of evidence that can be collected applies to research done by experienced scientific teams, but most certainly in research that is done locally in your own organization. Due to methodological, practical and statistical limitations - which are even more pressing in local research - you always end up with ‘evidence’ that can be interpreted and explained in many different ways and is considered ‘weak’ by scientific standards, such as Evidence-Based Medicine (e.g. Guyatt, et. al. 2010; USPSTF, 1989). In the end, how you interpret the evidence is largely a matter of assumptions, perspective, experience and, frankly, your personal agenda.

Of course some circumstantial evidence is stronger than others, which leads into my second point. As you may have realized by reading this post, appraising the quality of evidence is not as simple as putting it in the ‘direct’ or ‘circumstantial’ categories. Instead, it requires, among many other things, critical reading of the research, applied methodology and statistics. Not only is this very time-consuming, it also requires a lot of in-depth knowledge. A fact that is acknowledged by advocates of EBM (e.g. Rousseau, 2007, Pfeffer & Sutton, 2006). It is extremely challenging even for trained scientists, let alone managers and people that have not been trained at all. The biggest issue in some applications of EBM, is that is persuades decision-makers into believing that they are making objective decisions by using some tool or framework. Even though they're not.

But underneath these challenges, there is a more profound one. EBM rests on the principle that managers should make use of evidence, preferably from external scientific research (Pfeffer & Sutton, 2005). But if this evidence is only circumstantial at best, how is ethically acceptable for a manager to apply this ‘evidence’, probably even in an organization that is entirely different from the research sample? Especially if this ‘evidence’ is used to support decisions that impact the jobs of many employees. Is it really so much better than using experience, advice from colleagues or just plain intuition? This problem is amplified in the case of local research. Although the problems of generalization are presumably less relevant here, there are more limitations on how the research was performed (statistics, methodology, controls). Therefore, the evidence is likely to be even more circumstantial, and the ethical questions become even more pressing.

In my next post, I will continue in this vein. If circumstantial evidence is all we have - evidence that is by definition refutable - why do we call it ‘Evidence-Based’ Management? Doesn’t this open the door for blatant manipulation by picking just the evidence that suits one’s personal agenda, and presenting it as ‘direct evidence’? Doesn’t this result in managers using the 'direct' evidence (that isn't) to dissolve dissent and resistance under the guise of the scientific method? More in my next post.