The random risks of randomised trials

‘There are perils to treating patients not as human beings but as means to some glorious end’

The backlash against randomised trials in policy has begun. Randomised controlled trials (RCTs) are widely accepted as the foundation for evidence-based medicine. Yet a decade ago, they were extremely rare in other contexts such as economics, criminal justice or social policy. That is changing.

In the UK, Downing Street’s newly privatised Behavioural Insight Team has made it cool to test new ideas for conducting policy by running experiments in which many thousands of participants receive various treatments at random. The Education Endowment Foundation, set up with £125m of UK government money, has begun 59 RCTs involving 2,300 schools. In the aid industry, RCTs have been popularised by MIT’s Poverty Action Lab, which celebrated its 10th anniversary last summer – one estimate is that 500 RCTs are under way in the field of education policy alone.

With such a dramatic expansion of the use of randomised trials, it’s only right that we ask some hard questions about how they are being used. The World Bank’s development impact blog has been hosting a debate about the ethics of these trials; they have been criticised in The New York Times and in an academic article by economists Steve Ziliak and Edward Teather-Posadas.

Objections to the idea of randomisation aren’t new. The great epidemiologist Archie Cochrane once ran an RCT of coronary care units, with the alternative treatment being care at home. He was vigorously attacked by cardiologists: how could he justify randomly denying treatment to patients? The counter argument is simple: how could we justify prescribing treatments without knowing whether or not they work?

Yet that should not give carte blanche for evaluators to do whatever they like. Hanging in the background of this debate are awful abuses such as the “Tuskegee Study of Untreated Syphilis in the Negro Male”, which began in 1932. Researchers went to extraordinary lengths to ensure 400 African-American men with syphilis went untreated, although a proven treatment was available from 1947. When the experiment ended in 1972, many men were dead, 40 wives had been infected and 19 children with congenital syphilis had been born.

The Tuskegee study was not a randomised trial, but it demonstrates the perils of treating patients not as human beings but as means to some glorious end. This topic is rightly sensitive in development aid, as there is a clear power imbalance between the agencies who pay for new interventions and the poverty-stricken citizens on the receiving end.

In a perfect world, everyone involved in a trial would give informed consent, and everyone in the control group would receive the best available alternative to the approach being tested. (These are the basic guidelines laid out for medical trials by the World Medical Association’s “Helsinki” declaration.)

Yet compromises are common. Dean Karlan is professor of economics at Yale and founder of Innovations for Poverty Action, which evaluates development projects using randomisation. He points out that telling participants too much about the trial destroys the validity of the results by changing everyone’s behaviour.

Then there is the question of who consents. Camilla Nevill of the Economics Endowment Foundation says that trials are often agreed to and conducted by schools. Trying to persuade every parent to agree explicitly to the trial “decimates” the number of participants, she says.

Is this ethically troubling? At first glance, yes. But there is a risk of a double standard. Without the EEF funding, some schools would adopt the new teaching approach anyway. It is only when a researcher proposes a meaningful evaluation that suddenly there is talk of informed consent.

Ben Goldacre, an epidemiologist and author of Bad Pharma, says “it’s reasonable to hold researchers to a higher standard” if only to protect the reputation of rigorous research. But how high a standard is high enough?

Steve Ziliak, a critic of RCTs, complains about one conducted in China in which some visually-impaired children were given glasses while others received nothing. The case against the trial is that we no more need a randomised trial of spectacles than we need a randomised trial of the parachute.

The case for the defence is that we know that spectacles work but we don’t know how important it might be to pay for spectacles rather than, say, textbooks or vitamin supplements. None of these children was in line to receive glasses anyway, so what harm have the researchers inflicted?

I should leave the final word to Archie Cochrane. In his trial of coronary care units, run in the teeth of vehement opposition, early results suggested that home care was at the time safer than hospital care. Mischievously, Cochrane swapped the results round, giving the cardiologists the (false) message that their hospitals were best all along.

“They were vociferous in their abuse,” he later wrote, and demanded that the “unethical” trial stop immediately. He then revealed the truth and challenged the cardiologists to close down their own hospital units without delay. “There was dead silence.”

The world often surprises even the experts. When considering an intervention that might profoundly affect people’s lives, if there is one thing more unethical than running a randomised trial, it’s not running the trial.

7 Comments

Nick says:

The trouble is almost certainly in the presentation. My favourite example is the way that people are forever moaning about “post-code lotteries” in services and yet , if the national lottery randomly gave some people some free numbers who would complain? and yet that is essentially the same process.

I detest the postcode lottery discussions. The alternative to postcode lottery is every service being the same everywhere. This tends to have to happen by bringing the good services down as well as the poor ones up, and running with rigid rules. Do we want the same rigid rules used in all services in all areas regardless of need?

I totally agree Chris but my point is that for some reason people see it as a possible loss rather than a possible win. They don’t see it that way for the actual lottery so somehow we need to flip the point of view.

My response to this on ft.com did not include my name, so here it is again:

There is no doubt that when RCTs produce statistically significant effects, they produce very strong evidence for a causal relationship. If we can do RCTs in education, we should probably do so. However, if, for whatever reason, RCTs are not possible, should we say that we simply do not know anything, or should we investigate what we can say? There are some who would hold to the former view, but as Robert Solow has pointed out, this is tantamount to saying that because perfectly sterile conditions in an operating theater are impossible, “one might as well do surgery in a sewer.” (Solow, 1970 p. 101)

In this context, it is worth noting that RCTs were not required to establish that smoking causes cancer. If we truly wanted “gold standard” evidence that smoking causes cancer, we would have to solicit volunteers for an experiment, divide them into two groups at random, prevent one group from smoking, and ensure that all the members of the other group smoked a certain number of cigarettes per day for a significant length of time (say around 20 years) and then compare the prevalence of cancer in the two groups. Needless to say, this was not the approach adopted. Instead, researchers looked for a way of establishing a causal relationship without an RCT (Hill, 1965).

Even where RCTs are possible, there are a number of factors that make their use in education problematic. The first is to do with clustering effects. If we wanted to explore the impact of financial rewards for students, then randomization at the student level might be quite possible. Some students are given financial rewards and others are not, and we see the impact on their outcomes. However, where we wish to investigate the impact of a particular instructional program on student achievement, the appropriate unit of analysis is likely to be the class rather than the individual, because the way the program is implemented depends on the individual teacher. Indeed, given the fact that teachers in the same school talk to each other, it would probably be prudent to assume that the unit of analysis should be the school. The experiment would therefore need to be very large to produce a statistically significant effect.

A second problem with educational interventions is that because the range of achievement within a single group of students is so large, the differences between students receiving an intervention and those not receiving the intervention tend to be relatively small in comparison. The most common way of reporting the magnitude of the impact of interventions is by the use of the standardized effect size, defined as the difference in mean achievement of the treatment and control groups, divided by the population standard deviation (Cohen, 1988). While Cohen and others have suggested that effect sizes below 0.3 should be regarded as “small,” Lenth (2006) has pointed out that an effect size needs to be interpreted in context, and here guidelines that might work well in psychology work rather poorly in education because the magnitude of the effect size of an intervention depends on the ages of the students under study. This is because, as students get older, the range of achievement tends to increase (Black & Wiliam, 2007), and since the standard deviation of achievement is greater for older students, the denominator in the effect size calculation is increased, and the effect size is therefore smaller. For example, an intervention that increased the rate of student learning by 50% (so that students receiving the intervention learned in eight months what those in the treatment group learned in in a year), this would equate to an effect size of 0.75 for 6 year olds, but to an effect size of only 0.1 for 15 year olds (Bloom, Hill, Black, & Lipsey, 2008). An intervention with an effect size of 0.01 on the learning of secondary school students, while hard to detect, requiring thousands of schools to participate in an RCT sufficiently powerful to produce a statistically significant result, would nevertheless have an annual value of around £1 billion per year in the United Kingdom.

There are many other reasons that randomized control trials are difficult to do well in education. For one thing, it turns out to be quite difficult to get people to implement the programs as designed. A randomized control study of the Compass Learning Odyssey Math program found that only one out of the 60 participating teachers used the program for the 60 minutes specified each week—the average usage was around 38 minutes per week (Wijekumar, Hitchcock, Turner, Lei, & Peck, 2009). Similarly, an evaluation of Classroom Assessment for Student Learning found that teachers participating in the trial received only around half of the training specified in the program (Randel, Beesley, Apthorp, Clark, Wang, Cicchinelli, & Williams, 2010). The fact that neither of these evaluations found a significant impact on student achievement shows merely that if you do not implement a program, you are unlikely to get its benefits.

Of course, if an intervention is so cumbersome to implement that it is routinely implemented badly, or implementation requires levels of teacher skill that are not commonly found, this would raise questions about the usefulness of the intervention, at least as a way of improving education at scale. On the other hand, if the intervention can be implemented faithfully, and has the potential to substantially improve students’ achievement, but the nature of the intervention is such that randomized control trials are difficult, or even impossible, to conduct, then given the substantial lifelong benefit of higher achievement (e.g., Crawford & Cribb, 2013), there would appear to be a clear moral imperative for researching education even when RCTs are not possible; as Robert Slavin once observed, “Do we really know nothing until we know everything?” (Slavin, 1987 p. 347).

Finally, a randomized control trial of an intervention might be successful because of the presence of factors that are not present in all educational settings, so generalizability to other settings would not be warranted. This suggests that even where randomized trial can be conducted, for their results to be interpretable, they usually need to be accompanied by careful theorizations, which often benefit from careful qualitative observations of the phenomena under study. More developed theorizations of interventions also permit interventions to be optimized, by removing aspects of the intervention that prove to be unnecessary or less effective.

None of the foregoing is intended to suggest that randomized control trials are a bad idea. Rather the discussion highlights that if we rely only on such experimental designs, we end up not being able to say very much, and even when we do conduct such experiments, they benefit from research designs that include complementary approaches to inquiry. As the physicist Arthur Eddington (1935) said:

“But are we sure of our observational facts? Scientific men are rather fond of saying pontifically that one ought to be quite sure of one’s observational facts before embarking on theory. Fortunately those who give this advice do not practice what they preach. Observation and theory get on best when they are mixed together, both helping one another in the pursuit of truth. It is a good rule not to put overmuch confidence in a theory until it has been confirmed by observation. I hope I shall not shock the experimental physicists too much if I add that it is also a good rule not to put overmuch confidence in the observational results that are put forward until they have been confirmed by theory” (p. 211; italics in original).