Life-or-Death Algorithms: Avoiding the Black Box of AI in Medicine

When it comes to applications for machine learning, few can be more widely hyped than medicine. This is hardly surprising: it’s a huge industry that generates a phenomenal amount of data and revenue, where technological advances can improve or save the lives of millions of people. Hardly a week passes without a study that suggests algorithms will soon be better than experts at detecting pneumonia, or Alzheimer’s—diseases in complex organs ranging from the eye to the heart.

The problems of overcrowded hospitals and overworked medical staff plague public healthcare systems like Britain’s NHS and lead to rising costs for private healthcare systems. Here, again, algorithms offer a tantalizing solution. How many of those doctor’s visits really need to happen? How many could be replaced by an interaction with an intelligent chatbot—especially if it can be combined with portable diagnostic tests, utilizing the latest in biotechnology? That way, unnecessary visits could be reduced, and patients could be diagnosed and referred to specialists more quickly without waiting for an initial consultation.

As ever with artificial intelligence algorithms, the aim is not to replace doctors, but to give them tools to reduce the mundane or repetitive parts of the job. With an AI that can examine thousands of scans in a minute, the “dull drudgery” is left to machines, and the doctors are freed to concentrate on the parts of the job that require more complex, subtle, experience-based judgement of the best treatments and the needs of the patient.

High Stakes

But, as ever with AI algorithms, there are risks involved with relying on them—even for tasks that are considered mundane. The problems of black-box algorithms that make inexplicable decisions are bad enough when you’re trying to understand why that automated hiring chatbot was unimpressed by your job interview performance. In a healthcare context, where the decisions made could mean life or death, the consequences of algorithmic failure could be grave.

A new paper in Science Translational Medicine, by Nicholson Price, explores some of the promises and pitfalls of using these algorithms in the data-rich medical environment.

Neural networks excel at churning through vast quantities of training data and making connections, absorbing the underlying patterns or logic for the system in hidden layers of linear algebra; whether it’s detecting skin cancer from photographs or learning to write in pseudo-Shakespearean script. They are terrible, however, at explaining the underlying logic behind the relationships that they’ve found: there is often little more than a string of numbers, the statistical “weights” between the layers. They struggle to distinguish between correlation and causation.

This raises interesting dilemmas for healthcare providers. The dream of big data in medicine is to feed a neural network on “huge troves of health data, finding complex, implicit relationships and making individualized assessments for patients.” What if, inevitably, such an algorithm proves to be unreasonably effective at diagnosing a medical condition or prescribing a treatment, but you have no scientific understanding of how this link actually works?

Too Many Threads to Unravel?

The statistical models that underlie such neural networks often assume that variables are independent of each other, but in a complex, interacting system like the human body, this is not always the case.

In some ways, this is a familiar concept in medical science—there are many phenomena and links which have been observed for decades but are still poorly understood on a biological level. Paracetamol is one of the most commonly-prescribed painkillers, but there’s still robust debate about how it actually works. Medical practitioners may be keen to deploy whatever tool is most effective, regardless of whether it’s based on a deeper scientific understanding. Fans of the Copenhagen interpretation of quantum mechanics might spin this as “Shut up and medicate!”

But as in that field, there’s a debate to be had about whether this approach risks losing sight of a deeper understanding that will ultimately prove more fruitful—for example, for drug discovery.

Away from the philosophical weeds, there are more practical problems: if you don’t understand how a black-box medical algorithm is operating, how should you approach the issues of clinical trials and regulation?

Price points out that, in the US, the “21st-Century Cures Act” allows the FDA to regulate any algorithm that analyzes images, or doesn’t allow a provider to review the basis for its conclusions: this could completely exclude “black-box” algorithms of the kind described above from use.

Transparency about how the algorithm functions—the data it looks at, and the thresholds for drawing conclusions or providing medical advice—may be required, but could also conflict with the profit motive and the desire for secrecy in healthcare startups.

One solution might be to screen algorithms that can’t explain themselves, or don’t rely on well-understood medical science, from use before they enter the healthcare market. But this could prevent people from reaping the benefits that they can provide.

Evaluating Algorithms

New healthcare algorithms will be unable to do what physicists did with quantum mechanics, and point to a track record of success, because they will not have been deployed in the field. And, as Price notes, many algorithms will improve as they’re deployed in the field for a greater amount of time, and can harvest and learn from the performance data that’s actually used. So how can we choose between the most promising approaches?

Creating a standardized clinical trial and validation system that’s equally valid across algorithms that function in different ways, or use different input or training data, will be a difficult task. Clinical trials that rely on small sample sizes, such as for algorithms that attempt to personalize treatment to individuals, will also prove difficult. With a small sample size and little scientific understanding, it’s hard to tell whether the algorithm succeeded or failed because it’s bad at its job or by chance.

Add learning into the mix and the picture gets more complex. “Perhaps more importantly, to the extent that an ideal black-box algorithm is plastic and frequently updated, the clinical trial validation model breaks down further, because the model depends on a static product subject to stable validation.” As Price describes, the current system for testing and validation of medical products needs some adaptation to deal with this new software before it can successfully test and validate the new algorithms.

Striking a Balance

The story in healthcare reflects the AI story in so many other fields, and the complexities involved perhaps illustrate why even an illustrious company like IBM appears to be struggling to turn its famed Watson AI into a viable product in the healthcare space.

A balance must be struck, both in our rush to exploit big data and the eerie power of neural networks, and to automate thinking. We must be aware of the biases and flaws of this approach to problem-solving: to realize that it is not a foolproof panacea.

But we also need to embrace these technologies where they can be a useful complement to the skills, insights, and deeper understanding that humans can provide. Much like a neural network, our industries need to train themselves to enhance this cooperation in the future.

Thomas Hornigold is a physics student at the University of Oxford. When he's not geeking out about the Universe, he hosts a podcast, Physical Attraction, which explains physics - one chat-up line at a time.