Safely deploying machine learning in healthcare 1/2

10 May 2017

Updated 25th June 2017. Edited to focus on regulatory requirements and to improve readability.

There is considerable excitement (and hype) relating to the potential benefits of machine learning (ML) and artificial intelligence (AI) in healthcare. However, while increases in computer processing power have allowed more sophisticated ML in recent years, many types of machine learning have been in use for decades. The difference now is that it is now becoming practical to apply the results of ML to create usable AI within healthcare.

Such AI will act as a software-as-a-device and as such, there are important regulatory requirements to consider before deploying this technology.

Looking “inside the box”

The problem with inferences made independently by AI is that we may not be able to, figuratively speaking, open the box and see the logical, sequential arguments and premises used to reach a conclusion. Much as in legal argument or any other form of logical reasoning, carefully crafted reasoning and argument are built upon a framework of assumptions and known evidence. In such reasoning, one can craft inferences upon inferences and justify the final conclusions by examination of each step.

Human decision making is fallible and frequently non-deterministic with human error often the source of important errors in healthcare. Indeed, systems designed to reduce the frequency and impact of human error such as checklists, redundancy, guidelines and protocols are attempts to make human decision making more deterministic and less at risk of bias. Many systems work by breaking down a larger decision or process of care into smaller discrete steps.

If I see a patient and start them on a new treatment, such as pyridostigmine, you can ask me to justify why I recommended starting this treatment for that patient in clinic and I will be able to give you a logical reasoned argument built upon evidence and inference that will include the presence or absence of particular characteristics. In this case, the characteristics may include knowledge of the diagnosis of myasthenia gravis and the rationale for that diagnosis such as the positive finding of acetylcholine receptor antibodies in the patient’s serum. I may explain the balance between symptomatic treatment such as pyridostigmine and treatments of greater efficacy but risk of adverse effects in arriving at that decision.

Essentially, I can attempt to demonstrate the correctness of my final decision by a chain of reasoning. Similarly, someone else might be able to conclude I made the wrong decision by examining that chain of reasoning.

So the logical questions are:

what are the regulatory requirements within which we must work in order to make effective use of AI in healthcare?

do we need to understand how machines arrive at their conclusions, so we can justify using those conclusions in clinical practice?

won’t designing information systems that arrive at their conclusions without justification risk replicating the fallibility of human decision making?

must we be able to see inside the box?

can augmented AI decision support processes be decomposed into discrete steps with the results of one semi-deterministic “black-box” feeding the action of another? Does such decomposition result in creating a partial ‘window’ to allow us to understand the chain of reasoning created by AI devices?

and finally, how can we introduce machine learning in a safe and incremental manner in healthcare?

“Medical devices”

The Medicines & Healthcare Regulatory Agency (MHRA) provides helpful advice on the certification of software used in healthcare. That interactive PDF will take you step-by-step through a decision aid in order to determine whether your software meets the definition of a “medical device”, and needs certification and accreditation with a declaration of conformity (“CE” marking).

The guidance is logically organised and intuitive except that the section on “medical purpose” needs to be read carefully. Many things that you might think have a “medical purpose” are not included in their definition. This is important as if your software has a “medical purpose”, then essentially it needs assessment and certification as a medical device.

For example, software which independently makes a diagnosis based on imaging data would satisfy the test for “medical purpose” and therefore be labelled a medical device. However, software which displays data and makes general recommendations is not a medical device. Software or devices which monitor sport or fitness are not medical devices, but would be considered as such if they were intended for use to monitor physiological parameters that will affect the treatment of a patient.

If software claims to do any of the following, then it does have a “medical purpose”:

make recommendations to seek further advice based on user entered data.

indicate the risk that a specific patient has of developing a disease based on entered data for that patient.

allow remote access to information on physical monitors and applies user-defined filtering rules to any alarms generated by the original device.

monitor a patient and collects information entered by the user, measured automatically by the app or collected by a point of care device may qualify as a medical device if the output is intended to affect the treatment of an individual.

automate the treatment pathway for an individual patient.

provide clinical decisions

intended for therapeutic drug monitoring

However, if software has only one of these purposes, then it is not considered as having a “medical purpose”:

Patient medical education

Monitors fitness/health/wellbeing

Professional medical education

Stores or transmits medical data without change

Software that is used to book an appointment, request a prescription or have a virtual consultation is also unlikely to be considered a medical device if it only has an administrative function.

Software that provides reference information to help a Healthcare Professional to use their knowledge to make a clinical decision.

Data or databases for storing data

Implications for information technology

The process of confirmation of conformity with the requirements of a “medical device” is confirmation that it passes clinical evaluation. This process is designed to verify that a device performs as it is intended to do so, requiring research to demonstrate efficacy and adverse effect risk, much like the evaluation of a pharmaceutical product. There is a lot of information from MHRA on the clinical evaluation of medical devices available online. In addition, any device needs ongoing surveillance of its functionality in order to confirm that it continues to function as it was intended and has had no untoward effects not found during clinical evaluation.

Most information technology systems are made up of multiple interoperating modules and it is unlikely that all such modules would meet the criteria to become medical devices. As a result, it is possible for the accreditation to be needed for a single module.

The natural conclusion is to plan for two parallel streams of work in relation to machine learning.

The first stream is to support the ascertainment, aggregation and presentation of data relating to patients in order to support clinical care. Such data might range from the results from laboratory tests to links to guidelines that might apply in that general clinical context. This stream of work will not meet the criteria for a “medical device” and yet AI offers opportunities to improve how information is recorded, retrieved and shown. Importantly, I’d argue that the graphical display of data is under-used in healthcare and yet is a powerful way of summarising information without inadvertently creating a device that needs accreditation as a medical device.

The second stream is the development of software that can independently make inferences and judgements and present those results in order to influence or change patient management. Such software would be categorised as a “medical device”, and as a result, this stream of work must be undertaken much as the development of a new pharmacological product, via a phased and iterative structured research programme. The four phases for pharmaceutical products are:

Phase I studies the safety of the product, with testing on small numbers of healthy volunteers.

Phase II studies the efficacy of the product usually by randomising individuals to either placebo or the pharmaceutical product and assessing the difference. In clinical situations in which there is an established treatment, the control arm of the study would typically involve the “current standard of care or treatment” rather than placebo.

Phase III is essentially a large scale version of Phase II but patients are almost always randomised. Phase III results can be used to seek MHRA approval for the use of that product.

Phase IV is the post-marketing surveillance, monitoring the real-world use of the product.

Clinical research for medical devices

However, this scheme is modified for medical devices which need to prove safety and performance (as claimed) rather than efficacy. This distinction is made in the UK’s National Patient Safety Agency (NPSA) National Research Ethics Service (NRES) guidance from 2008. Assessment of safety requires studies to determine whether there are any “undesirable side effects under normal conditions of use and to allow an informed clinical opinion to assess whether these are acceptable when weighed against the benefits in relation to the intended performance of the device.” Assessment of performance requires studies “to verify that under normal conditions of use the performance characteristics of the device are those intended by the manufacturer.”

Just because a study is required to demonstrate safety and performance, the nature of AI devices is that such studies may not necessarily have to be performed prospectively. Indeed, a simple AI device that is claimed to, for example, recognise a deteriorating renal function, may simply require a study of performance against retrospective clinical data in order to verify that it is performing as required; this is an in-silico trial. It is important to note that such a process would need to be repeated if the device or its logic changed but this would be straightforward if the device to be tested simply needs to process the validation dataset and have its results checked. Indeed, this is the very essence of test-driven development in which a set of tests are written to validate that a piece of software is correct. The logical conclusions that follow from a need to perform both ML training and then subsequent validation is the provision of large-scale real clinical datasets.

However, is it possible to reduce the assessment of a medical device to a process of validation of the outputs given the specified inputs?

In many cases it is, particularly where the scope of the problem to be solved is limited or, a more complex problem has been decomposed manually in a series of steps and each of those steps can be tested independently. It really doesn’t matter if the process performed in each step is opaque as long as we can validate the output given a range in inputs. To answer one of our questions: we don’t have to “see inside the box”. I’d regard these as bounded problems in which human supervision has carefully curated the data used as input. In essence, the ML is a supervised process and is functioning much as one would apply regression to a set of data in order to determine the covariates that significantly effect the outcome.

However, what if the inputs and outputs are much more complex and less bounded? In complex medical environments, there may be data relating to a range of parameters such as co-morbidities, patient location, previous admissions, laboratory test results etc. Any or all of these data may be potentially used as input for deep-learning training and may produce random and sometimes bizarre outputs unexpectedly in a minority of situations. The ‘black-box’ inferences resulting from a hypothesis-free unbounded, unsupervised training paradigm using large amounts of clinical data cannot be explained or validated easily. In these circumstances, a model may use unexpected pieces of data made available as part of training in order to derive the outputs. This is particularly important for data that is not a predictive factor for an outcome, but a result of that outcome.

For example, this might result in inappropriate inferences being made, such as flagging up a patient on the intensive care unit (ICU) as needing intervention more than a patient on the ward. The training data would show that the patient on the ICU to be at greater risk of death, but of course, the presence of the patient on ICU is a result of clinicians identifying that patient and bringing them to the ICU. Such logic may result in paradoxical inferences that ignore a deteriorating patient on a surgical ward.

Do we have a framework that can certify such logic as being fit-for-purpose? Much like a pharmaceutical product, our more complex AI may have unintended consequences that are not easily demonstrable. In essence, the introduction of AI into a clinical environment becomes an intervention in a similar way to a pharmaceutical product. So, does that mean we must run randomised controlled trials of AI solutions?

Healthcare technology deployment

There are important differences between the deployment of technology within healthcare compared to the use of a pharmacological product. For example, a placebo may easily be substituted in place of an actual drug as a part of a randomised trial. If I start using a new treatment for hypertension, it is simply another prescription among many, rather than a change in working practice. However, if I deploy a new electronic patient record system, how can we determine it is effective and safe?

Similarly, we have little data on the effect of machine learning on the behaviour and actions of clinical users. In particular, do we know what happens when users assume technology will ensure that they are not making a mistake? Bob Wachter 2015 book, “The Digital Doctor”, explains in details the catalogue of errors that resulted in a patient being given a 39-fold overdose; in many of those errors, the fundamental issue was of inappropriate confidence and trust in the computer.

However, we do not routinely run clinical trials when deploying new software systems within healthcare. When we introduce a systems for electronic prescribing, patients are not randomised into two groups. There is not a group of patients whose prescriptions are written electronically while another continues to have their medications prescribed on paper. Instead, a system might be deployed incrementally and rolled out piecemeal across an organisation, or the system is procured because there are safety data relating to the product or the type of product from a different organisation. If we know errors in prescription are common when using paper and they were reduced when an electronic system was deployed in another organisation, is this sufficient evidence of its performance?

Would it be possible to run clinical trials of technology with the same rigour as those used for pharmaceutical products, not simply assessing performance against those claimed but looking for genuine efficacy? Certainly there would be considerable practical difficulties not least in trying to randomise patients to the intervention - in this case a new AI system - and placebo. Most organisations recognise that it is not possible to run randomised controlled trials of changes in their services, structure or indeed, information technology. Instead, they should adopt a continuous improvement methodology in which they continuously measure, analyse and monitor their services and subsequently effect improvements and repeat.

Conclusions for AI

In conclusion, health professionals and patients are ‘information-poor’ and information technology must be used to improve access to information. However, there are risks in such an approach as important data may be lost within a large amount of irrelevant information and there will be an increasing need to curate, summarise and infer from these data in order to support clinical decision making.

As such, my first suggestion is that we should be exploring the use of technology and AI to record and display the right information at the right time. It is possible for me to contrive what information I’d like to see when I see a patient with multiple sclerosis, but how can we do this at scale? Can we train deep learning systems with what information to show by giving those systems data on what information a clinician clicks on and reads and subsequently use those insights and heuristics to learn in real-time? Such a learning system should improve over time in response to the navigation choices made by clinicians in different contexts. As a result, instead of showing me a chronological list of investigations, I might be shown the most recent platelet count (dropped to 70 from a previous normal value) along with graph.

Such a solution will be made much more powerful by ensuring that the AI systems have access to a wide-range of structured clinical information, such as diagnoses and procedures. My second suggestion is therefore to work on the development of a range of clinically useful tools to facilitate the routine, systematic and real-time collection of structured data.

Thirdly, in order to look back at AI ‘decisions’, we must be able to see a snapshot of the data used by AI at the time of the decision. This means that it must be possible to either store or recreate what the algorithm saw at the time of making any inferences as well as a record of those inferences and the action taken.

Fourthly, the introduction of AI within healthcare should aim to surpass the formal regulatory requirements. Any software deployment should adopt a continuous improvement methodology in which there is an attempt to measure its performance and safety in a real-life clinical environment. All AI systems should have training and subsequent validation using independent data via in silico clinical trials and overseen by appropriate governance arrangements.

In particular, any AI system that meets the requirements of a medical device requires additional safeguards and I would argue that current requirement to simply demonstrate its performance compared to what is claimed is insufficient. Instead, I suggest that phase III clinical trials of medical devices containing artificial intelligence (AI) must be performed in a way that permits us as a health community to understand the value and the risk from the introduction of artificial intelligence over and above the other technology required.

A controlled trial of AI might be possible within a single organisation but I would envisage a trial across multiple organisations would be needed in order to compare the efficacy and safety between those using AI and those who are not. For such a study, we would try to isolate the use of machine learning as the only difference between the control and intervention arm. As a result, deployment would need to be across both organisations with the only difference being the use of AI itself.