model, subquestion strategy, and probabilistic inference engines, such as the ones used in WatsonPaths,
can add significant value to scenario-based question
answering.

Discussion
WatsonPaths has some key features that drive the
performance improvement over Watson. The first
and most important is that WatsonPaths has the ability to engage in inference. Watson does well on short
diagnostic questions where one phrase is strongly
associated with the correct diagnosis. WatsonPaths
does better than Watson when there is another factor
that rules out a diagnosis that would otherwise be
likely, or a second symptom that is not explained by
that diagnosis. Holistically performing inference over
the information contained in the question is often
necessary to answer such questions well.

Development of the inference capabilities of WatsonPaths has been driven by empirical results: We
continued to add sophistication to inference as long
as we detected a statistically significant improvement
in accuracy in the development set. The framework
supports many more kinds of inference. Some types
of inference (for instance, causal inference) already
have implementations, but have not yet shown a statistically significant impact. For some of these types
of inference, we suspect that improvements in the
underlying Watson question answering will be necessary before this impact will emerge. For some other kinds of inference, such as reasoning about events
in time, we are not convinced that such inference
will be necessary to do well on the United States Medical Licensing Examination (USMLE) test set. Finally,
some kinds of inference are not well supported by
WatsonPaths. For instance, as we mentioned, statements in WatsonPaths must be either true or false.
Thus explicit reasoning about entities and events,
and the relations between them, would require a
major extension to WatsonPaths. Overall, because
the primary impact of WatsonPaths appears to be its
more advanced inference, and further advancing
inference depends on the quality of Watson’s results,
we believe that the biggest gains in the performance
of WatsonPaths will come from improvements in
Watson.

Another factor contributing to the impact of Wat-sonPaths, is that WatsonPaths seems less likely thanWatson to get overwhelmed by irrelevant details inlong questions. While Watson tries to weight the rel-evance of various phrases in the question, its baselineassumption is that all the text is potentially impor-tant. Thus irrelevant text can water down the score ofa candidate hypothesis that would otherwise get ahigh score. In WatsonPaths, we ask many subques-tions using different ways of breaking down the sce-nario. For instance, we ask questions about sen-tences, factors, and combinations of factors. Thisincreases the chances that some set of words will pro-duce a strong inference chain that connects to ahypothesis. In contrast, irrelevant text will be unlike-ly to produce inference chains to the hypotheses.This property is important as many real-world appli-cations are not as concise as trivia questions. Forinstance, medical records often contain largeamounts of detail, much of which is irrelevant to aparticular question.

Related WorkClinical decision support systems (CDSSs) have had along history of development starting from the earlydays of artificial intelligence. These systems use avariety of knowledge representations, reasoningprocesses, system architectures, scope of medicaldomain, and types of decision (Musen, Middleton,and Greenes 2014). Although several studies havereported on the success of CDSS implementations inimproving clinical outcomes (Kawamoto et al. 2005;Roshanov et al. 2013), widespread adoption and rou-tine use is still lacking (Osheroff et al. 2007).

The pioneering Leeds abdominal pain system (De
Dombal et al. 1972) used structured knowledge in the
form of conditional probabilities for diseases and
their symptoms. Its success at using Bayesian reasoning was comparable to experienced clinicians at the
Leeds hospital where it was developed. But it did not
adapt successfully to other hospitals or regions, indicating the brittleness of some systems when they are
separated from their original developers. A recent systemic review of 162 CDSS implementations shows
that success at clinical trials is significantly associated with systems that were evaluated by their own
developers (Roshanov et al. 2013). MYCIN (Shortliffe
1976) was another early system that used structured
representation in the form of production rules. Its
scope was limited to the treatment of infectious diseases and, as with other systems with structured
knowledge bases, required expert humans to develop
and maintain these production rules. This manual
process can prove to be infeasible in many medical
specialties where active research produces new diagnosis and treatment guidelines and phases out older
ones. Many CDSS implementations mitigate this limitation by focusing their manual decision logic development effort on clinical guidelines for specific diseases or treatments, for example, hypertension
management (Goldstein et al. 2001). But such systems lack the ability to handle patient comorbidities
and concurrent treatment plans (Sittig et al. 2008).
Another notable system that used structured knowledge was Internist- 1. The knowledge base contained
disease-to-finding mappings represented as conditional probabilities (of disease given finding, and of
finding given disease) mapped to a 1–5 scale. Despite
initial success as a diagnostic tool, its design as an
expert consultant was not considered to meet the
information needs of most physicians. Eventually, its