The ensemble learning uses the same training set
that the individual closed-form inference models use.
To avoid giving excess weight to inference models
that have overfit the training set, we use a common
technique from stacking ensembles (Breiman 1996).
The training set is split into five folds, each leaving
out 20 percent of the training data, as though for
cross validation. Each closed-form inference model is
trained on each fold. When the ensemble gathers an
inference model’s confidence as a feature for an
instance, the inference model uses the learned
parameters from the fold that excludes that instance.
In this way, each inference model’s performance is
testlike, and the ensemble model does not overly
trust overfit models.

The ensemble is a binary logistic regression per
answer hypothesis using three features from each
inference model. The features used are: the probability of the hypothesis, the logit of the probability, and
the rank of the answer among the multiple-choice
answers. Using the logit of the probability ensures
that selecting a single inference model is in the
ensemble’s hypothesis space, achieved by simply setting the weight for that model’s logit feature to one
and all other weights to zero.

Each closed-form inference model is also trained
on the full training set. These versions are applied at
test time to generate the features for the ensemble.

EvaluationFor the automatic evaluation of WatsonPaths, weused a set of medical test preparation questions fromExam Master and McGraw-Hill, which are analogousto the examples we have used throughout this article.These questions consist of a paragraph-sized naturallanguage scenario description of a patient case,optionally accompanied by a semistructured tabularstructure. The paragraph description typically endswith a punch line question and a set of multiple-choice answers (average 5. 2 answer choices per ques-tion). We excluded from consideration questions thatrequire image analysis or whose answers are not textsegments.

The punch line questions may simply be seekingthe most likely disease that caused the patient’ssymptoms (for example, “What is the most likelydiagnosis in this patient?”), in which case the ques-tion is classified as a diagnosis question. The diagno-sis question set reported in this evaluation was iden-tified by independent annotators. Nondiagnosispunch line questions may include appropriate treat-ments, the organism causing the disease, and so on(for example, “What is the most appropriate treat-ment?” and “Which organism is the most likelycause of his meningitis?” respectively). We observedthat most questions that did not directly ask for adiagnosis nonetheless required a diagnosis as anintermediate step. For this reason, we decided thatfocusing initially on diagnosis questions was a goodstrategy.

We split our data set of 2190 questions into a training set of 1000 questions, a development set of 690
questions, and a blind test set of 500 questions. The
development set was used to iteratively drive the
development of the scenario analysis, relation generation, and belief engine components, and for parameter tuning. The training set was used to build models used by the learning component.

As noted earlier, our learning process requires subquestion training data to consolidate groups of question-answering features into smaller, more manageable sets of features. We do not have robust and
comprehensive ground truth for a sufficiently large
set of our automatically generated subquestions.
Instead, we use a preexisting set of simple factoid
medical questions as subquestion training data: the
Doctor’s Dilemma (DD) question set.
1 DD is an established benchmark used to assess performance in factoid medical question answering. We use 1039 DD
questions (with a known answer key) as our subquestion training data. Although the Doctor’s Dilemma questions do have some basic similarity to the
subquestions we ask in assertion graphs, there are
some important differences: ( 1) In an assertion graph
subquestion, there is usually one known entity and
one relation that is being asked about. For DD, the
question may constrain the answer by multiple entities and relations. ( 2) An assertion graph subquestion
like “What causes hypertension?” has many correct
answers, whereas DD questions have a single best
answer. ( 3) There may be a mismatch between how
confidence for DD is trained and how subquestion
confidence is used in an inference method. The DD
confidence model is trained to maximize log-likeli-hood on a correct/incorrect binary classification task.
In contrast, many probabilistic inference methods
use confidence as something like strength of indication or relevance.

For all these reasons, DD data might seem poorly
suited to training a complete model for judging edge-strength for subquestion edges in WatsonPaths. In
practice, however, we have found that DD data is useful as subquestion training data because it is easier to
obtain than subquestion ground truth, and so far
shows improved performance over the limited subquestion ground truth we have constructed.
2 In our
hybrid learning approach, we use 1039 DD questions
for consolidating question-answering features and
then use the smaller, consolidated set of features as
inputs to the inference models that are trained on the
1000 medical test preparation questions.

Metrics and BaselineAs a baseline, we attempted to answer questions fromour tests set using a simple information-retrievalstrategy. It used as much as possible the same corpusand starting point used by WatsonPaths. In this base-Summer 2017 71