line, we took each sentence in the question and generated an Indri query by removing stop words. We
then ran this query over our medical corpus, returning a set of 100 passages. (We also tried different
numbers of passages on our development set; 100
passages appeared to show the best results.) The score
for each candidate was simply the number of times
that the candidate text appeared in any of these passages. Confidence in each answer was generated by
normalizing the scores. For instance, if answer A
appeared 4 times in the passages, and answer B
appeared 1 time, the confidence in answer A would
be 80 per cent.

We also evaluated the performance of the Watson
question-answering system adapted for the medical
domain (Ferrucci et al. 2013). We ran this factoid-based pipeline on our scenario-based questions in
order to evaluate the value added by our scenario-based approach. Watson takes the entire scenario as
input and evaluates each multiple-choice answer
based on its likelihood of being the correct answer to
the punch line question. This one-shot approach to
answering medical scenario questions contrasts with
the WatsonPaths approach of decomposing the scenario, asking questions of atomic factors, and performing probabilistic inference over the resulting
graphic model. Note that Watson is the same system
that WatsonPaths uses as a subcomponent. It has
been developed and improved along with WatsonPaths.

We tuned various parameters in the WatsonPaths
system on the development set to balance speed and
performance. The system performs one iteration each
of forward and backward relation generation. The
minimum confidence threshold for expanding a
node is 0.25, and the maximum number of nodes
expanded per iteration is 40. In the relation generation component, the Watson medical question-answering system returns all answers with a confidence of above 0.01.

We evaluate system performance both on the fulltest set as well as on the diagnosis subset only. Thereason for evaluating the diagnosis subset separatelyis because most questions that do not directly seek adiagnosis in the punch line depend on a correct diag-nosis along the way. Thus progress on the diagnosissubset may be a step toward better performance onmultistep questions. We use the full 1000 questionsin the training set to learn the models for both thebaseline system and the WatsonPaths system. As not-ed earlier, Doctor’s Dilemma training data is used toconsolidate question-answering features in the Wat-sonPaths system. In the Watson system that was notpart of WatsonPaths, we did not use Doctor’s Dilem-ma training data for any purpose.

Results
Table 1 shows the results of our evaluation on a set of
500 blind questions of which a subset of 156 questions were identified as diagnosis questions by annotators.

We report results on our blind evaluation data,
using two metrics. Accuracy simply measures the percentage of questions for which a system ranks the
correct answer in top position. Confdence weighted
score is a metric that takes into account both the
accuracy of the system and its confidence in producing the top answer (Voorhees 2003). We sort all
<question, top answer> pairs in an evaluation set in
decreasing order of the system’s confidence in the top
answer and compute the confidence weighted score
as
where n is the number of questions in the evaluation
set. This metric rewards systems for more accurately
assigning high confidences to correct answers, an
important consideration for real-world question-answering and medical diagnosis systems.

Because most questions have five or six multiple-choice answers, chance performance on our test set
was approximately 19. 8 per cent.

Results show that in terms of accuracy, WatsonPaths outperforms both the baseline system and Watson on both the full set and the diagnosis subset. We
used a significance level of p < 0.05. In terms of confidence weighted score, WatsonPaths significantly
outperforms the baseline system on both sets, and
significantly outperforms Watson on the full set. For
the diagnosis subset, the difference between Watson
and WatsonPaths on confidence weighted score was
not statistically significant, despite a 6+ percent score
increase. This is likely due to the small diagnosis subset, which contains only 156 questions.

Overall, these results suggest that WatsonPaths
adds significant value to scenario-based question
answering, over and above a simple information-retrieval baseline, and also over and above a factoid-type question-answering approach. This is true even
when comparing WatsonPaths to the same Watson
system that was developed as a subcomponent for
WatsonPaths. These results suggest that the graphic
CWS =
1