Question Answering Collections

Question answering systems return an actual answer, rather than
a ranked list of documents, in response to a question. TREC
has had a question answering track since 1999; in each track
the task was defined such that the systems were to retrieve
small snippets of text that contained an answer for
open-domain, closed-class questions (i.e., fact-based, short-answer
questions that can be drawn from any domain).
This page summarizes the tasks that were used in the
QA tracks and describes the available data sets. For more
details about a particular task, see the question answering
track overview papers in the TREC proceedings.

Creating the true equivalent of a standard retrieval test
collection is an open problem. In a retrieval test collection,
the unit that is judged, the document, has an unique identifier,
and it is trivial to decide whether a document retrieved in a
new retrieval run is the same document that has been judged.
For question answering, the unit that is judged is the entire string
that is returned by the system. Different QA runs very seldom
return exactly the same answer strings, and it is very difficult
to determine automatically whether the difference between a new
string and a judged string is significant with respect to the
correctness of the answer. A partial solution to this problem
is to use so-called answer patterns and accept a string as correct
if it matches an answer pattern for the question. A description
of the use of answer patterns appeared in the SIGIR-2000 proceedings:

A submission for the (main) QA task in each TREC was a ranked
list of up to 5 responses per question. The format of a response was

qid Q0 docno rank score tag answer-string

where qid

is the question number

Q0

is the literal Q0

docno

is the id of a document that supports the answer

rank

(1-5) is the rank of this response for this question

score

is a system-dependent indication of the quality of the response

tag

is the identifier for the system

and answer-string

is the text snippet returned as the answer. Answer string (only) may contain
any embedded white space except a newline

The judgment set for a task contains all unique [docno,answer-string]
pairs from all submissions to the track. For TREC 2001 only,
the docno may also be the string "NIL", in which case the answer-string
is empty and the response indicates the system's belief that
there is no correct response in the document collection.

The format of a judgment set isqid docno judgment answer-string

where the fields are as in the submissions and judgment is -1 for wrong,
1 for correct, and 2 for unsupported. "Unsupported" means
that the string contains a correct response, but the document returned
with that string does not allow one to recognize that it is a correct response.
TREC-8 runs were judged only correct (1) or incorrect (-1).
A very detailed description of how answer strings were judged is given
in the paper "The TREC-8 Question Answering Track Evaluation"
in the TREC-8 proceedings.
The "NIL" responses are not included in the judgment set for TREC 2001.

Some potential participants in the TREC QA tracks do not have ready access
to a document retrieval system. To facilitate participation by these
groups, TREC provided a ranking of the top X (X=200 for TREC-8, and
1000 otherwise) documents as ranked by either the AT&T version of SMART
(TRECs 8,9) or PRISE (TREC 2001) when using the question as a query.
TREC also provided the full text of the top Y (Y=200 for TREC-8, 50 otherwise)
documents using this same ranking. These rankings were provided as
a service only; the documents containing a correct answer were not always
contained in the ranking. Links to both the rankings and the document texts
are provided below. The document texts are password-protected to
comply with the licensing agreements with the document providers.
To gain access to the texts, send details of when you obtained the
TREC and TIPSTER disks to the TREC manager trec@nist.gov asking for
the access sequence to the document texts.

The QA task runs were evaluated using mean reciprocal rank (MRR).
The score for an individual question was the reciprocal of the
rank at which the first correct answer was returned or 0 of no correct
response was returned. The score for the run was then the mean
over the set of questions in the test. The number of questions for which
no correct response was returned was also reported. Starting in TREC-9,
two versions of the scores were reported: "strict" evaluation
where unsupported responses were counted as wrong, and "lenient"
evaluation where unsupported responses were counted as correct.
A perl script that uses the answer patterns described above to judge responses
and then calculates MRR and number of questions with no correct
response returned is included with the collection data (perl script not
yet available for 2001). (There is no distinction between strict or
lenient evaluation with pattern-judged runs since the patterns cannot
detect unsupported answers.)