Navigation

Corpus Filtering Details

As of the second year (2014) of the track, a pre-filtered version of the corpus was provided to reduce the filtering load on participants. For the 2015, a similar filtered corpus was provided, along pre-filtered version of the original 2013 track. This paage summarizes how these filtered corpora were created.TREC-TS-2014F

The TREC-TS-2014F dataset is a filtered version of the KBA 2014
corpus. It is stored in the same format, follows the same file
structure (ordered into per-hour folders) and is encrypted with the
same GPG key (see above). To create this corpus, two levels of
filtering were performed. First, any documents that were published
out-with the time periods of the 15 events from the TREC-TS 2014 track
topics were removed, i.e. only documents with timestamps between the start and
end tag for one or more TREC-TS 2014 topics were kept. Second, we filtered the
remaining documents, keeping only those which were likely to contain one or
more relevant sentences to an event. This filtering was performed as
follows:

For each hour within the time period of each event, all documents
from the KBA 2014 corpus that were published within that hour were
indexed using the open source Terrier IR platform v4.0 (see
terrier.org). The title of each document (if available), and any text
within the body sentences were indexed. Terrier's stopword list and
Porter stemming were applied.

The TREC organisers manually identified a set of queries
representing the topics of interest relating to each of the 15
events, creating event-query pairs. (These will not be released until
after the final submission of runs)

For each event-query pair, Terrier was used to retrieve the top
1000 documents for each query incrementally from each hour index (for
the hours belonging to the associated event). The retrieval model
used was BM25 with default parameters. In this way, we aim to create a
high-recall set of documents for participants of summarise each event
from.

Documents that were not retrieved for one or more queries were
then filtered out, forming the final TREC-TS-2014F dataset.

TREC-TS-2015F

The TREC-TS-2015F dataset is a filtered version of the KBA 2014
corpus for the TREC-TS 2015 topics. The filtering methodology is identical to the TREC-TS-2014F dataset, with the exception of that the rank cutoff used was 100, rather than 1000. This smaller rank cutoff was chosen, since it was observed that most of the relevant content was available in the top documents. The result of this change is that the 2015 dataset is smaller than the 2014 dataset.

TREC-TS-2013F

The TREC-TS-2013F dataset was released in 2015 for participants that wanted to train their systems using the 2013 topics. Importantly, the filtering methodology used to create this dataset is not the same as the other filtered versions. In particular, TREC-TS-2013F is a prefiltered is a filtered version of the KBA 2013
corpus for the TREC-TS 2013 topics that was originally created by a participant to the 2014 TREC track. To create this corpus, two levels of
filtering were performed. First, any documents that were published
out-with the time periods of the 9 events from the TREC-TS 2013 track
topics were removed and only documents from the 'news' subset were considered. Second, the
remaining documents were subject to a machine learned document classifier trained on hand annotated documents collected from the Reuters news agency for other events. This classifier uses basic distance metrics between the document and the initial event representation (query).