ARC

The AI2 Reasoning Challenge (ARC)
dataset is a question answering, which contains 7,787 genuine grade-school level, multiple-choice science questions.
The dataset is partitioned into a Challenge Set and an Easy Set. The Challenge Set contains only questions
answered incorrectly by both a retrieval-based algorithm and a word co-occurrence algorithm. Models are evaluated
based on accuracy.

ShARC

Most work in machine reading focuses on question answering problems where the answer is directly expressed in the text to read. However, many real-world question answering problems require the reading of text not because it contains the literal answer, but because it contains a recipe to derive an answer together with the reader’s background knowledge. We formalise this task and introduce the challenging ShARC dataset with 32k task instances.

The goal is to answer questions by possibly asking follow-up questions first. We assume that the question does not provide enough information to be answered directly. However, a model can use the supporting rule text to infer what needs to be asked in order to determine the final answer. Concretely, The model must decide whether to answer with “Yes”, “No”, “Irrelevant”, or to generate a follow-up question given rule text, a user scenario and a conversation history. Performance is measured with Micro and Macro Accuracy for “Yes”/”No”/”Irrelevant”/”More” classifications, and the quality of follow-up questions are measured with BLEU.

The public data, further task details and public leaderboard are available on the ShARC Website.

Reading comprehension

Most current question answering datasets frame the task as reading comprehension where the question is about a paragraph
or document and the answer often is a span in the document. The Machine Reading group
at UCL also provides an overview of reading comprehension tasks.

CliCR

The CliCR dataset is a gap-filling reading comprehension dataset consisting of around 100,000 queries and their associated documents. The dataset was built from clinical case reports, requiring the reader to answer the query with a medical problem/test/treatment entity. The abilities to perform bridging inferences and track objects have been found to be the most frequently required skills for successful answering.

The instructions for accessing the dataset, the processing scripts, the baselines and the adaptations of some neural models can be found here.

Example:

Document

Question

Answer

We report a case of a 72-year-old Caucasian woman with pl-7 positive antisynthetase syndrome. Clinical presentation included interstitial lung disease, myositis, mechanic’s hands and dysphagia. As lung injury was the main concern, treatment consisted of prednisolone and cyclophosphamide. Complete remission with reversal of pulmonary damage was achieved, as reported by CT scan, pulmonary function tests and functional status. […]

Therefore, in severe cases an aggressive treatment, combining ____ and glucocorticoids as used in systemic vasculitis, is suggested.

CNN / Daily Mail

The CNN / Daily Mail dataset is a Cloze-style reading comprehension dataset
created from CNN and Daily Mail news articles using heuristics. Close-style
means that a missing word has to be inferred. In this case, “questions” were created by replacing entities
from bullet points summarizing one or several aspects of the article. Coreferent entities have been replaced with an
entity marker @entityn where n is a distinct index.
The model is tasked to infer the missing entity
in the bullet point based on the content of the corresponding article and models are evaluated based on
their accuracy on the test set.

CNN

Daily Mail

# Train

380,298

879,450

# Dev

3,924

64,835

# Test

3,198

53,182

Example:

Passage

Question

Answer

﻿( @entity4 ) if you feel a ripple in the force today , it may be the news that the official @entity6 is getting its first gay character . according to the sci-fi website @entity9 , the upcoming novel “ @entity11 “ will feature a capable but flawed @entity13 official named @entity14 who “ also happens to be a lesbian . “ the character is the first gay figure in the official @entity6 – the movies , television shows , comics and books approved by @entity6 franchise owner @entity22 – according to @entity24 , editor of “ @entity6 “ books at @entity28 imprint @entity26 .

characters in “ @placeholder “ movies have gradually become more diverse

CoQA

CoQA is a large-scale dataset for building Conversational Question Answering systems.
CoQA contains 127,000+ questions with answers collected from 8000+ conversations.
Each conversation is collected by pairing two crowdworkers to chat about a passage in the form of questions and answers.

HotpotQA

HotpotQA is a dataset with 113k Wikipedia-based question-answer pairs. Questions require
finding and reasoning over multiple supporting documents and are not constrained to any pre-existing knowledge bases.
Sentence-level supporting facts are available.

MultiRC

MultiRC (Multi-Sentence Reading Comprehension) is a dataset of short paragraphs and multi-sentence questions that can be answered from the content of the paragraph.
We have designed the dataset with three key challenges in mind:

The number of correct answer-options for each question is not pre-specified. This removes the over-reliance of current approaches on answer-options and forces them to decide on the correctness of each candidate answer independently of others. In other words, unlike previous work, the task here is not to simply identify the best answer-option, but to evaluate the correctness of each answer-option individually.

The correct answer(s) is not required to be a span in the text.

The paragraphs in our dataset have diverse provenance by being extracted from 7 different domains such as news, fiction, historical text etc., and hence are expected to be more diverse in their contents as compared to single-domain datasets.

NewsQA

The NewsQA dataset is a reading comprehension dataset of over 100,000
human-generated question-answer pairs from over 10,000 news articles from CNN, with answers consisting of spans of text
from the corresponding articles.
Some challenging characteristics of this dataset are:

Answers are spans of arbitrary length;

Some questions have no answer in the corresponding article;

There are no candidate answers from which to choose.
Although very similar to the SQuAD dataset, NewsQA offers a greater challenge to existing models at time of
introduction (eg. the paragraphs are longer than those in SQuAD). Models are evaluated based on F1 and Exact Match.

Example:

Story

Question

Answer

MOSCOW, Russia (CNN) – Russian space officials say the crew of the Soyuz space ship is resting after a rough ride back to Earth. A South Korean bioengineer was one of three people on board the Soyuz capsule. The craft carrying South Korea’s first astronaut landed in northern Kazakhstan on Saturday, 260 miles (418 kilometers) off its mark, they said. Mission Control spokesman Valery Lyndin said the condition of the crew – South Korean bioengineer Yi So-yeon, American astronaut Peggy Whitson and Russian flight engineer Yuri Malenchenko – was satisfactory, though the three had been subjected to severe G-forces during the re-entry. […]

QAngaroo

QAngaroo is a set of two reading comprehension datasets,
which require multiple steps of inference that combine facts from multiple documents. The first dataset, WikiHop
is open-domain and focuses on Wikipedia articles. The second dataset, MedHop is based on paper abstracts from
PubMed.

QuAC

Question Answering in Context (QuAC) is a dataset for modeling, understanding, and participating in information seeking dialog.
Data instances consist of an interactive dialog between two crowd workers:
(1) a student who poses a sequence of freeform questions to learn as much as possible about a hidden Wikipedia text,
and (2) a teacher who answers the questions by providing short excerpts (spans) from the text.

RACE

The RACE dataset is a reading comprehension dataset
collected from English examinations in China, which are designed for middle school and high school students.
The dataset contains more than 28,000 passages and nearly 100,000 questions and can be
downloaded here. Models are evaluated based on accuracy
on middle school examinations (RACE-m), high school examinations (RACE-h), and on the total dataset (RACE).

SQuAD

The Stanford Question Answering Dataset (SQuAD)
is a reading comprehension dataset, consisting of questions posed by crowdworkers
on a set of Wikipedia articles. The answer to every question is a segment of text (a span)
from the corresponding reading passage. Recently, SQuAD 2.0
has been released, which includes unanswerable questions.

RecipeQA

RecipeQA is a dataset for multimodal comprehension of cooking recipes. It consists of over 36K question-answer pairs automatically generated from approximately 20K unique recipes with step-by-step instructions and images. Each question in RecipeQA involves multiple modalities such as titles, descriptions or images, and working towards an answer requires (i) joint understanding of images and text, (ii) capturing the temporal flow of events, and (iii) making sense of procedural knowledge.

NarrativeQA

NarrativeQA is a dataset built to encourage deeper comprehension of language. This dataset involves reasoning over reading entire books or movie scripts. This dataset contains approximately 45K question answer pairs in free form text. There are two modes of this dataset (1) reading comprehension over summaries and (2) reading comprehension over entire books/scripts.

Several of the questions in DuoRC, while seeming relevant, cannot actually be answered from the given passage. This requires the model to detect the unanswerability of questions. This aspect is important for machines to achieve in industrial settings in particular.

DROP

DROP is a crowdsourced, adversarially-created, 96k-question benchmark, in which a system must resolve references in a question, perhaps to multiple input positions, and perform discrete operations over them (such as addition, counting, or sorting). These operations require a much more comprehensive understanding of the content of paragraphs than what was necessary for prior datasets.

Quasar

Quasar is a dataset for open-domain question answering. It includes two parts: (1) The Quasar-S dataset consists of 37,000 cloze-style queries constructed from definitions of software entity tags on the popular website Stack Overflow. (2) The Quasar-T dataset consists of 43,000 open-domain trivia questions and their answers obtained from various internet sources.

SearchQA

SearchQA was constructed to reflect a full pipeline of general question-answering. SearchQA consists of more than 140k question-answer pairs with each pair having 49.6 snippets on average. Each question-answer-context tuple of the SearchQA comes with additional meta-data such as the snippet’s URL.

Knowledge Base Question Answering

Knowledge Base Question Answering is the task of answering natural language question based on a knowledge base/knowledge graph such as DBpedia or Wikidata.

QALD-9

QALD-9 is a manually curated superset of the previous eight editions of the Question Answering over Linked Data (QALD) challenge published in 2018. It is constructed by human experts to cover a wide range of natural language to SPARQL conversions based on DBpedia 2016-10 knowledge base. Each question-answer-pair has additional meta-data. QALD-9 is best evaluated using the GERBIL QA platform for repeatability of the evaluation numbers.

Annotator

Macro P

Macro R

Macro F1

Error Count

Average Time/Doc ms

Macro F1 QALD

Paper (including links to webservices/source code)

Elon (WS)

0.049

0.053

0.050

2

219

0.100

QASystem (WS)

0.097

0.116

0.098

0

1014

0.200

TeBaQA (WS)

0.129

0.134

0.130

0

2668

0.222

wdaqua-core1 (DBpedia)

0.261

0.267

0.250

0

661

0.289

Diefenbach, Dennis, Kamal Singh, and Pierre Maret. “Wdaqua-core1: a question answering service for rdf knowledge bases.” Companion of the The Web Conference 2018 on The Web Conference 2018. International World Wide Web Conferences Steering Committee, 2018.