Contact

Project Debater Datasets

The development of an automatic debating system naturally involves advancing research in a range of artificial intelligence fields. This page presents several annotated data sets developed as part of Project Debater to facilitate this research. It is organized by research sub-fields explained below.

Argument Mining is a prominent research frontier. Within this field, we distinguish between Argument Detection - the detection and segmentation of argument components such as claims and evidence; and Argument Stance Classification – determining the polarity of an argument component with respect to a given topic.

Beyond argument mining, a debating system should face the challenge of interactivity i.e., the ability to understand and rebut the text of the opponent’s speech. Debate Speech Analysis is a new research field that focuses on this challenge.

Another important aspect of a debating system is the ability to interact with its surroundings in a human-like manner. Namely, it should be able to articulate arguments and listen to arguments made by others. Regarding the former, the Text to Speech system must demonstrate human-like expressiveness to keep human listeners engaged. The latter may call for Speech-to-text systems that are especially designed for a debating scenario.

Finally, a debating system should naturally rely on more fundamental NLP capabilities. One example is the ability to assess the semantic relatedness of various pieces of texts and glue these into a coherent narrative. The system should also have the ability to identify the basic concepts mentioned in the text. The corresponding benchmark data we released thus far in this context are described in the section on Basic NLP.

Argument Detection

The various argument detection datasets differ in size (e.g., number of topics), type of element detected (claims, claim sentences, or evidence), and method used for detection (pre-selected articles vs. automatic retrieval). The table below lists the different datasets and provides information on their characteristics:

IBM Debater® - Evidence Sentences
5,785 pairs of a topic and a sentence with a binary annotation indicating whether the sentence is valid evidence relevant for the topic. The sentences were extracted from Wikipedia, and the prior for a positive instance is 41%.
The data set includes 118 diverse topics, from domains such as politics, science and education. Each topic generally deals with one clearly identifiable concept.
The data set is split into two sets: 83 topics for training (4,066 sentences), and 35 topics for testing (1,719 sentences).

IBM Debater® - Claims and Evidence
2294 labeled claims and 4690 labeled evidence for 58 different topics. Labeled data published by Rinott et al. EMNLP-2015. This data is an extension of the CE-ACL-2014 data.

The dataset includes:
- Two CSV files containing, for each topic, the claims and evidence that were identified for it in relevant Wikipedia articles.
- The original Wikipedia articles - from Wikipedia April 2012 dump - in the form of text files, cleaned from any Wikisyntax or HTML markup.

IBM Debater® - Claims and Evidence
1,392 labeled claims for 33 different topics, and 1,291 labeled evidence for 350 distinct claims in 12 different topics. These data were published by Aharoni et al. in the First Workshop on Argumentation Mining at ACL-2014.

The dataset includes:
- Two CSV files containing, for each topic, the claims and evidence that were identified for it in relevant Wikipedia articles.
- The original Wikipedia articles - from Wikipedia April 2012 dump - in the form of text files, cleaned from any Wikisyntax or HTML markup.

Argument Stance Classification and Sentiment Analysis

A debating system must distinguish between arguments that support its side in the debate and those supporting the opponent’s side. The following datasets were developed as part of the work on Project Debater’s stance classification engine.

Claim Stance

The claim stance dataset includes stance annotations for claims, as well as auxiliary annotations for intermediate stance classification subtasks.

Dataset

Reference

Topics

Number of Claims

Method

IBM Debater® - Claim Stance Dataset
2,394 labeled claims for 55 topics. The dataset includes the stance (Pro/Con) of each claim towards the topic, as well as fine-grained annotations, based on the semantic model of Bar-Haim et al. [EACL 2017] (topic target, topic sentiment towards its target, claim target, claim sentiment towards its target, and the relation between the targets).

The dataset includes:
- A utf-8 JSON file containing the topics and the claims found for these topics in Wikipedia articles. Topics and claims are annotated as described above.
- A utf-8 CSV file containing the same information as the JSON file.
- The original Wikipedia articles - from Wikipedia April 2012 dump - in the form of text files. For each article, we provide both the original (raw) version, and a clean version, in which any Wikisyntax and HTML markup is removed.
- A CSV index file, containing the article title and Wikipedia URL for each article.

The dataset is divided into a training set (25 topics, 1,039 claims) and a test set (30 topics, 1,355 claims).

Sentiment Analysis

Sentiment analysis is an important sub-component of our stance classification engine. The following two resources address sentiment analysis of complex expressions, which goes beyond simple aggregation of word-level sentiments. The first resource is a sentiment lexicon of idiomatic expressions, like “on cloud nine” and “under fire”. The second resource addresses sentiment composition – predicting the sentiment of a phrase from the interaction between its constituents. For example, in the phrases “reduced bureaucracy” and “fresh injury”, both “reduced” and “fresh” are followed by a negative word. However, “reduced” flips the negative polarity, resulting in a positive phrase, while “fresh” propagates the negative polarity to the phrase level, resulting in a negative phrase. Accordingly, “reduced” is part of our “reversers” lexicon, and “fresh” is part of the “propagators” lexicon.

Dataset

Reference

Content

Source

IBM Debater® - Sentiment Lexicon of IDiomatic Expressions (SLIDE)
5000 frequently occurring idioms with sentiment annotation. The idioms were selected from Wiktionary, and over 40% of them were found to be sentiment-bearing via crowdsource labeling.
Dataset includes idioms, sentiment labels, and distribution of sentiment annotation from crowdsourced labels.

Expert Stance

Expert evidence (premise) is a commonly used type of argumentation scheme. Prior knowledge about the expert’s stance towards the debate topic can help predict the polarity of such arguments. For example, an argument made by Richard Dawkins about atheism is likely to have a PRO stance, since Dawkins is a well-known atheist. Such information can be extracted from Wikipedia categories: Dawkins, for instance, is listed under “Antitheists”, ”Atheism activists”, “Atheist feminists” and “Critics of religions”. The Wikipedia Category Stance dataset contains stance annotations of Wikipedia categories towards Wikipedia concepts representing controversial topics.

Dataset

Reference

Topics

Number of Categories

Method

IBM Debater® - Wikipedia Category Stance
4603 Wikipedia categories and lists annotated for stance (Pro/Con) towards a concept, for a set of 132 concepts. The data were published by Toledo-Ronen et al. at the ACL-2016 Workshop on Argument Mining.

Debate Speech Analysis

In order to respond to an opponent’s speech, the system must process the opponents’ voice and `understand’ its content. The provided datasets focus on the Automatic Speech Recognition (ASR) component and on the upstream tasks related to understanding the opponents' speeches.

Dataset

Reference

Speeches

Topics

Source

IBM Debater® - Recorded Debating Dataset - Release #1 (Full version)
60 speeches recorded by professional debaters about controversial topics, and their manual and automatic transcripts, in both raw and cleaned (processed) versions.

IBM Debater® - Recorded Debating Dataset - Release #1 (Compressed audio files)
60 speeches recorded by professional debaters about controversial topics, and their manual and automatic transcripts, in both raw and cleaned (processed) versions.

IBM Debater® - Recorded Debating Dataset - Release #1 (Light version - no audio files)
60 speeches recorded by professional debaters about controversial topics, and their manual and automatic transcripts, in both raw and cleaned (processed) versions.

The dataset includes:
- Manual and automatic transcripts of the speeches, raw and cleaned versions

IBM Debater® - Recorded Debating Dataset - Release #2 (Full version) + Annotated arguments - parts 1+2
200 speeches recorded by professional debaters about controversial topics (with their manual and automatic transcripts) and 756 arguments annotated as mentioned/not mentioned in these speeches.
Note: 60 speeches from release#1 are not included in this dataset.

IBM Debater® - Recorded Debating Dataset - Release #2 (Compressed audio files) + Annotated arguments
200 speeches recorded by professional debaters about controversial topics (with their manual and automatic transcripts) and 756 arguments annotated as mentioned/not mentioned in these speeches.
Note: 60 speeches from release#1 are not included in this dataset.

IBM Debater® - Recorded Debating Dataset - Release #2 (Light version - no audio files) + Annotated arguments
200 speeches recorded by professional debaters about controversial topics (with their manual and automatic transcripts) and 756 arguments annotated as mentioned/not mentioned in these speeches.
Note: 60 speeches from release#1 are not included in this dataset.

The dataset includes:
- Manual and automatic transcripts of the speeches, in both raw and cleaned (processed) versions.
- 756 annotated arguments

Expressive Text to Speech

The emphasized words dataset was created to train and evaluate a system that receives a written argumentative speech and predicts which words should be emphasized by the Text-to-Speech component.

Dataset

Reference

Number of Paragraphs

Number of Sentences

Source

IBM Debater® - Labeled Emphasized Words in Speech
The dataset contains 2485 paragraphs, which comprise of 4002 sentences from audience-addressed speeches that were annotated for emphasized words.
The dataset can be used to train models to predict words that should be emphasized in expressive TTS.
It is a subset of the benchmark data that was used in the InterSpeech'18 paper "Word Emphasis Prediction for Expressive Text to Speech"

Basic NLP Tasks

The following datasets relate to basic NLP tasks, addressed as part of Project Debater.

Semantic Relatedness

Predicting semantic relatedness between texts is a basic NLP problem with a wide variety of applications. Relatedness can be measured between several types of texts, ranging from words to documents. The relatedness datasets listed below differ in the type of elements considered (words, multi-word-terms, and concepts), number of topics from which the pairs were extracted, and number of annotated pairs.

Mention Detection

The goal of Mention Detection is to map entities/concepts mentioned in text to the correct concept in a knowledge base. This process involves segmenting the text (as some concepts span multiple words) and the disambiguation of terms with more than one meaning.

Dataset

Reference

Number of Sentences

Number of Topics

Source

IBM Debater® - Mention Detection Benchmark
3000 sentences annotated with Mentions. The sentences are taken from Wikipedia and from professional speakers discussing several topics,
There are about 20,000 annotated Mentions on those sentences.

The benchmark includes:
- The topics of discussion
- The sentences related to those topics
- The Mentions annotated on those sentences

Mix of Wikipedia articles and ASR/manual transcripts of speeches by expert debaters

Text Clustering

Text clustering is a widely-studied NLP problem. Clustering can be applied to texts at different levels, from single words to full documents, and can vary with respect to the clustering goal. In thematic clustering, the aim is to cluster texts based on thematic similarity between them, namely grouping together texts that discuss the same theme.

Thematic clustering of sentences is important for various use cases. For example, in multi-document summarization, one often extracts sentences from multiple documents that should be organized into meaningful sections and paragraphs. Similarly, within the emerging field of computational argumentation, arguments may be found in a widespread set of articles, which further require thematic organization to generate a compelling argumentative narrative.

Evaluation of thematic clustering methods requires a ground truth dataset of sentence clustering. Unfortunately, sentence clustering is considered a very difficult task for humans. As a result, there is no standard human annotated sentence clustering dataset.

In the dataset “Thematic Clustering of Sentences” sentences are annotated for their thematic clusters. This annotation enables to evaluate thematic clustering methods. The dataset was generated automatically by leveraging the partition of Wikipedia articles into sections. The underlying assumption of its creation was that the section structure of a Wikipedia article can serve as ground truth for the thematic clustering of its sentences. Details about the way this dataset was generated can be found in the article.

Dataset

Reference

Number of Topics

Number of Clusters per Topic

Number of Sentences per Topic

IBM Debater® - Thematic Clustering of Sentences Dataset
A benchmark of sentence-clustering created automatically based on the partition of Wikipedia articles into sections. The dataset contains 692 Wikipedia articles; each one corresponds to a different clustering task. The dataset includes the article name, the sentence text, and the name of the cluster to which the sentence belongs, which is the title of the section from which the sentence was extracted. A detailed description of the generation process of this dataset can be found in the article.

Concept Abstractness

During the last decades, the influence of psycholinguistic properties of words on cognitive processes has become a major topic of scientific inquiry. Among the most studied psycholinguistic attributes are concreteness, familiarity, imagery, and average age of acquisition. Abstractness quantifies the degree to which an expression denotes an entity that can be directly perceived by human senses. As an example, the word "feminism" is usually perceived as abstract, but the word "screwdriver" is associated with a concrete meaning.

We introduce a weakly supervised approach for inferring the property of abstractness of words and expressions in the complete absence of labeled data. Exploiting only minimal linguistic clues and the contextual usage of a concept as manifested in textual data, we train sufficiently powerful classifiers, obtaining high correlation with human labels. The released dataset contains 300K Wikipedia concepts automatically rated for their degree of abstractness.

Dataset

Reference

Number of Topics

IBM Debater® - Concept Abstractness
100,000 * 3 Wikipedia titles (unigrams, bi-grams and tri-grams) rated for their degree of abstractness.
A detailed description of the generation process can be found in the article, and description of the dataset format can be found in the attached README file

Debater Datasets - Licensing Notice

Each copy or modified version that you distribute must include a licensing notice stating that the work is released under CC-BY-SA and either a) a hyperlink or URL to the text of the license or b) a copy of the license. For this purpose, a suitable URL is: http://creativecommons.org/licenses/by-sa/3.0/.