Data Set Description

DER STANDARD is an Austrian daily broadsheet newspaper. On the newspaper’s website, there is a discussion section below each news article where readers engage in online discussions. The data set contains a selection of user posts from the 12 month time span from 2015-06-01 to 2016-05-31. There are 11,773 labeled and 1,000,000 unlabeled posts in the data set. The labeled posts were annotated by professional forum moderators employed by the newspaper.

The data set contains the following data for each post:

Post ID

Article ID

Headline (max. 250 characters)

Main Body (max. 750 characters)

User ID (the user names used by the website have been re-mapped to new numeric IDs)

Detailed descriptions of the post selection and annotation procedures are given in the paper.

Annotated Categories

Potentially undesirable content:

Sentiment (negative/neutral/positive)
An important goal is to detect changes in the prevalent sentiment in a discussion, e.g., the location within the fora and the point in time where a turn from positive/neutral sentiment to negative sentiment takes place.

Off-Topic (yes/no)
Posts which digress too far from the topic of the corresponding article.

Statistics

The following table contains some relevant statistics for the data set.

Total number of posts

1,011,773

Number of unlabeled posts

1,000,000

Number of labeled posts

11,773

Number of category annotation decisions

58,568

Number of posts taken offline by moderators

62,320

Min/Median/Max post length (words)

0 / 21 / 500

Vocabulary size (≥ 5 occurrences)

129,070

Number of articles

12,087

Number of article topics

1,229

Number of users

31,413

Min/Median/Max number of posts per article

1 / 22 / 3,656

Min/Median/Max number of posts per topic

1 / 142 / 44,329

Min/Median/Max number of posts per user

1 / 5 / 4,682

Min/Median/Max number of users per article

1 / 15 / 1,371

Min/Median/Max number of users per topic

1 / 78 / 6,874

Number of pos./neg. community votes

3,824,806 / 1,096,300

The following table gives the number of labeled examples per category.

Category

Does Apply

Does Not Apply

Total

Percentage

Sentiment Negative

1691

1908

3599

47 %

Sentiment Neutral

1865

1734

3599

52 %

Sentiment Positive

43

3556

3599

1 %

Off-Topic

580

3019

3599

16 %

Inappropriate

303

3296

3599

8 %

Discriminating

282

3317

3599

8 %

Possibly Feedback

1301

4737

6038

22 %

Personal Stories

1625

7711

9336

17 %

Arguments Used

1022

2577

3599

28 %

In the first annotation round, three moderators annotated 1,000 randomly selected posts. From this subset of the data, we can estimate the category distributions. The following bar chart shows the distribution for the three annotators (A, B, C) and after a majority vote per post.

Furthermore, this subset can be used to compute Cohen’s Kappa values to quantify the inter-rater agreement. The following bar chart shows the agreement for each pair of moderators and the average across all pairs for each category.

The distribution of posts to topic paths is illustrated below (regular scale on the left, log scale on the right).

License

Citation

Dietmar Schabus, Marcin Skowron, Martin TrappOne Million Posts: A Data Set of German Online Discussions
Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR)
pp. 1241-1244
Tokyo, Japan, August 2017

Furthermore, this data set was used in a follow-up paper presented at LREC 2018:

Dietmar Schabus and Marcin SkowronAcademic-Industrial Perspective on the Development and Deployment of a Moderation System for a Newspaper Website
Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC 2018)
pp. 1602-1605
Miyazaki, Japan, May 2018

@InProceedings{Schabus2018,
author = {Dietmar Schabus and Marcin Skowron},
title = {Academic-Industrial Perspective on the Development and Deployment of a Moderation System for a Newspaper Website},
booktitle = {Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC)},
year = {2018},
address = {Miyazaki, Japan},
month = may,
pages = {1602-1605},
abstract = {This paper describes an approach and our experiences from the development, deployment and usability testing of a Natural Language Processing (NLP) and Information Retrieval system that supports the moderation of user comments on a large newspaper website. We highlight some of the differences between industry-oriented and academic research settings and their influence on the decisions made in the data collection and annotation processes, selection of document representation and machine learning methods. We report on classification results, where the problems to solve and the data to work with come from a commercial enterprise. In this context typical for NLP research, we discuss relevant industrial aspects. We believe that the challenges faced as well as the solutions proposed for addressing them can provide insights to others working in a similar setting.},
url = {http://www.lrec-conf.org/proceedings/lrec2018/summaries/8885.html},
}

Experiments

Acknowledgments

This research was partially funded by the Google Digital News Initiative. We thank DER STANDARD and their moderators for the interesting collaboration and the annotation of the presented corpus. The GPU used for this research was donated by NVIDIA.