NLP-progress

Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.

Text classification

Text classification is the task of assigning a sentence or document an appropriate category.
The categories depend on the chosen dataset and can range from topics.

AG News

The AG News corpus
consists of news articles from the AG’s corpus of news articles on the web
pertaining to the 4 largest classes. The dataset contains 30,000 training examples for each class
1,900 examples for each class for testing. Models are evaluated based on error rate (lower is better).

DBpedia

The DBpedia ontology
dataset contains 560,000 training samples and 70,000 testing samples for each of 14 nonoverlapping classes from DBpedia.
Models are evaluated based on error rate (lower is better).

TREC

The TREC dataset is dataset for
question classification consisting of open-domain, fact-based questions divided into broad semantic categories.
It has both a six-class (TREC-6) and a fifty-class (TREC-50) version. Both have 5,452 training examples and 500 test examples,
but TREC-50 has finer-grained labels. Models are evaluated based on accuracy.