Datasets

Varoius corpus resources have been developed at CLiPS. Upon request, many of these are available for a wider audience. We have listed these resources below.

TwiSty Corpus

Description

TwiSty is a corpus developed for research in author profiling. It contains personality (MBTI) and gender annotations for a total of 18,168 authors spanning six languages. We distribute the Twitter ids of these authors as well as the ids of their available tweets at the time of corpus development. The tweets have undergone language identification and can be found in a Confirmed (as belonging to the language in which the author is situated) and Other category.

ISLRN

License

Download

Acknowledgement

This research is supported by a doctoral grant from the FWO Research Council - Flanders for the first author. We thank Guy De Pauw and Tom De Smedt for technical support. Part of this research was carried out in the framework of the AMiCA (IWT SBO-project 120007) project, funded by the Flemish government agency for Innovation by Science and Technology (IWT).

CLiPS Stylometry Investigation (CSI) Corpus

Description

The CSI corpus is a yearly expanded corpus of student texts in two genres: essays and reviews. The purpose of this corpus lies primarily in stylometric research, but other applications are possible. There is a vast amount of meta-data available, both on the author (gender, age, sexual orientation, region of origin, personality profile) and on the document (timestamp, genre, veracity, sentiment, grade). The current version of the corpus was assembled in February 2016. Previous versions of the corpus are available from the authors via e-mail request.

ISLRN

License

Download

Acknowledgement

We would like to express our gratitude to Katrien Verreyken, Shanti Verellen, Sarah Van Hoof, Dominiek Sandra and Reinhild Vandekerckhove (University of Antwerp) for their help in collecting all the data. This corpus was first constructed within the framework of the AMiCA project, funded by the Flemish Agency for Innovation through Science and Technology (IWT), but its further development is supported by a PhD grant of FWO - Research Foundation - Flanders for the first author.

AuCoPro Semantics

Description

The AuCoPro-Semantics dataset serves for the automatic semantic analysis of compounds. It contains semantically annotated noun-noun compounds (NN) from Dutch and Afrikaans, split in two annotation rounds per language. The semantic annotation was performed with annotation guidelines based on those of Ó Séaghdha (2008). Another part of the dataset contains other nominal compounds (XN) in Dutch, that were annotated using a newly developed annotation scheme.

License

Download

Acknowledgement

This dataset was created within the 'Automatic Compound Processing (AuCoPro)' project that was funded by the Dutch Language Union (Nederlandse Taalunie), the Department of Arts and Culture (DAC) of South Africa and the National Research Foundation (NRF) of South Africa.

deLearyous

Description

The deLearyous dataset is a Dutch (Flemish) dataset for emotion classification following the framework of Leary's Rose, also known as the Interpersonal Circumplex. The dataset contains 11 conversations that were annotated on the sentence level with their position on Leary's Rose, in function of the two defining dimensions: "dominance", and "affinity". In addition to having been annotated with discrete class labels (8 octants and a "neutral" class), the dataset also contains fine-grained annotations, with continuous values across the defining axes.

License

Download

Acknowledgement

This dataset was created in the context of the IWT-TETRA project deLearyous (2010-2012).

Personae Corpus

Description

The Personae corpus was collected for experiments in Authorship Attribution and Personality Prediction. It consists of 145 Dutch-language essays, written by 145 different students (BA in Linguistics and Literature at the University of Antwerp, Belgium). Each student also took an online MBTI personality test, allowing personality prediction experiments. The corpus was controlled for topic, register, genre, age, and education level. We make available the original texts, a syntactically annotated version of the texts, and the metadata.

Language

Dutch

Creator(s)

CLiPS Research Center, University of Antwerp; Kim Luyckx, Walter Daelemans

Citation

If you use this dataset in your research, make sure to cite the following paper: