Datasets

Please join our mailing list for announcements about new data releases and updates.

OPP-115 Corpus (ACL 2016)

The OPP-115 Corpus (Online Privacy Policies, set of 115) is a collection of website privacy policies (i.e., in natural language) with annotations that specify data practices in the text. Each privacy policy was read and annotated by three graduate students in law.

The dataset is made available for research, teaching, and scholarship purposes only, with further parameters in the spirit of a Creative Commons Attribution-NonCommercial License. Contact Prof. Norman Sadeh with any questions.

If you use this dataset as part of a publication, you must cite the following paper:

Opt-out Choice Dataset (EMNLP 2017)

We created a corpus of website privacy policies (i.e., in natural language) to train machine learning and natural language processing models to identify choices (e.g., opt outs from behavioral advertising).

The dataset is made available for research, teaching, and scholarship purposes only, with further parameters in the spirit of a Creative Commons Attribution-NonCommercial License.
Contact Prof. Norman Sadeh with any questions.

If you use this dataset as part of a publication, you must cite the following paper:

ACL/COLING 2014 Dataset

We created a corpus of 1,010 privacy policies from the top websites ranked on Alexa.com. The privacy policies in the dataset were retrieved in December 2013 and January 2014.

This dataset is made available for research, teaching, and scholarship purposes only, with further parameters in the spirit of a Creative Commons Attribution-NonCommercial License. Contact Prof. Norman Sadeh with any questions.