Reddit submission dataset

Description

This Reddit dataset consists specific metadata of all submissions posted to Reddit from the beginning of Nov. 2007 to the end of July 2013. The metadata of each submission (e.g., score)
were collected around 1-2 months after the initial submission (i.e., when they get blocked from voting) as the metadata has most likely been settled after this period. The dataset is available in JSON format and is zipped.
Concretely, the following information is available:

Number of upvotes

Number of downvotes

Score

Number of comments

Subreddit the submission was posted to

Domain of submission (i.e., top-level domain if link or "self.subreddit" if self submissions)

Furthermore, we make our manual categorization of top-level domains to content types available.
These two datasets allow to reproduce the results presented in:

Philipp Singer, Fabian Flöck, Clemens Meinhart, Elias Zeitfogel and Markus Strohmaier,Evolution of Reddit: From the Front Page of the Internet to a Self-referential Community?,
Web-Science Track at the 23rd International World Wide Web Conference, Seoul, South Korea, 2014 [PDF]

Anonymity

We have limited the metadata in the Reddit dataset to information necesarry to reproduce our scientific results. The rest of the metadata has been removed in order to sustain anonymity of Reddit users.

Accessibility

Please, use the dataset for scientific purposes only and follow general ethical rules. If you publish results obtained from using this dataset, please cite:

Philipp Singer, Fabian Flöck, Clemens Meinhart, Elias Zeitfogel and Markus Strohmaier,Evolution of Reddit: From the Front Page of the Internet to a Self-referential Community?,
Web-Science Track at the 23rd International World Wide Web Conference, Seoul, South Korea, 2014 [PDF]

Acknowledgements

We want to sincerely thank Jason Baumgartner (aka u/stuck_in_the_matrix) for conducting the data collection and providing us initial access to the data.