The content that people post publically on social media websites provide a rich, up-to-the-minute picture of what they are talking about, what they reading, and how they are connecting. Example datasets that researchers have studied to understand people's public behavior include:

Although public behavioral data is readily available and can be captured through an API or scraped, each source provides unique challenges, from effectively sampling the stream to complying with the Terms of Service. Another concern with these datasets is that they introduce a streetlight effect into our research. Analogous to the drunk man looking for his keys at night near the streetlight because it is the only place he can see, researchers are sometimes criticized for trying to understand human behavior using only Twitter because it is the only dataset we can see.

Community Generated Logs

There have also been efforts by the research community to create logs for the community to study of computer-mediated behavior that is not public. For example, the purpose of the Lemur Community Query Log Project was to create a query log that could be used by the information retrieval research community. Participants in the project were asked to install a toolbar and consent to having their queries and clicks collected. The plan was that all of the queries collected across all of the participants would be released to researchers in a controlled manner. Unfortunately, despite significant community interest in public search datasets, almost nobody installed and used the toolbar. The project eventually ended because, after a year of data collection, they collected only as much data as Google collects in 6 seconds.

Publically Released Private Logs

Finally, there are a handful of publically released private logs available for study. These include:

Most of these datasets were released to support research by the companies that collected them, with the exception of the Enron corpus, which was purchased by Andrew McCallum for $10,000 after the company went bankrupt. Many of these datasets have since been redacted due to privacy concerns – but that is a discussion for a future blog post.