Web Archive Data Sets

Explore content from the Library's web archives

In order to better facilitate research and to better understand the needs of users who might be interested in the Library of Congress Web Archives, the Library's Web Archiving Team is working to create and make available a number of derivative data sets available for users to download, re-use and explore. After participating in the Library's pilot to explore how we can better enable digital scholarship, beyond basic browsing of records and URL search, the team began exploring ways in which we could process our archives, the scale of which can be overwhelming to digital humanities researchers and other users. From this experimentation, the team created two derivative data sets which would provide researchers smaller bits of data to help users engage with and learn more from our archives.

We plan to launch additional data sets with other types of content in the future, so keep an eye on this page and follow The Signal blog for updates.

Data Sets

Our initial offering is two data sets that comprise of content found in the American Folklife Center's Web Cultures Web Archive. As part of an effort to review the coverage of web crawls of GIPHY and Meme Generator, the Web Archiving Team was able to query our web archives data, first finding the number of unique GIFs and distinct meme instances out of both web archives.

Meme Generator Data Set Download External - The Meme Generator data set was generated from content harvested from Meme Generator and includes 57,652 unique meme instances derived from base memes (meme images without text, waiting to be fashioned into meme instances). The data set includesw some minimal metadata for these Memes. The data set does not include the Memes themselves, however it does provide links to where you can access their web archive copies within the Library's web archive.

GIPHY Data Set Download External - The GIPHY dataset was generated from content harvested from GIPHY, and includes 10,972 unique GIFs. The data set includes some minimal metadata for these GIFs. The data set does not include the GIFs themselves, however it dos provide links to where you can access their web archive copies within the Library's web archive.

Documentation

The Library's Web Archiving program page provides background information and includes details about our collection policies, technical approach, and information for searching and browsing the entire web archive. We post about web archiving and new collections on the Signal Blog.

We need you

Our attempts to work through analysis of this data have helped our team identify some potential ways to enhance our crawls of these sites to better harvest more of their content. At the same time, we are also interested in exploring how making this kind of derivative data available to users might help spur further use and engagement with the collections. We are interested in hearing from users of these data sets – what you did with them, feedback on the documentation, what other derivative data sets might be of interest, and any other feedback or comments you have for us. Please write to us via our Contact Us form - we'd love to hear from you!