The NDSA Web Archiving Survey

The following is a guest post byJefferson Bailey, Fellow at the Library of Congress’s Office of Strategic Initiatives.

In a previous post on The Signal, we examined some of the themes that emerged from the survey of organizations in the United States that are actively involved in, or planning to start, programs to archive content from the web. The survey was conducted by the Content Working Group of the National Digital Stewardship Alliance from October 3 through October 31, 2011.

The goal of the survey was to gain a better understanding of the landscape of web archiving activities in the United States. The survey garnered 77 unique responses to 28 questions about current web archiving activities.

Instead of reiterating content that is in the report, we wanted to pull out some interesting or relevant charts and statistics and provide some additional explication of the summary themes and highlight survey results not featured in the previous blog post.

Policies for Web Archiving

One emergent theme discussed in the previous post was the lack of consistency around incorporating web archiving into institutional policy. To provide some additional detail to that idea, the following chart shows how institutions with active, testing, or planned web archiving programs are incorporating web archiving into their collection or selection policies:

Another survey question that elicited an interesting result was the question about respecting robots.txt files. Robots.txt is a file put on a web server which tells web-crawling robots not to visit or harvest that particular website. (More information on robots.txt here.) How institutions handle robots.txt files has a direct impact on their ability to acquire web content. In the recent post, Legal Issues in Web Archiving, Abbie Grotke addressed in detail the challenges of robots.txt files.

Chart 2: Policies towards respecting robots.txt files

Another topic examined in the report is the types of access that institutions are providing to web archives. Chart 3 provides some examples and percentages, though other means of access are listed in the full report.

Chart 3: Types of access being provided to web archives

Tools for Archiving the Web

The web archiving survey also sought to gain a better understanding of the specific tools being used both to collect content and to display archival collection. The Library of Congress does not endorse these tools, but merely provides this information as a resource for understanding what tools are being used by the web archiving community. Of the 63 respondents indicating their tools for harvesting web materials:

60% (38) were using an external service for acquisition

26% (16) were using an in-house method for acquisition

14% (9) were using both in-house and external services for acquisition

Chart 4 and Chart 5 document some of the services and tools being used to build archives of web content.

Chart 4: External services currenty being used for web archiving

Chart 5: In-house tools and software currently being used for web archiving

Conclusion

Beyond the charts and statistics offered here, the themes discussed in the previous blog post on the Web Archiving Survey merit repeating. The inconsistent custodianship of web archives and the policy and technical challenges of harvesting web content have not dampened the dramatic increase in the number of active web archiving programs. These issues also have not impeded overall efforts to preserve and provide access to a rich, diverse body of web-born content, as is seen in the large number of programs initiated within the last five years. Web archiving is increasingly a core function of collection development for many institutions and, as the survey documents, the web archiving community has demonstrated a keen interest in collaborative activities, knowledge sharing, and joint efforts on conducting research and determining best practices. Groups such as the NDSA and the IIPC continue to offer an open, cooperative space to support institutions working to archive the web.

Many presentations from the recent IIPC 2012 General Assembly can be found here.

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully
responsible for everything that you post. The content of all comments is released into the public domain
unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless,
the Library of Congress may monitor any user-generated content as it chooses and reserves the right to
remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and
may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's
privilege to post content on the Library site. Read our
Comment and Posting Policy.

Disclaimer

This blog does not represent official Library of Congress communications.

Links to external Internet sites on Library of Congress Web pages do not constitute the Library's endorsement of the content of their Web sites or of their policies or products. Please read our
Standard Disclaimer.