The Chesapeake Digital Preservation Group has completed its fifth annual investigation of link rot among the original URLs for online law- and policy-related materials archived though the group's efforts.

The Chesapeake Digital Preservation Group is a collaborative digital preservation program for legal materials, reports, and documents posted to the web. The group is comprised of four member libraries—two academic law libraries, the Georgetown Law and Harvard Law School Libraries, and the State Law Libraries of Maryland and Virginia—and is part of the Legal Information Archive.

Access to web-published content can be lost as websites are routinely updated, reorganized, or deleted over time. In the five years since the program began, the Chesapeake Group has built a digital archive collection comprising more than 8,600 digital items and 3,700 titles, almost all originally posted to the web but captured and preserved within the group's digital archive.

Every year, the Chesapeake Group investigates whether or not the documents in the archive can still be found at the original web addresses from which they were captured. The group analyzes two samples of web addresses, or URLs, pulled from the archive's records.

The first sample includes 579 original URLs for content captured from 2007-2008. This sample is revisited every year to document link rot and explore how it changes over time.

The other sample is new and represents the full content of the archive at the time the study is conducted. This second sample provides an up-to-date snapshot of link rot among the original URLs for all the content currently in the archive. In 2012, this sample included 830 original URLs for materials captured from 2007-2012.

In 2012, 218 out of 579 URLs in the sample no longer provide access to the content that was originally selected, captured, and archived by the Chesapeake Group. In other words, link rot has increased to 37.7 percent within five years.

In 2008, the sample was analyzed for the first time as part of an evaluation of the archiving program, and link rot was found to be present in 48, or 8.3 percent, of the 579 URLs comprising the sample. At the time, a total of 1,266 web-based titles had been captured and archived. A random sample of 579 titles from the archive was generated for the analysis, ensuring results at a 95 percent confidence level and confidence interval of +/- 3.

One year later, in 2009, the sample was analyzed a second time. Link rot was found to be present in 83 out of the original sample of 579 URLs. Within two years of capture, 14.3 percent of the archived titles had disappeared from their original URLs.

By the third year, in 2010, the prevalence of link rot had increased to 160 out of 579 URLs, to a whopping 27.9 percent. Link rot continued to increase in 2011, but by a slower margin, reaching 30.4 percent by the fourth year. The new 2012 data show an increase of 7.3 percent compared to 2011, to 37.7 percent, more in line with our findings of annual increases from 2008 and 2009.

Increases in link rot from 2008 through 2012 are illustrated in Figure 1 and Table 1, below.

More than 90 percent of the top-level domains in the sample are state-government (state.[state code].us), organization (.org), and government (.gov) URLs, representing approximately 41 percent, 32 percent, and 17 percent of the sample, respectively. Other top-level domains, comprising approximately 7 percent of the sample, combined, include .edu, .com, and .net, which respectively represent 2.9, 2.2, and 1.9 percent of the sample. Less than 3 percent of the sample consists of .mil, .us, .info, .uk, .au, .ca, and .int top-level domains. The sample also includes one IP address.

In 2012, the content at .org domains showed the highest increase in link rot. More than 43 percent of the materials posted to organization domains disappeared from the original documented web addresses. Link rot on government web pages also increased in 2012: up to 36 percent at .gov domains and nearly 34 percent at .state.[state code].us domains. Education domains also showed an increase to more than 41 percent in 2012 after decreasing slightly in 2011, and network domain link rot rose to more than 36 percent.

A list of all top-level domains found in the sample, along with link rot detected in 2008, 2009, 2010, and 2011 is available in Table 2.

For the present analysis, a new, separate sample of URLs was generated. In 2012, the collection included 8,627 digital items and 3,734 titles. To ensure statistically relevant results at a 95 percent confidence level and confidence interval of +/- 3, a random sample of 830 titles were selected for the 2012 study. Three of the titles selected for the sample were discarded because they were directly deposited by the content creators and therefore had no original web addresses; as the Chesapeake Group has increased contact with content producers over the years, a small fraction of the content archived is now deposited by the creators for archiving, rather than posted to the web for capture.

Out of the 827 titles in the sample that were captured from the web, link rot was found to be present in 214, nearly 26 percent (25.9%), of the original URLs. The ratio of working URLs to those with link rot for 2012 is illustrated in Figure 2 below, compared to samples studied in 2008, 2009, 2010, and 2011.

In 2012, the number of titles in the archive with URLs from organization (.org) top-level domains surpassed those from government (.gov) and state government (state.[state code].us) domains. Roughly 87 percent of the top-level domains in the sample were organization, state-government, and government URLs, which represented 38.1 percent, 26 percent, and 22.7 percent of the sample, respectively. Of these three top-level domains, link rot was present in 25.7 percent of URLs with organization top-level domains, 32.6 percent of URLs with state top-level domains, and 23.9 percent of URLs with government top-level domains.

URLs with .edu, .com, and .net top-level domains, combined, represented 10 percent of the sample, and were found to have inactivity levels of 13, 19.2, and 9.1 percent, respectively. Table 3 provides a comparison of all top-level domains found in the 2012 sample, as well as previous years' samples, along with their inactivity rates.

For the first time, the Chesapeake Group documented the year of capture for all of the URLs in our 2012 sample. Not surprisingly, the data show that a URL's risk for link rot increases with time; 40 percent of URLs captured in 2007 have succumbed to link rot, while all of the materials captured in the first few months of 2012 remain at their original web addresses. In fact, the data analyzing link rot by year of capture in our 2012 sample was strikingly similar to the increase in link rot documented by our annual link rot findings for our original 2007-2008 sample. See Tables 4 and 5 below for comparison.