Link Rot: Georgetown Law Library Finds 38 Percent of Online Documents Disappear from Web Pages Within Five Years

The Georgetown Law Library and members of the Chesapeake Digital Preservation Group have released new data that shows within a five-year period, nearly 38 percent of online legal reports and web pages preserved through the library’s efforts have disappeared from their original web addresses.

“The results of this study demonstrate the importance of web preservation efforts for law scholars, students, practitioners and researchers,” said Georgetown Law Librarian Michelle M. Wu. “And the release is timely during this celebration of National Preservation Week 2012.”

For the past five years, the Georgetown Law Library and the Chesapeake Digital Preservation Group have revisited a single sample of web addresses, or URLs, to examine their stability over time. The prevalence of link rot in the sample has steadily increased every year since 2008 when it was found to be 8.3 percent.

The 2012 analysis reveals that 37.7 percent of the online publications in the sample have disappeared from their original web addresses. However, due to the Georgetown Law Library and the Chesapeake Digital Preservation Group’s web archiving efforts, all of these publications have been preserved and can found online at http://legalinfoarchive.org/.

The sample used in the study includes 579 original URLs for law- and policy-related materials that have been captured from the web and archived. The majority of the URLs in the sample are from government (.us or .gov) organization (.org) web domains.

In 2012, 218 out of 579 URLs in the sample no longer provide access to the content that was originally selected, captured, and archived by the Chesapeake Group. In other words, link rot has increased to 37.7 percent within five years.

In 2008, the sample was analyzed for the first time as part of an evaluation of the archiving program, and link rot was found to be present in 48, or 8.3 percent, of the 579 URLs comprising the sample. At the time, a total of 1,266 web-based titles had been captured and archived. A random sample of 579 titles from the archive was generated for the analysis, ensuring results at a 95 percent confidence level and confidence interval of +/- 3.

One year later, in 2009, the sample was analyzed a second time. Link rot was found to be present in 83 out of the original sample of 579 URLs. Within two years of capture, 14.3 percent of the archived titles had disappeared from their original URLs.

By the third year, in 2010, the prevalence of link rot had increased to 160 out of 579 URLs, to a whopping 27.9 percent. Link rot continued to increase in 2011, but by a slower margin, reaching 30.4 percent by the fourth year. The new 2012 data show an increase of 7.3 percent compared to 2011, to 37.7 percent, more in line with our findings of annual increases from 2008 and 2009.

Gary Price (gprice@mediasourceinc.com) is a librarian, writer, consultant, and frequent conference speaker based in the Washington D.C. metro area. Before launching INFOdocket, Price and Shirl Kennedy were the founders and senior editors at ResourceShelf and DocuTicker for 10 years. From 2006-2009 he was Director of Online Information Services at Ask.com, and is currently a contributing editor at Search Engine Land.