How Much Of The Internet Does The Wayback Machine Really Archive?

The Internet Archive turns 20 years old next year, having archived nearly two decades and 23 petabytes of the evolution of the World Wide Web. Yet, surprisingly little is known about what exactly is in the Archive’s vaunted Wayback Machine. Beyond saying it has archived more than 445 billion webpages, the Archive has never published an inventory of the websites it archives or the algorithms it uses to determine what to capture and when. Given the Archive’s recent announcements of new efforts to make its web archive accessible to scholarly research, it is critically important to understand what precisely makes up this 445-billion-page archive and how that composition might affect the kinds of research scholars can perform with it.

Regular users of the Wayback Machine are familiar with the myriad oddities of its holdings. For example, despite CNN.com launching in September 1995, the Archive’s first snapshot its homepage does not appear until June 2000. In contrast, BBC’s website has been archived since December 1996, but the volume of snapshots ebbed and flowed in fits and starts through 2012. To truly understand the Archive it is clear we must move beyond casual anecdotes to a systematic assessment of the collection’s holdings.

Since the Archive does not publish a master inventory of the domains preserved in the Wayback Machine, the Alexa ranking of the top one million most popular websites in the world was used, which is compiled from browsing activity in more than 70 countries. The complete history of all snapshots ever recorded by the Archive for the homepage of each website was requested using the Wayback CDX Server API through November 5, 2015. While this only reflects snapshots of homepages, rather than sites as a whole, it nonetheless captures a key metric of how often the Archive is crawling each site.

The enormous technical resources required to crawl and archive the open web can be seen in this data. In all, the homepages of the top one million Alexa sites have been snapshotted by the Internet Archive just over 240 million times since 1996. Just over 2 terabytes of bandwidth was consumed downloading those homepages, with more than 307 gigabytes required in 2015 alone.

Thus, of the top 15 websites with the most snapshots taken by the Archive thus far this year, one is an alleged former movie pirating site, one is a Hawaiian hotel, two are pornography sites and five are online shopping sites. The second-most snapshotted homepage is of a Russian autoparts website and the eighth-most-snapshotted site is a parts supplier for trampolines.

Looking in more detail at the Wayback’s archive of Lithuanian loans website savy.lt it can be seen that the Archive crawled the site sporadically from January 1999 to May 2003, then did not return for more than a decade. In 2015 it crawled it heavily in late March and April and then very heavily in May and June, a few times on July 1, and never again in the following four months. In all, the Archive’s crawlers accessed savy.lt a total of 203,945 times over this period, most of it in a single massive burst of crawling. Yet, the public Wayback profile of the site asserts it has only been crawled 868 times.

The reason for this is that the public-facing Wayback website reports the number of hours with at least one snapshot, rather than the actual total number of snapshots, which is why it reports a maximum of 24 captures per day, rather than the thousands of captures per day it actually sees for some websites. Unfortunately, the Archive does not clarify this on its website, instead casually referencing it deep within the technical documentation for their CDX Server API on GitHub.

Reranking the top one million sites by the number of hours with at least one snapshot from that hour and calculating the percent of hours since 12:01AM January 1, 2015 there is a snapshot from, the top 15 sites are myspace.com (93%), yahoo.com (86%), cnn.com (80%), youtube.com (78%), msn.com (76%), twitter.com (76%), facebook.com (72%), msnbc.com (70%), abcnews.go.com (70%), today.com (69%), nbcnews.com (67%), cbsnews.com (65%), infoseek.co.jp (65%), cnbc.com (63%), and tinypic.com (58%). Nine of the top 15 websites by hourly snapshots are news websites, offering what appears to be a more reasonable ranking. Indeed, news websites make up many of the domains in the top 50.

Yet, a closer look at this ranking also reveals a number of anomalies. Site walb.com has an Alexa ranking of 100,803, yet is ranked 24th for most hours with a snapshot, while mountvernonnews.com is ranked 363,013 in Alexa and 43rd by snapshot hours. This appears to be a general trend, with no noticeable connection between Alexa rank and the number of times or hours a website homepage has been snapshotted.

In fact, the total number of snapshots and the total number of hours with at least one snapshot are only weakly correlated at r=0.35. Alexa rank and number of snapshots are not meaningfully correlated at r=-0.03, while Alexa rank and number of distinct hours with snapshots are inversely correlated at r=-0.15. Put into simpler terms, these numbers mean that the number of snapshots and number of hours with at least one snapshot are largely unrelated to its Alexa ranking. More popular sites do not have more snapshots than less popular sites. On the one hand, this might make sense, since the popularity of a site is not necessarily indicative of how frequently it updates. Yet, on the circa-2015 web highly popular sites tend to update constantly with new content – a site that is updated once every few years will likely draw little traffic. Thus, one could argue that the content update rate of a site and its popularity are at least somewhat related.

Looking across years, the correlation of Alexa rank with hours and snapshots is remarkably consistent from 2013 to 2015, varying from -0.15 to -0.17 for hours and -0.03 to -0.04 for snapshots. However, the correlation between hours and snapshots varies considerably, changing from 0.35 in 2015 to 0.29 in 2014 to 0.46 in 2013 to 0.38 in 2012. The fact that correlation of captures with Alexa rank remains constant across the last three years suggests that the Archive does not factor popularity into crawling behavior. On the other hand, the considerable change in the correlation of total snaps with snap hours suggests that the recrawl behavior of the Archive is constantly changing, which will have a profound effect on research using the Archive as a dataset to study web evolution.

News outlets represent a special kind of website that combines a high update rate of new content with considerable societal importance from the standpoint of archival. To examine how well the Archive has been preserving online news, the top 20,000 news websites by volume monitored by the GDELT Project were selected and the country of origin for each outlet identified. The total number of snapshot hours were summed for all news outlets from each country for 2013, 2014, and 2015, and divided by the total number of monitored outlets from each country, yielding the following maps of the average number of snapshot hours per news outlet in each country by year.

Average number of hours with at least one snapshot by outlet for online news outlets by country in 2013 (Credit: Kalev Leetaru)

Average number of hours with at least one snapshot by outlet for online news outlets by country in 2014 (Credit: Kalev Leetaru)

Average number of hours with at least one snapshot by outlet for online news outlets by country in 2015 (Credit: Kalev Leetaru)

Clearly visible in this sequence of maps is a strong centralization of the Archive’s crawling resources towards a relatively small number of countries in terms of snapshot hours. In 2013 there were just a few outliers, with most countries having relatively similar hours per outlet. Over the three years there has been a steady reorientation towards a more uneven breakdown of archival resources. The significant geographic change over time adds further evidence that the behavior of the Archive’s crawlers is constantly changing in profound and undocumented ways.

Taken together, these findings suggest that far greater understanding of the Internet Archive's Wayback Machine is required before it can be used for robust reliable scholarly research on the evolution of the web. Historical documentation on the algorithms and inputs of its crawlers are absolutely imperative, especially the workflows and heuristics that control its archival today. One possibility would be for the Archive to create a historical archive where it preserves every copy of the code and workflows powering the Wayback Machine over time, making it possible to look back at the crawlers from 1997 and compare them to 2007 and 2015.

More detailed logging data is also clearly a necessity, especially of the kinds of decisions that lead to situations like the extremely bursty archival of savy.lt or why the CNN.com homepage was not archived until 2000. If the Archive simply opens its doors and releases tools to allow data mining of its web archive without conducting this kind of research into the collection's biases, it is clear that the findings that result will be highly skewed and in many cases fail to accurately reflect the phenomena being studied.

What can we learn from all of this? Perhaps the most important lesson is that, like so many of the massive data archives that define the “big data” world, we have precious little understanding of what is actually in the data we use. Few researchers stop to ask the kinds of questions explored here and even fewer archives make any kind of detailed statistics available about their holdings. Instead, the “big data” era is unfortunately being increasingly defined by headline-grabbing results computed from datasets being plucked off the shelf with little attempt to understand their inner biases.

Another theme is that of unexpected discovery. This analysis originally began as a study of online news archiving practices of the Internet Archive, intended to explore whether it archived Western outlets more frequently than those of other countries. The original expectation was that the Archive’s holdings would reflect popularity and rate-of-change, with language and geographic location being the primary differentiators. However, once the data was examined, it was clear the archival landscape of the Wayback Machine was far more complex.

The interfaces we use to access these vast archives often silently transform it in ways that are not apparent or visibly documented but that can have profound impacts on our understanding of the results we obtain from them. For example, neither the Wayback homepage nor the detailed FAQ inform users that the snapshot counts on the web interface report the number of distinct hours with at least one snapshot, rather than the actual number of times that the Archive crawled a page. This fact is only available buried deeply within a technical API reference page on Github.

In my opening keynote address at the 2012 IIPC General Assembly at the Library of Congress, I noted that for scholars to be able to use web archives for research, we needed far greater information on how those archives were being constructed. Three and a half years later few major web archives have produced such documentation, especially relating to the algorithms that control what websites their crawlers visit, how they traverse those websites, and how they decide what parts of an infinite web to preserve with their limited resources. In fact, it is entirely unclear how the Wayback Machine has been constructed, given the incredibly uneven landscape it offers of the top one million websites, even over the past year.

The findings above demonstrate how critical this kind of insight is. When archiving an infinite web with finite resources, countless decisions must be made as to which narrow slices of the web to preserve. At the most basic level, one can chose either completely random archival (selecting pages without regard to any other factors), archival prioritized by rate of change (archiving pages more often that change more frequently – though this tends to emphasize dynamically-generated sites), or archival prioritized by popularity (this emphasizes the pages the most people use today, but risks failing to preserve relatively unknown pages that may become important in the future). Human input can also play a critical role as with the Archive's specialized Archive-It program.

Each approach has distinct benefits and risks. One might reasonably ask: 20 years from now, which are we more likely to want to look back at, a Lithuanian loan website, a trampoline parts supplier, or the breaking news homepage of a major news outlet like CNN? Decisions as critical as what to preserve for the future require far greater input from the community, especially the scholars who rely on these collections. Given the current state of the Archive’s holdings, it is clear that far greater visibility is needed into their algorithms and that critical engagement is needed with the broader scholarly community. We simply can’t leave something as important as the preservation of the online world to the decisions of blind algorithms that we have no understanding of how they function.

Indeed, just as libraries have formalized over thousands of years how they make acquisition and collection decisions based on community engagement, it is clear that web archives must adopt similar processes and partner with a wide range of organizations to help them do so. Given that up to 14% of all online news monitored by the GDELT Project is no longer accessible after two months, it is clear that the web is disappearing before our very eyes and thus it is imperative that we do a better job of archiving the online world and do it before this material is lost forever.