What does the web remember of its deleted past?

[Update (Jan 2017): this research has recently been published in New Media and Society. A free version is available in Academia.edu ]

On March 30 2010, the country-code top-level domain of the former Yugoslavia, .yu, was deleted from the Internet. It is said to have been the largest ccTLD ever removed. In terms of Internet governance, the domain had lost any entitlement to be part of the Internet’s root zone, after Yugoslavia dissolved. With the exception of Kosovo, all former Yugoslav republics received new ccTLDs. Technically, it was neither necessary nor possible to keep a domain of a country that no longer exists.

The consequence of the removal of the domain, which at its peak hosted about 70,000 websites, is the immediate deletion of any evidence that it was part of the Web. The oblivious live Web has simply rerouted around it. Since the .yu ccTLD is no longer part of the DNS, even if .yu websites are still hosted somewhere on a forgotten server, they cannot be recalled; search engines do not return results to queries for Websites in the .yu domain; references to old URLs on Wikipedia are broken.

My recent research uses the case of the deleted .yu domain to problematize the ties between the live and archived Web, and to both question and demonstrate the utility of Web archives as a primary source for historiography. The first problem I address relates to the politics of the live Web, which, arguably, create a structural preference for sovereign and stable states. The DNS protocol enforces ICANN’s domain delegation policy, which is derived from the ISO-3166 list of countries and territories officially recognized by the United Nations. Countries and territories recognized by the UN are therefore delegated ccTLDs, but unstable, unrecognized, dissolving or non-sovereign states cannot enjoy such formal presence on the Web, marked by the national country-code suffix. It is for this reason that the former republics of Yugoslavia (Bosnia, Macedonia, Slovenia, Croatia, Serbia and Montenegro) received new ccTLDs, but Kosovo, which is not recognized by the United Nations, did not.

While such policy influences the Web of the present, it also denies unstable and non-sovereign countries the possibility of preserving evidence of their digital past. To illustrate my point, consider an imaginary scenario whereby the top-level domain of a Western and wealthy state – say Germany, or the UK – is to be removed from the DNS system in two years. It is difficult to imagine that a loss of digital cultural heritage at such scale would go unnoticed. To prevent such imaginary scenarios from taking place, national libraries around the world work tirelessly to preserve their country’s national Webs. Yet for non-sovereign states, or in case of war-torn states that once existed but have since dissolved, such as the former Socialist Federal Republic of Yugoslavia, the removal of the country’s domain is not treated in terms of cultural heritage and preservation, but instead as a bureaucratic and technical issue.

Technically, the transition from .yu to the Serbian .rs and the Montenegrin .me was perfectly coordinated between ICANN, Serbia and Montenegro. In 2008, a two-year transitional phase was announced to allow webmasters ample time to transfer their old .yu websites to the new national domains. It is reported that migration rates were rather high. But what about the early days of the .yu domain – the websites that describe important historical events such as the NATO Bombing, the Kosovo War, the fall of Milosevic? What about the historical significance of the mailing lists and newsgroups that contributed for the first time to online reporting of war from the ground? The early history of the .yu domain – the domain that existed prior to the establishment of Serbia and Montenegro as sovereign states – was gone forever.

Almost.

Thankfully, the Internet Archive has kept snapshots of the .yu domain throughout the years. However a second problem hinders historians from accessing the rare documents that can no longer be found online. That second problem relates to the structural dependence of Web archives on the live Web. Despite some critical voices in the Web archiving community, most Web archiving initiatives and most researchers still assume that the live Web is the primary access point that leads to the archive. The Wayback Machine’s interface is an example of that; one has to know the URL in order to view its archived version. The archive validates the existence of URLs of the live Web, and allows for examining their history. However if all URLs of a certain domain are removed from the live Web and leave no trace, what could lead historians, researchers, or individuals to the archived snapshots of that domain?

Taking both problems into account, I set out to reconstruct the history of the .yu domain from the Internet Archive. The challenge is guided by a larger question about the utility of Web archives for historiography. Can the Web be used as a primary source for telling its own history? What does the Web remember of its deleted past? If the live Web has no evidence of the past existence of any .yu URL, would I be able to find the former Yugoslav Web in the Internet Archive, demarcate it, and reconstruct its networked structure?

I began digging. Initially, I used various advanced search techniques to find old Websites that may contain broken links to .yu Websites. I also scraped online aggregators of scholarly articles to find old references to .yu Websites in footnotes and bibliographies. My attempts yielded about 200 URLs, certainly not enough to reconstruct the history of the entire domain from the Internet Archive.

The second option was to use offline sources – newspaper archives, printed books, and physical archives. But doing so would not rely on the Web as a primary source of narrating its history.

My diggings have eventually led me to old mailing lists. In one of them I found a treasure. On 17 February 2009, Nikola Smolenski, a Wikipedian and a Web developer, posted a message to Wikimedia’s Wikibots-L mailing list, asking fellow Wikipedians to help him replace all references to .yu URLs in the various pages of the Wikimedia project. The risk, wrote Smolenski, was ‘that readers of Wikimedia projects will not be able to access information that is now available to them’, and that ‘with massive link loss, a large number of references could no longer be evaluated by the readers and editors’. He used a Python script to generate a list of 46,102 URLs in the .yu domain that were linked from Wikimedia projects and that had to be replaced. A day before the removal of the domain, he also systematically queried Google for all URLs in the .yu domain per sub-domain, which yielded several thousand results. Smolenski’s lists are a last snapshot of the presence of the Yugoslav domain on the live Web. The day after he conducted the search, the .yu ccTLD was no longer part of the Internet root, resulting in the link loss he had anticipated.

Smolenski kindly agreed to send me the lists he generated in 2010. Using the URLs in the lists as seeds, my research assistant Adam Amram and I have built another Python script to fetch the URLs from the Internet Archive, extract all the outlinks from each archived resource, and extract from that set of links those which belonged to the .yu domain. We reiterated the method four times until no new .yu content was found. Our dataset now contains 1.92 million unique pages that were once hosted in the .yu domain between 1996 and 2010.

While the full analysis of our data is beyond the scope of this blog post, I would like to present the following visualization of the rise and fall of the networked structure of the .yu domain over time. The figure below shows the evolution of the linking structure of .yu websites in the entire reconstructed space from 1996 to 2010. Websites in the .yu domain are marked in blue, websites in all other domains are marked in gray, and the visualization shows the domain’s hyperlinked structure per year.

As can be clearly seen, the internal linking structure of the domain became dense only after the end of the Milosevic regime in 2000, and it is only after the final split between Serbia and Montenegro in 2006 that the .yu domain stabilized both in terms of the number of websites and network density, followed shortly after by the dilution of the network in preparation for the replacement of the .yu domain with the new ccTLDs .rs and .me. In other words, the intra-domain linking patterns of the .yu domain are closely tied with stability and sovereignty.

As time goes by, Web archives are likely to hold more treasures of our deleted digital pasts. This makes Web archives all the more intriguing and important primary sources for historical research, despite the structural problems of the oblivious medium that they attempt to preserve.