The Internet Archive

"This webpage is no longer available", is the sentence familiar to you? This is no longer a problem since a complete archive for webpages is now available. Through the Internet Archive, you can retrieve expired webpages, trace the development of websites, and go back to events that have shaken the world.

What is the Internet Archive?
The Internet Archive is a complete snapshot of all webpages on every website since 1996 until today.

Where is all this data stored?

Following an agreement with the San Francisco team, the BA Internet Archive incorporated the second-generation machine for web archiving—the Petabox (http://www.petabox.org)This is a machine designed to safely store and process one petabyte (a million gigabytes) of data. The machine features low power consumption, support for multiple operating systems, easy maintenance and software to automate mirroring. The machines assembled in San Francisco accommodated 1.5 petabytes of data in 23 racks. The machines were installed in the BA during 2006, holding the web collections of 1996 through 2007. Machines for web collections beyond 2006 as well as other types of collections were designed and manufactured locally at the BA with a storage capacity of 2.2 petabytes hosted on 20 racks.

Cluster synchronization software was developed at the BA, and, as a test run of the synchronization process, a subset of the web collection of 2007 was transferred over the BA's 155-Mbps high-speed Internet link and made available through the Wayback Machine. Synchronization of further web archive data is underway.

The archive at the Bibliotheca Alexandrina (BA) now includes 70 billion webpages covering the period 1996–2007, 2000 hours of Egyptian and US television broadcasts, 1,000 archival films and 25,000 digitized books acquired through the Open Content Alliance (OCA) consortium. It is capable of storing 3.7 petabytes of data on 1636 computers. The archive is fully operational and the collection is widely accessed by national, regional and international users through the BA website, http://archive.bibalex.org, via the Wayback Machine, with over 31 million hits yearly.

The BA Internet Archive is the first center of its kind established outside US borders. It is designed not only as a backup for the mother archive in San Francisco, but also as a hub for Africa and the Middle East.

The Wayback Machine
Imagine being able to go back in time and surf the Net as it used to be! The Wayback Machine is a service that allows you to visit archived versions of websites. All you have to do is to type the URL you require in the Wayback Machine, and it will take you to a list of all archived versions of the site.

Further technical details

Throughout previous operations, analysis has been conducted into data failure rates and recovery methods in order to better maintain the preserved digital material. This analysis is shared with the San Francisco team and incorporated in the design of new machines. Comparison of web collection data on the Petabox and on the old 100-terabyte Internet Archive cluster was conducted, and a set of unique archival files was identified and leveraged off of the old machines, which have then been decommissioned. Enhancements to the system in the areas of cluster management and security have been researched and are being implemented. Work is in progress to invite researchers to work on the available wealth of data and build special collections reflecting the interests of the BA patrons.

The collaborative efforts exerted by the International School of Information Science (ISIS), one of the BA’s research centers, and the Internet Archive in San Francisco have enormously enriched the content of the BA’s universal digital library. Moreover, it has opened new horizons for regional and international partnerships to access this precious resource.

In September 2006, three software engineers visited the Internet Archive in San Francisco and received training on methods of maintaining large amounts of data files with various types (audio, video, images) and handling the new Petabox hardware. In January 2007, an engineer from the Internet Archive in San Francisco in turn visited the BA and helped set up the Petabox.

The BA Internet Archive team has been working on carrying out their own crawls to take business one step further beyond data synchronization with the Internet Archive in San Francisco. In this context, the BA joined the International Internet Preservation Consortium (IIPC) in October 2010, opening up new venues for collaboration and exchanging experiences in the field of Web archiving with other institutions worldwide. The team's current crawling interest is Arab League top-level domains as well as post-revolution Egyptian news and politics.

Future Work
The ISIS team is currently studying new ways of making the hardware and software more efficient. Data researchers (computer engineers, linguists, etc.) are invited to work on the collection. Special collections that reflect the interests of Bibliotheca Alexandrina’s patrons are also being built.