Wayback Goes Way Back on Web

Share

Wayback Goes Way Back on Web

Imagine being able to travel back in time to an era when the digital publishing euphoria had just begun and the dot-com boom was in full swing.

Now that may be possible with a new digital library tool called the Wayback Machine, which goes "way back" in Internet time to locate archived versions of over 10 billion Web pages dating to 1996.

The Internet Archive and Alexa Internet recently unveiled the free service, which provides digital snapshots from its archives that reveal the origins of the Internet and how it has evolved over the past five years.

"This will help make use of the cultural artifacts of our day," said Brewster Kahle, founder of The Internet Archive. "It will help people make sense of the world and give accountability to what's been published before."

Archivists are attempting to create permanent, reliable access to websites that otherwise might be lost.

"It's preserving a record of something that otherwise literally vanishes," said Paul Grabowicz, assistant dean at University of California in Berkeley's Graduate School of Journalism. "That is one of the frustrations about the Web."

Avid politicos can find early whitehouse.gov Web pages from 1996 to view news about the Clinton/Gore administration's statement on airport safety and terrorism.

Others can find versions of the original Heaven's Gate website before its members committed mass suicide in 1997 with the impending crash of the comet Hale-Bopp.

The project is funded by the Library of Congress, the National Science Foundation, the Smithsonian Institution and Compaq.

With over 100 terabytes of data growing at a rate of 12 terabytes per month, The Internet Archive's digital library is the world's largest known database, eclipsing the amount of data held in every library in the world including the Library of Congress.

However, keeping pace with the rapidly evolving digital landscape is a formidable task.

The average life of a Web page is 100 days, Kahle said. At this rate, "A lot of the best Web pages are out of print."

"It's relatively difficult, technologically, to do this. But it's a drop in the bucket compared to what traditional libraries attempt to do."

While the project attempts to archive the entire publicly available Web, some sites may not be included because they are password-protected or otherwise inaccessible to automated crawlers.

Those who don't want their Web pages to be included in the archive can put a robots.txt file on their site and crawlers will mark all previously archived pages as inaccessible.

The archive has been crawling faster over the years and technology is getting cheaper over time, Kahle said. However, the project is still very much a work in progress.

"We don't know what the right things are to be collecting," Kahle admits. "By making this collection available, we're hoping to find out what we should be collecting to create a library that is of enduring value."

"It's an incredible challenge on a variety of levels," Grabowicz agreed. "Being able to sweep sites on a regular basis, taking snapshots not just once every couple months – that's a huge challenge. How much of the Net do you try to catalog?"

"What lies ahead, as the Internet grows and grows, is more difficult to keep up with."