The Internet Archive (thing)

I recently attended the NotCon '04 event at which various novel technology projects were discussed.
Brewster Kahle spoke very impressively about the Internet Archive project. This scheme was inspired by the ancient Library of Alexandria, which was said to contain all human knowledge. Kahle and his cohorts have built an extensive server farm that is archiving all sorts of information as fast as volunteers submit it.

Books and other literature are being scanned in page-by-page, indexed and archived. Most of this activity has been outsourced to India- the Archive sends out crates of print material to the sub-continent, where the cost of scanning is about $10 per volume. The Million Book Project is aiming to archive that number of volumes by 2005. A pending court-case, Kahle v. Ashcroft, may allow access to the huge range of out-of-print books.

Musical recordings, especially live concerts, are another target. Over 12,000 concerts have been digitised, losslessly compressed and archived so far. Most of them feature "jam bands", like The Grateful Dead. Many public domain films have received similar treatment- the archive contains everything from government information reels to old Hollywood movies. I was able to access the old "Duck and Cover" film, and "Night of the Living Dead". Twenty television channels are being continuously digitised and archived in DVD-quality- around 20 terabytes of data per month by itself. Old software packages are solicited- especially if they're boxed with manuals.

Finally, The Archive has deployed spider programmes onto the internet to archive webpages. A significant portion of the world wide web dating back to 1996 has been archived and is available via the Wayback Machine. The Archive webpage offers graphs of how often a phrase has been used on the web over time- look out for neologisms like "Homeland Security".

The Archive is approaching 1 Petabyte (about one million gigabytes) of data. Kahle pledged that the Archive will offer "unlimited storage, unlimited bandwidth1, forever, for free" to any suitable content. Much of the currently archived material is available for download or viewing via the webpage at www.archive.org.

Apart from academic interest from future generations and researchers, the Archive has the potential to improve literacy and access to literature. A recent side-project involved specially equipped vans that offered children the opportunity to print and bind books from the Archive. Kahle claimed that lending out and returning a book costs a traditional library around $2, whereas printing and binding a book in this way costs about $1.

The Archive is seeing interest from technologists interested in the reliability of long-term data-storage. On average, 6% of their hard-disks fail every year- and they occasionally lose data. Their servers are based uncomfortably close to the San Andreas Fault, and so major efforts are now underway to find organisations willing to host duplicate Archives. The first of these is currently being set up in Amsterdam. The Archive has designed a Linux-based server configuration that offers 100 Terabytes of the archive and all the controlling software in a single rack- commoditising the distribution of the whole Archive to backup sites.

A video of Brewster Kahle's talk is available here: http://quernstone.com/notcon04/ . Or, by the magic of The Internet Archive's Freecache system (try this with all the big media files you want to download from the web- it forces the file into the archive, and lets you download from a distributed cache system) http://freecache.org/http://quernstone.com/notcon04/ . Audio of the talk is available here: http://www.ejhp.net/notcon/t10.ogg and here: http://www.ejhp.net/notcon/t10.mp3 .

The Internet Archive is a not-for profit corporation, supported by various companies and institutions.