British Library develops web crawling system for preserving webpages

The internet is an ever-changing repository of information, where pages disappear as quickly as new ones appear. The British Library is part of a group of institutions working to preserve digital data presented on the internet for future generations and to make it available for research.

The British Library co-funded with The National Library of New Zealand the development of a web harvesting management system called the web curator tool. It uses software to crawl through a specified section on the web and gather snapshots of various sites.

The system lets users select, describe, and harvest online publications, and then organise the data with an intuitive workflow and management system. It effectively allows users without in-depth technical knowledge to put in place a system to store and manage snapshots of webpages for future reference.

At the moment the tool is intended for selective harvesting of certain nominated websites with the permission of the copyright holders to keep in the national archive. Philip Beresford, Project Manager of the British Library’s Web Archiving Project explains, “The British Library is collaborating with five other institutions in the UK Web Archiving Consortium on archiving selected websites of likely research value".

“The British Library expects to continue to archive a limited number of selected sites, even after we’ve started to harvest snapshots of the whole .uk domain – at much greater breadth but less depth.”

Beresford adds, “Access to the archived material is subject to provisions in legal deposit legislation and copyright agreements with content providers, yet to be finalised”.

The libraries will integrate the tool into their preservation programs and release it as open source software for other organisations to implement. At the moment, Beresford sees it as most interesting for libraries and archives, but other organisations may find a need to preserve our digital heritage.

As the internet grows and evolves, these archives will provide important historical context for the change in digital media and information.