Innovation and best practices for the Web

About this Blog

The blog is written by Brian Kelly. Brian is the Innovation Advocate based at CETIS, University of Bolton.

This blog functions as an open notebook which provides personal thoughts, reflections and observations on the role of the Web in higher and further education which I hope will inform readers and stimulate discussion and debate, both on this blog and elsewhere, including on Twitter.

Archive for September 7th, 2011

In this guest blog post Maureen Pennock, the Web Archive Engagement & Liaison Manager at the British Library, explores some possible approaches to exploiting the scholarly value of web archives.

Web archives: more useful than just a ‘historical snapshot’

The importance of the internet for research is well-known. As a constantly growing and evolving information source, the web contains vast amounts of information not available or published elsewhere. It is also a unique record of life and society in this technological age. Rarely these days do scholars carry out their research without going online, and the research value of the web is undeniable.

Web archives seek to capture this value and uniqueness by harvesting websites so that they may be re-used in the future even when they are no longer available on the live web. Over the past decade, numerous web archives have been established and grown, including the UK Web Archive. At almost 10 terabytes, over 9,300 web sites and 38,000 instances of archived sites, the UK Web Archive is a unique selective web archive that reflects the collection policies of the participating institutions.

Use of the web archive is steady. However, as recent reports have identified, there remains a gap between the potential community of researchers who could exploit the content, and those who actually do so. To address this, we are collaborating with researchers to explore different ways in which they may use the web archive and exploit the data contained within. We have developed and released a number of visualisation tools as an early first step:

the 3D Visualisation Wall, (shown below) which provides a high-level, more dynamic presentation of search results and special collections;

the N-Gram search, which encourages users to consider the web archives as data as well as websites, enabling visualisation and comparisons of term frequency;

the General Election 2005 Tag Cloud, which visualises the most frequently used (single and pairs of) words in the websites related to key political parties during the 2005 election campaign.

Analysis shows that our single most popular site is the One & Other site, otherwise known as the Fourth Plinth, the website of a 2009 public arts project by artist Anthony Gormley. The site is no longer available on the live web. This type of usage, where users browse websites in order to access content that was available at a given point of time but is no longer accessible, is a widely accepted, original user scenario. It is based largely on original user experiences and early interactions with the live web. But there are other ways in which a web archive may be used, aside from visiting sites as they were captured at a given date and time. For example:

Resource citation. Researchers typically use the live web for research and cite live web resources with the date last visited. Why? Because content changes over time and they want to indicate when the content was available on the website. But if the content changes – and web pages are frequently updated or refreshed without archiving old versions – then there is no proof that the content cited actually existed. The web archive provides a more reliable and persistent citation than the live web.

Data exploitation. Web archives enable automatic identification of social trends over time (automated temporal trend research). The tools available will impact on the type of research that can be undertaken. This is a chicken & egg scenario: we rely to an extent on users to tell us what tools they want, but users need some direction on what might be possible with the data available. We need to work together to further develop the archive and support the emerging research needs of our users.

Intelligent querying, of the Q&A sort. Given the amount of data available in the web archive, it’s not inconceivable that future users will expect a more intelligent query mechanism than simple search and result presentation. More complex questions, for example, ‘tell me about the competing interests of oil companies in the late twentieth century’ are the stuff of sci-fi but rely upon an extensive historical database – such as a web archive.

Of course the characteristics of a web archive inevitably impact on how viable these different scenarios may be. For example, a selective web archive with limited scope but rich resource description will support research differently to a broad domain or international archive, with minimal accompanying metadata. The age of the web archive may be another factor. These factors must be recognised when developing tools and functionality.

Increasing usage and responding to researcher needs is an important element of our growth strategy for the UK Web Archive over the next five years. If you use the web archive for research and/or have ideas about tools or functionality to support specific types of research, we’d really like to hear from you. You can get in touch with us either by email, on Twitter, or by leaving a comment below.