Web-Archiving: Reconciling Two Curation Methods

One of the first things I did in digital preservation was a long-term web-archiving project, so I have long felt quite close to the subject. I was very pleased to attend the 2017 IIPC conference in Senate House, London, which this year combined to great effect with the RESAW conference, ensuring wide coverage and maximum audience satisfaction in the papers and presentations.

In this short series of blog posts, I want to look at some of the interesting topics that reflect some of my own priorities and concerns as an archivist. I will attempt to draw out the wider lessons as they apply to information management generally, and readers may find something of interest that puts another slant on our orthodox notions of collection, arrangement, and cataloguing.

Andy Jackson at the British Library is facing an interesting challenge as he attempts to build a technical infrastructure to accommodate a new and exciting approach to collections management.

The British Library has traditionally had custodial care of official Government papers. They’ve always collected them in paper form, but more recently two separate curation strands have emerged.

The first has been through web-archiving, where as part of the domain-wide crawls and targeted collection crawls, the BL has harvested entire government websites into the UK Web Archive. These harvests can include the official publications in the form of attached PDFs or born-digital documents.

The second strand involves the more conventional route followed by the curators who add to The Catalogue, i.e. the official BL union catalogue. It’s less automated, but more intensive on the quality control side; it involves manual selection, download, and cataloguing of the publication to MARC standards.

Currently, public access to the UK Web Archive and to The Catalogue are two different things. My understanding is that the BL are aiming to streamline this into a single collection discovery point, enabling end users to access digital content regardless of where it’s from, or how catalogued.

Andy’s challenges include the following:

The two curation methods involve thinking about digital content in quite different ways.The first one is more automated, and allows the possibility of data reprocessing. The second one has its roots in a physical production line, with clearly defined start and end points.

Because of its roots to the physical world, the second method has a form of workflow management which is closely linked to the results in the catalogue itself. It seems there are elements in the database which indicate sign-off and completion of a particular stage of the work. Web crawling, conversely, resembles a continual ongoing process, and the cut-off point for completion (if indeed there is one) is harder to identify.

There is known to be some duplication taking place, duplication of effort and of content; to put it another way, PDFs known to be in the web archive are also being manually uploaded to the catalogue.

In response to this, Andy has been commissioned to build an over-arching “transformation layer” model that encompasses these strands of work. It’s difficult because there’s a need to get away from a traditional workflow, there are serious synching issues, and the sheer volume of the content is so considerable.

I’m sure the issues of duplication will resonate with most readers of this blog, but there are also interesting questions about reconciling traditional cataloguing with new ways of gathering and understanding digital content. One dimension to Andy's work is the opportunity for sourcing descriptive metadata from outside the process; he makes use of external Government catalogues to find definitive entries for the documents he finds on the web pages in PDF form, and is able to combine this information in the process. What evidently appeals to him is the use of automation to save work.

My view as an archivist (and not a librarian) would involve questions such as:

Is MARC cataloguing really suitable for this work?Which isn't meant as a challenge to professional librarians - I’d level the same question at ISAD(G) too, which is a standard with many deficiencies when it comes to describing digital content adequately. On the other hand, end-users know and love MARC, and are still evidently wedded to accessing content in a subject-title-author based manner.

The issue of potential duplication bothers me as (a) it’s wasteful and (b) it increases ambiguity as to which one of several copies is the correct one. I’m also interested, as an archivist, in context and provenance; it could be there is additional valuable contextual information stored in the HTML of the web page, and embedded in the PDF properties; neither of these are guaranteed to be found, or catalogued, by the MARC method. But this raises the question, which Andy is well aware of, “what constitutes a publication”?

I can see how traditional cataloguers, including my fellow archivists, might find it hard to grasp the value of “reprocessing” in this context. Indeed it might even seem to cast doubts on the integrity of a web harvest if there’s all this indexing and re-indexing taking place on a digital resource. I would encourage any doubters to try and see it as a process not unlike “metadata enrichment”, a practice which is gaining ground as we try to archive more digital material; we simply can’t get it right first time, and it’s within the rules to keep adding metadata (be it descriptive / technical, hand-written or automated) as our understanding of the resource deepens, and the tools we can use keep improving.