Thursday, August 12, 2010

One of the many challenges of our increasingly digital world is that of establishing effective ways of preserving digital information — which is far more fragile than printed material. What are the implications of this for the scholarly record, and where does Open Access (OA) fit into the picture?

In a 1999 report for the Council on Library and Information ResourcesJeff Rothenberg , a senior research scientist at the RAND Corporation, pointed out that while we were generating more and more digital content each year no one really knew how to preserve it effectively. If we didn't find a way of doing it soon, he warned, "our increasingly digital heritage is in grave risk of being lost."

In launching the UK Web Archive earlier this year British Library chief executive Dame Lynne Brindleyestimated that the Library would only be able to archive about one per cent of the 8.8 million .co.uk domains expected to exist by 2011. The remaining 99 per cent, she said, was in danger of falling into a "digital black hole".

In the context of Rothenberg's earlier warning Brindley's comment might seem to suggest that very little has changed in the past eleven years so far as digital preservation is concerned. But that would be the wrong conclusion to reach. Rather, it draws attention to the fact that digital preservation is not just a technical issue.

As it happens, many of the technical issues associated with digital preservation have now been resolved. In their place, however, a bunch of other issues have emerged — including legal, organisational, social, and financial issues.

What concerns Brindley, for instance, are not the technical issues associated with archiving the Web, but the undesirable barrier that today's copyright laws imposes on anyone trying to do so. Since copyright requires obtaining permission from the owner of every web site before archiving it the task is time consuming, expensive, and quite often impossible.

Clearly there are implications here for the research community.

State of play

So what is the current state of play so far as preserving the scholarly record is concerned?

First we need to distinguish between two different categories of digital information. There is retro-digitised material, which in the research context consists mainly of data created as a result of research libraries digitising their print holdings — journals, books, theses, special collections etc. Then there is born-digital material — which includes ejournals, eBooks and raw data produced during the research process.

It is worth noting that the quantities of raw data generated by Big Science can be mind-boggling. In the case of the Large Hadron Collider, for instance, CERN expects that it will generate 27 terabytes of raw data every day when it is running at full throttle — plus 10 terabytes of "event summary data".

To cater for this deluge CERN has created a bespoke computing grid called the WLCG. While the costs associated with the WLCG will be shared amongst 130 computing centres around the world, the personnel and materials costs to CERN alone reached 100 million Euros in 2008, and CERN's budget for the grid going forward is 14 million Euros per annum.

Of course, these figures by no means represent preservation costs alone, and they are not typical — but they provide some perspective on the kind of challenges the science community faces.

So far as retro-digitisation is concerned, the Report points out that funding is limited and "the quantity of non-digitised material is huge". Even so, it adds, there is general concern about "the sustainability of hosting" the data that has been generated from digitisation. This is a particular concern for small and medium-sized institutions.

With regard to born-digital material the Report found that the largest gaps are currently in the "provision for perpetual access for e-journals".

The situation with regard to eBooks and databases is less clear since, as the Report points out, "experience in digital preservation with these content types is currently more limited."

While the Report focused on the situation in Germany the international nature of today's research environment suggests the situation will be similar in all developed nations (Although Germany does have two unique mass digitisation centres).

We should not be surprised that the German Report found the largest gap to be in the preservation of journal content. As we shall see, the migration from a print to a digital environment has disrupted traditional practices and responsibilities, and led to some uncertainty about who is ultimately responsible for preserving the scholarly record.

We should also point out that one important area that the German Report did not look at is the growing trend for scholars to make use of blogs, wikis, open notebooks and other Web 2.0 applications. Should this data not be preserved? If it should, whose responsibility is it to do it, and what peculiar challenges does it raise? As we have seen, for instance, preserving web content is not a technical issue alone. Amongst other things there are copyright issues. (Although as the research community starts to use more liberal copyright licences these difficulties should ease somewhat).

Another recently published report did look at the issue of web-created scholarly content, but reached no firm conclusion. Produced by the Blue Ribbon Task Force, this Report concluded: "[I]n scholarly discourse there is a clear community consensus about the value of e-journals over time. There is much less clarity about the long-term value of emerging forms of scholarly communication such as blogs, products of collaborative workspaces, digital lab books, and grey literature (at least in those fields that do not use preprints). Demand may be hypothesised — social networking sites should be preserved for future generations — but that does not tell us what to do or why."

Open Access

One issue likely to be of interest to OA advocates is whether institutional repositories should be expected to play a part in preserving research output.

Evidence cited by the German Report suggests that repositories are not generally viewed as preservation tools. It pointed out, for instance, that the Dutch National Library's KB e-Depot currently archives the content hosted in 13 institutional repositories in the Netherlands.

The Blue Ribbon Report, by contrast, appears to believe that repositories do have a long-term archiving role. It suggests, for instance, that self-archiving mandates should always be accompanied by a "preservation mandate".

The Report goes on to suggest that the inevitable additional costs associated with repository preservation should be taken out of the institution's Gold OA fund (where such a fund exists).

##

If you wish to read the rest of this introduction, and the interview with preservation specialist Neil Beagrie, please click on the link below. I am publishing it under a Creative Commons licence, so you are free to copy and distribute it as you wish, so long as you credit me as the author, do not alter or transform the text, and do not use it for any commercial purpose.

5 comments:

The trouble with universities (or nations) treating digital preservation (which is a genuine problem, and a genuine responsibility) as a single generic problem -- covering all the university's (or nation's) "digital output," whether published or unpublished, OA or non-OA -- is not only that adding an additional preservation cost and burden where it is not yet needed (by conflating Green OA OA self-archiving mandates with preservation mandates and their funding demands) makes it even harder to get a Green OA self-archiving mandate adopted at all. But taking an indiscriminate, scattershot approach to the preservation problem disserves the digital preservation agenda itself.

As usual, what is needed is to sort out and understand the actual contingencies, and then to implement the priorities, clearly and explicitly, in the requisite causal order. The priorities here are to focus university (or national) preservation efforts and funds on what needs to be preserved today. And -- as far as universities' own institutional repositories (IRs) are concerned -- that does not include the publisher's official version-of-record for that university's (or nation's) journal article output. Preserving those versions-of-record is a matter to be worked out among deposit libraries and the publishers and institutional subscribers of the journals in question. The university's IR is for providing OA to the author's final, refereed draft of those articles, for those users worldwide who do not have subscription access to the version-or-record. The author's draft needs preservation too, but that's not the same problem as the problem of preserving the published version-of-record (nor is it the same document!). ( Continued in Part 2, below. )

Perhaps one day universal Green OA mandates will cause journal subscriptions to become unsustainable, because the worldwide users of journal articles will be fully satisfied with the author's final drafts rather than needing the publisher's version-of-record, and hence journal subscriptions are cancelled. If and when we ever reach that point, the version-of-record will no longer be produced by the publisher, because the authors' drafts will effectively become the version-of-record. Publishers will then convert to Gold OA, with what remains of the cost of publication paid for by institutions, per individual article published, out of their windfall subscription cancellation servings. (Some of those savings can then also be devoted to digital preservation of the institutional version-of-record.)

But conflating the (nonexistent) need to pay for this hypothetical future contingency today with either universities' (or nations') digital preservation agenda or their OA IR agenda is not only incoherent but counterproductive.

Let's keep the agendas distinct: IRs can archive many different kinds of content. Work to preserve it all, but do not mistake that preservation function for journal article preservation or OA. For journal articles, worry about preserving the version-of-record -- and that has nothing to do with what is being deposited in IRs today. For OA, worry about mandating deposit of the author's version -- and that has nothing to do with digital preservation of the version-of-record. Nor should the need to mandate depositing the author's version be in any way hamstrung with extra expenses that concern the publish's version-of-record, or the university's IR, or OA. (Exactly the same thing is true, mutatis mutants, at the national preservation level, insofar as journal articles are concerned: Journal contents do not all come from one institution, nor from one nation.)

And, while we're at it, let's also keep university (or national) funding of Gold OA publishing costs distinct from the Green OA mandating agenda too. First things first. Needlessly over-reaching (for Gold OA funds or preservation funds) simply delays getting what is already fully within universities' (and nations') grasps, which is to provide OA to the authors' drafts of all their refereed journal articles by requiring them to be deposited in their OA IRs -- not to reform journal publishing, or solve the digital preservation problem

We may have witnessed a golden age of digital preservation tools, and some of these have been built into repository software interfaces. To explore the practical application for repositories, see our structured and fully documented KeepIt course on digital preservation tools for repository managers:

Without commenting on priorities here, IRs are much wider than OA papers. For IR preservation it's this broad scope that matters, then how policy deals with the specifics, rather than simply OA concerns.

As one of the members of the Blue Ribbon Task Force, I would agree that repositories have a wider role than Harnad envisages. Generally, the Task Force took a broader view of preservation than the "digital preservation" community often does. We aimed at ensuring future access (timescale not defined) rather than ensuring true fidelity in perpetuity. The first key to future access is not to lose the stuff in the first place! Repositories have a key role here. You don't place stuff in repositories knowing they will throw it away! In fact, a key argument why people should put their stuff in repositories rather than on their personal web site (where it is equally accessible... at least for now) is that the repository has greater persistence. So repositories have a de facto preservation role even if they do not attempt the whole OAIS compliance shenanigins.

If you run a "not for profit" organisation, a key point to remember is that it is "not for loss" (otherwise you go bust). Likewise, repositories are "not for loss". And this is the first step on the road to preservation. You can't preserve what doesn't exist in the first place.

It is neither the importance of digital preservation nor the "width" of repositories' role that I questioned. It was the somnambulistic conflation of (1a) the goal of preserving the publisher's version-of-record (not in the institutional repository!) with (1b) the goal of preserving the author's final draft (in the institutional repository) -- as well as the conflation (same dreamscape) of (2a) the role of the institutional repository in providing access to the target content of the open access movement -- each institution's own peer-reviewed journal article output -- with (2b) the role of the institutional repository in preserving its own digital output of any description.