Monthly Archives: October 2017

Cambridge Policy & Planning Fellow, Somaya, writes about her paper and presentation from the Digital Culture Heritage Conference 2017. The conference paper, Planning for the End from the Start: an Argument for Digital Stewardship, Long-Term Thinking and Alternative Capture Approaches, looks at considering digital preservation at the start of a digital humanities project and provides useful advice for digital humanities researchers to use in their current projects.

I presented the Friday morning Plenary session on Planning for the End from the Start: an Argument for Digital Stewardship, Long-Term Thinking and Alternative Capture Approaches. Otherwise known as: ‘planning for your funeral when you are conceived’. This is a presentation that represents challenges faced by both Oxford and Cambridge and the thinking behind this has been done collaboratively by myself and my Oxford Policy & Planning counterpart, Edith Halvarsson.

We decided it was a good idea to present on this topic to an international digital cultural heritage audience, who are likely to also experience similar challenges as our own researchers. It is based on some common digital preservation use cases that we are finding in each of our universities.

The Scenario

A Digital Humanities project receives project funding and develops a series of digital materials as part of the research project, and potentially some innovative tools as well. For one reason or another, ongoing funding cannot be secured and so the PIs/project team need to find a new home for the digital outputs of the project.

Example Cases

We have numerous examples of these situations at Cambridge and Oxford. Many projects containing digital content that needs to be ‘rehoused’ are created in the online environment, typically as websites. Some examples include:

Holistic Thinking

We believe that thinking holistically right at the start of a project can provide options further down the line, should an unfavourable funding outcome be received.

So it is important to consider holistic thinking, specifically a Digital Stewardship approach (incorporating Digital Curation & Digital Preservation).

Models for Preservation

Digital materials don’t necessarily exist in a static form and often they don’t exist in isolation. It’s important to think about digital content as being part of a lifecycle and managed by a variety of different workflows. Digital materials are also subject to many risks so these also need to be considered.

Documentation

It is incredibly important to document your project and when handing over the responsibility of your digital materials and data, also handing over documentation to someone responsible for hosting or preserving your digital project will need to rely on this information. Also ensuring the implementation of standards, metadata schemas and persistent identifiers etc.

Borrowing from Other Disciplines

Rather than having to ‘rebuild the wheel’, we should also consider borrowing from other disciplines. For example, borrowing from the performing arts we might provide similar documents and information such as:

Technical Rider (a list of requirements for staging a music gig and theatre show)

Stage Plots (layout of instruments, performers and other equipment on stage)

Input Lists (ordered list of the different audio channels from your instruments/microphones etc. that you’ll need to send to the mixing desk)

For digital humanities projects and other complex digital works, providing simple and straight forward information about data flows (including inputs and outputs) will greatly assist digital preservationists in determining where something has broken in the future.

Approaches

Here are some approaches to consider in regards to interim digital preservation of digital materials:

Bundling & Bitstream Preservation

The simplest and most basic approach may be to just zip up files and undertake bitstream preservation. Bitstream preservation only ensures that the zeroes and ones that went into a ‘system’ come out as the same zeroes and ones. Nothing more.

Exporting / Migrating

Consider exporting digital materials and/or data plus metadata into recognised standards as a means of migrating into another system.

For databases, the SIARD (Software Independent Archiving of Relational Databases) standard may be of use.

Hosting Code

Consider hosting code within your own institutional repository or digital preservation system (if your organisation has access to this option) or somewhere like GitHub or other services.

Packing it Down & ‘Putting on Ice’

You may need to consider ‘packing up’ your digital materials and doing it in a way that you can ‘put it on ice’. Doing this in a way that – when funding is secured in the future – it can be somewhat simply be brought back to life.

An example of this is the the work that Peter Sefton, from the University of Sydney in Australia, has been trialling. Based on Omeka, he has created a version of code called OzMeka. This is an attempt at a standardised way of being able to handle research project digital outputs that have been presented online. One example of this is Dharmae.

Alternatively, the Kings Digital Lab, provide infrastructure for eResearch and Digital Humanities projects that ensure the foundations of digital projects are stable from the get-go and mitigates risks regarding longer-term sustainability of digital content created as part of the projects.

Maintaining Access

This could be done through traditional web archiving approaches, such as using tools Web Archiving Tools (Heritrix or HTTrack) or downloading video materials using Video Download Helper for video. Alternatively, if you are part of an institution, the Internet Archive’s ArchiveIt service may be something you want to consider and can work with your institution to implement this.

Hosted Infrastructure Arrangements

Finding another organisation to take on the hosting of your service. If you do manage to negotiate this, you will need to either put in place a contract or Memorandum of Understanding (MOU) as well as handing over various documentation, which I have mentioned earlier.

Video Screen Capture

A simple way of attempting to document a journey through a complex digital work (not necessarily online, this can apply to other complex interactive digital works as well), may be by way of capturing a Video Screen Capture.

Kymata Atlas – Video Screen Capture still

Alternatively, recording a journey through an interactive website using the Webrecorder, developed by Rhizome, which will produce WARC web archive files.

Documenting in Context

Another means of understanding complex digital objects is to document the work in the context in which it was experienced. One example of this is the work of Robert Sakrowski and Constant Dullart, netart.database.

An example of this is the work of Dutch and Belgian net.artists JODI (Joan Heemskerk & Dirk Paesmans) shown here.

JODI – netart.database

Borrowing from documenting and archiving in the arts, an approach of ‘documenting around the work‘ might be suitable – for example, photographing and videoing interactive audiovisual installations.

Web Archives in Context

Another opportunity to understand websites – if they have been captured by the Internet Archive – is viewing these websites using another tool developed by Rhizome, oldweb.today.

An example of the Cambridge University Library website from 1997, shown in a Netscape 3.04 browser.

Cambridge University Library website in 1997 via oldweb.today

Conclusions

While there is no one perfect solution and each have their own pros and cons, using an approach that combines different methods might make your digital materials available post the lifespan of your project. These methods will help ensure that digital material is suitably documented, preserved and potentially accessible – so that both you and others can use the data in an ongoing manner.

Consider:

How you want to preserve the data?

How you want to provide access to your digital material?

Developing a strategy including several different methods.

Finally, I think this excerpt is relevant to how we approach digital stewardship and digital preservation:

“No man is an island entire of itself; every man is a piece of the continent, a part of the main” – Meditation XVII, John Donne

We are all in this together and rather than each having to troubleshoot alone and building our own separate solutions, it would be great if we can work to our strengths in collaborative ways, while sharing our knowledge and skills with others.

Edith, Policy and Planning Fellow at Bodleian Libraries, writes about her favourite features in ePADD (an open source software for email archives) and about how the tool aligns with digital preservation workflows.

At iPres a few weeks ago I had the pleasure of attending an ePadd workshop ran by Josh Schneider from Stanford University Libraries. The workshop was for me one of the major highlights of the conference, as I have been keen to try out ePADD since first hearing about it at DPC’s Email Preservation Day. I wrote a blog about the event back in July, and have now finally taken the time to review ePADD using my own email archive.

ePADD is primarily for appraisal and delivery, rather than a digital preservation tool. However, as a potential component in ingest workflows to an institutional repository, ensuring that email content retains integrity during processing in ePADD is paramount. The creators behind ePADD are therefore thinking about how to enhance current features to make the tool fit better into digital preservation workflows. I will discuss these features later in the blog, but first I wanted to show some of the capabilities of ePADD. I can definitely recommend having a play with this tool yourself as it is very addictive!

ePADD: Appraisal module dashboard

Josh, our lovely workshop leader, recommends that new ePADD users go home and try it on their own email collections. As you know your own material fairly well it is a good way of learning about both what ePADD does well and its limits. So I decided to feed in my work emails from the past year into ePADD – and found some interesting trends about my own working patterns.

ePADD consists of four modules, although I will only be showing features from the first two in this blog:

Module 1: Appraisal (Module used by donors for annotation and sensitivity review of emails before delivering them to the archive)

Module 2: Processing (A module with some enhanced appraisal features used by archivist to find additional sensitive information which may have been missed in the first round of appraisal)

Module 4: Delivery (This module provides more enhanced viewing of the content of the email archive – including a gallery for viewing images and other document attachments)

Note that ePADD only support MBOX files, so if you are an Outlook user like myself you will need to first convert from PST to MBOX. After you have created an MBOX file, setting up ePADD is fairly simple and quick. Once the first ePADD module (“Appraisal”) was up and running, processing my 1,500 emails and 450 attachments took around four minutes. This time includes time for natural language processing. ePADD recognises and indexes various “entities” – including persons, places and events – and presents these in a digestible way.

ePADD: Appraisal module processing MBOX file

Looking at the entities recognised by ePADD, I was able to see who I have been speaking with/about during the past year. There were some not so surprising figures that popped up (such as my DPOC colleagues James Mooney and Dave Gerrard). However, curiously I seem to also have received a lot of messages about the “black spider” this year (turns out they were emails from the Libraries’ Dungeons and Dragons group).

ePADD entity type: Person (some details removed)

An example of why you need to look deeper at the results of natural language processing was evident when I looked under the “place entities” list in ePADD:

ePADD entity type: Place

San Francisco comes highest up on the list of mentioned places in my inbox. I was initially quite surprised by this result. Looking a bit closer, all 126 emails containing a mention of San Francisco turned out to be from “Slack”. Slack is an instant messaging service used by the DPOC team, which has its headquarters in San Francisco. All email digests from Slack contains the head office address!

Another one of my favourite things about ePADD is its ability to track frequency of messages between email accounts. Below is a graph showing correspondence between myself and Sarah Mason (outreach and training fellow on the DPOC project). The graph shows that our peak period of emailing each other was during the PASIG conference, which DPOC hosted in Oxford at the start of September this year. It is easy to imagine how this feature could be useful to academics using email archives to research correspondence between particular individuals.

ePADD displaying correspondence frequency over time between two users

The last feature I wanted to talk about is “sensitivity review” in ePADD. Although I annotate personal data I receive, I thought that the one year mark of the DPOC project would also be a good time to run a second sensitivity review of my own email archive. Using ePADD’s “lexicon hits search” I was able to sift through a number of potentially sensitive emails. See image below for categories identified which cover everything from employment to health. These were all false positives in the end, but it is a feature I believe I will make use of again.

ePADD processing module: Lexicon hits for sensitive data

So now on to the Digital Preservation bit. There are currently three risks of using ePADD in terms of preservation which stands out to me.

1) For practical reasons, MBOX is currently the only email format option supported by ePADD. If MBOX is not the preferred preservation format of an archive it may end up running multiple migrations between email formats resulting in progressive loss of data

2) There are no checksums being generated when you download content from an ePADD module in order to copy it onto the next one. This could be an issue as emails are copied multiple times without monitoring of the integrity of the email archive files occurring

3) There is currently limited support for assigning multiple identifiers to archives in ePADD. This could potentially become an issue when trying to aggregate email archives from different intuitions. Local identifiers could in this scenario clash and other additional unique identifiers would then also be required

Note however that these concerns are already on the ePADD roadmap, so they are likely to improve or even be solved within the next year.

To watch out for ePADD updates, or just have a play with your own email archive (it is loads of fun!), check out their:

Bodleian Digital Library Systems and Services’ Digital Curator, Emma Stanford, guest blogs for the DPOC project this week. Emma writes about what she is doing to close some of the 6-million-image gap between what’s in our tape archive and what’s available online at Digital.Bodleian. It’s no small task, but sometimes Emma finds some real gems just waiting to be made available to researchers. She also raises some good questions about what metadata we should make available to researchers to interpret our digitized image. Read more from Emma below.

Thanks to Edith’s hard work, we now know that the Bodleian Imaging Services image archive contains about 5.8 million unique images. This is in addition to various images held on hard drives and other locations around the Bodleian, which bring the total up to almost 7 million. Digital.Bodleian, however, our flagship digital image platform, contains only about 710,000 unique images–a mere tenth of our total image archive. What gives?

That 6-million-image gap consists of two main categories:

Images that are online elsewhere (aka the migration backlog). In the decades before Digital.Bodleian, we tried a number of other image delivery platforms that remain with us today: Early Manuscripts at Oxford University, the Toyota City Imaging Project, the Oxford Digital Library, Luna, etc., etc. Edith has estimated that the non-Digital.Bodleian content comprises about 1.4 million images. Some of these images don’t belong in Digital.Bodleian, either because we don’t have rights to the images (for example, Queen Victoria’s Journals) or because they are incomplete selections rather than full image sets (for example, the images in the Bodleian Treasures exhibition). Our goal is to migrate all the content we can to Digital.Bodleian and eventually shut down most of the old sites. We’ve been chipping away at this task very slowly, but there is a lot left to do.

Images that have never been online. Much of Imaging Services’ work is commercial orders: shooting images for researchers, publishers, journalists, etc. We currently store all these images on tape, and we have a database that records the shelfmark, number of images, and list of captured pages, along with information about when and how the images were captured. Searching through this archive for Digital.Bodleian-appropriate images is a difficult task, though. Shelfmark notation isn’t standardized at all, so there are lots of duplicate records. Also, in many cases, just a few pages from a book or manuscript were captured, or the images were captured in black-and-white or grayscale; either way, not suitable for Digital.Bodleian, where we aim to publish fully-digitized works in full colour.

I’m working on extracting a list of complete, full-colour image sets from this database. In the meantime, we’ve started approaching the problem from the other direction: creating a list of items that we’d like to have on Digital.Bodleian, and then searching the archive for images of them. To do this, we asked the Bodleian’s manuscript and rare book curators to share with us their lists of “greatest hits”: the Bodleian’s most valuable, interesting, and/or fragile holdings, which would benefit most from online surrogates. We then began going through this list searching for the shelfmarks in the image archive. Mostly, we’ve found only a few images for each shelfmark, but occasionally we hit the jackpot: a complete, full-colour image set of a 13th-century bestiary or a first edition of a Shakespeare play.

Going through the archives in this way has underlined for me just how much the Bodleian’s imaging standards have changed in the last two decades. File size has increased, of course, as higher-resolution digital scanning backs have become available; but changes in lighting equipment, book cradles, processing software, rulers and colour charts have all made their mark on our images too. For me, this has raised the question of whether the technical metadata we’re preserving in our archives, about when and how the images were captured, should also be made available to researchers in some way, so that they can make an informed choice about how to interpret the images they encounter on sites like Digital.Bodleian.

In the meantime, here are some of the image sets we’ve pulled out of the archive and digitized so far:

An issue with thinking about digital ‘stuff’ too much in terms of tangible objects is that opportunities related to the fact the ‘stuff’ is digital can be missed. Matt Zumwalt highlighted one such opportunity in Data together: Communities & institutions using decentralized technologies to make a better web when he introduced ‘content addressing’: using cryptographic hashing and Directed Acyclic Graphs (in this case, information networks that record content changing as time progresses) to manage many copies of ‘stuff’ robustly.

This addresses some of the complexities of preserving digital ‘stuff’, but perhaps thinking in terms of ‘copies’, and not ‘branches’ or ‘forks’ is an over simplification? Precisely because digital ‘stuff’ is rarely static, all ‘copies’ have the potential to deviate from the ‘parent’ or ‘master’ copy. What’s the ‘version of true record’ in all this? Perhaps there isn’t one? Matt referred to ‘immutable data structures’, but the concept of ‘immutability’ only really holds if we think it’s possible for data to ever be completely separated from its informational context, because the information does change, constantly. (Hold that thought).

The need for greater simplicity in preservation was further emphasised by Mathieu Giannecchini’s The Eclair Archive cinema heritage use case: Rising to the challenges of complex formats at large scale. Again – space precludes me from getting into detail, but the key takeaway was that Mathieu has 2 million reels of film to preserve using the Digital Cinema Distribution Master (DCDM) format, and after lots of good work, he’s optimised the process to preserve 8tb a day, (with a target of 15tb). Now, we don’t know how much film is on each reel, but assuming a (likely over-) estimate of 10 minutes per reel, that’s roughly 180,000 films of 1 hour 50 mins in length. Based on Mathieu’s own figures, it’s going to take many decades, perhaps even a few hundred years, to get through all 2 million reels… So further, major optimisations are required, and I suspect DCDM (a format with a 155-page spec, which relies on TIFF, a format with a 122-page spec) might be one of the bottlenecks.

Of course, the trade-off with simplifying formats is that data will likely be ‘decontextualised’, so there must be a robust method for linking data back to context… Thoughts on this were triggered by Developing and applying principles for discovery and access for the UK Data Service by Katherine McNeill from the UK Data Archive, as Katherine discussed production of a next-generation access system based on a linked-data model with which, theoretically, single cells’ worth of data could be retrieved from research datasets.

William’s PASIG talk Sustainable digital futures was also one of two that got closer to what we know are the root of the preservation problem; economics. The other was Aging of digital: Managed services for digital continuity by Natasa Milic-Frayling, which flagged-up the current “imbalance in control and empowerment” between tech providers and content producers / owners / curators, an imbalance that means tech firms can effectively doom our digital ‘stuff’ into obsolescence, and we have to suck it up.

I think this imbalance in part exists because there’s too much technical context related to data, because it’s generally in the tech providers’ interests to bloat data formats to match the USPs of their software. So, is a pure ‘preservation format’ one in which the technical context of the data is generalised to the point where all that’s left is commonly-understood mathematics? Is that even possible? Do we really need 122-page specs to explain how raster image data is stored? (It’s just an N-dimensional array of pixel values…, isn’t it…?) I think perhaps we don’t need all the complexity – at the data storage level at least. Though I’m only guessing at this stage: much more research required.