Friday, 20 April 2012

Earlier this week I attended a 3 day mashup event in
Glasgow, organised as part of the SPRUCE project.SPRUCE aims to enable
Higher Education Institutions to address preservation gaps and articulate the
business case of digital preservation, and the mashup serves as a way to bring
practitioners and developers together to work on these problems. Practitioners
took along a collection which they were having issues with, and were paired off
with a developer who could work on a tool to provide a solution.

Day 1

After some short presentations on the purpose of SPRUCE and
the aims of the mashup, the practitioners presented some lightning talks on our collections and problems. These included dealing with email attachments, preserving content off Facebook, software emulation, black areas in scanned images, and identifying
file formats with incorrect extensions, amongst others. I took along some disk images, as we find it very time-consuming to find out date ranges, file types
and content of the files in the disk image, and we wanted a more efficient way to get this metadata. More information on the
collections and issues
presented can be found at the wiki.

After a short break for coffee (and excellent cakes and
biscuits) we were sorted into small groups of collection owners and developers
to discuss our issues in more detail. In my
group this led to conversations about natural language processing, and the
possibilities of using predefined subjects to identify files as being about a
particular topic, which we thought could be really helpful, but somewhat
impossible to create in a couple of days! We were then allocated our
developers. As there were a few of us with problems with file identification,
we were assigned to the same developer, Peter May from the BL. The day ended
with a short presentation from William Kilbride on the value of digital collections and Neil Beagrie's benefits framework.

Day 2

The developers were packed off to another room to
work on coding, while we collection owners started to look into the business
case for digital preservation. We used Beagrie’s framework to consider
the three dimensions of benefits (direct or indirect, near- or long-term, and
internal or external), as they apply to our institutions. When we reported back, it was interesting to see how different organisations
benefit in different ways. We also looked at various stakeholders and how
important or influential they are to digital preservation. Write ups of these
sessions are also available at the wiki.

The developers came back at several points throughout the
day to share their progress with us, and by lunchtime the first solution had
been found! The first steps to solving our problem were being made; Peter had
found a program, Apache Tika, which can parse a file and extract metadata (it can also identify the content type of files with incorrect extensions), and had written a script so that it could work through a directory
of files, and output the information into a CSV spreadsheet. This was a
really promising start, especially due to the amount of
metadata that could potentially be extracted (provided it exists within the
file), and the ability to identify file types with incorrect extensions.

Day 3

We had another catch up with the developers and their
overnight progress. Peter had written a script that took the
information from the CSV file and summarised it into
one row, so that it fits into the spreadsheets we use at BEAM. Unfortunately,
mounting the ISO image to check it with Apache Tika was slightly more
complicated than anticipated, so our disk images couldn't be checked this way without further work.

While the developers set about finalizing their solutions,
we continued to work on the business case, doing a skills gap analysis to consider whether our institutions had the skills and resources to carry out
digital preservation. Reporting back, we had a very interesting discussion on
skills gaps within the broader archives sector, and the need to provide digital
preservation training to students as well as existing professionals. We then
had to prepare an ‘elevator pitch’ for those occasions when we find ourselves
in a lift with senior management, which neatly brought together all the things
we had discussed, as we had to explain the specific benefits of digital preservation to
our institution and our goals in about
a minute.

To wrap up the developers presented their solutions, which solved many of the problems we had arrived with.A last minute breakthrough in mounting ISO images using
WinCDEmu and running scripts on them meant
that we are able to use the Tika script on our disk images. However, because we were so short
on time, there are still some small problems that need addressing. I'm really happy with our solution, and I was very impressed by all the developers and how much
they were able to get done in such a short space of time.

I felt that this event was a very useful way to get thinking about the business case for what we do, and to get to see what other
people within the sector are doing and what problems they are facing. It was
also really helpful as a non-techie to get to talk with developers and get an idea of what it is
possible to build tools to do (and get them made!). I would definitely
recommend this type of event – in fact, I’d love to go along
again if I get the opportunity!

What's the futureArch blog?

A place for sharing items of interest to those curating hybrid archives & manuscripts.

Legacy computer bits wanted!

At Bodleian Electronic Archives and Manuscripts (BEAM) we are always on the lookout for older computers, disk drives, technical manuals and software that can help us recover digital archives. If you have any such stuff that you would be willing to donate, please contact susan.thomas@bodleian.ox.ac.uk. Examples of items in our wish list include: an Apple Mac Macintosh Classic II Computer, a Wang PC 200/300 series, as well as myriad legacy operating system and word-processing software.