Sunday, 11 March 2012

Crowdsourcing transcription of handwritten archives

One of the big differences
between libraries and archives is that libraries tend to have more of ‘the
printed word’ whilst archives have vast amounts of ‘the handwritten record’.While some libraries are getting up to speed
with mass digitisation of books and journals and then being able to offer users
full text searchable digitised items, this is still a distant dream for most
archives.Some archives are undertaking
mass digitisation, but the second step – making handwritten records full-text
searchable is a massive challenge.The
reason for this is in the technology and processing steps.

After scanning a ‘printed
word’ page into an image file a piece of software called Optical CharacterRecognition (OCR)
converts the image into searchable text.The OCR works best with clean, clear, black and white typeface such as a
word document or a book, not quite so well on old books and journals, and very
poorly on old newspapers.When it comes
to converting handwriting it fails miserably.It just can’t distinguish and convert handwriting to text in the way the
human eye can.Therefore archives can’t
easily automate the second part of the digitisation process using OCR software
like libraries can for the printed word.

If you at least get some
OCR text from print that is readable and therefore searchable you can offer a
service to users to full-text search the books or journals such as Google does.
If the OCR text is poor there are some things you can do to improve it. You can encourage users of your service to
correct the OCR text with a text correction tool so that the searching is
improved, such as Trove
does with the Australian Newspapers.

Unfortunately the only
viable option open to archives to convert digital images into full-text
searchable text is to use a manuscript transcription tool, in combination with
harnessing the power of a crowd to do the transcription work.The transcription work for handwritten
records is much harder than for example text correcting old newspapers because
the handwriting is often difficult to read, old fashioned, barely legible and
not necessarily structured in lines or columns. There is often nothing to go
on.

I recently stumbled across
a blog all about manuscript transcription toolsthat is written by a
software developer Ben W Brumfield in Texas.Ben developed his own software to transcribe
his great-great grandmother’s journal. ‘FromthePage’
is now being used by archives because Ben has made it available open source.

A year ago he wrote an in-depthblog post
that covered manuscript transcription tools under development, manuscript
transcription projects in archives, and made some predications for future
directions of manuscript transcription.I am not going to repeat what he said here, I suggest you read the post
in full.He notes that software
development in this area is still fragmented and young with no particular tools
taking dominance. Most developed applications are being made available open
source. A standout is ‘Scribe’
from the Zooniverse team, currently being used by both the ‘Old Weather’ project to
transcribe maritime weather records and by ‘What’s the score’ project to
transcribe music scores at the Bodleian Library, Oxford.

Before an archive implements
a manuscript tool it needs to find out what it’s userswould most like to be easily full-text
searchable from the vast vaults of all the content it has. It is important to
find this out, because the crowd will only be motivated and swell in numbers if
they really feel what they are doing is very important to a broad group of
people and really matters either right now, or in the long-term and is also
interesting.They have to feel this
before they will join in.Once they have
joined in there are other motivational tips you can do to keep them going. Just implementing a manuscript tool is simply
not enough.You need to engage, watch,
understand and learn from your crowd, for they hold the passion and power in
their hands to make your project successful or not.

About Me

The views expressed on this blog are my personal views and do not represent the official views of any institution or organisation. I have been working in GLAM’s (Galleries, Libraries, Archives and Museums) for 30 years. For the last 16 years I have been project managing large collaborative digital projects in New Zealand and Australia, including the Australian Newspapers Digitisation Program and Trove. I have a particular interest in crowdsourcing. I currently work at UNSW. My blog is dedicated to the memory of Paul Reynolds http://www.peoplepoints.co.nz/ who encouraged and inspired me in my digital endeavours.