Categories

Subscribe

Archive for April, 2013

This is an outline of how we here at Notre Dame have been making digitized versions of our Catholic pamphlets available on the Web — a workflow:

Save PDF files to a common file system – This can be as simple as a shared hard disk or removable media.

Ingest PDF files into Fedora to generate URLs – The PDF files are saved in Fedora for the long haul.

Create persistent URLs and return a list of system numbers and… URLs – Each PDF file is given a PURL for the long haul. Output a delimited file containing system numbers in one column and PURLs in another. Create persistent URLs and return a list of system numbers and… URLs – Each PDF file is given a PURL for the long haul. Output a delimited file containing system numbers in one column and PURLs in another. (Steps #2 and #3 are implemented with a number of Ruby scripts: batch_ingester.rb, book.rb, mint_purl.rb, purl_config.rb, purl.rb, repo_object.rb.)

Update Filemaker database with URLs for quality assurance purposes – Use the PURLs from the previous step and update the local database so we can check the digitization process.

Start quality assurance process and cook until done – Look at each PDF file making sure it has been digitized correctly and thoroughly. Return poorly digitized items back to the digitization process.

Use system numbers to extract MARC records from Aleph – The file names of each original PDF document should be an Aleph system number. Use the list of numbers to get the associated bibliographic data from the integrated library system.

Edit MARC records to include copyright information and URLs to PDF file – Update the bibliographic records using scripts called list-copyright.pl and update-marc.pl. The first script outputs a list of copyright information that is used as input for the second script which includes the copyright information as well as simply pointers to the PDF documents.

Duplicate MARC records and edit them to create electronic resource records – Much of this work is done using MARCEdit

Check records for correctness – Given enough eyes, all bugs are shallow.

Put newly edited records into Aleph production – Make the newly created records available to the public.

Extract newly created MARC records with new system numbers – These numbers are needed for the concordance program — a way to link back from the concordance to the full bibliographic record.

Update concordance database and texts – Use something like pdftotext to extract the OCR from the scanned PDF documents. Save the text files in a place where the concordance program can find them. Update the concordance’s database linking keys to bibliographic information as well as locations of the text files. All of this is done with a script called extract.pl.

Create Aleph Sequential File to add concordance links – This script (marc2aleph.pl) will output something that can be used to update the bibliographic records with concordance URLs — an Aleph Sequential File.

In the Fall the Libraries will be opening a thing tentatively called The Hesburgh Center for Digital Scholarship. The purpose of the Center will be to facilitate learning, teaching, and research across campus through the use of digital technology.

For the past few months I have been visiting other centers across campus in order to learn what they do, and how we can work collaboratively with them. These centers included the Center for Social Research, the Center for Creative Computing, the Center for Research Computing, the Kaneb Center, Academic Technologies, as well as a number of computer lab/classroom. Since we all have more things in common than differences, I recently tried to build a bit of community through a grilled cheese lunch. The event was an unqualified success, and pictured are some of the attendees.