Converting Spreadsheets into MODSXML using Open Refine

Submitted by dsuadmin on Fri, 2014-06-27 13:21

by Sara Allain

Lately we've been trying to come up with a better way to create metadata for batch ingestion into Islandora. We just started preparing the UTSC Photographic Services Collection to go online - our lovely Young Canada Works summer student, Rachel, has been diligently selecting a few hundred candidates for the first phase of digitization - and it makes sense to start creating the metadata as well so that once we have digital surrogates we can bundle it all into Islandora via the batch ingest quickly. Since metadata creation/manipulation takes up a lot of my day, I started thinking about the most effective way to create XML using a workflow that would be optimal for our students, our systems, and me.

We often work with faculty and other people outside of the unit to create metadata for the various digital scholarship projects that we steward. Spreadsheets are an easy and accessible way for faculty, students, researchers - whomever - to come to grips with structured data. Things are tidy, they're easy to manipulate, we can derive CSV files - but most importantly, our project collaborators are familiar with how they work. There's no learning curve. We use a range of products from Excel to LibreOffice to Google Drive to do this - whatever's most suited to the project.

Step 1 - Set Up Your Spreadsheet

We're using MODS for all generic content going forwards - in past we used Dublin Core, but Islandora natively prefers MODS and it's more flexible for complex objects. (We may use other schemas for subject-specific content in the future, like Darwin Core for biodiversity data, which will be an interesting blog post in itself.) I set up a Google spreadsheet that uses human-friendly versions of the smallest child elements in MODS as column headers; that specific spreadsheet doesn't reflect all the fields in MODS that are available, so think of it as an infinitely extensible collection mechanism. In truth, it doesn't even matter what the headers are, as long as they map easily to MODS and the content is consistent.

Step 2 - Add Some Metadata

This step is pretty simple. We have generic guidelines for creating metadata - things like "Transcribe title from the object or create a title that describes the object." or "Use the format YYYY-MM-DD." Our goal in the DSU is to intervene as little as possible into this process. Usually all we'll do is a bit of clean-up before making it publicly available. You can see the instructions that we provide for users as comments if you hover over the column headers on the spreadsheet.

Step 3 - Import into Google Refine

Open Refine (also called Google Refine) allows you to perform sophisticated manipulations on tabular data. It supports regular expressions and a host of other ways to mash up your info. Once you have the program installed, it works in the Chrome browser. One word of warning, though - a desktop install can only handle so many rows of content before it will die on you. It's possible to allocate more memory if the program is having trouble parsing the data that you import.

The import process is simple - export the spreadsheet from Google as .xls, then import into Google Refine using the Create Project function. It looks like this:

Make sure that your data is rendering properly in the preview window and click on Create Project. You'll end up with - surprise! - another spreadsheet, this time in Open Refine.

Step 4 - Refine the Data

You might want to take this time to refine your data, since that's the whole point of Open Refine. You can do things like removing trailing spaces or splitting columns as needed. In the Google spreadsheet, for example, the Subject field includes multiple entities delimited by semicolons; Open Refine will do the work of isolating each of these into a separate column for you, if you should so desire. As mentioned above, it support regular expressions and is very powerful at manipulating data.

Step 5 - Export as MODS

This is the trickiest part, and by "trickiest" I mean surprisingly simple once you've figured it out. Open Refine has several options for exporting data; the one I use to export as MODS is Templating. When you click on it, you get it a form that looks like this:

Within the exporter, you can build any schema you desire. On the left is the editable template and on the right is a preview of how your file will look once it's exported. In this case we want MODS, which was easy to model. You simply need to add the proper tags around the jsonize tags. Here is a template for Open Refine that will show you exactly what to put where - the only thing that might need to be changed is the content within the square brackets in the jsonize tag - the bolded word here: {{jsonize(cells["Title"].value)}} (this is the column header from your spreadsheet). The exporter with the MODS template applied looks like this:

Click export and you'll get a big .txt file of structured data that you can work with - one you save it as .xml it will be valid MODSXML. I like to split that huge file using xml_split, part of the XML::Twig package, but there are any number of different ways of doing it. Zip your individual MODS records up with your objects and everything is ready to batch ingest into Islandora!