Thursday, April 26, 2012

My most recent project has been building an editing system for APIS (the Advanced Papyrological Information System—I know, I know) data as part of the Papyrological Editor (PE). APIS records are now TEI files. They started out using a text-based format that was MARC-inspired (fields, not structure), but I converted them to TEI a while back to bring them in line with the other papyri.info data. HGV records (also TEI) already had an editing system in the PE, so it seemed like the task of adding an APIS editor, which would be a simpler beast, wouldn't be that hard, just a matter of extending the existing one. In fact, it was an awful struggle.

Partly this is my fault. The PE is a Rails application, running on JRuby, and my Rails knowledge was pretty rusty. I loved it when I was working on it full time a few years ago, but coming back to it now, and working on an application someone else built, I found it obtuse and mentally exhausting. I had a hard time keeping in my head the different places business logic might be found, and figuring out where to graft on the pieces of the new editor was a continual fight against coder's block. The HGV editor relies on a YAML-based mapping file that documents where the editable information in an HGV document is to be found. This mapping is used to build a data structure that is used by the View to populate an editing form, and upon form submission, the process is reversed.

It's not at all unlike what XForms does, and in fact I was repeatedly saying to myself "Why didn't they just use XForms?" I got annoyed enough that I took some time to look at what it would take to just plug XForms into the application and use that for the APIS editor. The reluctant conclusion I came to was that there just aren't any XForms frameworks out there that I could do this with. And the XForms tutorials I was looking at didn't deal with data that looked like mine at all. TEI is flexible, but not all that regular, and every example I saw dealt with very regular XML data. Moreover, the only implementation I could find that wasn't a server-side framework (and I wasn't going to install another web service just to handle one form) is XSLTForms. The latter is impressive, but relies on in-browser XSLT transforms, which is fine if you have control of the whole page, but inconvenient for me, because I've already got a framework producing the shell of my pages. I just wanted something that would plug into what I already had. A bit sadder but wiser, I decided the team who built the HGV editor had done what they had to given what they had to work with.

Then, about a month ago, I got sick. Like bedridden sick. I actually took 3 days off work, which is probably the most sick leave I've taken since my children were born. While I was lying in bed waiting for the antibiotics to kick in, I got bored and thought I'd have a crack at rewriting the APIS editor. In Javascript. There was already a perfectly good pure-XML editing view, where you can open your document, make changes, and save it back (having it validated etc. on the server), so why not build something that could load the XML into a form in the browser, and then rebuild the XML with the form's contents and post it back? Doesn't sound that hard. And so that's what I did. I should say, this seemed like a potentially very silly thing to do, adding another, non-standard, editing method that duplicates the functionality of an existing standard to a system that already has a working method. I'd spent a lot of time on figuring out how to refactor the existing editor because that seemed like the Right Thing To Do. Being sick, bored, and off the clock gave me the scope to play around a bit.

Let me talk a bit more about the constraints of the project. Data is stored in XML files in a Git repository. This means that not only do I want my edits to be sending back good XML, I want it to look as much like the XML I got in the first place, with the same formatting, indentation, etc., so that Git can make sensible judgements about what has actually changed. I might want some data extracted from the XML to be processed a bit before it's put into the form, and vice versa. For example, I have dates with @when or @notBefore/@notAfter attributes that have values in the form YYYY-MM-DD, but I want my form to have separate Year, Month, and Day fields. Mixed (though only lightly mixed) content is possible in some elements. I need my form to be able to insert elements that may not be there at all in the source. I need it to deal with repeating data. I need it to deal with structured data (i.e. an XML fragment containing several pieces of data should be treated as a unit). And of course, it needs to be able to put data into the XML document in such a way that it will validate, so the order of inserted elements matters. Moreover, I need to build the form the way the framework expects it to be built, as a Rails View.

So the tool needs to know about an instance document:

how to get data out of the XML, maybe running it through a conversion function on the way out

how to map that data to one or a group of form elements, so there needs to be a naming convention

how to deal with repeating elements or groups of elements

how to structure the data coming out of the form, possibly running it through a function, and being able to deal with optional sub-structures

how to put the form data into the correct place in the XML, so some knowledge of the content model is necessary

how to add missing ancestor elements (maybe with attributes) for new form data

Basically, it needs to be able to deal with XML as it occurs in the wild: nasty, brutish, and (one hopes) short.

Form-based XML editing is one of those things that sounds fairly easy, but is in fact fraught with complications. It's easy to get data out of XML (XPath!), and it's easy to manipulate an XML DOM, removing and adding elements, attributes, and text nodes. But it's actually quite hard to get everything right, and to make the XML's formatting stay consistent. In my next post, I'll talk about how I did it.