Checking in on the PKP XML Parsing Service

You may not know that PKP has a pretty storied history with automated article parsing. Many, many years ago (circa 2008), we had a project called Lemon8-XML which worked as a complementary application to OJS. With Lemon8, journal editors could upload a Word (or Word-compatible) document into a web interface which would attempt to parse the document into different sections, and allow them to rearrange these sections, confirm the document’s metadata, and validate citations.

It was a good start, but the project was dropped soon after its first release due to a change in staffing at our end, and the relatively narrow range of documents that it could parse, along with the restrictiveness of the web editing interface, which probably prevented it from going further than it did.

Before and since then, OJS has never had (m)any provisions for transforming articles from the format they’re written and edited in (almost always Word) to a format in which they can be easily read in a browser. Some journals outsource the markup of their documents to publishing houses which specialize in the painstaking, manual transformation of Word to LaTeX or Markdown or XML formats that can then be rendered into HTML and indexed properly by sites such as PubMed Central. Others perform this work in house at significant cost. The vast majority of OJS journals simply use the “Print to PDF” function in Word and call it a day.

For the past couple of years, thanks to the generous support of MediaX at Stanford University and the Canadian Internet Registration Authority (CIRA), we’ve been working on a proper successor to Lemon8 (and then some), in the form of our new XML Parsing Service. It wraps over a dozen other parsing tools, many of them developed in the seven years since Lemon8 (including ParsCit, Pandoc, LibreOffice, Exiftool, and others), to provide a full stack for automated parsing of Word documents and conversion to National Library of Medicine JATS XML, as well as HTML, PDF, and ePub copies for readers.

While we’re improving the basic quality of the parsing all the time through the addition of new modules and upstream contributions to the modules themselves, this year we’ve been focusing on fleshing out some of the other value propositions. We’re creating a 1000-document test corpus which we’ll be releasing publicly to demonstrate the performance of our automated parsing on real-world articles, and we’ve rigged up a test suite that performs nightly run-throughs of this corpus, with results also publicly available. We’re also undertaken an efficiency study with development partner Érudit at the University of Montreal to assess real differences in cost-per-page when performing manual markup *after* our stack has already run, which we’ll be reporting on early next year.