Translating reliably between XML and JSON (xml2json)

Last week I was assigned to work on a simple yet interesting problem. MongoDB stores data as JSON. But it turns out we often have customers - especially in the important financial services market segment - where data is in XML. (Yes, SOAP still exists too!) To store that data into MongoDB, we need to transform it into JSON.

Storing XML as a text field

There are various ways one could do this. For example one could simply store the XML document as a whole in a single field, and then extract just a few parts of the XML that are stored as individual JSON keys. One reason to do this is that they can then be indexed and used in queries.

The above MongoDB document could be searched for firstName, lastName and age and can return the original XML document found. But you could not index or query for shoeSize or height because they exist only in the XML and MongoDB is completely ignorant of them.

Translating XML into equivalent JSON

In the project I was assigned to however, we wanted to go further. We decided to just translate all of the XML into a JSON document, then store it as it is into MongoDB. Also we want to translate that JSON back into XML. This will be implemented later, but it implies a requirement that the XML->JSON translation must clearly be lossless, so that we can later end up with the original XML document.

I fond that most of them approach the problem quite similarly. I chose the last one as basis for my work, as it was a simple yet robust code base.

The devil is in the details

XML and JSON do the same thing. You have a hierarchy of nodes, which ultimately contain data in leaf nodes. As such it is clearly possible to just transform from one syntax to an equivalent representation in the other syntax.

The fun part comes with the realization that XML is more convoluted than JSON. (Eliot would say it is "too expressive".) XML elements have attributes and child elements. Now, attributes too are children of their parent element, but they somehow live in their own dimension, separately from the actual child elements.

Historically, in the HTML language that is used to format text documents, this distinction did make sense. The content itself is "data" in the child elements. Things that relate to the structure or formatting of the document are "not data", so they are attributes. For example:

<p id="e12345" style="font-size: 12pt">Hello World</p>

But once we started to use XML as a generic serialization format for any programmatic data, this doesn't make any sense. Semanitcally all of the below XML snippets mean the same thing.

This actually works fine, because xml element names cannot contain a "@", so there's no risk for name collisions, rather attributes are easy to distinguish.

Below I will focus on improvements I made to the "state of the art". If you want to read about more details before that, this XML.com article by Stefan Gössner covers these things more broadly.

My 2 improvements

There were 2 things I had to improve on to achieve what we wanted to do.

All of the libraries I looked at, would have a similar trade-off between correctness and readability. As you can see above, an XML text node might be found under a #text under the equivalent JSON document. This actually comes directly from the libxml parser, which returns #text as the value of the .nodeName property. However, for most simple documents that is not very elegant, rather a stylistically more "correct" translation is to simply put the text as a string value:

<e>Hello</e> -> { "e" : "Hello" }

Otoh, should the XML element "e" also have some attributes, this is of course not possible, so we fall back to using #text as the key:

The problem with this is the ambiguity. If we want to reference the string "Hello" in MongoDB, for example to retrieve it in a query, it's impossible to do if we cannot reliably know it's path. The first one can be found simply with { "e" : 1 } while the latter would be { "e.#text" : 1 }.

This is clearly a mess! In the real world this is gonna be a problem. Maybe the id attribute is optional, so some documents have it and some others don't. Or maybe it was introduced in a later version of the application, so that old documents don't have an id attribute but newer ones do. Such ambiguity would make our MongoDB database unusable.

To solve this problem I modified the xml2json translation so that attributes are output as siblings to their "parent" node:

My second change deals with ordering. In XML the children are ordered:

<html>
<p>First paragraph.</p>
<p>Second paragpraph.</p>
</html>

(Note that attributes, otoh, are not ordered.)

In JSON the keys in a document are not in any particular order and not even guaranteed to maintain the order you think they have. The only structure in JSON that maintains order is the list structure. So to translate the above XML into JSON, we'd need a list. Interestingly, the popular way to do this would render the following translation:

{ html : { p : ["First paragraph", "Second paragraph"] } }

That looks good. Except it's a fairly limited solution! For example the following caused problems for the libraries I looked at:

<p>I feel <strong>fine</strong> today!</p>

The Node/NPM library would happily translate this as:

{ p : { strong : "fine", "#text" : "I feel today!" } }

Stefan Grössner's library at least acted robustly and detected such "mixed" children, and left them untranslated:

{ p : "I feel <strong>fine</strong> today!" }

Neither of the above were satisfactory for my purposes. The solution is to employ JSON list structures when needed:

{ p : ["I feel ", { strong : "fine" }, " today!" ] }

Note that again we maintain a style vs correctness tradeoff. Correct would be to always use a JSON list to always keep ordering of child elements intact. But in most cases the application probably doesn't care about the ordering, so we assume that it is ok to only use lists when it is actually needed. Most of the time the reason is that you have 2 children with the same nodeName, such as in this case there were 2 #text elements, or above we also had 2 p elements.

It's on Github!

I published my improved xml2json library on Github. It's designed to work in modern browsers, and uses the DOMParser for the XML, but I did successfully use it in Node.js together with the libxml parser. For Node.js you need to change the code a little bit to integrate with the module.exports system, and the libxml parser instead of window.DOMParser, other than that it works as is. I'll try to integrate those changes to the Github code one day too.

(Note that I did not yet do corresponding changes to json2xml, so you can't actually translate backwards and get your original XML document.)

One drawback of the "sometimes use an array" approach is that parsers need to be able to handle both. If there is only one element in the set, it gets represented one way, and with two or more elements in the response set, the data types change from scalers to arrays.

EJ, that's correct. Or to be precise, for json2xml conversion we would in any case want to support (almost) all possible json documents. But it did increase the complexity of the output phase of xml2json.

Still, it's not that bad and well worth the effort. For example, a MongoDB multi-key index is really cool, but it will only work with a single list, you cannot use it if you have multiple nested lists. So avoiding the list structure when not necessary is both useful and good style.

Even then, it confirms how different these two worlds are. Even if you master the subtleties well, it comes across as you wrote JXON from an XML background. For example requiring a schema to get started seems odd from a JSON point of view. In JSON, having a schema is like a best case, most of the time you don't.