Safe Ampersand Parsing in XML Files

The standard library ElementTree XML Parser is one of those packages that makes using Python such a dream. Sometimes, though, when you’re dealing with XML, you won’t always be able to ensure that what you’re pulling in is parseable. Having angle brackets where they should be and no illegal characters in names is fairly straightforward. Where most people run into trouble is with special characters, specifically the ampersand (&).

If the ampersand isn’t escaped (&amp;) or used in one of the other escapable xml characters (&lt; &gt; &quot;), it is invalid XML. However, due to finagling elsewhere in a system, there might not be any red flags raised until it comes time to parse it. That’s what happened to me today. The system which directly used the XML file was a black box and the config scraper from which I was pulling the XML file (for a tangental purpose) was also a black box. Not only was the primary system able to parse the unescaped ampersand, but the config files could be updated at any time and pulled in–but not validated–by the basic config scraper. This left me with a ParseError whenever I tried to parse an XML file that had an unescaped ampersand and few options for dealing with it. I couldn’t skip these files, as it was virtually guaranteed that they were valid (according to the primary system) and that parsing them was integral to the overall operation of my new program.

The sane, pragmatic solution here is to escape the source as required and enforce that by xmllinting files before they are loaded by the primary application. Then, assuming the config scraper remains a (dumb) black box, I could update my application to lint prior to attempting to parse a file and ignore that file if the operation throws an error. In the future, it’s likely that I will enforce proper syntax in the primary application’s loader (instead of whatever black magic it uses to parse the unescaped ampersand), but even if I could fast-track the change, there will still be plenty of deployed old instances of the primary application so I’ll still need to find a way to load the files with unescaped ampersands.

Okay, so why not just run a str.replace(‘&’, ‘&amp;’) on each line and call it a day?

Simply because it isn’t a robust solution.

The problem with replacing the char is that there could be a mix of already-escaped ampersands alongside unescaped ampersands. &amp; would become &amp;amp. In addition to wrecking those, the primitive approach will also invalidate other legitimate escape sequences such as &lt; and &gt;–turning them into &amp;lt; and &amp;lt.

The standard xml library has a set of utilities for creating SAX applications. That stands for Simple API for XML. It has two nifty functions that can be used in tandem to unescape and escape characters. By using the unescape tag first, it should be able to handle a mix of unescaped characters. Except it, rightfully so, can’t recognize structured xml tags, so it’s a mess:

You can instead run the replace function after you unescape the characters, but what’s the fun in that?

HTMLTidy looks like it might be able to save the day here because of its xml parsing feature. It’s intelligent enough to recognize your tags so it won’t mess with your angle brackets. It knows when an ampersand needs to be escaped and when it doesn’t. Problem solved? Not so fast. In one notable edge case, it fails when you have an ampersand in a tag header that has no counterpart on that line. It will create a closing tag on that line, pull a Houdini act on what’s in-between, and then delete the orphaned closing tag later on. There’s no way to change this behavior.

What this conundrum reveals is that mindlessly trawling Stack Overflow (like I did) can result in brittle, potentially dangerous code. If you were to look up questions related to ‘unescaped ampersand parsing in xml’ and ‘xml ampersand python ParseError’, you’d find yourself mislead again and again. You can’t expect answers to contain the most robust code especially for such esoteric use cases, but the scenario I’m writing about is far from make-believe.

It comes down to how the question is framed. Intuition is misleading–there should be a library function for this. But it turns out that I didn’t need an xml library function at all. A regex replace is the best solution! A coworker suggested it and I was dumbfounded by how much I had overthought the solution. If you only need to escape ampersands and nothing else, it’s this simple: