Resist Bad Data, Part 2: How to Filter Incomplete XHTML Entities and Encodings

So, as stated in the piece before, dealing with poorly formed XML is a necessary (though infuriating) part of the day for some people, including me. In that article, I described which regular expression to use (with Perl) in order to cleanse your files of such garbage. Which is fine, especially when you’re scripting…but these days, I’m fairly sure that the clear majority of systems are less about batch jobs and more are programming-oriented (distributed architectures, microservices, etc.). So, obviously, it’d be nice to have a solution that could be part of a platform with a more robust programming language.

So, how do we do it in Java? Easy enough. We can just make use of the same regular expression mentioned last time, since Java has a fairly straightforward approach that mimics other languages (like Perl):

And what about C#? Well, as I’ve pointed out again and again, the .NET platform isn’t exactly a source of inspiration when it comes to handling XML. And, of course, it has to be different when it comes to everything, including regular expressions. So, as my personal recommendation, I would make use of regular expressions for pattern matching and make use of the callbacks for actual filtering: