Web Development

Fast XML Parsing in Ruby

By Jeffrey L. Taylor, November 14, 2011

Sometimes the fastest way to parse a known XML stream is to write your own lean parser. Even in Ruby, this gives remarkable performance benefits

Most programming languages have their own XML parser libraries. And many of those use the DOM (Document Object Model) API. DOM is good for general-purpose XML processing: The input is parsed into a tree structure that can be modified and written back out. Often whitespace is preserved so the output is identical to the input if it is not modified. This generality comes at a cost: large memory requirements (often more than double the input size) and slow read and write. For applications that read the XML data for only specific pieces of information, there are better alternatives.

Other libraries use SAX (Simple API for XML), and a pull parser. SAX is callback-based and I find it difficult to work with, but it allows breaking your code into reasonable size pieces. I use pull parsers, but my code ends up in one or two large functions. Pull parsers have a simple internal structure and a lot of little snippets of code for each data type of interest.

Neither method in mainstream libraries deliver the speed and simplicity I want, however. Let's see how to improve them. The examples in this article start with a simple OPML (Outline Processor Markup Language, an XML language) subset. It is suitable for the RSS export format used by most RSS readers (such as Google Reader or Thunderbird). The second example is a more robust version that handles nested feeds (for example, Google Reader's Categories). The final version is possibly overkill for OPML, but it illustrates techniques suitable for larger parsers such as RSS). The complete OPML programs with test suites and sample input files are on GitHub.

RSS (originally, RDF Site Summary, often dubbed "Really Simple Syndication") is an XML language for publishing news and blog summaries. When I first started working with RSS feeds for a Ruby on Rails application, I used FeedTools. It is a solid package that supports reading, searching, modifying, and writing RSS feeds. But, it doesn't scale to reading thousands of feeds per hour.

The developer of FeedTools lamented in his blog about choosing to use REXML (part of the Ruby standard libraries) and that he was too burned out to redo it with a faster library. Nokogiri and Hpricot are newer XML libraries and an RSS parser using them runs 10% faster than FeedTools. Both use the libxml2 library (in C, part of the GNOME project). An RSS parser written with libxml2 is an additional 10% faster. However, I wanted 10 times faster, not 10% or even 20% faster. The current benchmark results for these parsers and my lean alternative are near the end of the article.

The libxml-ruby docs are much more readable than the libxml2 docs. Digging around in them I found the two additional APIs: SAX and pull parsers. I did not have much success with the SAX API. Pull parsers I understand. An RSS pull parser is over a hundred times faster than FeedTools! The key to the speed is the API, not the language it is written in. The REXML pull parser (pure Ruby) is trivally slower (approximately 2%) than libxml (in C with libxml-ruby Ruby wrapper). Interestingly, the REXML DOM API is 20% slower than the libxml2 DOM API. The RSS parser I use in production is available here.

How To Do It

Let's look at the feed and how its speed was accelerated. A simplified OPML file with one feed looks like this:

Pull parsers loop through the input. Within the loop, an outer case statement handles the markup types. Only tag start, tag data, and tag end markup are needed for OPML and RSS. Inner case statements switch on the path  the nesting of tag names. For example, the title in Example 1 has the path "opml>head>title" (using ">" to separate tag names is my own convention).

The code to parse it looks like Listing One (in the FLAT branch on GitHub). A case could be made for declaring a class or two to store the results. In the interests of simplicity and for teaching purposes, I've chosen to store the feeds as an array (reasonable) of hashes (a class with just title and URL might be better). Using built-in types simplifies the test data (in test/opml on GitHub). Each OPML file has a similarly named YAML file (YAML Ain't Markup Language) with the expected parse result. By sticking to YAML standard types: sequences (arrays in Ruby) and mappings (hashes in Ruby), the test data is portable across implementation languages that have support for those data types (e.g., Python).

The code to parse Example 2 must keep track of the nesting depth. Note also that regular expressions are used in some case statements to match the path for nested elements. The complete program is in the NESTED branch on GitHub.

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task.
However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

This month's Dr. Dobb's Journal

This month,
Dr. Dobb's Journal is devoted to mobile programming. We introduce you to Apple's new Swift programming language, discuss the perils of being the third-most-popular mobile platform, revisit SQLite on Android
, and much more!