How to Parse XML Documents in Elixir

Share

As software engineers, we often have to deal with different document formats. One example of such a document format is XML. This blog post will be a short introduction on how to parse XML documents with Elixir, and it includes an example of how it’s done.

Example

First, we need to create the example. For this, run the following command to create a new Elixir project called xml_example, and navigate into the project’s directory:

<todos><todoid="1"><body>This is the body of to-do item #1</body><priority>3</priority></todo><todoid="2"><body>This is the body of to-do item #2</body><priority>1</priority></todo><todoid="3"><body>This is the body of to-do item #3</body><priority>3</priority></todo></todos>

This XML document consists of a root node, todos, which should model a to-do list. Each todo item has the item id as an attribute and the priority of that to-do as a subnode.

After creating this example document, we can start Elixir’sinteractive shell so that we can play around with the example document:

This will convert the content of the document we previously read to a charlist with :binary.bin_to_list\1 and then parse it with :xmerl_scan.string\1. This function will return a tuple, where the first element is the parsed content represented in xmerl records (Erlang records defined in xmerl/include/xmerl.hrl). The second element will be a charlist of the rest of the input charlist that could not be parsed. In our case, :xmerl_scan.string\1 should be able to parse the entire input; that’s why we match an empty list to the second element in the returned tuple. This is how the return value of this function will appear:

Manually Parse xmerl Records

Because the returned data structure consists of Erlang records, which are defined in xmerl/include/xmerl.hrl, we have to convert them to Elixir records to be able to use them efficiently in Elixir. For this, we’ll use the defrecord macro, which we have to put into a module. We call this module XML, and we also define a few helper functions for parsing the XML file:

doc|>XML.get_child_elements()|>Enum.map(fntodo->XML.get_child_elements(todo)|>XML.find_child(:body)|>XML.get_text()end)# Returns:['This is the body of to-do item #1','This is the body of to-do item #2','This is the body of to-do item #3']

Although it is possible to get the data we want out of the document using the above, this looks like a lot of code. There must be a more convenient way to query data out of xmerl’s data structure.

XPath

XPath is a query language for selecting nodes in an XML document. Xmerl supports querying an XML document with XPath. Using this query language, we can rewrite our code to query the text of all to-do bodies:

We can also get rid of all the helper functions we defined in our XML module, as we won’t need them. This is because we can query the XML document with XPath instead of manually parsing the document.

Using XPath, we’re also able to choose nodes based on more complex queries, like selecting nodes based on attributes or on matching child nodes — for example, when we want to get the body text of all to-dos where the ID attribute is 1:

:xmerl_xpath.string('/todos/todo[priority=\"3\"]/body/text()',doc)|>Enum.map(&XML.xmlText(&1,:value))# Returns['This is the body of to-do item #1','This is the body of to-do item #3']

SweetXML

Another option for parsing XML documents with Elixir is using the SweetXML library, which is a small wrapper around xmerl that reduces the boilerplate code we have to write for converting the input to a charlist or converting the xmerl records.

In order to use the SweetXML library, we have to quit our interactive shell session and add SweetXML to the dependencies list in mix.exs. Then we run mix deps.get in the project’s root directory to download the dependencies:

importSweetXml{:ok,xmldoc}=File.read(Path.expand("./example.xml"))# get body text of all to-dos where the ID attribute is `1` as a list:xmldoc|>xpath(~x"/todos/todo[@id=\"1\"]/body/text()"l)# Returns['This is the body of todo item #1']# get the body text of all to-dos where a child node "priority" with the value `3` exists as a list:xmldoc|>xpath(~x"/todos/todo[priority=\"3\"]/body/text()"l)# Returns['This is the body of to-do item #1','This is the body of to-do item #3']

SweetXML defines a custom ~xSigil, which also lets you define the return values in addition to the XPath query. The syntax for the custom Sigil looks like this, ~x"/some/xpath/query"l, where the last character (l in our example) represents the return value. l stands for list, but you can also choose other return values like i for integer, s for strings, and more. Take a look at the SweetXML documentation to see all available options.

Conclusion

Although xmerl can be used directly from Elixir, I would recommend SweetXML for parsing XML documents, because it’s easier to use. The custom Sigil to define the return value is also a nice feature. All in all, it’s nice to have such a mature ecosystem provided by Erlang.