Introducing MicroXML, Part 2: Process MicroXML with microxml-js

Experiment with a JavaScript MicroXML parser

MicroXML is a simplification of XML that is compatible with
earlier versions. Part 1
of this two-article series covers the basic principles of MicroXML. MicroXML is
designed with a straightforward grammar that can be processed with many modern
general-purpose parsing tools. James Clark, who led the original push for MicroXML, is
among those thinkers who developed a parser for the community specification. Learn how to use Clark's JavaScript MicroXML parser to experiment with the format.

Editor's note: This two-article series, originally published in 2012, was
revised to reflect subsequent important updates to the MicroXML specification.

Uche Ogbuji is a partner at Zepheira, LLC, a solutions firm that specializes in the next generation of web technologies. Mr. Ogbuji is the lead developer of 4Suite, an open source platform for XML, RDF, and knowledge-management applications, and its successor Akara. He is a computer engineer and writer who was born in Nigeria and lives and works in Boulder, Colorado, US. You can find more about Mr. Ogbuji at his Weblog Copia, or on Twitter.

Other articles in this series

MicroXML, a simplification of XML that is compatible with
earlier versions, is an emerging specification under the W3C's Community Group process. In Part 1 of this series, "Explore the basic principles of MicroXML,"
you learned the basics of MicroXML and how it differs from XML 1.x and related standards.

MicroXML was proposed by James Clark and advanced by John Cowan, who also created its
first parser, MicroLark (open source, Apache 2.0 license). MicroLark is written in
the Java™ language and implements several modes of parsing: pull mode, push
mode, and tree mode. Cowan has not yet updated MicroLark to conform to the latest
community specification, but other emerging implementations include a project by James Clark with JavaScript and Java parser implementations.

In this article, learn to parse the MicroXML format using James Clark's JavaScript parser (microxml-js) in the browser.

Getting started

To follow along with the examples in this article, download microxml-js (see Resources). You can either use software that can retrieve code from the Git version-control system, or click ZIP on the microxml-js GitHub page.

Unpack the downloaded files, navigate with your browser to the location where you saved
them, and open the test.html file to display a page like the one in Figure 1:

Figure 1. Initial HTML test page from microxml-js

The main feature of the page is the text area that includes the <doc></doc> content. To exercise the parser, type or
paste MicroXML into that text area and click Parse. The parser
converts the MicroXML into a JSON object according to the informational JSON rendition
of the MicroXML data model. The resulting JSON code then displays in the JSON data model section. Figure 2 shows the result of parsing the MicroXML <doc></doc> line:

Figure 2. HTML test page from microxml-js after a test parse

In Figure 2, I highlighted the resulting JSON code in a yellow oval
(itself not part of the browser display). The JSON code reads:

Figure 3. HTML test page from microxml-js after parsing Listing 1

Improving the display

Notice that the JSON code stretches beyond the right border of the browser window. To
see all the JSON code, you can scroll left and right. It would be nice to get the code pretty-printed so that you can more easily make out the structure of the resulting JSON. I implemented that effect with a small change to this line in test.html:

The data model is simple. Lists, such as MicroXML element content, become JSON lists.
Mappings, such as attribute sets, become JSON objects. An element is a list of three
items: its name as a (Unicode) string, an object for its attributes, and then a list
of its children. Notice that Listing 1 contains a comment that
is missing in the data model.

Error handling

As with any XML or MicroXML parser, you must understand what happens in the case of erroneous input. For example, paste this line of malformed XML (the example from Part 1) into the parser's test text area:

<para>Hello, I claim to be <strong>MicroXML</para>

Figure 5 shows the output:

Figure 5. Test output from malformed MicroXML

In this case an error message (Parse error: name "para" in end-tag does not
match name "strong" in start-tag.) displays right below the text area, and the
JSON data model is blank. The parser code also highlights the location of the error
(in this case para). Clark's parser does not recover from
errors but stops immediately with a report of the error, much like an XML 1.0
parser does. Clark is also working on a version of the parser that supports error recovery.

Misplaced XML 1.0

XML 1.0 is still the dominant format in use and will be for a long time to come. The
most common errors that MicroXML parsers encounter are because XML features were accidentally left in. Listing 1 is MicroXML meant to look like the XML flavor of HTML5. I omitted the <!DOCTYPE html>
declaration that is recommended for XHTML5 because it is not allowed in MicroXML. If you restore it and paste the result into the text area, you get the error in Figure 6:

The parser highlights only one character in the input text (the D in DOCTYPE) to mark the error. I added an oval highlight in Figure 6 to emphasize it. The error message is Parse error: expected "-". The parser expects the <! followed by the -- of the comment syntax.

XML 1.0 namespaces

Another likely error is persistence of XML namespaces, which are eliminated from MicroXML. Figure 7 demonstrates the banning of the xmlns attribute:

Figure 9. Test output from MicroXML with erroneous colon in attribute name

This code is valid XML 1.0 because the xml prefix is a special
one that does not require declaration. You can no longer use even this prefix in
MicroXML, though, because colons are banned in attribute names as well. In this case the parser outputs Parse error: expected "=".

Character errors

MicroXML also restricts the way that you can represent characters. The most notable
restriction, and the one that I think most likely to cause errors in the field, is the banning of any encodings except for UTF-8. These days, more software is Unicode-aware and can produce UTF-8, but it can still be difficult for even developers to get that right.

In MicroXML, you can use only hexadecimal-encoded character entities. Figure 10
demonstrates the error from an existing decimal entity:

Wrap-up

Other articles in this series

A specification means little without practical implementation. It was important
for supporters of MicroXML to step up and implement it. John Cowan led the way with
MicroLark, and James Clark wrote the first implementation of the community spec when
it emerged. I am implementing it for Python 3. As a developer, usually I find MicroXML parsers much easier to learn and work with than XML parsers.

The JavaScript interface to Clark's parser is loosely documented. If you want to dig
deeper, start by working with the JavaScript object output it produces. The output is
similar to XML Document Object Model (DOM) but much easier to process with common web coding techniques.
microxml-js is an easy way to begin developing your own MicroXML applications, including for mobile usage.

XML area on developerWorks: Find the resources that you need to advance your skills in the XML arena, including DTDs, schemas, and XSLT. See the XML technical library for a wide range of technical articles and tips, tutorials, standards, and IBM Redbooks.

The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.