Saturday, October 20, 2007

I was reading an article on developerWorks that I didn't write... It was about XML and Java. Sound like a tired, old subject? This had an interesting take: How the choices you make when writing XML influence your Java application code. One of the central themes was on using Attributes vs. Elements. One of the basic points was that using attributes leads to faster code. I decided to test this theory.

I took a pretty simple XML document. It was actually one that I had written recently for a real problem. I created two versions of the same document. One used elements exclusively. The other used attributes whenever it was possible. I then tested how fast it was to access two pieces of data. One was "shallow", i.e. near the root of the document. The other was heavily nested. In both examples, the data was an attribute in the attribute favored approach. I repeated the test over 10,000 iterations and tested it against three XML parsing technologies: the standard DOM implementation included with Java 6, using XPath with dom4j, and using the StAX implementation included with Java 6. For the DOM and dom4j techniques, I also examined the parsing time.

The results were a little surprising. I found no differences with attributes vs. elements for DOM. This was true for both traversing the tree and for parse time. I don't mean a negligible difference, I mean no difference at all. It was so surprising that I had to double check my code a few times. The big difference for DOM was that the code for the attribute favored approach was definitely simpler, which was one of the points in the developerWorks aritcle.

The dom4j story was different. It was slightly faster to parse the attribute document, but it was a bit faster to retrieve values on the element document. I was surprised by this, but the differences were very small, probably not statistically significant (I didn't test this, though.) The code was virtually identical, of course, since we were using XPath for the traversal. The dom4j was much slower than the DOM approach, which is again not too surprising.

Finally, the StAX tests showed faster results for the attributes document. There was a larger difference than in any of the other tests. This makes sense because you don't have to go as far in to the attributes document (a start element event contains the attribute data, but does not contain text child node) and there are less events fired in an attribute document vs. an elements document. For example, bar is three events, but is two events. Also, StAX was faster than either DOM or dom4j, as you would expect. The StAX code for the attributes document was also slightly simpler than it was for the elements document.

So if you're using DOM or StAX, you should definitely favor attributes over elements. It will be less code and in the StAX case, faster code. If you're running dom4j and XPath (or maybe XQuery) based navigation, then it doesn't matter as much and elements based seems ok. This really is important, as a lot of these "modern" RESTful web services are heavy on the elements format over the attributes format. This is doubly bad for web services, since there's obviously a much larger byte-cost on elements style documents.

Update: As request, I am attaching the source code I wrote for this little micro-bench. I tweaked it a little as I realized there was an inefficiency in one of my dom4j methods. This tweak made dom4j faster on the attributes document, which is more consistent with the rest of the results. To run the code, you need dom4j and you need either Java 6 or Java 5 plus a StAX implementation. I ran it on my MacBook under Java 5 using Sun's StAX parser.