What is XML optimization

This is a set of techniques aimed to audit design metadata from any XML stream. Its purpose is to help
XML producers minimize the side effects of using XML, such like the size overhead, or the versioning lockdown.

For instance, increase in size results in more network bandwidth required to send/retrieve the equivalent
XML content, additionally to the increase in memory space required to store the XML
locally, additionally to the increase in time required for the XML parser to process the stream.

XML optimization provides a report showing relevant figures to play with (see screen capture above). With this report in hands, the
XML producers may choose to either use dedicated XML automation tools to transform
XML streams according to defined rules. XML producers may find even more appropriate to redesign the whole
XML metadata.

Figures have been calculated and are displayed in the report because they are meaningful for almost any kind of
XML stream, ie each could mean a substantial change in size or design. I have tested over 50
XML files before coming up with these figures.

XML optimization is new stuff. Before writing this article, I have browsed through the public internet sites, newsgroups and even quite a bunch of research papers, and I haven't found a single topic addressing it. Amazingly enough, I believe this is not only of interest in the real world - when you know that every company out there in the high-tech industry now uses
XML somehow - this is as crucial as database tuning tools or network tuners. Why isn't this part of leading
XML tools (Xmlspy, Xmetal, Msxml, .NET API) ? I don't know, may be developers are content enough with their use of
XML without really seeing the impact of using XML instead of binary file formats and standard databases.

What is not XML optimization

XML optimization is not about compressing XML to any proprietary binary format. For that purpose, please don't hesitate to check out
xmill
(at&t) and XMLppm
(sourceforge). Their intent is to make a binary format from XML by shrinking XML
patterns. And indeed it is very likely to be so because of either of these :

element and attribute names appear many times, thus can be replaced with short tokens

list of values may contain a lot of duplicated data, by analogy with SQL join records

Binary XML may be fine for some applications, but XML stops immediately
being human readable. That is the reason why such tools are usually applied at the transport level, not at the application level.
XML compression does not steal interest from XML optimization, since XML compression is the last thing to use when no smarter code or design principles can be of help - that's brute force in other words -.
XML optimization on the other hand reveals best practices and caveats, thus is bound to help
XML producers learn about their own metadata.

A real world sample

Before going into details, I would like to point a few links to an actual source
XML stream, and the report obtained by applying the tool on it :

XML optimization : structure in general

General figures about the XML stream are simple numbers to begin with.

Though the meaning of nb lines, nb elements, nb comments is obvious, it is of interest to know what are the effects over an
XML stream with a high nb comments ratio in it. XML producers usually add comments above, in, or below the actual
XML elements to explain the hierarchy and underlying design. But what they don't know is that in a lot of "content management server (CMS)" software, the
XML is left as is, and sent to clients without removing these unnecessary comments. Resulting in data transport being often 10% larger compared to the size without comments. Of course, in this case,
XML producers are more than encouraged to lift down their XML code. NB. CDATA sections and nb Process instructions play a simlar role than nb comments.

NB namespaces used is interesting as it reflects whether elements, attributes, and even data itself, use a lot of prefixes, which in turn may significantly increase the size of the
XML stream. For the report to be really useful, figures are often displayed both as absolute values, and as percentages.

XML optimization : structure in details

Fasten your seatbelt, there are many topics here.

Structure pattern

This reverse engineers the XML stream hierarchy by just processing the stream (it nevers read the DTD if any), giving both parent/children relationships, and also datatypes when they are recognized (including float, integers, currencies, dates, urls, and emails).
What for ? reverse engineering the structure pattern is not only a unique feature, it reveals a lot whether the
XML is designed "vertically" (lot of elements), "horizontally" (lot of attributes), or somewhat diagonally. The structure pattern is a preliminary block that must be displayed before proceeding next topics because it simplifies figuring out the design.

Flattening the structure pattern

Distinct patterns tells if there is more than one main pattern in the
XML stream. Pattern occurences, Pattern height (amount of lines) and Pattern size (in bytes) show the key characteristics of the main structure pattern. These are figures that are worth mentioning by themselves, but are also preliminary to the next figure.

Now what is flattening patterns ? That's what is obtained by replacing child elements with attributes, where possible. Follows is a sample before and after flattening :

Original XML :

Modified XML :

<person firstname="John" lastname="Lepers"/>

Flattening the patterns makes use of what is known in the W3C XML
norm as empty element tags, ie tags with no slash counterparts, thus reducing the size by significant amounts. Flattening patterns has a lot of interesting effects : 1. for instance, because the hierarchy is flat, the parsing will be faster. 2. it is much easier to do a diff on
XML streams with flatten patterns.

Structure depth

The depth we are talking about is the element depth in the hierarchy, ie "1" for the root element, "2" for the direct children, and so on. A measure usually comes with figures such like : the minimum value overall the
XML stream, the maximum value, the average value, and the standard deviation value. A great standard deviation value means that the
XML stream intensively uses indentation, <, > and end tags, which in turn increase the size.

To better reveal the depth, we also list the amount of elements at any given depth.

The depth measure is visually displayed using a bar chart (numerical figures in a list often hide the trend).
For those interested in how the chart is built, using Javascript code, read what follows :

Structure node naming strategy

Element and attribute names are usually chosen so they are self-descriptive. While this looks like an advantage, it has an overhead on size just because even in English, keywords enclosing content take statistically a significant space, resulting to a great contribution to the overall stream size. This can be avoided by enforcing a new strategy on naming described below.
An element or attribute is any combination of letters and digits. With that in hand, why not make these names as short as possible ? Let us take an example:

Similarly with depth, the node naming strategy is also visually reflected using a bar chart, so we see the trend.

The gain resulting from applying the smart node naming strategy to the XML stream is calculated. That's often 30% or more, which is very very
significant.

Structure attributes

The Structure attributes indicator reveals how uniform attributes are dispatched within elements. Besides the standard amount of attributes per element (with min, max, mean and standard deviation) is the disorder ratio.
The disorder ratio attempts to show if attributes are listed in the same order or not wrt element occurences. That's of course an average, because each element may have any amount of associated attributes. According to the W3C
XML norm, there is no special ordering between attributes, it is simply a good habit to have attributes always following the same order.

Structure namespaces

XML
namespaces are declared by using a special attribute of the form xmlns:supplier="http://www.namespaces.com/supplier" and refers to a set of element and attribute names with a dedicated semantic meaning. Element and attributes with namespaces are prefixed by the namespace, for instance supplier:orderID. Namespaces are not required in
XML streams, but they special meanings and may simplify data binding, as long as namespace real meanings are made public and available to everyone. Any number of namespaces can be used, not only one. A namespace must always be declared before it is used. The URL used for the declaration is a fake URL here just for global uniqueness purpose. Below is a sample for the supplier namespace:

When namespaces are used, the report shows the ratio of namespaces' use, and the list of namespaces.

Not only using or not namespaces strongly changes the underlying XML design, they have effect on the node naming strategy, and in turn on the overall size of the
XML stream.

Content itself

Even as the content itself is not part of the XML metadata, there are many ways to produce size overhead. The simplest of course is to dump data in
XML format from a relational database system, without factorizing duplicate values. It is easy to figure out that there is a lot of gain here.

Raw content

Content size in element or attribute values exhibit a trend which can be described using minimum size, maximum, average, and standard deviation.

In addition, the ratio of element and attributes with no values is shown. If the ratio is high, easy it is to question whether the design of the metadata is good.

A somewhat odd indicator is the Ratio of multiple part values. Below are two samples of multiple part values for the <book> element :

<book>
The name of this book is so inadequate for a general audience
that it has been decided not to print it.
</book>
...
<book>The Round Door
<year>1999</year><price>20$</price>Part II
</book>

Content correlation

Content correlation is an in-depth examination of List Of Values that reveals valuables things. The first indicator is related to duplication, or how often the same values appear again and again. And it includes max, average and standard deviation. The second indicator is a ranking, it shows the most seen value in all List Of Values.

Content spacing and indentation

Indentation is often used in XML streams, as they are often designed and read by humans. But indentation produces a signication increase in size. In the report is shown the new size of
XML stream without indentation at all. That's often 30%.

Summary of important measures

Out of the many figures from the HTML reports, several deserve some introductory
explanations :

Flattening patterns : that's the design rule of replacing 1-cardinality elements
by attributes. Sounds awful, but a lot of space is gained here.

Indentation and multiple spaces : beautifying your XML stream is ok, as long
as you're dealing with tiny streams. Indented XML streams are simply put twice
larger. Just keep this in mind if your server-side component does not scale,
and you're wrecking the entire network bandwidth.

Disorder ratios : that's the kind of measures that help by themselves improve
the schema design, and by the way may reduce XML bug fixing.

Correlation in content : statistically speaking an XML stream has a lot of
overhead in size just because content is duplicated rather than factorized.

Technical details

Technically, the tool is based on James Clark's Expat
(royalty-free SAX Parser). The executable, which is a report generator on top
of a static library can be divided into three parts :

betterxml.dsp (betterxml.exe), a report generator, contains
mostly the HTMLWriter class which is straightforward, and reuses HTML templates
stored in the .rc resource file. All strings are localized and ready for a foreign
release, if anyone interested. The HTML reports have a built-in chart library
(limited to bar charts) allowing to display charts using Javascript.

SaxAnalyzer.dsp (SaxAnalyzer.lib), an XML extraction library,
with the following shade of classes :

IXmlStats : API to expose measures. Inherits IUnknown.

AppLogic.cpp : callbacks from the XML parser, calculations of all measures

Element.cpp : element + attribute API.

HtmlParser.cpp : general purpose HTML parser, used to extract details that
expat does not see.

xmlparser.dsp (xmlparser.lib), the expat library itself.
Both VC6 and VC7 workspaces are provided.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

Comments and Discussions

Dear Stephane Rodriguez. Your article is very nice and informative, but I am unable to compile and run betterxml_src.zip project source files in visual studio 2008. The error message is given below.
"The source control provider associated with the solution could not be found."

hello sir,
i am a software professional.we are developing a web based application
in which we need to parse more than 10 mb xml files from diffent web sources.the problems we are facing are
1)how to parse xml file until its completely downloaded
2)will the sax help in this case as we need to get values as they appear
3)what will be the best way to parse these incomplete xml files

Hi again Stephane,
Here I am again, with comments, no silly conversion question this time
Regarding your remarks on binary XML : xmill and xppm are very far in performance (speed, memory consumption, ...) and ease of use from more recent XML binarizers like BinXML (www.expway.com).
You're right, data are not human readable while binarized, but with BinXML libraries :
- decoding is made through standard APIs like SAX, DOM, making binarization transparent for user (except for manual handling of course)
- decoding process is *much* faster (up to several times, not only a few %) than textual parsers, like Xerces-c, libXML, MSXML, or Java textual parsers etc... Gain factor highly depends on XML structure and grammar.
- Compression rates are up to several times those of a classical textual compressor like zip -compression process uses Schema grammar knowledge-
- decoder size is much smaller than usual parsers, allowing embedded applications, etc...

To be honest : I am working at Expway. But all that is the truth and only the truth. Our XML binarization format (called BiM) is normalized and part of MPEG-7 standard. It is currently being adopted by worldwide major broadcast and telecom industry companies.

CodeProject is changing. Now users are required to create an account and log on before they can download the zip files. Needless to say this is a shift in how authors relate article sharing and the passion to spend time on creating articles with source code about (hopefully) value content for others.

As author of this article, I haven't been given the opportunity to block this shift in the spirit of the site, although I am heavily against.
I have been told (through flames and misc kind of insults by some sectary Codeproject members) that the logon enforcement was a consequence of an action aimed to limit users from downloading the entire Codeproject site.
A lower profile solution should have been taken instead of that nasty annoying stinking one. Scripting techniques are available and should be a good and meaningful alternative.

As a consequence, as long as a safer Codeproject user policy is not back into place, I, as author of this article should not be taken liable for the absence or of source code along with it. Furthermore, I should not be by any mean liable for any damage resulting of the use of the binaries and source code in the zip files attached with this article.
On the other hand, until then the support of this article is discontinued.

I understand you're upset about the change but I don't understand why you are posting such comments. Making a comment "I should not be by any mean liable for any damage resulting of the use of the binaries and source code in the zip files attached with this article" is a strange statement - it almost seems like you're setting the stage for something. We're an open forum with no barrier (apart from registering) to uploading and downloading source code and binaries. If you upload code or binaries that deliberately includes code that is damaging to others then you are liable for that code - it's the same as if you had emailed that code to an unsuspecting user yourself.

We are a free site and we intend to keep the site free to use. We have absolutely no hidden agenda to make it a pay-per-view site. The barrier to downloading code is minor. You don't need to recieve the newsletters, you don't need to post messages, you don't need to do anything but register. If you then never use your account for anything other than downloading files then that's perfectly OK.

I have a concern about your strategy on elements and attributes naming. As you wrote, those elements should be self-descriptive and human readable. I agree with that. However, if we start using a strategic naming code and a look-up table to find the right final name of an element, this will defy the purpose of XML document readability.

Further more, none only this strategy will make the XML source unreadable (for a human), but all stylesheets used to transform that source will also be unreadable.

I still do think your tool is very useful, in the way is provide us a lot of metrics about a XML document (nb of lines, nb of space characters, indentation usage, depth analysis). Thanks to have done and shared it.

I suppose the next step will be to provide a tool which analyses the different areas of XML document used by a stylesheet during a XSL-T.
This is another tool I am looking for (I do not think I will have the courage to write it

Converting this to <person firstname="John" lastname="Lepers"/> is not a good idea. The reason is that it limits the extensibility of XML. It is also discouraged in the XML standard.

Another optimization is as follows: If a variable has the default value, do not write it. In many applications, most of the parameters are not modified by the user; i.e. they all contain the default value. So, if a parameter has the default value, just do not write it to the file. In my applications, this reduces the size more than 10 times. Of course this won't be the case for some other applications, but it will help a lot for most of the applications.

I was concerned about this decision too. Flatening the (result tree fragment) RTF for one would make a nightmare for still tying the XML into a schema, and secondly destroys a touchy point about parameters. If the content is data, many XML experts agree that it should remain as a node in the element. Data such as the first and last name of a person, is arguably data. While this will create a more verbose vocabulary, and increase the overall number of bytes that needs to be processed, it helps make the data easier to manage. If perhaps there was an ID number associated with this person, or perhaps a description of the individual, i.e. moustache="yes" then perhaps that data would be better suited as paramater.

I am also concerned with the decision to reduce the element names into single letters. While this again will reduce the overall amount of bytes, it also eliminates another benefit of XML, and that is human readability. If you look at the example again, the second, reduced RTF, is completely illegible without the legend.

I think the effort to reduce bandwith requirements of XML is noble, especially for a thin client such as a PDA or cell phone, but I think this solution sacrifices too many "features" of XML.