Summary

Most readers of The Code Project are familiar with various types of XML parsers in the .NET environment. This article series introduces to The Code Project community a new XML processing model called VTD-XML. It goes significantly beyond those traditional models by fundamentally overcoming many tough technical challenges hampering SOA and enterprise XML application development. The first part of this series demonstrates the benefits of VTD-XML as an indexer and as a parser with integrated XPath. The second part shows you how to benefit from VTD-XML's cutting, editing, and modifying capabilities, as well as introduces the concept of "document-centric" XML processing. The third part of this series shows you how to code your application in the C version of VTD-XML.

Public Enemy #1: DOM's Problem of Modifying XML

Suppose a DOM-based application modifies a particular text node of an XML document, below are the necessary steps to accomplish that:

Decode characters

Create string objects by taking apart the input document

Allocate node objects to build the DOM tree

Navigate to the text node (manually or by XPath)

Attach a new text node

Encode characters

Byte concatenation

Garbage collecting node and string objects

But, if you focus on the objective, I think that many readers will realize that the process outlined above doesn't really make sense. It is, in fact, absurd. DOM processing incurs at least the following three round-trip overheads in those steps:

Every time a character is decoded, it eventually needs to be encoded again.

Every time a document is taken apart of any change, it needs to be put back together (by concatenation).

Every time an object (e.g., strings, nodes etc.) is allocated, it will eventually go out of scope and be garbage collected.

Because those round-trips pretty much restore the document to the original state, they are nothing but a waste of CPU cycles and memory. Notice that modifying a text node can be done far more efficiently by humans using a text editor. To edit a text node, just open the document with Notepad, move the cursor to the text node, make the change in-place, and save it! This time, the update is "incremental", meaning it does not touch irrelevant parts of the document. And, if we, humans, can edit XML like this, why can't XML parsers?

To me, the answer to this question reveals some of the deep-root technical problems in software development today. Below are some of my observations on this topic:

It significantly impacts your application performance: When applications process XML in a read-only fashion, the base-line performance is decided by XML parsing. If applications both read and write XML data, the base-line performance is typically cut in half (as serialization and de-serialization are equal in performance).

It is a common, but deep, problem: Have you wondered that given XML is ubiquitous, why nobody seems to be complaining? One way to look at this: because this is the way things have always been, and everyone seems to get used to it. To make matter worse, solutions don't exist to make the problem look obvious. So, we end up with an ubiquitous issue that is surprisingly non-obvious and from which there is almost no escape.

Hidden from OO perspective: If you live in a pure OO world, the redundant de-serialization/serialization process-- the textbook approach of XML processing-- is very much the right thing to do. So, this problem is again hidden.

It is also worth noting that there is nothing small about this problem. It is, in my view, the biggest and toughest technical issue in enterprise IT today. Consider the ESB (Enterprise Services Bus) example I used in the first part of the series. Right now, the situation is, in my view, bad beyond belief. Because of the inefficient DOM parsing, those ESBs are already considered slow in read-only situations, especially for large XML messages. If the desired operations (such as policy enforcement) require both reading and writing, it is like adding insult to injury: no matter how trivial the change is, the entire XML message needs to be re-serialized, quickly degrading the overall performance to the point of unbearable.

So, my question becomes: Am I the only one seeing the elephant in the room?

How VTD-XML Changes the Picture

Simply put, VTD-XML provides a solution so spectacular that the problem is completely gone!

The first part in this article series introduced VTD-XML as a memory efficient, high performance XML parser with integrated indexing and XPath. Virtually every technical benefit of VTD-XML is, in one way or the other, the result of non-extractive parsing, meaning the original XML text is loaded in memory and fully preserved. However, the most important benefits of VTD-XML --the ones that truly set it apart from other XML processing models-- lie in its unique ability to manipulate XML document content at the byte level. Below are three distinct, yet related, sets of capabilities available in the latest version of VTD-XML.

Incremental XML modifier—You can modify an XML document incrementally through the XMLModifier, which defines three types of "modify" operations: inserting new content into any location (i.e., offset) in the document, deleting content (by specifying the offset and length), and replacing old content with new content—which effectively is a deletion and insertion at the same location. To compose a new document containing all the changes, you need to call XMLModifier's output(...) method.

XML slicer and splicer—You can use a pair of integers (offset and length) to address a segment of XML text so your application can slice the segment from the original document and move it to another location in the same or a different document. The VTDNav class exposes two methods that allow you to address an element fragment: getElementFragment(), which returns a 64-bit integer representing the offset and length value of the current element, and getElementFragmentNs() (in the latest version), which returns an ElementFragmentNs object representing a "namespace-compensated" element fragment. The latest version also transparently supports transcoding, so you can perform cutting and pasting across documents with different encoding formats.

XML editor— You can directly edit the in-memory copy of the XML text using VTDNav's overWrite(...) method, provided that the original tokens you're overwriting are wide enough to hold the new byte content.

Using VTD-XML as an incremental modifier to update the text node, you basically navigate the VTD records to the right location, stick in the change, and generate a new document-- exactly the same way you would do it with Notepad. Below is a simple application updating a text node using VTD-XML:

XML Processing: Object Oriented vs. Document Centric

Traditional XML processing models (such as DOM, SAX, and various object data binding tools) are designed around the notion of objects. The XML text-- merely the output of object serialization-- is relegated to the status of a second-class citizen. You base your applications on DOM nodes, strings, and various business objects, but rarely on the physical documents. If you have followed my analysis so far, it's become obvious that this object-oriented approach of XML processing makes little sense as it causes performance hits from virtually all directions (an in-depth discussion on the topic can be found in "the performance woe of binary XML"). Not only are object creations and garbage collection inherently memory and CPU intensive, but applications incur the cost of re-serialization with even the smallest changes to the original text.

What is "document-centric" XML processing? In non-extractive parsing, the XML text --the persistent data format-- is the starting point from which everything else comes about. Whether it is parsing, XPath evaluation, modifying content, or slicing element fragments, by default, you no longer work with objects. You only do that when it makes sense. More often than not, you treat documents purely as syntax, and think in bytes, byte arrays, integers, offsets, lengths, fragments, and namespace-compensated fragments. The first-class citizen in this paradigm is the XML text. And, the object-centric notions of XML processing, such as serialization and de-serialization (or marshalling and un-marshalling), as shown in Figure 1, are often displaced, if not replaced, by more document-centric notions of parsing and composition. Increasingly, you will find that your XML programming experience is getting simpler. And not surprisingly, the simpler, intuitive way to think about XML processing is also the most efficient and powerful (See Table 1 for the technical comparison of DOM and VTD-XML).

The Inflection Point for SOA

After reading many great articles posted on The Code Project, it seems to me that many readers in the community are career-long, ardent practitioners of the object oriented methodology. But in my view, OO may not always be the right tool for the job. The fact that the serialization/de-serialization problem becomes invisible when you approach XML processing from a pure object oriented perspective tells me that this design approach has practical limitations. In the world of distributed computing, the consensus is that objects don't distribute well across the process boundary (e.g., across the network). Starting from early 90's, the distributed computing community spent about 10 years attempting to figure out ways to make distributed objects (i.e., CORBA) work as if those objects reside in the same address space. But, the effort was eventually abandoned due to numerous technical issues (please visit the rise and fall of CORBA for further reading). Among those issues are tight-coupling, rigidity, and stifling complexity. It is those painful lessons of CORBA that lead us to SOA, which achieves loose-coupling and simplicity by explicitly exposing the XML messages (the wire format) as the public contract of your services. In other words, when building loosely coupled services, think in messages.

How does "document-centric" XML processing fits in, and enhances, the technical foundation of SOA? Simply put, by treating XML as documents (instead of serialization of objects), you gain not just loose coupling and simplicity, but efficiency as well. It usually doesn't make sense to think of XML in objects. Consider an SOA intermediary application that aggregates multiple services. Pretty much all it does is to splice together fragments from multiple documents to compose a single large document and shove it upstream. Where do objects come into the picture? Take the services dissemination point as another example. It is the exact opposite: large XML documents get split into multiple smaller ones, each of which is then forwarded to the respective recipient (downstream services) for further processing. Do you see the need to allocate a lot of objects? As more and more services come alive, you will discover that the composite services/applications are mostly about natively slicing, editing, modifying, splicing, and splitting documents. Traditional, object-oriented, design patterns are going to be less applicable. And, the moment you step across the boundary of OO and into a document-centric world, you will find that both the problem and solution become obvious. Everything suddenly seems to make sense again. But doing so may not be easy: you first need to have the courage to refuse to go along just to "get along," and do something that nobody around you seems to be doing. The web is undergoing a profound transformation around the concept of SOA. The experience you gained from doing SOA the right way should prove to be both rewarding and valuable in the end.

Figure 2. Slicing and Splicing XML Documents to Aggregate Services

In short, to prepare yourself for the upcoming wave of service-oriented computing, right now may be a good time to start embracing this "document-centric" view of XML processing.

Code Examples

For the rest of this article, I am going to show you how to use VTD-XML's cutting, splitting, editing, and transcoding capability to manipulate XML content with both flexibility and efficiency. To understand what the code does, each example places the input and output of the application side-by-side.

Recap and Conclusion

I hope that this article has done its job describing to you how VTD-XML fundamentally solves the common issue of DOM and SAX. The simplest problem, in my view, is also the biggest problem, not just because it affects everyone, but it is so deep that you have gotten used to it already. This is where VTD-XML again stands out. Be it parsing, indexing, modifying, cutting, splitting, or splicing XML documents, VTD-XML excels in virtually every aspect imaginable, while breaking new grounds in others. But, to reap the full benefit of VTD-XML, you need to first step out of the comfort zone of object-oriented thinking and start to think XML as documents. As the world of IT transitions towards a Services Oriented architecture, I am confident that you will discover, in many ways, that the "document-centric" approach to XML processing naturally lends itself to designing and implementing your SOA infrastructure. To me, this is why VTD-XML is the future of XML processing and why the best has to yet to come!

If you have visited VTD-XML's project site, you probably have noticed that VTD-XML has a C version that delivers the exact set of functionalities as its C# counter-part. But unlike C#, C is neither OO, nor based on VM. Worse, C doesn't even support exceptions. So, there are interesting challenges to port VTD-XML from C# to C. In the next part of this series, I will discuss how to overcome those challenges to maximize code reuse and reduce the porting effort to minimum.

Share

About the Author

Jimmy Zhang is a cofounder of XimpleWare, a provider of high performance XML processing solutions. He has working experience in the fields of electronic design automation and Voice over IP for a number of Silicon Valley high-tech companies. He holds both a BS and MS from the department of EECS from U.C. Berkeley.

Comments and Discussions

I have some ideas on how the parser could to manage the XML document edition:

(I don't know if this is already implemented)

-When only reading the XML file, the parser can read directly from the file, without copying to the memory (This is good for big files).

-When the application starts to add and remove new elements and attributes, if the file size is small these operations can be done in the RAM.

If the file is medium sized to big, then we could to have an in memory "slice keeper" (we can use another name) to keep track of the XML slices, or segments, that form the XML document being modified.

Example:

Suppose that one application opens a XML file of 10MB and wants to make a lot of updates on it, adding elements, child elements, attributes, updating attributes, removing anothers and removing elements...

When the application has not been modified the XML document, there is only one "slice", or no one since the document is entirelly on the file.

Once the application add a new element, we may to have 3 slices:

1. The first part of the document, that is on the file.
2. The added elements, already in XML raw format, that is in the memory.
3. The remaining part of the document, that is on the file.

If the application wants to walk trought the XML document, the parser may to start on the first slice (in the file), then continues on the next slice (on the RAM) and then on the file again.

The slices could to be stored in an array, or could to be allocated separately having a pointer to the next one. The first option uses only one memory block allocation (then is faster).

As the application makes new changes to the document, more slices are created. (Untill in the cases where the updated data has the same size of the previous one, because the XML file is not updated on the fly, only at the end of all the work. And can be saved on a different file).

If an element is removed, we may have cutted the document into 2 slices: 1 slice of the previous data, and another that begins after the reoved elements terminates.

At the moment the application requests the whole raw XML data to the parser, it may to rebuild the document joining the slices together, by simply concatenating the strings (in the case the application wants the XML document on the memory) or copying one slice at a time to the destination file (in the case the application wants to save the XML document into a file).

And important: All this things may to be transparent to the application (and to the programmers who code it too).

This implementation would to be useful in the cases we want to work on big XML files, and until on small ones.

Advantages:

-It is an easy way to make an XML updater (or XML Writer).
-Don't need to have all the XML file loaded in the memory.
-Don't need to have an object on memory for each element or attribute.

Thanks for the suggestion...
the api part, I believe that used properly, VTD-XML's interface
is pretty well designed... it might be a bit different from DOM, or SAX, but it has its own characteristics to get used to

regarding makefile,dll, we are working on that front, I can put you in touch with our developer on C to learn your perspective on that...

It is learned that VTD-XML outperforms SAX in benchmark conducted. In the benchmark, SAX simply parsed through the XML document without other processes; while VTD-XML parsed through the XML document & formed the corresponding LC entries.

My question is: since SAX did nothing besides parsing the XML document (and reporting every event) during the benchmark. How does VTD-XML outperformed SAX when it needed to parse and construct the LC entries at the same time?

we started to play with your parser in a fairly large XML heavy project. It is a data extraction system which can mirror a database without direct access to it (just using the user interface, like a web frontent).

we mainly use xml to normalize data, then we import it to a database. we have file sizes from 10 - 100 MB. i wrote a small xml viewer with your system and i can load and display our biggest XML file in just a matter of a few seconds, thats brilliant man. i have not seen ANY xml application that even comes close to that speed. opening that same xml file, lets say with firefox, takes AGES...

i can say that the performance really blew me away. it took me a while to sort out a few unicode problems, but aside from that your parser works flawlessly.

the only beef i have is the error handling. i have seen lots of empty catch blocks in the code, which makes it hard to pin down problems...

but aside of that, it's a really smart system, kudos...

i hope you continue to work on it. i'm sure you can easily turn this into a commercial product if you work on the details a bit more...

Why does the file size increase by a margin of 1.3x to 1.5x? There is another protocol called BXML that decrease the file size and supposedly have better performance than XML. How does VTD rates alongside BXML? Also why is the size of the file so inflated with VTD? How can this be alleviated?
I'm not going to get buy off on using VTD unless I can show that the file size is going to be smaller?

Not often do you see people go against traditional OO and embrace bloat-free way to freedom for the edge cases.

I've known about your impl for some time now off some protocol specific lists (we have a common interest I guess) and I am happy it is receiving a warm welcome here. My main focus shifted from this topic into specific scenarios so I left it all behind, including playing with VTD.

You are facing an uphill struggle from here though, as the world we're in is about selling, runtime if at all possible, bloatware first and foremost, especially in SOA land.

I actually think that we are on verge of a turning point in which people increasingly realize the weaknesses of OO... I am not against OO per se, instead I am promoting a more balanced approach to app development in which sensiblity is placed ahead of formality.

I liked it that it is going uphill, at least it isn't flat or going downhill

I've read all your articles, and I agree with you on the disadvantages and advantages of the various methods of parsing and its effect on memory, performance, redundancy, etc.

However, OO programming is not the best suitable method for programming super fast and memory efficient applications in the first place. C# and .NET itself does not utilize memory like C++ or C, so from the get go, there was a huge elephant in the room.

With that being said, C# has quickly became the language of choice by many, even java developers, because of it's elegance, ease, and simplicity for human understanding.

In my opinion, OO can never be compared to XML, they are like apples and oranges. XML is the underlining data, while OO is an extension of that data, it GIVES THE DATA LIFE, it gives the data functionalities etc.

The problem here is not in OO, it is in .NET's implementation of XML Serialization (which applies to many areas including web services etc).

Any how, I love your idea. Now here is my idea and what i think you should do to make this idea of yours become successful.

Create a persistent generation layer of an OO layer that can be directly bounded to the memory allocated for a deserialized version of an xml. then the business objects can remain Object oriented while not required to be serialized at all.

Example, when you modify an memory, you are actually modifying that piece of xml node value. and when you are getting from memory, you are retrieving a new substring copy of the xml document of the correct node. By doing this, it is important to manage the mapping between xml memory location and the pointers referenced from the properties of an Object.

The point I am trying to make with this articles that document-centric XML processing is a more effective XML processing model than object oriented... so the goal is not to make vtd-xml compatible with DOM... but to offer an option that goes beyond DOM...

notice that it is not necessarily about programming language... vtd-xml is available in C, C# and Java..

the limitation (of memory mangement of C#) is inherent to any programming languages ... because objects are small memory blocks (which incurs overhead) VTD-XML goes around that by using big memory blocks..

This way there is no heavy performance hit while reducing small memory blocks being used.
also eliminates serialization and deserialization.
i think i'm going to write an article on this, maybe it is worth exploring.

I think your idea in general is a good one. However, I think it might make the code hard to read and deal with. I have thought about this problem in the past. (This ideas sounds a bit like Java' ropes) I think you could hide a lot of the details of an XML element behind a stuct or a class. (Something like an STL iterator) And provide methods to act on the iterator. (In general OO is not all bad if you you use it correctly.) (So, it could be like the DOM without the expense.)

I also noticed that you can only have one active part of the document at a time, so the idea of an iterator might be less restrictive and let you work in multiple areas of the doc interchangeably.

I have a different view on this... I think that more often than not, dealing with XML text will not only make things more efficient, but also simplify programming...
(although the experience will be a bit different from OO, and you may not be familiar with it initially...) but XPath evaluation is equivalent to typeless iterator in my view (ur thoughts on this?)

When you are at one part of the doc, there are things u can do (ie. saving all the information for further processing etc)... furthermore, u can save the node position using nodeRecorder and BookMark class

In the end, nothing is perfect, everything has its strengthess and weaknesses in a particular context ... OO may be good for some situation, but bad for others...
If u do a lot ESB stuff, OO notion of objects is a major impediment to achieving performanc and scalability... The toughest part of the problem lies in how people think in my view...

I'm not sure pre-modern OSes had much more fun with numerous small memory blocks. But memory management is a red-herring here (the hueristic "prefer arrays of structs to lists of pointers" is evidence enough of that). All I'm wondering is, "where is the C implementation of VTD-XML?"

VTD-XML is great. But in my opinion, if your XML files are so big they don't fit in memory, XML is probably the wrong tool for the job anyway. (even though I rarely use parsers that load the whole file in memory)

Oh yeah! Compters have so much memory now that we can choose to use a terribly inefficient format for anything. Let's save image files uncompressed and in ASCII too. Who needs JPEG. Disks are so big and cheap anyway.

I think that XML is not perfect... it has its strengths and weaknesses...
there are things it is good for, and there are things it is not suited for...
so it is a matter of making the right trade offs... there are always two sides
of the argument... so calling it terribly inefficient is only one of them... I am
sure there are arguments for it as well

Of course there are arguments for it. I'm not saying XML sucks. It's great, I use XML-related stuff all the time. It's just some people abuse it as a replacement for a database or similar. I see 4GB XML files that can be replaced by 500MB custom-binary-format files... Like you say, the idea is doing tradeoffs, and XML is good for some things and bad for others. Some people use it always.

I saw a project using it for configuration on the server, for saving state on the client-side, for the client<->server protocol, for IPC (!) between the client and the worker apps (XML in a freaking shared memory block), for storing metadata on the server (XML inside a BLOB on the MySQL database), for GUI<->client communication, for exporting stats to the outside world (200MB gzipped XML files)... It's *everywhere*. And in not a single place they use a real XML parser. It's mostly reading the file line by line and doing things like strstr(line, "<get_state").

I've worked with XML files for website navigation that spanned over 10MB and they were a complete nightmare to work with.
Performance is so low that no matter what parser you use it's still going to be horribly slow.

In my opinion you have basically two choices when working with XML. Work stream with loads of I/O when the files get larger, or work with the whole document in memory and require loads of memory. Both are not really acceptable in case of large files.

The method we used to work around this problem was to break up the XML (Our system still requires XML in combination with XSLT to present webpages) and place the fragments in the database. This way the performance went up by miles and it was a lot easier to work with the files.

WM.What about weapons of mass-construction?"What? Its an Apple MacBook Pro. They are sexy!" - Paul WatsonMy blog