Discussions

The Java and C versions 1.0 of VTD-XML -- an open-source, high-performance and non-extractive XML processing API -- are freely available on sourceforge.net, with source code, documentation, detailed description of API and code examples. VTD-XML is geared for very fast examination of XML in an in-memory buffer.

New in VTD-XML 1.0 is the integrated support of XPath that also features a easy-to-use interface that further enhances VTD-XML's inherent benefits, such as CPU/memory efficiency, random access, and incremental update. Demos are available at http://www.ximpleware.com/demo.html.

For further reading, please refer to the following articles about VTD-XML:

Though for 95% of applications I'm sure this would be extremely useful, I could see that some may have problems with the limits for starting offset and depth. Maybe dock some bits from QName length (511 + 1023 is a lot!) and add to others?

But I certainly don't want to detract too much from a very interesting project. Do I gather it is being used in XML hardware?

I had similar ideas in the past, but was way too lazy to actually write something.

I assume you're referring to the "non-extractive tokenization approach that maintains the source document intact in memory ... a cursor-based API that retains most of DOM's random-access capabilities at a fraction of its memory usage". It's the phrase in bold that I don't like. What DOM features are lost? Why ignore that Moore's Law increases available memory and also speeds heap processing? Is there a compelling application unable to succeed due to SAX's linearity and JAXP's DOM being such a pig? Would XSLTC benefit from it?

Kit, it depends on what the definition of "parser" is.In my view it is a parser first, indexer second, by parsermy definition is to prepare the document into a form thatapplications can consume (and do whatever it wants to do)Like I point out in teh "xml on a chip" article,XML's performance issue is really a problem of XML's processing models, which has a little to do with XML,so replacing XML with binary version is probably notsolving the right problem...

Is there a compelling application unable to succeed due to SAX's linearity and JAXP's DOM being such a pig? Would XSLTC benefit from it?

If you go to the use cases for Binary XML, there are several there that cover random access into very large documents. I could see VTD-XML as being useful in these sorts of cases. AIUI, VTD-XML is more of a document indexer than parser.

Kit, it depends on what the definition of "parser" is. In my view it is a parser first, indexer second, by parser my definition is to prepare the document into a form that applications can consume...

Isn't VTD-XML less like DOM and more like SAX with nonlinearity (the ability to jump around and tokenize unvisited document fragments in any order, not only document order)? What additional services does VTD-XML provide that SAX doesn't? What can DOM do that VTD-XML doesn't?

Kit, it depends on what the definition of "parser" is. In my view it is a parser first, indexer second, by parser my definition is to prepare the document into a form that applications can consume...

Isn't VTD-XML less like DOM and more like SAX with nonlinearity (the ability to jump around and tokenize unvisited document fragments in any order, not only document order)? What additional services does VTD-XML provide that SAX doesn't? What can DOM do that VTD-XML doesn't?

VTD-XML is simpler to use than DOM: in DOM you haveto do a lot of node casting, in VTD-XML you don't, butstill VTD-XmL is not DOM...Also the write feature of VTD-XML is different:DOM modifies teh data structure, VTD-XML modifiesXML directly.

I had similar ideas in the past, but was way too lazy to actually write something.

I assume you're referring to the "non-extractive tokenization approach that maintains the source document intact in memory ... a cursor-based API that retains most of DOM's random-access capabilities at a fraction of its memory usage". It's the phrase in bold that I don't like. What DOM features are lost? Why ignore that Moore's Law increases available memory and also speeds heap processing? Is there a compelling application unable to succeed due to SAX's linearity and JAXP's DOM being such a pig? Would XSLTC benefit from it?

Really? I thought SAX was about as fast as you can get? But I havn't benchmarked it or anything -- just my impression versus DOM. I wrote my own 'dumb' parser once which was a dead stupid version and I was blown away that SAX was faster (twice as fast). I was muttering to myself "how could that be .." ;-)

VTD-XML is up to twice as fast as SAX with NULL content handler, what this means is that just do a dry run with SAX parser, without any custom logic or program code,VTD-XML is still much faster... Its C version significantly beats Expat's C version almost everytime ...

Really? I thought SAX was about as fast as you can get? But I havn't benchmarked it or anything -- just my impression versus DOM. I wrote my own 'dumb' parser once which was a dead stupid version and I was blown away that SAX was faster (twice as fast). I was muttering to myself "how could that be .." ;-)Cheers.

VTD-XML is up to twice as fast as SAX with NULL content handler, what this means is that just do a dry run with SAX parser, without any custom logic or program code,VTD-XML is still much faster... Its C version significantly beats Expat's C version almost everytime ...

Excellent! Congratulations, as that seems like quite a feat to me, however I wonder if you would come out ahead much if you were extracting most of the data anyways?? I looked at your benchmarks and you were faster by 20 to 30%, but I didn't study the examples closely i.e. how much data is being extracted etc. Your example of being twice as fast with a NULL handler would seem to be the ideal scenario for VTD-XML (assuming it also does not extract any data also).

The benchmark is very old, the perforamnce of VTD-XML has since improved quite a bit, 500mb/sec is our estimate ona 3 GHz Pentium processor... the entire idea of VTD-XML is that,extracting data, in most cases, doesn't acomplish much, in other words, you simply don't have to... offset and length is all you need...

VTD-XML is up to twice as fast as SAX with NULL content handler, what this means is that just do a dry run with SAX parser, without any custom logic or program code,VTD-XML is still much faster... Its C version significantly beats Expat's C version almost everytime ...

Excellent! Congratulations, as that seems like quite a feat to me, however I wonder if you would come out ahead much if you were extracting most of the data anyways?? I looked at your benchmarks and you were faster by 20 to 30%, but I didn't study the examples closely i.e. how much data is being extracted etc. Your example of being twice as fast with a NULL handler would seem to be the ideal scenario for VTD-XML (assuming it also does not extract any data also). Anyways it sounds like a good idea with potential. Good luck.

I've read the documentation (briefly) and this looks like an interesting idea but I have a question. I understand that the parser does not extract the XML data into a separate DOM structure but essentially uses indexes into the original document to reduce memory overhead. But surely the whole point of parsing the XML document in most cases is to enable the data in that document to be extracted into Java variables so that they can be processed in some way. Even with this fast approach to parsing there will still be the same overhead of creating the Java objects and populating them after the document has been parsed? I can see that there is a performance benefit of this approach if only a subset of the document is to be extracted. Have I understood this correctly?

Andy, there are two ways to work XML, roughly speaking,one is "data centric" which is to convert XML into java objects, some call it data binding.The other is "document centric" which basically treats XMLas a message.Web Services/ SOA community has generally considered document centric view of XML as the right way as it takes full advantages of loose coupling aspect of XML. In this view, XML documents are *not* restricted by schema, so that the applications are less likely to break with the evolution of data format. Data centric view of XML, on the other hand, is not designed to take those advantages. At the same time, data centric view of XML, which requires a large number of object creation, is also very slow.

with VTD-XML, message-based document centric processing, not only takes full advantages of XML's loose-coupled=ness, but also is much faster than data centric XML data binding.

Hi Jimmy,I've read the documentation (briefly) and this looks like an interesting idea but I have a question. I understand that the parser does not extract the XML data into a separate DOM structure but essentially uses indexes into the original document to reduce memory overhead. But surely the whole point of parsing the XML document in most cases is to enable the data in that document to be extracted into Java variables so that they can be processed in some way. Even with this fast approach to parsing there will still be the same overhead of creating the Java objects and populating them after the document has been parsed? I can see that there is a performance benefit of this approach if only a subset of the document is to be extracted. Have I understood this correctly?Thanks,Andy.

I understand the different models and I probably didn't state my original question in enough detail.

If I need to pull the data out of the XML document (which presumably is the point of parsing it in the first place) then the API will still need to extract this data and place it into newly allocated Java variables (Strings or primitives). I wasn't referring to binding to a Java object model ala Castor or XMLBeans.

My point is that post-parsing I will still need to extract the data from the XML document in order to do something with it and that is the point where the usual overhead of memory allocation and data copying will occur. I can see that even with this overhead there are some potential advantages over the traditional DOM model in certain situations (with the VTD approach it is more likely that transient objects will be created that can be garbage collected after they have been processed and if only a subset of the document needs processing then the memory overhead will be lower than the classic DOM approach).

I'm just concerned that the VTD benchmarks are not reflective of real world use cases where data is actually extracted from the XML document for processing.

Andy, at that point the performance impact is also very small, for several reasons1. because you already knew the offset value and length, you can allocate a string and fill in the char fairly easily2. you navigate the doc according to element/attr names, then you extract value/text node, so a lot of unncessarily extractions (of element/attr name) have been avoided3. If you want to extract the int/float value of a field, you can perform the conversion from VTD record to int/float directly, which bypass the string creation

I understand the different models and I probably didn't state my original question in enough detail.If I need to pull the data out of the XML document (which presumably is the point of parsing it in the first place) then the API will still need to extract this data and place it into newly allocated Java variables (Strings or primitives). I wasn't referring to binding to a Java object model ala Castor or XMLBeans.My point is that post-parsing I will still need to extract the data from the XML document in order to do something with it and that is the point where the usual overhead of memory allocation and data copying will occur. I can see that even with this overhead there are some potential advantages over the traditional DOM model in certain situations (with the VTD approach it is more likely that transient objects will be created that can be garbage collected after they have been processed and if only a subset of the document needs processing then the memory overhead will be lower than the classic DOM approach).I'm just concerned that the VTD benchmarks are not reflective of real world use cases where data is actually extracted from the XML document for processing.Cheers,Andy.

TechTarget provides technology professionals with the information they need to perform their jobs - from developing strategy, to making cost-effective purchase decisions and managing their organizations technology projects - with its network of technology-specific websites, events and online magazines.