Friday, October 17, 2008

A while back Marcus Zarra did a good article on working with libXML2 and xmlTextReader to parse XML Data without loading the whole thing into memory. However the tutorial failed my needs only for 1 reason... he used a file on Disk. In his call to xmlReaderForMemory() he passes a path to the resource on disk (yes I know it makes for an easy self contained project that's easy to demonstrate, but bear with me here), I made the assumption that this made the method unusable for objects residing only in memory. In fact just about every method in libxml2 seems to have an argument for a path to a resource on disk and initially thought I couldn't really do strict in memory xml parsing. How wrong I was as you'll see later on.
What I am going to show you today is how to create a valid xmlTextReader object from libXML2 with only the assumption that you are downloading the xml data from a server somewhere.
Marcus showed a great intro into XML parsing with libXML2 and using a file on disk, but in my code I am getting Data back in memory from calls to web services on the internet and so I went to find out what it'd take to do XML Parsing with libXML2 without using a reference to a file on disk. Basically I wanted to assume that I just have a valid NSData object that has XML in it. When I got done with this I found out there are 2 ways to do this 1 very insanely easy way which isn't obvious from looking at the libXML2 method names alone and 1 harder way that accomplishes the same thing.
The hard way

The point in this example is to create a xmlParserInputBufferPrt object that will be passed into xmlNewTextReader() so we don't read anything off of the disk. On line 3 we create this xmlParserInputBufferPtr by using the alloc method and passing in that we are going to use UTF8 encoding. And of course on line 5 we check for the existence of the xmlParserInputBufferPrt and if it doesn't exist there is really no point in going any further.
Then I create a NSString object from the NSData object (line 10) which will allow us to get the UTF8String from the NSData object and (again) signify that we are using UTF8 Encoding. Then the big thing we need to create (line 12) is the xmlBufferPrt object. This creates a static buffer with the entire contents of the XML from the NSData object passing in a pointer to the string contents itself and the length of the string.
Then (line14) in the xmlParserInputBufferPtr we point the data buffer pointer to the xmlBufferPrt object we created on line 12. After that it's just a matter of creating a xmlTextReaderPtr (line 16) with xmlNewTextReader and pointing the input buffer to the xmlParserInputBufferPointer which now has all the XML in it and pass NULL to the path of the XML. Now you have a xmlTextReader which you can use to parse xml with.
The Easy Way
Now I started going down this path because of Peter Hosey's suggestion of using xmlParserInputBufferPtr which logically seemed like the best solution to my problem of wanting to use strictly in memory objects and do no reading off of disk. If I could just pass NULL for the path in xmlNewTextReader() could I do the same with xmlTextReaderForMemory and it'll still work? As it turns out... YES... yes it does work.

In fact all you need to do is set the 3rd and 4th parameters to NULL in xmlReaderForMemory() and it works fine with in memory objects like if you have a NSData object you got back from API's like say NSURLConnection sendSynchronousRequest. The only thing I wish is that there was a lot better documentation on libxml2, maybe there is a great rescource I just don't know about, but from my googling it was hard to find anything and I had to do a lot of trial by error.
Update: Im aware of the documentation at xmlsoft.org, most of it however pretty much just shows you a list of method names and describes the arguments you pass in to methods with a couple decent doc's/tutorials. This isn't great documentation for me as it doesn't describe which arguments are necessary, if a argument is optional you should make that crystal clear and the documentation doesn't make it clear that a lot of arguments are in fact optional in libxml2 methods. Specifically one page im referring to is at http://xmlsoft.org/html/libxml-tree.html#xmlParserInputBufferPtr.

4 comments:

Aside from http://xmlsoft.org/docs.html - no. I've resorted to looking at the source (usually of xmllint or xsltproc, but sometimes of libxml2/libxslt) in the past, to work out how to do stuff with libxml2.

To be fair to Daniel Veillard, the docs for xmlReaderForMemory (http://xmlsoft.org/html/libxml-xmlreader.html#xmlReaderForMemory) say and imply *nothing* about needing to read from disk...