Wednesday, May 18, 2011

Options for Slow Performance Parsing Large XML

First let me state that this is a situational only solution and not a general solution to all performance problems when parsing large XML and inserting the data into the DB. I was brought onto a project to code the PL/SQL portion of a business process that would be kicked off a few times a year via a screen. This process would call a remote web service to see if a set of data had been updated. If that data had been updated since the last time, the code would call the web service again to retrieve all the information in XML format. This information would be parsed and stored into a table in the DB and a few automatic rules would be applied to update the data before the process finished and the user would need to perform manual steps to finish the larger process.

The XML that was being returned from the web service was in a simple format of

During my testing, the total run time of this was taking 1459 seconds. While this was a seldom run process, it was still too slow to give to the client. Given that this PL/SQL code included a web service call, a MERGE and an UPDATE statement, I wasn't sure where the slowdown was so I useddbms_utility.get_time;

to get some basic timing information. The time for a run broke down as

HTTP TIME in sec: 30.88INSERT INTO TIME in sec: 1031.99MERGE TIME in sec: 0.03UPDATE TIME in sec: 0.15

The time spent dealing with the web service seemed reasonable and the MERGE/UPDATE time were great given the amount of SQL involved in them. The above INSERT INTO was killing performance though. Having a really good guess, I did an explain plan on the SQL statement and saw the dreadedCOLLECTION ITERATOR PICKLER FETCH

in the explain plan. All those memory operations on the XML were killing performance as expected.

For several reasons, registering a schema was not high on the list of options for better performance. My next approach was then to manually parse the XML in PL/SQL into memory and then use a FORALL to store that into the database. For that approach I setup a structure like (simplified for posting)

This structure loops through each Row node in the XML and parses out the children nodes that exist. As some nodes were optional, the hidden logic within f_extractxmltypestrval is used to not throw errors if the desired node did not exist.

Once all this information was parsed and loaded into memory, the FORALL would store it into the table.

The first clean run through the code showed it was noticeably faster as the total run time was 37 seconds. The time for a run broke down as

HTTP TIME in sec: 32.43PARSE XML TIME in sec: 3.59FORALL TIME in sec: 0.47MERGE TIME in sec: 0.06UPDATE TIME in sec: 0.27

So the time spent parsing the XML and storing it into the table for the first run was nearly 1032 seconds and the above way took approximately 4 seconds. That was a good enough in terms of performance in regards to run frequency to call this method good.

So what did this show? When Oracle has to parse large amounts of XML that is stored as a CLOB or a PL/SQL XMLType variable, performance will not be great (at best). This is confirmed by Mark Drake via this XMLDB thread. If you are looking for performance increases, then you have to go other options. The above is one option that worked good for the situation at hand. One option is to register a schema in the database, create a table based off that schema, store the XML in there and parse it that way (if still needed). A second option would be to store the XML as binary XML (available in 11g). How those compare to the pure PL/SQL approach is a good question and hopefully I get around to looking into those soon.

Edit: June 9, 2011
As stated above, my testing was done on 11.1.0.6. In this OTN thread, see the answer from Mark (mdrake) regarding some internal performance changes made in 11.1.0.7 that would alter my original results. It should make the results faster so I'd be interested to see how much faster.

2 comments:

out of curiosity, what software do you use to edit your xml files? I am tried a few free products and have trialled liquid xml (http://www.liquid-technologies.com/xml-editor.aspx) also, do you have a preference?

A long time ago (10 years now that I think about it), I started using XMLSpy. I like the GUI style that it uses when showing stylesheets. I dislike their validation engines (they had different ones depending upon the view you were validating from, but not sure about recent releases) because I could turn up so many false errors and some missed errors. I tried StylusStudio a while back but never got the GUI style I liked for stylesheets so I've stuck with XMLSpy for XML editing. I run Xerces for validation purposes.

About Me

I do what I do because I like it. I've had exposure to a lot of different products/technology over the time. I started with COBOL and flat files on a mainframe. I've used VSAM and DB2 there as well. Linux and Solaris have crossed my path several times. XML, XSLT and schemas have been friends for a while. SQL has seen me through MS Access, DB2 and Oracle. I've worked with 2 versions of DB2 and 3 of Oracle. There has been many other things as well, too small to mention.