Re: Java - Efficient XML Parsing (Thousands of XMLs from Web)

Posted 04 October 2012 - 05:02 PM

xZhongCheng: I'm going to assume that you're looking to validate, as opposed to simply checking for well-formedness. I went ahead and built an XML validator that checks against WC3 standards. It took roughly 16 secs per hundred files (with downloading), and each file was the size of 1989.xml. That's nearly 25 minutes for 8000 files. Using the SAXParser with the Validator API is a little slower - roughly 32 minutes. Quite slow. However, the network transfer appears to the bottleneck. To validate one file via local disk and network:

URL: 940 ms
File: 17 ms

If you could keep the files on local disk, you could cut the total time to about 2 minutes (10 minutes with SAXParser), which is still pretty slow but not so bad. So, I think the issue is downloading 8000 files. Try seeing how long it would take to download 8000 files to your PC. Perhaps thread-level parallelism will help (test on a few thousands files).

Then I took to the web to see if SAXParser is just really slow, even though I've already established that the parser probably isn't at fault here. Apparently, when using validation it's much slower, which is no surprise. Some people suggested using Sun's Multi-Schema Validator. Another suggested using RXP validator, which is supposedly the fastest validator available. You'd have to go through JNI though since RXP is a C library. And here: Fastest Java XML parsers.

Re: Java - Efficient XML Parsing (Thousands of XMLs from Web)

Posted 05 October 2012 - 08:13 PM

blackcompe, on 04 October 2012 - 06:02 PM, said:

xZhongCheng: I'm going to assume that you're looking to validate, as opposed to simply checking for well-formedness. I went ahead and built an XML validator that checks against WC3 standards. It took roughly 16 secs per hundred files (with downloading), and each file was the size of 1989.xml. That's nearly 25 minutes for 8000 files. Using the SAXParser with the Validator API is a little slower - roughly 32 minutes. Quite slow. However, the network transfer appears to the bottleneck. To validate one file via local disk and network:

URL: 940 ms
File: 17 ms

If you could keep the files on local disk, you could cut the total time to about 2 minutes (10 minutes with SAXParser), which is still pretty slow but not so bad. So, I think the issue is downloading 8000 files. Try seeing how long it would take to download 8000 files to your PC. Perhaps thread-level parallelism will help (test on a few thousands files).

Then I took to the web to see if SAXParser is just really slow, even though I've already established that the parser probably isn't at fault here. Apparently, when using validation it's much slower, which is no surprise. Some people suggested using Sun's Multi-Schema Validator. Another suggested using RXP validator, which is supposedly the fastest validator available. You'd have to go through JNI though since RXP is a C library. And here: Fastest Java XML parsers.

I would like to say thank you so much for your help and your time.

I found that running the download takes about 1 second / url. I had a friend who suggested multithreading and using them to download, then send them off to another thread where I can process the data. I think I would much rather use this multithreading solution (never done it before so must learn!). Another factor i believe is that I am running an android app.

Re: Java - Efficient XML Parsing (Thousands of XMLs from Web)

Posted 06 October 2012 - 08:41 AM

You seem to have ignored the code i posted for some reason, which incidentally gives you the exact businesss object you described. I deliberately removed the validation code for optimisational reasons. If you feel you need it, of course you can use it.

As for multithreading, generally speaking, if you have multiple processors or cores, multithreading can offer performance benefits. Otherwise, it could actually decrease performance owing to the overheads of context maintenance and switching.

Nonetheless, with multiple page retrievals, multiple threading should be done anyway. You can usually bargain on one or two pages giving you retrieval problems and you don't want those to hold up a one and only application thread.

Re: Java - Efficient XML Parsing (Thousands of XMLs from Web)

Posted 06 October 2012 - 06:12 PM

g00se, on 06 October 2012 - 09:41 AM, said:

You seem to have ignored the code i posted for some reason, which incidentally gives you the exact businesss object you described. I deliberately removed the validation code for optimisational reasons. If you feel you need it, of course you can use it.

As for multithreading, generally speaking, if you have multiple processors or cores, multithreading can offer performance benefits. Otherwise, it could actually decrease performance owing to the overheads of context maintenance and switching.

Nonetheless, with multiple page retrievals, multiple threading should be done anyway. You can usually bargain on one or two pages giving you retrieval problems and you don't want those to hold up a one and only application thread.

The way you had described and provided code for was a very similar way to what I was doing. I am currently testing out my program using downloading threads. In less than 5 minutes it has already scanned through 3000 xmls. Alright I dont know exactly how to pass a value to a thread without having to create a class

Re: Java - Efficient XML Parsing (Thousands of XMLs from Web)

Posted 07 October 2012 - 11:39 AM

Quote

and ended up with SAX, which is usually not as efficient as the method i suggested

Goose: Will your code as is validate an XML document against a provided schema? I haven't been able to find one web result associating XmlStreamReader with "validation" other than using an XmlStreamReader as a StreamSource for a Validator.

Re: Java - Efficient XML Parsing (Thousands of XMLs from Web)

Will your code as is validate an XML document against a provided schema?

I don't know about a schema. That is something very specific.
But that's probably quite academic as

a. there's no sign of a schema being in use
b. there's no sign of validation being either required (or even possible [you can only validate against a schema or DTD and there's no sign of either even being present])

Re: Java - Efficient XML Parsing (Thousands of XMLs from Web)

Posted 07 October 2012 - 04:55 PM

Quote

I don't know about a schema. That is something very specific.
But that's probably quite academic as

a. there's no sign of a schema being in use
b. there's no sign of validation being either required (or even possible [you can only validate against a schema or DTD and there's no sign of either even being present])

Hmm... I guess I took the following to mean validation was required.

Quote

I need to parse through about 8000-9000 different URLs to see if they are xmls

I think a validating parser is necessary to ensure strict adherence to some schema, whereas the OP just needs the file to be parsed as XML without error. I overcomplicated things.

Re: Java - Efficient XML Parsing (Thousands of XMLs from Web)

Posted 07 October 2012 - 07:49 PM

SO i have finished my program. I have opened up 9 threads to do the downloading of the xml files. It was going nice and fast earlier, but now it takes about 20 minutes to get through all of them... I didnt change any of my code to get a speed change. Whats the max amount of threads I should be opening??