Development, Tech and Music Braindump

Main menu

Post navigation

Scraping HTML using Java Servlets and TagSoup

I’ve been working on a simple java project to scrape some data from vendor websites in order to compare prices. I found a neat little library called “TagSoup” to help parse through the HTML tags returned from my URL connection. I ran into a few hiccups along the way which I figured might be worth documenting not only for my own sanity but hopefully for other code monkeys searching the net for solutions to these problems.

First of all, good luck finding any clearly written usage guides for the TagSoup library. I was able to find a nice writeup over at HackDiary written in 2003 that gave me a nice starting point. The code provided there looks like this:

The default User-Agent that was being added to my GET headers was ‘Java/1.6.0_13′. This was causing some of the pages I needed to scrape to return back an 403 Forbidden error.Changing that was easy enough if I was not running this code from within a servlet.

Notice that I’m now passing the build() method a BufferedReader object instead of the Document object.

For whatever reason, setting the system property from the servlet was not getting the job done.

TagSoup seems to expect an ‘h:’ name space in front of all of the XPath elements. I was using the Firebug plugin for Firefox. The String returned from the ‘Copy XPath’ feature needed to be modified to include ‘h:’ after every node. Additionally, the ‘Tidying up’ that TagSoup performs seemed to have removed any ‘/tbody’ nodes. Those needed to be removed from my XPath string. here is an example of the XPath string I needed to use in order to grab the Silver Bid/Ask price from ‘http://bullion.nwtmint.com/silver_panam.php':