Sunday, July 8. 2007

Scraping content from the web is a real pain. In fact, at the moment, I can't think of any programming tasks I like less. I'm sure they exist I just can't think of them - feel free to remind me in the comments.

I'm setting up a tool at the moment that is going to use a fair amount of web scraping. Sadly not all sites are kind enough to provide an API like Compete and some actively work to make your life more difficult. The three key points that I feel make web scraping a real pain are:

Undocumented interface

We're dealing with essentially unstructured HTML here. Web scraping also highlights just how little many companies care about validation and writing readable code. I'm assuming their code is a pain to read because they compact their files to save on bandwidth. If not then maintaining these sites could be an even bigger pain then scraping them.

No warning of changes

You're essentially at the mercy of the designer. If they change their layout your code breaks.

Timeouts and blank pages

Even with all the code in place you still may got no result. The site may be down or if you're being too aggressive your requests may be throttled. Even if you do get to a page it may nto be the one you want. Does a blank page mean the site has no info for the query you made or does it mean there was some sort of error? If the site is using 'soft' error messages it may be difficult to know.

With my recent work on web scraping I found the latest 'Whiteboard Friday' video from seomoz to be exceptionally timely. I generally read the seomoz blog for the discussion on marketing issues but they also have a couple of tools which are heavily reliant on web scraping and on Friday the issues surrounding this were discussed. Apparently they've been having trouble providing data for all their visitors due to the ever increasing demand. As well as the video Matt, who is their CTO and web developer, goes into a fair amount of detail on their process and some of it really depressed and/or surprised me

One of the most unreliable APIs I've had to deal with is the Alexa / Amazon API, which is funny because it's the only one that costs money.

Sadly this isn't the first time I've heard this. The costs aren't unreasonable but if you're paying for something I think it is fair to expect I higher standard of service than if it was free.

This entire [data fetching] process is repeated between 2-7 times with varying timeout lengths and user-agents until some kind of data is fetched.

7 attempts strikes me as amazingly high. It would be interesting to know the delay between each request.

Experience so far

I've got two scripts/tools running at the moment that require data from external sources.

The first is the msn contact grabber script which can fail after half a dozen or fewer requests on a single IP address. To make matters worse the interface is complex and poorly documented. Thankfully it is stable.

The second is the dnsbl checker which I haven't had fail on me yet. It utilises the dns system and is designed for high use. Even if the tool did become insanely popular the demand placed on external services could be limited by zone transfers. The interface is also so simple that documentation really isn't needed.

My experience seems to be at the two extremes of scraping. I'm hoping my current work will be more dnsbl checker than msn contact grabber. Maybe I should just drop Alexa data?

Well i'll have to agree, scraping content sure is a pain. But in the end, i think it's really worth it, we have to give it credit for the happiness it gives back when we have an article that becomes really popular