In a previous post I mentioned that web2py is my weapon of choice for building web applications. Before web2py I had learnt a variety of approaches to building dynamic websites (raw PHP, Python CGI, Turbogears, Symfony, Rails, Django), but find myself most productive with web2py.

In the previous post I covered three alternative approaches to regularly scrape a website for a client, with the most common one being in the form of a web application. However hosting the web application on either my own or the clients server has problems.

My solution is to host the application on a neutral third party platform - Google App Engine (GAE). Here is my overview of deploying on GAE:

Pros:

provides a stable and consistent platform that I can use for multiple applications

both the customer and I can login and manage it, so we do not need to expose our servers

Usually my clients request for a website to be scraped into a standard format like CSV, which they can then integrate with their existing applications. However sometimes a client need a website scraped periodically because its data is continually updated. An example of the first use case is census statistics, and of the second stock prices.

I have three solutions for periodically scraping a website:

I provide the client with my web scraping code, which they can then execute regularly

Client pays me a small fee in future whenever they want the data rescraped

I build a web application that scrapes regularly and provides the data in a useful form

The first option is not always practical if the client does not have a technical background. Additionally my solutions are developed and tested on Linux and may not work on Windows.

The second option is generally not attractive to the client because it puts them in a weak position where they are dependent on me being contactable and cooperative in future.
Also it involves ongoing costs for them.

So usually I end up building a basic web application that consists of a CRON job to do the scraping, an interface to the scraped data, and some administration settings. If the scraping jobs are not too big I am happy to host the application on my own server, however most clients prefer the security of hosting it on their own server in case the app breaks down.

Unfortunately I find hosting on their server does not work well because the client will often have different versions of libraries or use a platform I am not familiar with. Additionally I prefer to build my web applications in Python (using web2py), and though Python is great for development it cannot compare to PHP for ease of deployment. I can usually figure this all out but it takes time and also trust from the client to give me root privilege on their server. And given that these web applications are generally low cost (~ $1000) the ease of deployment is important.

Flash is a pain. It is flaky on Linux and can not be scraped like HTML because it uses a binary format. HTML5 and Apple’s criticism of Flash are good news for me because they encourage developers to use non-Flash solutions.

The current reality though is that many sites currently use Flash to display content that I need to access. Here are some approaches for scraping Flash that I have tried:

Check for AJAX requests that may carry the data I am after between the flash app and server

AJAX is a JavaScript technique that allows a webpage to request URLs from its backend server and then make use of the returned data. For example gmail uses AJAX to load new messages. The old way to do this was reloading the webpage and then embedding the new content in the HTML, which was inefficient because it required downloading the entire webpage again rather that just the updated data.
AJAX is good for developers because it makes more complex web applications possible. It is good for users because it gives them a faster and smoother browsing experience. And it is good for me because AJAX powered websites are often easier to scrape.

The trouble with scraping websites is they obscure the data I am after within a layer of HTML presentation. However AJAX calls typically return just the data in an easy to parse format like JSON or XML. So effectively they provide an API to their backend database.

These AJAX calls can be monitored through tools such as Firebug to see what URLs are called and what they return from the server. Then I can call these URLs directly myself from outside the application and change the query parameters to fetch other records.

In the previous post I covered how to tackle JavaScript based websites with Chickenfoot. Chickenfoot is great but not perfect because it:

requires me to program in JavaScript rather than my beloved Python (with all its great libraries)

is slow because have to wait for FireFox to render the entire webpage

is somewhat buggy and has a small user/developer community, mostly at MIT

An alternative solution that addresses all these points is webkit, the open source browser engine used most famously in Apple’s Safari browser. Webkit has now been ported to the Qt framework and can be used through its Python bindings.

Here is a simple class that renders a webpage (including executing any JavaScript) and then saves the final HTML to a file: