Website Screen scraping using Zend Framework

ZF is a component-based framework, so we can only use some of its packages for a specific task. For example, if we don’t need to build a site and don’t need MVC, dispatchers, routers and so on, we can include only necessary packages for the task.

Assume we need to build a screen-scrapper for a site or group of sites. We’d need Zend_Dom_Query with its convenient xpath and css query methods and Zend_Json since many sites interact in AJAX using JSON.

So, we start with forming the packages we only need. Since we use ZF classes and they in turn load their base classes, so we need Zend_Loader. which will register its own autoload function. Here is all we need for the task:

Zend

│ Json.php

│ Loader.php

│

├───Dom

│ │ Exception.php

│ │ Query.php

│ │

│ └───Query

│ Css2Xpath.php

│ Result.php

│

├───Json

│ │ Decoder.php

│ │ Encoder.php

│ │ Exception.php

│ │ Expr.php

│ │ Server.php

│ │

│ └───Server

│ │ Cache.php

│ │ Error.php

│ │ Exception.php

│ │ Request.php

│ │ Response.php

│ │ Smd.php

│ │

│ ├───Request

│ │ Http.php

│ │

│ ├───Response

│ │ Http.php

│ │

│ └───Smd

│ Service.php

│

└───Loader

│ Autoloader.php

│ Exception.php

│ PluginLoader.php

│

├───Autoloader

│ Interface.php

│ Resource.php

│

└───PluginLoader

Exception.php

Interface.php

Let’s start coding it. If we will scrape several sites, we’d need a class containig all the methods for all sites + some common methods for handling cURL operations and service checks. Actually, it is a good idea to create a base class with all these methods and extend it by each class site-scrapper, but let’s leave it for the future.

// where $some_input_data - are input parameters for the selected site like search terms

// e.g

$scrap->Scrape_gold4power('Ragnaros','Horde')

// meaning that we should grab some content from 'gold4power' server for 'Ragnaros' WOW server for 'Horde' faction

in order to implement the method we have to work with the site using anything like Firebug, Charles shareware proxy server, Tamper data Firefox addon that will allow us to intercept and analyze HTTP headers and content. This all is beyond the topic of the article, but I have to say one can almost always emulate browsers behaviour. While working with the site you can notice that it may validate some additional headers, it may encode postdata in some non-standard manner etc. So we should tweak the getContent() method: