Extracting structured data from web sites is not a trivial
task. Most of the information on the web today is in the form of Hypertext
Markup Language (HTML) documents which are viewed by humans with a browser.
HTML documents are sometimes written by hand, sometimes with the aid of HTML
tools. Given that the format of HTML documents is designed for presentation
purposes, not automated extraction, and the fact that most of the HTML content
on the web is ill-formed (“broken”), extracting data from such documents can be
compared to the task of extracting structure from unstructured documents.

In our previous post, we discuss and learnt about simple web harvesting. In this post, we will try to scrape complex information from naptol.com.

Just to start with the little complex web scraping, we will try to extract mobile handset items available on given naptol.com page URL.

naptoConfig.xml

Now we will try to understand the configuration file and how it works.

<list>

<xpathexpression='//*[@id="productView"]'>

<html-to-xmlprunetags="yes">

<httpurl="${url}"/>

</html-to-xml>

</xpath>

</list>

<html-to-xml> processor cleans up the html downloaded by <http> processor for given url and
produce XHTML content. xpath processor searches specific xpath
in XHTML and produce a list of items matching the xpath expression.

Here, we will get all the elements having id as productView as array. If the page have 3 element with id productView, then list will contain three items starting from element [0] to [2]. <list> contains
the produced item as array of items list.

<loopitem="link"index="index">

Loop
iterate through the specified list and executes specified body logic for each
item. So item="link"

will give all the item in the
list one by one starting from index [0] to [n-1].

Now in the body section, we are using the item variable and extracting data from it by applying xpath expression and storing the extracted data into variable called productName using below code

<var-defname="productName">

<xpathexpression='//p[@class="proName"]//@title'>

<varname="link"/>

</xpath>

</var-def>

In the above excerpt, we are trying to extract the title of a paragraph having class as proName from the list item one by one and storing it in a variable productName.