Introduction

So, this tutorial continues to build on the first part. If you haven’t read this, I recommend to do so. As it contains vital information that I won’t repeat in this post.

Note: please download and open the eclipse project. The run methods can be found in the Scraping02 class.

Additional queries

Sometimes it is possible that the HTML structure is so fucked up that you need to take matters in your own hand. In run1 I’ll show you how to loop trough all elements of the images table one by one.

So ‘node.getFirstChild()’ is the <td> tag, the ‘node.getFirstChild().getNextSibling()’ is the first <img> tag. To loop trough you will need to get ‘next.getNextSibling().getNextSibling()’ as the first nextSibling is the current element.

This hasn’t been covered in the past but it is quite useful for web scraping. XPath queries can select the attribute values too. This can be done using the ‘@’ character. This comes quite handy when selecting links from <a href=””> tags. In run2 the query selects the links (href attribute) from all <a>tags in the tag defined by a class named ‘links’.

In run3 I’ve combined run2 together with run1. Here you can see that HtmlCleaner corrects thags that aren’t correct. (In this case the <img> tag). The XPath query reads: select all elements with a class ‘images, from this class select the ‘img’ element with a td parent at any level/depth and get the src attribute.

Introduction

So occasionally I want to scrape a web page of links or images. Being a programmer I hate such a tedious tags and I always end up writing a script. In the past I always used a combination of substring(), indexOf or some other string formatting functions. However websites are also/still XML files! This enables a much easier method to search content: XPATH.

Setup

The Java library that I will use in this tutorial is HtmlCleaner. HtmlCleaner is open-source HTML parser written in Java and it cleans up any ill written HTML code. You can download it at: http://htmlcleaner.sourceforge.net/download.php.

Right click and go to ‘Properties’. Go to the tab ‘Libraries’ and press ‘Add JARs’.

I also provided a test file. In the project you can find it in the ‘src/resources’ folder. The function readDocument() will read this file and create a usable Document object.

Now let’s get a little bit advanced: next example will select all <h1> tags which have a class named ‘bookTitle’. The xpath query reads: From the top level select all h1 elements with an attribute class that matches the value ‘bookTitle’.