Introduction

So occasionally I want to scrape a web page of links or images. Being a programmer I hate such a tedious tags and I always end up writing a script. In the past I always used a combination of substring(), indexOf or some other string formatting functions. However websites are also/still XML files! This enables a much easier method to search content: XPATH.

Setup

The Java library that I will use in this tutorial is HtmlCleaner. HtmlCleaner is open-source HTML parser written in Java and it cleans up any ill written HTML code. You can download it at: http://htmlcleaner.sourceforge.net/download.php.

Right click and go to ‘Properties’. Go to the tab ‘Libraries’ and press ‘Add JARs’.

I also provided a test file. In the project you can find it in the ‘src/resources’ folder. The function readDocument() will read this file and create a usable Document object.

Now let’s get a little bit advanced: next example will select all <h1> tags which have a class named ‘bookTitle’. The xpath query reads: From the top level select all h1 elements with an attribute class that matches the value ‘bookTitle’.