Web scraping or crawling is the fact of fetching data from a third party website by downloading and parsing the HTML code to extract the data you want.

Since every website does not offer a clean API, or an API at all, web scraping can be the only solution when it comes to extracting website information.
Lots of companies use it to obtain knowledge concerning competitor prices, news aggregation, mass email collect…

Almost everything can be extracted from HTML, the only information that are “difficult” to extract are inside images or other media.

In this post, we are going to see basic techniques in order to fetch and parse data in Java.

Serverless is a term referring to the execution of code inside ephemeral containers (Function As A Service, or FaaS). It is a hot topic in 2019, after the “micro-service” hype, here come the “nano-services”!

Cloud functions can be triggered by different things such as:

An HTTP call to a REST API

A job in a message queue

A log

IOT event

Cloud functions are a really good fit with web scraping tasks for many reasons. Web Scraping is I/O bound, most of the time is spent waiting for HTTP responses, so we don’t need high-end CPU servers. Cloud functions are cheap (first 1M request is free, then $0.20 per million requests) and easy to set up. Cloud functions are a good fit for parallel scraping, we can create hundreds or thousands of function at the same time for large-scale scraping.

In this introduction, we are going to see how to deploy a slightly modified version of the Craigslist scraper we made on a previous blogpost on AWS Lambda using the serverless framework.

Prerequisites

We are going to use the Serverless framework to build and deploy our project to AWS lambda. Serverless CLI is able to generate lots of boilerplate code in different languages and deploy the code to different cloud providers, like AWS, Google Cloud or Azure.

Architecture

We will build an API using API Gateway with a single endpoint /items/{query} binded on a lambda function that will respond to us with a JSON array with all items (on the first result page) for this query.

Here is a simple diagram for this architecture:

Create the Maven project

Serverless is able to generate projects in lots of different languages: Java, Python, NodeJS, Scala...
We are going to use one of these templates to generate a maven project:

I also added a timeout to 30 seconds. The default timeout with the serverless framework is 6 seconds. Since we're running Java code the Lambda cold start can take several seconds. And then we will make an HTTP request to Craigslist website, so 30 seconds seems good.

Function code

Now we can modify the Handler.class. The function logic is easy. First, we retrieve the path parameter called "searchQuery". Then we create a CraigsListScraper and call the scrape() method with this searchQuery. It will return a List<Item> representing all the items on the first Craigslist's result page.

We then use the ApiGatewayResponse class that was generated by the Serverless framework to return a JSON array containing every item.

You can find the rest of the code in this repository, with the CraigsListScraper and Item class.

This is the end of this tutorial. I hope you enjoyed the post. Don't hesitate to experiment with Lambda and other cloud providers, it's really fun, easy, and can drastically reduce your infrastructure costs, especially for web-scraping or asynchronous related tasks.

In the previous articles, I introduce you to two different tools to perform web scraping with Java. HtmlUnit in the first article, and PhantomJS in the article about handling Javascript heavy website.

This time we are going to introduce a new feature from Chrome, the headless mode. There was a rumor going around, that Google used a special version of Chrome for their crawling needs. I don't know if this is true, but Google launched the headless mode for Chrome with Chrome 59 several months ago.

This article is an excerpt from my new book Java Web Scraping Handbook
The book will teach you the noble art of web scraping. From parsing HTML to breaking captchas, handling Javascript heavy website and many more. Check out the book!

PhantomJS was the leader in this space, it was (and still is) heavy used for browser automation and testing. After hearing the news about Headless Chrome, the PhantomJS maintainer said that he was stepping down as maintainer because of I quote "Google Chrome is faster and more stable than PhantomJS [...]"
It looks like Chrome headless is becoming the way to go when it comes to browser automation and dealing with Javascript-heavy websites.

HtmlUnit, PhantomJS, and the other headless browsers are very useful tools, the problem is they are not as stable as Chrome, and sometimes you will encounter Javascript errors that would not have happened with Chrome.

If you don't have Google Chrome installed, you can download it here
To install Chromedriver you can use brew on MacOS :

brew install chromedriver

Or download it using the link below.
There are a lot of versions, I suggest you to use the last version of Chrome and chromedriver.

Let's log into Hacker News

In this part, we are going to log into Hacker News, and take a screenshot once logged in. We don't need Chrome headless for this task, but the goal of this article is only to show you how to run headless Chrome with Selenium.

The first thing we have to do is to create a WebDriver object, and set the chromedriver path and some arguments :

The --disable-gpu option is needed on Windows systems, according to the documentation
Chromedriver should automatically find the Google Chrome executable path, if you have a special installation, or if you want to use a different version of Chrome, you can do it with :

The next step is to perform a GET request to the Hacker News login form, select the username and password field, fill it with our credentials and click on the login button. Then we have to check for a credential error, and if we are logged in, we can take a screenshot.

You should now have a nice screenshot of the Hacker News homepage while being authenticated. As you can see Chrome headless is really easy to use, it is not that different from PhantomJS since we are using Selenium to run it.

If you enjoyed this post please subscribe to my newsletter!
As usual, the code is available in this Github repository

In the first article I showed how to extract data from CraigList website.
But what about the data you want or if the action you want to carry out on a website requires authentication ?

This article is an excerpt from my new book Java Web Scraping Handbook
The book will teach you the noble art of web scraping. From parsing HTML to breaking captchas, handling Javascript heavy website and many more. Checkout the book !

In this short tutorial I will show you how to make a generic method that can handle most authentication forms.

Authentication mechanism

There are many different authentication mechanisms, the most frequent being a login form , sometimes with a CSRF token as a hidden input.

To auto-magically log into a website with your scrapers, the idea is :

GET /loginPage

Select the first <input type="password"> tag

Select the first <input> before it that is not hidden

Set the value attribute for both inputs

Select the enclosing form, and submit it.

Hacker News Authentication

Let's say you want to create a bot that logs into hacker news (to submit a link or perform an action that requires being authenticated) :

This article is an excerpt from my new book Java Web Scraping Handbook
The book will teach you the noble art of web scraping. From parsing HTML to breaking captchas, handling Javascript heavy website and many more. Checkout the book !

Since I am hosting this blog on Digital Ocean (10$ in credit if you sign up via this link), I will show how to write a bot to automatically download every bills you have.

Login

To submit the login form without needing to inspect the dom, we will use the magic method I wrote in the previous article.

Then we have to go to the bill page : https://cloud.digitalocean.com/settings/billing

It's almost finished, the last thing is to download the invoice. It's pretty easy, we will use the Pageobject to store the pdf, and call a getContentAsStreamon it. It's better to check if the file has the right content type when doing this (application/pdf in our case)

{"label":"Invoice for December 2015","amount":0.35,"date":1451602800000,"url":"/billing/XXXXX.pdf"}
{"label":"Invoice for November 2015","amount":6.00,"date":1448924400000,"url":"/billing/XXXX.pdf"}
{"label":"Invoice for October 2015","amount":3.05,"date":1446332400000,"url":"/billing/XXXXX.pdf"}
{"label":"Invoice for April 2015","amount":1.87,"date":1430431200000,"url":"/billing/XXXXX.pdf"}
{"label":"Invoice for March 2015","amount":5.00,"date":1427839200000,"url":"/billing/XXXXX.pdf"}
{"label":"Invoice for February 2015","amount":5.00,"date":1425164400000,"url":"/billing/XXXXX.pdf"}
{"label":"Invoice for January 2015","amount":1.30,"date":1422745200000,"url":"/billing/XXXXXX.pdf"}
{"label":"Invoice for October 2014","amount":3.85,"date":1414796400000,"url":"/billing/XXXXXX.pdf"}

This article is an excerpt from my new book Java Web Scraping Handbook
The book will teach you the noble art of web scraping. From parsing HTML to breaking captchas, handling Javascript heavy website and many more. Checkout the book !

]]>Today more and more websites are using Ajax for fancy user experiences, dynamic web pages, and many more good reasons.
Crawling Ajax heavy website can be tricky and painful, we are going to see some tricks to make it easier.

Today more and more websites are using Ajax for fancy user experiences, dynamic web pages, and many more good reasons.
Crawling Ajax heavy website can be tricky and painful, we are going to see some tricks to make it easier.

This article is an excerpt from my new book Java Web Scraping Handbook
The book will teach you the noble art of web scraping. From parsing HTML to breaking captchas, handling Javascript heavy website and many more. Checkout the book !

PhantomJS and Selenium

Now we're going to use Selenium and GhostDriver to "pilot" PhantomJS.

The example that we are going to see is a simple "See more" button on a news site, that perform a ajax call to load more news.
So you may think that opening PhantomJS to click on a simple button is a waste of time and overkilled ? Of course it is !

Here we call the initPhantomJs() method to setup everything, then we select the button with its id and click on it.

The other part of the code count the number of articles we have on the page and print it to show what we have loaded.

We could have also printed the entire dom with driver.getPageSource()and open it in a real browser to see the difference before and after the click.

I suggest you to look at the Selenium Webdriver documentation, there are lots of cool methods to manipulate the DOM.

I used a dirty solution with my Thread.sleep(800) to wait for the Ajax call to complete.
It's dirty because it is an arbitrary number, and the scraper could run faster if we could wait just the time it takes to perform that ajax call.

Conclusion

So we've seen a little bit about how to use PhantomJS with Java.

The example I took is really simple, it would have been easy to simulate the request.

But sometimes when you have tens of Ajax calls, and lots of Javascript being executed to render the page properly, it can be very hard to scrape the data you want, and PhantomJS/Selenium is here to save you :)

Next time we will see how to do it by analyzing the AJAX calls and make the requests ourselves.

This article is an excerpt from my new book Java Web Scraping Handbook
The book will teach you the noble art of web scraping. From parsing HTML to breaking captchas, handling Javascript heavy website and many more. Checkout the book !