Web Scraping : Handling AJAX website part I

Today more and more websites are using Ajax for fancy user experiences, dynamic web pages, and many more good reasons.
Crawling Ajax heavy website can be tricky and painful, we are going to see some tricks to make it easier.

This article is an excerpt from my new book Java Web Scraping Handbook
The book will teach you the noble art of web scraping. From parsing HTML to breaking captchas, handling Javascript heavy website and many more. Checkout the book !

PhantomJS and Selenium

Now we're going to use Selenium and GhostDriver to "pilot" PhantomJS.

The example that we are going to see is a simple "See more" button on a news site, that perform a ajax call to load more news.
So you may think that opening PhantomJS to click on a simple button is a waste of time and overkilled ? Of course it is !

Here we call the initPhantomJs() method to setup everything, then we select the button with its id and click on it.

The other part of the code count the number of articles we have on the page and print it to show what we have loaded.

We could have also printed the entire dom with driver.getPageSource()and open it in a real browser to see the difference before and after the click.

I suggest you to look at the Selenium Webdriver documentation, there are lots of cool methods to manipulate the DOM.

I used a dirty solution with my Thread.sleep(800) to wait for the Ajax call to complete.
It's dirty because it is an arbitrary number, and the scraper could run faster if we could wait just the time it takes to perform that ajax call.

Conclusion

So we've seen a little bit about how to use PhantomJS with Java.

The example I took is really simple, it would have been easy to simulate the request.

But sometimes when you have tens of Ajax calls, and lots of Javascript being executed to render the page properly, it can be very hard to scrape the data you want, and PhantomJS/Selenium is here to save you :)

Next time we will see how to do it by analyzing the AJAX calls and make the requests ourselves.

This article is an excerpt from my new book Java Web Scraping Handbook
The book will teach you the noble art of web scraping. From parsing HTML to breaking captchas, handling Javascript heavy website and many more. Checkout the book !