Headless Chrome and the Puppeteer Library for Scraping and Testing the Web

Written by Nikos Vaggalis

Wednesday, 29 November 2017

With the advent of Single Page Applications, scraping pages for information as well as running automated user interaction tests has become much harder due to its highly dynamic nature. The solution? Headless Chrome and the Puppeteer library.

While there's always been Selenium, PhantomJS and others, and despite headless Chrome and Puppeteer arriving late to the party, they make for valuable additions to the team of web testing automation tools, which allow developers to simulate interaction of real users with a web site or application.

Headless Chrome is able to run without Puppeteer, as it can be programmatically controlled through the Chrome DevTools Protocol, typically invoked by attaching to a remotely running Chrome instance:

chrome --headless --disable-gpu --remote-debugging-port=9222

Subsequently loading the protocol's sideckick module 'chrome-remote-interface' which provides a simple abstraction of commands and notifications using a straightforward JavaScript API, one can execute JavaScript scripts under a local Node.js installation.

From the official documentation, here is an example that navigates to https://example.com and saves a screenshot as example.png::

For example, let's go to www.smadeseek.com and load a list of all smartphones availability.Then programmaticaly click on the img element of the second displayed device to bring up its detailed specifications page. From there we can access the innerHTML of the first table element:

There's just one caveat. Since CDP only works with Chromium, Chrome and other Blink-based browsers, so does Puppeteer. If you require more than that, then sticking to Selenium and its WebDriver API still remains the best option..

The latest version of Next.js has been released with better server-side Webpack support and first-class TypeScript support. Next.js is a toolkit for universal, server-rendered (or statically pre-rende [ ... ]