The web scraping topic has been actively growing in popularity for dozens of years now. Freelance sites are overcrowded with orders connected with this contradictory data extracting process. Today we will combine two new and revolutionary directions in web development. So, let’s consider an elegant and modern way to scrape data from websites with Node.js!

Firstly, a few words about the technology in use. Node.js is a cross-platform server environment, based on V8. Two main Node.js benefits are:

Using JavaScript on the back-end

Asynchronous programming – when thousands of users are simultaneously connected to the server– Node.js works asynchronously, that is it makes priorities and distributes resources more rationally.

Node.js is usually used for creating API; it is very convenient to create Desktop + Mobile app with Node.js, and, take notice, IoT. The deeper you study it, the more distinctly you will see – it is the future of back-end technologies.

If you don’t know anything about Node.js, a basic understanding of javascript and callback functions will be enough; the other complex code will be explained here.

Modules

Let’s start with overviewing our project. What do we need first? Node.js consists of a lot of useful modules that help you work faster. We will use these:

Express: The Node.js framework which allows designing API for mobile and web apps.

Fs: File system module. We will use it to write the results into a file.

Request: This module provides the simplest way to make http calls.

Cheerio: This allows one to use JQuery syntax to parse web data.

Now we will create our project and take some installation steps.

Building a project

To use Node.js you should download it. The installation process is very simple, so right after it’s successfully completed, you can start using it. We will talk about launching a bit later. Now we should create a project and start to install the needed modules.

The project building is as easy as the installation:

Create a folder

Inside the folder create file package.json

Open this file and paste into it the following:

1

2

3

4

5

6

7

8

9

10

11

12

{

"name":"scrape",

"version":"1.0.0",

"description":"web scraping tutorial",

"main":"server.js",

"author":"Scraping.pro",

"dependencies":{

"express":"latest",

"request":"latest",

"cheerio":"latest"

}

}

In the file package.jsonthis basic information is placed: name of the project, project version and description, main file and the author. The dependencies defines all modules and their versions (latest) that will be used in the project.

Now we are going to use a command line, but first we should write some code. Create a server.js file and enter into it the following:

1

console.log("Hello!");

Open it: find your project and enter the command node server – it will print our message in the console.

The basic configuration is done. Now we should install our modules that were mentioned in package.json file. The command

1

npminstall

will download them in our project.

Scraping data with API

So, we have checked the project’s capability and have downloaded the modules. Let’s try to scrape some info. In the first example, we will get data about users on github.com. Fortunately, GitHub has its own open API. We will create a script which loads data about every GitHub user. For the test, we will get info about GitHub co-founder – Linus Torvalds.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

// here we initialize our modules - after we will work with them like with objects

varexpress=require('express');

varfs=require('fs');

varrequest=require('request');

varcheerio=require('cheerio');

varapp=express();

// here you can see how are making routes in Node.js + Express. The first parameter - the route,

// the second is a callback function, which have two default parameters - request and response.

// For example, if you want to get some data from the page you should address from response variable

"Advanced Cash - reliable and convenient payment system for everyone!",

"Growing Сrystal – project is closed",

"Primex – project is closed",

"FOREVER MONEY LIMITED – project is closed"

]

Scraping with Node.js is a like an art, isn’t it?

Making conclusions

Web scraping is an engaging experience. We strongly recommend for you to go deeper in this theme to explore some other amazing features about scraping with Node.js, but remember – use gained knowledge only in legal directions.

To become a guru in Node.js scraping we recommend that you read the following 4 articles (the first post is this very post):