Restrictions of Websites

The original Chinese article is written by ninthakeey. It has been translated and remixed by Datumorphism

Beware that scraping data off websites is neither always allowed nor as easy as a few lines of code. The preceding articles enable you to scrape many data, however, website have counter measures. In this article, we will be dealing with some of the common ones.

Request Frequency

Some websites have limitations on the frequency of API requests. The solution to this is simply a brief pause after each request. In Node.js, the function setInterval enables this.

Node.js is single-threaded. As we have mentioned when explaining the function fs.writeFile(), Node.js has non-blocking I/O. In fact, concurrency is probably at the heart of Node.js, which means we execute one process out of the multiple running processes. Concurrency can make your code more efficient.

In the first block of code, we have used setInterval callback function. However, the callback is kind of messy. It’s much better if we could use function chaining. Promise is exactly what we need to achieve this.

async tells Node.js that we are defining an asynchronous function, while await tells it to wait for the async function to parse things. It’s basically an alternative to .then and .catch functions.

User-Agent Verification

Some services will require some certain browsers to get a successful response. The way they do it is to verify the User-Agent (UA) data in the HTTP requests. In this case, we have to request with some fake UA data.