5 Tips For Web Scraping Without Getting Blocked or Blacklisted

Published 2019-03-30 by The Scraper API Team

Web scraping can be difficult, particularly when most popular sites
actively try to prevent developers from scraping their websites.
However, there are many strategies that developers can use to avoid
blocks and allow their web scrapers to be undetectable. Here are a few
of my favorites:

1. IP Rotation

The number one way sites detect web scrapers is by examining their IP
address. To avoid sending all of your requests through the same IP
address, you can use Scraper API or other proxy
services in order to route your requests through a series of
different IP addresses. This will allow you to scrape the majority of
websites without issue.

2. Set a Real User Agent

Some websites will examine User Agents and block requests from User
Agents that don't belong to a major browser. Most web scrapers don't
bother setting the User Agent, and are easily detected by checking for
missing User Agents. Don't be one of these developers! Remember to set a
popular User Agent for your web crawler (you can find a list of popular
User Agents here).
For advanced users, you can also set your User Agent to the Googlebot
User Agent since most websites want to be listed on Google and
therefore let Googlebot through.

3. Set other headers

Real web browsers will have a whole host of headers set, any of which
can be checked by careful websites to block your web scraper. In order to
make your scraper appear to be a real browser, you can navigate to
https://httpbin.org/anything,
and simply copy the headers that you see there (they are the headers that
your current web browser is using). Things like "Accept",
"Accept-Encoding", "Accept-Language", and "Cache-Control" being set will
make your requests look like they are coming from a real browser.

4. Set random intervals in between your
requests

It is easy to detect a web scraper that sends exactly one request each
second 24 hours a day! No real person would ever use a website like that,
and an obvious pattern like this is easily detectable. Use randomized
delays (anywhere between 2-10 seconds for example) in order to build a
web scraper that can avoid being blocked.

5. Use a headless browser (advanced)

The trickiest websites to scrape may detect subtle tells like web
fonts, extensions, browser cookies, and javascript execution in order to
determine whether or not the request is coming from a real user. In order
to scrape these websites you may need to deploy your own headless
browser (or have Scraper API do it for you!). Tools like
Selenium and Puppeteer will allow you to write a program to control a
real web browser that is identical to what a real user would use in order
to completely avoid detection. While this is quite a bit of work to set
up, this is the most effective way to scrape websites that would
otherwise give you quite some difficulty.

Hopefully you've learned a few useful tips for scraping popular websites
without being blacklisted or IP banned, if you ever have any questions
feel free to contact us at hello@scraperapi.com with any
questions. Happy scraping!

Ready to start scraping?

Scraper API is a tool that handles proxies, browsers, and CAPTCHAs so
developers can get the HTML of any web page with a simple API call. Get started with 1000 free API calls or contact sales.

SIGN UP WITH GOOGLE

SIGN UP WITH GITHUB

OR

SIGN UP WITH EMAIL

Our Story

Having built many web scrapers, we repeatedly went through the tiresome
process of finding proxies, setting up headless browsers,
and solving CAPTCHAs. That's why we decided to start Scraper
API, it handles all of this for you so you
can scrape any page with a simple API call!