The Pitfalls of Building an In-House Web Scraper

The internet can be a treasure trove of data for your business. It’s teeming with customer insights, competitive information, and government regulations that could affect your position in the market. Even better, you can collect this data in a systematic way and analyze it to reveal business threats and opportunities.

The trick is collecting this data in a clean and timely manner. That’s where web scraping comes in.

Web scraping enables you to pull data from a chosen website and store it in an organized way. The process involves pointing your custom web scraping script at the target website, configuring the tool, and scheduling regular scrapes. Writing your own web scraper seems simple … at first.

But the pitfalls of building an in-house web scraper are many. Like a B horror-movie monster, web scraping traps lurk behind the scenes, ready to spring out in the last act and foil your data-gathering efforts. Using a professionally developed and supported web scraping software suite instead can help you avoid the 9 pitfalls below.

Pitfall #1: Crawling Too Fast

The inherent risk of all web scraping efforts is that web scraping increases the demand on the target website. Web scraping emulates a human site visitor in that it accesses (or “crawls”) the target web pages, causing the hosting server to provide the page, as well as any graphics, video, or dynamic content.

However, poorly-designed web scraping scripts run the risk of crawling through pages much faster than a human website visitor would. When a web scraping script crawls through a website too quickly, it can inadvertently have the same effect as a denial of service (DoS) attack on the target website and impact the user experience for other site visitors.

Thus, savvy website owners will monitor their web traffic, configure their monitoring software to recognize poorly-designed crawling frequencies, and take action to protect their sites from degraded performance.

Pitfall #2: Repeatedly Following Similar Crawling Patterns

Just as you don’t want to scrape too fast, you also want to avoid scraping the same way every time. This could be another dead giveaway, since humans typically don’t perform repetitive actions on websites. Well-configured site management software will detect patterns in web scraping and throw a flag to the web developer. Homegrown web scraping scripts often lack the sophistication to include random actions (like mouse pointer movement and clicks) that mimic human actions on a site in order to throw web monitoring software off the web scraper’s trail.

Pitfall #3: Getting Stuck in the Honeypot

When web scraping, be sure to take heed of Admiral Ackbar’s immortal line in Return of the Jedi: “It’s a trap!” Clever developers can set traps for poorly-written web scraping scripts in the form of “honeypot” links. These are links hidden within web pages that are invisible to human users, but can be accessed by unwary web scrapers. When these links are accessed, it alerts the site’s web monitoring software, and the offending web scraper may be subject to unhappy consequences.

These links can be detected. For example, honeypot links may feature nofollow tags or be set to the same color as the background page. But this detection logic is a non-trivial exercise and can be costly or time-consuming to develop on your own.

Pitfall #4: Not Using Rotating IPs and Proxy Services

All incoming website requests (human, web scraping, or otherwise), include an IP address. Most web monitoring software can identify web scrapers by looking for multiple requests from the same IP address. So, varying IP addresses with incoming web scraping requests can help throw web monitors off the trail. This approach creates a group of IP addresses that web scrapers can randomly select from when making website requests.

In addition, more sophisticated web scraping tools can use virtual private networks (VPNs) and proxy servers (separate servers that are gateways to the internet) to utilize different IP addresses in their outgoing website requests.

Pitfall #5: Ignoring Site Guidance in the Robots.txt File

Operators of most ecommerce and other heavily-trafficked websites understand that web scraping is a fact of life. In fact, major search engines use a type of web scraping to populate their search results. So most sites allow it, within reason.

An industry-standard practice to balance the needs of discoverability and user experience is the use of a robots.txt file. Easily found in the website’s root directory, this file is meant to define the parameters of acceptable web scraping on the site, such as allowed request rate, allowed pages, disallowed pages, etc.

More sophisticated robots.txt files may vary permissions by web scraper. For example, a robots.txt file may be very permissive for Google and DuckDuckGo web scrapers, but more restrictive for small-time operators or potential competitors. So, web scraping scripts need to include sophisticated robots.txt scanning to understand the allowable behaviors for their particular scraper, and configure scraping parameters on the fly to stay off the radar of the website operator.

Pitfall #6: Using the Same User Agent

By default, web browsers also include a user agent header in their requests to websites. Similar to the IP address issue described above, web monitoring software can easily pick up on repeated requests using the same user agent header.

Handling this challenge also includes an approach similar to the rotating IP solution described above: programmatically presenting different user agents or spoofed agents that look more like human users to web monitors. This solution could develop a list of user agents and randomly pick one at run time of a web scraper. It could also select user agents more commonly associated with human users than default agents.

As with other strategies, this approach requires well-designed web scraping software that anticipates this need and applies user agent randomization at run time with acceptable performance.

Pitfall #7: Not Accounting for the Demands of Web Scraping

Effective web scraping demands deep expertise in web technologies, strong software development and testing skills, and a significant amount of server and network capacity, along with dedicated resources to maintain web scraping infrastructure after it’s built.

Then there’s the matter of managing multiple web scrapers for many different websites. Development, coordination, and management of all these independent web scraping agents can quickly become overwhelming. But if they’re not properly maintained, the value of web scrapers degrades over time as their target websites change.

Finally, there’s the data harvested from web scraping. Ten different websites may present information for the exact same product in 10 different ways. All that raw data will need to be profiled, cleansed, processed, and conformed to a data model in an accessible data asset to produce meaningful and timely insights. So, you’ll need a data engineering team to work hand in hand with a web scraping script development team.

Pitfall #8: Getting Banned or Blacklisted

Another consequence of having your poorly-written web scraper detected by a target website’s operators is that they may learn to recognize your web scraper and ban it from accessing their site or restrict certain content (aka blacklisting). If you start encountering CAPTCHA pages, unusual delays, or frequent 404 error messages, you may have been busted by an astute web developer.

Obviously, this would defeat the purpose of web scraping and would put you in the position of having to start over with a better-designed web scraping script that would not be recognized by the target server. So, if your web scraping script writing chops weren’t good enough to avoid banning or blacklisting, you’ll really have to pick up your game on the next try.

Pitfall #9: Data Spoofing

Unwary web scrapers can set up their owners for an even worse fate than banning or blacklisting: data spoofing. Sophisticated and somewhat devious website developers may choose to let an unwanted web scraper pull data from their site, but will strategically lace this data with misinformation. Without a source for comparison, the web scraper would unwittingly traffic this bad data back to its host server and contaminate any business analysis.

Obviously, this state of affairs could have potentially disastrous consequences for the host company, so meticulous care must be taken when coding web scrapers to adhere to the aforementioned best practices for web scraping.

Web scraping is a double-edged sword. It can give any business timely insights that lead to a competitive advantage; and in some industries, it’s a necessity to keep up to date on customer and competitive trends.

But it’s also more complicated and risky than it seems at first. There are a host of unforeseen perils that can ensnare the unwitting. Companies that have a well-defined niche market and strong technical core competencies may be in a position to tackle web scraping. But most companies would be well-advised to strongly consider professional web scraping or web data integration services to reap all the benefits and avoid all the risks.

What is your experience with web scraping? Share in the comments below: