Developer tips

Fix Your Crawler

Every now and then I open the admin panel of my blog hosting and ban a few IPs (after I’ve tried messaging their abuse email, if I find one). It is always IPs that are generating tons of requests (and traffic) – most likely running some home-made crawler. In some cases the IPs belong to an actual service that captures and provides content, in other cases it’s just a scraper for unknown reasons.

I don’t want to ban IPs, especially because that same IP may be reassigned to a legitimate user (or network) in the future. But they are increasing my hosting usage, which in turn leads to the hosting provider suggesting an upgrade in the plan. And this is not about me, I’m just an example – tons of requests to millions of sites are … useless.

Second, make your crawler “polite” (the “politeness” property in the article above). Here’s a good overview on how to be polite, including respect for robots.txt. Existing implementations most likely have politeness options, but you may have to configure them.

Here I’d suggest another option – set a dynamic crawl rate per website that depends on how often the content is updated. My blog updates 3 times a month – no need to crawl it more than once or twice a day. TechCrunch updates many times a day; it’s probably a good idea to crawl it way more often. I don’t have a formula, but you can come up with one that ends up crawling different sites with periods between 2 minutes and 1 day.

Third, don’t “scrape” the content if a better protocol is supported. Many content websites have RSS – use that instead of the HTML of the page. If not, make use of sitemaps. If the WebSub protocol gains traction, you can avoid the crawling/scraping entirely and get notified on new content.

Finally, make sure your crawler/scraper is identifiable by the UserAgent. You can supply your service name or web address in it to make it easier for website owners to find you and complain in case you’ve misconfigured something.