I'm going to be developing some functionality that will crawl various public web sites and process/aggregate the data on them. Nothing sinister like looking for e-mail addresses - in fact it's something that might actually drive additional traffic to their sites. But I digress.

Other than honouring robots.txt, are there any rules or guidelines, written or unwritten, that I ought to be following in order to (a) avoid appearing malicious and potentially being banned, and (b) not cause any problems for the site owners/webmasters?

Some examples I can think of which may or may not matter:

Number of parallel requests

Time between requests

Time between entire crawls

Avoiding potentially destructive links (don't want to be the Spider of Doom - but who knows if this is even practical)

That's really just spit-balling, though; is there any tried-and-tested wisdom out there that's broadly applicable for anybody who intends to write or utilize a spider?

7 Answers
7

There are many who believe robots.txt is not the proper way to block indexing and because of that viewpoint, have instructed many site owners to rely on the <meta name="robots" content="noindex"> tag to tell web crawlers not to index a page.

If you're trying to make a graph of connections between websites (anything similar to PageRank), (and <meta name="robots" content="nofollow">) is supposed to indicate the source site doesn't trust the destination site enough to give it a proper endorsement. So while you can index the destination site, you ought not store the relation between the two sites.

SEO is more of an art than a real science, and it's practiced by a lot of people who know what they're doing, and a lot of people who read the executive summaries of people who know what they're doing. You're going to run into issues where you'll get blocked from sites for doing things that other sites found perfectly acceptable due to some rule someone overheard or read in a blog post on SEOmoz that may or may not be interpreted correctly.

Because of that human element, unless you are Google, Microsoft, or Yahoo!, you are presumed malicious unless proven otherwise. You need to take extra care to act as though you are no threat to a web site owner, and act in accordance with how you would want a potentially malicious (but hopefully benign) crawler to act:

stop crawling a site once you detect you're being blocked: 403/401s on pages you know work, throttling, time-outs, etc.

avoid exhaustive crawls in relatively short periods of time: crawl a portion of the site, and come back later on (a few days later) to crawl another portion. Don't make parallel requests.

Even then, it's going to be an up-hill battle unless you resort to black-hat techniques like UA spoofing or purposely masking your crawling patterns: many site owners, for the same reasons above, will block an unknown crawler on sight instead of taking the chance that there's someone not trying to "hack their site". Prepare for a lot of failure.

One thing you could do to combat the negative image an unknown crawler is going to have is to make it clear in your user-agent string who you are:

Aarobot Crawler 0.9 created by John Doe. See http://example.com/aarobot.html for more information.

Where http://example.com/aarobot.html explains what you're trying to accomplish and why you're not a threat. That page should have a few things:

Information on how to contact you directly

Information about what the crawler collects and why it's collecting it

Information on how to opt-out and have any data collected deleted

That last one is key: a good opt-out is like a Money Back Guarantee™ and scores an unreasonable amount of goodwill. It should be humane: one simple step (either an email address or, ideally, a form) and comprehensive (there shouldn't be any "gotchas": opt-out means you stop crawling without exception).

Huge +1 for the suggestion of putting clear info in the User-Agent. I've had the job of poring over webserver logs to figure out who was spidering a big site, and it's no fun trying to trace down who is running all the obscure spiders.
–
Carson63000Jul 11 '11 at 4:12

1

It's quite common to put the URL in the form (+http://example.com/aarobot.html). I don't know what the purpose of the + sign is here, but I've seen it often. Web-Sniffer does it, and so do many others.
–
TRiGJul 11 '11 at 10:30

This is great information, but I'm confused on one thing: You make mention of rel="noindex" as if it's an <a> attribute, but the page you link to describes it as part of the <meta> tag's content attribute. Is it both, or was this a typo in the answer?
–
AaronaughtJul 11 '11 at 23:06

@Aaronaught rel="noindex" is just <meta>; rel="nofollow" is both. I fail at coming up with a non-clumsy way to phrase that.
–
user8Jul 11 '11 at 23:34

1

"SEO is more of an art than a real science" - not true. If you are a statistical programmer, SEO is less an art and more a mathematical recognition skill. Math grads who are skilled in programming or programmers skilled in Maths are in good demand in the web data profiling industry.
–
Blessed GeekJan 23 '12 at 3:34

While this doesn't answer all of your questions, I believe it will be of help to you and to the sites you crawl.

Similar to the technique used to brute force websites without drawing attention, if you have a large enough pool of sites you need to crawl, don't crawl the next page on the site until you have crawled the next page of all of the other sites. Well, modern servers will allow HTTP connection reuse, so you might want to do more than one to minimise overhead, but the idea still stands. Do not crawl one site to exhaustion until you move to the next. Share the love.

For you at the end of a day, you can still have crawled just as many pages, but average bandwidth usage on a single site will be much lower.

If you want to avoid being the spider of doom, there is no sure-fire method. If someone wants to stick beans up their nose, they will and probably do so in manners you could never predict. Having said that, if you don't mind missing the occasional valid page, have a blacklist of words for a link that will prevent you from following it. For example:

Delete

Remove

Update

Edit

Modify

Not fool-proof, but sometimes you just cannot prevent people from having to learn the hard way ;)

My one bit of advice is to listen to what the website you are crawling is telling you, and dynamically change your crawl in reaction to that.

Is the site slow? Crawl slower so you don't DDOS it. Is it fast? Crawl a bit more, then!

Is the site erroring? Crawl less so you're not stressing out a site already under duress. Use exponentially increasing retry times, so you retry less the longer the site is erroring. But remember to try back later, eventually, so you can see anything you're missing due to, say, a week long error on a specific URL path.

Getting lots of 404s? (remember, our fancy 404 pages take server time too!) Avoid crawling further URLs with that path for now as perhaps everything there is missing; if file001.html - file005.html is not there, I bet you dollars to donuts file999.html isn't either! Or perhaps turn down the percent of time you retrieve anything in that path.

I think this is where a lot of naive crawlers go deeply wrong, by having one robotic strategy that they excute the same regardless of the signals they're getting back from the target site.

Optimize for some typical webserver "directory listing" pages. In particular, they allow to sort for size, date, name, permissions, and so on. Don't treat each sort method as a separate root for crawling.

Ask for gzip (compression on the fly) whenever available.

Limit depth or detect recursion (or both).

Limit page size. Some pages implement tarpits to thwart email-scrapping bots. It's a page that loads at snail speed and is terabytes long.

Do not index 404 pages. Engines that boast biggest indexes do this, and receive well-deserved hate in exchange.

This may be tricky, but try to detect load-balancing farms. If v329.host.com/pages/article.php?99999 returns the same as v132.host.com/pages/article.php?99999 don't scrape the complete list of servers from v001.host.com up to v999.host.com

Copyright & other legal issues:
I know you write they are public websites, so there might not be any copyright, but there might be other legal issues to storing the data.

This will of course depend on which country's data it is you are storing (and where you are storing them). Case in point the problems with the US Patriot Act vs EU's Data Protection Directive. An executive summary of the problem is that US companies have to give their data to eg. the FBI if asked, without informing the users of that, where the Data Protection Directive states that users have to be informed of this. Se http://www.itworld.com/government/179977/eu-upset-microsoft-warning-about-us-access-eu-cloud