How robots and spiders are causing issues, how to stop them. We can also talk about Completely Automated Public Turing Test To Tell Computers And Humans Apart - their use, their compliance issues, porn proxies, PWNtcha and other ways to defeat them.

I've been thinking about how spiders work in the context of black box web application scanners.

On a very basic level all the spider does is regex for href attributes which are part of the same domain, enqueues them, visits them and so on and so forth.

There becomes a point when there must be a cut off point, and you simply can't follow every href forever. This is partly achieved by setting link depth, keeping a memory of the depth of the links checked and go no further than the cut off point. This helps set a certain limit, but with link depth alone, a spider can still take a hell of a long time to complete.

The problem is the hashes will vary 100% if they are good hashing algorithms if even something as simple as "Jan" is replaced by "Feb". You'll probably have to think of something a little more clever - like percent different or something. I believe the search engines know what headers and footers look like so they can disregard that part and just focus on the meat.

this is an interesting problem and it is also very hard, because of the endless possibilities to design webpages and URLs.

As rsnake wrote, the hash value of two pages will almost always vary if only one single character changes. Stripping all HTML elements will not help here. Using hashes can be a way to detect exact duplicates of pages, but it will fail to detect near-duplicate pages. Search engines will be very interested in crawlers that can detect near-duplicate pages as good as possible.