Search Engine Spider and User Agent Identification Forum

I wanted to use a stronger word, but there are grownups reading these forums.

HTTrack shows up a good bit in WebmasterWorld, but usually not in the UA-identification context.

UA (details probably irrelevant): Mozilla/4.5* (compatible; HTTrack 3.0x; Windows 98) IP (almost certainly irrelevant): 176.74.192.nn (in Sweden, of all places) Files: 185 in a bit under 2 minutes, including 46 html files repeated with HTTP 1.0 instead of HTTP 1.1. In spurts and hiccups, so generally 4-5/second. Only one from a roboted-out directory (the others require at least four recursions of links, and they stopped at two). Referer: generally my front page, even when picking up interior files robots.txt: Don't be silly.

They have a www site [httrack.com] that answers all your questions, starting with the basic

Q: Some sites are captured very well, other aren't. Why?A: There are several reasons (and solutions) for a mirror to fail. Reading the log files (ans [sic] this FAQ!) is generally a VERY good idea to figure out what occured. <snip> * Website 'robots.txt' rules forbide [sic again] access to several website parts - you can disable them, but only with great care! * HTTrack is filtered (by its default User-agent IDentity [sic no. 3]) - you can change the Browser User-Agent identity to an anonymous one (MSIE, Netscape..) - here again, use this option with care, as this measure might have been put to avoid some bandwidth abuse (see also the abuse faq!)

If you want to hide, give your UA as Netscape. Nobody will ever notice you then.

Wait, don't rush off to have your jaw wired back into place just yet. The Abuse FAQ [httrack.com] is even better. Very educational.

I guess I must be getting bigger; I have never met these guys before yesterday.

Don't know about the rest of y'all. But as far as I'm concerned, if someone chooses to copy my site for personal offline reading (the official reason for using this program), they can bloody well do it the way I do. Manually load up the desired pages, save them with your browser... and then sneer at the HTML before quietly cleaning it up.

HTTrack is listed in the Close to Perfect Thread [webmasterworld.com] along with the long list of UA's that have been copied and pasted thousands of times (the one ever newbie seems to find), which means HTTrack existed as valid UA even prior to 2002.

I moved my first site over from the university (UCSD) servers to the WWW in 1998. HTTrack was the first user-end site copier I saw and probably singularly responsible for starting a campaign to educate myself in defense measures.

Don't see it nearly as much as I used to, but it still rears it's nasty presence now and then.

Oh, I blocked 'em both ways (UA of course, and IP for the heck of it) before even posting ;) What struck me was the absolute brazenness of their documentation. It's like the old ads for radar detectors: OK, now can you explain to me again why this is a good and worthy purpose that every law-abiding person should embrace without hesitation?

"Mozilla 4.5" is interesting in its own right. Does there even exist such a thing? Had to go all the way back to last July to find another specimen in the logs-- and that was an unequivocal robot. Also Win98, coincidentally.

Ya know those days when you see a big spike in traffic one hour and when ya check the logs, it turns out to be a legit user but they're downloading dozens & dozens of pages all in a couple minutes? That's HTTrack, or one of its clones. It can run in stealth so there's no identifiable UA to block and of course the user can come from any ol' IP address.

So, how to block? Well if you believe their documentation, it says they respect robots.txt. So if the deny is there, even the stealth runners should be blocked right? Trouble is, this can't be verified unless you install it yourself and test. So as I said, the fact that this bad boy is rarely seen nowadays is the only conciliation.

Well if you believe their documentation, it says they respect robots.txt.

Only if you use the default settings. You can override the block. And camouflage the UA. Or, as in this case, you could just sorta forget to read robots.txt in the first place. Gee, officer, I didn't know this was 40-mile zone. Nobody stopped me and told me.

It gives a different perspective on all those one-off robots from innocuous IPs. They must be hard-coded to start at the top level of the site and work downward, because the chances of someone being equally interested in all my various directories are vanishingly small. It's a very distinctive pattern, like when you tell the Link Checker to go up to some level of recursion.