Most of the time, when I catch scumbags attempting to spam, scrape, leech, or otherwise hack my site, I stitch up a new voodoo doll and let the cursing begin. No, seriously, I just blacklist the idiots. I don’t need their traffic, and so I don’t even blink while slamming the doors in their faces.

Of course, this policy presents a bit of a dilemma when the culprit is one of the four major search engines. Slamming the door on Yahoo! would be unwise, but if their Slurp crawler continues behaving suspiciously, I may have no choice. Check out the following records, pulled directly from one of my error logs, where Yahoo! exhibits some extremely questionable behavior.

July 29th, 2007

The first observed record of suspicious activity shows Yahoo! attempting to access a nonexistent URL. Note that the first portion of the URL does exist; it is the Perishable Press Redirection Lounge. Only explicitly predefined 301 redirects (via htaccess) should arrive at the Redirection Lounge.

July 30th, 2007

Yahoo! dropped in eight times on the 30th. Hmmm. We have several additional requests for nonexistent subdirectories of the Redirection Lounge, as well as several new and insightful clues into Yahoo!’s suspicious behavior. Allow me to break a few of these log entries on down..

The first record of the day presents an interesting clue as to what on earth ol’ Slurp might be doing. Check out the value of the query string. “88t” is an abbreviation for the username of fellow DLa [Dead Letter Art] member 88teeth. The query string in question is prepended to several of 88teeth’s DLa images and appears nowhere on the Perishable Press domain. Further, there is no specific “gallery” directory either.

This next one is interesting as well. The URL once again begins with the Redirection Lounge location, but then subdirects to “/paintboard/editor.php”. I can assure you, there is no such directory or resource here at Perishable Press, but there is reference to such over at the DLa website. What are you up to, Yahoo!?

Here is a 404 error that is not unique to the Yahoo! Slurp crawler. I have been unable to ascertain the originating source of myriad URL requests looking for nonexistent locations named after various JavaScript functions. With almost every visit, Google, MSN, Ask, and (as shown here) Yahoo! generate 404 errors such as the following:

Later that day, Yahoo! strolls into town again. This time looking to “view” “/easyboard.php”. Huh? There are no links or mentions of any such resource on this domain! Any ideas as to why Yahoo! is picking up keywords, image-name prefixes, and JavaScript functions from a different domain and looking for them here at Perishable Press? I would love to know..

This one confuses me. While there is an article on this domain with a URL matching the first part of that shown below, there is absolutely nothing associated with that post that has anything to do with “function.array-rand”. This is confusing because, in previous log entries, Yahoo! is apparently following links created on a different domain, or at least somewhere that redirects them to the Lounge, however in this case, the request is not redirected, but rather followed directly to the 404 status code.

Totally shady, Yahoo! Slurp is not only looking for a JavaScript function, it is looking for it in an incomplete, nonexistent directory. In previous cases, at least the first part of the URL was a legitimate resource. Here, that is simply not the case. If you ask me, this is very unusual behavior for the Slurp crawler.

Similar to the first error shown for July 30th (above), this case shows Yahoo! passing the value of a filename prefix (another DLa member username) located on a different domain to the URL of a referral that was ultimately redirected to the Redirection Lounge. I should also mention that this type of request has not happened before (to my knowledge), and that both of the domains that seem to be involved (perishablepress.com and deadletterart.com) have been crawled previously many times by Yahoo! without such errors. Further, I have not altered the DLa site in any way within the previous several months (sadly). Perishable Press has been changed here and there, but nothing (to my mind) that would result in sudden crawl errors for Slurp.

August 1st 2007

At the time of this writing, August 1st is the most recent date for which I have been able to gather such data. Again, we see more of the previously demonstrated behavior. First, the ol’ JavaScript function trick:

Conclusion

Two different sites seem to be involved, Perishable Press and Dead Letter Art.

No changes whatsoever have been made to the DLa site in several months.

Changes have been made to this site, but nothing related to DLa, paintboards, or JavaScript.

The pattern of 404 errors presented in this article is unique and has not been seen before.

Reverse IP lookups verify these crawl errors belong to Yahoo/Slurp.

None of the other major search engines have demonstrated any similar patterns of behavior.

None of the resources requested in these examples exist on the Perishable Press domain.

None of the resources requested in these examples are referred to anywhere on this domain.

On the other hand, here are a few reasons why such behavior may not be so suspicious after all:

The IP addresses are from different domains.

Relatively speaking, there are relatively few unresolved requests.

The uresolved requests occur over the course of several days.

A significant amount of time transpired between each request.

There is no recognizable pattern to the unresolved requests.

Beyond these clues, I may only guess at the reason behind such unusual crawl behavior. This weekend, I plan on digging into the DLa site to determine if there is an underlying cause for these errors. If/when anything turns up, I will post an update here on this article. Until then, if you have any ideas or clues as to why Yahoo! is doing this, please let me know. Also, if you have experienced similar behavior from Yahoo!/Slurp, I would love to hear about it. Otherwise, thanks for your kind attention.

I went over the info briefly. the line number of the error doesn’t point to a query. it points to where we check the log in level. have no idea yet why it gave the can’t log in to database. I also got this same error with http://www.entireweb.com/about/search_tech/speedy_spider/) I am going to talk to my host guy.

okay so i talked to my host guy. there was a hardware failure around the time of this error. and a brief failure during the 2nd one also. the new hardware is being replaced tonight. that was probably the root of it.

Alright, thanks for the follow-up. It’s a bit of a relief to know that Slurp’s not trying to access anyone’s database.. that would’ve been nuts. Even so, it’s another example of Slurp’s aggressive/inept programming. Let us know if you notice anything else weird with ‘ol slurp.

Projects

About the site

Perishable Press is the work of Jeff Starr, professional developer, designer, author, and publisher with over 10 years of experience.
Check out some of Jeff's books and projects, follow on Twitter, or learn more »

Fun fact: Perishable Press has been online since 2005, and features over 800 articles and more than 11,000 comments. More stats »