Bad Bots

It's been a while since I visited, I've mostly been involved with some technical stuff and moving back to the US of A.

This is a question I couldn't answer on the really geek tech forums, so I am putting it out to the "generalist" body of cre8...not 100% sure of correct forum, but security sounds good enough...

Nice new look by the way...

I have had some server-slowing periods stemming from discourteous bots originating on amazonaws.com.

I tried contacting amazonaws, but their reporting process required a lengthy online form...then it turned out the form didn't work correctly.

Ultimately I blocked all their reported IP ranges at server firewall level and completely solved the problem.

But here's the question: Am I cutting my nose to spite my face? Throwing out the baby with the bathwater?

Is there any server traffic coming from amazonaws that is doing me any good? For example, bots that actually get me directory links and generate visitors/traffic/links/revenues?

One thing I discovered is that bit.ly uses bitlybot hosted on amazonaws. Apparently it is used to read titles of pages, that are then displayed in mouseovers on the bit.ly urls. I wrote bit.ly, they responded immediately, I sent off the list of blocked IP ranges (had to go to their tech department) and I have not heard back from them. I suspect that amazonaws assigns IP addresses at random within a geographical area, so there is no relevant "range" that might be used by one amazonaws client such as bit.ly.

It's probably impossible to know if you might be missing out on something great either now or in the future. I was just reading this: http://blog.red7.com...rom-amazon-aws/ and he mentioned only blocking unidentified bots from amazonaws, which he says are the ones being the most obnoxious. That might be a good compromise solution.

For many of us who actively block bots amazonaws (AWS) is perhaps the one standard default block Yes, there are legitimate services on AWS including many/most/all of Amazons properties, i.e. Alexa, Archive.org, IMDb - which if any of whose crawlers you consider valuable to your site is a business decision. Personally I block them all with a vengeance because AWS serves up not only bots by the zillions but is host to zillions of proxy servers utilised by scrapers (and scammers).With one exception: because my sites are Amazon Affiliates I must allow AMZNKAssocBot/4.0 to crawl.

Note: if you hate AWS you might also love to hate Google App Engine. If AWS is default block #1, GAE is #2.

I just realized I had to allow cre8 emails with postini, which by the way does a fantastic job and has not only dramatically reduced spam (and the daily time need to deal with it) but also lowered server load on spamassassin.

I'm glad to see that others have taken the same tack. Since I dropped amazon affiliate stuff (due to zero revenues) -- along with ebay (due to their api problems plus low revenues) and pretty much all cpa advertising -- no problem there.

Donna, I read the link, however that approach is labor-intensive and creates server load, while firewall blocking stops the traffic before it hits the server. I *love* my managed hosting (ahem) **plug** for datapipe.net.

Personally I have never seen the value of alexa, I mean, a visitor has to have the alexa toolbar for hits to count, right? Wayback (archive) may have some value, but how much really?

Iamlost, regarding google app engine -- my understanding is that is used by individuals and the IP numbers **should not** cross over with google maps, right?

Personally I have never seen the value of alexa, I mean, a visitor has to have the alexa toolbar for hits to count, right?

No. They claim they gather data from other sources as well. I think most people are still looking at Alexa as if it never changed the way it functions several years ago.

From How are Alexa’s traffic rankings determined?: "Alexa’s Traffic Ranks are based on the traffic data provided by users in the Alexa Toolbar panel and data collected from other, diverse sources over a rolling 3 month period."

In addition to the Alexa toolbar, they also collect data from their analytics package. And several years ago I read that like Compete and other metrics services they were buying aggregated user data from various ISPs. I don't know if they are still doing that today.

Alexa is actually a much more reliable data source than most people believe it to be, and has been for several years.

That said, all third-party metrics services operate at a disadvantage as they only have incomplete data to work with for the Web in general. Alexa points that out in one or two of their FAQ answers.

If they collect enough data about a Website they will try to rank it. I think they advise people not to put too much stock in the rankings below 100,000. Their estimates for the top 100,000 sites (1..100,000) are apparently more reliable than for the sites about which they have less data.

With the release of Amazon's Kindle Fire tablet upcoming at a competitor hosing price I expect it's uptake to be swift and considerable. And that raises several problem questions...

First, as it's Silk browser will leverage the AWS cloud (in a similar manner to Opera, it looks like) that means that those of us simply blocking AWS IPs will need to reconsider our stance. How to adjust will depend on how Silk uses AWS: from dedicated IPs or random; if dedicated the fix is simple, if random then one will have to consider user-agent identification...and how long before scammer-scrapers are making as Silk?

Second, as it will also use a prefetch that means that webdevs such as myself that block prefetch will have to learn the best method of identification for blocking this one too.

Running a 'clean' site for ease of calculating direct ad visitor metrics is an uphill battle on an ever rising hill...

I am posting an update to the list of amazonaws bots. This comes from https://forums.aws.a...jspa?annID=1252 posted Dec 14, 2011. The list is long, I blocked the older ones months ago, here are a few new bad actors: