Dreamhost asking clients to block GoogleBot

Yes, it’s quite true. DreamHost representatives are asking their clients to block GoogleBot trough the .htaccess file, because their websites were “hammered by GoogleBot”.

Dreamhost representatives are also guiding their clients to make all their websites “unsearchable and uncrawlable by search engine robots”, because they cause “high memory usage and load on the server”.

I am the system stability manager here at DreamHost and I thought I might try to explain how googlebot can occasionally be problematic. Of late I have been working on our very heavy usage customers (people using over a quarter to in some cases 250% of what the full server should be processing). In excess of half of these cases the cause is google’s crawler malfunctioning in how it interacts with the site resulting in heavily loaded or even crashing machines (I have had cases where googlebot has over 95% of the last 10,000 hits on a site that doesn’t even have that many pages). Our terms of service (http://www.dreamhost.com/tos.html) specifically state that “if your processes are adversely affecting server performance disproportionately DreamHost Web Hosting reserves the right to negotiate additional charges with the Customer and/or the discontinuation of the offending processes” so that we can ensure that we keep machines working for everyone on them and not just one user. In a case where a faulty googlebot interaction is killing a machine I have two options:

1. disable the site
2. block the bot

We feel that the best solution for our customers is to stop just the malfunctioning behavior and keep _everyone’s_ sites working as well as possible on the machine. In the specific case referenced above I do not feel that it was handled with the appropriate level of care (I always include detailed information about usage, the logs I looked at and the resulting improvement of loads or the like on the customer’s server) and will more strictly define our policy on this subject so we can avoid any more confusion like this. I will also remove the block here if it was not necessary (we do make mistakes sometimes and I am glad this came to my attention so that I can properly train everyone). Really the goal is to provide the best possible hosting experience for our customers as a whole!

John, no matter how the client’s website is behaving, no matter how much the GoogleBot crawler is crawling, you just CAN’T ASK your clients to ban the bot, that in most cases brings in excess of 50% of the total traffic.

Try to think that more then 40% of your clients, maybe don’t even know what code you have given them, and they will put the code in the .htaccess file, thus depriving themselves of all the Google traffic.

It’s not natural what you are doing.

If some of your clients are using the servers at 250% capacity, then disable those accounts, or recommend an hardware upgrade.

I agree they handled it professionally but this shouldn’t of happened in the first place, this is the whole deal.
Oh well …you can’t expect too much for 10$ / month.
I would like to add that I also have some sites with dreamhost and they have their ups and downs …but I really like them hosting my static content and keeping my 46gigs of backups.
thank you

We actually do provide upgrade paths for heavy usage customers. From my experience customers overwhelmingly prefer that we disable their account rather than specifically address the source of the usage issue. Also there appears to be the misconception that we are blocking traffic from visitors to google – we are not, we are blocking the google crawler that is causing the artificial traffic (people connect from their ISP which is a different location). Hits from googlebot is not actual people trying to access the domain (the legitimate visitors are exactly who we are protecting by getting rid of the malfunctioning software that is hammering the site =).

I apologize if I did not make myself clear – the only cases where we would block googlebot are when the following conditions are met:

-the site in question is causing the server it is on to be unstable

-the site in question is causing erratic or abnormal behavior on the part of google’s crawler

We do not block googlebot on every busy customer site, only when it is demonstrated that it is causing artificial usage (a 10 page site does not require 5000 hits from googlebot to be indexed =) and when the alternative to blocking googlebot is disabling the entire domain. It is disingenuous to suggest that this refers to google simply indexing sites – I actually have been in direct with google engineers to help sort out the specific cases where by their their crawler was not performing as it should be.

For most of us adsense is the reason we have websites. They should say they want to not offer anymore hosting services. Everyone will “pack up” (back up) their websites and move to another host. I suggest hostgator or vamphost (if you don’t have a lot of money).

We actually looked into that as a preferable option, unfortunately this is not under our control as googlebot specifically ignores the slowcrawl directive in robots.txt files (also it can take up to 24 hours for it to recheck the status of a robots.txt file and in the cases where we are forced to take action an immediate solution is required). What we do suggest to customers is working with google to determine what specifically is happening and if they can resolve the issue with google engineers there is no need to protect thier site from the crawler. Again the cases where this is even an issues is a minute fraction of a percent (out of a half million domains I probably run into at most 10 a week that are having this sort of problem, that’s less than 2e-05).

John, please do us a favor: stop talking nonsense. Telling a customer to ban googlebot is like telling a classic store to get rid of the nice front window and put some bricks instead. Maybe also a steel door.

I run some websites that have more than 70% traffic from google. Imagine how bad it will get if I tell google: stop crawling and get the f@#% off.

From my point of view your intervention here did nothing good. Oh, I’m wrong, it did something good: made me one of the guys that tell others that dreamhost sucks.

I feel that you are not really reading my posts – we’re not blocking traffic from google to all or most or even many customer websites (and in the specific case referenced above we did not handle it correctly, I have already removed the blocks and contacted the customer to apologize – I also have trained the tech responsible and conveyed the proper situations in which such steps would be necessary to our entire company at large).

I am sure your websites would run perfectly fine on our system without any interference from us or block of google (more than 99.99998% of sites hosted here never run into this problem). Please consider the effect also on other customers on servers with one of these anomalies. Would you accept the explanation that your hosting company couldn’t do anything about it because it’s just googlebot malfunctioning while hitting another customer’s site and causing them to consume most of the server resources to service it? It’s really no different from disabling a bad script that is crashing a server – in both cases the action taken is required to maintain server stability and is a far cry better than simply disabling a customer’s entire site or account.

John: You’re taking things a bit too literally… it’s pretty clear to me when someone compares blocking googlebot from locking customers out, they mean that the traffic _will_ stop because their site won’t be crawled.

That being said, no one else hear seems to be able to read either. It was pretty clear that it was only when someone’s site was already using more than their allotted resources that they recommandation was made.

I would say it’s pretty disingenuous to assume it’s the fault of the GoogleBot, but you’re working with them so that should eventually be resolved.

On a side note, I’ve seen the infrastructure of hosting companies where one site can crash or severely degrade an entire server, and I have to say that would imply your network/setup is crap. Anyone who values customers would have shared backend storage and would be able to easily handle an influx of traffic to a few sites, whether it’s through load balancing or not cramming a server so full that a single customer can affect however many hundreds of other users are on the server.

Good luck getting through to the hard heads on this site, and good luck maintaining what sounds like an architecture that won’t scale :)

The cases I have been working on are ones where users are at a quarter or more of the processing consumption of the entire server. It is true that our architecture does not scale such that one user can horde 25% of the processing power of a server without ill effect (though to be fair the servers can usually manage to stay up, we simply don’t think really high loads all the time are acceptable). At the point though were a user is taking up that kind of utilization they aren’t in the shared hosting market (no $10/month plan is going to support hosting 4 or less customers per machine =). If you would like to make informed observations about our hosting infrastructure we have some openings available:

I’m not a fan of google, they are worse then microsoft. If googlebot is blocked to keep it from crashing the server how is google going to make there billions of dollars on imaginary hits. Adsense is probably a ripoff, the money you get is probably just a fraction of what google gets.

Of course if the server is going unstable steps must be taken to rectify it and if that means inhibiting googlebot then so be it. BUT this should be a short term measure and the offending site owner given all the information required to make the site work with googlebot again. I am not sure this is happening here.

Have you tried talking to someone from Google? I’m sure since you host so many sites, they might be receptive to your questions. Maybe they can look into what’s wrong with those 0.00001% websites that are overloading your servers, and you might find a workaround that would not affect your server’s performance nor the traffic & site performance of the websites in subject.

IMHO if a web host wants to block Googlebot or any major search engine robot, then it does not make sense to host a website there. The point of having a website is so others can find your site on the search engines, in this case on Google, this cannot be accomplished if you are blocking Googlebot.

It sounds like Dreamhost may need to look at their own network and hardware, to see where the bottleneck is at. I would advise trying to find out what your competitors are doing differently, because they do not seem to have this problem, even though I am certain they get just as many crawls from Googlebot. I think Dreamhost should take a closer look at their load balancing equipment, as you are a web host and suppose to be able to handle high spikes in traffic. I worked for years as a systems engineer and have worked in data centers, I have dealt with many traffic spike issues, but never banning bots were the answer.

[QUOTE]
IMHO if a web host wants to block Googlebot or any major search engine robot, then it does not make sense to host a website there.
[/QUOTE]

They are not blocking Googlebot from all their clients sites. Only a .0001% of their clients get the google block.

Instead of blocking Googlebot from these sites, should Dreamhost:

#1) Tell the customer to fuck off and suspend their site? This option would make it where NO ONE could get to their Web Site, including Google.

#2) Tell the other 100 clients on the server to fuck off, and continue letting google consume the majority of the servers resources? Meanwhile, all clients on the server (including the site getting pounded by google) is loading extremely slow, or just flat timing out.

#3) Temporairly block the googlebot from the 1 site it’s causing problems on, and notify the client. Meanwhile, the clients site is still operating (able to take sales, show ads, or whatever) and every other site on the server is running fine as well.

Hmm… Let’s see… The choice is so fucking hard isn’t!!!

According to the majority posting comments… Dreamhost should just tell all the clients on the server to fuck off. After all, google is doing it’s natural thing…indexing sites. The thing your obviously not getting…if google is generating that much of a strain on the server, google won’t be able to index alot of the pages? Why??? Because the pages aren’t even going to load when the server is under that much stress!!! Not only will it not load for google…it ain’t going to load for anyone else visiting the site either. Not only will it be that 1 site with problems… It will be 100’s of other sites on that server too!

As a site owner… I would MUCH rather a hosting company suspend google temporairly than suspend my whole site. That will give me time to analyze the situation and upgrade to a dedicated server if I need to. If google is displaying erractic behavior as John suggested, I would personally contact them to tell them to fix there shit.

Last but not least.. Dreamhost is taking an extra step to identify the exact problem. Most hosting companies would simply suspend your account without even seeing that the googlebot was the culprit and was acting erratically. Cheers to them for identifying the problem in the first place and avoiding having to suspend a clients account.

Google isn’t going to de-list your web site from not being able to access it for a day. So stop overreacting.

Dreamhost is a fantastic hosting company. They may not be the best choice for everybody of course. Dreamhost is a worker owned co-op, carbon neutral, and they purchase their electricity with green credits (i.e. they financially support wind and solar power until such time that they can obtain all power from such sources). I don’t work for Dreamhost, I am just a completely satisfied customer, and have been for 5 years. They do offer dedicated servers if you can afford it. But if you choose the $8/month shared server deal, then you have to abide by the agreement that you “sign” when you join. If you are more concerned with googlebots then the overall health of the server and all of its users, simply upgrade to a different deal. I don’t see why people are attacking Dreamhost for handling their situation for the benefit of the whole versus the benefit of the few. You chose the server you use because of what it offers. If you choose Dreamhost, it would be because of what they offer. For me, that offer includes green hosting with a worker-owned co-op. In the big picture, that means more to me than traffic.

oh, and take note, John Bishoff does not claim to be John from Dreamhost. John from Dreamhost was quite polite, not yelling and cussing. John Bishoff does not use any website or contact info. So, it may be a different John, or someone trying to make Dreamhost look bad. Just a thought…

To be fair, I too have had this problem on some of my dedicated boxes on another host. Google’s bot, particularly the one that checks for adwords permission, likes to run crazy on websites when people start creating 1000+ keyword campaigns. I’m not entirely sure why either, but for the life of me, we see these days where our hit count shoots from the 100’s to the 1000’s and it’s all coming from a Google IP-space. The strange part about the whole thing, is it’s a 3 page site.

Now if they were all dynamically generated pages, and poorly coded to execute… 100 slow queries… that could put a massive strain on a server.

I think there are times where the host has to take action even for the site owners own good, and it sounds like in most cases, this is exactly what dreamhost does. On the other upside now, they have their virtual server stuff in place, so you can now get your processor dedicated to you for those times where you do generate that much traffic.

Dreamhost is one of the most open shared hosts I’ve worked with (I’ve worked with about 15 at this point). There in my top 3 choices, and with promocodes allowing you to get the price down to like 2.50/month for the first year… hard to beat.

If your in this situation though, and generating that much *dynamic* traffic, it’s probably time to start looking at dedicated hosting anyway…

Don’t be fooled, that promocode website is how Justin here makes money, not how DH makes money.

I, too, am a very satisfied customer of Dreamhost, having been with them for a little over 2 years (and currently having a 2 year contract with them expiring in 2009). For what it’s worth, I have never had a problem with them; heck, if you’re having a problem with your website, and it’s due to a coding error, or maybe an endless loop in your .htaccess, they will actually figure out what’s wrong, instead of just telling you “f**k off, it’s all good on our end, let us fix the real problems”. I’ve had that exact line fed to me by other shared hosts – when one could even get a response out of them. Most of the time, I couldn’t even get that. I would definitely rather work with a host that will look at every conceivable problem point, rather than take what we’ll call the “There’s nothing wrong with our s**t, check your code” route.

I just wanted to point that I have got the same emails to restrict googlebots due high traffic storms at one of my sites.

I understand and share the measure, indeed I’d have done the same and also I think for the cash I spent for my hosting plan, I’m already getting a lot. Iâ€™m glad they care of these kind of issues, as Iâ€™dnt be really happy if my server-neighbors were crashing the server the whole day.

And finally, I have to said Dreamhost support team it’s the BEST one I ever have seen (and I suffered from several already at other hosting companies).

And from my own experience, being myself a “professional-resource-eater” (huge databases, thousands of queries, dynamic graphs,…) :) Dreamhost support team is extremely helpful, take care of your problems and I really believe they enjoy helping you in any aspect.

I have submitted several tickets and all of them have been answered faster than I’d expect. And in all cases I had professional answers.

So I’d never complain about DH having invested so few bucks and getting such cool support (which is MOST important value IMHO).

The problem is we are on a SEO forum, and the forest is not visible. Because of the trees.

Posting here that you “dared” to suggest temporary Google-ban in order to keep a shared hosted up it’s like being a physician and suggesting a blood transfusion to a Jehovah Witness in order to save his life :)

John Bishoff is right, although he could have expressed that in a nicer way. Unless… Google is God and “Thou shall not question erratic Googlebot behavior!” in which case: John Bishoff shame on you! [let’s burn him!]

I hate cheapskates like the ones complaining who DON’T WANT to pay but expect the moon with service. People like you should be put on a permanent never-host-with-these-customers blacklist. Good luck with finding a host that takes your bull.

Hi, I have adwords, adsense, sitemaps, analytics, everything; I live and breath SEO, I dream SEO, I’m also a Dreamhost customer, I was reading blog.dreamhost.com and they link here from an old blog post. I love Dreamhost, they are amazing, and have values that make them outstaning in a cold cold business world, they are completely transparent about shit that goes wrong and help out whenever they can. If not for Dreamhost, I’d would have Go daddy or fu**ing Wildwest, sneakily doing everything they can to steal my traffic in them dodgy 404’s or forced parking, and charge me for shit they just make up, no way!

Look, I am not only an SEO guy but a hardened PHP programmer and I can totally agree that it’s the responsibility of the programmer to make his programs run “nicely” on their environment, I have payed dearly for shit scripting, with sites being taken offline which is devastating to say the least, but not with DreamHost, blocking Google’s crap spider(which is just a machine and has no feelings ;) for a short period of time IS NOT GOING TO HURT YOUR BLOODY SITE, as much as a host pulling the plug on you, it probably won’t have an effect at all because Google still has your site indexed, and cached, and will stay in the index for a long time. If your really are the type to care and are half as good an SEO as you say you are you’ll have sitemaps and Google will still know you are there! Plus for those that don’t know about Googlebot like the saying goes – it won’t hurt them.

Some of you people are so one eyed that its stupid and you can’t spell – that sucks for SEO and you’re all indignant about something you know nothing about so you suck. I feel for John being on an SEO website and putting his case to any sensible person but getting dumb scatty replies.

I don’t know anyone at Dreamhost but John you can Block Googlebot anytime its crashing my $10 a month machine because I know that my $10 per month comes nowhere near the cost of me using it’s full capacity or even 1/4; I trust Dreamhost they’ve been good. And stop bloody complaining. You’ll notice that the longer a thread draws out the more sensible it becomes. GodBless.

Let the guy alone. He is doing all of us a favor, by not letting these fracking bastards from Google to harm their servers and collect much data without our consent. I wish my free host was able to provide the same support to avoid these bots (which I am unable to do). John, you have my full support. Unfortunately, these Google-fanatics are like a disease, without a cure.

About Dreamhost PS, you need to set the memory according to your websites usage, but it’s true that I wonder who can use only 150 MB for 15$, as even just google #$&*#@#@ bots might use more than that…