Learning From Mistakes, Growing Through Crisis

It was a terrible, horrible, no good, very bad week at ServInt – the worst we’ve had since 10 years ago, when a fiber cut in just the wrong place brought us offline completely for seven hours. That day was one of the most professionally terrifying of my life, but we learned from it and we grew. In the wake of that event we added redundancies far beyond the “industry standard,” we fixed a ton of processes and we quickly regained the faith of our customers. To this day, that date in 2004 was the very last time ServInt’s entire network has gone down.

Every time we experience problems I am determined to make sure we learn from them. This week is a big learning week, because it’s been fraught with some of the biggest problems we’ve seen in nearly a decade. Let me take a few moments to tell you a bit about what challenges this week brought — to show you what went wrong, and what we did right as we resolved them.

This week started with an announcement that the largest kernel level exploit in the history of our VPS and virtual dedicated offerings had been discovered. This exploit could have allowed hackers to access not only our customers’ VPSs but also the machines that they were hosted upon. A fix would require the reboot of literally thousands of servers, while minimizing the impact on our clients’ businesses — always our top priority. Within 48 hours we performed emergency maintenance on nearly every single customer in our datacenter. This meant forcing every single customer to accept at least a little downtime in the pursuit of vital security protections. Some customers did not like this, but if I had to do it all over again I would do it the same way. I am proud of the way ServInt rose to the challenge and protected our customer base from this dangerous exploit in such a swift manner.

I was really hoping that the week would get easier from there — but it didn’t. Last night, one of ServInt’s datacenters experienced one of the strangest, most difficult to explain, and most difficult to solve networking problems we have ever seen.

We build our networks to withstand most anything. We have stayed up through hurricanes, ice storms, and more equipment failures than I can count. We’ve made it through power disruption for extended periods, and other horrendous events that would have taken down providers that aren’t as thorough, many times over. But this one got us good for a while.

On Saturday evening, our network was running smoothly, as it generally has for more than a decade. Suddenly our monitoring system started showing red/green/red/green/etc. The phrase “this is not a drill” had to be used as senior engineers were plucked from their lives and rushed into the datacenter. Our COO was on a plane, I was at dinner, but the engineering fix-it team that really needed to be there was there, immediately. What made this situation unique, and what made it impossible to fix in the normal few minutes, was the fact that the critical equipment that was in the process of failing seemed incapable of making up its mind whether it was healthy or not. Making matters more challenging: high levels of equipment redundancy (normally a very good thing) made it nearly impossible to determine where the problem lay. Our top engineers, without access to reliable diagnostic data, literally had to pull the network apart and put it back together to find the exact piece of hardware that went haywire (in this case a router) that caused everything else to behave erratically. In the meantime, there was simply no information to share with increasingly frustrated customers, and our Tweets and Facebook posts began to sound unnecessarily vague.

In a typical router-failure situation, as soon as the router shows “red/down” on our monitoring system, we post “we had a failed router interrupt traffic impact the network. This is being fixed and we’re routing around it — we’re sorry for the inconvenience.” Those are facts and details, things people can get confidence from. However, with no reliable detail to pass on, our team was left to pass on rather vague updates for quite some time. It was frustrating and made us seem much worse about communication than we actually are.

In the end, last night’s events pointed out some of ServInt’s greatest historical strengths — and some newly discovered weaknesses. We’re still the best in the business at running a reliable, robust network and data center — and, when necessary, finding and fixing complex technical problems. When it comes to customer support and communication through a crisis, however, we need to do better. Having no support/communication failover systems, and forcing ServInt and its customers to rely on Twitter and Facebook to communicate, was totally unacceptable. We will build greater redundancy into our ticketing and communication systems to make sure that never happens again.

Having said that, we can’t promise that technical glitches will never happen again. They are a fact of internet life. What matters most is that we must always — always — learn from these thankfully rare events, and become a better service provider as a result. I promise you we will do so in this case as well. You’ll see the results of this growth as the weeks and months unfold. I am confident you’ll like what you see. Thank you, as always, for your continued faith and trust in us.

Find out more about ServInt solutions

Comments

Even if you don't have much to add, a simple tweet given more often and stating you are continuing to work on the problem is helpful so we have something to say to clients. The worse thing I can say to my clients is my host isn't telling me anything and I will not lie to my hosting clients so I felt boxed into a corner. Communication is key.

I'm glad you're working on a better communication system for the future inevitable outages, far apart as they tend to be.
That said, I strongly suggest that some form of offsite redundancy hosting plan be created.
Also, I will be moving my email hosting for my primary domain from Servint to a third party as a way to at least communicate with my clients during periods of down time like this.
That said, I'm keeping my ServInt account.

I have been a customer since 2005. This is by far the best hosting company I have ever seen. The one thing that leaves me dumbfounded is why there has never been mechanism to communicate to the public. Past experience has proven this was a need that never was never addressed. Please post your roadmap for changes to make sure this never happens again.

Can I just echo/like/retweet/favorite Greg's suggestion? He made an excellent suggestion in that saying what it's NOT is as important as (eventually) saying what it IS:
"Sometimes (and for me in this case) telling us what it's not is just as informative as what it actually is. For example, a piece of communication could have been
"We are experiencing issues with subset of customers, not all are effected. This is not an attack or DDoS or a virus. It's an equipment failure we are isolating so it can be repaired as safely as possible for the entire network. Your data is safe.
We do not have an eta at this moment but as in all crisis' we feel the immediacy of time. We will update you as soon as new information is available. Our engineers are full-hands-on-deck until this is resolved." "

Thanks for giving us the low down on what occurred. I was concerned when it happened, but unlike the few Hosting companies I used prior to finding you all a couple years ago I knew it would get taken care of and eventually explained. Just being straight forward with us instead of feeding us a line of BS (like I received from the previous companies I was using) is not only refreshing, but the kind of ethical behavior I've grown to expect out of your company. It's sucks that my sites were down for several hours, but I also know there is no perfectly failsafe technology in this world and this was neither the end of the world. Please continue to be forthright in the unlikely event something like this occurs another 10 years from now. Just getting straight answers on top of y'alls stellar service is more than enough to keep me around even when something as rare as what happened on Saturday happens.

Reed, thank you for this honest explanation. In my opinion, the fact that you as owner and a team leader personally spoke to us, ServInt customers, explaining what happened, only means that I can continue to have full confidence in ServInt, and that also means that ServInt didn't lose its soul.
We all had a great luck that it all happened on Saturday night, or after midnight, depending on the time zone. Anyway, for most users it happened out of working hours what significantly mitigated the situation.
When I compare these 3,5 years I spent as the ServInt user with the experience that I had before with other hosting companies, I can tell this period past very quietly most of the time.
This honest explanation of yours gives me hope that in the future it will remain that way.

I echo Chad's remark. As a non-techie, ServInt support rocks!!! They are always helpful and extremely patient with non-techie-me.
Thank you for thinking enough if us customers to take the time to proactively be transparent so quickly after the incident.
I remind myself: we are dealing with technology, software, machines and people working together, of course things are going to break, issues happen and people make mistakes...it's all about how you deal and address them, and the lessons learned.
Sometimes (and for me in this case) telling us what it's not is just as informative as what it actually is. For example, a piece of communication could have been
"We are experiencing issues with subset of customers, not all are effected. This is not an attack or DDoS or a virus. It's an equipment failure we are isolating so it can be repaired as safely as possible for the entire network. Your data is safe.
We do not have an eta at this moment but as in all crisis' we feel the immediacy of time. We will update you as soon as new information is available. Our engineers are full-hands-on-deck until this is resolved." (Or something like this...I'm not a wordsmith)
When prioritizing resources I'd rather them all focus on the issue...first-things-first put out the fire. I'm thrilled with the service I receive from ServInt and I'm staying put for a long long time.

Hi Reed,
Thank you for the details, I always LOVE the culture around here. The way You guys handle this Crisis is very professional! (despite lack of communication) but I'm glad you plan to build greater redundancy into the ticketing and communication systems to make sure that never happens again.
This is also my first outage since I become Servint customer 5 years ago, but I still have faith and trust in ServInt and I strongly believe they will prevent bad things like this from happening again in the future.
Some suggestion from me:
1. Use hosted status page like https://www.statuspage.io/ or https://status.io
2. Host the support page outside of servint network or at least different DC in LA or Amsterdam
3. Make the Forum Public

Hi Reed,
My wife and I have been happy ServInt customers for about 4 years. She runs a small business blog, and the Managed Services team has helped us out of more predicaments than I can count.
This was a rough week, like you said, but for my wife and I as small business owners, it's not just about uptime/downtime for us. I know it is for some companies, but for us, there's much more to it than that. We don't have all of the IT resources that larger companies have, so we rely on ServInt's expertise, and the help you've provided us has on more than one occasion literally saved our business, so I just want you to know from both my wife and I that we appreciate the way ServInt does business! Thank you!
Things like this outage are bound to happen sometimes, but we know ServInt will learn from it and improve because the company always has. Thank you, Reed, from two very happy and appreciative ServInt customers. We've never worked with a better hosting company, and we've worked with quite a few.
-Chad

I recommend using statuspage.io and if you link to it from a custom domain, then use DNS outside Servint's network.
You then have a reliable way of communicating with customers, if Servint's entire network goes down.
And I'm sure you'll be fully investigating the router situation to see if exactly the same thing might be able to happen again.

This is my 6th year with Servint hosting and I must say it was the only and the longest downtime ever. Although I hated the downtime but the fact that everything was back to normal is more satisfying than anything else. All the best to the teams at servint and good luck to them for running the show :-)

Additional support redundancy would be good. We're in the process of migrating our own support into Zendesk for these reasons. We also have off-server live chat available, and a status blog hosted off-server.
On the positive side, I have to say I did appreciate the Twitter updates, even if they were rather light in content. It was better than nothing at all, which is what some give. It would be good if there was more communication however...with posts going to Facebook (much earlier than they did). It's best if all bases are covered. Incidentally, not that I'm trying to push you into Zendesk, but tickets are automatically created from tweets and Facebook posts; it's a good way to cover all the bases.
On the not-so-positive side..... there needs to be more communication (you've already acknowledged that), and more transparency. We shouldn't have to login to the portal for the "Major Announcement." That adds unnecessary steps to find out what's going on. Obviously not everything, especially when involving a major security issue as the kernel vulnerability, should have all the gory details made public.... but there should be *something* either posted publicly or an email newsletter blast done. You guys used to send out notices prior to scheduled maintenance - that seems to have stopped.
I have also never agreed that the ServInt's forums should be private. I know of no other major hosting company/data center that doesn't have publicly accessible support forums.
In many ways, you guys are cutting edge. Your network and services have always been rock solid, and like you said...you came through the Snowpocalypse and the floods (from Sandy) that took down lesser NOCs. I've been with you since 2006 and have seen many improvements in the lines of communication along the way, but I'm glad to hear you know there is room for improvement.

Start the conversation

Chief Executive Officer, ServInt

Reed Caldwell is the founder and CEO of ServInt. Founded in 1995, ServInt has since expanded its network nationwide with data centers in Los Angeles, Northern Virginia, and the District of Columbia. Caldwell’s vision and leadership have led ServInt to become one of the most successful privately held hosting companies in the U.S. Reed is a member of San Diego Social Venture Partners (SDSVP) and a Strategic Advisor for the Equinox Center in San Diego. Reed is an active venture philanthropist and lives in Southern California.