Posted
by
Soulskill
on Monday May 02, 2011 @06:55PM
from the third-time's-a-charm dept.

jbrodkin writes "VMware's new Cloud Foundry service was online for just two weeks when it suffered its first outage, caused by a power failure. Things got really interesting the next day, when a VMware employee accidentally caused a second, more serious outage while a VMware team was writing up a plan of action to recover from future power loss incidents. An inadvertent press of a key on a keyboard led to 'a full outage of the network infrastructure [that] took out all load balancers, routers, and firewalls... and resulted in a complete external loss of connectivity to Cloud Foundry.' Clearly, human error is still a major factor in cloud networks."

[[An inadvertent press of a key on a keyboard led to 'a full outage of the network infrastructure [that] took out all load balancers, routers, and firewalls... and resulted in a complete external loss of connectivity to Cloud Foundry]] Really? Pressing a single key and bam! All gone? Is that the best they can do?

I didn't get the sense from reading the linked analysis that it was necessarily a single key-press. It reads like this:

This was to be a paper only, hands off the keyboards exercise until the playbook was reviewed. Unfortunately, at 10:15am PDT, one of the operations engineers developing the playbook touched the keyboard. This resulted in a full outage of the network infrastructure sitting in front of Cloud Foundry. This took out all load balancers, routers, and firewalls; caused a partial outage of portions of our internal DNS infrastructure; and resulted in a complete external loss of connectivity to Cloud Foundry.

My sense is that "touched the keyboard" doesn't literally mean "touched a single key on the keyboard", but actually means "ignored the hands-off-the-keyboard part of the exercise, and executed some commands".

Notably, Excel uses it, for its intended function (making the arrow keys scroll rather than moving the cursor). And Linux, when the kernel's busy handling the screen itself (say during the boot process), uses Scroll Lock to temporarily pause quickly scrolling output to the screen so that you can see what it says. Apparently KVM switches often use a double-tap of Scroll Lock in order to send signals to the switch itself rather than the computers connected to it (on the basis that that quickly turning Scroll

When an unlocked and unmanned workstation is found in our Dept, the SOP is to place a RICKROLL somewhere in the system. Bonus points for being creative. I have one that is still waiting to go off, because the guy never reboots his computer. He'll never know who did it, or when.

The enter key being pressed after doing something silly like typing up an example command line for a half written script that will automate some large process to simply copy and paste into another document.

While the reality of it is the reason they said 'hands off' was to avoid just such an accident, an engineer actually executing the test plan before it was actually ready to do its job, by accident. And it happened.

Its really one of those moments where the poor guy is just the most perfect example of why

They didn't even say that a key was pressed. Perhaps someone accidentally brushed a hand against the keyboard. Perhaps the "very bad design" of the data center involves the electrical wiring.

Seriously, this does indicate bad design, and it does NOT inspire confidence. If cloud services go down and the official explanation that is given is "Someone accidentally touched some equipment, and everything go boom," then I don't want to rely on this cloud service. That's not good enough.

Seriously? You can read that and come away with that interpretation? Rather than say "they were supposed to planning out what to do without actually executing any commands, but someone misunderstood and actually did the actions" that it obviously means.

They used the language they use internally without bothering to translate it (aka dumb it down) to something people who don't have the right context would understand. Which I agree is unprofessional and stupid of them, but it is not dumbing down though. And while the general public might misunderstand (which is why it was stupid of them) anyone with an IT background who thinks for 2 seconds knows what they mean.

Some routers have extremely unsafe defaults and ignore syntax errors in commands. If you add a single letter to a command which corrects the default (perhaps while the configuration file is open in an editor), producing a syntax error, this can trigger far-reaching outages. Taking down a data center is not even the worst thing that can happen. For example, if an ISP accidentally redistributes the global BGP table into OSPF, they can produce a world-wide outage affecting thousands of routers and almost all c

It's called the "windows key". It has a little windows flag on it. It was placed on keyboards for the purpose of slowing down, crashing, mutilating, and annihilating data centers, desktops, laptops, and phones.

I remember from almost 20 years ago (DOS / floppy era) overhearing a couple of kids in my school yard. Apparently one of them had promised the other a floppy with a game and he had not delivered. The excuse was "you know, I had it ready and everything, but I hit on the "delete" key by accident and I lost it - sorry". The other party agreed it was an unfortunate accident and did not make a fuss. I was in disbelief of the idiocy of the exchange I had just heard - and I was just 13 years old.

One time, "The Cloud" corrupted a few files on my server, toasted my dev machine's hard drive (couldn't even re-install!) made several monitors explode, and split the tree outside my home-office completely in two; Flying chunks of bark shattered my windows... to say nothing of the horror that became of the decorative landscape lighting that foolishly linked the outside to my main electrical system, may it rest in pieces.

The ironic thing is that I had a lightning rod installed; I thought I was safe from The Cloud, but The Cloud decided that my, now deceased, 200ft pine tree was a better target of opportunity.

The Cloud is a scary concept -- Super charged flying electrical batteries, always looming overhead, unpredictably destroying their targets with tremendous power, and surgical precision. Hell, the terror of witnessing such an event has permanently emotionally scarred my dog -- She has a prescription for Valium now because she hyperventilates and continuously shakes for hours at the mere sound of distant thunder...

My psyche is not unscathed either: I have to take a tranquilizer whenever I hear the words: "To The Cloud"

Cloud is just a term. Everything is in the net nowadays. Same problems exists even if we don't call it Cloud.Managing of service is just too easy when one key press can cause this kind of damage. They need more redundancy both for infrastructure and management.

Bullshit. One keypunch doesn't cause this unless shit is being run by people who shouldn't be any-ware near a server.

By taking all of the various sites and services for hundreds of companies and condensing them into one two or even three buildings you create the exact opposite of what the Internet was designed to be in the first place, which is decentralised. You also remove accountability for the people who are running the show. People stop being able to see the forest from the trees and you end up with mo

Amazingly the Cloudfoundry blog itself had a much more dramatic telling:

"... At 8am this effort was kicked off with explicit instructions to develop the playbook with a formal review by our operations and engineering team scheduled for noon. This was to be a paper only, hands off the keyboards exercise until the playbook was reviewed.

Unfortunately, at 10:15am PDT, one of the operations engineers developing the playbook touched the keyboard. This resulted in a full outage of the network infrastructure sitting in front of Cloud Foundry. This took out all load balancers, routers, and firewalls; caused a partial outage of portions of our internal DNS infrastructure; and resulted in a complete external loss of connectivity to Cloud Foundry."

VMware's explanation of events is troubling to me. The company as a whole is responsible for any of its failures. Internally the company could blame an individual but to shareholders and other vested entities an individual employee's failure is not something they care about. A better PR response would be to say that "we" made an unscheduled change or simply an unscheduled change was made to our infrastructure that caused X.
This also outline a major issue with "cloud" technologies. They are only as redundant and stable as the individuals managing them. Also that there is always the opportunity for single point of failure in any system, you just need to go up the support tree high enough. For most companies this is the data center itself as offsite DR can get expensive quick.
For VMware it can be the Virtual Center, a misconfigured vRouter or even vSwitch. Finally putting all your eggs into one basket can increase efficiency and save money. It can also raise your risk profile. An engineer may have caused this outage but I would find it hard to believe that replacing the engineer would make the "risk" go away.

Agreed. They seem to treat it as some magical instance where touching the keyboard breaks things, as though this was written by someone's grandmother.

How did one engineer touching a keyboard when he shouldn't, take everything down? I don't think I could do this at work unless I was really trying hard. This is a really shitty response, especially compared to the writeup that amazon put out.

VMware's explanation of events is troubling to me. The company as a whole is responsible for any of its failures. Internally the company could blame an individual but to shareholders and other vested entities an individual employee's failure is not something they care about. A better PR response would be to say that "we" made an unscheduled change or simply an unscheduled change was made to our infrastructure that caused X.

VMware's explanation of events is troubling to me. The company as a whole is responsible for any of its failures. Internally the company could blame an individual but to shareholders and other vested entities an individual employee's failure is not something they care about. A better PR response would be to say that "we" made an unscheduled change or simply an unscheduled change was made to our infrastructure that caused X.

"Transparency is bad" +4 Insightful

What the... ?

You know, I'd prefer my vendor/partner (ie, VMWare) doesn't throw their employees under the bus when bad stuff happens. If this happened at Apple or Google the group (leadership taking responsibility) would announce they messed up... not "one of the peons pushed a magic button".

Transparency is only useful as a way to diagnose and improve. This "explanation" from VMWare hides all explaination (...touched the keyboard. This resulted in a full outage of the network infrastructure...) while torching a single employee.

So what caused this problem? We released a storage software update that introduced the unexpected bug, which caused 0.02% of Gmail users to temporarily lose access to their email. When we discovered the problem, we immediately stopped the deployment of the new software and reverted to the old version.

To me it sounds like someone (non-technical) high up in the chain wanted to focus blame on an inadverant act by one of the engineers. Inadvertant, of course, so no one needs to get fired and file a lawsuit, and an engineer so that no one in upper management appears culpable. The downside is that they dramatically underscore the fragility of their cloud, thereby undermining its acceptance in the market. Not a good tradeoff, if that's the case.

The infrastructure design is not resilient and it seems late in the game to "develop a playbook" after you've gone live. Their credibility also in building a fault tolerant platform is questionable. While VMWare is at the core of a lot of data centers, there are other players that bring things to the table to build out the other pieces that make high availability and reliability a reality; I don't think they understand how all of this fits together. By reading that this was a "paper only" all hands on d

You would think someone as big as VMware would have figured out, by now, that if "An inadvertent press of a key on a keyboard" can lead to "a full outage of the network infrastructure [including] all load balancers, routers, and firewalls [resulting] in a complete external loss of connectivity to [their Cloud service]" that they are DOING IT WRONG!

In other news, VMware announces they're releasing a new voting machine: http://xkcd.com/463/ [xkcd.com]

I would like more elaboration on what "touched the keyboard" means. It has more than one dictionary meaning, and it's very vague in this context.Like, did they touch it and press a key?Did they touch it for an extended period, typing "killall cloud"?Was it an accidental touch, or was the person an idiot who's not supposed to touch important things?

I would like more elaboration on what "touched the keyboard" means. It has more than one dictionary meaning, and it's very vague in this context.
Like, did they touch it and press a key?
Did they touch it for an extended period, typing "killall cloud"?
Was it an accidental touch, or was the person an idiot who's not supposed to touch important things?

The keyboard they touched wasn't a keyboard in the conventional sense. It was a small 3"x3" yellow/black striped board with one large circular red key on it. Somebody touched that key even though the sign said "DON'T PUSH THIS." A harmless prank.

"...An inadvertent press of a key on a keyboard led to 'a full outage of the network infrastructure [that] took out all load balancers, routers, and firewalls... and resulted in a complete external loss of connectivity to Cloud Foundry."

OK, seriously, who the hell has that much shit tied to a single key on a keyboard?

You can not really stop stupid people. However many companies cripple their networks through so called "Security" measures. What do you do when you lock down everything to be accessed through a few servers and you experience a major network outage? Your time to resolution is crippled by having to use ancient back doors "Serial Access" to get back into these devices. Now you're losing customers on top of losing money, especially when it comes to compute clouds where you're literally billing by the hour. Even more so for long distance providers, cellular companies, and VOIP communications providers.

I am curious how the press of one key managed to wipe out the cloud, the load balancers, and the routers at the same time. Either they're using some program to manage their switching network which is the only key thing that could take it all out, or the idiot had the command queued up.

More likely some idiot introduced a cisco switch into their VTP domain and it had a higher revision number queued up and it overwrote their entire LAN environment. Simply fixed by requiring a password that way you can really nail an idiot that does it, and secondly bite the admin bullet and run vtp transparent mode.

There's no one command that's going to bring it all down, it's going to be a series of actions that result from a lack of proper network management, and lack of proper tested redundancy. Redundancy does not exist in the same physical facility, redundancy exists in a separate facility nowhere associated with anything that runs the backed up facility. Pull the plug on data center A, your customers should not notice a thing is amiss. If you can do that, then you have proper redundancy.

I believe the other problem is that were working on a 30+ year old protocol stack, and it's starting to show it's limitations. TCP/IP is great, but there needs to be some better upper layer changes that allow client replication to work as well. So if the App loses it's connection to server A, it seamlessly uses server B without so much as a hiccup. Something like keyed content where you can accept replies from two different sources, but the app can use the data as it comes in from each, much like bittorrent, but on a real time level. It requires twice the resources to handle an app, but if redundancy is king this type of system would be king and prevent some of the large outages we have seen in the past.

More likely some idiot introduced a cisco switch into their VTP domain and it had a higher revision number queued up and it overwrote their entire LAN environment.

How does that even happen in a properly managed environment? In fact, even in an improperly managed one? I'd have to try hard to make that happen......I mean...really. Bring up an identically configured VTP master, change it enough times to get a higher rev number, put it on the same LAN and......without external inputs (dropping links to the real VTP master) pretty much nothing ought to happen (other than syslog screaming) unless you're using some really crusty old IOS/CatOS.

Just so you know, even a VTP *client* with a higher revision number and a different table used to be able to / can wipe out a VTP domain by being introduced. Being a VTP server just allows you to add and remove VLANs from the database. VTPv3 is supposed to fix these kinds of things though. The last time this happened to me, thankfully, I still had the output from a "show vlan" in my scroll back buffer.

Just so you know, even a VTP *client* with a higher revision number and a different table used to be able to / can wipe out a VTP domain by being introduced. Being a VTP server just allows you to add and remove VLANs from the database. VTPv3 is supposed to fix these kinds of things though. The last time this happened to me, thankfully, I still had the output from a "show vlan" in my scroll back buffer.

See my previous post about "crusty old IOS/CatOS".

Also, who the hell runs the same VTP name and auth key in production and the lab? That is BEGGING for problems.

Maybe I've just been doing this the right way for too long. I find it difficult te believe that there are networks of any scale that have any duration of uptime that aren't following very, very simple procedures to ensure uptime and/or are operating with such a complete lack of knowledge of the basic plumbing that makes them work. Also, who doesn

So....what I said. Except you have it in your lab environment. And you don't relize its your VTP master. And you don't bother to put your production config on your replacement box before putting it in production.......
Yeah. Not buying it as a likely scenario. This required multiple steps, and a fundamental lack of understanding of key functions of networking equipment in a datacenter setting (namely not knowing what your VTP master is) and a lack of any sort of sane procedures (putting a piece of eq

However many companies cripple their networks through so called "Security" measures. What do you do when you lock down everything to be accessed through a few servers and you experience a major network outage? Your time to resolution is crippled by having to use ancient back doors "Serial Access" to get back into these devices.

The problem with such "security" is that the easier you make it for your admins to connect... the easier you make it for the bad guys to connect.

When it comes to valuable data, nothing beats a local hard drive, and nothing will ever beat that. The Cloud is great for sharing photos or game saves, but I don't see a future where we all do our computing "in the cloud".

When it comes to valuable data, nothing beats a local hard drive, and nothing will ever beat that.

You know what beats a local hard drive? Two local hard drives, so that if one of them dies, you can still retrieve your data on the other one. And you know what beats two local hard drives? N hard drives in different locations, so that even after Evil Otto nukes your office and your branch office, you can still retrieve a backup copy of your data from another zip code.

I wonder if/when any cloud services will offer the option of letting you automatically keep a copy of your cloud data on your home compute

I forget exactly where I first read it, but it bears repeating: Unless you can put your finger on a damn good reason why your business cannot deal with any downtime, you don't need high availability and probably shouldn't bother with it.

It invariably introduces a lot more complication, a lot more to go wrong. Few businesses truly need it, usually all they need is a clear plan to recover from system failure which accounts for the length of time such recovery wi

I wonder if/when any cloud services will offer the option of letting you automatically keep a copy of your cloud data on your home computer's local drive? That seems like it would be a good feature to have.

I cant see why it is too hard to realize that, if you end up tying everything into one major big structure, put everything in it, regardless of how much redundancy you designed, it will eventually flop grandly.

if not downtime, it will be security. if not, its something else. the idea is, you are creating one HUGE environment which contains everything. its inevitable that some issue affects all the participants in that environment eventually. those being the clients.

Ok.. so Vmware is owned by EMC, a dominant storage player. They lost a power supply in a cabinet. So? EMC arrays have had multiple power feeds for years (decades). Even the low end clariion has 2x power supplies. And anybody that racks up equipment knows to connect each rack's powerstrip/PDU to a separate feed. So that if you lost one PDU, the cabinet still has 100% at no redundancy.

I also find it odd that they'd have an application configuration that if access was lost to ONE lun on ONE array, that it would cripple the entire application. Umm... this is bad application design if you ask me. All it would take would be for the host to mirror the lun to another disk array. That way the array could blow up and you'd be fine, and being VMware (a part of EMC) disk is cheap, unlike the brutal prices the rest of us pay.

Either that or the power failure caused a loss of a single path from host to disk and they forgot to configure Powerpath on the server... or verify that vmware's native multipathing was working correctly...

And anybody that racks up equipment knows to connect each rack's powerstrip/PDU to a separate feed

If you don't know if the other circuit is on another phase or not and you have a power supply fault that can be a truly shocking suggestion that can destroy the equipment you intended to save since you may be dealing with 480V now instead of 240V. If you DO know they are on the same phase it is a good idea - but in some circumstances it can be a very very stupid idea to randomly plug the power into random sock

True story, one DC where I worked about 12 years ago called Sun support because a machine wouldn't power up after a simulated power failure. Stupid Sun SE wound up replacing the motherboard before he would listen to me and plug in a damn keyboard.