UPDATE: Google has now posted an official incident report to describe why the outage occurred. Here is the key takeaway:

Between 8:45 AM PT and 9:13 AM PT, a routine update to Google’s load balancing software was rolled out to production. A bug in the software update caused it to incorrectly interpret a portion of Google data centers as being unavailable. The Google load balancers have a failsafe mechanism to prevent this type of failure from causing Google­wide service degradation, and they continued to route user traffic. As a result, most Google services, such as Google Search, Maps, and AdWords, were unaffected. However, some services, including Gmail, that require specific data center information to efficiently route users’ requests, experienced a partial outage.

The percentage of Gmail users hit with slow performance, server error messages, or timeouts during the 18-minute outage ranged from 8 percent to 40 percent. A smaller percentage of users received errors from applications including Google Drive, Chat, Calendar, Google Play, and Chrome Sync. Google has corrected the problem in the load balancing software, and will change the release process for load balancing software to push changes "in one location before proceeding with a general rollout." The load balancing software is particularly important as it "routes the millions of users’ requests to Google data centers around the world for processing and serving content, such as search results and email."

Original story follows:

Portions of the Internet panicked yesterday when Gmail was hit by an outage that lasted for an agonizing 18 minutes. The outage coincided with reports of Google's Chrome browser crashing. It turns out the culprit was a faulty load balancing change that affected products including Chrome's sync service, which allows users to sync bookmarks and other browser settings across multiple computers and mobile devices.

That quota service experienced traffic problems today due to a faulty load balancing configuration change.

That change was to a core piece of infrastructure that many services at Google depend on. This means other services may have been affected at the same time, leading to the confounding original title of this bug [which referred to Gmail].

Because of the quota service failure, Chrome Sync Servers reacted too conservatively by telling clients to throttle "all" data types, without accounting for the fact that not all client versions support all data types.

The crash is due to faulty logic responsible for handling "throttled" data types on the client when the data types are unrecognized.

If the Chrome sync service had gone down entirely, the Chrome browser crashes would not have occurred, it turns out. "In fact this crash would *not* happen if the sync server itself was unreachable," Steele wrote. "It's due to a backend service that sync servers depend on becoming overwhelmed, and sync servers responding to that by telling all clients to throttle all data types (including data types that the client may not understand yet)."

An outage like this often leads to grand pronouncements about the viability of "cloud computing." What it really shows is that cloud services can be affected by mistakes made by people, just as IT services always have been. Single points of failure that can affect multiple services are also bad, especially in an infrastructure as large and widely used as Google's.

As noted in the developer forum, preventing this problem from reoccurring requires changes both in Google's servers and in the Chrome application on user's computers. Google's Apps Status Dashboard promised that "we are confident we have established the root cause of the event and corrected it."

Google has promised a more thorough explanation of the root cause will come later today.

The headline on this article was changed to clarify that the problem originated with a load balancing configuration change.

Such failures in incredibly complex systems are inevitable. What is critical is root cause analysis and changing procedures to avoid it in the future. FAA has been doing that for decades as the cost of a failure is extremely high. IT systems could learn a few things from FAA playbook (obviously adjusting for cost/benefit ratio)

It should be a well known fact for everyone who works with high-available systems that most problems are caused by exactly the software/hardware that is used to increase availabilty beacuse they always introduce SPOF's.

Never really understood why people gripe about small outages like this and claim they prove that cloud computing isn't viable. Are these people really pulling better uptime with their own IT staff than a company like Google?

I'm guessing the belly aching about the cloud is less about reliability and more about fear of consolidation and job loss from IT staff.

Isn't it pretty brain-dead programming to have the browser crash if the sync server sends data the browser doesn't understand? Cause that's my reading of the Chrome crashing problem: the sync server sent a response that the browser was not configured to understand, so the browser crashed.

As far as the core cloud services problem, this article really doesn't give good information about root cause. Other service providers would be pilloried for giving such lame answers: "faulty load balance configuration change". Was there a fundamental design problem somewhere? Or did someone just input a bad configuration change by accident? Is this a single human's error? Or a fundamental problem in the load balancing team?

EDIT: Note: I don't mean to pick on Google. I really mean to pick on the fact that if this were Microsoft or Apple, the early comments would not be so forgiving. We'd see "M$" and "Crapple" all over the place.

As far as I'm concerned, it's not even vaguely an indictment of cloud computing as a concept. What it highlights is the co-dependency of (conceptually to the user) entirely different services on shared points of failure. The fact that my sync service isn't running "shouldn't" mean my email is down.

Put another way, it highlights that consolidating disparate services and data with one provider creates (from the user's POV) a single point of failure.

From my POV, it's not even vaguely an indictment of cloud computing. What it highlights, to me, is the co-dependency of (conceptually to the user) entirely different services on shared points of failure. The fact that my sync service isn't running "shouldn't" mean my email is down.

Put another way, it highlights that consolidating disparate services and data with one provider creates (from the user's POV) a single point of failure.

And not just from the user's perspective. It highlights the potential risk to Google of its "internal cloud" setup, where basically all services run across the same pooled resources. They can't update the load balancer for Google Maps, then roll it out to GMail the next day if nothing went wrong. (At least that's my limited understanding of how Google's systems are set up.)

Such failures in incredibly complex systems are inevitable. What is critical is root cause analysis and changing procedures to avoid it in the future. FAA has been doing that for decades as the cost of a failure is extremely high. IT systems could learn a few things from FAA playbook (obviously adjusting for cost/benefit ratio)

Obviously you haven't been privy to the Boeing VS Airbus debates for the past 20 years. Even Sully had to keep shutting off the computer alpha protection.

From my POV, it's not even vaguely an indictment of cloud computing. What it highlights, to me, is the co-dependency of (conceptually to the user) entirely different services on shared points of failure. The fact that my sync service isn't running "shouldn't" mean my email is down.

Put another way, it highlights that consolidating disparate services and data with one provider creates (from the user's POV) a single point of failure.

And not just from the user's perspective. It highlights the potential risk to Google of its "internal cloud" setup, where basically all services run across the same pooled resources. They can't update the load balancer for Google Maps, then roll it out to GMail the next day if nothing went wrong. (At least that's my limited understanding of how Google's systems are set up.)

With the same understanding you have, I completely agree.

I wanted to avoid calling it definitively a single point of failure because one thing the cloud provides is an abstraction layer. There's no fundamental reason a cloud provider must have an infrastructure with internally-shared points of failure, but it is fundamental to the notion of consolidation that the user perceives a single point of failure. EC2, Glacier, Route 53, and CloudFront may have a shared-nothing architecture (I'm pretty sure they don't, but hypothetically), so in that sense there's no SPOF.

From the user's point of view, though, if Amazon goes bankrupt or leaves the cloud computing market, everything they've got in those four services goes down at the same time.

Hence my trying to distinguish between a SPOF in the service, and a SPOF to the user.

The title here is somewhat misleading: the load balancer change that caused the Gmail outage and Chrome sync crashes wasn't for a Chrome-specific set of servers. This was a load balancer that was responsible for parts of many different Google services. Gmail and the Chrome sync quota service were just the two casualties of the misconfiguration that were most visible from the outside.

I understand the failure to load balance backend services, and those services having to throttle back because of the increase in traffic, but it sounds like Chrome "stopped working" for people which is different than not being able to update (working in "offline" mode). I just have this image in my head of Atlas, with the world on his shoulders, slipping on a banana peel.

I understand the failure to load balance backend services, and those services having to throttle back because of the increase in traffic, but it sounds like Chrome "stopped working" for people which is different than not being able to update (working in "offline" mode). I just have this image in my head of Atlas, with the world on his shoulders, slipping on a banana peel.

Never really understood why people gripe about small outages like this and claim they prove that cloud computing isn't viable. Are these people really pulling better uptime with their own IT staff than a company like Google?

I'm guessing the belly aching about the cloud is less about reliability and more about fear of consolidation and job loss from IT staff.

The difference is you know when you are going to be doing an upgrade, something risky, or something at all... And you stop everyone from doing it during times when a mistake would be disastrous.

With cloud based you don't know that they are doing changes during an incredibly sensitive time for your company. I work in a hospital system, so we do most of our big changes at night, when traffic goes down a bit. We know we won't be doing maintenance during the day when it could have widespread outages.

Office365 you don't know that and it can and WILL happen, because you can't control it.

EDIT: Note: I don't mean to pick on Google. I really mean to pick on the fact that if this were Microsoft or Apple, the early comments would not be so forgiving. We'd see "M$" and "Crapple" all over the place.

May be the fact that "M$" and "Crapple" actually take the customers' money whereas Google doesn't ? After all even today those worthless fiat currency notes mean more to people than some obscure "online search habit" ... which Google sells to faceless corporations to monetize its services ...

Never really understood why people gripe about small outages like this and claim they prove that cloud computing isn't viable. Are these people really pulling better uptime with their own IT staff than a company like Google?

I'm guessing the belly aching about the cloud is less about reliability and more about fear of consolidation and job loss from IT staff.

You usually hear IT people say that. In many places, IT folk will say that changes are frozen around peak times (ie. Christmas rush, tax time, etc), so that just will be an issue with public cloud services that needs to be weighed against the other costs and benefits. They also invent various schemes to disclaim ownership downtime (ie. weekly maintenance windows, blame-shifting, etc)

Business people often don't see a difference between random cloud outages and "planned" internal outages.

FWIW, during the infamous 15 minutes, it wasn't just the Chrome browser that failed. Mi esposa uses Safari, but has her Address Book in sync with Google Contacts. During the 15 minutes, she got 502s trying to get into Gmail. I, with no such sync, could get into my "junk" Gmail account easily.

May be the fact that "M$" and "Crapple" actually take the customers' money whereas Google doesn't ?

Right. You must have missed the part where they bought Sparrow, and continue to charge for it. Or the part where they just did away with the free version of Google Apps. Or how they charge for advertising.

Those all sound like taking customers' money to me.

I work in a business where we have ~300 users on Google Apps (and pay through the nose for it) and you should have seen the wave of users coming into my office when the mail outage happened. We pay, and service outages happen. I'm not happy about it, but I'm not about to throw stones for someone making a mistake. People make mistakes, and IT people are people, too.....regardless of what the users think of us.

FWIW, during the infamous 15 minutes, it wasn't just the Chrome browser that failed. Mi esposa uses Safari, but has her Address Book in sync with Google Contacts. During the 15 minutes, she got 502s trying to get into Gmail. I, with no such sync, could get into my "junk" Gmail account easily.

Yeah, I had Chrome crash and the app formerly known as iCal freak the hell out, nearly simultaneously. Sparrow wasn't connecting to any of the 3 GMail accounts it handles, either.

I found this on high scalability web site a year or two ago. Every sysadmin can relate.

"Rarely do things just fail. They become intermittent, they become slow, they don't failover cleanly, the don't recover cleanly, they lie about what is really happening, they don't collect the data you need to figure out what is going, they corrupt messages, they drop work randomly, they do crazy things that are failures but don't fit in clean crisp failure definition boundaries."

EDIT: Note: I don't mean to pick on Google. I really mean to pick on the fact that if this were Microsoft or Apple, the early comments would not be so forgiving. We'd see "M$" and "Crapple" all over the place.

May be the fact that "M$" and "Crapple" actually take the customers' money whereas Google doesn't ?

Google gets "payment in kind" when we implicitly provide it with access to some of our screen space for ad. placement and/or to valuable, marketable information about ourselves in exchange for use of its products and services. It is not a charity, and does not deserve to be treated differently or more generously than other companies that we deal with.

P. S. Perhaps comparison to free-to-air, for-profit commercial TV networks is closer: they do not get us much slack as Google.

I think many people here have this confusion. We, the users, are not Google's customers, we are the products. We get free services in exchange for agreeing to be their products. Google's customers definitely have to pony up money, and they buy us. If Google didn't take customers' money, where the hell do people think they are making the billions from?

I'd love to see more information on this. From the article I find this whole thing rather troubling. I don't like the idea that there is no "offline" mode if a service goes down. I want to be able to work autonomously and get things done. Yes it was only 18 minutes, but as has been mentioned it highlights some problems exist or did exist. But at the moment I'll wait and see if there is more to it than meets the eye.

I think many people here have this confusion. We, the users, are not Google's customers, we are the products. We get free services in exchange for agreeing to be their products. Google's customers definitely have to pony up money, and they buy us. If Google didn't take customers' money, where the hell do people think they are making the billions from?

Ugh. I don't understand why otherwise intelligent people would think that what is essentially a t-shirt slogan would provide any insight. Let's go even trivially deeper here. You are concerned that since you do not provide money directly to a company, they have reduced motivation to treat you right and protect your data (at least I'm going to assume this, because "they buy us" is essentially meaningless rhetoric unless they are involved in slave trade).

Of course, for Google to continue to make those billions, they have to *have* users to show ads to. They can only keep those users if they treat them well and provide them with a useful product, and don't betray their trust to the point their users stop using them.

Meanwhile, companies take their *paying* customers and bundle up and actually do sell off your information all the time (and when that was banned in some states, they would collate it by doctor instead so drug companies could target individual doctors for drug marketing).

So on the one hand, we've shown that companies can be motivated even by indirect payment for user happiness and protection, and on the other, we've demonstrated that companies that take direct payment have no problem in engaging in the behavior that you decry. So can we please stop treating in platitudes and actually stick to actual events and evidence? It's pretty ridiculous (and mostly just a non sequitur) to assert that there was a crashing bug in checking the type of an entry in a sync message because users of Chrome are products, not customers.

Also, if you want to read about the actual economics here, the field is quite interesting:

I think many people here have this confusion. We, the users, are not Google's customers, we are the products. We get free services in exchange for agreeing to be their products. Google's customers definitely have to pony up money, and they buy us. If Google didn't take customers' money, where the hell do people think they are making the billions from?

Ugh. I don't understand why otherwise intelligent people would think that what is essentially a t-shirt slogan would provide any insight. Let's go even trivially deeper here. You are concerned that since you do not provide money directly to a company, they have reduced motivation to treat you right and protect your data (at least I'm going to assume this, because "they buy us" is essentially meaningless rhetoric unless they are involved in slave trade).

Of course, for Google to continue to make those billions, they have to *have* users to show ads to. They can only keep those users if they treat them well and provide them with a useful product, and don't betray their trust to the point their users stop using them.

Meanwhile, companies take their paying customers and bundle up and actually do sell off your information all the time (and when that was banned in some states, they would collate it by doctor instead so drug companies could target individual doctors for marketing).

So on the one hand, we've shown that companies can be motivated even by indirect payment for user happiness and protection, and on the other, we've demonstrated that companies that take direct payment have no problem in engaging in the behavior that you decry. So we can we please stop treating in platitudes and actually stick to actual events and evidence. It's pretty ridiculous (and mostly just a non sequitur) to assert that there was a crashing bug in checking the type of an entry in a sync message because users of Chrome are the product, not customers.

Also, if you want to read about actual economics, the field is actually quite interesting:

Oh, I think my point came out the wrong way. I'm not saying it's a bad behavior or making it sound slaveryish. What I'm trying to say when saying they buy us is our eyeballs, our attention. I meant to reply to the statements saying that at least Google doesn't take the customers' money, while other companies do. This is a misunderstanding on who Google's customers are and what they sell to them. Any advertising company sells people, they sell impressions or clicks to ads. The more people's attention they have, the more valuable they are.

I'm a happy Google user, and I am fully aware there's a reason why they provide me with so many free services of high quality: to retain my attention. I have gmail, gdrive, two blogspot accounts, docs, chrome, youtube, Google's search, etc. However, that doesn't mean that I am the customer. What they do is they analyze my usage of their services and make a profile out of me which they sell to someone else saying "look, this Mr. Felix might be a potential customer for that product or service that you are trying to sell". Most likely, that will be a true statement, which is why their actual customers pay them. From my perspective, this is a triple win situation for Google, their customers, and me. It's not like anyone is forcing me to use Google's free services, I voluntarily trade some information to "pay" for the usage. Besides, I will receive ads anyway, and at least for me, it's preferable to receive ads of things I like than random crap I don't care about.

May be the fact that "M$" and "Crapple" actually take the customers' money whereas Google doesn't ?

Right. You must have missed the part where they bought Sparrow, and continue to charge for it. Or the part where they just did away with the free version of Google Apps. Or how they charge for advertising.

Those all sound like taking customers' money to me.

I work in a business where we have ~300 users on Google Apps (and pay through the nose for it) and you should have seen the wave of users coming into my office when the mail outage happened. We pay, and service outages happen. I'm not happy about it, but I'm not about to throw stones for someone making a mistake. People make mistakes, and IT people are people, too.....regardless of what the users think of us.

$5/month isn't paying through the nose. But, there are "plenty" of paying Google customers, we hand them a couple grand a month for Google Apps.

May be the fact that "M$" and "Crapple" actually take the customers' money whereas Google doesn't ?

Right. You must have missed the part where they bought Sparrow, and continue to charge for it. Or the part where they just did away with the free version of Google Apps. Or how they charge for advertising.

Those all sound like taking customers' money to me.

I work in a business where we have ~300 users on Google Apps (and pay through the nose for it) and you should have seen the wave of users coming into my office when the mail outage happened. We pay, and service outages happen. I'm not happy about it, but I'm not about to throw stones for someone making a mistake. People make mistakes, and IT people are people, too.....regardless of what the users think of us.

$5/month isn't paying through the nose. But, there are "plenty" of paying Google customers, we hand them a couple grand a month for Google Apps.

Sounds pretty cheap to me; cheaper than doing it yourself. Factor in staff, server, licensing, air conditioning, whatever costs. Cloud isn't for everyone, but it makes tonnes of sense for many businesses.