Netflix is pointing the finger at Amazon’s cloud for the terribly-timed outage that left millions of US customers unable to access the service on Christmas Eve.
Amazon’s ‘Elastic Load Balancing’ snapped, affecting a number of Amazon Web Services customers in the USA – with Netflix among them and, by extension, more than 20 …

this could never happen in the cloud

Re: this could never happen in the cloud

Very, very succint observation.

Constant availability and such buzzwords are just that: words. Cloud without data is useless and data and interconnects are subject to laws of physics (and we're talking tens or hundreds of PB for a locality with cloud providers). And even with cloud you have to know what you are doing. Exactly.

The difference between cloud and your own datacenter is just the fact that you outsource the IT service on particular level (infrastructure, platform or software). You yourself are responsible for design of all higher level services built from those cloudy building blocks (and for all built-in redundancies and all SLAs), so Netflix actually IS responsible for the outage, not Amazon*. They didn't construct their service according to reality, they had SPOF in their path. So it is their fault.

Everything else is BS.

*) if a non-clustered server fails in your datacenter and your service to customers is unavailable as a result, you solely are responsible, not the server vendor. You failed your due dilligence.

@bri

Re: this could never happen in the cloud

I love that argument - blame the customer not the provider. You didn't build your app so it could handle the cloud outage so it's your fault.

I'm trying to think of another scenario on IT where that is the case but am struggling to do so. The closest I can come is to blame the management that chose a service provider that is not high quality perhaps because they are cheap bastards, and only looked at price. (which also comes into play with Amazon since they are competitive on price)

I believe Netflix, more than most have done a ton of work to try to deal with this built to fail model. They even have released this 'chaos monkey' app to randomly kill shit. I wonder if at some point they say fuckit and go entirely on their own. I suspect not until there is an IT management change. They got their heads buried up their own asses, relying on a competitor for their critical services.

Unfortunately the vast majority of people looking to use cloud - especially those that use Amazon have absolutely no idea that this is the case. I have worked at two organizations in a row that designed their apps from day 1 to run in Amazon and in both cases neither of them did a wink of work to deal with the built to fail model. In both cases they had many single points of failure, inflexible configurations, and sensitive to things that you can't take for granted in EC2. Fortunately the most recent one moved out almost a year ago (was in EC2 just a few months for production). Oh my god how much better it has been since from a technical standpoint not to mention the really quick ROI. I have more than 2 years of experience in EC2 and related services and it was an absolute nightmare. Even IF the app was cloud designed, the level of service EC2 provides is so terrible and so ass backwards I wouldn't want to use it anyways.

It's like living in the 90s with APIs bolted on.

* No pooling of resources??

* No billing based on what you USE (based on what you PROVISION)??

* Fixed instance sizes ??

* No live migration of VMs off of degraded servers???

* No thin provisioning???

* Persistent networking configuration ????

* How about booting a VM with an ISO image ??????

I wrote a document about 1.5 yrs ago "reasons I wouldn't use EC2 even if it was free", it was 4 pages long of ~12 point text.

Things enterprise IT has had for more than half a decade show no sign of showing up in Amazon any time soon.

All of the other software development companies are similar- none had apps that were built for the built to fail model. It's just not a priority. They'd rather develop features that customers want than make their software resilient to failure. That is the smart way to go, unless you have nothing better to do than to make things globally distributed. The amount of overhead in such designs is terribly huge as well.

The best solution at a small scale does not equate the best solution at a massive scale. Unfortunately for Amazon customers you are forced to design for massive scale regardless. If you do not then you get massive pain & frustration.

If more customers realized this they would not be using Amazon to begin with, so I do my best to educate folks wherever possible.

Re: @bri

This is frankly irrelevant. I understand very well that the onus is *technically* with Amazon, but that's not the point. These guys (Netflix) are not Joe Public, they should've known better.

Your end customer doesn't give <censored> about which of your suppliers YOU'VE CHOSEN in order to lower your own expenses is to blame*. When a company A outsources something to some offshore service company B and as a result customer service tanks, it is a failure of the A's management (be it on technical or financial level) and as a result, of the whole A from customer POV.

*) "flexibility", "on demand" etc. are only different names for expenses in various forms (incl. depreciation) and financial risk management

Re: @bri

It's not exactly like that. Often it's: CIO of company A outsources to company B, collects bonus based on estimated savings, immediately resigns, new CIO assumes position, realize he's totally screwed. The outsource is a fiasco, customers are complaining, and the cost to get out of the contract and hire and train a new set of workers is prohibitive. In the meantime, investors are asking "where are our savings?" and when the situation is explained to them, they retort "the previous CIO assured us that this would make us loads of cash. You must not be doing it right." New CIO looks like an idiot through no fault of his own.

Although management covered it up, it was already looking like the outsource had destroyed decades of experience and competence at my company and we were starting to fail badly by the time the CIO quit. When the company tried to promote someone into the position, there were no takers. A year later, the position is still not filled.

The point being, this is a situation where one person or a very few people in the company can lead an initiative to do something profoundly stupid that ultimately wrecks the company. And then, get out before the fall. It's not necessarily the company's fault.

Re: this could never happen in the cloud

Anyone who does their research knows that Amazon's cloud has LOTS of serious outages that turn into total outages for your service.

I am designing a service and have it structured so I use two small clouds, neither Amazon, neither with a major outage to date.

So, yes, if you put all your eggs in one basket and didn't do your research on how many show stopping outages Amazon had, it's your fault.

You say your company went with Amazon. Did they not Google for the major outages in the past year? Did they not view a comparison table or a detailed report on these outages? If they had done any of these things they would know not to use Amazon, as it's failures are often catastrophic and ridiculously frequent. Go look at Heroku's reports on the issue, or Netflix blog.

To paraphrase Homer Simpson

Unlikely, I thnk.

A substantial part of Amazon's infrastructure is committed to supporting Netflix, an extremely high profile customer of theirs - AWS cites them all the time. The small amount of customer "churn" to Prime Movies resulting from the Christmas Eve outage would be wiped out many thousands of times over if Netflix decided to migrate to a different infrastructure provider as a result.

Also, consider the impact on the business. Just the outage itself resulted in a 4% loss in Amazon's market value today. The publicity, if Netflix defected, would wreck their share price, and investors don't like that very much. It would lead to Board level changes.

I really don't think Amazon deliberately downed Netflix. Sometimes an outage really is just an outage.

Pass the buck

Re: Pass the buck

Maybe not to "pass the buck", but to be honest and forthright about the issue. I have a SaaS running off of Rackspace and have had my share out issues with the cloud, almost ALL of them virtual switch related. In the end, many customers don't care; it is YOU who are at fault, regardless of what the actual issue was.

Do I regret the move to the cloud? In some ways, yes. I love the virtualization and the ability to ramp up horsepower as I need it. Do I TRUST it? Let's say I am cautiously optimistic. As long as there aren't any switch issues that take down internal communication between servers, it runs great and I don't have to run out at O'dark-thirty and drive an hour to kick a server that doesn't respond to anything other than a physical hard boot.

I also like the fact that I can scale as much as I need to without incurring more hardware costs. I also monitor various metrics on the hardware using munin and also paging when something is misbehaving. Rackspace won me over with their excellent support which is stellar, btw, so I plan on sticking with them. I believe what Ronald Reagan said about trust when it came to the Soviets ... Trust, but verify.

@Moof

Thanks for your honesty.

As you regret the move to the cloud "in some ways" and don't really trust your core IT infrastructure, could you tell us what firm you're part of so we can make an informed decision whether to do business with it? For instance, if you're running my bank account, I'd like to change banks. If you're a marketing firm, not so bothered.

Re: @Moof

Prior to the move to the cloud I had a half dozen 2U Dell servers running my SaaS in a data center. Great pricing, but I did (and still do) all IT-related functions. I had a 1/2 rack and bandwidth in a semi-secure colo. During that 5 years I was using them, I had one (1) minor network outage and a few unscheduled hardware outages to my own hardware, a couple of which required me to drive an hour to hard cycle one of the boxes (same box every time). I generally trust unto myself and that greatly pisses me off. I have several machines on a load balancer that are constantly synced and an 'emergency' backup in case the load balancer goes tits-up or the internal network goes down. Unfortunately, the hardware was getting old and I didn't want to replace it all for my own virtualized setup, so I looked around and went with Rackspace.

My regret is that Rackspace had some virtual switch issues that really affected performance and in several cases caused outage. It seems like they have cleared things up in their last update and I am cautiously optimistic, but remain vigilant. I run a PostgreSQL 9.1 cluster with no real master/master solution. I opted not do do something like DRBD for the database data due to speed issues and if I need to change to a new master all I have to do is kick off a script. That has never happened except while testing failover.

The bottom line is I believe in trusting unto myself. I'll happily use their hardware resources, but I don't like shared database environments so I have my own. I have several slaves in my DB cluster, have multiple forward-facing machines that are kept in sync and I keep a very close eye on things. Several times I have noticed larger than normal I/O issues that were being caused by someone else sharing the same hardware on one of my VMs. Sucky I/O really affects database performance a LOT and each time I notified Rackspace that they had a problem and they responded immediately, including after the 3rd time in the same morning, in which they disabled the VM of the idiot who was causing the issue until they could prove it was fixed. Their support is AWESOME and one of the main reasons I went with them.

My service is a small business solution geared towards the pet / dog daycare industry. There is a marketing component, but mainly scheduling and tracking pets. Short of the virtual switch issues Rackspace had with their Open Stack cloud, and the couple other instances, I have been more than pleased with their service.

Re: @Moof

You should be thankful that you chose Rackspace - at least there if you want to go physical or go hybrid they are more than happy to accommodate. About two years ago I worked briefly with a company who had a few SQL servers hosted at Rackspace and said "hey rackspace - give us fusion i/o cards", and they got them..

I believe rackspace also has multiple cloud offerings, the big one is openstack but I think they also have

Re: @Moof

Yeah, I don't regret my decision in that regard. We very well may go back to physical hardware for part of the database cluster (master and one slave) when we go to PostgreSQL 9.3, although we are currently pretty happy with where we are now for the time being.

Funny

I would swear that they ad a problem earlier this year with the exact same AWS 'region'. Is there some reason that they haven't spent the intervening time implementing a back-up for this type of failure?

Re: Funny

No, it is cloud networking and the concept of SDN (as seen by Amazon) at play here.

That region is where most of their network interconnects (and clusterf***).

Those of us who did network routing protocols and implementations can say - not entirely unexpected. That is what happens when you try to reinvent the wheel in an area which _REALLY_ requires some knowledge of mathematics and not just software mongering even if it is being done by someone like James Hamilton. It will happen again... And again... And again...

Outsourcing

This is a good example of where outsourcing is not a good idea. Netflix is relying on a company that competes with them to facilitate their service. There was nothing Netflix could do to resolve the problem.

I've had "business" partners that tried to get me to outsource certain pieces of production when I had a manufacturing company. I never saw the point since we were able to keep our employees busy all of the time and didn't have to rely on another company to meet our goals.

Outsourcing has been a fashionable thing for business executives, but doesn't make much sense in the real world. There is always a certain amount of outside services a company will use such as machining, plating and shipping, but there are many suppliers and switching from one to the next is quick and painless. Having a single source for something critical to your company is always a risk. If that single source is also a competitor, it's time to reconsider if your business is worthwhile.

Re: Outsourcing

It's a matter of business investments.

I think its one many company will face going forward and certainly something I considered when "moving to the cloud".

When Netflix was starting out, they were probably just developers that needed a platform. Amazon was the company they choose (probably because they were the most current) had had not thoughts of Amazon as a competitor. They were simply a third party by which to deliver their product.

The fact that Amazon now competes with them on their product is reason to consider a switch, but the question is, do you move it another player? You next best option is a Rackspace., but are you getting the same deal?

Should you setup your own infrastructure? Now you need to invest in the people, equipment, and data-centers to make it happen.

I can still see more benefit for Netflix to stay with Amazon, namely, Netflix isn't going to commit the required resources to operate on their own, no matter how much "better" that may be.

Re: Outsourcing

I agree with your last statement, they aren't going to move until their management is replaced. I've seen time and time again the costs of EC2 FAR outweigh the costs of operating your own stuff even after staff overhead. In fact you need MORE expertise to operate in EC2 than you do on traditional enterprise equipment. Mainly because it is difficult to use, lacks features, built to fail, billed based on provisioning rather than utilization, limited training available, support sucks, etc etc... all of these drive costs.

even when you toss out everything beyond basic EC2 costs and assume EC2 operates with the same level of features and reliability you can get elsewhere the costs still blow out doing it on your own in most cases.

EC2 and related services are like a roach motel, easy for developers to get in, very hard to get out.

S3 is not bad by contrast, I dislike amazon, but S3 is a halfway decent service which is true pay for utilization, is fairly reliable, costs are reasonable. The main thing S3 lacks of course is automated inter-region replication.

It's the Netflix Christmas special!!

Starring Jeff Bezos as The Grinch :)

( I think it's extremely unlikely that the outage was an attempt to subvert Netflix in favor of Amazon Prime, but even so, Netflix might want to ask why they are relying on a competitor for key infrastucture when there are lots of other options out there.)

clusterfuck-->cloudfuck

It was most annoying - here of course it was the early hours of Xmas day - and just like Netflix said - it didn't affect ALL devices just some of them. While my Nexus 4 could quite happily connect to Netflix, my Panasonic Freeview HD+ box could not - and I really REALLY wanted to watch some more of the 4400.....

I complained on Twitter and asked about the possibility of a free month - and hey presto, 2 minutes later it was streaming like nothing had happened...... guess I'm not getting a free month :(

Well, look at that!

It seems that despite the hype of cloud computing and the outsourcing craze, that it still is necessary to sit down and work out the design of your systems, including figuring out how you will avoid compromising your business when the inevitable component failure (cloud provider outage) occurs.

Also, Canadians. This also affected Canadians - or at least this Canadian, in Eastern Canada.

First it prevented me from watching on my WDTV, then I tried my Xbox360 and PS3 with no luck. My Asus Transformer Infinity worked fine for another hour, then that too stopped working. I continued watching on my PC - which eventually got a little wonky, but did keep letting me watch stuff, if after a bit of a longer wait than usual.

(This was the Canadian Netflix, not the US Netflix with the DNS trick.)

same on the wet coast

ahhh, hah

This explains why, on every other movie my wife and I tried to load, it'd get to 7% and then just sit there in graybar limbo. Even more infuriating were the times that a movie load would get to 99% and then freeze up. This couldn't have happened to them at a worse time -- Christmas Eve, when everybody and their cat is wanting to watch It's A Wonderful Life or Miracle On 34th Street or The Waltons' Christmas episode, and all they get is "Loading... 7%..."

We finally were able to get three episodes of The Larry Sanders Show to play before Netflix finally flatlined for the rest of the night.

Once again, I was thankful for my local stash of DVDs and mp4's. My wife used to have a DVD rental account with Netflix but went to their streaming service a couple of years ago and, at last report, still swears by it. Eurgh.

Conspiracy Theory Anyone?

"...and the social media app Scope"

seriously...

What idiot at Netflix thought it would be ok to host their service on a competitor's machines? I mean, what did you expect to happen? "Sorry, something broke, your service is not available. But our competing service is fine. What a coincidence."