Saturday, December 29, 2012

Over the last few days, a few of the engineers at TranscendComputing have been discussing what we could have done to have helped Netflix
avoid their Christmas outage. For those of you who aren’t aware, AWS suffered
an outage in the Elastic Load Balancer (ELB) service in the East Region.

In the middle of our discussions on creating massively
scalable, highly available, clustered load balancers with feature parity to
ELB, I caught a post by Diane Mueller at Active State. The gist of her post is
that ‘Netflix went down because of AWS but her personal app (which leveraged
FeedHenry and Stackato’ was revived after 10 minutes. The post seems to imply
that if you use PaaS (like Stackato), one can switch clouds easily, like she did
when she moved to her application to the HP Cloud.

I’ll avoid the overly dramatic retort but let’s just say
that I disagree with Dianne’s implication. Here’s my position: if core Netflix
applications were negatively affected by any core service (such as ELB), it
would be extremely difficult to quickly switch to another cloud. Here are some
specifics:

No disrespect to my friends on the HP Cloud team but I
honestly believe that if Netflix were to have done a sudden switch from AWS to
HP it would have brought HP Cloud to its knees. ELB’s (if they had them) would
have been crushed and Internet gateways would have been overloaded. Finding a
very large number of idle servers may have also been a challenge.

In this imaginary scenario, I guess we’ll assume that
Netflix decided to keep their movie library and all application services
running on multiple clouds. Sure this would be expensive but it wouldn’t have
been realistic for them to do a just-in-time copy of the data from one location
to the other.

Netflix has done a great job of publishing their technical
architecture: EMR, ELB, EIP, VPC, SQS, Autoscale, etc. None of these are
available in the solution Dianne prescribed (Stackato), nor does HP Cloud offer
them natively. There is a complete mismatch of services between the clouds. CloudFoundry
offers some things that are ‘similar’ but I’m concerned that they wouldn’t have
offered performance at scale.

Netflix has also created tools specific to the AWS cloud
(Asgard, Astyanax, etc.) as well as tuned off-the-shelf tools for AWS like
Cassandra. These would have to be refined to work on each target cloud.

In summary, there’s little-to-no chance that Netflix could
have quickly moved to ANY other cloud provider (including Rackspace or Azure)
and there’s not a thing that Stackato would have done to alleviate the problem.
All medium and large customers have real needs that are service dependent. I’ve
joked that CloudFoundry is a toy. It is, but it’s a toy that is maturing and
eventually may help with ‘real’ problems – but let’s be clear – that day isn’t
today. Any suggestion that it is ready for a ‘Netflix-like-outage’ is either
naïve or intentionally misleading.

I’ve spent the last three years working on solving the AWS
portability problem – and it’s a bitch. Like Dianne, if you have a simple app
my solution, TopStack, will work. It replicates core AWS services for workload
portability. As proud as I am as what the team at Transcend Computing has done,
I’m also quick to note that cloning any of the AWS services at massive scale
with minimal down-time, across heterogeneous cloud platforms and providers is
an incredibly tough problem.

Here’s my belief: Running the Transcend Computing ELB
service on HP Cloud would not have worked for Netflix in their time of need.
Our software would have been crushed. HP’s cloud would have been crushed.
Netflix’s homegrown software wouldn’t have had ‘practically portable’. It
would not have worked.

I’m happy to acknowledge where we suck. We’ll continue to
listen to the unfortunate incidents that AWS, Netflix and others encounter. My
2013 prediction for Transcend Computing is this: we’ll suck less. Acknowledging
reality is the first step.