Netflix's 5 Secrets For Maximizing Amazon Cloud Value

Instead of trying to build out its own data centers for its rapidly expanding film and video distribution business, Netflix finds the better strategy is to use Amazon Web Services' cloud resources. At Cloud Connect 2013, the architect of that strategy disclosed some of his secrets for optimizing use of the Amazon cloud.

Adrian Cockcroft is a leading proponent of Amazon Web Services, so much so that he is sometimes criticized for channeling Netflix computing onto Amazon's EC2. A recent InformationWeek column, "How Netflix Is Ruining Cloud Computing," drew 39 comments, including several by Cockcroft defending himself. John Engates, CTO of Rackspace Cloud, an Amazon competitor, added a follow-up commentary, "What Netflix Could Do For Cloud Computing."

An April 4 session at Cloud Connect 2013, a UBM Tech event in Santa Clara, Calif., featured Cockcroft and Amazon Web Services technology evangelist Jinesh Varia. It drew a crowd to hear their take on Netflix's five tips to maximize the cost effectiveness of AWS.

And while it's commonly viewed that Netflix is dependent on AWS for vital services, Varia said at the start that AWS relies on Netflix to educate it about meeting a large, demanding customer's needs. "Adrian and his team challenge Amazon Web Services in every way. They help us to make AWS better," he said.

Cockcroft, in turn, said the switch to Amazon allowed Netflix to try using large data center resources, fail at it without paying a heavy penalty in unused gear because it was only rented by the hour, then try again. The ability to execute a rapid, iterative testing of ideas "[gives] us an ability to try things out, even more than our own data centers would," he said.

Cockcroft offered these five tips for using AWS.

1. Weigh Costs Vs. Business Goals.

As an example, he said Netflix had no staff or servers in South or Central America when it opened operations in South America Sept. 5, 2011. It thought it would improve customer service throughout the region if it added servers in the AWS center in Sao Paulo, Brazil. But it found requests to its virtual servers in Brazil from other countries were almost always routed over the Internet through Miami. That's because Miami is a massive hub for network carriers, and an Internet user in Ecuador wanting to talk to one in Brazil will almost always be routed through Miami. Netflix found there was no performance advantage to using servers in Brazil, which were farther from Miami than AWS's East Coast servers. It reverted to serving South and Central American through its hundreds of servers in Northern Virginia because "it made sense to serve them out of U.S. East," said Cockcroft.

European customers, on the other hand, could be more efficiently served out of AWS's Dublin, Ireland, data center. AWS services in Dublin are slightly more expensive than those in U.S. East, but reducing latencies for customers was worth the increase, he said. Netflix launched a 1,000 virtual machine footprint in Dublin, using the same procedures and same APIs to which it was already accustomed. "Everything just worked," he said.

Mastering these business tradeoffs of weighing the cost and latency penalties, when they exist, against your business goals is one of the fundamental challenges of cloud computing. Cockcroft used a rough equation to formulate the trade-off: "How many dollars should you spend to reduce customer latencies by 50% if that increases your conversion rate by 10%?"

2. Plan Ahead For Disaster.

Netflix is rare among Amazon users in implementing operations in multiple regions. Amazon urges customers to achieve high availability by implementing the same application and data in more than one availability zone. Multiple zones exist in the same region, but Netflix wants its U.S. East operations in Northern Virginia able to fail over to one of Amazon's West Coast facilities. That allows a region-wide disaster, such as Hurricane Sandy, to strike Northern Virginia but still leave Netflix with a guarantee of continued operations. (Sandy created temporary problems with some services but did not knock AWS off the air.)

The consolidation comment brings up an interesting point. Many IT chiefs have rogue Amazon instances out there, set up by developers or even folks on the business side, that they don't know about. You have to track them down before you can consolidate.

Most of the discussion and slides was actually by Jinesh, so many of the quotes in this article are of things that Jinesh said, rather than what I said.

The discussion of Brazil overstates what we did. We ran a small experiment in AWS Brazil for a week or two earlier this year, it wasn't a large scale deployment. The point was that we could easily try out deploying systems anywhere in the world.

Point 4 and 5 above doesn't quite have it right. With consolidated billing, reservations apply across accounts. It makes sense to have excess reservations in production accounts so that you have a capacity guarantee for handling production peaks. The excess is mopped up by other accounts at the end of the month, so that there is no cost penalty for the extra headroom. The other optimization is to autoscale down the production web services instances during the night, and use the same reserved instances to create short lived hadoop clusters to do the daily ETL processing for business intelligence metrics.

Enterprise cloud adoption has evolved to the point where hybrid public/private cloud designs and use of multiple providers is common. Who among us has mastered provisioning resources in different clouds; allocating the right resources to each application; assigning applications to the "best" cloud provider based on performance or reliability requirements.