Amazon's Vogels Challenges IT: Rethink App Dev

Amazon Web Services CTO says promised land of cloud computing requires a new generation of applications that follow different principles.

10 Cloud Computing Pioneers

(click image for larger view and for slideshow)

In an unusually direct challenge to enterprise IT departments, Amazon Web Services CTO Werner Vogels advocated a new set of rules for enterprise application development. "I'm not going to go too much Old Testament on you," said the would-be Moses of cloud computing during a keynote at the Amazon event Re:Invent, "but here are some tablets that have been given to me."

Tablets, as in stone etched with commandments. Instead of ten, he had four tablets -- Controllable, Resilient, Adaptive and Data Driven -- each with some command lines written on them. As he expounded upon each, he made it clear that Amazon itself followed these rules in building out its infrastructure as a service on which the Amazon.com retail operation now runs.

"Thou shalt use new concepts to build new applications," he decreed as an opening command line for "controllable."

"You have to leave behind the old world of resource-based, data centric thinking," he said. Until recently, IT has conceived first of the physical resources that it would need to build a system, then let those resources constrain what it did next.

If an application is developed to run on AWS' Elastic Compute Cloud, then it will run in virtual servers and there's no limit to how fast it can scale up, regardless of the initial resources allocated to it. Instead of conceiving of physical servers, think of "fungible software components" that can be put to work on jobs of different scale and traffic intensities.

Such software needs to be "decomposed into small, loosely coupled, stateless building blocks" that can modified independently without disrupting other building blocks, Vogels said. The idea has been around since the advent of Web services and services oriented architecture in the enterprise, but Vogels dusted it off and gave it renewed urgency.

Automate your application deployment and operational characteristics, was roughly a second commandment, which at one point he expressed as, "Let business levers control your system."

By that, he was saying an application should run itself, following parameters set to meet changing levels of business need. "As an engineer, you do not want to be involved in scaling," he said. Furthermore, if you want scalability to always work, "it is best if there are no humans in the process." Business rules can determine appropriate response times from an application for customers, and automated processes, such as fresh server spin up and load balancing, can see that they are followed.

"Architect with cost in mind" was another rule. Vogels said he's good at choosing the algorithm that's most fault tolerant, but he "has no clue" how to determine which algorithm is going to be the lowest cost one to run over a long period. For effective, long term operations, some projection of the cost of running the code must be included in the decision on which code is used.

"Protecting your customers is your first priority," he asserted, saying too few applications build in protections that are now cost effective with today's processing power. "If you have sensitive customer data, you should encrypt it," whether it's in transit or standing still. "Integrate security into your application from the ground up. If firewalls were the way to go, we'd still have moats around cities," he warned.

The adaptive, resilient and data driven tablets also got some air time. "Build, test, integrate and deploy continuously," Vogels urged. That's a widely shared but hard to implement DevOps idea that's also been around a long time, but Amazon practices it with a vengeance. It deploys new code every 11 seconds, and has made a maximum of 1,079 deployments in an hour, he said.

In its early days, AWS phased in a new deployment, initiating it on one server at a time until it had been spread through a rack, then the next, etc. "We were good at it but it was a very complex and very error-prone process," he noted. Amazon has automated its new code deployments with open source Chef and other tools managing configuration, version management and roll-back. What used to be a one-machine at a time process is now done in blocks of 10,000 machines at a time.

If something goes wrong, "rollback is one single API call" to return a block of machines to their former state, he said.

Applications need to be instrumented so that they report constant metrics as they run. That information can be relayed back into a business management system that monitors whether users, including customers, are being served. The well-placed reporting mechanism "is the canary in the coal mine" and warns of imminent system slowdown, as some service or component in an application starts to show signs of stress.

"Don't treat failure as an exception. There are many events happening in your system," some of which "will make you need to reboot," he warned.

AWS had to rebuild the S3 storage service to allow it to better cope with disk drive failures, which happen "many, many times a day" in the AWS storage service, said Alyssa Harvey, VP of storage services, after being called to the stage by Vogels.

S3 was launched in 2006, an early cloud service, built to hold 20 billion objects "with the wrong design assumptions about object size," Harvey said. Rebuilt around a simpler, more resilient design, the new architecture is one of the reasons Amazon was able to announce a 24% to 27% reduction in storage prices Wednesday, she said.

"Instrument everything, all the time," added Vogels. "Put everything in [server] logs. You need to control the worst experience your customers are getting," and analytics can be periodically applied to server logs to see what events lead to slowdowns and how they may be prevented in the future.

Enterprise cloud adoption has evolved to the point where hybrid public/private cloud designs and use of multiple providers is common. Who among us has mastered provisioning resources in different clouds; allocating the right resources to each application; assigning applications to the "best" cloud provider based on performance or reliability requirements.