Google CIO and others talk DevOps and “Disaster Porn” at Surge

At the Surge conference in Baltimore, Google CIO Ben Fried said that IT …

Google CIO Ben Fried bared his soul to systems and software engineers and other IT pros gathered at OmniTI's Surge scalability conference in Baltimore Thursday, sharing the story of his greatest IT failure and how it informed how Google runs its IT operations. While he didn't call it by the name, Fried's keynote was as much a manifesto for the "cult of DevOps" as it was “disaster porn.”

There were plenty of other cautionary tales from Surge presenters, many of which promoted DevOps in some way. But they also highlighted just how fickle public cloud services—and Amazon's EC2 in particular—can be.

DevOps is the growing practice of forging tight collaboration between application developers and IT operations staff to continually improve performance, automation, and scalability of software and systems. The philosophy is also the force behind scripting language-based infrastructure automation tools, such as Puppet and Chef. But it's also a reflection of the necessities that come from trying to provide reliable systems based on increasingly complex and unreliable stacks of software and infrastructure.

A failure to
communicate

Fried described a catastrophic failure of an institutional trading application he led the development of seven years ago at his previous employer, Morgan Stanley (who he identified in his presentation as “a large investment banking firm”)—a failure that cost the company millions and took 18 months to correct. While there were a number of contributing technical errors that led to the failure, Fried said the source of the problem was in how the IT organization scaled up as the application had become a more successful business.

The application, which Fried said was the source of much back-patting at the time of its launch, was a desktop app for Morgan Stanley's large institutional customers who made large volumes of trades that re-used the infrastructure of the investment bank's Web-based trading tool. It used SSL connections to feed the application realtime market data, send trades, and pass back reports of how the trades were being executed. While the application used the Web infrastructure, all of the customers were “high value,” said Fried. The decision was made to try to move the customers over to private connections to help ensure quality of service—which many of them did.

Soon, there were complaints about the application's performance. But Fried says his response, when asked about the problems, was to “ask for data." This typically led to the complainers leaving him alone. That ended when “a very important person in the company called me,” he said, “and told me it was in my best interest to go look at what was going on on the trade floor, or the consequences would not be pleasant.” He found the trader support team slammed with calls from customers. And at that moment, he received a page: “There was a hard failure in the system, and it was going down.”

The reasons for the failure were legion. “The most interesting thing is that big disasters rarely happen because of just one thing,” Fried said. In this case, the first of them was that the dedicated load balancer that had been put in front of the application had a gigabit Ethernet port, but was only rated for 45 megabits per second throughput. “And what made it worse was that in the sea of configuration changes that had been made, someone had blocked the SNMP port” for the load balancer, he added—so it showed up as “green” on the network management console until it failed completely. But the solution required more than just a new load balancer—it required a change in network architecture. The leased lines that had been sold to customers were routed through the system's public facing Internet—totally defeating any effort at quality of service.

As if that wasn't enough, the application itself was causing network issues. While its communications had originally been based on small HTTP messages, that had been changed because of what Fried called “the organization's love affair with XML.” The messages had grown in size up to 3000 bytes—twice the maximum size of an Ethernet packet—so there was a “hockey-stick spike” in traffic and dropped packets.

The opposite of
DevOps

The real root of the problem, Fried said, was the way the organization around the system had been built. "Without even thinking about it, the way we scaled up was through specialization," Fried explained. "We added people to specialized teams, each operating within a functional boundary. We never said understanding how everything works is important." Because none of them had knowledge of how the application worked beyond their area of expertise, the teams made decisions that led to a "hard failure" of the application.

As companies strive to scale up applications to handle larger tasks, Fried said, it's increasingly important to have IT generalists on the team who can look cross-functionally at systems. "Scalability is pushing the boundaries of the possible,” he said. “We operate at the interface of the known and unknown. Normal industrial style thinking doesn't work, because specialists' expertise is not good at dealing with the unknown."

Fried said the process of fixing the problems at Morgan Stanley “forced me to rethink how we do operations, and what the culture of operations should be. Operations is engineering. We need generalists in operations, and we can't allow the tech barriers to separate us because that will result in failure.” He said that it's important to reward and recognize generalist skills and broad understanding of systems, and added that he thinks Google gets this right. “We go to great lengths to hire people with engineering skills, put engineers in operational roles and give them power and accountability."

That sounds a lot like an embrace of the philosophy of DevOps, and it was a message that many in the audience who are responsible for Web applications received warmly. In many ways, the DevOps style is something endemic to Web startups—especially small ones, where the developers end up being responsible for operations as well.

That was the case at Ruby-based platform-as-a-service provider Heroku, which was acquired earlier this year by Salesforce. As Heroku's cloud operations director Mark Imbriaco said in a presentation on the company's approach to responding to system failures, "A year ago, Heroku had no ops at all." The operations team is still small, so every engineer on staff participates in the company's on-call incident response. “It gives us a sense of shared suffering,” he added, “and lets everyone see the problems—particularly the people who wrote the code.”

It also means that every engineer at Heroku has sysadmin privileges—Imbriaco admitted that this is something he'd rather not have.

Amazon and other disasters

Some of the other "disaster porn" at Surge yielded practical advice that Google's CIO couldn't give, particularly about the dark arts of dealing with Amazon's EC2 cloud infrastructure services. EC2 was the platform of choice for most of the cloud service players at the conference; Heroku, for example, runs completely on EC2. But that's a choice that doesn't come without pain.

Imbriaco said that Heroku has seen "so many different errors from Amazon" that they have gotten to be experts at diagnosing them, and Heroku's own monitoring usually beats Amazon's usually by 15 minutes in diagnosing problems. And when the problems are related to ephemeral disk failures, Amazon does little to deal with them other than occasionally sending a message. "We will get an email saying, 'Your host is in a degraded state and you need to move your stuff,'" Imbriaco noted.

The majority of the issues that Heroku encounters with its services on Amazon are related to disk I/O, including the "ephemeral" disks related to instances failing and crashing the virtual machine instance. Most of Heroku's "playbooks" for dealing with system failures and degradations include resetting or "destroying" EC2 instances. Imbriaco says that he'd like to automate most of these responses, but "we're too afraid to right now." Automation, he said, is also a great way to rapidly distribute failure.

Andy Parsons, formerly of hyperlocal news service Outside.in and now CTO of a startup called Bookish, provided a long history of war stories from his serial startups, and many of them included lessons about EC2. "Machines disappear, you don't know why. Sometimes it's network availability—the infrastructrure at Amazon's data centers is immense; people are plugging things in all the time, and there are network outages." At Outside.in, he says, "We went through a period where we lost an instance a day. In any week, we were doing 10 emergency reboots."

Parsons also had I/O problems, which he says were in part because "you have no idea where the actual SAN is" that supports virtual systems. The storage area network might be on the other side of one of Amazon's EC2 campuses, practially in a separate data center. He also warned off usage of ephemeral storage for anything that is critical.

Another set of frequent problems that both Parsons and Imbriaco cited were related to the Domain Name Service at Amazon. "Local DNS failure is a common problem," Imbriaco said, resulting in connections failing. Parsons said that the private IP addresses of instances in the Amazon cloud change without warning as well, so it's important to use DNS for communications between them.

In the end, Parsons said, with Amazon instances, "Failure is assured." He recommended keeping a hardened basic system image and using tools like Puppet or Chef to load and patch them, and replacing instances early—before they fail." When asked if he considered switching to another cloud provider, he replied that he still thinks Amazon is the best option. "For all my complaints, I think EC2 is fantastic."

I wonder how large is Heroku's presence on EC2? Is it so large that they're seeing the network and it's services flake out. And when you're on a big network, weird ass stuff does occure. Or is there something amiss with the EC2 infrastructure?

As a public-sector IT worker, this is all too familiar. I think my boss said it best during a recent server upgrade on a mission-critical system:

"Educational software vendors can never fully test the impact of real world, day to day loads on their software."

Not without hiring an army of temp workers to push through test data. Sean accurately described the difficulties of scaling applications. Amazon's cloud infrastructure only highlights this difficulty. As more and more apps become cloud-based, I think these stories will become more public.

It sounds like most of those guys were using EC2 as a IaaS. Would more of those problems be mitigated if they were using PaaS, within some kind of managed platform? I ask because I'm digging into and will soon be depending on Microsoft's Azure. The vast majority of the complaints listed sound like they're out of my control. From what I understand about Azure, it automates a lot of that for me, but I don't have the experience, scale, or load to really tell.

DevOps is the new cargo cult. It is the latest reminder that developers need to remember that their code will run in the real world. Problems like network throughput, latency, and outages can't be abstracted away.

Yup, there's a lot of that going around. (e.g. Microsoft). Personally, I hate XML, but more to the point it's used in all kinds of inappropriate places. Sending XML over a data transport channel is just plain dumb. Sure, you can read the contents right off the wire with a packet sniffer but a whole lot of bandwidth has been wasted for the sake of debugging convenience. When sending data through a data channel of any kind it should almost always be in raw binary format.

It sounds like they are re-learning the purpose of systems engineering. You need someone who has a grasp of the whole picture to spot the issues that happen at the interfaces between systems. I can't believe people would build such huge systems without dedicate systems engineer teams. It's standard practice in the world of physical things, like aircraft, robotics, automobiles, etc.

DevOps is the new cargo cult. It is the latest reminder that developers need to remember that their code will run in the real world. Problems like network throughput, latency, and outages can't be abstracted away.

Yup, there's a lot of that going around. (e.g. Microsoft). Personally, I hate XML, but more to the point it's used in all kinds of inappropriate places. Sending XML over a data transport channel is just plain dumb. Sure, you can read the contents right off the wire with a packet sniffer but a whole lot of bandwidth has been wasted for the sake of debugging convenience. When sending data through a data channel of any kind it should almost always be in raw binary format.

XML is popular like that because of the tooling. There are lots of wizards and programs and grokkers for XML. But it's verbose and I've seen some very expensive "enterprise" apps get it wrong. But anyway, I think this is at least part of the reason that JSON is a rising star.

Yup, there's a lot of that going around. (e.g. Microsoft). Personally, I hate XML, but more to the point it's used in all kinds of inappropriate places. Sending XML over a data transport channel is just plain dumb. Sure, you can read the contents right off the wire with a packet sniffer but a whole lot of bandwidth has been wasted for the sake of debugging convenience. When sending data through a data channel of any kind it should almost always be in raw binary format.

Huh? Have you heard of Content-Encoding: gzip? XML, JSON, and CSV are all fairly wasteful in their payload size. Yet so are (to pick some nasty proprietary binary formats): .DOC, MAPI RPC, and .XLS. Ok, so let's use ProtoBuf for a raw binary format. If we're sending lots of strings, compressed XML can be smaller than uncompressed ProtoBufs. Enabling output compression on any web server is trivial these days, and the processing time minimal.

I'm no particular fan of XML, but use the easiest open interchange format supported by your toolchain and libraries. Sending XML to a browser is silly, but then again, sending JSON to be deserialized into a C++ object is pretty silly too, given how optimized stream-processing XML parsers are. With Gzip enabled by default on all text assets and messages, network bandwidth is the least relevant criterion for picking a wire format.

However, the point from the article is a good one: be aware of how your messages are being encoded on the network, and the characteristics of your transport. If you have high-latency connections, don't design a chatty protocol; if you have small MTUs (esp. over tunnels or cell/frame layer2s), don't use large messages requiring fragmentation. If you have lots of packet-loss, don't use a streaming transport. etc.

This is just showing that IT is pretty much the bottom of the barrel of the technology industry, not necessarily from a talent standpoint, but from the standpoint of being at such infancy compared to other engineering disciplines. I worked for a few design firms years ago, and have worked with many major manufacturer's, and pretty much all of them would be aware that you need a "systems" guy (or a few) if you plan to produce anything that actually functions as intended. Typically those are the veterans of the corporation who've done the low level design for years and gradually worked their way to overseeing larger subsystems and/or the system as a whole.

Seems like IT needs to take some lessons from the more traditional engineering design fields...

DevOps is the new cargo cult. It is the latest reminder that developers need to remember that their code will run in the real world. Problems like network throughput, latency, and outages can't be abstracted away.

Like Game_Ender, and Redbeer, I think that devops looks like systems engineering (with the organization as a whole considered as a system), and its proponents would do well to learn something about this established field before they go about trying to reinvent it.

That said, I think it is a good idea, overdue by about four decades. Because it is a discipline that requires looking at the big picture and anticipating problems, it will only succeed if it is practiced by people who can think about systems in abstract terms, though from a practical point of view. I once worked for an organization with a reputation for being a 'technical' shop, but what that meant was that operations ran development, and with a couple of notable exceptions, the operations managers did not understand development, instead treating it like a process in which you did the same tasks every day. On the other hand, I have also worked at organizations where the developers were only interested in doing technically 'sweet' things (such as XML everywhere, back when it was the latest fad), with little regard for the needs of the organization. Systems engineering / devops will not work if it is run from either of these points of view.

As a well respected HA scientist I bet he may have some answers for you.

I would think it would be an over-reliance on academia folks instead of on experienced HA practitioners, although I would not be surprise to hear marketing/saleforce issues or customer misconceptions issues.

HA implementations have been on the market for decades, and it is unreasonable to expect every issue found to be published. Some are still known to corps only as a proprietary know-how. There is a reason it costs so much.

Sean Gallagher / Sean is Ars Technica's IT Editor. A former Navy officer, systems administrator, and network systems integrator with 20 years of IT journalism experience, he lives and works in Baltimore, Maryland.