Why RIM still hasn’t found the cause of its world-wide outage

While the initial cause of Research In Motion's Blackberry messaging network …

It has been nearly a month since Research In Motion's huge Blackberry service outage ended, and the company hasn't yet revealed why the failure cascaded so widely. While company executives blamed the failure of a core switch and backup systems at a data center in Europe for the initial outage, they've so far been unable to explain why it caused a a backlog of network data to bring services down around the world. RIM has formed a "SWAT team" with Chief Technology Officer David Yach to uncover why the outage spread so far and lasted so long, Bloomberg reports.

Sure, eventually everything fails—anyone who's ever worked in a data center knows that. And the big cloud and managed service companies like RIM Amazon, Facebook, Google and Microsoft have built their infrastructures around the idea that things fail regularly, and have gone to pains to build resiliency into their systems. So why, if they're designed to deal with little failures—like a dying server, a lost drive, or a cut power line—do they seem to fail so often on a huge scale?

In his book Normal Accidents: Living with High Risk Technology, Charles Perrow wrote that for highly complex systems, "multiple (failures) and unexpected interactions of failures are inevitable." At the time, he was writing about nuclear power plants and the risks of the Y2K bug, but the same is true of big online systems: despite all the efforts to make them 99.95 percent available, there will always be unknowable risks buried within the systems' complexity. When something fails in a server room or a smaller data center, the cause is usually not too hard to find after going through the log files or inspecting the components. But the sheer complexity and scale of cloud services like Blackberry email or Amazon's EC2 service can turn a small failure into a large one. The very things that cloud and managed service providers do to make their systems more resilient can actually cause them to fail more spectacularly.

The vast majority of big service outages are caused by the same things that have always caused systems outages. "It's usually a router, a switch, or a backup power supply," Antonio Priano, the chief technology of cloud management software company ScienceLogic, said in an interview with Ars Technica.

"These are the kind of failures that happen at any data center," Priano said. "Over the last decade, every data center operator has dealt with an outage, not because of software failures but because of issues lower in the infrastructure. The issue [with cloud data centers] is their scale—a small failure can have far greater impact."

He pointed to the August data center outages in Europe for both Amazon and Microsoft as examples—they were caused by the explosion of a transformer near their data centers in Ireland. The resulting power spike damaged the systems that were supposed to start up backup generators.

RIM is hardly the only service provider to suffer from a multiday outage as the result of the failure of a piece of network gear. Amazon's April EC2 outage at the company's data center in Northern Virginia was triggered by a network configuration change, ironically intended to increase the capacity of Amazon's primary network. Traffic was inadvertently switched to the wrong router during maintenance, interrupting both the primary network and the backup network in the process. This caused a backlog of storage replication changes in Amazon's Elastic Storage Block systems to build up, filling up the excess storage available with change data and creating a "re-mirroring storm" that resulted in about 13 percent of Amazon's storage volumes to become "stuck" in a death spiral, continuously searching for more storage space.

Priano said that Amazon's resiliency system caused the outage in this case because "they got a little too clever" in how they managed their storage base. "If they had ‘t had such a massive volume of data being re-repilcated all over the place," he said, the outage would have been much shorter.

Four weeks later...

It took Amazon just a week to completely decipher the cause of that outage, which lasted for some users for over four days. So why, over four weeks later, has RIM not been able to completely explain its own outage? The answer may be that they simply can't find one—the complexity of the system may itself be the cause, with no identifiable individual cause.

"The complexities of cloud and managed services have made a lot of companies have to take a new look at how they manage their ops," said Jim Adreon, former director of advanced cloud services at IBM and now vice president of strategy at Virtela Technology Services in Denver. "It takes quite a while to debug a cloud problem, if it wasn’t something that was already diagnosed."

While cloud service providers like RIM, Microsoft and Amazon have done a lot to create resiliency, the technology they use is still in its infancy, he says. "Think about how long it took IBM to develop the Sysplex model," the redundancy model used in UBM mainframes, introduced in 1994. "They have to redevelop that for cloud, and it will take at least ten years."

The trigger for Amazon's April outage was human error, and the RIM outage was triggered by the failure of a single switch. But public and private cloud system failures can be triggered even by something going *right.* That's because it's become almost impossible to model the relationships and dependencies of different elements within them, according to Mark Jaffe, the CEO of Prelert—a company that has developed a data center monitoring tool based on artificial intelligence and predictive analytics. "As the systems have gotten more complex, they have not lost their ability to communicate what they’re doing and how they’re doing," Jaffe said. But the operations teams of data centers don't have the ability to parse it all. "It's a tremendous amounts of data, requires more effort to understand, and the relationships in it aren't clear."

Jaffe cited the example of one customer that called in his company to diagnose a major failure: a financial services company running a trading system in a private cloud. "The trading system has lots of resiliency, with three backend databases they can fail over to," Jaffe explained. "A change made on Friday night to move to an exact replica of the database used the previous week. It was darn close to exact, but in the darn close was the problem."

As trading started around the world while it was still Sunday in the US, the database started having minor performance issues—which turned into outright failure at the open of the New York Stock Exchange. "They spent hours trying to understand the cause, because this change is one they make fairly regularly,"Jafffe said.

As it turned out, someone had applied a patch to the database server that hadn't been applied to the other systems, which caused enough of a degradation in performance to cause a "wobble" in the system, eventually bringing it to its knees.

This isn't the sort of failure that would be detected in most systems-management environments, because they depend on rules and exceptions that are based on models of expected types of failure. Prelert's tool was able to perform a complex relationship analysis on the log data, and show the problem emerging 12 hours before the failure occurred—because it constructs the relationships between different types of data in the log file based on pattern recognition rather than a predefined model of the system.

As RIM searches for the source of its own outage, Patrick Spence, RIM's head of European sales and marketing, told Bloomberg that there is "nothing that's not on the table," including a complete redesign of RIM's server network. But maybe what RIM really needs is a little deeper knowledge of how their system really works, rather than how it's modeled. Otherwise, the RIM SWAT team may be poking around for a long time in search of the real answers.