The bigger they get, the harder we fall: Thinking our way out of cloud crash

Holy cascading cloutage catastrophe!

Cloud computing is wonderful, until it isn’t. A digital screw comes loose somewhere, and before you know it the whole engine has ground to a halt in a cascading cloud outage – or, as we like to call it, a cloutage.

It has happened before, and Bryan Ford was very worried about it in 2012. Then a Yale Researcher, he published a paper warning about a digital iceberg, with well-understood visible risks at the top, but with a whole bunch of hidden risks at the bottom. Complex interdependencies could sink the cloud computing ship, he warned.

That was four years ago – 28 dog years and far longer in irreverent online tech publication years.

Ford, proud owner of a PhD from MIT, has since gone from being professor at Yale to leader of the decentralised/distributed systems lab at the Swiss Federal Institute of Technology (EPFL), in Lausanne, as an associate professor. It's here that Ford's work means he builds secure and decentralised systems.

What does he see in the internet of four years on, outside of EPFL? The cloud has become more complex, that's what.

"The complexity of interdependencies in the ecosystem, kind of as expected, is developing like crazy," he told The Reg recently. "In part that's driven by functionality, in part by efficiency and economics."

Cloud services are getting bigger and serving more customers. Microsoft nearly doubled its number of cloud customers in the last 12 months alone. Amazon Web Services Q2 results saw a 70 per cent revenue hike to $2.4bn.

Their service portfolio is also ballooning. "The whole cloud business has developed into a market with more and more layers, and it keeps growing more layers every time I look at it," Ford said. "There are new, different kinds of services that different applications build on, that are also often sold as first-class products while being used internally."

The bigger, more complex infrastructures offer everything from IoT provisioning and management through to identity management services, enterprise applications and various layers of analytics. It leads to convenience for customers, but Ford worries about the potential risks of building complex systems atop more complex systems. What happens when you can't see all the gears, he asks?

"Many of the services are virtual storefronts on top of the wide variety of other services that they don't even talk about," he reflects. "Even if you're paying attention and you're worried about these kinds of reliability and stability issues, you're not going to be able to find the information you need to reason about it."

Then there's the IoT, which represents an even more complex set of moving parts, he muses. Don't even get him started. That'll be worse than cloud, he suggests. The implementation stacks are already all over the place. What happens if these interdependencies aren't managed properly, and everything is automated? "It will be totally the wild west," he says.

What's been done about it since then? Not much, if we're honest. Ford worked on an interesting paper for Usenix 12 in 2014 that proposed something called Independence as a Service (INDaaS), which would audit the redundancy of systems in hugely complex environments. It required an architecture of pluggable modules to acquire the dependency info. There was a proof of concept, too.

Timothy "Mothy" Roscoe, professor of computer science at Swiss Federal Institute of Technology (ETH) Zürich, shepherded the 2014 paper. He explains that the problem is hard to tackle with commercial systems management software, because management and deployment teams are often fragmented in large, complex computing environments.

In the kinds of large enterprises that use or provide cloud services, different functions are controlled by different divisions, he warned.

"The structure of the enterprise means that the various bits are managed by different people. If you're a startup and trying to sell a management solution that crosses all of those boundaries, you have a problem, because few organisations will make that purchasing decision."

What can cloud providers do to protect increasingly complex systems from cascading failures, or other connected problems such as security holes? One nice idea would be to formally verify everything – but that's just what it is – a nice idea. Formal verification involves formally defining the function and characteristics of a system and then proving its correctness mathematically.

Formal verification has worked in systems running nuclear power stations or airplanes, where the software's functions can be tightly bounded and where the stakes are high in the event of a system failure. Now, thanks to firms like Veriflow, it's starting to make its way into networks.

Firms may be making some headway at lower levels of the network stack, but the further up the stack you go, the harder this becomes, points out Roscoe. Even defining language to formally describe the properties and states of everything in a data centre is a highly complex job.

There are other ways to mitigate potential failures, though. One promising area is simulation, in which you build up a software picture of a data centre and its cloud components. This simulated model will be driven by sensors providing real-time traces of what's happening there.

"You can use that online, and then you have a model of what's really happening, driven by those traces," Roscoe said. "Then you can query that model."

Roscoe's team is working on this right now. His project is still in the early stages, but he believes that he'll have an academic paper and some results to share by the end of the year. The design of the query processor to ask the simulation questions is an interesting problem.

"Some of the early work we did here suggests that you can take a biggish data centre and have it maintained in real time by ingesting all that data," he said. "It doesn't take that much computational power to do that. You can do it in a rack."

These proposed solutions are intellectually stimulating, but they aren't out of the lab yet, and are unlikely to be for some time to come. Most of the work being done on correlated cloud failures is still academic.

Meanwhile, the cloud engines thunder along, hoovering up large portions of the world's data – from our public tweets to our private moments. Let's hope no digital screws fall out of something important, eh? ®