Failure Cascading Through the Cloud

Failure Cascading Through the Cloud

However, some experts question whether this will really help prevent future outages. “It’s not just individual systems that can fail,” says Neil Conway, a PhD student at the University of California, Berkeley, who works on a research project involving large-scale and complex computing platforms. “One failure event can have all of these cascading effects.” A similar problem led to a temporary failure of Amazon’s Simple Storage Service in 2008.

One of the biggest challenges, Conway says, is that “testing is almost impossible, because by definition these are unusual situations.” He adds that it’s difficult to simulate the behavior of a system as large and complex as Amazon Web Services, or even to know what to simulate.

Conway expects companies and researchers to look into new ways of testing abnormal situations for cloud computing systems. “The severity of the outage and the time it took [Amazon] to recover will draw a lot of people’s attention,” he says.

Sony’s PlayStation Network, an online gaming platform linked to the PlayStation 3, has yet to be fully restored after its outage on April 20. The company took it down in response to a security breach and has been frantically reworking the system to keep it better protected in the future. In a press release, Sony offered some details of its progress to date. The company has added enhanced levels of data protection and encryption, additional firewalls, and better methods for detecting intrusions and unusual activity.

For both Sony and Amazon, these struggles are happening in public, under pressure, and under the scrutiny of millions. Systems as complex as cloud services are going to fail, and it’s impossible to anticipate all the conditions that could lead to trouble. But as cloud computing matures, companies will build more extensive testing, monitoring, and backup systems to prevent outages resulting in public embarrassment and financial loss.