A Question of Cloud Reliability

21 August 2011 by Jenn Granger

On Sunday 7th August, part of Amazon’s Web Services in Europe were knocked offline after a power failure. The outage caused errors in the Amazon Elastic Compute Cloud 2, Elastic Block Storage and Relational Database Service platforms.

The initial explanation was that a lighting bolt at Amazon’s Dublin data centre caused a power surge that knocked out the generators. The initial power outage caused the UPS systems to kick in. The generators should then have taken over, but when they did not start up, the UPS quickly ran out of reserves.

Amazon was able to get services back online quite quickly, but an earlier hardware failure meant the deployment of recovery snapshots was delayed and some customers were offline for three days.

The lighting theory has since been discounted. Amazon “currently believe (supported by all observations of the state and behaviour of this PLC) that a large ground fault detected by the PLC caused it to fail to complete its task.”

This is also not the first time Amazon has suffered a massive outage on its EC2 system. The last one was international with high profile sites such as Quora and Foresquare affected

Cloud hosting is seen by many as the best way to attain an ‘always on’ solution without the expense of a multiple dedicated server setup.

The biggest factor to date that has stopped companies moving to the cloud is security. With two long term outages each affecting hundreds of clients in the past six months from Amazon, reliability might soon be the biggest reason for businesses being unwilling to outsource to the cloud.