Joyent cloud outage caused by sysadmin reboot error

Joyent is looking at how to improve software and operational procedures to prevent a reoccurrence

By Mikael Ricknäs

May 28, 2014

IDG News Service

Share

Twitter

Facebook

LinkedIn

Cloud provider Joyent suffered an outage yesterday after an administrator was able to simultaneously reboot all virtual servers hosted in the company's US-East-1 data centre.

"It should go without saying that we're mortified by this. While the immediate cause was operator error, there are broader systemic issues that allowed a fat finger to take down a data centre," said Bryan Cantrill, CTO at Joyent, in a post on Hacker News.

The company first noticed something had gone wrong when it started seeing transient availability issues.

"Due to an operator error, all compute nodes in US-East-1 were simultaneously rebooted. Some compute nodes are already back up, but due to very high load on the control plane, this is taking some time," Joyent said in an initial update on the issue.

About an hour later after first reporting the problem, the company said that all compute nodes and virtual machines were back online.

Joyent didn't say how many customers or servers were affected by the reboot. However, an error of this magnitude shouldn't be allowed to happen, and highlights the importance of processes that balance the need for effective management and protecting users against these kinds of issues.

"As soon as we reasonably can, we will be providing a full postmortem of this: how this was architecturally possible, what exactly happened, how the system recovered, and what improvements we are and will be making," Cantrill said.

The company is looking at how it can improve software and operational procedures to ensure that this doesn't happen in the future, and also how the recovery after a failure can be made smoother, according to Cantrill.

Just like any IT system, cloud-based services and servers can suffer from outages, but because the large number uses consequences are usually larger.

This week some Amazon Web Services users were hit by a power outage. Servers in one of the US-West-1 region's availability zones were affected, and it took almost three hours for Amazon to recover all instances. Amazon didn't elaborate on what caused the power failure.

Recently, Twitter also suffered an outage after a change to one of its core services went wrong, and HBO angered users of its Go service twice after it was overwhelmed by the number of people that wanted to watch the season premiere of "Game of Thrones" and the finale of "True Detective."