Joyent’s Downtime Attributed to Admin Error

Joyent is a provider of high-end cloud infrastructure for web and mobile apps. On Tuesday, Joyent experienced a complete outage in its US-East-1 datacenter that resides in Ashburn, Virginia.

Since Joyent provides services to apps that thrive on high octane cloud horsepower, the downtime was quite recognizable to many of Joyent’s clients. As of right now, there is no post-mortem available for the public however Chief Technology Officer Bryan Cantrill promised a full breakdown of what went wrong.

On the Hacker News forum, Cantrill writes “It should go without saying that we’re mortified by this. While the immediate cause was operator error, there are broader systemic issues that allowed a fat finger to take down a datacenter. As soon as we reasonably can, we will be providing a full postmortem of this: how this was architecturally possible, what exactly happened, how the system recovered, and what improvements we are/will be making to both the software and to operational procedures to assure that this doesn’t happen in the future (and that the recovery is smoother for failure modes of similar scope).”

Cantrill blames operator error for the cause of this downtime. “Fat finger” is a term used by computer enthusiasts who have hit the wrong key and caused catastrophic damage. The end result of this error was that the entire datacenter had to be rebooted in order for Joyent’s cloud services to begin running properly.

It seems as if all services were back up and running within an hour. Cantrill mentions that the administrator who caused the error will not face consequences. In English, he won’t get fired. In fact, Cantrill says that he’s more interested in learning how this happened so that his engineers can prevent it from happening again. Joyent has datacenters in San Francisco, Las Vegas and Amsterdam that were unaffected by this outage.