Extra Second That Crashed the Web

This is a very interesting story from Robert McMillan and Cade Metz, published in Wired yesterday. I am re-posting the same here for readers of techkhabaren:

When Saturday night’s leap second glitch hit Reddit, Jason Harvey didn’t realize it was the leap second glitch. He thought it was some sort of internet slowdown related to the massive Amazon cloud outage that brought down some of the web’s most popular services less than 24 hours earlier.

“It looked like the network was just moving really poorly,” says Harvey, one of the system administrators who oversee the operation of Reddit, the popular news aggregation and discussion site. “With Amazon going down, a network problem just made sense.”

But after about half an hour, Harvey and his team traced the problem to a group of their own machines running the open source Linux operating system. These servers had almost ground to a halt after failing to properly accommodate the “leap second” that was added to the world’s atomic clocks on Saturday night, as June turned into July.

Depending on how quickly the earth is spinning, the planet’s official time keepers periodically add an extra second to these clocks to keep them in sync with the planet’s rotation. This keeps us from drifting away to a place where sunsets happen in the morning, but it can cause problems with computing systems that plug into these clocks but aren’t quite agile enough to deal with that extra second.

In Reddit’s case, the problem could be traced to a glitch in the Linux kernel, the core of the open source operating system. A Linux subsystem called “hrtimer” — short for high-res timer — got confused by the time change, and suddenly sparked some hyperactivity on those servers, which locked up the machines’ CPUs.

Reddit was just one of several web outfits that were hit by leap second glitches just after midnight Greenwich Mean Time on Saturday, including Gawker Media and Mozilla, and these sorts of problems tend to pop up with every time there’s a leap second adjustment. In January 2009, for instance, the leap second reportedly caused problems with Sun Microsystems’ Solaris operating system and an Oracle software package.

“Almost every time we have a leap second, we find something,” Linux’s creator, Linus Torvalds, tells Wired. “It’s really annoying, because it’s a classic case of code that is basically never run, and thus not tested by users under their normal conditions.”

The hrtimer glitch was patched in the Linux kernel this past March by a Linux kernel hacker named John Stultz, but some versions of Linux have not yet been updated to include this fix. Stultz was unavailable for comment on Monday, but in an post to an online mailing list, he discusses the problem that seemed to hit Reddit.

Inside the Crash

What actually happened to these machines? It’s complicated. Even Linus Torvalds said that in order to really understand what went on, we should talk to Stultz. But after interviews with several others familiar with the problem, we have a pretty good idea of what went down.

Hrtimer is a subsystem that is used when an application is “sleeping,” waiting for the OS to complete some other task. In some cases, it sets a kind of alarm clock for these sleeping applications that will go off when the OS is taking too much time with its other work.

Judging from Stultz’s mailing list post, when the leap second hit and these hrtimers were suddenly a second ahead of the core OS, they started ringing those alarm clocks, waking up countless sleeping applications at once and overloading the machines’ CPUs.

Reddit, however, saw something a little different. Its servers were running an open source database known as Cassandra that was built with the Java programming language and runs atop Linux. From what Jason Harvey can tell, Cassandra was failing to pause Java processes, and these processes were caught in constantly spinning loops, eating up the CPU power on Reddit’s servers.

Eventually, Reddit solved the problem by rebooting its servers. The site was all but inoperable for about 30 to 40 minutes, and it was entirely offline for about an hour and a half.

While Reddit was struggling with its Cassandra servers, Gawker had issues with its Tomcat servers, and Mozilla had trouble with Hadoop. Both Hadoop and Tomcat also depend on Linux and Java, and it would seem that were hit by the same glitch.

Other systems, however, experienced problems a day before the leap second arrived. Systems such as Linux use the Network Time Protocol, or NTP, to plug into the world’s atomic clocks and check the time. On Friday, NTP began warning servers that this year’s leap second was on the way, and according to Opera Software system admin Marco Marongiu, at least some Opera servers started locking up when they received the announcement. This issued is discussed on a Linux mailing list here, and it’s unclear how closely this issue is tied to the hrtimer problem experienced by Reddit.

The Best Laid Plans of Mice and Linux Geniuses

We don’t know when the next leap second will be. It depends on how quickly the earth spins — and that can slow down or speed up, depending on tides, weather and the flow of molten metals in the earth’s core. But when the next leap second does come, there could be more problems.

Whenever you mess around with time, things have a pretty good chance of going wrong, Torvalds says. Developers may test for this stuff before-hand, but there’s it’s hard to predict how things will play out in the real world.

“Leap seconds and daylight savings time changes are particularly painful, though, because they have the added complexity of being ad hoc without strict rules,” he says. “And of those two, leap seconds are the even more painful of the two.”

As Torvalds points out, synching up the earth with the time measured by atomic clocks is a tricky business. But, in general, the tech industry hasn’t had much experience with leap seconds over the past decade and a half. In fact, that may be part of the problem, says Steve Allen, a programmer with the Lick Observatory, just outside of San Jose, California. “From 1999 to 2005, there hadn’t been leap seconds. So all of the notions of cloud services and multiprocessors and so on came into existence during a period of time when leap seconds weren’t happening,” he says.

Since then, there have been leap seconds in 2005, late 2008, and now 2012. “So there was a long interval when people created all sorts of new stuff and didn’t have to think about that,” he says. “And then the earth stopped accelerating.”

Some have called for an end to leap second — so that these problems can be avoided. But in the meantime, others have proposed master fixes that seek to hide the sudden time changes from systems such as Linux. Opera’s Marongiu suggests pausing a system’s NTP system for a second, rather than actually moving a system’s clock back.

“Basically, you trick NTP, so it won’t take that sudden step back, but still adds the extra second,” says Marongiu.

But he calls this a “poor man’s workaround.” The better solution, he says, is the one used by Google. Last fall, with a blog post, Google described a method it calls “leap smear.” Rather than add the extra second all at once, Google has modified NTP so that it adds milliseconds to clocks over a relatively long period of time.

It’s a clever fix. But don’t expect it to become the norm. When the next leap second hits, someone somewhere will go down.