Google: Computer memory flakier than expected

Wondering why your computer just crashed again? Its memory might be to blame, according to real-world Google research that finds error rates higher than what earlier work showed.

With hundreds of thousands of computers in its data centers, Google can collect an abundance of real-world data about how those machines actually work. That's exactly what the company did for a research paper that found error rates are surprisingly high.

"We found the incidence of memory errors and the range of error rates across different DIMMs (dual in-line memory modules) to be much higher than previously reported," according the paper jointly written by Bianca Schroeder, a professor at the University of Toronto, and Google's Eduardo Pinheiro and Wolf-Dietrich Weber. "Memory errors are not rare events."
The probability of an uncorrected memory error goes way up if a memory module has experienced a correctible error within the most recent month--431 times more likely in some cases.

The probability of an uncorrected memory error goes way up if a memory module has experienced a correctible error within the most recent month--431 times more likely in some cases.
(Credit: Bianca Schroeder, Eduardo Pinheiro, and Wolf-Dietrich Weber

How many errors? On average, about one in three Google servers experienced a correctable memory error each year and one in a hundred an uncorrectable error, an event that typically causes a crash.

4,000 errors per year
That may not sound like a high fraction, but bear these factors in mind, too: each memory module experienced an average of nearly 4,000 correctible errors per year, and unlike your PC, Google servers use error correction code (ECC) that can nip most of those problems in the bud. That means an correctable error on a Google machine likely is an uncorrectable error on your computer, said Peter Glaskowsky, an analyst at the Envisioneering Group (and member of CNET's blog network).

ECC detects where a memory cell that should have stored a one ended up with a zero or vice versa, and Google also uses some higher-end error correction technology called chipkill, too, the paper said. The study measured the majority of Google's servers, gathering data for nearly two and a half years, the first study at such scale. they said.

Previous research, such as some data from a 300-computer cluster, showed that memory modules had correctable error rates of 200 to 5,000 failures per billion hours of operation. Google, though, found the rate much higher: 25,000 to 75,000 failures per billion hours.

While memory errors can cause serious problems, they're a lot less serious for PCs than for servers, Glaskowsky said. That's because servers keep a lot of data in memory, writing it periodically to the relative safe haven of a hard drive, whereas most of a PC's memory holds just application or operating system files or perhaps some content that's being seen but not edited.

But the study's results are causing some to rethink their software approach. One Google Chrome programmer, John Abd-El-Malek, suggested that the browser's database code be split off into a separate process from the rest of the browser code to cut down on corruption problems.

"Even if only a small fraction of these are harmful, spread over millions of users that's a lot of corruption," he wrote. He failed to convince at least some of his peers of his particular approach, but one skeptic, Scott Hess, responded, "I can see how it would make it useful to minimize how much in-memory data SQLite keeps, regardless of where SQLite lives."

Other myths debunked
The paper also challenged some other beliefs about memory.

â¢ Temperature isn't such a big deal.

Higher temperatures generally cause more error rates, but differences in temperature at Google's data center "had a marginal impact on the incidence of memory errors." However, system utilization, which tends to go hand in hand with high temperature, did cause more errors.

â¢ "Hard errors" are more common than "soft errors."

Hard errors, which are irreparable problems with hardware are more likely at fault than soft errors, which are transient issues caused by events such as random cosmic rays. This finding is interesting "since much previous work has assumed that soft errors are the dominating error mode in DRAM," the authors said, referring to the common dynamic random access memory used for computers' main memory.

â¢ Newer generations of memory modules, such as DDR2, aren't any worse than older ones.

There has been concern that newer memory modules, which pack electronics more tightly, suffer higher error rates. "In fact, DIMMs used in the three most recent platforms exhibit lower correctible error rates than the two older platforms, despite generally higher DIMM capacities," the authors wrote. "This indicates that improvements in technology are able to keep up with adversarial trends in DIMM scaling."

The researchers based this conclusion in part on the evidence that one error in a memory module is a good predictor of another to come--either correctible or uncorrectable. Worse, error rates go up with time:

"We see a surprisingly strong and early effect of age on error rates," the paper said. "Aging in the form of increased correctible error rates sets in after only 10 to 18 months in the field."

Google replaces error-prone memory modules, but it's harder for regular computer users without ECC memory to spot problems. In the olden days of personal computing and into the 1990s, memory was unreliable enough that people ran reliability tests.

But it may be those tests could come back, perhaps built into operating system software, Glaskowsky said: "If error rates are high enough, there may be an argument for running memory tests again."