Is the ECC Performance Price Worth it for GPUs?

One of several elements that separates high performance computing GPUs from their gaming and graphics brethren is the addition of ECC codes, which target critical bit-flip errors in memory, which can lead to invalid results or system problems.

While ECC is often deemed a necessary component for confirming the viability of simulation results, it does come with a performance price. According to a team of researchers from the San Diego Supercomputer Center and Los Alamos National Laboratory, enabling ECC cuts the size of the system available by 10% because of the amount of memory consumed by the error correction codes. They note that additionally, turning ECC on “reduces simulation speed, resulting in greater opportunity for other sources of error such as disk failures in large file systems, power glitches, and unexplained node failures during the timeframe of the calculation.”

With this performance and greater potential for failure in mind, the question turns to whether ECC is preventing enough critical flaws to justify its price. In other words, are these errors so common that ECC is necessary? As one might imagine, this is a difficult question to tackle since the compute time with multiple hardware, application, GPU and other issues are all involved. However, the team took the question of ECC usefulness across large XSEDE systems, including Keeneland at the Georgia Institute of Technology, a smaller production cluster at Los Alamos, and on Dante at SDSC, which is equipped with GPUs of the gaming variety (so without any ECC).

As seen in the graph, the performance penalty on Keeneland, which was the largest system used in the test GPU/node count-wise, is certainly observable. Similar results in terms of performance hits were observed on the other systems as well. But most interestingly, when it came to actually seeing how useful the ECC was overall for all of the systems, it turned out that there were very few errors—and in fact, the most significant errors or problems with the results when compared across the different systems were based on the hardware itself, faulty motherboards or other variables…not the types the errors ECC is designed to address—at least for the AMBER molecular dynamics code that was used as the basis for the cross-system testing. There is far more detail about the nature of this MD code and why it was particularly relevant for this sort of testing in the full paper.

As the researchers summarize, “Although the ability of ECC to detect and correct single bit errors is undeniably useful in theory, the practical application of this technology may not be in the interests of the MD community.” They point to the rarity of ECC correctable errors and note that they do “not outweigh the costs in terms of system size and calculation speed,” noting that “the errors appear to be so rare in production GPU calculations that their rate of incidence could not be quantified in this experiment.”

Finally, they surmise that overall, “the fact that other sources of hardware error were observed during the experiment, regardless of ECC status, indicates that there are much more probable ways for simulations to fail and that such failures most likely cause the simulation to crash rather than to produce bad data.”

Again, there is a great deal more information in the full piece, but this sparks new life in the debate over whether or not ECC is all it’s cracked up to be for some scientific applications. Does this mean a new life for low-brow gaming graphics cards in large-scale scientific computing sites? Probably not—but an interesting read nonetheless.

You are on a risk track here. Vendors sell what users signal they want to buy. Once a user signals they don’t need reliable memory, by purchasing without ECC, then some vendors may drop memory reliability lower. You get what you buy, accelerated toward cost reduction. The business certainty is purchasing no ECC will give drastically lower reliability than the simple calculation of this article indicates.

The real problem with memory errors has more to do with their fundamental nature than their frequency. The errors that ECC corrects are random changes of individual bits out of memory systems containing billions of bits. Nearly all other types of hardware errors are pervasive, changing large numbers of bits, and repeatable (running the same program with the same data will cause the same error every time). Pervasive errors typically result in no answer from the application, either because the program or entire system crashes, or because the output of the application is obviously nonsensical. Single bit errors are different – much of the time, they just give an incorrect answer that looks reasonable. Depending on the particular bit involved, the wrong answer may be fine, or may be catastrophic when it is relied on in a critical application. No answer is a vastly preferable situation – you know something is wrong, so you fix the problem and re-run the program, and all that is lost is time.

Also, we should note that Molecular Dynamics algorithms perform iterative convergence. This means that errors, unless they are numerically quite large, tend to be corrected as the algorithm progresses – the only harm a small error does is increase the number of iterations needed for convergence. They are therefore less sensitive to bit flips than many other algorithms. Conclusions based on MD behavior should not be assumed valid for algorithms that do not have the same properties.

Sponsored Links

Accelerate your computational research and engineering applications with Exxact’s AMD FirePro Solutions, optimized for supercomputing professionals. AMD FirePro technology offers breakthrough capabilities that can help maximize productivity and help lower cost and complexity–giving you the edge you need in your business.

3/24/15 | Altair, Bright Computing, EMC, Numascale, and Platfora |
This 49-page in-depth report takes a look at how two very different industries are scaling familiar advanced computing concepts to new heights, Read more…

Tabor Communications

HPCwire is a registered trademark of Tabor Communications, Inc. Use of this site is governed by our Terms of Use and Privacy Policy.
Reproduction in whole or in part in any form or medium without express written permission of Tabor Communications Inc. is prohibited.