Flash Fails 5 Ways, Says Alibaba

SANTA CLARA, Calif. — Datacenters are hungrily eating up flash memory, and feeling some indigestion. An R&D manager from China's Alibaba cloud service gave a frank talk on the pains it is experiencing in a keynote at the Flash Memory Summit here.

Among the top issues, flash vendors need to reduce cost per GByte, lower power consumption and latency, and increase reliability, said Wu Peng, a chief technologist in Alibaba's datacenter group.

A lot of flash products offer mean time between failures of many hours and warranties that last for years. "But actually we encounter a lot of degradation in performance specs so we are seeking more stability and certainty in life cycle performance," he said.

Despite the issues, flash is making its way into datacenters. Last year Alibaba bought at least one percent of the world's enterprise flash memory and its use is growing.

"Roc" Wu Peng called for a streamlined software stack to ease the job of communicating flash memory health to apps.

Alibaba started exploring flash five years ago. It now uses all-flash databases and significant amounts of flash in its content distribution networks and app severs. The e-commerce company hit records last year of completing 188 million transactions in 24 hours and 15,000 transactions/second.

To reliably keep the pace, applications need more information about the health of flash memories they depend on. "Failures cannot be avoided, but the best thing is to let the system now when the hardware will fail, when should I use caution and when I should shift to back up," he said.

Alibaba wants to streamline both the hardware and software to easy the job of letting apps know the state of underlying flash. The datacenter giant wants to handle provisioning and data redundancy issues itself. "If the app knows a lot it can do a lot," he said.

Among short-term troubles at Alibaba, RAID controllers are generating problems in error correction and battery back-up when used with flash. Meanwhile flash vendors are too focused on delivering ever higher I/O operations/second and too little time on lowering latency, he said,

Long term, work is needed to create a new software programming model to address changing storage hierarchies with the advent of flash memory. Separately, the vendor-driven concept of software-defined storage "is just a very rough direction" given datacenters have separate storage requirements for different apps, he said.

Further out, a host of next-generation memories such as STT-MRAM and phase-change memory on the horizon raise questions for datacenters, he said. "A lot of proprietary things are pushed to us and we have trouble understanding which will fail and which have the right timing for deployment," he said.

He extended an invitation to al vendors to test new products out in the Alibaba labs where it hosts a diversity of apps.

"We have a program to try new things that could eventually end up in our infrastructure," he said. "The process is frankly long, but we are always open to new products," he added.

I expect that the data centers will drive the flash reliabillity and help the product to be better for everyone. It is a new technology, so it will take some time and effort to get to the reliability needed.

Flash corruption and read endurance are two issues that often get raised in our applications. The latest micros have better error detection and correction. We still see these types of failures with off-the-shelf SD Cards. SLC cards are much better but are often ten times too expensive. MLC/TLC cards are more cost effective but need to be managed carefully.