NVIDIA Speeds Product Development with NetApp Hybrid Flash Arrays

Free Download

At NVIDIA, success is driven by relentless innovation and the ability to bring new products to market quickly. The company’s engineers design a range of processors, from chips that power smartphones and tablets to supercomputing processors packed with 7 billion transistors. According to Pethuraj Perumal, IT storage manager at NVIDIA, “We don’t want our product engineering teams to even have to think about storage while they are testing their designs, and we certainly don’t want storage to be a bottleneck to their research and development workflow.”

However, several years ago, the NVIDIA storage team ran into the following problems when using a major storage vendor’s scale-out technology that striped data across all the disks in the system:

Small-file random I/O became a bottleneck

Performance did not scale linearly

Stability and reliability were insufficient

To address these shortcomings, the NVIDIA team evaluated solutions from a number of vendors and selected NetApp® hybrid flash storage systems with intelligent caching as the best solution for their electric design automation (EDA) software requirements.

NVIDIA Benefits from Flash CacheMuch of the workload at NVIDIA is dependent on reads, which are accelerated by NetApp Flash Cache™. By caching recently read data and metadata on storage controller PCIe cards, Flash Cache works as an extended buffer to the PCI bus, helping to accommodate very large data sets. The NVIDIA team worked closely with NetApp to size the Flash Cache capacity for its specific workloads and, as a result, cache usage is always above 90%.

Flash Cache allows NVIDIA to use a hybrid storage model that mixes high-performance SAS drives with higher-density, lower-cost SATA drives, which minimizes the storage footprint and keeps costs down. The hybrid approach provided three times the savings in the number of disk shelves required, as well as significant savings in power and cooling.

Without the ability of Flash Cache to optimize performance across both SAS and SATA drives, NVIDIA would have required more rack space and a new data center. Instead, the NetApp systems fit within the existing data center and helped NVIDIA earn a $200,000 rebate from the power company, while at the same time providing a significant increase in total storage capacity.

Flash Storage For Dummies, NetApp Special Edition: Optimize performance and reduce the footprint of storage infrastructure in the data center

Processing power: A single FAS6290 controller has 12 processing cores, all of which can be used to accelerate data processing and handle concurrent compute jobs. Storage clusters enable linear performance scaling up to 24 nodes.

Controller memory (DRAM): With 96 GB of memory per controller, metadata can be cached in base memory, which provides sub-millisecond response time for metadata. This is critical for accommodating very large active working set sizes.

Networking. Two IOH chips in the FAS6290 provide 72 PCIe Gen 2 lanes, which can be broken out further using switches to create 152 PCIe lanes of I/O connectivity, with total internal bandwidth in excess of 72 GB per second.

Bottom-Line ResultsOne of the IT team’s key objectives is maximizing the “CPU time-to-wall time ratio,” whereby wall clock time represents the total amount of time necessary to process the compute job, and CPU time measures the amount of time that the CPU is actively working on processing the task. The higher the ratio, the more efficient the compute factory. A 25% improvement in compute factory efficiency means that chip designs can be tested, validated and brought to market more quickly.

The end result of the NetApp storage systems, caching and other optimizations is that the overall processing efficiency of the compute factory has more than doubled, from 2 million compute jobs per day to 4.5 million. The overall CPU-to-wall time ratio has improved, with up to 19% improvement in wall-clock performance for compiles, and up to 25% improvement in simulation run times.

In addition to scaling performance, the NetApp systems have exceeded the NVIDIA reliability goals. According to Perumal, “We have stopped measuring storage uptime (after documenting availability greater than 99.99%) because our NetApp storage is always available when engineers need it.”

No portions of this document may be reproduced without prior written consent of NetApp, Inc. Specifications are subject to change without notice. NetApp, the NetApp logo and Go further, faster, are trademarks or registered trademarks of NetApp, Inc. in the United States and/or other countries. Click here for a full listing of NetApp trademarks. All other brands or products are trademarks or registered trademarks of their respective holders and should be treated as such.