Conventional DBA Wisdom and High Performance Flash Storage

We are at the greatest inflection point in storage since the inception of the beast pictured below, the worlds first commercial hard drive in the form of the IBM RAMAC. I am of course speaking of flash storage, this challenges a lot of conventional DBA wisdom.

High performance flash storage turns a lot of conventional DBA storage wisdom and lore on its head:

#1 Compress your data to boost performance

The raw IOPS horse power of flash storage, raided gen 1 ioDrives in this example, is enough to make scans CPU bound to the extent that the overhead of decompression has the net effect of sending performance backwards:

The graph provides the perfect illustration of the problem that row and page level compression is desgined to solve: the performance disparity between spinning disks and modern CPUs.

I repeated this test with my much faster ioDrive 2 duo, rated at 3GB/s for sequential reads. An INSERT SELECT based on this statement is used for populating a heap hash partitioned across 24 partitions:

If we start off with a basic heap and then go all the a column store with archive compression, this is what we see for elapsed times:

The SELECT COUNT query runs in under a second for the column store, I suspect this is because the database engine is picking count information directly out of the global dictionary. Below are the scan rates and CPU consumption per query/compression type:

I have omitted the scan rates and CPU consumption for the SELECT COUNT query against the column store because it runs so fast. The main takeaway from this data is that as we head towards page compression for the heap and the SELECT SUM query, the scan rate goes down and CPU consumption goes up (turbo boost allows for CPU consumption above 100%), the net effect being an increase in elapsed time. Column stores do not suffer from the CPU being swamped by the SELECT SUM query, I suspect this is due largely to the fact that the huge CPU penatly of having to work out where columns begin and end in the page is not incurred for column stores:

Wrong again, the fact that IO request can be parallelized when talking to flash via the PCIe bus means that the maximum IO throughput of a flash devices can be achieved with 64K reads. My ioDrive 2 duo rated at 3GB/s for sequential reads can deliver 3GB/s scan rates with 64K reads:

#3 Put TEMPDB, Transactions Logs etc On Different Spindles

This table is borrowed from my blog post on singleton inserts, after experimenting with different key types and various ‘Tweaks’ as documented below, I managed to insert 727,272 rows per second, and how was the storage configured for this ?, everything was stripped across a Fusion IO drive and two conventional SanDisk solid state drivers . . . no separation of tempdb, transaction logs etc whatsoever.

HBA Queue Depth Settings

When configuring HBAs on the host, ensure that the Queue Depth is set to an optimal value. SQL Server applications are generally I/O-intensive, with many concurrent outstanding I/O requests. As a result, the default values for Queue Depth on HBAs are usually not high enough to support optimal performance. Currently the default value for the major HBA vendors is in the range of 8 to 32.

In our SQL Server testing, we have seen substantial gains in I/O performance when increasing this to 64 or even higher. It is worth noting that in these tests SQL Server was usually the only application using the storage array. It is important to discuss with your storage administrator the appropriate values for this setting, as the setting may affect other applications in sharing the same storage environment. When Queue Depth is set too low, a common symptom is increasing latency and less-than-expected throughput given the bandwidth between host/storage and the number of spindles in a particular configuration.

Flash PCIe cards do not require high queue depths in order to deliver near peak IO throughput, the screen shot below comes from a parallel full table scan of a heap on my iodrive 2 duo, the card is rated at 3GB/s for sequential reads, 2845MB/s is being achieved with a queue depth of just 1.

Even with a LUN and most popular SANs the LUN is carved out of a pool of disks and you have no control over seggragating IO down to spindle level. Joe Chang has a good blog on his experience of using a Violin Memory array and the fact that checkpoints have no effect on latency (if I remember it correctly) I would wager that making special provision for logs, tempdb etc has no effect on what that gives you either.

I hate SANS in their current incarnation with a passion, you struggle to get low latency writes to your transaction logs, they crumble under heavy random IO, you often run out of IOPS before you run out of storage capacity. When the market has fully transitioned to all flash arrays, this is when SANS will fulfill their value proposition, then the focus of the DBA will change from tuning for IO and tuning around IO, to looking at where CPU cycles are being burned.