It's been a long time since the last blog post on SSD benchmarking - I've been busy! I'm starting up my benchmarking activities again and hope to post more frequently. You can see the whole progression of benchmarking posts here.

You can see my benchmarking hardware setup here, with the addition of the Fusion-io ioDrive Duo 640GB drives that Fusion-io were nice enough to lend me. My test systems now have 16GB each and all tests were performed with the buffer pool ramped up, so memory allocation didn't figure into the performance numbers.

In this recent set of tests I wanted to explore three questions:

What kind of performance gain do I get upgrading from Fusion-io's v1.2 driver to the v2.2 driver?

What is the sweet spot for the number of files on an SSD?

Does a 4Kb block size give any gains in performance for my test?

To keep it simple I'm using one half of the 640GB drive (it's two 320GB drives under the covers). To do this my test harness does the following:

I have 64 connections each inserting 2.5 million records into the table (with the loop code running server-side) for a total of 160 million records inserted, in batches of 1000 records per transaction. This works out to be about 37 GB of allocated space from the database.

The clustered index on a random GUID is the easiest way to generate random reads and writes, and is a very common design pattern out in the field (even though it performs poorly) – for my purposes it's perfect.

I tested each of the eight data file layouts on the following configurations (all using 1MB partition offsets, 64k NTFS allocation unit size, RAID was not involved):

Log and data on a single 320GB SSD with the old v1.2.7 Fusion-io driver (each of the 3 ways of formatting)

Log and data on a single 320GB SSD with the new v2.2.3 Fusion-io driver (each of the 3 ways of formatting)

Log and data on a single 320GB SSD with the new v2.2.3 Fusion-io driver and a 4Kb block size (each of the 3 ways of formatting)

That's a total of 9 configurations, with 8 data file layouts in each configuration – making 72 separate configurations. I ran each test 5 times and then took an average of the results – so altogether I ran 360 tests, for a cumulative test time of just over 1.43 million seconds (16.5 days) during April.

The test harness takes care of all of this except reformatting the drives, and also captures the wait stats for each test, making note of the most prevalent waits that make up the top 95% of all waits during the test. This uses the script from this blog post.

On to the results… the wait stats are *really* interesting!

Note: the y-axes in the graphs below do not start at zero. All the graphs have the same axes so there is nothing to misunderstand. They are not misleading – and I make no apologies for my choice of axes – I want to show the difference between the various formats more clearly.

v1.2.7 Fusion-io driver and 512-byte sector size

The best performance I could get with the old driver was 3580 seconds for test completion, with 4 data files and the Improved Write format – a tiny amount less than the time for 8 data files.

v2.2.3 Fusion-io driver and 512-byte sector size

The best performance I could get with the new driver was 2993 seconds for test completion, with 8 data files and the Max Write format - a tiny amount less than the time for other formats for 8 data files, and very close to the times for 4 data files.

On average across all the tests the new v2.2 Fusion-io driver gives a 20.5% performance boost over the old v1.2 driver, and for the regular format, the new v2.2 Fusion-io driver gives a 24% performance boost over the old v1.2 driver. It also (but I didn't measure) reduces the amount of system memory required to use the SSDs. Good stuff!

v2.2.3 Fusion-io driver and 4-Kb sector size

The performance using a 4-Kb sector size is roughly the same for my test compared to traditional 512-byte sector size. The most performance gain I got was 3% over using a 512-byte sector size, but on average across all tests the performance was very slightly (0.5%) lower.

Wait_S – cumulative wait time in seconds, from a thread being RUNNING, going through SUSPENDED, back to RUNNABLE and then RUNNING again

Resource_S – cumulative wait time in seconds while a thread was SUSPENDED (called the resource wait time)

Signal_S – cumulative wait time in seconds while a thread was RUNNABLE (i.e. after being signalled that the resource wait has ended and waiting on the runnable queue to get the CPU again – called the signal wait time)

WaitCount – number of waits of this type during the test

Percentage – percentage of all waits during the test that had this type

AvgWait_S – average cumulative wait time in seconds

AvgRes_S – average resource wait time in seconds

AvgSig_S – average signal wait time in seconds

For a single file, the wait stats for all the various formatting options look like:

With more files, the percentage of PAGEIOLATCH_EX waits increases, and by the time we get to 8 files, SOS_SCHEDULER_YIELD has started to appear. At 8 files, the wait stats for all the various formatting options look like:

By 16 files, the PAGELATCH waits have disappeared from the top 95%. As the number of files increases to 128, the PAGEIOLATCH_EX waits increase to just over 91% and the wait stats look like this for regular format:

What does this mean? It's obvious from the wait stats that as I increase the number of data files on the drive, the average resource wait time for each PAGEIOLATCH_EX wait increases from 4.2ms for 1 file up to 6.3ms for 128 files – 50% worse, with the signal wait time static at 0.7ms.

But look at the wait stats for 128 files using the Maximum Write format:

The average resource wait time for the PAGEIOLATCH_EX waits has dropped from 6.3ms to 3.7ms! But isn't PAGEIOLATCH_EX a wait type that's for a page *read*? Well, yes, but what I think is happening is that the buffer pool is having to force pages out to disk to make space for the pages being read in to be inserted into (which I believe is included in the PAGEIOLATCH_EX wait time) – and when the SSD is formatted with the improved write algorithm, this is faster and so the PAGEIOLATCH_EX resource wait time decreases.

But why the gradual decrease in PAGELATCH waits and increase in SOS_SCHEDULER_YIELD waits as the number of files increases?

I went back and ran a single file test and used the sys.dm_os_waiting_tasks DMV (see this blog post) to see what the various threads are waiting for. Here's some example output:

Dividing 1520544 by 8088 gives exactly 188. Running the same query a few seconds later gives most of the threads waiting on resource 5:1:1544808, another exact multiple of 8088. These resources are PFS pages! What we're seeing is PFS page contention, just like you can get in tempdb with lots of concurrent threads creating and dropping temp tables. In this case, I have 64 concurrent threads doing inserts that are causing page splits, which requires page allocations. As the number of files increases, the amount of PFS page contention decreases. It disappears after 8 files because I've only got 8 cores, so there can only be 8 threads running at once (one per SQLOS scheduler, with the others SUSPENDED on the waiter list or waiting in the RUNNABLE queue).

From 16 to 128 files, the wait stats hardly change and the performance (in the Improved Write and Max Write formats) only slightly degrades (5%) with each doubling of the number of files. Without deeper investigation, I'm putting this down to increased amounts of metadata to deal with – maybe with more to do when searching the allocation cache for the allocation unit of the clustered index. If I have time I'll dig in and investigate exactly why.

The SOS_SCHEDULER_YIELD waits are just because the threads are able to do more before having to wait, and so they're hitting voluntary yield points in the code – the workload is becoming more CPU bound.

Summary

I've clearly shown that the new Fusion-io driver gives a nice boost compared to the older one – very cool.

I've also shown that the number of files on the SSD does have an affect on performance too – with the sweet spot appearing to be the number of processor cores. I'd love to see someone do similar tests on a 16-way, 32-way or higher (or lend me one to play with :-)

[Edit: I discussed the results with my friend Thomas Kejser on the SQLCAT team and he sees the same behavior on a 64-way running the same benchmark (in fact we screen-shared on a 64-way system with 2TB and 4 640GB Fusion-io cards this weekend). He posted some more investigations on his blog – see here.]

And finally, I showed that for my workload, using a 4-Kb sector size did not improve performance.

Very interesting. We use RAM drives on some of our servers for tempdb and full-text catalogues; I wonder if there’d be any similarity between that and your FusionIO findings regarding the number of database files? Time to test it I guess. :-)

Hey Dave – I don’t see why not as the principal is the same – especially as Fusion-io drives are server-side rather than attached through some network fabric. I’d love to hear of your results if you get time to do any testing!

Your update is very timely as we are about to purchase Fusion-IO cards to offload tempdb storage from our SAN. I was debating the benefit using a larger sector size and your benchmark numbers help very much.

Hi again Paul. I’ll definitely post the results of the RamDisk comparison. We did have some eval FusionIO’s recently, but have since given them back (bugger!). But we do have a nice new EMC V-Max SAN with plenty of solid state & FC disk, so I’m going to do some comparisons between that (solid state & FC) and the RamDisk. I was wondering if you could give me a little more info on your test rig? For example, are you using some kind of load tool to spin up the 64 connections? Sorry if you’ve given this detail somewhere else; I searched but couldnt find it. Thanks.

Paul – reading some white papers on this type of storage, a concern seems to be deterioration of performance as they fill up with with data (i.e. slower writes to a full card than an empty one.) I wonder if that type of test is in your plans – maybe run a test to data files that fill 1/2 an empty card, then a test to fill a card from 50% to 99%? Also wonder if you would like another vendor’s cards for comparison, or if that’s less interesting :-)

Merrill, filling up a flash based SSD in a high i/o scenario basically equates to mis-use of the device.

Data cannot be over-written in place in flash memory & hence is relocated somewhere else upon every write (at block scope). If you fill up the device, there’s nowhere for this relocation to occur & writes then have to wait on previously used blocks to be "flashed" (zero’d, ready for new writes), hence "over-provisioning" is required with flash based SSDs. The higher your i/o rate, the more you have to "over provision" (smaller File System allocation vs device capacity, or in other words, the device is "over provisioned" against the FS allocation).

We provision SSDs by default for SQL Server with 25% over-provision (FS = 75% of device capacity) and sometimes go lower (eg 50%) in extreme i/o scenarios. As a result, we’ve never seen device degradation after years of use in high volume sites. If you didn’t over-provision however, you could expect the degradation you’ve described but this just means the device wasn’t correctly specified.

Another insight Paul: The reason you see lower latency on PAGEIOLATCH at 128 could be caused by the characteristics of the Fusion cards. A wide I/O pattern with few outstanding IOPS on each thread is more efficent on NAND than a narrow pattern with many outstanding on each thread.

I had a RamSAN at my last company and compared it to my Fusion-IO drives I had locally. I didn’t run the same tests like Paul just a standard IOStats set for each of the drives. Basically what I found is that the RamSAN could do the same numbers but was limited by whatever network pipe you were connecting to with it. In our case a 4GB fiber. Unless you can get a network pipe to it that can get you faster speed I think it will be the deciding factor on the performance you get. :)