Why have we not seen TPC-C and TPC-E benchmarks using SSD storage?

Dell recently published a TPC-H report for the PowerEdge T610, 2 x Xeon 5570, with 4 FusionIO 80GB SSD storage devices at 100GB scale factor. So why have we not seen TPC-C or TPC-E OLTP benchmark results published?

Now it is much more feasible to run the TPC-H data warehouse benchmark on SSD because the Scale Factor 100 size is still allowed, for which the Line item table is 100GB for data only, not indexes or other tables. The full SF 100 tpch database is about 170GB for all tables and indexes. Additional space is required for tempdb.

The TPC-C and TPC-E benchmarks require the database size to be scaled with performance target ranges. Consider the Fujitsu TPC-E published result for the Primergy RX300 S5 with 2 Xeon 5570. The dual-socket Xeon 5570 system scored 800 tps-E, for which the required initial database size is about 3TB. The space actually allocated for the data files is approximately 4.5TB, plus another 85GB for log space.

System

Fujitsu Primergy RX300 S5

Processors

2 x Intel Xeon X5570

Memory

96GB

RAID controllers

5+1

Disk enclosures

30

HDD

360 (192 73GB 15K + 168 146GB 15K)

Storage cost

$148K + $49K for 3 year maintenance

Raw capacity

35GB

RAID 10 Capacity

18TB

Estimated IOPS

360*200 = 72K

For the 360 15K disk drives, based on 200 IOPS per disk, the small block random IOPS capability of this storage system is 72K, excluding RAID 10 overhead. If the actual load is 10,000 IOPS (at the operating system) with a 50/50 read/write mix, then the raw IOPS to disk is 5K reads and 2x5K writes for a total of 15K IOPS to disk. So a 75K IOPS system can actually handle 50K IOPS at 50/50 read/write mix in RAID 10.

If we consider that the active database resides on only 15% of the disk space (3TB of 18TB after RAID 10 overhead), then there is some benefit from the short-stroke effect. If the average disk queue depth per disk were higher than 1, then command queuing capability would result in even high IOPS per disk. The actual IOPS per disk might be anywhere from 200-300 depending on whether the emphasis was on pure performance or balanced price/performance.

Below is a proposed SSD (+ HDD for archival space) configuration.

SSD + HDD configuration

SSD Capacity

4.5TB

60-day space

13TB

SSD drives

155 @ 32GB

62 @ 80GB

Cost for Intel SSD

$520 for X25E, 32GB

$340 for X25M, 80GB

$80K

$21K

HDD drives for 60-day space

20 x 1TB SATA $3200

or 42 x 450GB SAS

In additional to the above, we need disk enclosures. Ideally I would like to place no more than 4-5 SSD devices on each x4 SAS port. A x4 3Gbps SAS port can support 1GB/s, but if a HBA/RAID controller with 2 x4 SAS ports is plugged into a x8 PCI-E gen 1 slot, we can only expect 1.6GB/s total (single direction) throughput. The Intel X25 SSDs are rated at 250MB/s sequential read, 170MB/s write for the E, and 70MB/s write for the M (all sequential).

The X25-E random 4K IO characteristics are 35K IOPS read, 3.3K write. In the absence of data, let assume the 8K random read is 15.7K IOPS (probably higher), so under 8K random IO, the bandwidth requirement is only 140MB/s or much less for read/write mixes.

The data sheet also say 8K 2:1 R/W 7K IOPS for the X25-E. No random IO data is listed for the X25-M in the datasheet, so it is not clear the X25-M can meet the TPC-C/E random IO requirements for mixed R/W.

A 1U enclosure with 2 SAS ports (daisy chained enclosures not expected) and 8-10 2.5in bays seem appropriate. The 2U enclosures with 24 bays should have 6 independent SAS ports. The 20 or so 3.5in SATA drives for the 60-day space requirement could be accommodated between the internal bays and 1 or 2 external enclosures. The next generation systems and components should be PCI-E gen 2 (5Gbps per lane) and 6Gbps SAS, but we expect higher SSD bandwidths as well.

So the SSD cost structure does seem to support the TPC-E benchmark. It would probably also support the TPC-C benchmark as well, based on the HP DL370G6 result for a

The main issue above is that I have not included RAID overhead. It is my m opinion that the SSD is not fundamentally a single component device, like a disk drive with a single motor. If the SSD were built with dual controllers, and chip-kill ECC on the NAND, then the SSD would be inherently single component failure tolerant. Of course, this is not the case yet. I am just looking forward to when we can do without RAID in SSD. I am not convinced RAID controllers are going to be able to keep up with an SSD arrays anyways.

Since the IOPS capability of the X25E with SLC NAND (not sure for the X25M with MLC NAND), RAID 5 with the higher small block random write overhead is not an issue. So 190 of 32GB or 77 of the 80GB SSDs in RAID 5 would be required.

I should briefly touch on expected performance benefits of SSD over HDD. The Dell TPC-H result did seem to indicate some benefits from SSD, even though there was not a otherwise similar HDD result to compare with. The TPC-H data warehouse queries may generate many table scans for which HDD is fine, there are still loop joins and key lookups, which generate pseudo-random IO. Several TPC-H queries also dump intermediate results to tempdb.

I am expecting TPC-C and E to show reasonable benefits from SSD over HDD. Consider the main TPC-C new order transaction. A typical TPC-C published result might show an average response time 0.3-0.4sec. This procedures processing an order for upto 15 items (average of 10?) which means one update of the Stock table for each item, one insert in to the Order Line table, and one insert to the New Order table, plus a few a others. Since the TPC-C database is very large, each of the above steps might require a disk IO. On a perfectly configured disk system (for OLTP), the average latency could be as low as 5ms even when the entire system drives 200K IOPS.

Still, if you look at the New Order procedure, it is clear each item must be processed serially. The SQL Server engine might use the Scatter-Gather IO API to consolidate IO calls from multiple concurrent users, but in each step in the new order is issued sequentially, after the previous step completes. Since there are over 20 steps, if each step take 5ms, then we can see why the average duration is well over 100ms.

With SSD, the IO latency should drop to 0.08 milli-sec (80us), meaning 20 steps should in the range of 2ms. Because there are fewer transactions "in-flight" at any given point in time, the expectation is that the SQL Server engine has less to keep track of.

Consider a system supporting 600,000 tpm-C. Thats 10,000 new order transactions per second. If each new order procedure averages 0.3sec, then there are 3,300 new order transactions in-flight at any point in time (plus others).

TPC-C also has performance/size scaling requirements. A 600K transactions per minute result requires approx 50,000 warehouses, each of which requires approx 84MB, for a database size of 4.2TB. The recent 600K tpm-C results required 1000+ disk drives (no RAID requirement) meaning the IOPS load is probably 200-300K, possibly a R/W mix close to 50/50.

Since Wes say the FusionIO 640GB devices are out, lets consider what kind of system would be required. The FusionIO is built with a PCI-E interface, that is, it plugs into the PCI-E slot directly, so it probably comes with its own driver. The second generation FusionIO matches up nicely with either PCI-E gen 1 x8 or PCI-E gen 2 x4 in terms of bandwidth.

For 4TB we need 7-8 of the 640GB drives. So, ideally a system should be configured with 9 PCI-E gen 2 x4 slots, plus with embedded devices (the extra slot or two is for additional network or SATA drives). The new Intel 5500 IOH has 36 PCI-E gen 2 lanes, plus the x4 gen 1 off the ESI. So a single IOH would support 9 x4 slots, plus GE and SAS off the ESI. The HP ML/DL370G6 actually uses 2 IOHs for a mix of x16, x8 and x4 slots.

Per Grumpy below, at this point in time, SSD devices have very different characteristics, particulary with regard to writes. Writes to NAND need to be in large blocks. Depending on how the SSD controller is implemented, expect some issues. So it may not be time yet to deploy transaction processing to SSD. DW might be worth considering. Still, we should see OLTP benchmarks plus accompaning details to better understand SSD characteristics. Where are Bashful, Doc, Dopey, Happy, Sleepy and Sneezy DBAs?

Comments

I've been playing with 4 x 32GB SSD ( mainly bought to replace scsi disks used for video editing - wife complains PC is noisy ) what I've found is that although random io is much faster under heavy sequential load they tend to not have the throughput of the same number of 15k scsi disks, but no noise and less heat and power and I'm doubling the disks to get capacity. With big databases ( > 1 TB ) the number of SSD required makes it very expensive, I'm not sure I'd actually want to run a critical business application on Intel SSD and those SSD touted for the enterprise are much more expensive than the intel SSD. I did talk to a couple of storage vendors about SSD for a project I was working on, I'm told the number of SSD deployed in, lets say, a typical san storage unit has to be limited to avoid overloading the backplane with io. You also remarked at one point that the sql optimiser may not work well with SSD - I had suggested placing tempdb on SSD.

There are 2 basic problems, no suitable hardware available (yet) and cost.

The read-write ratio of TPC-E is close to 90% reads / 10% writes, perfect for SSDs.

The problem is that there are basically no SAS SSDs out there, the Intel drives are SATA and there are some technical limitations with regards to running SATA drives behind SAS expanders (usually found in external enclosures).

Another problem is that RAID cards are not fast enough to run SSDs, most RAID cards top out at 20K - 25K IOPS, given the read/write ratio of TPC-E, that's only 4 to 5 drives. TPC-E requires fault tolerant storage, no more RAID 0 so we need additional capacity.

We found that given the read-heavy nature of TPC-E, RAID5 is thepreferred RAID level for SSDs.

While it might technically be TPC-legal to use and price SSDs for the benchmark and price spinnig media for the 60-day space, it's still quite expensive.

Since there are beasically no SAS SSDs out there yet, FusionIO seems to be an interesting alternative. The problem with FusionIO is the limited capacity and some performance issue with the driver. Since the driver does a lot more work that a regular StorPort diver, we se rather high System CPU with FusionIO compared to optmizied drivers for traditional RAID cards.

Now as gfar as rge real world in s concerened, FusionIO is great for TempDB or staging tables.

Gunter

Disclaimer: This is my personal opinion, not an official HP statement.

well this explains it. Can't use SATA SSD behind SAS expanders (of course this would apply to daisy chaining SATA HDD too), RAID controllers can't handle multi-SSD IOPS, Fusion still need to work on their driver (its probably good enough for most ordinary use). You never hear this stuff from marketing material.

This is why I like to look at the components used in TPC benchmarks, it must be able to withstand heavy abuse.

Perhaps Gunter might casually mention what big iron Nehalem-EX system we might expect from a vendor that should remain nameless.

An 8-socket system is definitely expected as there is already one for Opteron.But perhaps a 16 or even 32 socket system since the nameless vendor has already built a cross-bar for the Intel QPI. The question is then whether Windows/SQL Server 2008R2 can use 32x8x2 (sockets * cores * threads) = 512. I am not sure what Microsoft means by the processor group, does it mean a single process can only use 64 processors, or is there a way to use multiple groups.

Well, all I can say is that Windows Server 2008 R2 and SQL Server 2008 R2 will support more than 64 logical processors and that the largest Windows system in existence today has 256 logical processors (from the same nameless vendor I should add) and performance and scaling is good. Using new APIs, one Process can use all processors.

There are some good WinHEC presentations regarding the new APIs (google).

I am eagerly awaiting the new big iron Windows systems, In particular, I would like to take measurements on Hyper-Threading in Nehalem. I had a good set of data on both Pentium 4 generations (Northwood and Prescott) I never got the chance to evaluate HT on Montvale/Montecito

I know this is an older thread, but I found it, via google, because I was looking for comments about using SATA SSD's behind SAS Expanders.

And I see that you say that for technical reasons its not a good idea, and I found another post at another website that simply said "=bad"

But is there any more information to add? Is it because it doubles the latency? I mean I was measuing about .18ms latency on seeks direct attached to an ICH10 motherboard, and measuring over .3ms when using an adaptec card and a SAS expander.

Well the adaptec 5805 has been widely reported to add latency, so that is part of the issue, but I'm trying to figure out, if, additionally, you get a lot of latency out of the SAS expander, so...just replacing the card may not be a solution.

My SSD's are doing about half the iops in the server, as they do when attached to my $700 workstation!

Anyone have any numbers or more specifics on the issue with SATA SSD's and SAS Expanders? thanks!

I am inclined to think that the technical issues mentioned by Gunter is more than latency. The SAS protocols have greater scope than SATA, and I am inclined to guess that this might be part of the issue.