with consideration for random and sequential IO patterns, and in certain cases possibly also separation of low-queue and high-queue patterns, but this is not always possible. It also helps to know how to estimate the theoretical performance in IOPS and bandwidth for a given number of disks and IO channel, and then test to see how your configuration compares to the expected characteristics.

It is also necessary to have a basic idea of the capabilities and limitations of each component or bus in this chain. Storage performance cannot be achieved with magic/secret registry settings or other incantations.

A dozen 1TB 7200RPM supporting data, temp and log files, however impressive the capacity seems to be, will have poor performance by database standards no matter what secret settings are applied.

Nor is performance achieved with a grossly overpriced SAN storage system, with relatively few big capacity disk drives, configured in complete disregard of the principals of disk system performance.

Reference Configuration

Without getting deep into concepts, I will provide a simple example of what I consider a balanced storage system configuration. The objective for this reference configuration is the ability to sustain transaction processing throughput with no more than minor degradation during a moderately heavy reporting query. The configuration is also suitable for data warehouse workloads.

A 4-way server, that is a system with four processor sockets, on which the current generation Intel Xeon 7400 (now 7500) series and Opteron 8400 series processors have six (or 8) cores per socket, the reference storage configuration is 4-5 controllers and 120 15K disk drives as detailed below. (Intel finally announced the Xeon 7500/6500 series in the middle of writing this, so I will have make adjustments later.)

Processors

Intel Xeon X7560or Intel Xeon X7460or Opteron 8439

Cores

4 x 8 = 32 (X7560)4 x 6 = 24

Memory

64 x 4GB = 256GB (X7500)32 x 4GB = 128GB

HBAControllers

4-5 Dual-Port 4/8 Gbit/s FC or 4-5 6Gb/s SAS with 2x4 ports

IO Channels

8-10

Disk Drives

120 x 15K

Disks per channel

12-15

This is only a reference configuration. With direct-attach storage, eminently suitable for data warehouse, should have an amortized cost per disk of $500-600, for a total of $60-70K. In a SAN, the cost per disk might range from $1500-3000 per disk, for a total cost of $180-360K. A SAN vendor will probably attempt to substitute 600GB 15K disk instead of the low capacity models. This will push cost per disk to over $6K, usually resulting in a storage system with far too few disks.

At this time, a SAN is required for clustering. In the past, Windows supporting clustering on SCSI, with two hosts on the same SCSI bus. But this capability was removed as customers seemed anxious to buy very expensive SAN storage. The SAS protocol also supports two hosts connected to the same SAS network, so it should also be possible to enable clustering, but Microsoft does not currently support this.

A really high-end storage system could have over 1000 disk drives. This does need not be a single storage system, it could be multiple systems. Of course, for exceptional random IO needs, a serious effort should be made to determine if solid-state storage can be implemented to keep the spindle count manageable.

Sizing Guidance

If your storage vendor opens with a question as to your capacity requirements, don't waste anymore time. Just throw the rep out and proceed to the next vendor.

Performance Targets

For calculation purposes, I am going to assume 100 of 120 disks are allocated for data and temp, and the remaining 20 for other purposes including logs. In actuality, if only 4 disks are required for logs, then 116 disks would be allocated to data and temp.

Random IOPS

low queue, full-stroke

185 per 15K 3.5in disk,205 per 15K 2.5in disk, 17-22K tot

high queue, full-stroke

300+

low queue, short-stroke

250+ per disk, 25K tot

high queue, short-stroke

400+

Sequential

4Gb/s FC

330-360MB/s per port,720MB/s per dual-port HBA,2.6-3.0GB/s+ total

2x4 3Gb/s SAS RAID controller in x8 PCI-E Gen 1 slot

0.8GB/s per x4 port, 1.6GB/sec per adapter6GB/s+ or system limit*

2x4 6Gb/s SAS RAID controller in x8 PCI-E Gen 2 slot

1.6GB/s per x4 port, 2.8GB/sec per adapter10GB/s+ or system limit*

Short-Stroke Effect

The short-stroke effect is absolutely essential for transaction processing systems with tight mandatory limits on responsiveness. The short-stroke effect lowers latency and improves random IO performance. Most importantly, the short-stroke effect keeps latency low during heavy IO surges when active data is kept in a very narrow band of the disk. On a fully populated disk where full strokes are required, latency can jump to several hundred milli-sec during heavy IO surges.

One of the fundamental arguments made by SAN storage vendors is that by consolidating storage, it is possible to achieve high storage utilization, i.e., guaranteeing the full-stroke criteria. A heavy IO surge is very likely to cause transaction processing volume to collapse. To benefit from the short-stroke effect, it is necessary to restrict the active data to a very narrow range. The remaining disk capacity can still be used for data not in use during busy hours. This means having far more capacity than the active database, which in turn implies that it is essential to keep amortized cost per disk low, i.e., forgoing frills.

System Memory and IO bandwidth

The general concept here: the server system has several PCI-E slots. If the objective include IO bandwidth, plan on using the PCI-E slots, instead of leaving them empty.

The previous generation of Intel systems (and Server Systems 2009 Q3) built around the 5000P/X and 7300 chipset may have been limited to 3GB/sec in realizable IO bandwidth, regardless of the apparent bandwidth of the IO slots. There is no clear source for the realizable IO bandwidth of a 4-way Opteron system. An authoritative source indicated that the 8-way Opteron platform could achieve 9GB/sec in IO bandwidth, with approximately 7GB/sec realized from a SQL Server query. This may have been the TPC-H Pricing Summary Report, which is moderately CPU-intensive for a single table scan query, so the 9GB/sec value might be achieved in other SQL queries. It is reasonable to suppose that a 2009-10 generation 4-way Opteron should be able to achieve 4.5GB/sec or higher, but actual documentation is still desired.

The Intel Nehalem generation servers (Xeon 5500 and 5600 series, and the 6500 and 7500 series should be able to sustain phenomenal IO bandwidth, but I have yet to get my hands on a system with truly massive IO brute force capability.

The system memory bandwidth contribution is more complicated. Consider that a read from disk is also a write to memory, possibly followed by a read from memory. A system with SDRAM or DDR-x memory, the cite memory bandwidth is frequently the read bandwidth. The write rate to SDR/DDR memory is one-half the read rate. So the IO bandwidth might be limited to one-third the memory bandwidth, regardless of the bandwidth of the PCI busses. In the past there was a system with 2 DDR memory channels (64 bit or 8 byte wide) at 266MHz has a read bandwidth of 4,264MB/sec. The maximum disk IO bandwidth possible was around 1,350MB/sec, even though the system had two PCI-X 100/133MHz busses.

The more recent Intel chipsets, including the 5000 and 7300, have FB-DIMM which uses DDR, but have a separate device on the memory module. This allows simultaneous read and write traffic at full and half-speed. The 5000P chipset has 4 memory channels. With DDR2-667, the memory bandwidth is 5.3GB/s per channel or 21GB/sec system total for read, and 10.5GB/s for write. There are reports demonstrating 10GB/sec IO bandwidth, or even 7GB/s. The PCI-E bandwidth over 28 PCI-E lanes is 7GB/s unidirectional.

PCI-E Nominal and Realizable Bandwidth (bi-directional)

The table below shows PCI-E nominal and realizable bandwidths in GB/sec. PCI-E gen 1 (or just PCI-E) signals at 2.5Gbit/s. After 8B/10B (or is it 10B/8B?) overhead, the nominal bandwidth is 250MB/sec per lane per direction. Keep in mind PCI-E has simultaneous bi-directional capability. So a PCI-E x4 slot has a nominal bandwidth of 1GB/sec in each direction. Actual test transfers show that the maximum realizable bandwidth for a PCI-E x4 slot is approximately 800MB/sec. PCI-E gen 2 signals at 5.0Gbit/s or 500MB/sec per lane per direction, or double then gen 1 bandwidth for a given bus width.

Slot width

PCI-E Gen 1

PCI-E Gen 2

x4

1.00.8

2.01.6

nominalrealizable

x8

2.01.6

4.03.2

nominalrealizable

Systems of the Intel Core 2 processor architecture generation (Xeon 5100-5400 and Xeon 7300-7400 series) are almost exclusively PCI-E gen 1, as are the accompanying chipsets: the 5000P and 7300. The Intel 5400 MCH did support PCI-E gen 2, but no tier-1 system vendor produced a server with this chipset. (Supermicro, popular with white-box builders, did have 5400-based motherboards.) Systems of the Intel Nehalem generation and later have PCI-E gen 2. If someone could advise on when AMD Opteron transitioned from PCI-E gen 1 to gen 2, I would appreciate it.

Serial Attached SCSI (SAS)

SAS started out with 3.0Gbit/sec signaling. Unlike SATA, SAS appears to be used only with a x4 wide connection. Most SAS adapters have 2 x4 ports. The HP Smart Array P800 has 4 x4 ports. The nominal bandwidth of a x4 3Gb/s SAS connection is 12Gbit/sec. The realizable bandwidth appears to be 1.0-1.1GB/sec.

Unfortunately, this is not matched with the bandwidth of a PCI-E gen 1 slot. To realize more than 800MB/sec from a single x4 SAS channel requires a x8 PCI-E gen 1 slot, which in turn, results in under-utilizing the PCI-E slot or not achieving balance between the 2 x4 SAS ports. Since most adapters have 2 x4 ports, the maximum realizable bandwidth in a x8 PCI-E gen 1 slot is 1.6GB/sec. Some of the early PCI-E SAS adapters have an internal PCI-X bus that limits realizable bandwidth over both x4 SAS ports to 1GB/sec.

Server systems usually have some combination of x16, x8 and x4 slots. No server adapter relevent to databases can use more bandwidth than that provided by a x8 slot, so each x16 slot could have been 2 x8 slots, for a waste of an otherwise perfectly good x8 slot. The x4 slots are usually a good match for network adapters. A PCI-E gen 2 x4 slot is exactly matched to 2 x 10GbE ports.

Matching the available x16 and x8 slots to storage controllers is not always possible. Sometimes it may be necessary to place one or more SAS storage controllers in the x4 slots, in which case it is important to distribute the number disks behind controllers in x8 and x4 slots proportionately as appropriate.

In the last year, 6.0Gb/s SAS adapters and disk drives became available. The same bandwidth mismatch situation between 3Gb/s SAS and 2.5 Gb/s PCI-E gen 1 also occurs with 6Gb/s SAS and 5Gb/s PCI-E gen 2. In addition, LSI Logic states that their 6Gb/s SAS controller has a maximum combined bandwidth of 2.8GB/sec over both x4 SAS ports.

SAS RAID Controllers

For direct-attach storage, the SAS adapter is frequently also a RAID controller. Most SAS RAID controllers are built around LSI Logic silicon, notably the LSI SAS 1078 for 3Gb/s SAS and the new SAS 2008 for 6Gb/s SAS and 5Gb/s PCI-E gen 2. Intel used to make a PCI-E to SAS RAID controller built around the 80333 IO Processor, but mysteriously dropped out of the market soon after releasing the new 81348 IOP in 2007. There might be another vendor as I am not sure who makes the controller for the HP P800.

LSI has a 4 x4 PCI-E gen 2 6Gb/s SAS RAID Controller, listing a LSI SAS 2116. It is unclear if this is a variation of the 2008 or just two die in board.

Fibre Channel (FC)

It is Fibre channel to emphasize that the media is not necessarily fiber. Or it might be that some one thought fibre was more sophisticated. For a long time FC signaling stayed put at 4Gbit/sec, which I consider to be a serious mistake. The mistake might have also been in staying with a single lane, unlike SAS which employed 4 lanes as the standard connection.

Anyways, a dual-port 4Gb/s FC HBA is a good match for a PCI-E x4 slot. To make best use of system IO bandwidth, the x8 slot should be populated with a quad-port 4Gb/s FC HBA. Some Intel 4-way Xeon systems with the 7300MCH have one or two PCI-E bridge expanders, that allow two x8 slots share the upstream bandwidth of one x8 port. In this case, it is recommended that one slot be populated with storage controller and the other with a network controller, as simultaneous heavy traffic is predominately in opposite directions.

PCI-E Gen 1

PCI-E Gen 2

x4

x8

x4

x8

4Gb/s FC

dual-port

quad-port

quad-port

?

8Gb/s FC

single-port

dual-port

dual-port

quad-port?

A dual-port 8Gb/s FC HBA should be placed in a x8 PCI-E slot or a x4 PCI-E gen slot. I am not aware that there are any quad-port 8Gb/s FC HBAs for gen 2 x8 slots, much less an 8-port for the gen 2 x16 slot.

Fibre Channel HBAs

There are currently two main vendors for FC controllers and HBAs, Emulex and QLogic. Keep in mind that the SAN controller itself is just a computer system and also has HBAs, for both front-end and back-end as applicable, from these same FC controller vendors. It might be a good idea to match the HBA, firmware and driver on both the host and SAN, but this is not a hard requirement.

On the Emulex HBAs, driver settings that used to be in the registry are now set from the HBAnyware utility. Of particular note are the pairs Queue Depth and Queue Target, and CoalesceMsCnt and CoalesceRspCnt. Various Microsoft documents say that increasing Queue Depth from the default of 32 to the maximum value 254 can improve performance without qualifications.

In life, there are always qualifications. At one time, this queue depth setting was for the entire HBA. Now on Emulex, the default is per LUN, with the option being per target. The general concept is that in a SAN storage system with hundreds of disk drives, limiting the queue depth generated by one server helps prevent overloading the SAN. Well, a line-of-business SQL Server database means it runs the business, and it is the most important host. So increasing the queue depth allowed helps.

Notice that I said a storage system with hundreds of disks. What if the storage system only has 30 disks? Does increasing queue depth on the HBA help? Now that Emulex defaults to per LUN, what if each LUN only comprises 15 disks? The Microsoft Fast Track Data Warehouse papers recommend LUNs comprised of 2 disks. What should the per LUN queue depth be?

My thinking is it should be any where from 2 to 32 per physical disk in the critical LUN. The disk itself has command queuing for up to 64 tasks (128 on current Seagate enterprise drives?). Piling on the queue increases throughput at the expense of latency. In theory, restricting the queue depth to a low value might prevent one source from overloading the LUN. An attempt to test this theory showed no difference in queue depth setting over a certain range.

Note: Queue depth has meaning at multiple locations: at the operating system, on the HBA, on the SAN storage controller, possibly both front and back-end HBAs, and on the disk drive itself.

SAN Storage Systems

As Linchi Shea pointed out, SAN stands for Storage Area Network. A storage system that connects to a SAN is a SAN based storage system. But it is common refer to the SAN based storage system as the SAN.

Many documents state that the bandwidth achievable in 2Gb/s FC is in range of 160-170MB/sec, and 320-360MB/sec for 4Gb/s FC. Nominally, 4G bits translates to 500M bytes decimal. Lets assume that there is a protocol overhead of 20% leaving 400M. Then translate this to MB binary where 1MB = 1,048,576 bytes. So 400MB decimal is really 380MB binary. So there is still a gap between observed and nominal bandwidth. Back in 2Gb/s FC days, I investigated this matter, and found that it was possible to achieve 190MB/sec from host to SAN cache, but only 165MB/sec from host to storage controller, then over the back-end FC loop to disk and back. This disks in the back-end are in a loop, with 15-120 disks in one loop path. It is possible that the number of disks in a loop influences that maximum achievable bandwidth.

In the 4Gb/s FC generation, EMC introduced the UltraPoint DAE with star-point topology to disks with an enclosure. This might be what allows EMC to achieve 360MB/s per 4Gb/s FC port.

Most SAN storage systems today are 4Gb/s on the back-end. The front-end might be able to support 8Gb/s FC. SAN vendors are usually not quick about moving to the latest technology. On the front-end, it only involves the HBA. The back-end is more complicated, also involving the disk drives and the enclosures, which might have custom FC components. Personally, I think storage vendors should just ditch FC on the back-end for mid-range systems and go to SAS like the Hitachi AMS. Otherwise customers should ditch the mid-range and go with multiple entry-level systems.

The SAN configuration employs four dual-port HBAs and four fiber channel loop pairs on the backend. Each FC loop pair consists just that, two FC loops, each loop connected to a different storage/service processor (SP) depending on SAN vendor specific terminology.

EMC CLARiiON CX4

Some details of the EMC CLARiiON CX4 line is show below. Each Clariion system is comprised of two Service Processors (SP). The SP is simply a Intel Core 2 architecture server system.

CX4 120

CX4 240

CX4 480

CX4 960

Processorper SP

1 dual-core 1.2GHz

1 dual-core 1.6GHz

1 dual-core 2.2GHz

2 quad-core 2.33GHz

Memory per SP

3GB

4GB

8GB

16GB

Max Cache

600MB

1.264GB/s

4.5GB

10.764GB

Front-end FC ports (Base/Max)

4/8

4/12

8/16

8/24

Back-end FC ports (Base/Max)

2/2

4/4

8/8

8/16

The Clariion CX4 line came out in 2008. I do have some criticism on the choice of processors for each model. First, the Intel Processor price list does not even show a 1.2GHz model in the Xeon 5100 or 3000 series. This means EMC asked Intel for a special crippled version of the Core 2 processor. The Intel Xeon processors start at 1.6GHz for a dual-core with a price of $167. The quad-core X3220 2.4GHz has price of only $198, so why in the world does EMC use a 1.2GHz dual-core at the low-end? Sure, basic storage server operations does not require a huge amount of compute cycles, but all the fancy features (that really should not be used in a critical SQL system) the SAN vendors advocate do use CPU cycles. So when the features are used, performance tanks on the crippled CPU used in the expensive SAN storage system.

Now what we really want at the mid-range 480 level is having two processor sockets populated, as this will let the system use the full memory bandwidth of the Intel 5000 (or 5400) chipset, with 4 FB-DIMM memory channels. Yes, the 960 does have two quad-core processors, but I am inclined to think that the 960 (SP pair) with up to 16 back-end FC port might be over-reaching for the capability of the Intel 5000P chipset. If the CX4 960 in fact uses the 5400 chipset, then this might be a good configuration. But I have seen no documentation that the 960 can drive 5.6GB/sec. The quad-core E5405 2.00GHz processor is a mere $209 each, and the E5410 2.33GHz used in the high-end 960 model is $256 each. In late 2008, the dual-core E5205 1.86GHz was the same price as the quad-core E5405 2.0GHz. The Dell PowerEdge 2900 with 2 E5405 quad-core processors and 16GB was $2300.

This is less than the cost of each of the quad-port FC adapters, of which there are two in each SP of the 480. Consider also the cost of the 480 and 960 base systems, and that the 16GB memory in each 960 SP has a cost of around $800 each. Why not just fill the 16 DIMM sockets allowed by the 5000P chipset with 4GB DIMMs at about $3200 for 64GB per SP, unless it is because a large cache on a storage controller is really not that useful?

My final complaint in the EMC Clariion line is the use of a slice of the first 5 disk drives for the internal operating system (which is Windows XP or version of Windows). This results in the 5 disks having slightly less performance than the other disks, which can completely undermine the load balancing strategy. Given the price that EMC charges per disk, the storage system OS really should be moved to dedicated internal disks. If it seems that I am being highly critical of the EMC Clariion line, let me say now that the other mid-range SAN storage system use truely pathetic processors. So, the Clariion CX4 is probably the best of the mid-range systems.

HP StorageWorks 2000 Modular Storage Array

First, the model name and numbering system for the HP entry storage line is utterly incomprehensible. Perhaps the product manager may have been on powerful medications at the time, or there were 2 PMs who did not talk to each other. The official name seems to be StorageWorks 2000 Modular Storage Array, but the common name seems to be MSA2000 G2 (for the second generation). This name might just apply to the parent chassis family, comprised of the 2012 12-bay enclosure for 3.5in (LFF) drives and the 2024 24-bay for 2.5in (SFF) drives. The controller itself appears to be the MSA2300 with suffix for the front-end interface. There are two models of interest for database systems, the 4Gb/s fiber channel fc model and the 3Gb/s SAS sa model. Do not even think of putting a critical database server on iSCSI. The choice is between fc and sa on the front-end interface. The configured unit might be the 2312 or 2324.

Apparently there is also the StorageWorks P2000 G3 MSA. This appears to consolidate the G2 fc and i (iSCSI) models, with 8Gb/s FC. Above this, HP has the P4000 series. I am not sure how this relates to the EVA 4400 series.

The back-end interface is SAS, and allows both SAS and SATA drives. The back-end can also connect to either additional 12-bay LFF enclosures (MSA2000) or 25-bay SFF enclosures (MSA70). There is the option of having a single controller or dual-controllers. The storage expansion enclosures can have single or dual IO interfaces. My opinion is that SAS for the back-end interface is the right choice. FC incurs a large cost premium and has no real advantages over SAS. A single 4Gb/s FC port has one-third the bandwidth of a 3Gb/s x4 SAS port, and the same ration for 8Gb/s FC to 6Gb/s x4 SAS.

There are 2 FC ports per controller on the fc model, and four SAS ports on the sa model. There is a single (3Gb/s x4?) SAS port on the backend. HP initially put out a performance report showing reasonable performance numbers for the MSA2000 G2 with 96 15K drives on the fc model of 22,800 random Read IOPS and 1,200MB/sec sequential in RAID 10, but ridiculously low numbers of 10,800 IOPS and 700MB/s for the sa model. Either this was a benchmarking mistake, which seems unlikely for HP's history in this area, or there were bugs in the sa software stack. This was later corrected to 21,800 IOPS and 1,000MB/s. This configuration is essentially the maximum for the MSA2000 with 2.5in. The random reads works out to just over 225 IOPS per disk, but the sequential is 12.5MB/sec per disk. I am presuming that 1GB/sec sequential could have been reach with about 40 disks. The Microsoft Fast Track Data Warehouse Reference Architecture 2.0 document seems to indicate that 100MB/sec per disk is possible for 2-disk RAID-1 groups.

I do not know much about the Hitachi AMS line, and have never worked on one (vendors should be alert to subtle, or not, hints). I point out this SAN storage system because Hitachi did submit a SPC benchmark report for it, with a price of about $1500 per 15K disk, amortizing the controller and supporting components. Most SAN storage systems usually work out from $2,500 to $3,500 per 15K 73 or 146GB disk, and up to $6K per 450 or 600GB disk, which seems to be what SAN vendors like to push, with horrible performance consequences. The Hitachi AMS has FC on the front-end and SAS on the back-end. The HP MSA 2000 and EMC Clariion AX also have SAS back-ends, but both are entry storage systems, in having limited backend ports. The Hitachi AMS is a mid-range comparable in bandwidth capability to the CX4 line. I reiterate that FC on the back-end is a major waste of money for less performance.

Enterprise Storage Systems

The big-iron storage systems are really beyond the scope of this document, but a couple of comments are worth noting. EMC top of the line used to be the DMX-4, which was a cross-bar architecture connecting front-end, memory and back-end. Last year (2009), the new V-Max line replaced the DMX-4. The V-Max architecture is comprised of up to 8 engines. Each engine is a pair of directors. Each director is a 2-way quad-core Intel Xeon 5400 system with up to 64GB memory (compared with 16GB for the CX4-960).

Each director also has 8 back-end 4Gb/s FC ports (comprised of quad-port HBAs?) and various options for the front-end including 8 4Gb/s FC ports. In the full configuration of 128 4Gb/s FC ports on the front and back ends, the expectation is that this system could deliver 40GB/s if there a no bottlenecks in the system architecture. Of course, there is no documentation on the actual sequential capability of the V-Max system. EMC has not submitted SPC benchmark results for any of their product line.

EMC V-Max documentation does not say what the Virtual Matrix interface is, but I presume it is Infini-Band, as I do not think 4 or even 8Gb/s FC is a good choice.

The main point here is that even EMC has decided it is a waste of time and money to build a custom architecture in silicon, and just using the best of Intel Xeon (or AMD Opteron) architecture components. It should be possible to build even more powerful storage systems around the Intel Nehalem architecture infrastructure. Unfortunately, storage systems evolve slowly, usually lagging 1-2 generations behind server systems.

Disk Enclosures

The next step in the chain of devices from the system IO bus to the disk drive is the disk enclosure (EMC uses the term DAE, which will also be used here even for non-EMC enclosures). Some years ago, a 3U enclosure for 15 3.5in disk drives was more or the less the only standard configuration.

HP may have been the first major vendor to switch to a 2U 12-disk enclosure for 3.5in drives.

The standard configuration for 2.5in drives seems to be a 2U enclosure for 24 or 25 drives.

I am not aware of any SAN vendors offering the high-density enclosures for 2.5in drives, except for HP in the Storage Works 2000 MSA line. This may indicate a serious lack of appreciation (or even understanding) of the importance of performance over capacity.

Hard Drives

The table below shows the specifications for the recent Seagate 3.5in (LFF) and 2.5in (SFF) 15K drives. The 2.5in Savvio drive has lower average seek time. The rotational latency for 15K drives is 2.0ms. The transfer time for an 8KB block ranges from 0.04ms at 204MB/s to 0.065ms at 122MB/s. The average access time for 8K IOP randomly distributed over the entire disk is then 5.45ms for the 3.5in disk and 4.95ms for the 2.5in disk. It should also be considered that the 3.5in Cheetah 15K.7 has media density of 150GB per platter versus 73GB for the 2.5in Savvio 15K.2. If the 3.5in disk were only populated to 50% capacity, the average seek latency would probably be comparable with the 2.5in disk.

Cheetah 15K.6

Cheetah 15K.7

Savvio 15K.2

Avg. Read Seek

3.4ms

3.4ms

2.9ms

Avg. Write Seek

3.9ms

3.9ms

3.3ms

Sequential Max

171MB/s

204MB/s

160MB/s

Sequential Min

112MB/s

122MB/s

122MB/s

The sequential transfer rates assume no errors and no relocated logical blocks. On the enterprise class disk drives, this is effectively achieved. On the high-capacity 7200RPM drives, the ability to sustain the perfect transfer rates is highly problematic, and the data sheet may not specify the design transfer rate.

The chart below shows IOMeter results for a single 10K over a range of data space utilization and queue depth demonstrating the short-stroke effect on IOPS (vertical axis).

The charts below show latency on the vertical scale in ms for a range of data utilizations and queue depth.

There is no point to having the big capacity SATA disks in the main storage system. We said early that the short-stroke effect was key. This meant we will have much more space than needed on the set of 15K drives. The SATA drives are good for allowing dev and QA to work with the full database. There are too many developers who cannot understand why a query works fine on a tiny 10MB dev database, but not the 10TB production database.

Solid State Storage

Most SSDs fall into one of two categories. One is an SSD with one of the standard disk drive interfaces such as SATA, SAS, FC, or one of the legacy interfaces. The second type connects directly into the system IO port (PCI-E), for example the Fusion-IO SSDs. TMS has a complete solid state SAN system, which might even included DRAM for storage as well as non-volatile memory.

Most of SSD devices in the news have a SATA interface and are intended for use in desktop and mobile systems. There might be (or have been) technical issues with using the SATA SSD in an SAS storage system when there are multiple SAS-SAS bridges in the chain, even though SATA drives can be used in these systems.

STEC makes the SSD for the EMC DMX line, possibly other models as well, and for several other storage vendors. The specifications for the STEC SSD is 52K random read IOPS, 17K random write IOPS, 250MB/s sequential reads, and 200MB/s sequential write.

The general idea behind the Fusion-IO architecture is that the storage interfaces were not really intended for the capabilities of an SSD. The storage interface, like SAS, was designed for many drives to be connected to a single system IO port. Since Fusion-IO could build a SSD unit to match the IO capability of a PCI-E slot, it is nature to interface directly to PCI-E.

What I would like from Fusion-IO are a range of cards that can match the IO bandwidth of PCI-E gen 2 x4, x8 and x16 slots, that deliver 2, 4 and 8GB/s respectively. Even better is the ability to simultaneously read 2GB/s and write 500MB/s or so from a x4 port, and so on for x8 and x16. I do not think it is really necessary for the write bandwidth to be more than 30-50% of the read bandwidth in proper database applications. One way to do this is to have a card with a x16 PCI-E interface, but the onboard SSD only connects to a x4 slice. The main card allows daughter cards each connecting to a x4 slice, or something to this effect.

One more thing I would like from Fusion-IO is using the PCI-E to PCI-E bridge chips. In my other blog on System Architecture, I mentioned that the 4-way systems such as the Dell PowerEdge R900 and HP ProLiant DL580G5 for Xeon 7400 series with the 7300MCH use bridge chips that let two PCI-E port share one upstream port. My thought is that the Fusion-IO resides in an external enclosure, attached to the bridge chip. The other two ports connect to the host system(s). One the host would be a simple pass through adapter that sends the signals from the host PCI-E port to the bridge chip in the Fusion-IO external enclosure. This means the SSD is connected to two hosts. So now we can have a cluster? Sure it would probably involve a lot of software to make this work, who said life was easy.

At this time, I am not entirely sure what the proper role is for SSD in database servers. A properly designed disk drive storage system can already achieve phenomenally highly sequential IO, just stay away from expensive SAN storage, and do not follow their standard advice on storage configuration.

Random IO is the natural fit for SSD. Lets suppose the amortized cost of a 146GB 15K disk is $500 in direct attach, $2K in a SAN, and a similar capacity SSD is $3000. Then table below shows cost per GB, and cost per IOP between HD and SSD, using fictitious but reasonable numbers.

Direct Attach

SAN

SSD

Capacity

146GB

146GB

146GB

Amortized Unit Cost

$500

$2000

$3000

IOPS

200

200

20,000

$/GB

3.4

13.7

20.5

$/IOP

2.5

10

0.15

A database storage system should not be entirely SSD. Rather a mix of 15K HD and SSD should be employed. The key is to put the data subject to high random IO into its own filegroup on the SSD.

Some people have suggested and temp as good candidates for SSD. For a single active database, a storage system that works correctly with the hard drive sequential characteristics is fine. It is only the situation of multiple high activity databases. Ideally, the storage controller cache interprets the pattern of activity from multiple log files on one LUN, so that a single RAID group is sufficient. If not, then this is a good fit for SSD.

I am not certain that SSDs are necessary for temp. For the data warehouse queries, the temp on a large HD array seems to work fine. In the TPC-H benchmark results, the queries that showed strong advantage for SDD involve random IO, not heavy tempdb activity.

Two criteria can sometimes terminate extended discussion. If the database happens to fit in memory, then there will not be heavy disk activity, except possible for log and temp. The log activity can be handled by disk drives. In this, it may not be worth the effort to setup a 48-disk HD array for temp, so a single SSD for temp is a good choice.

The second criteria is for slightly larger databases that exceed system memory, perhaps in the 200-400GB range, but are sufficiently small to fit on one or two SSDs. Again, it may not be worth the effort to setup the HD array, making the SSD a good choice.

Solid State Storage in the future without RAID?

A point I stress to people is to not blindly carry a great idea from the past into the future without understanding why. (Of course, one should also not blindly discard knowledge, most especially the reason underlying reason behind the knowledge).

So why do we have RAID? In the early days, disk drives were notoriously prone to failure. Does anyone remember was the original platter size was? I thought it was in the 12-18in range. MTBF may have been in the 1000hr range? Even today, at 1M-hr MTBF, for a 1000-disk array, the expectation is 8.8 disk failures per year (the average hours per year is 8,765.76, based on 365.24 days) Some reports show much higher failure rates, perhaps 30 per year per 1000 disks. Of course this includes all components in the storage system, not just the bare drive.

Sure, SSDs will also have a non-infinite MTBF. But the HD is fundamentally a single device. If the motor or certain components in the read/write mechanism fails, the entire disk is inaccessible. An SSD is not by necessity inherently a single device. The figure below shows a functional diagram of an Intel SSD. There is a controller, and there are non-volatile memory chips (NAND Flash).

In system memory, the ECC algorithm is design to correct single bit errors and detect double-bit errors within an 8-byte channel using 72-bit. When four channels are uniformly populated it can also detect and correct an entire x4 or x8 dram device failure and detect double x4 chip failures. I suppose SSDs might already have some chip failure capability (but I have not found the documentation that actually states this detail).

There should be no fundamental reason an SSD cannot have redundant controllers as well. With proper design, the SSD may nolonger be subject to single component failure. With proper design, an SSD storage system could conceivably copy off data from a partially failed individual SSD to a standby SSD, or even a disk drive.

I stress that may not be what we have today, but what I think the future should be.

Configuration

Having set most of the base, we can now discuss the strategies behind configuring the storage system. The most fundamental is probably that the traffic for each LUN must travel through a specific path. So on FC, do not plan on having more than 320-360MB/sec per LUN. EMC documents say that traffic can go on both loops of the FC pair, meaning 720MB/sec. I may have not interpreted this correctly, so verify this. The traffic to LUNs must be properly distributed over the available paths and HBAs with adjustments for the PCI-E slot bandwidth.

In the past, I have seen approximately 11MB/sec per disk in running a table scan on a SAN storage system. This works out to 175 x 64KB IOs, meaning the storage is issuing 64K IOs serially instead of taking advantage of the sequential capabilities of the disk drive.

The EMC whitepaper h5548 "Deploying EMC CLARiiON CX4-960 for Data Warehouse/Decision Support System (DSS) Workloads" states: number of key FLARE changes have been included in release 28, the array system software release in support of the CX4 family, to ensure that we can drive the underlying disk drives to the considerably higher level of data delivery rate using the "thin" RAID striping configurations.Thin RAID is described as a RAID group with few drives. Later the EMC paper advocates 2+1R5 or 2+2R6, in contrast with Microsoft FTDW which advocates 1+1 RAID 1.

The Microsoft FTDW documents also statement 100MB/sec per disk is possible. It would then take only 8 disks in one DAE to saturate a FC loop pair. My assertion is that even DW is not always true sequential, so a more reasonable target is 15 disks in one DAE, 2 DAE in one 4Gb/s FC loop pair. If the 720MB/sec per loop pair can be achieved, this works out to 24MB/sec per disk.

If there ever is 8Gb/s FC on the back-end, it would be desirable to continue with 2 DAE per loop pair averaging 48MB/sec per disk.

A direct attach storage system has no problems in aggregating disk drive sequential performance. In the past, I have place 15 disks in a single enclosure on a 3Gb/s x4 SAS port for 800MB/sec. Today, perhaps 12 disks on a 6Gb/s x4 SAS port works well with the 12 disk enclosures, or splitting the 24 disk SFF enclosure into two.

Technically, the correct strategy on SAN is to create three or more LUNs on each of the main (non-log) RAID groups. The first LUNs is for data, or the filegroup with the big table (usually order line items) with the other tables and indexes in the filegroup on the second LUN. The tempdb data file is on the next LUN. The last LUN, with the short-stroke strategy, should be the largest capacity and is intended for inactive files. This could be database backups, flat files or even data files for archive tables.

If we were to be extremely aggressive in performance optimization, the data and temp LUNs would be raw partitions. Most people do not even know about this capability in the Windows operating system and SQL Server, and will probably be afraid to venture into this territory.

The following example shows a storage system with 9 RAID groups, 4 FC loop pairs, and perhaps 8 DAEs total, 2 per loop pair. With the data & temp LUNs evenly distributed across SP and loops, how should the log RAID group LUN be configured? If the SPs and loops are not heavily loaded, the log could be set to either SP. It could be in its own loop, or even be distributed among unused disks in the other loops.

RAID Group

SP

Loop

LUN A

LUN B

LUN C

LUN D

0

A

0

FG1 data 1

FG2 data

temp 1

other

0

B

0

FG1 data 2

FG2 data

temp 2

other

0

A

1

FG1 data 3

FG2 data

temp 3

other

0

B

1

FG1 data 4

FG2 data

temp 4

other

0

A

2

FG1 data 5

FG2 data

temp 5

other

0

B

2

FG1 data 6

FG2 data

temp 6

other

0

A

3

FG1 data 7

FG2 data

temp 7

other

0

B

3

FG1 data 8

FG2 data

temp 8

other

8

?

?

log

If one were to follow the thin RAID group strategy, then there would be multiple RAID groups in each loop.

One storage system feature I do not intend to use is dynamic volume growth or relocation. SQL Server already has features for this. If more space is needed, add another DAE, RAID group, and set of LUNs. Use the ADD FILE command. Be sure to rebuild indexes during off hours to redistribute data across all files in the filegroup. When moving data from older disks to new disks, we could use DBCC SHRINKFILE with the EMPTYFILE option, but I have found that it can be faster to rebuild the indexes (and clustered indexes) to another filegroup first, then shrink the file.

This can also be a fundamental question of whether to use the SQL Server or storage system features to perform certain functions. A transactional database engines, be it SQL Server, Oracle, DB2 etc have carefully designed procedures to main data integrity.

One of most serious problems I have seen is that in large IT organizations, the database and storage systems are managed by completely different groups. The storage group is intent on carrying out the latest propaganda from storage vendors, such as storage as a service. Applications request service based on capacity only. The storage service (system) will automatically solve redistribute load. In any case, the SAN admin will strictly follow the configuration strategies put out by the storage vendor, in complete disregard of the base of accumulated knowledge on Database storage performance. This is guaranteed to be the worst possible configuration for database performance. The line-of-business and data warehouse systems really need to be on dedicated storage.

Per other discussion, each highly active log file must have its dedicated RAID 1 pair of physical disk drives, and possibly RAID 10 sets in extreme cases. Unknown is whether really critical log traffic should be directed over a dedicated HBA. Even more extreme is if an SP under heavy load from data and temp LUNs can provide super low latency for log LUNs. If not, then in such extreme circumstances, the log might require its own dedicated SP. The middle storage systems all have 2 controllers. Enterprise storage systems can have multiple controllers, but each is incredibly expensive. An entry level storage system has one or two controllers, but the intent is the complete storage system has multiple entry storage units, which is why entry level SAN systems makes sense even for big-time databases.

The other question on logs, suppose there multiple high activity log files on the same RAID group. Without a cache, the actual IO to disk is non-sequential, hopping between the log files. Can the storage controller effectively cache the log writes? Since the writes to the individual log files are sequential, it will fill the cache line and RAID stripe for that matter.

Comments on System Memory Configuration

Some people think I try to put formulas in areas that they think are too mysterious, like SQL Server execution plan cost structure. I put formulas were I think it is meaningful. The Microsoft FTDW puts out an arbitrary on system memory with no substantiating data. My thoughts:

Server systems today are frequently configured with 128GB of memory. A 4GB ECC DIMM costs about $150 depending on the source. So 32 x 4GB contributes on the order of $6,000 to the cost of a system. Given the cost of storage performance, filling the system with 4GB memory modules can have a relatively high value to cost ratio. In any case, an effort to determine the most correct memory configuration will likely exceed the cost of filling the DIMM sockets with the largest capacity DIMM without a price per GB premium. Today, this is the 4GB DIMM. Next year or in 2012, it might be the 8GB DIMM.

Comments

Thank you for the very interesting article. I was wondering if it was possible to convert an EMC DAE to a JBOD and attach it directly to a host/server? What kind of connectors/cable would be required if possible.

the regular FC cable should work, just connect from HBA to the DAE, skip the dual-path for now

My bad, EMC Clariion uses copper Fibre Channel on the back-end, 10 years ago, it was a DB9 connector, today I am sure it is a different connector. A media interface adapter can convert Fiber to Copper, with the right connectors on each side.

ok, I am really slippingif you look at your HBA, you will notice that the module where the fiber cable plugs in is removable,its called a transceiver, a small form-factor pluggable (SFP) or Mini-GBIC

why would a 10GbE not work with PCI-E gen1 or PCI-X? 10GbE has been around since 2004, but perphaps only more reliable since 2006. It might be difficult to find PCI-X 10GbE nowadays. I am inclined to think that most new 10GbE support both PCI-E gen 2 and gen 1, while the older ones support just gen 1.

I found the HP url for the ML370G4. So HP did not support 10GbE on the G4, even though other vendors had 10GbE cards at the time. I believe HP even bought a 10GbE vendor, S2IO?

Apparently HP only supported 10GbE from G5 on, and they only had dual-port, hence the better match for gen1 x8.

Of the course, the G4 is very old system, and it is probably time for it to start a new life as a boat anchor perhaps?