I suppose I write this post to mock some of the things I keep seeing in the industry today. Directly Attached Storage (DAS) has moved on dramatically in the past 5 years, a single SSD has the capability of up to 525MBytes/sec throughput (SATA/SAS 6Gbits/sec), 700MBytes/sec (M.2) and then there is PCIe connected flash operating at multiple GBytes per second.

I will simply put a picture, if you understand what I’m portraying then you are like the rest of us – understand maths and common sense.

A clue: what happens when you drive down a 10 lane motorway that get’s reduced to just a single lane?

In the diagram below the database servers have a reporting role so pull much data from storage.

What is wrong with the architecture below? Assume data is located on a single LUN across all the SSD’s.

For SAN calculations convert everything to gigaBITS/sec as this is what the ‘pipes’ are measured in!

SSD roughly 3GByte=24GigBit/sec.
Hosts 3×8=24GigBit/sec
SDD-Switch is 8GigBit/sec hence that is up to 3x bottleneck!

Wih SSDs no more waiting for disk to spin/seek hence the pipes become the next bottleneck.

Imagine you solve the SAN bottleneck and scale to the point that 1 query cannot be satisfied by 1 host, now you scale to the next level as multi-host database servers (DB2 DPF shared nothing when the database server owning the connection collates the data) or
sharding (the client collates the data).

Now you hit the issue of the network on the other side of the database servers is the bottleneck – either between database servers (even top of rack switches have a llimit) or to the client.

So you go to the network folks and say I want to do a grouping on say 5TB of data over the Ethernet network!

6Gbits equates to around 525MiBytes per second
8Gbits equates to around 700MiBytes per second

Each drive is connected using a 6Gbit channel, so you have 6 drives, you basically have 6 channels so an aggregate bandwidth of 6 x 525MiBytes = 3150MiBytes per second.

Now you connect the SAN to the infrastructure using a single 8Gbit channel so you get a maximum of 700MiBytes per second – no more.

You therefore curtail your available performance down to just over 1/6 of what it’s capability.

Databases aren’t file servers, the access patterns are entirely different – to satisfy a single query we might actually end up reading the same data from storage many 100’s or 10,000’s of times if the available memory in the server isn’t enough to cache the data we are working over!

“…what happens when you drive down a 10 lane motorway that get’s reduced to just a single lane?”
“Assume data is located on a single LUN across all the SSD’s.”
A single LUN on the hosts would most likely present a serious DB server bottleneck before getting to the downstream network bottleneck. Assume a LUN service queue depth of 32 and queue length of of the same – since no specification has been made about LUN service queue depth. (32 is maximum LUN queue depth for Hitachi VSP, XP24000 and related offerings from HP as well as for EMC VNX with virtual provisioning.) With those assumptions 128,000 IOPs of 8k each at .25 ms avg latency would be needed to hit 1 gigabyte/sec. 384,000 IOPs of 8k each across the 3 servers to hit 3 gigabytes/sec.

At a LUN service queue depth of 256 it becomes a more manageable 2 ms avg latency. But not all storage controllers will allow such a high LUN service queue depth. Other than 32, the most common LUN queue depth I encounter is 64.

With at least 8 or more LUNs at queue depth 32 on a database server I think you could easily be throttled by available downstream bandwidth. But with a single LUN, the most typical case on a database server (with transfer sizes ranging from a single block to the maximum physical transfer size) is that the throughput requirements of the outstanding IO don’t come close to saturating available bandwidth of an 8 gigabit HBA.

Although accommodating for outstanding IO potential isn’t common in SQL Server configurations, for example, those responsible for system design and performance should know that a 4 vcpu VM can drive up to 1000 outstanding IOs, and the more of them that are relegated to wait queue status, the more punishment in latency and throughput the VM will experience.

Its very common for flash and SSD devices to provide more IOPs and bytes/sec than storage controllers can serve, and even more common in my opinion for the storage controllers to be underutilized while host/guest disk IO queues on heavily used reporting database servers are saturated.

sql_handle: you’ve based that maths on an 8KiB page, for a reporting server that just isn’t realistic, looking at a real semi-busy server doing OLTP the avg disk bytes/write is averaging 15,550 and the read is averaging 13,264; when doing a reporting work load the avg rises to 325,240.

If I’m running a query that requires an index scan (left side of the join on loop/merge/hash even) on a table that is 200GiB, indexed so that the real data read is just 50GiB then that 50GiB isn’t read in 8KiB IO’s, I’ll be getting between 100KiB and 2MiB depending on how fragmented the data is and what I’m actually doing in the query.

Do you now see why the 700MiB limiter (8Gbits) i.e. one lane out of the SAN servicing multiple servers or even just the one is just plain silly.

I think you’ve fallen into one of the traps I see – SAN engineers assume 8KiB fully random IO when looking at database servers.

Also – as to the single LUN mentioned then that’s actually what I see time and time again out in the field – to use multiple LUN’s for a SQL Server database you need to use multiple files in the database which a lot of folk just don’t do.

So in reality at a read rate of 128KiBytes per IO off the SAN you can only manage around 5.5K IOps which really isn’t that much!