Search This Blog

IOPS and latency are not related - HDD performance explored

Today, we routinely hear people carrying on about IOPS-this and IOPS-that. Mostly this seems to come from marketing people: 1.5 million IOPS-this, billion IOPS-that. Right off the bat, a billion IOPS is not hard to do, the metric lends itself rather well to parallelization...

This post is the first in a series looking at the use and misuse of IOPS for storage system performance analysis or specification.

Let's do some simple math. We all want low latency -- the holy grail of performance. In the bad old days, many computer systems were bandwidth constrained in the I/O data path, so it was very easy to measure the effect of bandwidth constraints on latency. For example, fast/wide parallel SCSI and UltraSCSI was the rage when the dot-com bubble was bubbling, capped out at 20 MB/sec. Suppose we had to move 100 MB of data, then the latency is easily calculated:

Well, there it is, 5 seconds later. For modern computers, 6Gbps SAS or SATA is prevalent. So a modern system's disk channel bandwidth is around 750 MB/sec:

Good New Latency = 100 MB / 750 MB/sec = 0.133 sec

Sweet! But that is just channel bandwidth, for HDDs there is another limiting factor, the media bandwidth. For bulk HDDs, you can guess 150 MB/sec for media bandwidth and you'll be in the ballpark. Consult the datasheet or detailed design docs for your drive to see its rating.

The effect of the elimination of latency in the data path has led to an interesting change in systems thinking. The good news is that bandwidth generally isn't an issue. The bad news is that the marketing folks need something else to spew about. Something that goes up and to the right when you graph it over time, just like your stock portfolio. In other words, latency doesn't work because you are happier when it is smaller. The solution to this marketing dilemma: talk about IOPS! They go up and to the right, bigger is better, my product has more than yours, and the 2-drink minimum ensures a party well into the night.

Let's revisit our scenario. Assume that we have a fixed, 4KB I/O size, which is reasonably common for many workloads, especially those running on Intel x86-based systems today.

How about them apples! A million IOPS cannot be far away! Billions will follow! Joy and happiness will overtake the legions of struggling performance geeks and all will be good in the universe!

Hold it right there, fella! Unfortunately is doesn't quite work like that. The physics behind the technology says you can maintain 100MB/sec (or media bandwidth) as long as you don't have to seek. It turns out that most real-world workloads are not of the streaming bandwidth type. Just a little seek to an adjacent track and you blow the whole equation. What's worse, for HDDs it gets blown in unpredictable ways. To deal with unpredictable systems, we resort to our good old friends, measurement and statistics. So let's take a look at how a HDD reacts to a random workload -- where one wants to see high IOPS.

The workload of choice is a full-stroke 4KB random workload. The victim under test, a typical 7,200 rpm 3.5" HDD (most vendors have similar performance specs). If you look at the datasheet you might see specifications like:

Average seek time: 8.5 ms

Average rotational delay: 4.17 ms

The average rotational delay is based on the rotation speed, 7,200 rpm. Pay close attention to these specs for HDDs and be aware that many of the new, low power "green" drives have variable rotational delay (read as: random I/O performance will suck even worse). From the specs, we can expect that an average random read will take around 12.5 ms. We test the drive, and sure enough we get 12.5 ms, for a single thread case. This is good, because it means that the datasheets don't lie and I can rely on them for system sizing without ever actually testing the HDD. Now that we know the average latency, it is a simple matter of math to get the IOPS.

IOPS = 1/avg latency = 1/0.0125 = 80 IOPS

Well, there it is. If you know nothing at all about a drive, you can guess that it should be able to deliver 80 IOPS, give or take a few. Drives with faster rotational speed, such as 15krpm, reduce the rotational delay and the real fast, enterprise-grade HDDs can also reduce the average seek time to a few ms. Do the math and you'll find them around 80 to 200 IOPS.

But Richard, this is a long way from a billion IOPS. Yes, it is. But before we go there, we need to take another look at the HDDs under concurrent load. The above case is for a single I/O operation at a time. Remember when I said IOPS can be easily tamed by parallelization, let's give it a try. For the next test, lets increase the number of concurrent I/O operations. Modern HDDs have either Tagged Command Queuing (TCQ) for SCSI or Native Command Queuing (NCQ) for SATA. The idea is that you can submit multiple I/O operations to a HDD and it will optimize the head movements to give you better performance. We adjust our test to measure 100% writes and 100% reads for 4KB random I/Os with 1, 2, 4, 6, 8, or 10 threads. ZFS fans will know that 10 is a magic number because it is the default I/O concurrency for disks. Since these tests result in multiple answers, it is best to graph them.

There is our 80 IOPS for 100% reads with a single thread, so our tests look legitimate. But wait... the rest of the measurements are slightly unexpected. We can reconcile the better performance for writes due to the write buffer cache in the drive (and the workload does not issue SYNCRHONIZE_CACHE commands). So ok, good, we can get 180 write IOPS to the drive, a bonus over the 80 IOPS for reads.

But, hold it right there again, fella! Remember when we said that the IOPS is related to latency? This data shows that there is no correlation between IOPS and latency for concurrent workloads on HDDs! Latency is on the X axis and IOPS is on the Y axis, so the data clearly shows that IOPS tends to remain constant around 180 to 190 IOPS for the write case even though the average latency ranges from around 5.5 ms up to more than 55 ms. Reads are even worse, where at around 137 IOPS average latency is more than 70 ms. Going back to our math:

IOPS = 1/avg latency = 1/75 ms = 13.3 IOPS

13.3 IOPS != 137 IOPS

Clearly, there is no correlation between IOPS and latency for concurrent workloads on HDDs! What the data shows is that some I/Os are efficiently handled by the drive's elevator algorithm, but there are other I/Os that get penalized rather badly. Some of the maximum measurements, not shown in the graphs, were in the 900+ ms range. Perhaps these measurements need to include reporting of the standard deviation in addition to the mean? Pity the poor application that has to wait 900 ms because all of the other I/Os were being serviced out of order.

Back to the marketing department... a billion IOPS should look like:

Latency = 1/IOPS = 1/1,000,000,000 = 1 nanosecond (ns)

Can you reasonably expect a 1 ns response time for any I/Os on any modern computer system? Absolutely not! You can't even get 1 ns response time from memory, let alone across the PCIe interconnect, through an HBA, down the wire to the disk and back again. Clearly, the games the marketeers are playing have one or more of the following caveats:

Big I/Os are being divided ex post-facto into smaller I/Os, as shown in the UltraSCSI example above.

I/Os are a strange or useless size. We've seen this recently from FusionIO (who should know better!) saying 1 billion IOPS where each I/O was 64 bytes. A more accurate statement is perhaps that they passed 1 billion PCIe transactions per second, except that they didn't say how big the PCI transfers were, so they could be confusing the public with #1, too.

Parallel storage systems are prevalent, but very often do not deliver better latency. The analogy I typically use here is: nine women can deliver nine babies in nine months (IOPS), but nine women cannot deliver a baby in one month (latency).

I've painted a bleak picture for HDDs here, and indeed their role in high performance systems is over. SSDs won, game over. There are some very good SSDs that have very consistent, low latency and are amongst my favorite choices for cases where latency is important. Even more are being developed as I type, and it is an interesting time to be in the storage business. If you can, please help squash the crazy marketeers who are spewing dribble by understanding your system and how latency matters.

As we deliver better tools for observing latency and its effects on your storage workload, we will necessarily have to discourage use of meaningless or confusing measures. Bandwidth is already buried, IOPS will be next. Stay tuned...

I've been wanting to get a better understanding of our storage system, so I will be following this. Any chance you will explain how a SAN fits into this? You often hear words like number of spindles etc. spouted by the SAN people, and I wonder what it all means.

Hi Matin,SANs have no impact on this analysis. A SAN is simply a different form of transport and does not alter the mechanics of HDDs.

When people talk about adding spindles, they are attempting to get lower average latency. Sometimes they explain this as "more IOPS" but I think a simpler queueing theory view is more appropriate. Note that the latest public benchmark results, for those benchmarks that are latency-sensitive, are all SSD-based.

Excellent article. One thing I got from reading this which is something I've been saying for a while now. This is why the newer Tier based storage systems (3par, compellent) will win out against the old encumbants(netapp) in the long run.

The ability to have your data automatically moved to its tier of most efficiency is to my mind the 'killer app' of the storage world right now. Allows you to choose all three points of the 'cost', 'size', 'speed' triangle.

In general, yes. The problem with the tiered storage systems is that the decision to migrate between the tiers is temporal locality. Indeed, from a storage-centric view of the world, this is the best effort possible. This is why storage solutions with a better knowledge of the workload are superior: they know the context of the data and the intended use. This is why storage management software (!) is becoming more important for systems. Think technologies like ZFS, ASM, etc.

Post a Comment

Popular Posts

If you wander through the OpenSolarisZFS-discuss archives or look at the ZFS Best Practices Guide, then you can encounter references and debates about whether the zfs send and zfs receive commands are suitable for backups. As I've described before, zfs send and zfs receive can be part of a comprehensive backup strategy for high-transaction environments. But people get nervous when we discuss placing a zfs send stream on persistent storage. The reasoning is that if the stream gets corrupted, then it is useless. There is an RFE open to improve the robustness of zfs receive, but that is little consolation for someone who has lost data. The fundamental design of ZFS is exposed in zfs send -- the send stream contains an object, not files. This is great for replicating objects, and since ZFS file systems and volumes are objects, it is quite handy. This is why zfs send and zfs receive do not replace the functionality of an enterprise backup system that works on files. So, I expect the te…

ZFS now offers triple-parity raidz3. Conceptually, raidz3 is an N+3 parity protection scheme. Today, there are few, if any, other implementations of triple parity protection, so when we say "raidz is similar to RAID-5" and "raidz2 is similar to RAID-6" there is no similar allusion for raidz3. I prefer to say "raidz3 is like raidz2 with one additional level of parity protection. But how much better is raidz3 than raidz2? To help answer that question, I used the simple Mean Time to Data Loss (MTTDL) model to calculate the data retention capabilities of the possible configurations of 12 disks under ZFS. To be fair, the same model applies to other RAID implementations, but I'll use the ZFS terminology here.

In this MTTDL model, the configuration includes N total disks. If the data protection scheme is raidz3, then the minimum N = 1 data disk + 3 parity disks = 4. You can add more data disks to increase the overall available space, so if N=6 then you have 3 data…