SignalFx is the only real-time cloud monitoring platform for infrastructure, microservices, and applications. The platform collects metrics and traces across every component in your cloud environment, replacing traditional point tools with a single integrated solution that works across the stack.

In my previous blog post, I wrote about CPU utilization and saturation, the practical difference between them, and different CPU utilization and saturation impact response times. Now we will look at another critical component of database performance: the storage subsystem. In this post, I will refer to the storage subsystem as "disk" (as a casual catch-all).

The most common tool for command line IO performance monitoring is iostat, which shows information like this:

The first line shows the average performance since system start. In some cases, it is useful to compare the current load to the long term average. In this case, as it is a test system, it can be safely ignored. The next line shows the current performance metrics over five-second intervals (as specified in the command line).

The iostatcommand reports utilization information in the column, and you can look at saturation by either looking at the average request queue size (the column) or looking at the and columns (which show the average wait for read and write operations). If it goes well above "normal" then the device is over-saturated.

To focus specifically on the disk, we're using the Sysbench fileio test. I'm using just one 100GB file, as I'm using DirectIO so all requests hit the disk directly. I'm also using "sync" request submission mode so I can get better control of request concurrency.

I'm using an Intel 750 NVME SSD in this test (though it does not really matter).

The Disk Latency graph confirms the disk IO latency we saw in the command, and it will be highly device-specific. We use it as a baseline to compare changes to with higher concurrency.

Disk IO Utilization

Disk IO utilization is close to 100% even though we have just one outstanding IO request (queue depth). This is the problem with Linux disk utilization reporting: unlike CPUs, Linux does not have direct visibility on how the IO device is designed. How many "execution units" does it really have? How are they utilized? Single spinning disks can be seen as a single execution unit while RAID, SSDs, and cloud storage (such as EBS) are more than one.

Disk Load

This graph shows the disk load (or request queue size), which roughly matches the number of threads that are hitting the disk as hard as possible.

Saturation (IO Load)

The IO load on the Saturation Metrics graph shows pretty much the same numbers. The only difference is that unlike Disk IO statistics, it shows the summary for the whole system.

Sysbench FileIO 4 Threads

Now let's increase IO to four concurrent threads and see how disk responds:

We can see the number of requests scales almost linearly, while request latency changes very little: 0.14ms vs. 0.15ms. This shows the device has enough execution units internally to handle the load in parallel, and there are no other bottlenecks (such as the connection interface).

These stats and graphs show an interesting picture: we barely see a response time increase for IO requests, while utilization inches closer to 100% (with four threads submitting requests all the time, it is hard to catch the time when the disk does not have any requests in flight). The load is near four (showing the disk has to handle four requests at the time on average).

Going from four to 16 threads, we again see a good throughput increase with a mild response time increase. If you look at the results closely, you will notice one more interesting thing: the average response time has increased from 0.15ms to 0.21ms (which is a 40% increase), while the 95% response time has increased from 0.21ms to 0.36ms (which is 71%). I also ran a separate test measuring 99% response time, and the difference is even larger: 0.26ms vs. 0.48ms (or 84%).

This is an important observation to make: once saturation starts to happen, the variance is likely to increase and some of the requests will be disproportionately affected (beyond what the average response time shows).

The graphs show an expected figure: the disk load and IO load from saturation are up to about 16, and utilization remains at 100%.

One thing to notice is increased jitter in the graphs. IO utilization jumps to over 100% and disk IO load spikes to 18, when there should not be as many requests in flight. This comes from how this information is gathered. An attempt is made to sample this data every second, but with the loaded system it takes time for this process to work: sometimes when we try to get the data for a one-second interval but really get data for 1.05- or 0.95-second intervals. When the math is applied to the data, it creates the spikes and dips in the graph when there should be none. You can just ignore them if you're looking at the big picture.

We can see the average has risen from 0.21ms to 0.50 (more than two times), and 95% almost tripled from 0.36ms to 1.25ms. From a practical standpoint, we can see some saturation starting to happen, but we're still not seeing a linear response time increase with increasing numbers of parallel operations as we have seen with CPU saturation. I guess this points to the fact that this IO device has a lot of parallel capacity inside and can process requests more effectively (even going from 16 to 64 concurrent threads).

Over the series of tests, as we increased concurrency from one to 64, we saw response times increase from 0.14ms to 0.5ms (or approximately three times). The 95% response time at this time grew from 0.17ms to 1.25ms (or about seven times). For practical purposes, this is where we see the IO device saturation start to show.

With 256 threads, finally, we're seeing the linear growth of the average response time that indicates overload and queueing to process requests. There is no easy way to tell if it is due to the IO bus saturation (we're reading 2GB/sec here) or if it is the internal device processing ability.

As we've seen a less than linear increase in response time going from 16 to 64 connections, and a linear increase going from 64 to 256, we can see the "optimal" concurrency for this device: somewhere between 16 and 64 connections. This allows for peak throughput without a lot of queuing.

Another Way to Think About Saturation

Before we get to the summary, I want to make an important note about this particular test. The test is a random reads test, which is a very important pattern for many database workloads, but it might not be the dominant load for your environment. You might be write-bound as well, or have mainly sequential IO access patterns (which could behave differently). For those other workloads, I hope this gives you some ideas on how to also analyze them.

When I asked the Percona staff for feedback on this blog post by, my colleague Yves Trudeau provided another way of thinking about saturation: measure saturation as percent increase in the average response time compared to the single user. Like this:

Threads

Avg Response Time

Saturation

1

0.14

–

4

0.15

1.07x or 7%

16

0.21

1.5x or 50%

64

0.50

3.6x or 260%

256

1.95

13.9x or 1290%

Summary

We can see how understanding disk utilization and saturation is much more complicated than for the CPU:

The Utilization metric (as reported by iostat and by PMM) is not very helpful for showing true storage utilization, as it only measures the time when there is at least one request in flight. If you had the same metric for the CPU, it would correspond to something running on at least one of the cores (not very useful for highly parallel systems).

Unlike a CPU, Linux tools do not provide us with information about the structure of the underlying storage and how much parallel load it should be able to handle without saturation. Even more so, storage might well have different low-level resources that cause saturation. For example, it could be the network connection, SATA BUS or even the kernel IO stack for older kernels and very fast storage.

Saturation as measured by the number of requests in flight is helpful for guessing if there might be saturation, but since we do not know how many requests the device can efficiently process concurrently, just looking the raw metric doesn't let us determine that the device is overloaded.

Average Response Time is a great metric for looking at saturation, but as with the response time you can't say what response time is good or bad for this device. You need to look at it in context and compare it to the baseline. When you're looking at the Average Response Time, make sure you're looking at read request response time vs. write request response time separately, and keep the average request size in mind to ensure we are comparing apples to apples.

SignalFx is built on a massively scalable streaming architecture that applies advanced predictive analytics for real-time problem detection. With its NoSample™ distributed tracing capabilities, SignalFx reliably monitors all transactions across microservices, accurately identifying all anomalies. And through data-science-powered directed troubleshooting SignalFx guides the operator to find the root cause of issues in seconds.