VMAX Performance – Queue Depth Utilization Explained

Recently I’ve ran a project for a new EMC Symmetrix VMAX 10k installation. The install was a breeze and the migration of data to the system fairly straightforward. The customer saw some good performance improvements on the storage front and there is plenty of capacity in the system to cater for immediate growth. Yet when I opened the Unisphere for VMAX interface and browsed to the performance tab, my heart skipped a beat. What are those red queue depth utilization bars? We were seeing good response times, weren’t we? Were we at risk? How about scalability? Lets dig deeper and find out.

Analyzing Queue Depth metrics

Some background info on the VMAX10k: we’ve made port groups of 4 ports each, grouping all the E, F, G and H ports. Port groups E through G have ESX hosts attached with the various clusters being spread across the three port groups. Port group H runs physical Linux and HP-UX hosts. Currently there’s only 1 Linux EMC NetWorker host attached to port group H, with the rest of the HP-UX infrastructure (roughly 20 servers) following in the next couple of months. We’ve only attached the first port for each FE director: e.g. FA-1E:0 is connected, FA-1E:1 is disconnected.

First of all, lets look at the actual queue depth utilizations; are they indeed high?

Opening up the FE director dashboard you can see high % Queue depth Utilization in the top middle chart: the 1-4G ports are topping the list, with 1-4E following closely and 1-4F not too far behind. Out of the H-ports, only 1H has a Queue Depth utilization of about 40%; the other H-ports aren’t queuing at all. This makes sense because the H-ports only have one physical server attached to it doing 5,5 IOPs on average.

Curiosity #1: why is port 1H, which only has to cope with 5,5 IOPs on average, already showing a Queue Depth Utilization of 40%?

The actual % Busy for the front end directors is reasonably low: 30% for the E-ports, 15-ish for the G-ports and <10% for the F-ports. The % Busy roughly follows the same pattern as the IOps per FE director graph (bottom middle) shows.

Okay, lets look at response times.

The top two storage groups have higher than average response times at 15ms; these storage groups are presented out of port group G. The other storage groups are mapped to port group E and F, with the exception being the last port group that is mapped to port group H.

Curiosity #2: port group E and G have almost the same queue depth utilization but a vast difference in response time… how is this possible?

So what differences are there between pg E and G?

Let’s take one director per port group; 1E is up first.

Queue depth util at around 70%, 30% busy time. Latencies are roughly 3-4ms for reads, 1,5ms for write. Read/Write ratio is 50:50, with the director handling a total of 1500-2000IOps and 25-30MB/s.

So how does 1G look?

We can immediately spot a bursty workload with high read latencies. Read/write ratio is on average somewhere between 2:1 and 4:1, going all the way to a 8:1 ratio in the beginning of the graphs. Average IOps are 600-800 IOps with 15-25 MB/s bandwidth.

Looking at these graphs I’d expect a bursty workload with a large amount of reads, probably random reads with a bigger than average I/O size. Lets check this with a couple more graphs… we have plenty of those!

Storage group graphs

This is the most active storage group on port group E:

If you look at the shape of the graphs they closely resemble the IOps graphs for FA-1E: 50:50 read:write ratio, low latency. About 40% of I/O is random write I/O that hits the cache, 12% of I/O are random read misses (e.g. they have to come from disk), the remaining 48% is read I/O that hits the cache (either random or sequential). I/O is predominantly small, roughly 20KB.

For the most active storage group on port group G:

Again the shape of the graphs resembles the FA graphs: stable write workload of about 15-30% of total I/O, with a bursty read workload on top. Even though this storage group does about 1/3rd of the amount of IOps compared to the previous storage group, it’s moving quite a lot of MB/s. Response times for the writes are low but the reads are circling 20ms. Looking at the I/O patterns we can see 25% write I/O that hits the cache, roughly 20% random read I/O that misses the cache and the remainder is random read I/O that does hit cache. The read I/O size is roughly 50KB.

Both port groups have almost identical queue depth utilizations but vastly different workloads and response times. Can we dig a bit deeper?

Queue Depth Buckets

Queuing is inevitable. To accurately asses how long the queues are, Enginuity keeps track of the length of the queue. It does this by creating buckets of Queue Depth I/O buckets, each reflecting a certain range or length of queuing. The ranges or buckets are:

0 – 0

1 – 0-5

2 – 5-10

3 – 10-20

4 – 20-40

5 – 40-80

6 – 80-160

7 – 160-320

8 – 320-640

9 – >640

So for example: say a FA has no queue and one I/O comes in. This increments the counter of range 0. Another I/O comes in while the first one isn’t processed yet: the queue is 1 so this second I/O increments the bucket belonging to range 1 (which is a queue of 0-5). In a similar manner, lets say the queue gets longer and reaches a queue length of 60 outstanding I/Os. Another I/O comes in, Enginuity sees the queue is 60 I/Os long and it increments the counter for bucket/range 5. What this does is show you how deep the queue is at certain points in time.

Blue is range 0, red is range 1, yellow is range 2, purple 3, orange range 4, etc. What we can see here is that for port group E, FA-3E, the majority of incoming I/Os fall in buckets 0 and 1 (blue and red). This means that the majority of incoming I/O only has to wait for 5 other I/Os to finish. There is very little I/O hitting the deeper queues belonging to bucket 3+, with the exception of that peak at 13:40.

FA-1G:

Light blue is range 0, red range 1, yellow range 2, purple range 3, orange range 4, dark blue range 5, green range 6. We can see that the spikes are hitting bucket 5 and some are even hitting bucket 6. Hence a lot of I/O have between 40-80 I/Os in front of them in the queue, some even 80-160. Combine this with the fact that this port group sees a lot of (random) read I/O which processes slower if it isn’t prefetched into cache and you’ve got an explanation for the higher read latency: even when a single read I/O is serviced in 4ms, if you have 100 read I/Os waiting in line in front of you, you’re going to have a bad response time…

The next step would be to figure out which device is responsible for the spikes in IOps, find out which application or host is causing these spikes and see if we could start tuning these peaks away. But that’s a story for another post…

My two cents…

While the metrics “Queue Depth Count Range x” might be confusing if you see them for the first time, once you know what they mean and what queue depths belong to each range or bucket they become invaluable in sensing how deep your queue is at certain points in time.

I’m slightly skeptical about the “% Queue Depth Utilization” metric: it appears this metric will pretty much always live at a 50%+ value and with the default Unisphere alert settings this will turn your dashboard permanently red even when FA directors are servicing I/O at low latencies. If I had to compare it to another overrated metric it would be the CLARiiON/VNX “LUN Utilization %”, where 100% utilization just meant that 1 disk servicing that LUN was doing something, possibly as little as a continuous 1 IOps. Fortunately you can change the thresholds for these alerts, so I’ve changed the first threshold to 80% and the second to 90%: this correctly flags port group G as warning/yellow and leaves the other port groups alone.