Ramblings from Richard's Ranch

Saturday, October 4, 2014

The other day someone asked on the #zfs IRC (irc.freenode.net) chat about using ZFS at home. As one of the early adopters, I can say it is a great idea! I've been running ZFS at home since late 2005. The first pool of "stuff" I created has been upgraded, expanded, and had its drives replaced. In 2008 I created the latest version of "stuff" as a simple mirrored pair of HDDs. The prior version of "stuff" was transferred to the 2008 pool which is still in use. Therefore I do not have the actual creation date of the original "stuff," but since I used ZFS send/receive to transfer the datasets, I can definitively say the oldest snapshot was created in July 2006.

# zfs get creation stuff/home@20060715

NAME PROPERTY VALUE SOURCE

stuff/home@20060715 creation Sat Jul 15 20:20 2006 -

I've made many snapshots since and it seems quite impressive to know that I can roll back in time over 8 years to see how my "stuff" has evolved. Let's hear it for long-lived data!

Tuesday, August 5, 2014

One of the nice changes to the kstat (kernel statistics) command in illumos is its conversion to C from perl. There were several areas in the illumos (nee OpenSolaris) code where perl had been used. But these were too few to maintain critical mass and it is difficult for interpreted runtimes to change at the pace of an OS, so keeping the two in lockstep is simply not worthwhile. So the few places where parts of illumos were written in perl have been replaced by native C implementations.The kstat(1m) command rewritten in C was contributed by David Höppner, an active member of the illumos community. It is fast and efficient at filtering and printing kstats. By contrast, the old perl version had to start perl (an interpreter), find and load the kstat-to-perl module, and then filter and print the kstats. Internal to the kernel, kstats are stored as a name-value list (nvlist) containing strongly-typed data. Many of these are 64-bit integers. This poses a problem for the version of perl used (5.12) as the 64-bit support is dependent on the compiled version and illumos can be compiled for both 32 and 64 bit processors. To compensate for this mismatch, the following was added to the man page for kstat(3perl):

Several of the statistics provided by the kstat facility are stored as 64-bit integer values. Perl 5 does not yet internally support 64-bit integers, so these values are approximated in this module. There are two classes of 64-bit value to be dealt with: 64-bit intervals and times

These are the crtime and snaptime fields of all the statistics hashes, and the wtime, wlentime, wlastupdate, rtime, rlentime and rlastupdate fields of the kstat I/O statistics structures. These are measured by the kstat facility in nanoseconds, meaning that a 32-bit value would represent approximately 4 seconds. The alternative is to store the values as floating-point numbers, which offer approximately 53 bits of precision on present hardware. 64-bit intervals and timers as floating point values expressed in seconds, meaning that time-related kstats are being rounded to approximately microsecond resolution.

It is not useful to store these values as 32-bit values. As noted above, floating-point values offer 53 bitsof precision. Accordingly, all 64-bit counters are stored as floating-point values.

For consumers of the kstat(1m) command output, this means that kstat I/O statistics are stored as seconds (floating point) instead of nanoseconds. For example, with formatting adjusted for readability:

Perl-based kstat(1m)

# kstat -p sd:0:sd0

sd:0:sd0:class disk

sd:0:sd0:crtime 1855326.99995062

sd:0:sd0:nread 380919301

sd:0:sd0:nwritten 1984175104

sd:0:sd0:rcnt 0

sd:0:sd0:reads 18455

sd:0:sd0:rlastupdate 2371703.49260763

sd:0:sd0:rlentime 147.154123471

sd:0:sd0:rtime 49.399890683

sd:0:sd0:snaptime 2371828.77138052

sd:0:sd0:wcnt 0

sd:0:sd0:wlastupdate 2371703.49174494

sd:0:sd0:wlentime 2.425675727

sd:0:sd0:writes 103602

sd:0:sd0:wtime 1.43643661

C-based kstat(1m)

# kstat -p sd:0:sd0

sd:0:sd0:class disk

sd:0:sd0:crtime 244.271312204

sd:0:sd0:nread 25549493

sd:0:sd0:nwritten 1698218496

sd:0:sd0:rcnt 0

sd:0:sd0:reads 4043

sd:0:sd0:rlastupdate 104543293563241

sd:0:sd0:rlentime 68750036336

sd:0:sd0:rtime 64365048052

sd:0:sd0:snaptime 104509.092582653

sd:0:sd0:wcnt 0

sd:0:sd0:wlastupdate 104543293482995

sd:0:sd0:wlentime 4569934990

sd:0:sd0:writes 289766

sd:0:sd0:wtime 4551425719

I find kstat(1m) output to be very convenient for historical tracking and use it often. If you do too, then be aware of this conversion.

One of the best features of the new C-based kstat(1m) is the ability to get kstats as JSON. This is even more useful than the "parseable" output shown previously.

# kstat -j sd:0:sd0

[{

"module": "sd",

"instance": 0,

"name": "sd0",

"class": "disk",

"type": 3,

"snaptime": 104547.013504692,

"data": {

"crtime": 244.271312204,

"nread": 25549493,

"nwritten": 1700980224,

"rcnt": 0,

"reads": 4043,

"rlastupdate": 104733296813446,

"rlentime": 68901598866,

"rtime": 64513785819,

"snaptime": 104547.013504692,

"wcnt": 0,

"wlastupdate": 104733296708770,

"wlentime": 4579560895,

"writes": 290404,

"wtime": 4561051625

}

}]

Using JSON has the added advantage of being easy to parse without making assumptions about the data. For example, did you know that some kernel modules use ':' in the kstat module, instance, name, or statistic? This makes using the parseable output a little bit tricky. The JSON output is not affected and is readily and consistently readable or storable in the many tools that support JSON.

Now you can see how to take advantage of the kstat(1m) command and how it has evolved under illumos to be more friendly to building tools and taking quick measurements. Go forth and kstat!

Monday, August 4, 2014

Latency and performance problems in storage subsystems can be tricky to understand and tune. If you've ever been stuck in a traffic jam or waited in line to get into a concert, you know that queues can be frustrating to understand and trying on your patience. In modern computing systems, there are many different queues and any time we must share a constrained resource, one or more queues will magically appear in the architecture. Thus in order to understand the impacts of the constrained resource, you need to understand the queues.The Good Old ZFS I/O Scheduler

Looking at the queues the oldZFS I/O (ZIO) scheduler deserves investigation first, because it still represents the vast majority of the installed base, including all Solaris installations. The new OpenZFS write throttle changes this area of the code and deserves its own treatment in a separate post. Adding a resilvering workload into the mix shows how overall performance is affected by the queues and constrained resources.

The ZIO scheduler sends I/O requests to the virtual devices (vdevs) based upon weighted priority. For oldZFS implementations, the priority table maps an I/O type to a value useful for sorting incoming I/O workload requests. The priority table has grown over time, but the general approach is to favor the "I need it now!" demands from the "process this when you get a chance" workloads. The table usually looked like this, with higher priority entries having lower numbers:

0 - Priority NOW

0 - Sync read

0 - Sync write

0 - Log write

1 - Cache fill

2 - Dedup Table Prefetch

4 - Free

4 - Async write

6 - Async read

10 - Resilver

20 - Scrub

Background I/Os such as resilver and scrub are scheduled with a lower priority so they do not interfere with more immediate needs, such as sync reads or writes. The ZIO scheduler does a good job scheduling high priority I/Os over resilver and scrub I/Os.

But the ZIO scheduler only works for one queue -- and there are many queues in the system -- some of which don't have priority-based schedulers. And this is where I begin to grumble about queues... grumble, grumble, grumble.

Consider the following, simplified model of the ZIO scheduler: a resilver workload, a normal (I need it now!) workload, and a pool built of mirrors.

The ZIO scheduler prioritizes work in the ZIO pipeline. But once the I/Os have been passed along to the vdevs or disks, the ZIO scheduler can do nothing. Also, the typical queue wrapped around a constrained physical storage device, such as a HDD, doesn't have an easy-to-use, priority-based scheduler.

The depth of the queues down the stack can be quite large. It is not uncommon to see HBAs capable of queuing thousands of I/Os. Modern enterprise-grade SSDs can also have thousands of concurrent I/Os. Even modest HDDs can handle multiple concurrent I/Os.

To maintain some control over the number of I/Os passed along to the disks, oldZFS has a kernel tunable named zfs_vdev_max_pending. In very old ZFS, this was set to 35. This turned out to be a bad default choice. In old ZFS, it is 10, a better choice than 35 for many systems. In modern OpenZFS, the ZIO scheduler is replaced with a better algorithm altogether. Concentrating on old ZFS, this means 10 I/Os will be passed along to the disks and the remainder will hang around in a queue and be scheduled by the ZIO scheduler. In the previous queue diagram, this is represented by the sigma (Σ) as the sum of the depth of the queues.

The two-queue model for the disks shown in the diagram also works well for analysis in Solaris-based OSes, where the iostat command has a two-queue model. The internal kstats implement a Riemann algorithm with two queues and this can give some insight to the workings of the I/O subsystem. In iostat, these are known as the wait and active queues, for historical reasons.

An important assumption in this model is the service time for handling the I/Os is constant. If you actually measure this, you will find a very different service time for SSDs vs HDDs. For example, most Flash-based SSDs really like many concurrent I/Os and have write pipelines especially well suited for coalescing many write I/Os. For these devices, deeper queues can result in lower per-I/O response time, a very good thing. By contrast, HDDs perform poorly as the queue depth increases, but in an opposite manner than Flash SSDs. By default, HDDs buffer writes so their service time tends to be more consistent. But HDD reads do require physical movement and media rotation leading to response times less predictable than writes.

The biggest issue with queuing concurrent I/Os to HDDs is the elevator algorithms used tend to result in a few I/Os getting starved, while the HDD tries to optimize for the majority. Further, these I/Os are not prioritized, so it is quite possible for a ZFS sync read I/O to be starved while the disk optimizes for the resilver I/Os. This starvation is not under control of the OS or ZFS.

What might this starvation look like? Benchmarks and statistics will clearly show us. We know there is a scheduler in the disk device and we hope it will work for us. To find out, we'll create a full-stroke, totally random, 4KB-sized, 100% read or 100% write workload. Our workload includes:

Full-stroke gives us the average worst case workload, but also one validating the vendor's datasheet for average seek and average rotational delay

4KB isn't particularly magic, it is just small enough that we won't reach bandwidth saturation for any component in the system: memory, CPU, PCIexpress, HBA, SAS/SATA link, or media bandwidth

100% read workload because these workloads are difficult to cache, thus providing better insight into where the real performance limits lie

100% write workload to show how deferred writes can lull you into a false sense of performance, due to caching

In this workload experiment, we're measuring latency. Bandwidth isn't saturated and IOPS are meaningless. To prove both, we'll take a look at the workload running with a typical enterprise-grade 7,200 rpm HDD.

For this device, the average rotational delay is 4.167ms (7,200 rpm) and the datasheet says the average read seek time is just over 8ms. We expect the average 4KB full-stroke random read to be about 12.5ms or 80 IOPS. And there it is! For a single thread (single concurrent I/O) 100% read workload, the measured response time is close to 12.5 ms and 80 IOPS. Woohoo!

The bandwidth consumed by servicing 80 IOPS at 4KB is a whopping 320 KB/sec or around 2.5 Mbits/sec. This is, at best, about 1/1000th of the bandwidth of the disk interconnect, HBA, media, or memory. Clearly, there is no bandwidth constraint for this experiment.

Unfortunately, vendor datasheets are not sufficient for explaining the constraints demonstrated by the experiment. The 100% write workload has both lower latency and higher IOPS than the read for all measured numbers of concurrent operations (threads). The writes are also buffered and only committed to permanent media when we issue a synchronize cache command. In this test, we don't issue synchronize cache commands, so the decision to put on media is left to the HDD. We know there is a scheduler ultimately trying to make the best decision on writes to the media, so the latency is a measure of the efficiency of the HDD scheduler with the constraints placed by a continuous stream of small, random writes. The benefits of the scheduler and deferred commits to permanent media for writes are shown by the lower latency and higher IOPS than the 100% read workload.

The clue that there is trouble in paradise is not in the number of measured IOPS. In all cases, the IOPS increases as the concurrency increases -- a good thing. Unfortunately, the latency also increases rather significantly -- a very bad thing.

All IOPS are Not Created Equal

The measurements shown in the previous graph are averages and do not show a linear relationship between IOPS and latency. Also, if we were to try a linear regression, we would have a problem fitting the mixed workload cases. This is evident from the data, in which a linear relationship could fit the curve well, but the read and write cases have very different slopes.

We need to dig further into the latency and queus to understand what lies beneath. For this, we examine the latency of each and every I/O operation. I've long been a fan of iosnoop and will draw your attention to Brendan's porting of iosnoop to Linux. Now you can run iosnoop on Mac OSX, illumos and Solaris distros, and now Linux. Toss in your favorite stats graphing program and you no longer have any excuse for not understanding the distribution of the latency of your storage systems. Let's take a look at our data distributions by read vs write and one vs ten concurrent I/Os. We'll measure in units of microseconds, which is useful for storage devices.

We begin by looking at the Concurrent I/Os=1 cases. For both reads and writes, we see a really nice example of the normal distribution; this is what your stats teacher was trying to drill into your head while you were daydreaming in class. Rejoice when you see these in real life; in computer systems, they are few and far between. From a practical perspective, this means the median and averages are almost identical and common intuition about response time will apply.

The mean of the reads is 11.8 ms, very close to the datasheet prediction of 12.5ms. Writes show a mean of 5.1ms, due primarily to the deferred writes and device scheduler optimizations. The 90% percentile for writes is also much better than the reads, also affirming the benefits of the device's scheduler.

By contrast, issuing 10 concurrent I/Os shows a different picture of the latency. The read workload looks staggering similar to an exponential distribution. The large difference between mean and median latency is an indictor the device's scheduling algorithms are starving some reads. The 90% percentile has ballooned to 148 ms, and some poor reads took more than 300 ms.

The write workload still seems normal, but not quite the bell curve we expect from normal distributions. Writes fares better than the reads, with the worse case write taking slightly more than 100 ms.

Device Queues and the ZIO Scheduler

Now that we understand how HDDs react to concurrent workloads, we can revisit the ZIO scheduler. The resilver workload is typically described as one or more reads for each write:

Resilver I/Os are constantly running in the background, injecting their workload into the the ZIO pipeline along with the "I need it now!" workload

Once issued to the device, the ZIO scheduler cannot adjust the priority

At the device, more concurrent I/Os leads to longer latency, with extreme latency 10x longer than the single-I/O case

To prevent normal workload I/Os from getting stuck behind resilver I/Os after being issued to the device, we need to reduce the depth of the queue at the device by reducing zfs_vdev_max_pending. The perfect value is based on the expected workload. For latency-sensitive workloads, a smaller number results in more consistency and less variance. For bandwidth-intensive workloads, having a few more I/Os can help.To quiet my grumbling, here are my rules-of-thumb for zfs_vdev_max_pending:

1 = the lower bound, but exposes all of the serialization in the stack

2 = a good alternative to 1 for HDDs -- offers the disk an opportunity to optimize, but doesn't overwhelm3 = not a bad choice4 = upper bound for HDDs, IMHO

16 = good for many Flash SSDs

30+ = good for RAID arrays

Herein lies the dilemma: HDDs like fewer concurrent I/Os, but SSDs like more concurrent I/Os, especially for write-intensive workloads. But there is only one tunable for all disks in all pools on the system. Fortunately, you can change this on the fly, without rebooting, and measure the impact on your workload. For Solaris or older illumos-based distributions, this example sets the value to 2:

echo zfs_vdev_max_pending/W0t2 | mdb -kw

to return to the modern default:

echo zfs_vdev_max_pending/W0t10 | mdb -kw

The effects will take place almost immediately. Monitor the latency changes using iostat -x or another disk latency measure. For direct measurements of all I/Os, as shown previously, consider using iosnoop and a stats package.

In some storage appliances based on ZFS, there are options for setting this value. Consult your documentation for further information. In some cases, the term zfs_vdev_max_pending might be disguised as another name, so look for a tunable related to the number of concurrent I/Os sent to disks.

Understanding queues and their impact on constrained resources can be tricky. Not all of the behaviors follow normal distributions and simple math. Measuring the latency of constrained resources while varying the workload allows you to make adjustments, improving the overall performance of the system. Tuning ZFS device queues is easier when using my tips and proven techniques.

May your traffic jams be short, may you be in the front of the line at the next concert, and may all of your I/O have low latency variance!

Monday, September 24, 2012

The ZFS hybrid storage pool model is very flexible and allows many different combinations of storage technology to be used. This presents a dilemma to the systems architect: what is the best way to build and configure a pool to meet business requirements? We'll discuss modeling ZFS systems and hybrid storage pools at a datacenter scale. The models consider space, performance, dependability, and cost of the storage devices and any interconnecting networks (including SANs). We will also discuss methods for measuring the performance of the system.
I hope to see your smiling face in the audience, or you can register to see the live, streaming video. Sign up today!

Tuesday, September 18, 2012

We are hosting illumos and ZFS day events in San Francisco October 1 - 3, 2012. Our good friends from DDRdrive, Delphix, Joyent, and Nexenta are also sponsoring the event. I will be talking about how to optimize the design of ZFS-based systems and explain how to get the best bang for your buck. Jason and Garrett are also on the speakers list, talking about how illumos has really taken hold as a foundation for building modern businesses.

Wednesday, September 12, 2012

When I originally wrote cifssvrtop (top for CIFS servers), all of the systems I tested with had one thing in common: the workstations (clients) had names. Interestingly, I recently found a case where the workstations are not named, so the results were less useful than normal.

In this case, there are supposed to be 5 clients. But none have workstation names, so they all get lumped together under "".

The fix is, of course, easy and obvious: make an option to discern clients by IPv4 address instead of workstation name. This is more consistent with nfssvrtop and iscsisvrtop, a good thing. Now the output looks like:

2012 Sep 12 19:52:23, load: 2.50, read: 0 KB, write: 1766632 KB

Client CIFSOPS Reads Writes Rd_bw Wr_bw Rd_t Wr_t Align%

172.60.0.101 452 0 441 0 27984 0 108 100

172.60.0.104 488 0 473 0 30072 0 101 100

172.60.0.103 505 0 490 0 31068 0 849 99

172.60.0.102 625 0 614 0 38979 0 2710 99

172.60.0.105 792 0 773 0 49002 0 4548 99

all 2864 0 2792 0 177106 0 2030 99

Here we can clearly see the clients separated by IPv4 address. The sorting is by CIFSOPS, which is the easiest way to deal with dtrace aggregations.

To implement this change, I added a new "-w" flag that will print the workstation name instead of the IPv4 address. If you prefer the previous defaults, then feel free to fork it on github.