Oracle Blog

John L. Henning

Losing My Fear of ZFS

Abstract

The original ZFS vision was enormously ambitious: End the Suffering, Free Your Mind, providing
simplicity, power, safety, and speed. As is common with most new technologies, this ambitious vision was not completely
fulfilled in the intial versions. Initial usage showed that although it did have useful and convenient
features, for some workloads, such as the memory-intensive SPEC CPU benchmarks, there were reasons for concern. Now that
ZFS has had time to grow, more of the vision is fulfilled. This article, told from the personal perspective of one
performance engineer, describes some of the improvements, and provides examples of use.

Can I Please Just Forget About IO? (NO)

As a performance engineer, my primary concern is for the SPEC CPU
benchmarks - which intentionally do relatively little IO. Usually.

To a first approximation, IO can be ignored in this context. Usually.

To a first approximation, it's fine if my ZFS "knowledge" is limited to rumors / innuendo as quoted above.
Until....

Until there comes the second approximation, the re-education, and the beginner loses some fear of ZFS.

Why a SPEC CPU Benchmarker Might Care About IO

Although the SPEC CPU benchmarks intentionally try to avoid doing IO, some amount inevitably remains. An
analysis of the IO in the
benchmarks shows that one benchmark, 450.soplex, reads 300 MB of data. Most of that comes from a single 1/4 GB file,
ref.mps, which is read during the second invocation of the benchmark.

Given the speed of today's disk drives, is that a lot? Using an internal drive (Seagate ST973402SSUN72G),
a T5220 with a Niagara 2 processor reads the 1/4 GB file at about 50 MB/sec. It takes about 5.5 seconds to read one copy
of the file, which is a tiny amount of time compared to how long it takes to run one copy of the actual benchmark - about
3000 seconds.

But 1/4 GB becomes a concern when one takes into account that we do not, in fact, read one copy of the
file when testing SPEC CPU2006, because we are interested in the SPECrate metrics, which run
multiple copies of the benchmarks. On a single-chip T5220 system, which supports 64 theads, 63 copies of the benchmark
are run. An 8-chip M5000, which supports 8 threads per chip, also runs 63 copies.

On such systems, it is not uncommon to see 10 to 30 minutes of time when the CPU is sitting idle
- which is not the desired behavior for a CPU benchmark.

For example, on the M5000, as shown in the graph below, it takes about 18 minutes before the CPU reaches
the desired 99% User time. During that 18 minutes, a single disk with ufs on it is, according to iostat, 100%
busy. It reads about 16 MB/sec, doing about 725 reads/sec.

Note that in this graph, and all other graphs in this article, the program being tested is only one
of the benchmarks drawn from a larger suite, and only one of its inputs. Therefore, no statements in this article should
be taken as indicative of "compliant" (aka "reportable") runs of an entire suite. SPEC and the benchmark name SPECfp and
SPECint are registered trademarks of the Standard Performance Evaluation Corporation. For more information about SPEC and
the CPU benchmarks, see www.spec.org/cpu2006.

ZFS Makes its Dramatic Entrance

Although this tester has heard concerns raised by people who have passed along rumors of ZFS limitations,
there have been other teachers who have sung its praises, including one who has pointed out that 450.soplex's 1/4 GB input
file is highly compressible, going from 267 MB to 20 MB with gzip.

The best IO is the IO that you never have to do at all. By using the ZFS compression feature, we can make
90% of the IO go away:

The careful reader may note that there are actually two lines on the far left: one measured
with Solaris 10 Update 7, the other with Solaris Express. The version of Solaris did not appear to be a signficant
variable for the tests reported in this paper, as can be seen by the fact that the two lines are right on top of each
other.

What About Memory Consumption?

Although ZFS has done a great job above, what about its memory consumption? Concerns have been raised
that it is memory-hungry, and indeed the "Be
st
Practices" Guide plainly says that it will use all the memory on the system if it thinks it can get away with it:

The ZFS adaptive replacement cache (ARC) tries to use most of a system's available memory to cache
file system data. The default is to use all of physical memory except 1 Gbyte. As memory pressure increases, the ARC
relinquishes memory.

ZFS memory usage is an important concern when running the SPEC CPU benchmarks, which are designed to
stress the CPU, the memory system, and the compiler. Some of the benchmarks in the suite use just under 1 GB of physical
memory, and it is desirable to run (n — 1) copies on a system with (n) threads and
(n) GB of memory. Fortunately, there is a tuning knob available to control the size of the ARC:
set zfs:zfs_arc_max = 0x(size) can be added to/etc/system.

The tests reported on this page all use a limited ARC cache.

It should also be noted that all tests are done after a fresh reboot, so presumably the ARC cache is not
contributing to the reported performance. More details about methods may be found at the end of the article.

ZFS on T5440: Good, But Not As Dramatic

Although the above simple commands are enough to remove the IO idle time on the M5000, for the 4-chip
T5440 there is a bigger problem: this system supports 256 threads, and 255 copies of the benchmark are run. Therefore, it
needs to quickly inhale on the order of 64 GB.

A somewhat older RAID system was made available for this test: an SE3510 with 12x 15K 72GB drives. Using
this device with ufs, it takes 30 minutes before the system hits the maximum user time, as shown by the line on the right
in the graph below:

In the ufs test above, the SE3510 is configured as 12x drives in a RAID-5 logical drive, with a simple ufs
filesystem (newfs -f 8192 /dev/dsk/c2t40d0s6). Despite the large number of drives, the SE3510 sustains a steady
read rate of only about 45 MB/sec, processing about 3000 IO/sec according to iostat. (Aside: the IO expert may
question why the hardware RAID provides only 45 MB/sec, but please bear in mind we are following the path of the IO
beginner here. This topic is re-visted below.)

The zfs file system reads about 16 MB/sec, doing about 4500 IO/sec, but takes less than 1/2 as long to peg
the CPU, since it is reading compressed data.

The zfs file system also used an SE3510 with SUN72G 15k RPM drives. On that unit, 12 individual "NRAID"
drives were created, and made visible to the host as 12 separate units. Then, 10 of them were strung together as zfs
RAID-Z using:

A kind ZFS expert notes that "with RAID-Z the disks are saturated delivering >400 iops. The problem of
RAID-Z is that those iops carry small amount of data and throughput is low." For more information, see this popular
reference: https://blogs.oracle.com/roch/entry/when_to_and_not_to.

A secondary reason might be that as the reads are done, ZFS is decompressing the gzip'd data on a system
where single thread performance is much slower than the one in Graph #2. On the M5000,
'gunzip ref.mps' requires about 2 seconds of CPU time; on the T5440, about 7 seconds. It should be
emphasized that this is only a secondary concern for the read statistics described in this article, although it can become
more important for write workloads, since compression is harder than decompression. Doing 'gzip ref.mps'
takes ~12 seconds on the M5000, and ~51 seconds on the T5440. Furthermore, although the T5440 has 256 threads available,
as of Solaris 10 s10s_u7, and Solaris Express snv_112, it is only willing to spend 8 threads doing gzip/gunzip operations.
(This limitation may change in a future version of Solaris.)

Solution: Mirrors, No Gzip

The kind ZFS expert suggested trying mirrored drives without gzip. When this is done, the %b (busy) time,
which is about 90% in the iostat report just above, changes to 98-100%. The %w time (queue non-empty) time, which shows
wide variability just above, also pushes 90-100%. Because we are reading much more data, elapsed time is actually slower
- the red line in the graph below:

Adding 12 more drives, configured as 8x three way mirrors, does the trick: the leftmost line shows the
desired slope. We spend about 3-4 minutes reading the file, an acceptable amount given that the benchmark as a whole runs
for more than 120 minutes.

The command creates 3-way mirrors, splitting each mirror across the two available controllers (c2 and c3).
There are 8 of these 3-way mirrors, and zfs will dynamically stripe data across the 8 mirrors.

Were These Tests Fair?

The hardware IO expert may be bothered by the data presented here for the RAID-5 ufs configuration. Why
would the hardware RAID system, with 12x drives, deliver only 45 MB/sec? In addition, it may seem odd that the tests use
a RAID device which is now 5 years old, and compare it versus contemporary ZFS.

This is a fair point. In fact, a more modern RAID device has been observed delivering 97 MB/sec to
450.soplex, although with a very different system under test.

On the other hand, it should be emphasized that all the T5440 tests reported in
this article used SE3510/SUN72G/15k. For the ufs tests, the SE3510 on-board software did the RAID-5 work. For the zfs
tests, the SE3510 simply presented its disks to the Solaris system as 12 separate ("NRAID") logical units, and zfs did the
RAID-Z and mirroring work.

Could there be something wrong with the particular SE3510 used for ufs? That seems unlikely. Although Graph 3 compares two different SE3510s (both connected to the same HBA, both configured with SUN72G 15k
drives), a later test repeated the RAID-5 run on the exact same SE3510 unit as had been used for zfs. The time did not
improve.

Is it possible that the SE3510 was mis-configured? Maybe. The author does not claim to be an IO expert,
and, in particular, relied on the SE3510 menu system, not its command line interface (sccli). The menus provide
limited access to disk block size setting, and the tester did not at first realize that the disk block size depends on
this other parameter .... located over here in the menus ...

For this particular controller, default block sizes are controlled indirectly by whether this setting is
yes or no. Changing it to "No" makes the default block size larger (32 KB vs. 128 KB). Once this was discovered, various
tests were repeated. The hw RAID-5 test was repeated with explicit selection of a larger size; however, it did not
improve. On the other hand, the NRAID devices, controlled by zfs, did improve.

Finally, in order to isolate any overhead from RAID-5, the SE3510 was configured as 12 x drives in a
RAID-0 stripe (256 KB stripe size). The time required to start 450.soplex was still over 30 minutes.

YMMV

As usual, your mileage may vary depending on your workload. This is especially true for IO workloads.

Summary / Basic Lessons

Some basic lessons about ZFS emerge:

1) ZFS can be easily taught not to hog memory.

2) Selecting gzip compression can be a big win, especially on systems with relatively faster CPUs.

3) Setting up mirrored drives with dynamic striping is straightforward.

4) ZFS is not so scary, after all.

Notes on Methods

During an actual test of a "reportable" run of SPEC CPU2006, file caches are normally not effective for
450.soplex, because its data files are set up many hours prior to their use, with many intervening programs competing for
memory usage. Therefore, for all tests reported here, it was important to avoid unwanted file caching effects that would
not be present in a reportable run, which was accomplished as summarized below:

The tests on the M5000 used 72GB 10K RPM disks. The ufs disk was a FUJITSU MBB2073RCSUN72G (SAS); the zfs
disk was a SEAGATE ST973401LSUN72G (Ultra320 SCSI). The tests on the T5440 used 72GB 15K RPM disks: FUJITSU MAU3073FCSUN72
(Fibre Channel).

Acknowledgments.

My IO teachers include Senthil Ramanujam, Cloyce Spradling, and Roch Bourbonnais, none of whom
should be blamed for this beginner's ignorance. Karsten Guthridge was the first to point out the
usefulness of ZFS gzip compression for 450.soplex.

Very interesting, I'd be interested in seeing as well if using lzjb (the default ZFS compression) provided a different result. It isn't as aggressive as gzip can be in getting the data smaller but it isn't as CPU intensive either.