Search This Blog

I/O Reduction and the ZIL

I came across an interesting microbenchmark this week. It shows that some workloads can show confusing results, or head fakes, can lead to difficulty in understanding benchmark results. In this case, a method we use for finding the performance envelope for ZFS is not effective.

Before I dive into the microbenchmark, a few words about the ZFS Intent Log (ZIL). ZFS is a transactional file system, which means that it collects I/O into a transaction group (txg) and commits that txg to persistent storage. In later ZFS implementations, that txg commit occurs every 30 seconds. However, if an application needs to ensure that an I/O is written to persistent storage immediately, often called synchronous writes (though that is arguably not the best descriptive term), then waiting for up to 30 seconds is not an option. This is where the ZIL enters the picture. In the synchronous write case, ZFS will write the record to the ZIL and later commit the record with the txg. This ensures the synchronous write agreement between the application and ZFS is honored -- a good thing. Neil Perrin offers a more detailed description in his famous lumberjack blog posting.

Synchronous writes are the bane of high performance. Really. We see this every day. It causes performance guys to gnash their teeth and cuss. When a microbenchmark wanders along and does a lot of synchronous writes, complaints about how " sucks" and "I can't believe those file system developers could be so insensitive" come pouring forth.

To determine the performance envelope of a benchmark, it is relatively easy to disable the ZIL. This is neither a safe nor recommended option for production systems or people who like their data. But for benchmarking, it allows a performance engineer to quickly determine the best possible performance for the given system configuration. The ZIL is then re-enabled and the work can concentrate on how to approach that performance goal. Tools like zilstat are designed to help with this endeavor, and can save you a lot of time when you suspect synchronous write performance might be an issue.

But disabling the ZIL can also hide important behavior. That is why this microbenchmark could be a poster child for benchmarking that doesn't do what you expect. Here it is:

while true; do

echo "blah" > outputfile

done

When run on an Solaris NFS client with a Solaris NFS server using default NFS settings, this will cause the following to occur:

outputfile is LOOKUPed

outputfile is OPENed

ACCESS to outputfile is checked

The data is written to the file with WRITE

The data is COMMITted

outputfile is CLOSEd

This will also, by default, cause the file to be synchronously written, the so-called "sync-on-close" operation.

Argv! This simple microbenchmark actually makes lot of synchronous writes to the file system. zilstat will happily show that the ZIL is working hard when running this microbenchmark. If you run this, then you can experiment with various pool or separate (ZIL) log configurations to your heart's content.

However, if you disable the ZIL, then the number of I/O operations is reduced to just a handfull, every 30 seconds. Why? Because ZFS is clever enough to recognize that the same file is being overlaid and is only concerned with physically commiting the last one in the transaction group. In other words, the amount of I/O traffic to the pool is dramatically reduced. When this happens, you are no longer measuring the affect of ZIL I/O, you are also measuring the main pool I/O. The results look something like this:

ZIL enabled, no separate log = 100 iterations/second

ZIL enabled, separate log on a fast SSD = 1,000 iterations/second

ZIL disabled = 10,000 iterations/second

In other words the affect of eliminating the pool I/O in addition to the ZIL I/O made the system faster! Hurray! But wait just a dog-gone second. That means that the benchmark is basically useless -- it does almost zero physical I/O when the ZIL is disabled. This is kinda like redirecting all of the data to /dev/null -- a fun trick to amuse your friends at parties, but otherwise completely useless.

The moral of this tale is: beware of microbenchmarks and how they can confuse your understanding of the real system behavior.

P.S. Don't disable the ZIL.

P.P.S. I really mean it, don't disable the ZIL. Seriously. I might cut you some slack for benchmark purposes, but other than that, don't disable the ZIL. Period. End of discussion.

Comments

Interesting post, but regarding the ZIL, having 1 storage server and 2 servers using NFS, where the storage server is the NFS Server, connected to a no-break system that lasts for 4 hours in case of power-off, the ZIL still should not be disabled?Thanks.

@it4it, even if you have power, that does not preserve the data if the cause of the outage is not the power subsystem. For example, a reset, panic, or catastrophic mobo crash would result in data loss. Do your data a favor, don't disable the ZIL.

Post a Comment

Popular Posts

Today, we routinely hear people carrying on about IOPS-this and IOPS-that. Mostly this seems to come from marketing people: 1.5 million IOPS-this, billion IOPS-that. Right off the bat, a billion IOPS is not hard to do, the metric lends itself rather well to parallelization...

This post is the first in a series looking at the use and misuse of IOPS for storage system performance analysis or specification.

Let's do some simple math. We all want low latency -- the holy grail of performance. In the bad old days, many computer systems were bandwidth constrained in the I/O data path, so it was very easy to measure the effect of bandwidth constraints on latency. For example, fast/wide parallel SCSI and UltraSCSI was the rage when the dot-com bubble was bubbling, capped out at 20 MB/sec. Suppose we had to move 100 MB of data, then the latency is easily calculated:

If you wander through the OpenSolarisZFS-discuss archives or look at the ZFS Best Practices Guide, then you can encounter references and debates about whether the zfs send and zfs receive commands are suitable for backups. As I've described before, zfs send and zfs receive can be part of a comprehensive backup strategy for high-transaction environments. But people get nervous when we discuss placing a zfs send stream on persistent storage. The reasoning is that if the stream gets corrupted, then it is useless. There is an RFE open to improve the robustness of zfs receive, but that is little consolation for someone who has lost data. The fundamental design of ZFS is exposed in zfs send -- the send stream contains an object, not files. This is great for replicating objects, and since ZFS file systems and volumes are objects, it is quite handy. This is why zfs send and zfs receive do not replace the functionality of an enterprise backup system that works on files. So, I expect the te…

ZFS now offers triple-parity raidz3. Conceptually, raidz3 is an N+3 parity protection scheme. Today, there are few, if any, other implementations of triple parity protection, so when we say "raidz is similar to RAID-5" and "raidz2 is similar to RAID-6" there is no similar allusion for raidz3. I prefer to say "raidz3 is like raidz2 with one additional level of parity protection. But how much better is raidz3 than raidz2? To help answer that question, I used the simple Mean Time to Data Loss (MTTDL) model to calculate the data retention capabilities of the possible configurations of 12 disks under ZFS. To be fair, the same model applies to other RAID implementations, but I'll use the ZFS terminology here.

In this MTTDL model, the configuration includes N total disks. If the data protection scheme is raidz3, then the minimum N = 1 data disk + 3 parity disks = 4. You can add more data disks to increase the overall available space, so if N=6 then you have 3 data…