Now lets run zdb with the new option "-S". We pass in "user:all", where "user" tells zdb that we only want user data blocks (as opposed to both user and metadata) and "all" tells zdb to print out all blocks (skipping any checksum algorithm strength comparisons).

This displays the signature of each block pointer - where the columns are level, physical size, number of dvas, object type, checksum type, compression type, and finally the actual checksum of the block.

So this is interesting, but what could we do with this information? Well, one thing we could do, is to figure out how much your pool can take advantage of dedup. Let's assume that the dedup implementation does matching based on the actual checksum and any checksum algorithm is strong enough (in reality, we'd need sha256 or stronger). So starting with the above pool and using a simple perl script 'line_by_line_process.pl' (shown at the end of this blog), we find:

In our trivial case, we can see that we could get a huge win - 87% of the pool can be dedup'd!. Upon closer examination, we notice that mkfile writes out all zero blocks. If you had compression enabled, there won't be any actual blocks for this file. So let's look at a case where just the "ejk.txt" contents are getting dedup'd:

Tuesday Sep 11, 2007

Previously, i did some analysis on NCQ in OpenSolaris. It turned out that to get good multi-stream sequential read performance, you had to disable NCQ via the tunable 'sata_func_enable'. Disabling NCQ actually does two things: 1) sets the number of concurrent I/Os to 1 and 2) changes what you send down protocol wise.

Turns out, the first is all we really need to get good performance for the multi-stream sequential read case, and doing the second actually exposes a bug on certain firmware of certain disks. So i highly recommend the newly added 'sata_max_queue_depth' tunable instead of 'sata_func_enable'. As a reminder, put the following in /etc/system and reboot:

set sata:sata_max_queue_depth = 0x1

An admin command to allow you to do this on the fly without rebooting would be another step forward, but no official plans on that just yet.

As you can guess, this inflating would really hurt typical databases as they do lots of record aligned random I/Os - and the random I/Os are typically under 16KB (Oracle and Postgress are usually configured with 8KB, JavaDB with 4KB, etc.). So why do we have this inflation in the first place? Turns out its really important for pre-fetching metadata. One workload that demonstrates this is the mult-stream sequnetial read workload for
FileBench. We can also use the oltp workload of FileBench to test database performance.

What we changed in order to fix 6437054 was to make the vdev cache
only inflate I/Os for \*metadata\* - not \*user\* data. You can now see that logic in vdev_cache_read(). This logically makes sense, as we can now rely on zfetch to correctly pre-fetch user data (which depends more on what the application is doing), and the vdev cache to pre-fetch metadata (which depends more on where it was located on disk).

Ok, yeah, theory is nice, but let's see some measurements...

OLTP workload

Below are the results from using this profile (named 'db.prof'). This was on a thumper, non-debug bits, ZFS configured in a 46 disk RAID-0, and the recordsize set to 8KB.

OpenSolaris results without the fix for 6437054 (onnv-gate:2007-07-11)

10520.9 ops/s vs. 8543.3 ops/s, over 20%! That's a nice out of the box improvement!

Multi-Stream Sequential Read workload

A workaround previously mentioned to get better DB performance was to set 'zfs_vdev_cache_max' to 1B (which essentially disables the vdev cache as the random I/Os are not going to be lower than 1B). The problem with this approach is that it really hurst other workloads, such as the multi-stream sequential workload. Below are the results using the same thumper, non-debug bits, ZFS in a 46 disk RAID-0, checksums turned off, NCQ disabled via 'set sata:sata_func_enable = 0x5' in /etc/system', and using this profile (named 'sqread.prof').

OpenSolaris results with the fix for 6437054 (onnv-gate:2007-07-18), 'zfs_vdev_cache_max' left as its default value

So by disabling the vdev cache, the throughput drops from 1787MB/s to 1410MB/s - a 25% regression. So disabling the vdev cache really hurts here. The nice thing is that with the fix for 6437054, we don't have to - and we get great DB performance too. My cake is tasty.

Tuesday Jun 12, 2007

Sun is known for servers, not laptops. So a filesystem designed by Sun would surely be too powerful and too "heavy" for laptops, that the features of a "datacenter" filesystem wouldn't fit on a laptop. Right? Actually... not. As it turns out, ZFS is a great match for laptops.

Backup

One of the most important things a user needs to do on a laptop is to back his data up. Copying your data to DVD or an external drive is one way. ZFS snapshots with 'zfs send' and 'zfs recv' is a better way. Due to its architecture, snaphots in ZFS are very fast and only take up as much space as much data has changed. For a typical user, taking a snapshot every day, for example, will only take up a small amount of capacity.

So let's start off with a ZFS pool called 'swim' and two filesystems: 'Music' and 'Pictures':

What's really nice about using ZFS's snapshots is that you only need to send over (and store) the differences between snapshots. So if you're doing video editing on your laptop, and have a giant 10GB file, but only change, say, 1KB of data on this day, with ZFS you only have to send over 1KB of data - not the entire 10GB of the file. This also means you don't have to store multiple 10GB versions (one per snapshot) of the file on your backup device.

You can also backup with an external hard drive. Create a backup pool on the second hard drive, and just 'zfs send/recv' your nightly snapshots.

Reliability

Since laptops (typically) only have 1 disk, handling disk errors is very important. Bill introduced ditto blocks to handle partial disk failures. With typical filesystems, if part of the disk is corrupted/failing and that part of the disk stores your metadata, you're screwed. There's no way to access the data associated with the inaccessible metadata without backing up. With ditto blocks, ZFS stores multiple copies of the metadata in the pool. In the single disk case, we strategically store multiple copies of the metadata on different locations on disk (such as at the front and back of the disk). A subtle partial disk failure can make other filesystems useless, whereas ZFS can survive.

Matt took ditto blocks one step further and allowed the user to apply it to any filesystem's data. What this means is that you can make your more important data more reliable by stashing away multiple copies of your precious data (without muddying your namespace). Here's how you store two copies of your pictures:

Built-in Compression

With ZFS, compression comes built-in. The current algorithms are lzjb (based on Lempel-Ziv) and gzip. Now its true that your jpegs and mp4s are already compressed quite nicely, but if you want to save capacity on other filesystems, all you have to do is:

That single disk stickiness

A major problem with laptops today is the single point of failure: the single disk. It makes complete sense today that laptops are designed this way given the physical space and power issues. But looking foward, as, say, flash gets cheaper and cheaper as well as more reliable, it becomes more and more of a possibility to replace the single disk in laptops. So now that you save physical space, you can actually fit more than one flash device in the laptop. Wouldn't it be really cool if you could then build RAID ontop of the multiple devices? Introducing some hardware RAID controller doesn't make any sense - but software RAID does.

ZFS allows you to do mirroring as well as RAID-Z (ZFS's unique form of RAID-5) - in software.

At this point, both clientA and clientB have the same pool imported and both can write to it - however, ZFS is not designed
to have multiple writers (yet), so both clients will quickly corrupt the pool as both have a different view of the pool's state.

Now that we store the hostid in the label and verify the system importing the pool was the last one that accessed the pool, the
poor man's cluster corruption scenario mentioned above can no longer happen. Below is an example using shared storage over iSCSI.
In the example, clientA is 'fsh-weakfish', clientB is 'fsh-mullet'.

First, let's create the pool on clientA (assume both clients are already setup for iSCSI):

fsh-mullet# zpool import
pool: i
id: 8574825092618243264
state: ONLINE
status: The pool was last accessed by another system.
action: The pool can be imported using its name or numeric identifier and
the '-f' flag.
see: http://www.sun.com/msg/ZFS-8000-EY
config:
i ONLINE
c2t01000003BAAAE84F00002A0045F86E49d0 ONLINE
fsh-mullet# zpool import i
cannot import 'i': pool may be in use from other system, it was last accessed by
fsh-weakfish (hostid: 0x4ab08c2) on Tue Apr 10 09:33:07 2007
use '-f' to import anyway
fsh-mullet#

Ok, we don't want to forcibly import the pool until clientA is down. So after clientA (fsh-weakfish) has rebooted,
forcibly import the pool on clientB (fsh-mullet):

One detail i'd like to point out is that you have to be careful on \*when\* you forcibly import a pool. For instance,
if you forcibly import the pool on clientB \*before\* you reboot clientA then corruption can still happen. This is because
the command reboot(1M) cleanly takes down the machine, which means it unmounts all filesystems, and unmounting a
filesystem will write a bit of data to the pool.

Friday Feb 02, 2007

If ZFS detects either a checksum error or read I/O failure and is not able to correct it (say by successfully reading from the other side of a mirror), then it will store a log of objects that are damaged permanently (perhaps due to silent corruption).

Previously (that is, before snv_57), the output we gave was only somewhat useful:

If you were lucky, the DATASET object number would actually get converted into a dataset name. If it didn't then you would have to use zdb(1M) to figure out what the dataset name/mountpoint was. After that, you would have to use the '-inum' option to find(1) to figure out what the actual file was (see the opensolaris thread on it). While it is really powerful to even have this ability, it would be really nice to have ZFS do all the dirty work for you - we are after all shooting for easy administration.

For the listings above, we attempt to print out the full path to the file. If we successfully find the full path and the dataset is mounted then we print out the full path with a preceding "/" (such as in the "/monkey/a.txt" example above). If we successfully find it, but the dataset is not mounted, then we print out the dataset name (no preceding "/"), followed by the path within the dataset to the file (see the "monkey/ghost:/e.txt" example above).

If we can't successfully translate the object number to a file path (either due to error or the object doesn't have a real file path associated with it as is the case for say a dnode_t), then we print out the dataset name followed by the object's number (as in the "monkey/dnode:<0x0>" case above). If an object in the MOS gets corrupted then we print out the special tag of <metadata>, followed by the object number.

Couple this with background scrubbing and you have very impressive fault management and observability. What other filesystem/storage system can give you this ability?

Note: these changes are in snv_57, will hopefully make s10u4, and perhaps even Leopard :)

If you're stuck on old bits (without the above mentioned changes) and are trying to figure out how to translate object numbers to filenames,
then check out
this thread

Tuesday Nov 21, 2006

People are finding that setting 'zil_disable' seems to increase their performance - especially NFS/ZFS performance. But what does setting 'zil_disable' to 1 really do? It completely disables the ZIL. Ok fine, what does that mean?

Disabling the ZIL causes ZFS to not immediatley write synchronous operations to disk/storage. With the ZIL disabled, the synchronous operations (such as fsync(), O_DSYNC, OP_COMMIT for NFS, etc.) will be written to disk, just at the same guarantees as asynchronous operations. That means you can return success to applications/NFS clients before the data has been commited to stable storage. In the event of a server crash, if the data hasn't been written out to the storage, it is lost forever.

With the ZIL disabled, no ZIL log records are written.

Note: disabling the ZIL does NOT compromise filesystem integrity. Disabling the ZIL does NOT cause corruption in ZFS.

Disabling the ZIL is definitely frowned upon and can cause your applications much confusion. Disabling the ZIL can cause corruption for NFS clients in the case where a reply to the client is done before the server crashes, and the server crashes before the data is commited to stable storage. If you can't live with this, then don't turn off the ZIL.

All subcommands of zfs(1M) and zpool(1M) that modify the state of the pool get logged persistently to disk. That means no matter where you take your pool or what machine is currently accessing it (such as in the SunCluster failover case), your history follows. Sorta like your permanent record.

Now you have a convenient way of finding out if someone did something bad to your pool...

The history log is implemented using a ring buffer of <packed record length, record nvlist> tuples. More details can be found in spa_history.c, which contains the main kernel code changes for 'zpool history'. The history log's size is 1% of your pool, with a maximum of 32MB and a minimum of 128KB. Note: the original creation of the pool via 'zpool create' is never overwritten.

If you add a new subcommand to zfs(1m) or zpool(1M), all you need to do is call zpool_log_history(). If you build a new consumer of 'zpool history' (such as a GUI), then you need to call zpool_get_history(), and parse the nvlist. A good example of that is in get_history_one().

In the future, we will add the ability to also log uid, hostname, and zonename. We're also looking at adding "internal events" to the log since some subcommands actually take more than one txg, and we'd like to log history every txg (this would be more for developers and debuggers than admins).

These changes are in snv_51, and i would expect s10_u4 (though that schedule hasn't been decided yet).

Monday Aug 07, 2006

As part of the I/O scheduling, ZFS has a field called 'zfs_vdev_max_pending'. This limits the maximum number of I/Os we can send down per leaf vdev. This is NOT the maximum per filesystem or per pool. Currently the default is 35. This is a good number for today's disk drives; however, it is not a good number for storage arrays that are really comprised of many disks but exported to ZFS as a single device.

But if you've created say a 2 device mirrored pool - where each device is really a 10 disk storage array, and you think that ZFS just
isn't doing enough I/O for you, here's a script to see if that's true:

If you see the "avg pending I/Os" hitting your vq_max_pending limit, then raising the limit would be a good thing. The way to do that used to be per vdev, but we now have a single global way to change all vdevs.

Friday Mar 31, 2006

So you're running some app, and you're curious where zfs is spending its time... here's some dscripts to figure out how much time
each VOP is taking.

This one (
zfs_get_vop_times.d)
grabs the number of times each VOP was called, the VOP's average time, and the total time spent in each VOP. This
is for all ZFS file systems on the system. It generates output like:

This one (
zvop_times_fsid.d)
does the same as above but just for one file system - namely the one you specify via passed in FSID ints.

Lastly, this one (
zvop_times_fsid_large.d)
does the same as above (tracking per FSID), but also spits out the stack and quantize information when a zfs VOP
call goes over X time - where X is passed into the script. This makes it easy to see if there's some really slow calls.
It generates output like (skipping the output thats the same from the above examples):

Yikes! so looking at throughput (number of transactions per second)
ZFS is ~16x better than UFS on this benchmark. Ok so ZFS
is not this good on every benchmark when compared to UFS, but we rather
like this one.

This was run on a 2 way opteron box, using the same SCSI disk for both
ZFS and UFS.

And why the 'lockfs' call you ask? This ensures that all data is flushed to
disk - and measuring how long it takes to do something that doesn't necessarily
get flushed is just not legit in this case. Persistent data is good.