Blog for scottdickson

Saturday Nov 05, 2011

Just a note that the slides are now available from our ZFS in the cloud session at OpenWorld. Tom Shafron, CEO of Viewbiquity, and I presented on the new features in ZFS and how Viewbiquity is using them to provide a cloud-based data storage system.

From our abstract: Oracle Solaris ZFS, a key feature of Oracle Solaris, integrates the
concepts of a file system and volume management capabilities to deliver
simple, fast data management. This session provides a case study of
Viewbiquity, a provider of market-leading, innovative M2M platforms. Its
cloud-based platform integrates command and control, video, VoIP, data
logging and management, asset tracking, automated responses, and
advanced process automation. Viewbiquity relies on Oracle Solaris ZFS
and Oracle Solaris for fast write with integrated hybrid traditional and
solid-state storage capabilities, snapshots for backups, built-in
deduplication, and compression for data storage efficiency. Learn what
Oracle Solaris ZFS is and how you can deploy it in high-performance,
high-availability environments.

Monday Dec 14, 2009

Just for grins, I thought it would be fun to do some "extreme" deduping. I started out created a pool from a pair of mirrored drives on a system running OpenSolaris build 129. We'll call the pool p1. Notice that everyone agrees on the size when we first create it. zpool list, zfs list, and df -h all show 134G available, more or less. Notice that when we created the pool, we turned deduplication on from the very start.

So, what if we start copying a file over and over? Well, we would expect that to dedup pretty well. Let's get some data to play with. We will create a set of 8 files, each one being made up of 128K of random data. Then we will cat these together over and over and over and over and see what we get.

Why choose 128K for my file size? Remember that we are trying to deduplicate as much as possible within this dataset. As it turns out, the default recordsize for ZFS is 128K. ZFS deduplication works at the ZFS block level. By selecting a file size of 128K, each of the files I create fits exactly into a single ZFS block. What if we picked a file size that was different from the ZFS block size? The blocks across the boundaries, where each file was cat-ed to another, would create some blocks that were not exactly the same as the other boundary blocks and would not deduplicate as well.

Here's an example. Assume we have a file A whose contents are "aaaaaaaa", a file B containing "bbbbbbbb", and a file C containing "cccccccc". If our blocksize is 6, while our files all have length 8, then each file spans more than 1 block.

The combined contents of the three files span across 4 blocks. Notice that the only block in this example that is replicated is block 4 of f1 and block 4 of f2. The other blocks all end up being different, even though the files were the same. Think about how this would work as files numbers of files grew.

So, if we want to make an example where things are guaranteed to dedup as well as possible, our files need to always line up on block boundaries (remember we're not trying to be a real world - we're trying to get silly dedupratios). So, let's create a set of files that all match the ZFS blocksize. We'll just create files b1-b8 full of blocks of /dev/

Somehow, df now believes that the pool is 422GB instead of 134GB. Why is that? Well, rather than reporting the amount of available space by subtracting used from size, df now calculates its size dynamically as the sum of the space used plus the space available. We have lots of space available since we have many many many duplicate references to the same blocks.

zpool list tells us the actual size of the pool, along with the amount of space that it views as being allocated and the amount free. So, the pool really has not changed size. But the pool says that 225M are in use. Metadata and pointer blocks, I presume.

Notice that the dedupratio is 299594! That means that on average, there are almost 300,000 references to each actual block on the disk.

One last bit of interesting output comes from zdb. Try zdb -DD on the pool. This will give you a histogram of how many blocks are referenced how many times. Not for the faint of heart, zdb will give you lots of ugly internal info on the pool and datasets.

So, what's my point? I guess the point is that dedup really does work. For data that has a commonality, it can save space. For data that has a lot of commonality, it can save a lot of space. With that come some surprises in terms of how some commands have had to adjust to changing sizes (or perceived sizes) of the storage they are reporting.

My suggestion? Take a look at zfs dedup. Think about where it might be helpful. And then give it a try!

Our topic this time was "What's New In ZFS" and we talked about some of the new features that have gone into ZFS recently, especially DeDupe. George Wilson of the ZFS team was kind enough to share some slides that he had been working on and they are posted here.

Our next meeting will be Tuesday, January 12 at GCA. Details and info can be found on the ATLOSUG website at http://hub.opensolaris.org/bin/view/User+Group+atl-osug/

Tuesday Dec 23, 2008

A Different Approach

A week or so ago, I wrote about a way to get around the current limitation of mixing flash and ZFS root in Solaris 10 10/08.
Well, here's a much better approach.

I was visiting with a customer last week and they were very excited to move forward quickly with ZFS boot in their Solaris 10
environment, even to the point of using this as a reason to encourage people to upgrade. However, when they realized that
it was impossible to use Flash with Jumpstart and ZFS boot, they were disappointed. Their entire deployment infrastructure
is built around using not just Flash, but Secure WANboot. This means that they have no alternative to Flash; the images deployed
via Secure WANBoot are always flash archives. So, what to do?

It occurred to me that in general, the upgrade procedure from a pre-10/08 update of Solaris 10 to Solaris 10 10/08 with a
ZFS root disk is a two-step process. First, you have to upgrade to Solaris 10 10/08 on UFS and then use lucreate
to copy that environment to a new ZFS ABE. Why not use this approach in Jumpstart?

Turns out that it works quite nicely. This is a framework for how to do that. You likely will want to expand on it, since
one thing this does not do is give you any indication of progress once it starts the conversion. Here's the general approach:

Create your flash archive for Solaris 10 10/08 as you usually would. Make sure you include all the appropriate LiveUpgrade
patches in the flash archive.

Use Jumpstart to deploy this flash archive to one disk in the target system.

Use a finish script to add a conversion program to run when the system reboots for the first time. It is necessary to make
this script run once the system has rebooted so that the LU commands run within the context of the fully built
new system.

Details of this approach

Our goal when complete is to have the flash archive installed as it always has been, but to have it running from a ZFS root
pool, preferably a mirrored ZFS pool. The conversion script requires two phases to complete this conversion. The first phase
creates the ZFS boot environment and the second phase mirrors the root pool. The following in this example, our flash archive
is called s10u6s.flar. We will install the initial flash archive onto the disk c0t1d0 and built our
initial root pool on c0t0d0.

We specify a simple finish script for this system to copy our conversion script into place:

cp ${SI_CONFIG_DIR}/S99xlu-phase1 /a/etc/rc2.d/S99xlu-phase1

You see what we have done: We put a new script into place to run at the end of rc2 during the first boot.
We name the script so that it is the last thing to run. The x in the name makes sure that this will
run after other S99 scripts that might be in place. As it turns out, the luactivate that we will
do puts its own S99 script in place, and we want to come after that. Naming ours S99x makes it happen later in the
boot sequence.

So, what does this magic conversion script do? Let me outline it for you:

Create a new ZFS pool that will become our root pool

Create a new boot environment in that pool using lucreate

Activate the new boot environment

Add the script to be run during the second phase of the conversion

Clean up a bit and reboot

That's Phase 1. Phase 2 has its own script to be run at the same time that finishes the mirroring of the root pool.
If you are satisfied with a non-mirrored pool, you can stop here and leave phase 2 out. Or you might prefer to make
this step a manual process once the system is built. But, here's what happens in Phase 2:

Delete the old boot environment

Add a boot block to the disk we just freed. This example is SPARC, so use installboot. For x86, you
would do something similar with installgrub.

Attach the disk we freed from the old boot environment as a mirror of the device used to build the new
root zpool.

Clean up and reboot.

I have been thinking it might be worthwhile to add a third phase to start a zpool scrub, which will force
the newly attached drive to be resilvered when it reboots. The first time something goes to use this drive, it will
notice that it has not been synced to the master drive and will resilver it, so this is sort of optional.

The reason we add bootability explicitly to this drive is because currently, when a mirror is attached to a root zpool,
a boot block is not automatically installed. If the master drive were to fail and you were left with only the mirror,
this would leave the system unbootable. By adding a boot block to it, you can boot from either drive.

So, here's my simple little script that got installed as /etc/rc2.d/S99xlu-phase1. Just to make the code a
little easier for me to follow, I first create the script for phase 2, then do the work of phase 1.

I think that this is a much better approach than the one I offered before, using ZFS send. This approach
uses standard tools to create the new environment and it allows you to continue to use Flash as a way to
deploy archives. The dependency is that you must have two drives on the target system. I think that's
not going to be a hardship, since most folks will use two drives anyway. You will have to keep then as separate
drives rather than using hardware mirroring. The underlying assumption is that you previously used SVM or VxVM
to mirror those drives.

So, what do you think? Better? Is this helpful? Hopefully, this is a little Christmas present for
someone! Merry Christmas and Happy New Year!

Friday Dec 05, 2008

Ancient History

Gather round kiddies and let Grandpa tell you a tale of how we used to to clone systems before we had Jumpstart and Flash,
when we had to carry water in leaky buckets 3 miles through snow up to our knees, uphill both ways.

Long ago, a customer of mine needed to deploy 600(!) SPARCstation 5 desktops all running SunOS 4.1.4. Even then, this was
an old operating system, since Solaris 2.6 had recently been released. But it was what their application required.
And we only had a few days to build and deploy these systems.

Remember that Jumpstart did not exist for SunOS 4.1.4, Flash did not exist for Solaris 2.6. So, our approach was to
build a system, a golden image, the way we wanted to be deployed and then use ufsdump to save the contents of the filesystems.
Then, we were able to use Jumpstart from a Solaris 2.6 server to boot each of these workstations. Instead of having a
Jumpstart profile, we only used a finish script that partitioned the disks and restored the ufsdump images.
So Jumpstart just provided us clean way to boot these systems and apply the scripts we wanted to them.

Solaris 10 10/08, ZFS, Jumpstart and Flash

Now, we have a bit of a similar situation. Solaris 10 10/08 introduces ZFS boot to Solaris, something that many of my
customers have been anxiously awaiting for some time. A system can be deployed using Jumpstart and the ZFS boot environment
created as a part of the Jumpstart process.

But. There's always a but, isn't there.

But, at present, Flash archives are not supported (and in fact do not work) as a way to install into a ZFS boot environment,
either via Jumpstart or via Live Upgrade. Turns out, they use the same mechanism under the covers for this. This is CR 6690473.

So, how can I continue to use Jumpstart to deploy systems, and continue to use something akin to Flash archives to speed
and simplify the process?

Build a "Golden Image" System

The first step, as with Flash, is to construct a system that you want to replicate. The caveat here is that you use ZFS for
the root of this system. For this example, I have left /var as part of the root filesystem rather than a separate
dataset, though this process could certainly be tweaked to accommodate a separate /var.

Once the system to be cloned has been built, you save an image of the system. Rather than using flarcreate, you will create a
ZFS send stream and capture this in a file. Then move that file to the jumpstart server, just as you would with a flash archive.

In this example, the ZFS bootfs has the default name - rpool/ROOT/s10s_u6wos_07.

How do I get this on my new server?

Now, we have to figure out how to have this ZFS send stream restored on the new clone systems. We would like to take advantage
of the fact that Jumpstart will create the root pool for us, along with the dump and swap volumes, and will set up all of the needed
bits for the booting from ZFS. So, let's install the minimum Solaris set of packages just to get these side effects.

Then, we will use Jumpstart finish scripts to create a fresh ZFS dataset and restore our saved image into it.
Since this new dataset will contain the old identity of the original system, we have to reset our system identity.
But once we do that, we are good to go.

So, set up the cloned system as you would for a hands-free jumpstart. Be sure to specify the sysid_config and install_config
bits in the /etc/bootparams. The manual Solaris 10 10/08 Installation Guide: Custom JumpStart and Advanced Installations
covers how to do this. We add to the rules file a finish script (I called mine loadzfs in this case) that will do the
heavy lifting. Once Jumpstart installs Solaris according to the profile provided, it then runs the finish script to finish up
the installation.

Here is the Jumpstart profile I used. This is a basic profile that installs the base, required Solaris packages into a ZFS pool
mirrored across two drives.

The finish script is a little more interesting since it has to create the new ZFS dataset, set the right properties, fill it up,
reset the identity, etc. Below is the finish script that I used.

#!/bin/sh -x
# TBOOTFS is a temporary dataset used to receive the stream
TBOOTFS=rpool/ROOT/s10u6_rcv
# NBOOTFS is the final name for the new ZFS dataset
NBOOTFS=rpool/ROOT/s10u6f
MNT=/tmp/mntz
FLAR=s10s_u6wos_07_flar.zfs
NFS=serverIP:/export/solaris/Solaris10/flash
# Mount directory where archive (send stream) exists
mkdir ${MNT}
mount -o ro -F nfs ${NFS} ${MNT}
# Create file system to receive ZFS send stream &
# receive it. This creates a new ZFS snapshot that
# needs to be promoted into a new filesystem
zfs create ${TBOOTFS}
zfs set canmount=noauto ${TBOOTFS}
zfs set compression=on ${TBOOTFS}
zfs receive -vF ${TBOOTFS} < ${MNT}/${FLAR}
# Create a writeable filesystem from the received snapshot
zfs clone ${TBOOTFS}@flar ${NBOOTFS}
# Make the new filesystem the top of the stack so it is not dependent
# on other filesystems or snapshots
zfs promote ${NBOOTFS}
# Don't automatically mount this new dataset, but allow it to be mounted
# so we can finalize our changes.
zfs set canmount=noauto ${NBOOTFS}
zfs set mountpoint=${MNT} ${NBOOTFS}
# Mount newly created replica filesystem and set up for
# sysidtool. Remove old identity and provide new identity
umount ${MNT}
zfs mount ${NBOOTFS}
# This section essentially forces sysidtool to reset system identity at
# the next boot.
touch /a/${MNT}/reconfigure
touch /a/${MNT}/etc/.UNCONFIGURED
rm /a/${MNT}/etc/nodename
rm /a/${MNT}/etc/.sysIDtool.state
cp ${SI_CONFIG_DIR}/sysidcfg /a/${MNT}/etc/sysidcfg
# Now that we have finished tweaking things, unmount the new filesystem
# and make it ready to become the new root.
zfs umount ${NBOOTFS}
zfs set mountpoint=/ ${NBOOTFS}
zpool set bootfs=${NBOOTFS} rpool
# Get rid of the leftovers
zfs destroy ${TBOOTFS}
zfs destroy ${NBOOTFS}@flar

When we jumpstart the system, Solaris is installed, but it really isn't used. Then, we load from the send stream
a whole new OS dataset, make it bootable, set our identity in it, and use it. When the system is booted, Jumpstart
still takes care of updating the boot archives in the new bootfs.

On the whole, this is a lot more work than Flash, and is really not as flexible or as complete. But hopefully, until
Flash is supported with a ZFS root and Jumpstart, this might at least give you an idea of how you can replicate systems
and do installations that do not have to revert back to package-based installation.

Many people use Flash as a form of disaster recover. I think that this same approach might be used there as well. Still
not as clean or complete as Flash, but it might work in a pinch.

So, what do you think? I would love to hear comments on this as a stop-gap approach.

Friday Dec 01, 2006

Continuing with some of the ideas around zvols, I wondered about UFS on a zvol. On the surface, this appears to be sort of redundant and not really very sensible. But thinking about it, there are some real advantages.

I can take advantage of the data integrity and self-healing features of ZFS since this is below the filesystem layer.

I can easily create new volumes for filesystems and grow existing ones

I can make snapshots of the volume, sharing the ZFS snapshot flexibility with UFS - very cool

In the future, I should be able to do things like have an encrypted UFS (sort-of) and secure deletion

Creating UFS filesystems on zvols

Creating a UFS filesystem on a zvol is pretty trivial. In this example, we'll create a mirrored pool and then build a UFS filesystem in a zvol.

Growing UFS filesystems on zvols

But, what if I run out of space? Well, just as you can add disks to a volume and grow the size of the volume, you can grow the size of a zvol. Now, since the UFS filesystem is a data structure inside zvol container, you have to grow it as well. Were I using just zfs, the size of the file system would grow and shrink dynamically with the size of the data in the file system. But a UFS has a fixed size, so it has to be expanded manually to accomodate the enlarged volume. Now, this seems to have quite working between b45 and b53, so I just filed a bug on this one.

What about compression?

Along the same lines as growing the file system, I suppose you could turn compression on for the zvol. But since the UFS is of fixed size, it won't help especially, as far as fitting more data in the file system. You can't put more into the filesystem than the filesystem thinks that it can hold. Even if it isn't using that much on the disk. Here's a little demonstration of that.

First, we will loop through, creating 200MB files in a 1GB file system with no compression. We will use blocks of zeros, since these will compress quite a bit the second time round.

This time, even though the volume was not using much space at all, the file system was full. So compression in this case is especially valuable from a space management standpoint. Depending on the contents of the filesystem, compression may still help the performance by converting multiple I/Os into single or fewer I/Os, though.

The Cool Stuff - Snapshots and Clones with UFS on Zvols

One of the things that is not available in UFS is the ability to create multiple snapshots quickly and easily. The fssnap(1M) command allows me to create a single, read-only snapshot of a UFS file system. In addition, it requires an additional location to maintain backing store for files changed or deleted in the master image during the lifetime of the snapshot.

ZFS offers the ability to create many snapshots of a ZFS filesystem quickly and easily. This ability extends to zvols, as it turns out.

For this example, we will create a volume, fill it up with some data and then play around with taking some snapshots of it. We will just tar over the Java JDK so there are some files in the file system.

Now, we will create a snapshot of the volume, just like for any other ZFS file system. As it turns out, this creates new device nodes in /dev/zvol for the block and character devices. We can mount them as UFS file systems same as always.

Just as you can create multiple snapshots. And as with any other ZFS file system, you can rollback a snapshot and make it the master again. You have to unmount the filesystem in order to do this, since the rollback is at the volume level. Changing the volume underneath the UFS filesystem would leave UFS confused about the state of things. But, ZFS catches this, too.

I can create additional read-write instances of a volume by cloning the snapshot. The clone and the master file system will share the same objects on-disk for data that remains unchanged, while new on-disk objects will be created for any files that are changed either in the master or in the clone.

I think am pretty sure that this isn't exactly what the ZFS guys had in mind when they set out to build all of this, but this is pretty cool. Now, I can create UFS snapshots without having to specify a backing store. I can create clones, promote the clones to the master, and the other things that I can do in ZFS. I still have to manage the mounts myself, but I'm better off than before.

I mentioned recently that I just spent a week in a ZFS internals TOI. Got a few ideas to play with there that I will share. Hopefully folks might have suggestions as to how to improve / test / validate some of these things.

ZVOLs as Swap

The first thing that I thought about was using ZFS as a swap device. Of course, this is right there in the zfs(1) man page as an example, but it still deserves a mention here. There has been some discussion of this on the zfs-discuss list at opensolaris.org (I just retyped that dot four times thinking it was a comma. Turns out there was crud on my laptop screen). The dump device cannot be on a zvol (at least if you want to catch a crash dump) but this still gives a lot of flexibility. With root on ZFS (coming before too long) ZFS swap makes a lot of sense and is the natural choice. We were talking in class that maybe it would be nice if there were a way to turn off ZFS' caching for the swap surface to improve performance, but that remains to be seen.

At any rate, setting up mirrored swap with ZFS is way simple! Much simpler even than with SVM, which in turn is simpler than VxVM. Here's all it takes:

Pretty darn simple, if you ask me. You can make it permanent by changing the lines for swap in your /etc/vfstab (below). Notice that you use the path to the zvol in the /dev tree rather than the ZFS dataset name.

I would like to do some performance testing to see what kind of performance you can get with swap on a zvol. I am curious about how this will affect kernel memory usage. I am curious about the effect of things like compression on the swap volume. Thinking about that one, it doesn't make a lot of sense. I am also curious about the ability to dynamically change the size of the swap space. At first glance, changing the size of the volume does not automatically change the amount of available swap space. That makes sense. That makes sense for expanding swap space. But if you reduce the size of the volume and the kernel doesn't notice, that sounds like a it could be a problem. Maybe I should file a bug.

Suggestions for things to try and ways to measure overhead and performance for this are welcomed.

Thursday Nov 30, 2006

I just spent the last four days in a ZFS Intenals TOI, given by George Wilson from RPE. This just reinforces my belief that the folks who build OpenSolaris (and most any complex software product, actually) have a special gift. How one can conceive of all of the various parts and pieces to bring together something as cool as OpenSolaris or ZFS or DTrace, etc., is beyond me.

By way of full disclosure, I ought to admit that the main thing I learned in graduate school and while working as a developer in a CO-OP job at IBM was that I hate development. I am not cut out for it and have no patience for it.

Anyway, though, spending a week in the ZFS source actually helps you figure out how to best use the tool at a user level. You how things fit together and this helps to figure out how to build solutions. I got a ton of good ideas about some things that you might do with ZFS even without moving all of your data to ZFS. Don't know whether they will pan out or not, but some ideas to play around with. More about that later.

Same kind of thing applies for internals of the kernel. Whether or not you are a kernel programmer, you can be a better developer and a better system administrator if you have a notion of how the pieces of the kernel fit together. Sun Education is now offering a class called Solaris 10 Operating System, previously only offered internally at Sun. Since Solaris has been open-sourced, the internal Internals is now and external Internals! If you have a chance, take this class! I take it every couple of Solaris releases and never regret it.

But, mostly I want to say a special thanks to George Wilson and the RPE team for putting together a fantastic training event and for allowing me, from the SE / non-developer side of the house to sit in and bask in the glow of those who actually make things for a living.