Again, ZFS handled everything for you and you now have the contents of ''mysecondDS'' exactly as it was at the time the snapshot ''Charlie'' was taken. Not more complicated than that. Hang on you hat, we have not finished and here is a bonus to have read this tutorial so far: '''you can access the various snapshots without cloning them'''

Again, ZFS handled everything for you and you now have the contents of ''mysecondDS'' exactly as it was at the time the snapshot ''Charlie'' was taken. Not more complicated than that. Hang on you hat, we have not finished and here is a bonus to have read this tutorial so far: '''you can access the various snapshots without cloning them'''

−

=== the .zfs pseudo-directory or the secret gate to your snapshot ===

+

=== the .zfs pseudo-directory or the secret passage to your snapshots ===

+

Any directory where a ZFS dataset is mounted (having snapshots or not) secretly contains a pseudo-directory named '''.zfs''' (dot-ZFS) and you will not see it even with the option ''-a'' given to a '''ls''' command unless you specify it. It is a contradiction to Unix and Unix-like systems' philosophy to not hide anything to the system administrator. It is not a bug of ZFS On Linux implementation and the Solaris implementation of ZFS exposes the exact behavior. So what is inside this little magic box?

+

+

<console>

+

###i## cd /myfirstpool/mysecondDS

+

###i## ls -la | grep .zfs

+

###i## cd /myfirstpool/mysecondDS

+

###i## ls -la | grep .zfs

+

###i## ls -lad .zfs

+

dr-xr-xr-x 1 root root 0 Mar 2 15:26 .zfs

+

###i## pwd

+

/myfirstpool/mysecondDS/.zfs

+

###i## ls -la

+

total 4

+

dr-xr-xr-x 1 root root 0 Mar 2 15:26 .

+

drwxr-xr-x 3 root root 145 Mar 2 19:29 ..

+

dr-xr-xr-x 2 root root 2 Mar 2 19:47 shares

+

dr-xr-xr-x 2 root root 2 Mar 2 18:46 snapshot

+

<console>

=== The time-traveling machine (incremental snapshots) ===

=== The time-traveling machine (incremental snapshots) ===

Revision as of 00:52, March 3, 2014

Important

This tutorial is under a heavy revision to be switched from ZFS Fuse to ZFS on Linux.

Introduction

ZFS features and limitations

ZFS offers an impressive amount of features even putting aside its hybrid nature (both a filesystem and a volume manager -- zvol) covered in detail on Wikipedia. One of the most fundamental points to keep in mind about ZFS is it targets a legendary reliability in terms of preserving data integrity. ZFS uses several techniques to detect and repair (self-healing) corrupted data. Simply speaking it makes an aggressive use of checksums and relies on data redundancy, the price to pay is a bit more CPU processing power. However, the Wikipedia article about ZFS also mention it is strongly discouraged to use ZFS over classic RAID arrays as it can not control the data redundancy, thus ruining most of its benefits.

Virtual block-devices support support over a ZFS pool (zvol) - (extremely cool when jointly used over a RAID-Z volume)

Read-only Snapshot support (it is possible to get a read-write copy of them, those are named clones)

Encryption support (supported only at ZFS version 30 and upper, ZFS version 31 is shipped with Oracle Solaris 11 so that version is mandatory if you plan to encrypt your ZFS datasets/pools)

Built-in RAID-5-like-over-steroid capabilities known as RAID-Z and RAID-6-like-over-steroid capabilities known as RAID-Z2. RAID-Z3 (triple parity) also exists.

Copy-on-Write transactional filesystem

Meta-attributes support (properties) allowing you to you easily drive the show like "That directory is encrypted", "that directory is limited to 5GiB", "That directory is exported via NFS" and so on. Depending on what you define, ZFS takes the appropriates actions!

Dynamic striping to optimize data throughput

Variable block length

Data deduplication

Automatic pool re-silvering

Transparent data compression

Transparent encryption (Solaris 11 and later only)

Most notable limitations are:

Lack a features ZFS developers knows as "Block Pointer rewrite functionality" (planned to be developed), without it ZFS suffers of currently not being able to:

Pool defragmentation (COW techniques used in ZFS mitigates the problem)

Pool resizing

Data compression (re-applying)

Adding an additional device in a RAID-Z/Z2/Z3 pool to increase it size (however, it is possible to replace in sequence each one of the disks composing a RAID-Z/Z2/Z3)

NOT A CLUSTERED FILESYSTEM like Lustre, GFS or OCFS2

No data healing if used on a single device (corruption can still be detected), workaround if to force a data duplication on the drive

No support of TRIMming (SSD devices)

ZFS on well known operating systems

Linux

Despite the source code of ZFS is open, its license (Sun CDDL) is incompatible with the license governing the Linux kernel (GNU GPL v2) thus preventing its direct integration. However a couple of ports exists, but suffers of maturity and lack of features. As of writing (February 2014) two known implementations exists:

ZFS-fuse: a totally userland implementation relying on FUSE. This implementation can now be considered as defunct as of February 2014). The original site of ZFS FUSE seems to have disappeared nevertheless the source code is still available on http://freecode.com/projects/zfs-fuse. ZFS FUSE stalled at version 0.7.0 in 2011 and never really evolved since then.

ZFS on Linux: a kernel mode implementation of ZFS in kernel mode which supports a lot of NFS features. The implementation is not as complete as it is under Solaris and its siblings like OpenIndiana (e.g. SMB integration is still missing, no encryption support...) but a lot of functionality is there. This is the implementation used for this article. As ZFS on Linux is an out-of-tree Linux kernel implementation, patches must be waited after each Linux kernel release. ZfsOnLinux currently supports zpools version 28.

Solaris/OpenIndiana

Oracle Solaris: remains the de facto reference platform for ZFS implementation: ZFS on this platform is now considered as mature and usable on production systems. Solaris 11 uses ZFS even for its "system" pool (aka rpool). A great advantage of this: it is now quite easy to revert the effect of a patch at the condition a snapshot has been taken just before applying it. In the "old good" times of Solaris 10 and before, reverting a patch was possible but could be tricky and complex when possible. ZFS is far from being new in Solaris as it takes its roots in 2005 to be, then, integrated in Solaris 10 6/06 introduced in June 2006.

OpenIndiana: is based on the Illuminos kernel (a derivative of the now defunct OpenSolaris) which aims to provide absolute binary compatibility with Sun/Oracle Solaris. Worth mentioning that Solaris kernel and the Illumos kernel were both sharing the same code base, however, they now follows a different path since Oracle announced the discontinuation of OpenSolaris (August 13th 2010). Like Oracle Solaris, OpenIndiana uses ZFS for its system pool. The illumos kernel ZFS support lags a bit behind Oracle: it supports zpool version 28 where as Oracle Solaris 11 has zpool version 31 support, data encryption being supported at zpool version 30.

*BSD

FreeBSD: ZFS is present in FreeBSD since FreeBSD 7 (zpool version 6) and FreeBSD can boot on a ZFS volume (zfsboot). ZFS support has been vastly enhanced in FreeBSD 8.x (8.2 supports zpool version 15, version 8.3 supports version 28), FreeBSD 9 and FreeBSD 10 (both supports zpool version 28). ZFS in FreeBSD is now considered as fully functional and mature. FreeBSD derivatives such as the popular FreeNAS takes befenits of ZFS and integrated it in their tools. In the case of that latter, it have, for example, supports for zvol though its Web management interface (FreeNAS >= 8.0.1).

NetBSD: ZFS has been started to be ported as a GSoC project in 2007 and is present in the NetBSD mainstream since 2009 (zpool version 13).

OpenBSD: No ZFS support yet and not planned until Oracle changes some policies according to the project FAQ.

ZFS alternatives

WAFL seems to have severe limitation [1] (document is not dated), also an interesting article lies here

BTRFS is advancing every week but it still lacks such features like the capability of emulating a virtual block device over a storage pool (zvol) and built-in support for RAID-5/6 is not complete yet (cf. Btrfs mailing list). At date of writing, it is still experimental where as ZFS is used on big production servers.

VxFS has also been targeted by comparisons like this one (a bit controversial). VxFS has been known in the industry since 1993 and is known for its legendary flexibility. Symantec acquired VxFS and proposed a basic version (no clustering for example) of it under the same Veritas Storage Foundation Basic

An interesting discussion about modern filesystems can be found on OSNews.com

ZFS vs BTRFS at a glance

Some key features in no particular order of importance between ZFS and BTRFS:

ZFS support in Linux is not considered as production quality (yet) although it is very robust. Several operating systems like Solaris/OpenIndiana have a production quality implementation, Solaris/OpenIndiana is now installed in ZFS datasets by defaults.

Integrated within the Linux kernel tree

NO

YES

ZFS is released under the CDDL license...

ZFS resource naming restrictions

Before going further, you must be aware of restrictions concerning the names you can use on a ZFS filesystem. The general rule is: you can can use all of the alphanumeric characters plus the following specials are allowed:

Underscore (_)

Hyphen (-)

Colon (:)

Period (.)

The name used to designate a ZFS pool has no particular restriction except:

it can't use one the reserved words in particular:

mirror

raidz (raidz2, raidz3 and so on)

spare

cache

log

names must begin with an alphanumeric character (same for ZFS datasets).

Some ZFS concepts

Once again with no particular order of importance:

ZFS

What it is

Counterparts examples

zpool

A group of one or many physical storage media (hard drive partition, file...). A zpool has to be divided in at least one ZFS dataset or at least one zvol to hold any data. Several zpools can coexists in a system at the condition they each hold a unique name. Also note that zpools can never be mounted, the only things that can are the ZFS datasets they hold.

Volume group (VG) in LVM

BTRFS volumes

dataset

A logical subdivision of a zpool mounted in your host's VFS where your files and directories resides. Several uniquely named ZFS datasets can coexist in a single system at the conditions they each own a unique name within their zpool.

Logical subvolumes (LV) in LVM formatted with a filesystem like ext3.

BTRFS subvolumes

snapshot

A read-only photo of a ZFS dataset state as is taken at a precise moment of time. ZFS has no way to cooperate on its own with applications that read and write data on ZFS datasets, if those latter still hold data at the moment the snapshot is taken, only what has been flushed will be included in the snapshot. Worth mentioning that snapshot do not take diskspace aside of sone metadata at the exact time they are created, they size will grow as more and data blocks (i.e. files) are deleted or changed on their corresponding live ZFS dataset.

No direct equivalent in LVM.

BTRFS read-only snapshots

clone

What is is... A writable physical clone of snapshot

LVM snapshots

BTRFS snapshots

zvol

An emulated block device whose data is hold behind the scene in the zpool the zvol has been created in.

No known equivalent even in BTRFS

Your first contact with ZFS

Requirements

ZFS userland tools installed (package sys-fs/zfs)

ZFS kernel modules built and installed (package sys-fs/zfs-kmod), there is a known issue with kernel 3.13 series see this thread on Funtoo's forum

Disk size of 64 Mbytes as a bare minimum (128 Mbytes is the minimum size of a pool). Multiple disk will be simulated through the use of several raw images accessed via the Linux loopback devices.

At least 512 MB of RAM

Preparing

Once your have emerged sys-fs/zfs and sys-fs/zfs-kmod you have two options to start using ZFS at this point :

Either you start /etc/init.d/zfs (will load all of the zfs kernel modules for you plus a couple of other things)

Either you load the zfs kernel modules by hand (will load all of the zfs kernel modules for you)

Then let's see what loopback devices are in use and which is the first free:

# losetup -a
# losetup -f
/dev/loop0

In the above example nothing is used and the first available loopback device is /dev/loop0. Now associate all of the disks with a loopback device (/tmp/zfs-test-disk00.img -> /dev/loop/0, /tmp/zfs-test-disk01.img -> /dev/loop/1 and so on):

ZFS literature often names zpools "tank", this is not a requirement you can use whatever name of you choice (as we did here...)

Every story in ZFS takes its root with a the very first ZFS related command you will be in touch with: zpool. zpool as you might guessed manages all ZFS aspects in connection with the physical devices underlying your ZFS storage spaces and the very first task is to use this command to make what is called a pool (if you have used LVM before, volume groups can be seen as a counter part). Basically what you will do here is to tell ZFS to take a collection of physical storage stuff which can take several forms like a hard drive partition, a USB key partition or even a file and consider all of them as a single pool of storage (we will subdivide it in following paragraphs). No black magic here, ZFS will write some metadata on them behind the scene to be able to track which physical device belongs to what pool of storage.

Your first ZFS dataset

What does this mean? Several things: First, your zpool is here and has a size of, roughly, 8 Go minus some space eaten by some metadata. Second is is actually usable because the column HEALTH says ONLINE. Other columns are not meaningful for us for the moment just ignore them. If want more crusty details you can use the zpool command like this:

Information is quite intuitive: your pool is seen as being usable (state is similar to HEALTH) and is composed of several devices each one listed as being in a healthy state ... at least for now because they will be salvaged for demonstration purpose in a later section. For your information the columns READ,WRITE and CKSUM list the number of operation failures on each of the devices respectfully:

READ for reading failures. Having a non-zero value is not a good sign... the device is clunky and will soon fail.

WRITE for writing failures. Having a non-zero value is not a good sign... the device is clunky and will soon fail.

CKSUM for mismatch between the checksum of the data at the time is had been written and how it has been recomputed when read again (yes, ZFS uses checksums in a agressive manner). Having a non-zero value is not a good sign... corruption happened, ZFS will do its best to recover data by its own but this is definitely not a good sign of a healthy system.

Cool! So far so good you have a new 8 Gb usable brand new storage space on you system. Has been mounted somewhere?

Remember the tables in the section above? A zpool in itself can never be mounted, never ever. It is just a container where ZFS datasets are created then mounted. So what happened here? Obscure black magic? No, of course not! Indeed a ZFS dataset named after the zpool's name should have been created automatically for us then mounted. Is is true? We will check this shortly. For the moment you will be introduced with the second command you will deal with when using ZFS : zfs. While the zpool command is used with anything related to zpools, the zfs is used to anything related to ZFS datasets (a ZFS dataset always resides in a zpool, always no exception on that).

Note

zfs and zpool commands are the two only ones you will need to remember when dealing with ZFS.

So how can we check what ZFS datasets are currently known by the system? As you might already guessed like this:

Tala! The mystery is busted! the zfs command tells us that not only a ZFS dataset named myfirstpool has been created but also it has been mounted in the system's VFS for us. If you check with the df command, you should also see something like this:

Notice the various sizes reported by zpool and zfs commands. In this case it is the same however it can differ, this is true especially with zpools mounted in RAID-Z.

Unmounting/remounting a ZFS dataset

Important

Once again, remember that only ZFS datasets can be mounted inside your host's VFS, no exception on that okay? Zpools cannot be mounted, never, never, never... please pay attention to the terminology and keep things clear by not messing up with terms. We will introduce ZFS snapshots and ZFS clones but those are ZFS datasets at the basis so they can also be mounted and unmounted.

If a ZFS dataset behaves just like any other filesystem, can we unmount it?

# umount /myfirstpool
# mount | grep myfirstpool

No more /myfirstpool the line of sight! So yes, it is possible to unmount a ZFS dataset just like you would do with any other filesystem. Is the ZFS dataset still present on the system even it is unmounted? Let's check:

Your first contact with ZFS management by attributes or the end of /etc/fstab

At this point you might be curious about how the zfs command know what it has to mount and where is has to mount it. You might be familiar with the following syntax of the mount command that, behind the scenes, scans the file /etc/fstab and mount the specified entry:

# mount /boot

Does /etc/fstab contain something related to our ZFS dataset?

# cat /etc/fstab | grep myfirstpool
#

Doh!!!... Obvisouly nothing there. Another mystery? Sure not! The answer lies in a extremely powerful feature of ZFS: the attributes. Simply speaking: an attribute is named property of a ZFS dataset that holds a value. Attributes govern various aspects of how the datasets are managed like: "Is the data has to be compressed?", "Is the data has to be encrypted?", "Is the data has to be exposed to the rest of the world by NFS or SMB/Samba?" and of course... '"Where the dataset has to be mounted?". The answer to that latter question can be tell by the following command:

Bingo! When you remounted the dataset just some paragraphs ago, ZFS automatically inspected the mountpoint attribute and saw this dataset has to be mounted in the directory /myfirstpool.

A step forward with ZFS datasets

So far you were given a quick tour of what ZFS can do for you and it is very important at this point to distinguish a zpool from a ZFS dataset and to call a dataset for what it is (a dataset) and not for what is is not (a zpool). It is a bit confusing and an editorial choice to have choosen a confusing name just to make you familiar with the one and the other.

Creating datasets

Obviously it is possible to have more than one ZFS dataset within a single zpool. Quizz: what command would you use to subdivide a zpool in datasets? zfs or zpool? Stops reading for two seconds and try to figure out this little question. Frankly.

Answer is... zfs! Although you want to operate on the zpool to logically subdivide it in several datasets, you manage datasets at the end thus you will use the zfs command. It is not always easy at the beginning, do not be too worry you will soon get the habit when to use one or the other. Creating a dataset in a zpool is easy: just give to the zfs command the name of the pool you want to divide and the name of the dataset you want to create in it. So let's create three datasets named myfirstDS, mysecondDS and mythirdDS in myfirstpool(observe how we use the zpool and datasets' names) :

Noticed the size given for the 'AVAIL' column? At the very beginning of this tutorial we had slightly less than 8 Gb of available space, it now has a value of roughly 6 Gb. The datasets are just a subdivision of the zpool, they compete with each others for using the available storage within the zpool, no miracle here. To what limit? The pool itself as we never imposed a quota on datasets. Hopefully df and zfs list gives a coherent result.

Second contact with attributes: quota management

Remember how painful is the quota management under Linux? Now you can say goodbye to setquota, edquota and other quotacheck commands, ZFS handle this in the snap of fingers! Guess with what? An ZFS dataset attribute of course! ;-) Just to make you drool here is how a 2Gb limit can be set on myfirstpool/mythirdDS :

# zfs set quota=2G myfirstpool/mythirdDS

Et voila! The zfs command is bit silent however if we check we can see that myfirstpool/mythirdDS is now capped to 2 Gb (forget about 'REFER' for the moment): around 1 Gb of data has been copied in this dataset thus leaving a big 1 Gb of available space.

Of course you can use this technique for the home directories of your users /home this also having the a advantage of being much less forgiving than a soft/hard user quota: when the limit is reached, it is reached period and no more data can be written in the dataset. The user must do some cleanup and cannot procastinate anymore :-)

To remove the quota:

# zfs set quota=none myfirstpool/mythirdDS

none is simply the original value for the quota attribute (we did not demonstrate it, you can check by doing a zfs get quota myfirstpool/mysecondDS for example).

Destroying datasets

Important

There is no way to resurrect a destroyed ZFS dataset and the data it contained! Once you destroy a dataset the corresponding metadata is cleared and gone forever so be careful when using zfs destroy notably with the -r option ...

We have three datasets, but the third is pretty useless and contains a lot of garbage. Is it possible to remove it with a simple rm -rf? Let's try:

This is perfectly normal, remember that datasets are indeed something mounted in your VFS. ZFS might be ZFS and do alot for you, it cannot enforce the nature of a mounted filesystem under Linux/Unix. The "ZFS way" to remove a dataset is to use the zfs command like this at the reserve no process owns open files on it (once again, ZFS can do miracles for you but not that kind of miracles as it has to unmount the dataset before deleting it):

A bit more subtle case would be to try to destroy a ZFS dataset whenever another ZFS dataset is nested in it. Before doing that nasty experiment myfirstpool/mythirdDS must be created again this time with another nested dataset (myfirstpool/mythirdDS/nestedSD1):

# zfs destroy myfirstpool/mythirdDS
cannot destroy 'myfirstpool/mythirdDS': filesystem has children
use '-r' to destroy the following datasets:
myfirstpool/mythirdDS/nestedDS1

The zfs command detected the situation and refused to proceed on the deletion without your consent to make a recursive destruction (-r parameter). Before going any step further let's create some more nested datasets plus a couple of directories inside myfirstpool/mythirdDS:

Nothing really new the portage directory is here nothing more a priori. If you have used BTRFS before reading this tutorial you probably expected to see a @Charlie lying in /myfirstpool/mysecondDS? So where the check is Charlie? In ZFS a dataset snapshot is not visible from within the VFS tree (if you are not convinced you can search for it with the find command but it will never find it). Let's check with the zfs command:

Wow... No sign of the snapshot. What you mus know is that indeed zfs list shows only datasets by default and omits snapshots. If the command is invoked with the parameter -t set to all it will list everything:

So yes, @Charlie is here! Also notice here the power of copy-on-write filesystems: @Charlie takes only a couple of kilobytes (some ZFS metadata) just like any ZFS snapshot at the time they are taken. The reason snapshots occupy very little space in the datasets is because data and metadata blocks are the same and no physical copy of them are made. At the time goes on and more and more changes happens in the original dataset (myfirstpool/mysecondDS here), ZFS will allocate new data and metadata blocks to accommodate the changes but will leave the blocks used by the snapshot untouched and the snapshot will tend to eat more and more pool space. It seems odd at first glance because a snapshot is a frozen in time copy of a ZFS dataset but this the way ZFS manage them. So caveat emptor: remove any unused snapshot to not full your zpool...

Noticed the size's increase of myfirstpool/mysecondDS@Charlie? This is mainly due to new files copied in the snasphot: ZFS had to retained the original blocks of data. Now time to roll this ZFS dataset back to its original state (if some processes would have open files in the dataset to be rolled back, you should terminate them first) :

Again, ZFS handled everything for you and you now have the contents of mysecondDS exactly as it was at the time the snapshot Charlie was taken. Not more complicated than that. Hang on you hat, we have not finished and here is a bonus to have read this tutorial so far: you can access the various snapshots without cloning them

the .zfs pseudo-directory or the secret passage to your snapshots

Any directory where a ZFS dataset is mounted (having snapshots or not) secretly contains a pseudo-directory named .zfs (dot-ZFS) and you will not see it even with the option -a given to a ls command unless you specify it. It is a contradiction to Unix and Unix-like systems' philosophy to not hide anything to the system administrator. It is not a bug of ZFS On Linux implementation and the Solaris implementation of ZFS exposes the exact behavior. So what is inside this little magic box?

# cd /myfirstpool/mysecondDS
# ls -la | grep .zfs
# cd /myfirstpool/mysecondDS
# ls -la | grep .zfs
# ls -lad .zfs
dr-xr-xr-x 1 root root 0 Mar 2 15:26 .zfs
# pwd
/myfirstpool/mysecondDS/.zfs
# ls -la
total 4
dr-xr-xr-x 1 root root 0 Mar 2 15:26 .
drwxr-xr-x 3 root root 145 Mar 2 19:29 ..
dr-xr-xr-x 2 root root 2 Mar 2 19:47 shares
dr-xr-xr-x 2 root root 2 Mar 2 18:46 snapshot
<console>
=== The time-traveling machine (incremental snapshots) ===
So far we only used a single snapshot just to keep things simple. However a dataset can hold several snapshots and moreover you can do a delta between two snapshots and nothing is really much more complicated than you have seen so far.
Let's consider myfirstDS this time. This dataset should be empty as we did nothing in it so far:
<pre>
# ls -la /myfirstpool/myfirstDS
total 3
drwxr-xr-x 2 root root 2 Sep 4 23:34 .
drwxr-xr-x 6 root root 6 Sep 5 15:43 ..
</pre>
Now generate some contents, take a snapshot (snapshot-1), add more content, take a snapshot again (snapshot-2), do some more modifications and take a third snapshot (snapshot-3):
<pre>
# echo "Hello, world" > /myfirstpool/myfirstDS/hello.txt
# cp /usr/src/linux-3.1-rc4.tar.bz2 /myfirstpool/myfirstDS
# ls -l /myfirstpool/myfirstDS
# ls -l /myfirstpool/myfirstDS
total 75580
-rw-r--r-- 1 root root 13 Sep 5 22:38 hello.txt
-rw-r--r-- 1 root root 77220912 Sep 5 22:38 linux-3.1-rc4.tar.bz2
# zfs snapshot myfirstpool/myfirstDS@snapshot-1
# echo "Goodbye, world" > /myfirstpool/myfirstDS/goodbye.txt
# echo "Are you there?" >> /myfirstpool/myfirstDS/hello.txt
# cp /usr/src/linux-3.0.tar.bz2 /myfirstpool/myfirstDS
# rm /myfirstpool/myfirstDS/linux-3.1-rc4.tar.bz2
# zfs snapshot myfirstpool/myfirstDS@snapshot-2
# echo "Still there?" >> /myfirstpool/myfirstDS/goodbye.txt
# rm /myfirstpool/myfirstDS/hello.txt
# cp /proc/config.gz /myfirstpool/myfirstDS
# zfs snapshot myfirstpool/myfirstDS@snapshot-3
# zfs list -t all
# zfs list -t all
NAME USED AVAIL REFER MOUNTPOINT
myfirstpool 2.41G 5.40G 444M /myfirstpool
myfirstpool/myfirstDS 147M 5.40G 73.3M /myfirstpool/myfirstDS
myfirstpool/myfirstDS@snapshot-1 73.8M - 73.8M -
myfirstpool/myfirstDS@snapshot-2 20K - 73.3M -
myfirstpool/myfirstDS@snapshot-3 0 - 73.3M -
</pre>
Wow, nice demonstration on how a Copy-on-Write filesystem like ZFS works: what do we observe? First it is quite obvious to see that ''snapshot-1'' is quite big. Is is possible that having a so big snapshot to be the consequence of removing /myfirstDS/linux-3.1-rc4.tar.bz2? Absolutely. Remember that a snapshot is a photograph of what a dataset contains at a given time, deleted information and unmodified original information is retained by the snapshot even you delete it from the dataset or bring in some changes to it. If you look again at the command history between snapshot-2 and snapshot-3, you will notice that we removed a small file and changed another small file a bit thus having a little information delta between what the dataset content at this time and what it also actually contains leading to a very small snapshot at the end. The third dataset is the exact copy of what the current dataset contains thus its size is very close to zero (truncated to zero on what you see).
$100 question: "How can I see what changed between snapshots?". Answer: ''yes, you can!'' Nuance is: ZFS Fuse does not support it yet :( Nevertheless here is what snapshots diffing looks like on an OpenIndiana/Solaris machine:
<pre>
# zfs create testpool/test2
# cd /testpool/test2
# wget http://www.kernel.org/pub/linux/kernel/v3.0/testing/patch-3.1-rc4.bz2
# echo "Hello,world" > hello.txt
# zfs snapshot testpool/test2@s1
# rm patch-3.1-rc4.bz2
# echo 'Goodbye!' > goodbye.txt
# echo 'Still there?' >> hello.txt
# zfs snapshot testpool/test2@s2
# echo 'Hello, again' >> hello.txt
# ln -s goodbye.txt goodbye2.txt
# mv hello.txt hello-new.txt
# zfs snapshot testpool/test2@s3
# zfs list -t all | grep test2
testpool/test2 8.49M 3.76T 47.9K /testpool/test2
testpool/test2@s1 8.41M - 8.42M -
testpool/test2@s2 29.2K - 46.4K -
testpool/test2@s3 0 - 47.9K -
# zfs diff testpool/test2@s1 testpool/test2@s2
M /testpool/test2/
- /testpool/test2/patch-3.1-rc4.bz2
M /testpool/test2/hello.txt
+ /testpool/test2/goodbye.txt
# zfs diff testpool/test2@s2 testpool/test2@s3
M /testpool/test2/
R /testpool/test2/hello.txt -> /testpool/test2/hello-new.txt
+ /testpool/test2/goodbye2.txt
# zfs diff testpool/test2@s1 testpool/test2@s3
M /testpool/test2/
- /testpool/test2/patch-3.1-rc4.bz2
R /testpool/test2/hello.txt -> /testpool/test2/hello-new.txt
+ /testpool/test2/goodbye.txt
+ /testpool/test2/goodbye2.txt
# zfs diff testpool/test2@s3 san/test2@s1
Unable to obtain diffs:
Not an earlier snapshot from the same fs
</pre>
Where M,R,+,- stands for:
* M: item has been modified
* R: item has been renamed
* +: item has been added
* -: item has been removed
Observe the output of each diff and draw you own conclusion on what we did at each step and what appears in the diff. It is not possible to get a detailed diff similar to what Git and others gives but you have the big picture of what changed between snapshots.
If ZFS-Fuse does not implements (yet) a snapshot diffing capability, it can deal with several snapshots and is able to jump across several steps backwards. Suppose we want ''myfirstDS'' to go back exactly is was when we took the dataset photograph named ''snapshot-1'':
<pre>
# zfs rollback myfirstpool/myfirstDS@snapshot-1
cannot rollback to 'myfirstpool/myfirstDS@snapshot-1': more recent snapshots exist
use '-r' to force deletion of the following snapshots:
myfirstpool/myfirstDS@snapshot-3
myfirstpool/myfirstDS@snapshot-2
</pre>
This is not a bug, this is absolutely normal. The '''zfs''' command asks you to give it the explicit permission to remove the two others snapshots as they becomes useless (restoring them would be an absolute no sense) once snapshot-1 is restored. Second attempt:
<pre>
# zfs rollback myfirstpool/myfirstDS@snapshot-1
# ls -l /myfirstpool/myfirstDS
total 75580
-rw-r--r-- 1 root root 13 Sep 5 22:38 hello.txt
-rw-r--r-- 1 root root 77220912 Sep 5 22:38 linux-3.1-rc4.tar.bz2
# zfs list -t all
NAME USED AVAIL REFER MOUNTPOINT
myfirstpool 2.34G 5.47G 444M /myfirstpool
myfirstpool/myfirstDS 73.8M 5.47G 73.8M /myfirstpool/myfirstDS
myfirstpool/myfirstDS@snapshot-1 0 - 73.8M -
myfirstpool/mysecondDS 1.84G 5.47G 1.84G /myfirstpool/mysecondDS
myfirstpool/mysecondDS@snapshot1 37K - 1.84G -
</pre>
''myfirstDS'' effectively returned to its state when ''snapshot-2'' was taken and the snapshots ''snapshot-2'' and ''snapshot-3'' vanished.
{{fancynote|You can leap several steps backward at the cost of '''loosing''' your subsequent modifications forever. }}
=== Snapshots and clones ===
=== Streaming ZFS datasets over the network ===
You find ZFS snaphots useful? Well, you have seen just a small part of their potential. As a snapshot is a photograph of what a dataset contains frozen in the time, snapshots can be seen as being no more than a data backup. Like any backup, they must not stay on the local machine but must be put elsewhere and the common good sense tells to keep backups in a safe place, making them travel through a secure channel. By "secure channel" we intend something like a trusted person in your organization whose job consists of bringing a box of tapes off-site in a secure location but we also intend a secure communication channel like an SSH tunnel over two hosts without any human intervention.
ZSH designers had the same vision and made possible for a dataset to be able to be sent over a network. How is that possible? Simple: the process involves two peers who can use through a communication channel like the one established by '''netcat''' (OpenSSH supports a similar functionality but with an encrypted communication channel). For the sake of the demonstration, we will use two Solaris boxes at each end-point.
How stream some ZFS bits over the network? Here again, '''zfs''' is the answer. A nifty move from the designers was to use ''stdin'' and ''stdout'' as transmission/reception channels thus allowing great a flexibility in processing the ZFS stream. You can envisage, for instance, to compress your stream then crypt it then encode it in base64 then sign it and so on. It sounds a bit overkill but it is possible and in the general case you can use any tool that swallow the data from ''stdin'' and spit it through ''stdout'' in your plumbing.
{{fancynote|The rest of this section has been done entirely on two Solaris 11 machines.}}
1. Sender side:
<pre>
# zfs create testpool2/zfsstreamtest
# echo 'Hello, world!' > /testpool2/zfsstreamtest/hello.txt
# echo 'Goodbye, world' > /testpool2/zfsstreamtest/goodbye.txt
# zfs snapshot zfs testpool2/zfsstreamtest@s1
# zfs list -t snapshot
NAME USED AVAIL REFER MOUNTPOINT
testpool2/zfsstreamtest@s1 0 - 32K -
</pre>
2. Receiver side (the dataset ''zfs-stream-test'' will be created and should not be present):
<pre>
# nc -l -p 7000 | zfs receive testpool/zfs-stream-test
</pre>
At this point the receiver is waiting after some data.
3. Sender side:
<pre>
# zfs send testpool2/zfsstreamtest@s1 | nc 192.168.aaa.bbb.ccc 7000
</pre>
4. Receiver side:
<pre>
# zfs list -t snapshot
NAME USED AVAIL REFER
...
testpool2/zfs-stream-test@s1 0 - 46.4K -
</pre>
Note that we did not set an explicit snapshot name in the second step but it could have been possible to choose anything else but the default which is the name of the snapshot sent over the network. In that case the dataset which will contain the snapshot needs to be created first:
<pre>
# nc -l -p 7000 | zfs receive testpool/zfs-stream-test@mysnapshot01
</pre>
Once received you would get:
<pre>
# zfs list -t snapshot
NAME USED AVAIL REFER
...
testpool2/zfs-stream-test@mysnapshot01 0 - 46.4K -
</pre>
5. Just for the sake of the curiosity let's do a rollback on the receiver side:
<pre>
# zfs rollback testpool2/zfsstreamtest@s1
# ls -l /testpool2/zfs-stream-test
total 2
-rw-r--r-- 1 root root 15 2011-09-06 23:54 goodbye.txt
-rw-r--r-- 1 root root 13 2011-09-06 23:53 hello.txt
# cat /testpool2/zfs-stream-test/hello.txt
Hello, world
</pre>
Because ZFS streaming operates using the starnd input and output (''stdin'' / ''stdout'') you can build a bit more complex pipeline like:
<pre>
# zfs send testpool2/zfsstreamtest@s1 | gzip | nc 192.168.aaa.bbb.ccc 7000
</pre>
The above example was using two hosts but a simpler setup is also possible: you are not required to send you data over the network with '''netcat''', you can store it to a regular file then mail it or store it on a USB key. By the way: we have not finished! We took only a simple case here: it is absolutely possible to do the exact same operation with the difference between snapshots (incremental). Just like an incremental backup takes only what has changed, ZFS can determine the difference between two snapshots and streaming instead of streaming a snapshot taken at whole. Although ZFS can detect and act on differentials, it does not operate (yet) at the block level: if only a few bytes of a very big file have changed, the whole file will be taken into consideration (operating at data block level is possible with some tools like the well-known '''rsync''').
Consider the following:
* A dataset snapshot (S1) contains two files:
** A -> 10 MB
** B -> 4 GB
* A bit later some files (named C, D and E) are added to the dataset and another snapshot is (S2) taken. S2 contains:
** A -> 10 MB
** B -> 4 GB
** C -> 3 MB
** D -> 500 KB
** E -> 1GB
With a full transfer of S2 A,B,C,D and E would be streamed whereas an incremental transfert (S2-S1), zfs would only process C, D and E. The next $100 question:''"How can we stream a difference of snapshot? '''zfs''' again?"'' Yes! This time with a subtle difference: a special option specified on the command line telling it must use a difference rather than a full snapshot. Assuming a few more files are added in ''testpool2/zfsstreamtest'' dataset and a snapshot (s2) is has been taken, the delta between s2 and s1 (s2-s1) giving s3 can be send like this (on the receiver side the same as shown above is used, nothing special is required alos notice the presence of the -i option):
* Sender:
<pre>
# zfs send -i testpool2/zfsstreamtest@s1 testpool2/zfsstreamtest@s2 | nc 192.168.aaa.bbb.ccc 7000
</pre>
* Receiver:
<pre>
# nc -l -p 7000 | zfs receive testpool/zfs-stream-test
# zfs list -t snapshot
testpool/zfs-stream-test@s1 28.4K - 46.4K -
testpool/zfs-stream-test@s2 0 - 47.1K -
</pre>
Note that although we did not specified any snapshot name to use on the receiver side, ZFS used by default the name of the second snapshot involved in the delta (''s2'' here).
$200 question: suppose we delete all of the received snapshots so far on the receiver side and we try to send the difference between s2 and s1, what would happen? ZFS will protest on the receiver side although no error message will be visible on the sender side:
<pre>
cannot receive incremental stream: destination testpool/zfs-stream-test has been modified
since most recent snapshot
</pre>
It is even worse if we remove the dataset used to receive the data:
<pre>
cannot receive incremental stream: destination 'testpool/zfs-stream-test' does not exist
</pre>
{{fancyimportant|ZFS streaming over a network has '''no underlying protocol''', therefore the sender just assumes the data has been successfully received and processed. It '''does not care''' whether a processing error occurs.}}
== Govern a dataset by attributes ==
So far, most of a filesystem capabilities were driven by separate and scarced command line line tools (e.g. tune2fs, edquota, rquota, quotacheck...) which all have their own ways to handle tasks and can go through tricky ways sometimes especially the quota-related management utilities. Moreover, there was no easy way to handle a limitations on a directory rather than putting it a a dedicated partition or logical volume implying downtimes when additional space was to be added. Quota management is however one of the many facets disk space management includes.
In the ZFS world, many aspects are now managed by simply setting/clearing a property attached to a ZFS dataset through the now so well-known command '''zfs'''.You can, for example:
* put a size limit on a dataset
* reserve a space for dataset (that space is ''guaranteed'' to be available in the future although not being allocated at the time the reservation is made)
* control if new files are encrypted and/or compressed
* define a quota per user or group of users
* control checksum usage => '''never turn that property off unless having very good reasons you are likely to never have''' (no checksums = no silent data corruption detection)
* share a dataset by NFS/CIFS
* control automatic data deduplication
Not all of a dataset properties are settable, some of them are set and managed by the operating system in the background for you and thus cannot be modified.
{{fancynote|Solaris/OpenIndiana users: ZFS has a tight integration with the NFS/CIFS server, thus it is possible to share a zfs dataset by setting adequate attributes. ZFS on Linux (native kernel mode port) also has a tight integration with the built-in Linux NFS server, the same for ZFS fuse although still experimental. Under FreeBSD ZFS integration has been done both with NFS and Samba (CIFS).}}
Like any other action concerning datasets, properties are sets and unset via the zfs command. On our Funtoo box running zfs-Fuse we can, for example, start by seeing the value of all properties for the dataset ''myfirstpool/myfirstDS'':
<pre>
# zfs get all myfirstpool/myfirstDS
zfs get all myfirstpool/myfirstDS
NAME PROPERTY VALUE SOURCE
myfirstpool/myfirstDS type filesystem -
myfirstpool/myfirstDS creation Sun Sep 4 23:34 2011 -
myfirstpool/myfirstDS used 73.8M -
myfirstpool/myfirstDS available 5.47G -
myfirstpool/myfirstDS referenced 73.8M -
myfirstpool/myfirstDS compressratio 1.00x -
myfirstpool/myfirstDS mounted yes -
myfirstpool/myfirstDS quota none default
myfirstpool/myfirstDS reservation none default
myfirstpool/myfirstDS recordsize 128K default
myfirstpool/myfirstDS mountpoint /myfirstpool/myfirstDS default
myfirstpool/myfirstDS sharenfs off default
myfirstpool/myfirstDS checksum on default
myfirstpool/myfirstDS compression off default
myfirstpool/myfirstDS atime on default
myfirstpool/myfirstDS devices on default
myfirstpool/myfirstDS exec on default
myfirstpool/myfirstDS setuid on default
myfirstpool/myfirstDS readonly off default
myfirstpool/myfirstDS zoned off default
myfirstpool/myfirstDS snapdir hidden default
myfirstpool/myfirstDS aclmode groupmask default
myfirstpool/myfirstDS aclinherit restricted default
myfirstpool/myfirstDS canmount on default
myfirstpool/myfirstDS xattr on default
myfirstpool/myfirstDS copies 1 default
myfirstpool/myfirstDS version 4 -
myfirstpool/myfirstDS utf8only off -
myfirstpool/myfirstDS normalization none -
myfirstpool/myfirstDS casesensitivity sensitive -
myfirstpool/myfirstDS vscan off default
myfirstpool/myfirstDS nbmand off default
myfirstpool/myfirstDS sharesmb off default
myfirstpool/myfirstDS refquota none default
myfirstpool/myfirstDS refreservation none default
myfirstpool/myfirstDS primarycache all default
myfirstpool/myfirstDS secondarycache all default
myfirstpool/myfirstDS usedbysnapshots 18K -
myfirstpool/myfirstDS usedbydataset 73.8M -
myfirstpool/myfirstDS usedbychildren 0 -
myfirstpool/myfirstDS usedbyrefreservation 0 -
myfirstpool/myfirstDS logbias latency default
myfirstpool/myfirstDS dedup off default
myfirstpool/myfirstDS mlslabel off -
</pre>
How can we set a limit that prevents ''myfirstpool/myfirstDS'' to not use more than 1 GB of space in the pool? Simple, just set the ''quota'' property:
<pre>
# zfs set quota=1G myfirstpool/myfirstDS
# zfs get quota myfirstpool/myfirstDS
NAME PROPERTY VALUE SOURCE
myfirstpool/myfirstDS quota 1G local
</pre>
May be something poked your curiosity: ''what "SOURCE" means?'' "SOURCE" describes how the property has been determined for the dataset and can have several values:
* '''local''': the property has been explicitly set for this dataset
* '''default''': a default value has been assigned by the operating system if not explicitely set by the system adminsitrator (e.g SUID allowed or not in the above example).
* '''dash (-)''': not modifiable intrinsic property (e.g. dataset creation time, whether the dataset is currently mounted or not, dataset space usage in the pool, average compression ratio...)
Before copying some files in the dataset, let's fix a binary (on/off) property:
<pre>
# zfs set compression=on myfirstpool/myfirstDS
</pre>
Now try to put more than 1GB of data in the dataset:
<pre>
# dd if=/dev/zero of=/myfirstpool/myfirstDS/one-GB-test bs=2G count=1
dd: writing `/myfirstpool/myfirstDS/one-GB-test': Disk quota exceeded
</pre>
== Permission delegation ==
ZFS brings a feature known as delegated administration. Delegated administration enables ordinary users to handle administrative tasks on a dataset without being administrators. '''It is however not a sudo replacement as it covers only ZFS related tasks''' such as sharing/unsharing, disk quota management and so on. Permission delegation shines in flexibility because such delegation can be handled by inheritance though nested datasets. Pewrmission deleguation is handled via '''zfs''' through its '''allow''' and '''disallow''' options.
= Data redundancy with ZFS =
Nothing is perfect and the storage medium (even in datacenter-class equipment) is prone to failures and fails on a regular basis. Having data redundancy is mandatory to help in preventing single-points of failure (SPoF). Over the past decades, RAID technologies were powerful however their power is precisely their weakness: as operating at the block level, they do not care about what is stored on the data blocks and have no ways to interact with the filesystems stored on them to ensure data integrity is properly handled.
== Some statistics ==
It is not a secret to tell that a general trend in the IT industry is the exponential growth of data quantities. Just thinking about the amount of data Youtube, Google or Facebook generates every day taking the case of the first [http://www.website-monitoring.com/blog/2010/05/17/youtube-facts-and-figures-history-statistics some statistics] gives:
* 24 hours of video is generated every ''minute'' in March 2010 (May 2009 - 20h / October 2008 - 15h / May 2008 - 13h)
* More than 2 ''billions'' views a day
* More video is produced on Youtube every 60 days than 3 major US broadcasting networks did in the last 60 years
Facebook is also impressive (Facebook own stats):
* over 900 million objects that people interact with (pages, groups, events and community pages)
* Average user creates 90 pieces of content each month (750 millions users active)
* More than 2.5 million websites have integrated with Facebook
What is true with Facebook and Youtube is also true with many other cases (think one minutes about the amount of data stored in iTunes) especially with the growing popularity of cloud computing infrastructures. Despite the progress of the technology a "bottleneck" still exists: the storage reliability is nearly the same over the years. If only one organization in the world generate huge quantities of data it would be the [http://public.web.cern.ch CERN] (''Conseil Européen pour la Recherche Nucléaire'', now officially known as ''European Organization for Nuclear Research'') as their experiments can generate spikes of many terabytes of data within a few seconds. A study done in 2007 quoted by a [http://www.zdnet.com/blog/storage/data-corruption-is-worse-than-you-know/191 ZDNet article] reveals that:
* Even ECC memory cannot be always be helpful: 3 double-bit errors (uncorrectable) occurred in 3 months on 1300 nodes. Bad news: it should be '''zero'''.
* RAID systems cannot protect in all cases: monitoring 492 RAID controller for 4 weeks showed an average error rate of 1 per ~10^14 bits, giving roughly 300 errors for every 2.4 petabytes
* Magnetic storage is still not reliable even on high-end datacenter class drives: 500 errors found over 100 nodes while writing 2 GB file to 3000+ nodes every 2 hours then read it again and again for 5 weeks.
Overall this means: 22 corrupted files (1 in every 1500 files) for a grand total of 33700 files holding 8.7TB of data. And this study is 5 years old....
== Source of silent data corruption ==
http://www.zdnet.com/blog/storage/50-ways-to-lose-your-data/168
Not an exhaustive list but we can quote:
* Cheap controller or buggy driver that does not reports errors/pre-failure conditions to the operating system;
* "bit-leaking": an harddrive consists of many concentric magnetic tracks. When the hard drive magnetic head writes bits on the magnetic surface it generates a very weak magnetic field however sufficient to "leak" on the next track and change some bits. Drives can generally, compensate those situations because they also records some error correction data on the magnetic surface
* magnetic surface defects (weak sectors)
* Hard drives firmware bugs
* Cosmic rays hitting your RAM chips or hard drives cache memory/electronics
*
== Building a mirrored pool ==
== ZFS RAID-Z ==
=== ZFS/RAID-Z vs RAID-5 ===
RAID-5 is very commonly used nowadays because of its simplicity, efficiency and fault-tolerance. Although the technology did its proof over decades, it has a major drawback known as "The RAID-5 write hole". if you are familiar with RAID-5 you already know that is consists of spreading the stripes across all of the disks within the array and interleaving them with a special stripe called the parity. Several schemes of spreading stripes/parity between disks exists in the natures, each one with its own pros and cons, however the "standard" one (also known as ''left-asynchronous'') is:
<pre>
Disk_0 | Disk_1 | Disk_2 | Disk_3
[D0_S0] | [D0_S1] | [D0_S2] | [D0_P]
[D1_S0] | [D1_S1] | [D1_P] | [D1_S2]
[D2_S0] | [D2_P] | [D2_S1] | [D2_S2]
[D2_P] | [D2_S0] | [D2_S1] | [D2_S2]
</pre>
The parity is simply computed by XORing the stripes of the same "row", thus giving the general equation:
* [Dn_S0] XOR [Dn_S1] XOR ... XOR [Dn_Sm] XOR [Dn_P] = 0
This equation can be rewritten in several ways:
* [Dn_S0] XOR [Dn_S1] XOR ... XOR [Dn_Sm] = [Dn_P]
* [Dn_S1] XOR [Dn_S2] XOR ... XOR [Dn_Sm] XOR [Dn_P] = [Dn_S0]
* [Dn_S0] XOR [Dn_S2] XOR ... XOR [Dn_Sm] XOR [Dn_P] = [Dn_S1]
* ...and so on!
Because the equations are a combinations of exclusive-or, it is possible to easily compute a parameter if it is missing. Let say we have 3 stripes plus one parity composed of 4 bits each but one of them is missing due to a disk failure:
* D0_S0 = 1011
* D0_S1 = 0010
* D0_S2 = <missing>
* D0_P = 0110
However we know that:
* D0_S0 XOR D0_S1 XOR D0_S2 XOR D0_P = 0000 also rewritten as:
* D0_S2 = D0_S1 XOR D0_S2 XOR D0_P
Applying boolean algebra it gives:''' D0_S2 = 1011 XOR 0010 XOR 0110 = 1111'''.
Proof: '''1011 XOR 0010 XOR 1111 = 0110''' this is the same as '''D0_P'''
''''''So what's the deal?''''''
Okay now the funny part, forgot the above hypothesis and imagine we have this:
* D0_S0 = 1011
* D0_S1 = 0010
* D0_S2 = 1101
* D0_P = 0110
Applying boolean algebra magics gives 1011 XOR 0010 XOR 1101 => 0100. Problem: this is different of D0_P (0110). Can you tell which one (or which ONES) of the four terms lies? If you find a mathematically acceptable solution, found your company because you have just solved a big computer science problem. If humans can't solve the question, imagine how hard it is for the poor little RAID-5 controller to determine which stripe is right and which one lies and the resulting "datageddon" (i.e. massive data corruption on the RAID-5 array) when the RAID-5 controller detect error and start to rebuild the array.
This is not science fiction, this a pure reality and the weakness stays in the RAID-5 simplicity. Here is how it can happen: an urban legend with RAID-5 arrays is that they update stripes in an atomic transaction (all of the stripes+parity are written or none of them). Too bad, this is just not true, the data is written on the fly and if for a reason or another the machine where the RAID-5 array has a power outage or crash, the RAID-5 controller will simply have no idea about what he was doing and which stripes are up to date which ones are not up to date. Of course, RAID controllers in servers do have a replaceable on-board battery and most of the time the server they reside in is connected to an auxiliary source like a battery-based UPS or a diesel/gas electricity generator. However, Murphy laws or unpredictable hazards can, sometimes, happens....
Another funny scenario: imagine a machine with a RAID-5 array (on UPS this time) but with non ECC memory. the RAID-5 controller splits the data buffer in stripes, computes a data stripe and starts to write them on the different disks of the array. But...but...but... For some odd reason, only one bit in one of the stripes flips (cosmic rays, RFI...) after the parity calculation. Too bad too sad, one of the written stripes contains corrupted data and it is silently written on the array. Datageddon in sight!
Not to make you freaking: storage units have sophisticated error correction capability (a magnetic surface or an optical recording surface is not perfect and reading/writing error occurs) masking most the cases. However, some established statistics estimates that even with error correction mechanism one bit over 10^16 bits transferred is incorrect. 10^16 is really huge but unfortunately in this beginning of the XXIst century with datacenters brewing massive amounts of data with several hundreds to not say thousands servers this this number starts to give headaches: '''a big datacenter can face to silent data corruption every 15 minutes''' (Wikepedia). No typo here, a potential disaster may silently appear 5 times an hour for every single day of the year. Detection techniques exists but traditional RAID-5 arrays in them selves can be a problem. Ironic for a so popular and widely used solution :)
If RAID-5 was an acceptable trade-off in the past decades, it simply made its time. RAID-5 is dead? '''*Horray!*'''
= More advanced topics =
== ZFS Intention Log (ZIL) ==
= Final words and lessons learned =
ZFS surpasses by far (as of September 2011) every of the well-known filesystems around there: none of them propose such an integration of features and certainly not with this management simplicity and robustness. However in the Linux world it is definitely a no-go in the short term especially for production systems. The two known implementations are not ready for production environments and lacks some important features or behave in a clunky manner, this is absolutely correct as none of them pretend to be at this level of maturity and the licensing incompatibility between the code opened by Sun Microsystems some years ago and the GNU/GPL does not help the cause. However, both look '''very promising''' once their corners will become rounded.
For a Linux system, the nearest plan B is you seek for a BTRFS like filesystem covering some of the functionalities offered by ZFS is BTRFS (still considered as experimental, be prepared to a disaster sooner or later although BTRFS is used by some Funtoo core team members since 2 years and proved to be quite stable in practise). BTRFS however does not pushes the limits as much as ZFS does: it does not have built-in snapshot differentiation tool nor implement built-in filesystem streaming capabilities and roll-backing a BTRFS subvolume is a bit more manual than in ''"the ZFS way of life"''.
= Footnotes & references =
Source: [http://docs.huihoo.com/opensolaris/solaris-zfs-administration-guide/html/index.html solaris-zfs-administration-guide]
[[Category:Labs]]
[[Category:Articles]]
[[Category:Filesystems]]
<references/>