Installing ZFS and setting up a pool

Build considerations & preparation

Hardware plays a large role in the performance and integrity of the your ZFS file server. Although ZFS will function on a variety of commodity hardware, you should consider the following before proceeding:

ECC RAM

The question of using non-ECC RAM gets asked again and again, but the bottom line is you do need it. ZFS does its best to protect your data and ensure its integrity, but it cannot do so if the memory it uses cannot be trusted. ZFS is an advanced filesystem that can self-heal your files when silent bit rot occurs (bit flips on the disk from bad sectors or cosmic rays). When this error is discovered, it can attempt to self-heal the file. What if the information on disk is OK, but an undetected bit flip occurs in your RAM? ZFS could attempt to "self-heal" and actually cause a corruption in your data because the information it received from RAM was incorrect.

ZFS will run just fine without ECC RAM but you run the risk of silent data corruption and (although very unlikely) losing the zpool entirely if your metadata gets corrupted in RAM and then subsequently written to disk. The chance of random bit flips of small, but if your RAM stick is going bad and is riddling your filesystem with errors, you do not want to run the risk of catching that too late and losing everything.

Keep in mind that in order to use ECC RAM, you must buy a motherboard AND CPU that both support it. There are also buffered (also known as registered) DIMMS and non-buffered DIMMS. Buffered DIMMS tend to be slower, more expensive, but scale much better (e.g. a single board could support up to 192GB RAM) while unbuffered ECC RAM tends to be less expensive, performs better but doesn't scale as high (maximum of 32GB RAM on most current boards).

A more detailed analysis on this topic is available in this FreeNAS forum post.

Sufficient RAM for ARC cache

Conventional wisdom is that you should plan to allocate 1GB RAM per TB of usable disk space in your ZFS filesystem. ZFS will run on far less (i.e. 4GB), however then you have little space available for your ARC cache and your read performance may suffer. Plan ahead and buy enough RAM from the start, or be sure that you'll be able to get your hands on additional DIMMs later if you plan on adding additional disks later.

If you are unsure which pool type you would like to use, there is a very good and detailed comparison here. As the article points out, if you can afford it striped mirrors (mirrored disks combined into a pool - effectively a RAID 0 of several groups of 2 disks in RAID 1) offers the best performance. However, you'll lose 50% of your usable disk capacity at a minimum, 66% if you want to be able to sustain two drive losses (which I highly recommend you do).

If you don't mind limiting performance to the equivalent of a single disk, RAIDZ2 is your best choice. It offers at worst a 40% loss in usable disk capacity and that number shrinks as if you add more disks. A RAIDZ2 with 6 disks, for example, only loses 2/6 disks to parity (33%). Always remember that RAID is redundancy, not a backup!

Unrecoverable Read Error (URE)

Consumer hardware has become extremely inexpensive for the capacity it can offer, however it's not perfect. All hard disks are manufactured with a mean time between failure (MTBF) and non-recoverable bit error rate specified. MBTF is nothing to worry about, as we can simply swap the disk out for a functioning one when it fails. The point of interest here is the non-recoverable bit error rate, which for consumer disks is typically 1 out of every 10^14 bits read. This means that if you read 10^14 bits from your disk, on average one bit is unrecoverable unreadable and irreparably lost.

This is a significant problem with modern disk sizes, as if a drive in RAID were to fail and be replaced, during the reconstruction process several TB of data from multiple disks would be read and there's a significant (often above 50% - calculator here) that a single URE will be encountered. In a traditional RAID setup, the controller cannot proceed and reconstruction ends. Your data is lost.

However, because ZFS is in control of both the filesystem and disks in a software RAIDZ, it can degrade gracefully should you encounter a URE. it can actually know exactly where that bit fell. Instead of dropping your array, it simply notifies you which file was lost and moves on with the reconstruction. ZFS is also aware of free space, and so doesn't need to waste time reconstructing the free space on a replacement disk.

Disk controllers

Although your hardware may support RAID, do not use it. RAIDZ2 is a software RAID implementation that works best when ZFS has direct control over your disks. Running ZFS on top of a hardware RAID array eliminates some of the advantages of ZFS, such as being able to gracefully recover from a Unrecoverable Read Error (URE). More on this below.

If you want to add additional disks and are looking to buy a PCIe add-in card, ensure that you purchase an HBA (Host Bus Adapter) that will present the disks as JBOD and not a RAID-only controller. An excellent HBA card is the IBM M1015 cross-flashed to IT mode which offers excellent performance for the price.

Optimizing the number of disks

In addition to above, consider that number of disks you choose to use in your pool can also have an impact on performance. Adam Nowacki posted this helpful data on the freebsd-fs mailing list (emphasis mine):

Free space calculation is done with the assumption of 128k block size.
Each block is completely independent so sector aligned and no parity
shared between blocks. This creates overhead unless the number of disks
minus raidz level is a power of two. Above that is allocation overhead
where each block (together with parity) is padded to occupy the multiple
of raidz level plus 1 (sectors). Zero overhead from both happens at
raidz1 with 2, 3, 5, 9 and 17 disks and raidz2 with 3, 6 or 18 disks.

Personally, I recommend RAIDZ2 with 6 disks - it offers a very nice balance between the cost of disks, performance and redundancy.

Replace poolname with the name of your zpool (e.g. "data" or "tank"), [type] with the ZFS pool type (e.g. raidz2) and finally [disks] with the disk you wish to use to create the zpool. There are several ways to specify the disks; see the ZFS on Linux FAQ for how to best How to choose device names.

Note that the contents of these disks will be erased and ZFS will resume control over the partition table & disk data.

Create one or more datasets

ZFS datasets (or "filesystems") behave like multiple filesystems on a disk would, except they are all backed by the same storage pool. You can divide your pool into several filesystems, each with different options and mountpoints, and the free space is shared among all filesystems on the pool.

Remember to replace [poolname] as per above. Use zpool status -v to get the pool status and display any scrub errors.

Receiving email notifications

Installing an MTA

All of ZFS's fancy data protection features are useless if we cannot respond quickly to a problem. Since Fedora 20 does not include a Mail Transfer Agent (MTA) by default, install one now to ensure we can receive email notifications when a disk goes bad:

You need to configure your myhostname to be something valid; in this case, I have chosen a free DynDNS hostname. Most ISPs block port 25, so you will need to use their mail server coordinates for relayhost or alternatively, you can always setup a free GMail account and use GMail as your relay on an alternate port (e.g. 587).

Monitoring SMART disk health information

The smartd daemon an monitor your disks health and notify you immediately should an error turn up.

yum install smartmontoolssystemctl enable smartd

Edit the /etc/smartmontools/smartd.conf and change the -m root flag to point to the desired email address, for example -m myaddress@gmail.com. To test if notifications are working correctly, add the line DEVICESCAN -H -m s.adam@diffingo.com -M test to the configuration and then restart smartd:

Comments

There is no need to reduce the record size for lots of small files as it actually refers to the max record size, not the actual recordsize. In you example 32k files would still have 32k records. Setting the recordsize max is of more use when you have say a database writting in 8k blocks, but multiple blocks at a time. The initial write is fine at 128k, as the 16x8k blocks go into one 128k record. However if the db wants the 3rd 8k chunk the whole 128k is needed to be read. If this 3rd block is modified then the whole 128k is read again and then written out again. Not good. So unless the user is updating lots of these 32k files simultaneously, it make little sense to reduce the recordsize for the reason you stated