Best Hard Drives for ZFS Server (Updated Nov 2018)

Today’s question comes from Jeff….

Q. What drives should I buy for my ZFS server?

Answer: Here’s what I recommend, considering a balance of cost per TB, performance, and reliability. I prefer NAS class drives since they are designed to run 24/7 and also are better at tolerating vibration from other drives. I prefer SATA but SAS drives would be better in some designs (especially when using expanders). For a home or small business FreeNAS storage server with 8 or fewer drives I think these are the best options.

2TB Hitachi Drives – $24/TB – Budget

They won’t carry the HGST 5-year warranty but you can usually get a 1-year warranty from the seller. HGST drives are reliable so the lower cost probably justifies the lack of a warranty. 2TB HGST drives also boast a MTBF of 2 million hours!

4TB and 6TB White Label Drives $23/TB to $27/TB – Budget

A great way to save money is to get White Label Drives. These are NAS class drives, made by the same manufacturers with branding removed. Seller usually provides 1-year warranty. 4TB WhiteLabel Drive or 6TB WhiteLabel Drive . These are what I buy for my home and I’ve yet to have one fail. These are most likely re-branded Western Digital Reds. I hate dealing with paperwork and warranty returns–I’d much rather pay a little less and buy an extra drive to have sitting on the shelf in case one fails than pay more for a warranty and deal with the paperwork and hassle of exchanging it.

3TB, 4TB, 5TB, 6TB, 8TB, and 10TB Drives $37/TB to $40/TB

I’d purchase either HGST Deskstar NAS, the WD RED NAS, or the newer Western Digital WD GOLD Datacenter class hard drives. The first two are NAS class (designed for configurations up to 8 bays), the WD GOLD is Datacenter class. All are designed for 24-7 operation. The difference is the WD GOLD and HGST are 7200RPM and WD REDs are ~5400RPM so it’s a performance vs cost/energy/heat trade-off. The Gold’s also have a 5-year warranty while the other two have a 3-year.

WD RED NAS 64MB-128MB ~5400RPM SATA III 3-year WarrantyThis drive is available from 1TB to 10TB this WD drive runs a little cheaper than HGST and WD Gold version. It’s a fantastic drive and runs cool and quiet. If the price is less than the HGST by more than $5/TB I would consider this drive to save a little money.

The Western Digital Gold is a new Datacenter class drive. Has a larger cache, runs at 7200RPM and comes with a 5-year warranty. These may be a bit louder than the other two drives.

Or buy a TrueNAS Storage Server from iXsystems

I’m cheap and tend to go with a DIY approach most of the time, but when I’m recommending ZFS systems in environments where availability is important I like the TrueNAS servers from iX Systems which will of course come with drives in configurations that have been well tested. The prices on a TrueNAS are very reasonable compared to other storage systems and it can be setup in an HA cluster. Even a FreeNAS Certified Server is probably not going to cost much more than doing it yourself (more often than not it ends up being less expensive than DIY). And of course for a small server you can grab the 4-bay FreeNAS Mini (which ships with WD REDs).

Careful with “archival” drives

If you don’t get one of the drives above, some larger hard drives are using SMR (Shingled Magnetic Recording) which should not be used with ZFS if you care about performance until drivers are developed. Be careful about any drive that says it’s for archiving purposes.

The ZIL / SLOG and L2ARC

The ZFS Intent Log (ZIL) should be on a SSD with battery backed capacitor that can flush out the cache in case of a drive failure. I have done quite a bit of testing and like Intel’s DC S35xx, S36xx, or S37xx series drives and also HGST’s S840Z. These are rated to have their data overwritten many times and will not lose data on power loss. These run on the expensive side, so for a home setup I typically try to find them used on eBay. From a ZIL perspective there’s not a reason to get a large drive–but keep in mind you get better performance with larger drives. In my home I use 100GB DC S3700s and they do just fine.

I generally don’t use an L2ARC (SSH read cache) and instead opt to add more memory. There are a few cases where an L2ARC makes sense when you have very large working sets.

Capcity Planning for Failure

Most drives running 24/7 start having a high failure rate after 3-years, you might be able to squeeze 4 or 5 years out of them if you’re lucky. So a good rule of thumb is to estimate your growth and buy drives big enough that you will start to outgrow them in 4 to 5 years. The price of hard drives is always dropping so you don’t really want to buy more much than you’ll need before they start failing. Consider that in ZFS you shouldn’t run more than 70% full (with 80% being max) for your typical NAS applications including VMs on NFS. But if you’re planning to use iSCSI you shouldn’t run more than 50% full.

ZFS Drive Configurations

My preference is almost always RAID-Z2 (RAID-6) with 6 to 8 drives which provides a storage efficiency of .66 to .75. This scales pretty well as far as capacity is concerned and with double-parity I’m not that concerned if a drive fails. 6 drives in RAID-Z2 would net 8TB capacity all the way up to 24TB with 6TB drives. For larger setups use multiple vdevs. E.g. with 60 bays use 10 six drive RAID-Z2 vdevs (each vdev will increase IOPS). For smaller setups I run 3 or 4 drives in RAID-Z (RAID-5). In all cases it’s essential to have backups… and I’d rather have two smaller servers with RAID-Z mirroring to each other than one server with RAID-Z2. The nice thing about smaller setups is the cost of upgrading 4 drives isn’t as bad as 6 or 8!

Enabling CCTL/TLER

On desktop class drives such as the HGST Deskstar, they’re typically not run in RAID mode so by default they are configured to take as long as needed (sometimes several minutes) to try to recover a bad sector of data. This is what you’d want on a desktop, however performance grinds to a halt during this time which can cause your ZFS server to hang for several minutes waiting on a recovery. If you already have ZFS redundancy it’s a pretty low risk to just tell the drive to give up after a few seconds, and let ZFS rebuild the data.

The basic rule of thumb. If you’re running RAID-Z, you have two copies so I’d be a little cautious about enabling TLER. If you’re running RAID-Z2 or RAID-Z3 you have three or four copies of data so in that case there’s very little risk in enabling it.

Hi, Jon. I haven’t tested native vs emulated sector disks. I am curious so if you come across any benchmarks let me know.

I have a 6-disk RAID-Z2 of 2TB Seagate Barracudas on my main pool and I haven’t been very happy with them. Now these are not enterprise grade because I didn’t know any better when I first built my pool. Last year I had two fail out of warranty. A couple a weeks ago another one failed, this time I got smart and replaced it with an HGST. Of the three I haven’t replaced 2 are reporting SMART issues and the other one is corrupting data (ZFS keeps reporting read errors on and on every scrub I see it being resilvered). That’s a 100% failure rate within 4 years. So I went ahead and ordered 5 more HGST 7K4000s to replace them all and hopefully be done with it.

The 4TB Seagates are a lot more reliable than 2TB. I feel like if Seagate wants more of my business they should be giving me a partial credit on those 6 drives since they didn’t last very long. Last week I sat down to write a letter to Seagate to see if they’d be interested in earning back my business by giving me a partial credit or a couple of free drives but they don’t have a mailing address listed on their website.

I always try to buy with a five (5) year warranty and have a spare drive lying around (even if I’m burning the warranty). For that matter I always buy computers and servers in threes just so I have some “swap” the part capability. Sorry about your experience with Seagate.

I too was less than enthused about ZFS RAID (I stopped using them years ago and went to mirrors) at first they seemed to perform like “magic” but then after they got to 70% full or above they really started to crawl (due to inherent COW fragmentation). The lousy thing is you can really get rid of it is as there is no (and probably never will be) block pointer rewrite (BPR). Once I had the pain of dealing with a Sun X4500 “thumper” with two 10 disk ZFS RAIDs which became fragmented and useless – I had to move the data off chassis rebuild smaller pools and then move the data back (and that took a long long time). That was the final straw which caused me to switched to mirrored pairs and I keep them relatively empty if I want to refresh performance I copy the data to a freshly minted mirrored pair. In general I’m pretty happy with this arraignment.

As to benchmarks I did a little more digging (no benchmarks on the 512 v. 4K sector by me):

This also is not my benchmark – albeit for windows – the big issue here is miss alignment – the unaligned setup results in a performance impact of up to 50%, depending on the workload. However workloads that do not involve write operations, such as the Web server test pattern, don’t show any disadvantage at all in I/O testing.

In PCMark Vantage application test shows decreased performance for the new drive with its 4 KB sector size in popular Windows scenarios. … However, we cannot control the way data writes are actually executed and organized. In the case of PCMark Vantage, the benchmark was never tweaked to minimize the number of smallest-size write requests in favor of larger chunks of data (this is something ZFS does). http://www.tomshardware.com/reviews/advanced-format-4k-sector-size-hard-drive,2759.html

FYI in ZFS RAID scenarios (which I stopped using years ago) If a 4k bytes is the minimum block of data that can be written or read, data blocks smaller than 4k will also be padded to form the 4k. This article says that this can hurt the most, when parity blocks have to be created for small chunks of data. http://www.docs.cloudbyte.com/knowledge-base-articles/implications-of-using-4k-sector-size-zfs-pools/ a) So as a thumb-rule the ZFS record size should be multiple of number of data disks in the RAIDZ x Sector Size. b) For example for 4+1, the record size should be 16K (4 x 4096) and for 2+1 it should be 8K (2 x 4096). c) CloudByte recommends 32K as the sector size for least space overhead.

Well, you’re wise to get a 5-year warranty. Mine were bought shortly after the Thailand flooding so that may have played a role in the shorter lifespan.

You’ve done quite a bit of research there. For heavy I/O I agree that mirrors are the way to go, mostly for the extra IOPS. At home I use 6 x RAID-Z2 any my performance is fine for my needs… I have a few VMware VMs on NFS but the ARC and SLOG is more than enough to run it on RAID-Z2–but the bulk of my data is movies and pictures.

I think those articles on the record size might be a little dated. Now days everyone runs with compression=on (which for FreeNAS and OmniOS is LZ4) which changes the theory on record sizes and keeping the data disks in powers of 2 for RAID-Z/Z2/Z3. Here’s some newer articles you may find interesting:

Matt makes the point that you don’t need to worry about the record size being a multiple of data disks: See: http://blog.delphix.com/matt/2014/06/06/zfs-stripe-width/ With LZ4 compression (which at worst doesn’t hurt and at best saves space and increases performance) it doesn’t really matter that much since most of the time it will compress into a smaller record anyway.

Max Bruning has an interesting article about how parity works with RAID-Z. https://www.joyent.com/blog/zfs-raidz-striping He states that in the case of a small write, RAID-Z will only put the data on enough disks to get the required redundancy. So if you have 4+1 RAID-Z and write a block less than 4K (assuming ashift-12) RAID-Z will place the data on only 2 disks (effectively mirroring the data) instead of wasting space on all 5 drives.

So correct me if I’m interpreting this wrong, but I think the best record size from a storage efficiency standpoint is the largest possible (1M). Smaller writes are going to use the smallest record-size they can anyway. Using a large block-size record-size will help throughput but could come at the cost of IOPS, so there may be performance reasons to force a smaller record-size.

I talked to somebody from iXsystems … and apparently they are using 4K Native sector drives, specifically HGST they told me they provide the best performance.

Also … I had some major issues with recent WD RED drives … they would timeout and ABRT block read commands randomly on high usage, so i switched to HGST NAS for now. But HGST also has the HE series drives with 4KN and the non HE 4KN drives also available.

Update for 2018? HE6 – HE12 HGST Helium drives, are revolutionary, come in both SAS/SATA interface, longer life, less power, less heat, better speed than most competing models, ~$28/tb. Backblaze is marking leading reliability ratings for these. For ZIL, the Intel Optane 900P is a new winner for budget users, it has no cache, so no need to worry about cache power Caps, this means writes are direct to media always. Like your posts, bless you and yours.

. <-- this is a dot

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 158 other subscribers

Email Address

b3n.org is a participant in the Amazon Services LLC Associates Program, an affiliate advertising program designed to provide a means for sites to earn advertising fees by advertising and linking to amazon.com