I have recently built a home NAS / HTPC rig. Here is my experience from research, building it, and experimentation, for the interested in building a similar solution or to improve upon it… or just for the curious. I ended up including more details and writing longer than I first planned, but I guess you can skip over the familiar parts. I’d certainly have appreciated reading a similar review or write up while shopping and researching.

TL;DR: ZFS on Linux, powered by AMD A8 APU, 2x4GB RAM and 2x16GB flash drive in Raid-1 boot drive, backed by 6x3TB WD Red in Raidz2 (Raid-6 equivalent) for under $1400 inc. tax + shipping. The setup was the cheapest that met or exceeded my goals and is very stable and performing. AMD’s APUs are very powerful for ZFS and transcoding video, RAM is sufficient for this array size and flash drives are more than adequate for booting purposes (with the caveat of low reliability in the long-run). ZFS is a very stable and powerful FS with advanced and modern features at the cost of learning to self-serve. The build gives 10.7TiB of usable space with double-parity and transparent compression and deduplication, with ample power to transcode video, at a TCO of under $125 / TiB (~2x raw disk cost). Reads/Writes are sustained above 350MB/s with gzip-9 but without deduplication.

This is significantly cheaper than ready NAS solutions, even when ignoring all the advantages of ZFS and generic Linux running on a modern quad core APU (compare with Atom or Celeron that often plague home NAS solution).

Research on hardware vs. software RAID

The largest cost of a NAS are the drives. Unless one wants to spend an arm and a leg on h/w raid cards, that is. Regardless, the drives were my first decision point. Raid-5 gives 1 disk redundancy, but due to high resilver time (rebuilding a degraded array) chances of secondary failure increase substantially during the days and sometimes weeks that resilvering takes. Drive failures aren’t random or independent occurrences, as drives from the same batch, used in the same array, tend to have very similar wear and failure reasons. As such, Raid-5 was not good enough for my purposes (high data-security). Raid-6 gives 2 drive redundancy, but requires a minimum of 5 drives to make it worthwhile. With 4 drives, 2 used for parity, Raid-6 is about as good as a mirror. Mirrors however have the disadvantage of being susceptible to failure if the 2 drives that fail happen to be the same on both sides of the mirror. Raid-6 is immune to this, but IO performance is significantly poorer than mirroring.

Resilver time could be reduced significantly with true h/w raid cards. However they cost upwards of $300 for the cheapest and realistically $600-700 for the better ones. In addition, most give peak performance with backup battery, which costs more and will need some maintenance I expect. At this point I started comparing h/w with s/w performance. My personal experience with software data processing told me that a machine that isn’t under high-load, or is dedicated for storage, should do at least as good as, if not better than, the raid card’s on-board processor. After all, the cheapest CPUs are much more powerful, have ample fast cache and with system RAM that is in the GBs running at least at 1333 Mhz, they should beat h/w raid easily. The main advantage of h/w raid is that, with the backup battery, it can flush cached data to disk even on power failure. But this assumes that the drives still have power! For a storage box, everything is powered from the same source. So the same UPS that will keep the drives spinning will also keep the CPU and RAM pumping data long enough to flush the cache to disk. (This is true provided no new data is being written when on battery power.) The trouble with software raid is that there is no abstraction of the disks to the OS, so it’s much harder to maintain (the admin will maintain the raid in software). Also, resilvering with h/w will probably be lighter on the system as the card will handle the IO without affecting the rest of the system. But, I was to accept the performance penalty I was probably going to meet my goals even during resilvering. So I decided to go with software Raid.

The best software solutions were the following: Linux Raid using MD, BtrFS or ZFS. The first is limited to traditional Raid-5 and Raid-6 and is straightforward to use, is bootable and well-supported, but it lacks any modern features like deduplication, compression, encryption or snapshots. BtrFS and ZFS have these features, but are more complicated to administer. Also, BtrFS is still not production-ready, unlike ZFS. So ZFS it was. Great feedback online on ZFS too. One important note on software raid is that they don’t play well with h/w raid cards. So if there is a raid controller between the drives and the system, it should be set to bypass the devices or work in JBOD mode. I’ll have more to say on ZFS in subsequent posts.

To reach 8TiB with 2 drive redundancy I had to either go with 4x4TB drives or 5x3TB. But with ZFS (RaidZ) growing the array is impossible. The only solutions are to either add larger drives (one at a time and resilver until all have larger capacity and then the new space becomes available to the pool,) or, to create a new vdev with a new set of drives and extend the pool. While simply adding a 6th drive to the 5 existing would have been a sweet deal, when the time comes I could upgrade all drives with larger one and enjoy new drive longevity and extra disk space. But first I had to have some headroom, so it was either 5x4TB = 12TB or 6x3TB=12TB.

Which drives? The storage market is still recovering from the Thailand flood. Prices are coming down, but they are still not great. The cheapest 3TB are very affordable, but they are 7200 rpm. Heat, noise and power bill are all high with 6 drives in an enclosure. They also come with 1 year warranty! Greens are cool and low on power requirements, but they are more expensive. The warranty isn’t much better, if at all longer. WD Red drives cost ~$20 more than the greens, come with all the advantages of 5400 rpm drive, are designed for 24/7 operation and have 3 year warranty. The only disadvantage of 5400 rpm drives is lower IOPS. But 7200 rpm doesn’t make a night-and-day difference anyway. Considering that it’s more likely than not that more than 1 drive will fail during 3 years, the $20 premium is a warranty purchase if not for the advantage of NAS drive vs. home usage. There was no 4 TB Red to consider at the time (Seagate and WD have “SE” 4TB drives that cost substantially more), although the cost per GB would be the same, it was outside my budget (~$800 for the drives) and I didn’t have immediate use for 16TB space of usable to justify the extra cost.

Computer hardware

AMD’s APU with its tiny heatsink and 2x 4GB DIMMs. Noticeable is the absence of a GPU (which is on the CPU chip).

I wanted to buy the cheapest hardware I could buy to do the job. I needed no monitor, just an enclosure with motherboard, ram and cpu. The PSU had to be solid to supply clean, stable power to the 6 drives. Rosewill’s Capstone is one of the best on the market at a very good price. The 450W version delivers continuous 450W power, not peak (which is probably in the ~600W range). I only needed ~260W continuous plus headroom for initial spin up. The case had to be big enough for the 6 drives + front fans for keeping the already cool-running drives even cooler (drive temperature is very important for data integrity and drive longevity). Motherboards with > 6 SATA ports are fewer and typically cost significantly more than with 6 or less. With 6 drives in raid, I was missing a boot drive. I searched high and low for a PCI-e SSD, but it seems there was nothing on offer for a good price, not even used ones (even the smallest ones were very expensive). Best price was for WD Blue 250GB (platter) for ~$40, but that was precious port that it would take, or cost me more in motherboard with 7 SATA ports. My solution was to use a flash drive. They are SSD and they come in all sizes and at all prices. I got two A-DATA 16GB drives for $16 each, thinking I’d keep the second one for personal use. It was after I placed the order that I thought I should RAID-1 the two drives to get better reliability.

With ZFS, RAM is a must, especially for deduplication (if desired). It is recommended to have 1-2GB for each TB of storage. So far, I see ~145 bytes / block used on core (RAM) which for ~1.3TiB of user data in 10million blocks = 1382MB of RAM. Those 10million blocks were used by <50K files (yes, mostly documentaries and movies at this point). The per-block requirement goes down with increased duplicate blocks, so it’s important to know how much duplication there are in those 1.3TiB. In this particular case, almost none (there were <70K duplicate bocks in those 10million). So if this was all the data I had, I should disable dedup and save myself RAM and processing time. But I still have ~5TiB of data to load with all of my backups, which sure have a metric ton of dups in them. Bottom line: 1GB of ram per 1TiB of data is a good rule of thumb, but it looks like it’s a worse-case scenario here (and that would leave little room for file caching). So I’m happy to report that my 8GB ram will do OK all the way to 8TiB of user data and realistically much more as I certainly have duplicates (yes, had to go for 8GB as budget and RAM price hike of 40-45% this year alone didn’t help. Had downwards price trends continued from last year, I should have gotten 16GB for almost the same dough). Updates on RAM and performance below.

CPU wise, nothing could beat AMD APU, which includes Radeon GPU, in terms of price/performance ratio. I could either go for dual core at $65 or quad core for $100 and upgrade L2 cache from 1MB to 4MB and better GPU core. I went for the latter to future proof video decoding and transcoding and give ZFS ample cycles for compression, checksum, hashing and deduplication. The GPU in the CPU also loves high-clock RAM. After shopping for a good pair of RAMs that work in dual-pump @ 1600Mhz, I found 1866Mhz ones for $5 more that is reported to clock to over 2000Mhz. So G.Skill wins the day yet again for me as is the case on my bigger machine with 4x8GB @ 1866 G.Skill. I should add that my first choice in both cases had been Corsair as I’ve been a fan for over a decade. But at least on my big build they failed me as they weren’t really quad-pump (certainly not at the reported frequency and G.Skill has overclocked to 2040Mhz from 1866Mhz while the CPU is at 4.6Ghz, but that the big boy, not this NAS/HTPC).

Putting it all together

I got the drives for $150 each and the PC cost me about $350. Two 16GB USB3.0 flash drives are partitioned for swap and rootfs. Swap is on Raid-0, ext4 on Raid-1. Even though I can boot off of ZFS, I didn’t want to install the system on the raid array, in case I need to recover it. It also simplifies things. The flash drives are for booting, really. /home, /usr, and /var should go on ZFS. I can backup everything to other machines and all I’d need is to dd the flash drive image onto a spare one to boot that machine in case of a catastrophic OS failure. Also, I keep a Linux Rescue disk on another flash drive at hand at all times. The rescue disk automatically detects MD partitions and will load mdadm and let me resilver a broken mirror. One good note is to set mdadm to boot in degraded mode and rebuild or send an email to get your attention. You probably don’t want to go in with a rescue disk and a blank screen to resilver the boot raid.

The 6 Red drives run very quietly and are just warmer than the case metal (when ambient is ~22C), enough that they don’t feel metallic-cold at sustained writing thanks to two 120mm fans blowing on them. Besides that, no other fans are used. AMD comes with an unassuming little cooler that is less loud than my 5 year old laptop. A-Data in raid gives upwards of 80MB of sustained reads (average over full drive dd read) and drop to ~19MB of sustained write speed. Bursts can reach 280MB/s and writes a little over 100MB/s. Ubuntu Minimal 13.04 was used (which comes with Kernel 3.8) and kernel upgraded to 3.11, ZFS 28 (the latest available for Linux, Solaris is at 32, which has transparent encryption) installed and 16.2TiB of raw disk space reported after Raidz-2 zpool creation and 10.7TiB of usable space (excluding parity). The system boots faster than the monitor (a Philips 32” TV) turns on (that is to say in a few seconds). The box is connected with a Cat-5 to the router, which has a static IP assigned to it (just for my sanity).

I experimented with ZFS for over a day just to learn to navigate it while reading on it before scratching the pool for the final build. Most info online are out of date (from 2009 when deduplication was all the rage in FS circles) so care must be taken when reading about ZFS. Checking out the code, building and browsing it certainly helps. For example, online articles will tell you Fletcher4 is the default checksum (it is not) and that one should use it if they want to improve dedup performance (instead of the much slower sha256), but the code will reveal that deduplication is defaulted and forced to sha256 checksum and that is the default even for on-disk checksums for integrity checks. Therefore, switching to Fletcher4 will only increase the risk of on-disk integrity checking, without affecting deduplication at all (Fletcher4 was removed from the dedup code when a severe bug was found due to endianness). The speed should only be worse with Fletcher4 if dedup is enabled because now both checksums must be done (without dedup Fletcher4 should improve the performance at the cost of data security as Fletcher4 is known to have a much higher collision rate than sha256).

ZFS administration is reasonably easy and it does all the mounting transparently for you. It also has smb/nsf sharing administration built-in, as well as quotas and acl support. You can set up as many filesystems as necessary anywhere you like. Each filesystem looks like a folder with subfolders (the difference between the root of a filesystem within another or just a plain subfolder is not obvious). The advantage is that each filesystem has its own settings (which are inherited from the parent by default) and statistics (except for dedup stats, which are pool-wide). Raid performance was very good. Didn’t do extensive tests, but sustained reads reached 120MB/s. Ingesting data from 3 external drives connected over USB 2.0 is running at 100GB/hour using rsync. Each drive is writing into a different filesystem on the same zpool. One is copying RAW/Jpg/Tif images (7-8MB each) on gzip-9 and two copying compressed Video (~1-8GB) and SHN/FLAC/APE audio (~20-50MB) on gzip-7. Deduplication is enabled. Checksum is SHA256. ZFS has background integrity check and auto-rebuild of any corrupted data on disk which does have a non-negligible impact on the write rates as the Red drives could do no more than ~50 random read IOPS and ~110 random write IOPS, but for the aforementioned load each levels at ~400 IOPS per drive since most writes are sequential. These numbers fluctuate with smaller files such that the IOPS drops down to 200-250 per drive and average ingestion is 1/3rd at ~36GB/hour. This is mostly due to the FS overhead on reads and writes that force much higher seek rates vs sequential writes. CPU is doing ~15-20% user and ~50% kernel, leaving ~25% idle on each of the 4 cores at peak times and drops substantially otherwise. Reading iostats show about 30MB/s sustained reading rate from the source drives combined and writes on the Reds that average 50MB/s but spike at 90-120MB/s (this includes parity which is 50% of the data and updates of FS structure, checksums etc.)

2x 16GB flash drives in RAID-1 as boot drive and HDMI connection.

UPDATE: Since I wrote the above, it’s been 3 days. I now have over 2TiB of data ingested (I started fresh after the first 1.3TiB). The drives sustain at a very stable ~6000KB/s writes and anywhere between 200 and 500 IOPS (depending on how sequential they are). Typically it’s ~400 IOPS and ~5800KB/s. This translates into ~125GiB/hour (about 85GiB/hour of user data ingestion), including parity and FS overhead. Even though with gzip-9 and highly compressible data the rate or writing goes down, I now am writing from 4 threads and the drives are saturated at the aforementioned rates. So at this point I’m fairly confident the bottleneck of ingestion are the drives. Still, 85GiB/hour is decent for the price tag. I haven’t done any explicit performance tests because that was never in my goals. I’m curious to see the raw read/write performance, but this isn’t a raw Raid setup, so the filesystem overhead is always in the equation and that will be very much variable as data fills up and dedup tables grow. So the numbers wouldn’t be representative. Still, I do plan to do some tests when I ingest my data and have a real-life system with actual data.

Regarding compression, high-compression settings affect only performance. If the data is not compressible, the original data is stored as-is and no penalty for reading it is inured (I read the code,) nor is there extra overhead in storage (incompressible data typically grows a bit when compressed). So for archival purposes the penalty is slower ingestion speed. Unless modification will happen, slow and good compression is a good compromise as it does yield a few % points compression even on mp3 and jpg files.

With Plex Media Server installed on Linux, I could stream full HD movies over WiFi (thanks to my Asus dual-band router) transparently while ingesting data. Haven’t tried heavy transcoding (say HD to iphone) nor have I installed windows manager on Linux (AMD APUs show up in forums with Linux driver issues, but that’s mostly old and for 3D games etc.) Regarding the overhead of dedup, I can disable dedup per filesystem and remove the overhead for folders that don’t benefit anyway. So it’s very important to design the hierarchy correctly and have filesystems around file types. Worst case scenario: upgrade to 16GB RAM, which is the limit for this motherboard (I didn’t feel the need to pay an upfront premium for a 32GB max MB).

I haven’t planned a UPS. Some are religious about availability and avoiding hard power cuts. I’m more concerned about the environmental impact of batteries than anything else. ZFS is very resilient to hard reboots, not least thanks to its journaling and data checksums and background scrubbing (validating checksums and rebuilding transparently). I had two hard recycles that recovered transparently. I also know ztest which is a dev test tool does all sorts of crazy corruptions and kills and it’s reported that in 1million tests no corruptions were found.

Conclusion

For perhaps anything but the smallest NAS solutions, a custom build will be cheaper and more versatile. The cost is the lack of warranties of satisfaction (responsibility is on you) and the possibility of ending up with something underpowered or worse. Maintenance might be an issue as well, but from what I gather ready NAS solutions are known to be very problematic, especially when they show any issues, like failed drive or buggy firmware or management software. ZFS proved, so far at least, to be fantastic! Especially that deduplication and compression really work well and increase the data density without compromising integrity. I also plan to make good use of snapshots, which can be configured to auto-snapshot with a preset interval, for backups and code. I only miss transparent encryption from ZFS on Linux (Solaris got it, but it hasn’t been allowed to trickle down yet.) Otherwise, I couldn’t be more satisfied (except may be with 16GB RAM, or larger drives… but I would settle for 16GB RAM for sure.)

29 Responses to “18TB Home NAS/HTPC with ZFS on Linux (Part 1)”

I really appreciate this post. I’m planning on building something very similiar this coming June (2014). I’m outgrowing my 4TB storage (2x2TB in LVM), and I would like to incorporate RAID to mitigate the risk of drive failure as I add more disks. I currently use a second independent 4TB storage for backups and plan to build two independent systems like the one in this post for the same purpose. I’m planning to use 6x2TB disks in RAID 5 for each system (because I’ve already got 6 of the 12 i would need, 4 in my storage and 2 spare). That will give me almost as much usable space as 6x3TB in RAID 6. Hopefully the indepedent backup system will save me if I ever have to deal with 2 disk failures on the same array. I would like to use BtrFS instead of ZFS, as long as RAID5 support is ready in time. Among other reasons, the most compelling is that it allows the array to grow on shrink online.

I still haven’t figured out an inexpensive way to establish an offsite storage of this size, except maybe buiding a third such system to keep at a friends house and sync with btsync. Maybe that’s overkill but the larger my data set grows the more I am worrying about data loss. Any thoughts?

The only obvious suggestion is to do RAID-6 with 6 drives. The risk of losing a drive while resilvering with 6 drives is a bit high and if you lose another drive, you lose the array completely with your data.

BtrFS is a good alternative to ZFS, albeit not as mature. If you feel its features and stability are good enough for your needs, then by all means use it. Ultimately you need to have a good understanding of your needs and how you use your data. That will make deciding redundancy of the array and the necessary features of the software stack rather simpler.

On backup, there are a few options. If your most valuable data has a (much) smaller footprint than the full array size, then you can have two external drives that you use to backup your files on one, while keeping the other at a remote location. You then swap them after each backup. This way you’d have two backups representing the current data and one version older at two locations. Alternatively, you could upload your data (after encryption) to a cloud backup service. However if you need to backup your full array, I think two identical rigs with rsync or ZFS send/receive should do the trick (provided you have the bandwidth to backup faster than you ingest/modify data!

You have enough time to do more research. You can play with ZFS and BtrFS with loop device, btw, so you get a good feel and experience with their features.

ECC with ZFS is a subject that I discussed a number of times offline but didn’t touch upon in my write ups. Technically speaking, ECC is necessary to avoid corruptions before writing to disk (which will include the corrupted block’s checksum, which is useless).

But, in practice for a home setup it’s probably excessive. The reason is, because any point in the pipeline can also introduce corruption, including all the other machines and routers and hubs and wires etc. Example: Suppose you download some files from the web and you have ECC and ZFS and top-grade hardware, but the file was corrupted somewhere before it even reached your gateway or router. What do you do? Well, if you don’t check, your written copy will be corrupted. But how do you check? Only if the server provides a checksum can you compare them, otherwise, you can try downloading again and compare bytes (assuming the same corruption is very unlikely to happen at the exact same bytes, unless someone is deliberately modifying the data in flight).

This brings us to the main point: If you have a way to validate ingested data, then ECC is undermined. If you don’t have such a method (say your data is generated by software on the network etc.) then you should worry about the weakest point in the chain. That is, all your machines, routers, cables etc. must be of the highest quality and do error detection/recovery. For example: while TCP does use error detection and retransmition, UDP does not, so UDP would be a very bad choice for any data that you care about.

Finally, ECC RAM is hardly the only thing to upgrade. The motherboard and CPU must also be ECC supporting. That will bring the total cost of the hardware to at least twice (if not more) the original cost.

So, if you can afford it, by all means use ECC. But for me, who wanted the cheapest solution, I feel that eating healthy food is still better than eating junk food, even if I don’t exercise regularly and do other healthy practices. ZFS is great improvement to data integrity, with or without ECC.

If the source of the file is corrupted, only that file ends up being stored as corrupted. On the other hand, if the RAM goes bad, ZFS will actively corrupt ALL your data during the process of scrubbing and what not! I understand that ECC requires more investment in terms of hardware. Just wanted to point out that there is a potential risk; albeit small.

I don’t think this is about RAM going bad. Both ECC and non-ECC can go bad. That is not what ECC protects against. As I understand it, it is about a fluke bit flip causing a data error due to some external influence such as cosmic rays/particals. IOW, it is a rare, transient and intermittent incident. And as I understand it, ECC can detect that rare bit-flip condition.

That is much different than data corruption from a RAM module going bad, which is not the issue here. Yes, you can buy an enterprise level server board, that uses ECC and also possible incorporate standby RAM that activates to takeover if another module it detected as malfunctioning…Such reliabilty may be priceless if you need it and can afford it as such populated motherboards typically cost 10-20 times what a personal use computer does and cost lots more in electricity to operate.

In a personal LAN setup, what’s the point? If you have 5 normal computers without ECC sending their files to an ECC-equipped file server, the potential for cosmic bit-flips still exist – the file server accepts and stores them and does not correct them.

Nashod, you’re making a very DANGEROUS assumption about ECC ram, let me clarify so that you can also edit your original post: 1) ECC IS a HARD requirement for ZFS, doesn’ matter if it’s for your home “barely use” NAS or super-performance one, otherwise you’re playing russian roulette and i’ll clarify later 2) “Technically speaking, ECC is necessary to avoid corruptions before writing to disk (which will include the corrupted block’s checksum, which is useless).”, this is wrong and a dangerous mislead, ZFS checks checksum ALL THE TIME, not only on writes(that’s why it has unmatched integrity), so if you have corrupted memory and perform a read, the values read will NOT match the checksum triggering a block repair, which will repair based on the corrupt data of your ram, this is a self-sustaining death spiral and in short time will render the entire filesystem corrupt beyond repair. This death spiral is particular to ZFS and how it handles integrity plus the large ram usage it has for caching which guarantees you will get data on a corrupted memory address, it wont happen with other filesystems(except BTRFS which does the same).

this is not fearmongering, there’s plenty evidence of users who got their system broken beyond repair and after they ran a memtest it turns out they had memory errors, one case a guy left the NAS turned on overnight, without any activity, harmless, right?, nope, the other day the system was crashed and zfs was unrecoverable.

Let me put it this way, you want to implement ZFS because it has a unmatched data integrity features, yet you put a zero-integrity buffer on top of it?, if you dont want to use ECC then don’t use ZFS, go with softRAID, period.

here’s a list of people who lost all their data, all of them due to ram errors:

Thanks Guillermo. I’ll check the references. I keep backups of my data and expect anyone who cares about their files to do the same. ZFS is not a backup solution and the data shouldn’t be assumed to be indestructible.

Nashod, the thing is that if you think about implementing ZFS is because you have an interest for it’s indestructibility/resiliency vs “normal filesystems” like NTFS/EXTx that have zero provision for data integrity, but if you put it on a host without ECC RAM you’re actually lowering the reliability below those filesystems you seek to replace in the 1st place!, as the same ECC error would probably do nothing or corrupt a part of a file in NTFS/EXT (i speak for experience) yet on ZFS it will destroy the entire FS irreparably. backups aren’t always possible past certain size(the backup solution would be either non-portable or too expensive to make it worth), and that still leaves you with data loss between backup windows. and fo a home server, backups aren’t even in the menu, that’s the whole point of a ZFS setup, to have device redundancy via RAID AND data integrity via continuous checksum

Agreed, I replaced all the hardware in my NAS when I discovered that ECC was not optional for ZFS. If I’d known as much when I first purchased it, the added cost would have been nominal. Instead, I ended up buying two sets of everything.

Not having ECC doesn’t just mean that a file here or there will get corrupted. If you get a bad chip, ZFS will detect errors across your drive every time it scrubs, and will attempt to “correct” them by writing back bad data to disk. So even if the file gets written correctly the first time, it’ll get corrupted over time.

Please, add something to your post noting that ECC ram is required for ZFS…otherwise, it is just propagating the rumor that ECC is optional.

Thanks Toby for the feedback. More info is highly valuable in this domain.

> If you get a bad chip, ZFS will detect errors across your drive every time it scrubs, and will attempt to “correct” them by writing back bad data to disk. So even if the file gets written correctly the first time, it’ll get corrupted over time.

I don’t think this is necessarily accurate. This assumes a number of things that don’t hold. For one, it assumes that the reconstructed data has at least equal weight to the other copies, but this couldn’t be true. Assume that the “corrected” data ends up getting corrupted due to bad RAM, shouldn’t we expect ZFS to detect this bad copy exactly in the same manner it detected it before “correcting” ? Of course the same method for detecting bad data should hold whether the data was “original” or “corrected.” That is, if they don’t match with the other copies and/or the checksums, then they are at fault, not the other copies. In addition, it assumes that the checksum (which we know to be valid, as the file was originally written correctly) will match the corrupted data, which it won’t (chances of this happening is essentially nil). Indeed, it is because the checksum mismatch that ZFS will detect this “corrected” data as bad and try to replace it yet again. Note that I’m talking about 3+ disk raid configurations, not mirrors.

Otherwise, I think it’s fair that potential ZFS users should know that ECC is considered as highly-recommended.

However, you have a much better load on the system with mirroring. No need for parity check, calculation etc.

As a consequence, the resilvering time required for a mirror would be minimal and it also gives you a 3 drives failure (if not 2 drives in the save mirror of course!).

Using VDEV mirror adds the ease of increasing a pool storage. As you wrote it, you can’t resize a VDEV unless changing all the disks one at a time with resilvering!

Have 6 drives? 3 VDEV of 2 drives mirror. Need to expand in a year? Drives prices will have dropped. Easy to either get another set of 2 drives same capacity or go for larger drives as their prices will be maybe not more than the one purchases a year earlier.

Regarding rust drives (spinning platter I mean…), I have selected the Seagate Video Surveillance models. They are designed to spin 24/7! With a 3 year warranty and the possibility of the “Rescue Plan” on top of that where (if I am not mistaken) Seagate will recover all your data on the drive for free in case of a failure. Maybe more expensive, but at then end what is most important?

With the current SSD reliability, I have built my system with efi boot, boot and / on one single 120 GB SSD, /home on 2 x 256 GB SSD (Crucial MX 100 with power failure protection) striped. Then I went crazy on a special deal for the Seagate 3 TB Video Surveillance 7,200rpm/64MB cache, including Rescue Plan for $100 each… got 10 of them… So these will be one pool for most data (movies, music, backups of household computers) on their respective zfs datasets.

Then I have a second 120G SSD that I sliced to give a small bit to the ZIL and a large one to the L2ARC. Maybe I’ll get another SSD in the future to give each its own. But these should still remain faster than any rust drive. No need to have any for the SSD pool (home).

I also limited the ZFS cache maximum to constraint ZFS hunger. It might impact a little, but it is a home desktop/server after all and I don’t care too much. I am not serving huge database records to millions of people.

ECC or not ECC? that is the question! Well, it seems that ZFS, especially for home use can work very well without ECC. As long as it can find one good copy of file, it is good. ZFS will NEVER write anything bad to disk.

Regarding ZFS needs backup, remember that snapshots exists. Also there is a flag that can be set: set copies=2 (or more!). This will get ZFS to always have more than one copy of the dataset on more than one VDEV, which is also why having all mirror VDEVs is a good idea!

If you have corrupted file, if you back it up, it will still be corrupt. However it is a good idea to do a backup of data that CANNOT BE LOST (financial documents, priceless photos, love letters, websites, emails, the WIFE’s DATA!…). Movies, Music, OS, are NOT IMPORTANT (well, not as much). So do not backup on any ZFS pool as if it is the first thing to go, you’ll be very annoyed! Get another bit X TB drive for this sole purpose, and mirror it if not too expensive.

BTW, we can also build a complete identical system and back each other to each other! And again…

Then you can sign up for Microsoft 365 and have unlimited One Drive storage space for the very low price of $150/year… Less than what would be necessary to pay for a VPS with similar storage…

Just remember, anything can go wrong. And in time, something wrong WILL HAPPEN. Just be prepared as well as possible. ZFS, BtrFS, RAID etc are all good. And remember to backup your data… backup, backup, backup!

As a dumb user I’m confused. On the one hand ZFS people are telling me ECC is a hard requirement, on the other hand I’m reading blogs like this one and others (http://blog.brianmoses.net/2013/10/diy-nas-econonas-2013.html) saying ECC isn’t really such a big deal to worry about… Being in the unfortunate position of not being very clever… who should I trust and why? I understand there is a risk of faulty memory causing error correction of ZFS going awry, but how big is that risk, and by what factor is it mitigated by ECC and at what cost?

In my own speculative build I’m looking at hiking the price from around $500 to $700 to go from non-ECC to ECC config… small price to pay if the risk is large right? Huge price to pay if the risk is negligible… (this is /mostly/ going to be ‘junk’ data, movies, music, photos, etc.)

There are many ways to approach this. ECC does plug a serious hole, if not add a layer of reliability to the integrity of the data. So one approach is from a purely cost-benefit analysis. That is, if the premium for ECC is as low as 40%, go for it. You can’t lose.

However in most cases the premium is more in the order of a few 100% and that can be prohibitive, considering other things. So, the other approach is to consider that the data is as reliable as the weakest link in the chain. This means that if, say, you had some data on an old CD that you moved to your new NAS with ECC, block checksums etc, but the original data on the CD was corrupted, then you’ll be preserving the corrupted data with extreme integrity (not very useful, nor good utilization of one’s resource, if you ask me).

Another point is that most home users’ data is really static. It’s written once, read many times over. We might edit some metadata, move some files around, sometimes cleanup some of them and replace with others, but mostly the data is never modified after the initial ingestion. ECC’s value is when modifying data, because the system memory is part of the link I mentioned. That is, if your data has high integrity at the point of reaching the NAS, and ZFS has high integrity, the hardware of the NAS will be the weak point unless it too has some way of validating the data in its main memory and processor. But this is only important when data is being written and modified. In case of reading it, there is no risk of corrupting the original and if exporting we can compare with the source (this isn’t true for modifying the data on the NAS–we may end up with corrupted data because we have bad memory). This means that ECC is important for video editing and similar storage uses, as the data is constantly going back and forth between the workstations and the NAS and being processed. A corruption at any point in the workflow (including hubs, switches and RAM) will end up in the data on the NAS and will be useless. Because the way the data is generated, there is no easy way to compare with the source. This means in that scenario all machines need ECC and high reliability network etc. not just the NAS.

As a practical point, most of us take photos and videos and want to store them securely on our shiny NAS. But most cameras don’t have anything even remotely similar to the features of NAS (most flash memory use FAT). The flash memory is extremely unreliable in its own right (averaging a few 1000 writes per cell before the cell dies out). So that is the weakest point in the chain between taking the photos and storing them, no matter how good your NAS is. The only solution here is to md5 or compare the photos on the flash with the copy in ZFS and hope the photos were not corrupted between the time you released the shutter and you ingested to the NAS. Once in, ZFS will take care of them. This is not unlike downloading files from the internet. There is no guarantee of integrity in the hardware all the way between us and the source, but either there are checksums in the protocol (f.e. bittorrent) or the website provides the md5 checksums that we can use to validate the copy we got. We’re doing essentially the same thing here by distrusting our system memory and validating the ingested data with the source.

Considering these, my recommendation is this: Unless you are building a business-grade or high-reliability NAS with a fat budget, forgo ECC and invest the money in better drives and backup (the latter is a so much better bang for the buck than ECC). When ingesting your data, always md5 (or other crytohash) your data on the source and your NAS to validate. Once validated, ZFS will insure the integrity from there on, unless you modify the data.

So, the price difference for me was about $220 to upgrade to an ECC-compatible motherboard and CPU. As a bonus, the CPU was a *lot* faster than what I started with, and the motherboard includes a nice IPMI interface. All around, the whole configuration is a much better fit for a NAS.

Well, if you have to change your mobo and processor to accommodate the new ECC ram, this will add much more to the bill. Desktop mobo are not build to accept ECC ram usually. Some are, such as AMD board (maybe not all of them).

Then you have another dilemma. ECC Registered or ECC Unregistered? Both type are available, but in my case (AMD CPU with Asus M5A99X board), the mobo can only accept ECC Unregistered.

I read that you must have ECC Registered for the best flip bit protection?

Then it is also recommended to have the most expensive SSD type (SLC) and that it must be mirrored for the ZIL. Then you have to give the biggest L2ARC SSD you can have. That is another $800 for a Samsung EVO 850 PRO (10 yr warranty though LOL).

TCP/IP has checksums on each sent packet. So does BitTorrent (file level or block level, not sure). So if you download a File using the most common methods (HTTP, FTP, BitTorrent), it will be checksummed, and random flipped bits between the Server’s Ram and your Ram will be detected and corrected without you noticing it. CDs have checksums too. So you won’t read garbage just because the CD has some scratches on it. A scratched CD/DVD will typically “hang” in your DVD player first, and product read errors later.

So what Ashod Nakashian says about not having ways to ensure “correctness” of files during the transfer from “somewhere” to you is wrong in many cases.

However, as he mentioned: Finding Hardware that supports ECC is a lot more difficult, and will costs a bit more. How much depends…

With Intel, you need Xeons (or some Celerons) which promote ECC support (Intel ARC will tell you this). On the AMD side, most (all?) FX Processors (not the APUs) aparently support ECC. The APUs (Kaveri, Kabini) don’t (probably). The G-Series embedded APUs do.

The next thing you need is an ECC capable Mainboard. For AMD, some Mainboards for FX Processors do (e.g. Asus M5A78L-M/USB3). Some vendors say they support DDR3 […] Non-ECC, so you at least know what’s what. Mostly, ECC is not mentioned, and you don’t really know wether it’s supported or not (probably not). You obviously need ECC Ram, too, but that’s easy, and price-wise it does not make much difference.

In theory, the AMD APUs (Kabini, even) are perfect, because they are fast enough, have decent Graphics for Media playback, maybe GPU transcoding support in the future, and they are cheap. If you want ECC however, you will Probably need a G-Series embedded Processor and an “embedded” or “industrial” Mainboard. All such Mainboards that I have found do not support ECC. They do boast prices in the 200$ Range, compared to 50-80$ for most AM1 Mainboards. Which is ridiculous (and pisses me off), because the main difference between the normal and embedded APUs is ECC. Yes, the components and validation is better, but why not throw ECC support in, too? They are also geared towards different applications, so the interconnects they offer are not geared towards storage that well.

> “[…] not having ways to ensure “correctness” of files during the transfer from “somewhere” to you is wrong in many cases.”

Notice that there is a major difference between TCP/IP level checksums and that of a file (or collection of files, like bittorrent). I believe I have spelled this distinction out (if not in this post, in one of the follow-ups,) that if one has a way to checksum a file after copy/download, then they are in a very good position. But simply downloading a file over HTTP is no guarantee that it won’t be corrupt. TCP/IP reduces corruption rates, but doesn’t eliminate them. One shouldn’t assume that the packet level checksum is a substitute for file-level checksums, which is what we should care about.

And, yes, if one can afford ECC, by all means they should get it. But it’s equally important to understand where they help. Certainly not when we download a corrupted file (from the web, camera, phone …) that we can’t checksum.

TCP-IP ensures that the network package that got sent is the package that is received (within the limits of the checksum). Yeah, sure, it’s not 100%, but then nothing is. If there is corruption, it most likely did not occur during the network transfer, because that is checksummed. If the PCI(e) bus to the Server’s NIC is faulty and the NIC get’s garbage to send, then that does not help, obviously.

ZFS assumes ECC Memory, and based on that it can guarantee that the data you’re reading now is the data that you have sent from memory to disk. If the PCI bus to storage, or the cable or controller or disk is faulty, you will know. Sure, if you write garbage (because that’s what you got in your memory), you get garbage. But you get exactly the same garbage you’ve written. That is better than any other solution. It is exactly the kind of checksum you’d need to ensure that your http download did not get corrupted from remote memory to your own. Gotta start somewhere.

I came across your post on a dedicated ZFS NAS via search engine. Your blogpost was at position no. 3.

I like the format, your writing style and, for the non-professional users that we all were at some point in our lives, the digestable content. Most of the time blog writers do themselves and their readers good to not overly use tech terms just because using them sounds cool or dive too deep into the matter of, say, checksumming of logical blocks on filesystem (which is not to be confused with physical blocks on disk).

But here’s the problem: Giving your readers the impression that a proper storage solution is super frickin’ easy like an iPhone does harm them in the long-term. There’s a reason why computer scientists and mathematicians at Sun Micro (now Oracle) with doctors degrees and whatnot needed ten years and 100 Million $ to come up with such masterpiece of technology like ZFS.

Before I dig as much as I feel is neccessary into ‘the matter’ and for the TL;DR guys I tell you this: If a senior developer of ZFS, who worked right from the beginning on ZFS, absolutely and under all circumstances tells you to use ECC RAM, you should trust this person until proven wrong!

And here’s why these scientists are right and why Ashod is wrong regarding ECC RAM. Suppose the following scenario:

You send a good and working copy of a 512 Byte text file, say a logfile, to your ZFS NAS. Let’s, for now, suppose the cabling, switches aren’t a problem (more on that argument later). Your logfile now resides within the servers’ RAM for a period of usually five seconds. The file is part of the current open transaction group. BTW, there are three active transaction groups at all times which are based in RAM. Now suppose a bitflip takes place before the checksum for that 512 Byte file, equiv. to a 512 byte logical ZFS block, was calculated. Assume that within the logfile an ‘A’ character flips to a ‘B’. Then ZFS calculates the checksum, commits the transaction group to stable storage (disk) and that transaction is closed.

Within a logfile, no big deal right? Even within non-professional photos, no big deal if some pixel doesn’t have the correct RGB value, right? WRONG! Because the sun-ray hitting your system, the alpha-particle that radioactively collapses and sends out electrons, the power supply unit that produces a spike on the 12V line which accelerates the platters motor or arm, the little spike in temperature which hits your RAM and causes a bit to flip or even worse, exchange with its’ neighbour state, all that doesn’t decide where to strike and when.

Because your RAM or circumstances doesn’t deterministically ‘decide’ which bit to flip when and which to spare the pain. If it was a deterministic process, like computers in general are performing, then we could identify and ‘rollback’. But since so many natural causes for these kind of RAM errors exist and because natural events appear without notice, it’s important to simply guard against one of the following scenarios (ordered from mild disturbance to absolute catastrophe):

– As mentioned above, character ‘A’ flips to character ‘B’ within some logfile that’s likely never read again.

– Within a private photo some RGB value gets corrupted, but the image still opens.

– Within a private, compressed photo the compression algo gets corrupted so the image won’t open.

– The bitflip affects any part of an operating system ISO you store on your NAS, so the error is detecable and recoverable (checksum comparison of the provided hash on the website).

– The bitflip takes place while reading the ISO from NAS into your DVD buffer while burning (so you have a decent copy on stable storage within you NAS but an invalid one on the DVD).

– The bitflip takes place while reading or writing to a virtual disk image stored on your ZFS NAS, which could affect a single unimportant logfile within your VM or a critical system script, the kernel or the virtual filesytem itself.

– The bitflip takes place while dumping your important files and folders to the ZFS NAS.

– The bitflip affects an ongoing backup of your PCs’ harddrive (TrueImage and the likes) which you store onto your NAS.

Think this was bad? Now come the REAL SHOWSTOPPERS!

– The bitflip takes place while ZFS checksums logical blocks, which can be as small as 512 Byte or as big as 128 KB.

– The bitflip takes place while ZFS works on or checksums Metablocks.

– The bitflip takes place while ZFS works on or checksums Uberblocks.

When reaching the level of Uberblocks we’re likely in a position where ZFS panics the box and leaves you with an unmountable filesystem at reboot or, at best, is able to rollback to a previous filesystem state and you lose just the recent data. And you might think it’s a bug by ZFS to panic your NAS, no, it’s intentional created by the above mentioned scientists! The reason is that silent data corruption evidently took place on your NAS and in order to protect the rest of the pool, YES YOUR WHOLE POOL NOT JUST THE DATASET, the system rather panics to avoid a down spiral.

Oh, and sadly on forums and blogs I more than often hear terms and statements like: ‘yeah, well, now and then such things happen’ or ‘once in lifetime’ or ‘never happened to me’ or ‘the probability of such events is near zero’.

You really want to hear some numbers? The probability of a single uncorrectable error per non-ECC DIMM per year is as high as 8%. The probability of a single uncorrectable error per ECC DIMM per year is 0.22%. After those two cases (8% on one hand and 0.22% on the other hand) there’s roughly a 5% chance that the error is serious enough to panic your system or make your pool unusable. In short, four DIMMs within the system = overall 32% chance of a bitflip and overall 1.6% of causing serious damage, per year!

I don’t even have to use 3rd grade math or be anywhere near a computer scientists degree in order to understand and accept the fact that ECC RAM with ZFS is not an option, it’s a requirement. Because those people who invested about 10 years and a 100 Million $ for creating, testing and developing ZFS, implemented an architecture that acts more like a database and not a classical filesystem. And no one in their right mind would use non-ECC RAM when building a database server! So where did this non-ECC vs. ECC discussion even pop up to begin with? You don’t hear this discussion e.g. on the postgreSQL or mySQL forums!

Don’t trust me or any guy on the internet, just ask facebook. They had cases where single DIMMs would produce as much as 4000 bitflips per year.

So, please do yourself a favor and don’t try to playdown a probability that, given the NAS runs long enough, will hit you! Where, when and how serious it will be is up to the luck. Don’t play roulette with your data, even if you say ‘ah well just movies, no biggie’ since the time it took to assemble a movie collection or take pictures with your camera has also value.

Regarding the cabling: Of course you should use consistency all around your infrastructure and use shielded and foiled twisted pair cabling while in an environment with strong electromagnetic fields from PSUs and wallplugs. Following the path of consitent logic, to me it doesn’t make any sense to talk about three way mirror vdevs, RAIDZ2 pools, cabling and avoiding the use of 30$ network switches and then even start to think about argueing regarding ECC RAM.

I also have to disagree that an ECC solution does cost a few hundred percent more than a non-ECC box. Most AMD CPUs natively support ECC. Most Asus consumer boards, even µATX ones, support ECC. Asus solders the lanes connecting DRAM banks with the CPU (and therefore memory controller). And who said you’d have to buy brand new ECC RAMs? Go to ebay and look for ‘Kingston ECC’, some one time used kits are sold for a bargain because people bought ECC RAM and thought it was a cool new feature when in reality they have a consumer motherboard.

Thanks for anyone who read my comment and thanks again for informing others instead of doing them a disservice by playing down the risks.

I agree with Arthur, it is highly irresponsible to advice to advice against ECC so that the new and uneducated keep getting conflicting information. So for the new FreeNAS users: at the moment of writing, minimum of 8GB ECC RAM is required for a safe operation. (not 4GB, not 6GB but 8GB MINIMUM with ECC together with an ECC compatible motherboard and CPU.

Read the guides and stickies on the FreeNAS community on forums.freenas.org and the Cyberjock posts.

I’m not sure why there is an assumption that I’m in any way related to the FreeNAS community, or that I have to read their guides and stickies before I talk about a server I built, or that I’m somehow representing any community at all (so my words would be irresponsible if I abuse some authority I don’t even have).

Whatever the reasons, I only wrote about a server I designed and built to meet certain needs for a certain budget. I discuss some of the technical decisions and trade-offs I made and why I made them.

Sorry to disappoint, but I’m no authority on FreeNAS nor am I affiliated with them. I do however stand by my rationale when I justify a certain decision and if it’s wrong I admit to it.

Please do feel free to post your own findings and recommendations elsewhere as you see fit.

Archives

Archives

Follow Me

Disclaimer

The contents of this site are the personal opinions, views, ideas and products of Ashod Nakashian, and are not intended to malign any religion, ethnic group, country, race, government, law enforcement officer, club, organization, company, insect or individual or anyone or thing, especially those with the ability and desire to fight back.

Text, code and photos copyright Ashod Nakashian, unless stated otherwise. Do not use without permission. Do not hot-link.