a friendly guide for building ZFS based SAN/NAS solutions

ZFS includes two exciting features that dramatically improve the performance of read operations. I’m talking about ARC and L2ARC. ARC stands for adaptive replacement cache. ARC is a very fast cache located in the server’s memory (RAM). The amount of ARC available in a server is usually all of the memory except for 1GB.

For example, our ZFS server with 12GB of RAM has 11GB dedicated to ARC, which means our ZFS server will be able to cache 11GB of the most accessed data. Any read requests for data in the cache can be served directly from the ARC memory cache instead of hitting the much slower hard drives. This creates a noticeable performance boost for data that is accessed frequently.

As a general rule, you want to install as much RAM into the server as you can to make the ARC as big as possible. At some point, adding more memory is just cost prohibitive. That is where the L2ARC becomes important. The L2ARC is the second level adaptive replacement cache. The L2ARC is often called “cache drives” in the ZFS systems.

These cache drives are physically MLC style SSD drives. These SSD drives are slower than system memory, but still much faster than hard drives. More importantly, the SSD drives are much cheaper than system memory. Most people compare the price of SSD drives with the price of hard drives, and this makes SSD drives seem expensive. Compared to system memory, MLC SSD drives are actually very inexpensive.

When cache drives are present in the ZFS pool, the cache drives will cache frequently accessed data that did not fit in ARC. When read requests come into the system, ZFS will attempt to serve those requests from the ARC. If the data is not in the ARC, ZFS will attempt to serve the requests from the L2ARC. Hard drives are only accessed when data does not exist in either the ARC or L2ARC. This means the hard drives receive far fewer requests, which is awesome given the fact that the hard drives are easily the slowest devices in the overall storage solution.

In our ZFS project, we added a pair of 160GB Intel X25-M MLC SSD drives for a total of 320GB of L2ARC. Between our ARC of 11GB and our L2ARC of 320GB, our ZFS solution can cache a total of 331GB of the most frequently accessed data! This hybrid solution offers considerably better performance for read requests because it reduces the number of accesses to the large, slow hard drives.

Things to Keep in Mind
There are a few things to remember. The cache drives don’t get mirrored. When you add cache drives, you cannot set them up as mirrored, but there is no need to since the content is already mirrored on the hard drives. The cache drives are just a cheap alternative to RAM for caching frequently access content.

Another thing to remember is you still need to use SLC SSD drives for the ZIL drives, even when you use MLC SSD drives for cache drives. The SLC SSD drives used for ZIL drives dramatically improve the performance of write actions. The MLC SSD drives used as cache drives are use to improve read performance.

If you decide to use MLC SSD drives for actual storage instead of using SATA or SAS hard drives, then you don’t need to use cache drives. Since all of the storage drives would already be ultra fast SSD drives, there would be no performance gained from also running cache drives. You would still need to run SLC SSD drives for ZIL drives, though, as that would reduce wear on the MLC SSD drives that were being used for data storage.

If you plan to attach a lot of SSD drives, remember to use multiple SAS controllers. The SAS controller in the motherboard for our ZFS Build project is able to sustain 140,000 IOPS. If you use enough SSD drives, you could actually saturate the motherboard’s SAS controller. As a general rule of thumb, you may want to have one additional SAS controller for every 24 MLC style SSD drives.

Effective Caching to Virtualized Environments
At this point, you are probably wondering how effectively the two levels of caching will be able to cache the most frequently used data, especially when we are talking about 9TB of formatted RAID10 capacity. Will 11GB of ARC and 320GB L2ARC make a significant difference for overall performance? It will depend on what type of data is located on the storage array and how it is being accessed. If it contained 9TB of files that were all accessed in a completely random way, the caching would likely not be effective. However, we are planning to use the storage for virtual machine (VPS) file systems and this will cache very effectively for our intended purpose.

When you plan to deploy hundreds of virtual machines, the first step is to build a base template that all of the virtual machines will start from. If you were planning to host a lot of Linux/cPanel virtual machines, you would build the base template by installing CentOS and cPanel. When you get to the step where you would normally configure cPanel through the browser, you would shut off the virtual machine. At that point, you would have the base template ready. Each additional virtual machine would simply be chained off the base template. The virtualization technology will keep the changes specific to each virtual machine in its own child or differencing file.

When the virtualization solution is configured this way, the base template will be cached quite effectively in the ARC (main system memory). This means the main operating system files and cPanel files should deliver near RAM-disk performance levels. The L2ARC will be able to effectively cache the most frequently used content that is not shared by all of the virtual machines, such as the content of the files and folders in the most popular websites or MySQL databases. The least frequently accessed content will be pulled from the hard drives, but even that should show solid performance since it will be RAID10 across 20 drives and none of the frequently accessed read requests will need to burden the RAID10 volume since they are already served from ARC or L2ARC.

19 Comments to Explanation of ARC and L2ARC

Thanks for creating this blog; we are soon to embark on the ZFS SAN adventure, and it’s really nice to have a record of someone else’s experience to draw from.

It’s important to note that when building VM templates, you should remove the SSH host keys just before halting the original image. Having multiple hosts sharing the same key is a significant security problem.

I also want to thank you for making this blog. I know its April 26th already, and you probably have this system in production.

I am curious regarding the following, hopefully you can give me some insight.

1) What was the total cost of your implementation?

2) What tools are you using for automatic notification of hard drive failures?

3) What happens if your non-ECC ram fails, is your array down until you replace it or you have some fault-tolerance?

4) What kind of virtualization will you be running and using this as a storage repository? Xen Linux-KVM or openvz? — me myself I am looking into getting ZFS to be the storage repository for a Linux-KVM virtualization environment.

We don’t have this one in production quite yet. We are still running additional benchmarks and have quite a few pages we will post to this site before we move the ZFS Build project over to production storage duties.

We will be posting project cost summaries. A significant part of our project cost was related to InfiniBand, so we will be breaking down those totals separately.

We will also be posting about drive failure notifications and how to get the LEDs to light up.

We are using ECC RAM in this project.

We currently use Hyper-V and Xen for virtualization of Windows and Linux VMs. We are actually planning to do performance comparisons of Hyper-V, Xen, and VMWare ESX with our blade center and this ZFS based storage solution. I guess we could run some additional benchmarks with OpenVZ, too.

Something to bear in mind here is that ZFS has to map the cache devices in its RAM. For every 100GB of L2 ARC you have on your system, ZFS will use ~2GB of main memory to map the cache. Thus you lose quite a chunk of RAM used for L1 ARC if you have too much cache, and there’s a trade-off that has to be reached depending on the size of your working set of data.

This is something that we’re aware of but didn’t really touch on in the article. Our next build will most likely include a significantly larger system board configuration to allow for dual processors and a lot more RAM to accommodate the L2ARC a little bit better.

I just finished a build. I have a dell r510 with 12 2tb sas drives + 2 samsung 830s used as cache. The server has 64 gigs of ram. I am seeing virtually no arch2 hits. I know my workload is slow so it may be using all arc instead given my memory load. my question though is I am not seeing anywhere near the write performance i would expect (zil is disabled). I have a 10ge connection directly between the server and the zfs device. I am seeing only about 400mb/s writes. i would expect this to be much higher since i thought all writes would goto ram and utilize the 10gb link.

How are you testing the capabilities of the system? Depending upon the application writing to the disk, you may not see full utilization of the 10GbE pipe.

Also – when you say 400mbs – is that megabit, or megabyte. If it’s megabit, that does seem quite low. If it’s megabyte, then it’s really not terrible. Obviously not the full potential of the link, but still nothing to sneeze at.

I am wanting to build a somewhat large RAID-Z2 array using 8-10 4 TB disks with an SSD cache. The system will likely have 32-64 GB RAM and a high-end Core i7. This system is primarily going to be used for storing and streaming multiple high-definition video files (often transcoded “on the fly” with ffmpeg to match the client’s resolution/bitrate settings). Files will typically range between 1-8 GB and there may be up to 4-5 simultaneous streams, along with multiple incoming uploads. I may also run a small VM or two on it. I am trying to determine the best way to optimize the disk performance of such a system without spending a ridiculous amount of money.

I am curious as to whether I would potentially get a RAID-0 like performance boost from having two SSDs as cache or if I should just get one larger drive. I haven’t come across anything regarding how ZFS handles multiple L2ARC SSDs yet (still looking though).

Also, in regards to using SLC for ZIL, SLC drives are quite rare, and write performance has increased significantly on modern MLC drives (e.g. http://www.newegg.com/Product/Product.aspx?Item=N82E16820227792). Would a modern MLC drive be sufficient? Would there be any noticeable benefit to using separate drives for L2ARC and ZIL?

For large streaming workloads, L2ARC and lots of memory aren’t going to help a ton, unless you add a _lot_ of L2ARC, and the data in the L2ARC gets accessed often. Unless the data is accessed frequently, it’ll get pushed out of the ARC and L2ARC caches by new data getting read/written. If you’re talking about 32TB of data, and accessing 8GB at a time (and that 8GB gets accessed once a day) you’d basically have to cache the entire drive contents to see an appreciable gain in speed.

For L2ARC using two drives will balance the requests across both drives. This gives the effect of RAID0, but if one drive fails, you only lose the caching of the data that was on that drive, not both drives.

SLC vs MLC – if you’re relying on the ZIL to accelerate write requests, it’s not the speed difference between SLC and MLC that is of concern, it’s the longevity of the media. The chances of writing an SLC to death are low. The chances of writing an MLC to death are significantly higher. I would rather use an SLC drive simply for the longevity of the device.

Using the same drive for ZIL and L2ARC – not recommended. You _can_ slice a drive up and use different slices for ZIL and for L2ARC, but replacement is a pain and always from the command line. I prefer to replace entire devices rather than having a device die, replace it, re-slice it, and then re-map those slices to the failed slices.

I don’t really think I need the acceleration for the writes. Data is not likely going to be written to the array nearly as fast as the physical disks are capable of. The only exception might be when writing backups from other systems to it across the local network, but those won’t be that often and speed isn’t really a concern for that.

The most recently added content would be the most likely to be read at any given time, which is why I’m thinking a large-ish SSD (e.g. 256 GB) as an L2ARC might help. Do you happen to know if ZFS caches newly written content, or just recently accessed content? I haven’t stumbled across that info in the reading I’ve been doing recently.

I like that it balances the reads between L2ARC devices. 2 smaller SSDs could potentially offer significantly improved read performance.

It may get a little tricky when the video transcoding comes into play. I don’t know the applications will be writing to disk at all while doing that or just using RAM. That’s something I still need to test in a VM, especially as I will be moving from Windows applications to Linux/BSD ones and they may behave differently.

I would still argue that you want the dedicated ZIL device. If you do not have it, the ZIL (ZFS Intent Log) is spread across all of the disks in the pool. This will effectively turn every IO on your disks into a random IO. From experience, not having a dedicated ZIL device is a _bad_ idea from a performance standpoint.

As for caching newly written content – it depends upon what’s in the cache. If everything in the cache has been accessed 20-30 times, it’s not going to be evicted for newly written data. If it’s been accessed once, there is a possibility that it’ll be evicted for the “most recently used” data that just got written. I’m not exactly sure how the MRU (Most Recently Used) and MFU (Most Frequently Used) algorithms work, so I can’t say that with 100% certainty.

[…] previous setup was with two Seagate ST1000DM003 disks (the mirror pool) and one Crucial M4 SSD (L2ARC). The biggest difficulty in replacing the disk is not the $54.44 cost of the replacement purchase; […]