ESXi iSCSI/RAID/Disk Performance: Improving Through RAM Cache

When designing any kind of system, whether it be an ESXi lab or otherwise, disk performance can end up being a bottleneck. No matter how fast your processor, or how much RAM you have, if you have a ton of iowait, and your system is having to wait in line to write or read data from your drive or volume, then your system will be slow. Modern systems in need of extreme IOPS are turning to SSD arrays and SSD cards for performance, but even with the drastic reduction in price since their introduction, their price spot for GB to dollars is still much more than mechanical hard drives.

A good option for increasing performance on the cheap is a RAM cache. I’ve advocated PrimoCache (formerly FancyCache) prior to this on my blog, and had plenty of comments concerning people using it in their lab with incredible results. However, I’ve never covered its performance directly in an article, and that’s what I’d like to do here.

Note that PrimoCache is currently free and in beta, with a provided 90 day license that so far, has always been extended by the nice folks over at Romex Software. There has been no mention of the price once it exits beta, but it will be a commercial piece of software in the end. Also, it’s worth mentioning that this is beta software, and although I run this in my ESXi lab, with a good potion of that devoted to production use, using this in a full production environment wouldn’t be recommended.

SAN Hardware for Testing

iSCSI SAN Hardware

As part of another project, I built a custom SAN for testing purposes, and that’s what I will be using PrimoCache on. The performance is equivalent to a low- to mid-range server platform, and uses a hardware RAID card with 15K SAS drives. The actual testing platform will be used for a number of different RAID levels, but it is currently configured in RAID0, the fastest performance RAID, for a comparison to speeds with and without PrimoCache.

PrimoCache Installation and Settings

PrimoCache RAM Cache

PrimoCache has a simple, standard installer, and installation is as simple as clicking on the installer and following the prompts.

At the end of the installation, you will be prompted to reboot your system, at which time PrimoCache is available for use. The amount of RAM that you assign to your created caches will be determined once you create them in the software, and can run up to the limit of your free RAM.

After installation, clicking the PrimoCache shortcut brings you to the primary screen of the software:

Primo Cache Main Screen

Once here, you will need to create a cache. Note that the target of the cache does not need to be a physical drive. It can be a volume: either a presented RAID volume, a dynamic disk, a JBOD … anything that classifies as a volume with a drive letter inside Windows. The amount of memory that you use is only limited by the amount that you have free in your system. To prevent paging, I would suggest using no more than a total of 80% at idle (OS + Programs + Cache). In this instance, I’m going to select D: drive (my RAID volume) as my target drive:

Primo Cache: Add a Cache

Once selected, I click Next and am taken to the cache configuration screen. Here I’ll be able to select the amount of memory that I want to devote to my cache (in MB), and how I want that cache to be used: whether I’d like to improve read performance, write performance, or both. Write caching allows you to defer all writes, holding them in RAM and then writing them out later. It would only be prudent to use this latter option on systems with a battery backup/UPS, of course. You can also change the block size for the granularity of the cache. Smaller block sizes bring better performance at the cost of higher overhead for maintaining the cache. PrimoCache also allows the use of an SSD as a Level 2 cache. In this instance, I will not be using this.

Primo Cache: Configure Cache Parameters

I’ll select custom (since I’ll be using both the read cache and write defer), assign 16GB of memory to our cache, leave the block size at 4KB since I have plenty of CPU power, check the Enable Defer-Write, and leave it at the default 10 second latency.

Primo Cache: Cache Parameters Set

Once done, we click Start, and PrimoCache will assign the RAM to the Cache (your memory in Task Manager will go down by the amount you assign), and show a success message.

Primo Cache: Successful Cache Creation

Clicking OK on the success message takes us back to the PrimoCache screen, and now we can see our existing cache, information about the volume it’s assigned to, detailed information about the cache itself, statistics on reads and writes, and a cache hit rate chart to track our cache performance. There are also a number of icons for stopping, pausing, flushing and other operations to the cache.

Primo Cache: Main Screen with Cache

PrimoCache Performance Testing

To give us a baseline to compare to, we’re going to stop the cache on D: drive now that we’ve created it, and run some basic I/O and performance tests without it to get some idea of our performance without it. After each test, PrimoCache will be stopped and started to clear the cache, so there is a fresh cache. The following tests will be performed with and without PrimoCache installed. The only exception will be HDTune Pro, which bypasses any cache; it is included for baseline measurement and performance of the RAID array itself. The RAID card settings are also listed below for the virtual drive:

Microsoft Exchange Server Jetstress Tool: Disk throughput tests at 100% drive capacity run for 2 hours, both with and without cache.

A Note About Benchmarking and Cache

Cache Performance

Having a large cache in the two or three digit GB range presents some interesting issues when testing performance, as does benchmarking disk use in the first place. Hardware RAID cards have algorithms that adjust to the work load they are under, becoming more adapted, so you may see better performance in the long run, and of course, real world use is rarely 75/25, 60/40 or some other even split over long periods of time. The best we can do is look for general performance.

Caches bring a whole new problem: when benchmarking, most of the time, you’re going to have a 100% cache hit rate, which means you’re benchmarking to pure RAM. For writes, this is real-world equatable, since PrimoCache uses write-defering, so all writes are theoretically 100% cache hits. With a big enough cache (think server-level if you had 192GB of RAM), you could easily thread out all writes, even from several ESXi nodes and lots of VMs.

Read caching becomes a bit harder to estimate. Caches hold the most used items, and thus read caches take a while to “build” for good cache hit performance. So reads are what you could see if the data you needed was cached. PrimoCache also sports the ability to use SSDs as L2 cache, moving items off to them. This would dramatically increase your read cache hit rate over time, as you could theoretically have a L2 cache in the terrabytes range. That said, take the read results with that in mind.

One last note is that I have been using PrimoCache on the RAID array I have my iSCSI target on (using Starwind’s iSCSI) and have seen huge improvements with my VM responsiveness. Although I have not done any performance testing from a VM as of yet (it’s coming, it’s coming!), I can unequivocably endorse it.

HDTune Pro RAID Benchmark Results

The results below are run without the cache since HDTune bypasses that. This is to get an idea of the raw performance of the RAID volume. In this series, I also do some IOPS tests as well as latency tests to show raw access times.

HDTune RAID Benchmark: Read

HDTune RAID Benchmark: Random Access

HDTune RAID Benchmark: Extras

Crystal DiskMark3 RAID Benchmark Results

Crystal DiskMark3 tests the read and write speed performance of a drive. Although this is limited to pure drive speed, and not IOPS, this will still give us a good baseline of how our RAID array is performing speed wise. The tests will be at all the size levels, and all of the performance tests, both with and without PrimoCache. CrystalMark DiskMark3 runs each test five times and then averages the result. All results are in MB/sec, including the up axis of the graph.

Crystal DiskMark RAID Performance Results

The most amazing results here are the increases in random reads and writes. These are the most punishing type of reads and writes for a drive, since the head is forced to move, at random, all over the platter surface, putting the seek times to the real test. Of course, coming out of RAM cache, there are no seek times, so we see both write and read performance skyrocket, with some results in the +30,000% range. This is where the cache results shine, and this would be incredible for high-write applications, such as databases if you had enough RAM cache. One thing to remember here is that these are deferred writes, so if you fill up cache enough, you could be minutes, if not hours, writing out all the data. A UPS/Battery Backup is a requirement in this case.

IOMeter RAID Benchmark Results

IOMeter is one of the go-to benchmarks in the world of IOPS performance, and for good reason: it’s highly configurable, accurate, and can push a drive to its limits when properly used. I spent the most time with these results, testing almost every aspect of performance on the RAID. Here I did a full series of sequential tests with mixed read/write loads, as well as a full series of random access tests with mixed read/write loads. Although the most common block sizes are 4K, 8K and 64K, I decided to spend the time to test at every block size, just to get a complete picture. All values below are in IOPS.

IOMeter RAID Performance Benchmark: IOPS

Although we don’t see the same out of the ballpark performance gains with PrimoCache on, you still see a remarkable percentage gain out of IOPS with it enabled. The largest gains are in the middle of the curve in the random writes, which I expected to see considering all deferred writes are automatically a 100% hit rate unless the cache is full. Once again, RAM caching delivers solid gains.

ATTO RAID Benchmark Results

ATTO Benchmark is another large name in benchmarking results, and by default, it measures transfer speeds in several different test sizes. I had some consistently odd results with ATTO in the 4096KB test size that I’m unsure about. Even though PrimoCache would report 100% cache hits, CPU was low, and no other performance issues manifested themselves, I would get bad results only in this range about 50% of the time. I didn’t see similar results in any other benchmarking utility, and attribute it to some quirk with ATTO. In the interests of transparency, I’m leaving the results here.

ATTO RAID Benchmark Results: No Cache

ATTO RAID Benchmark Results: PrimoCache Enabled

Conclusions on RAID and RAM Cache Performance

Obviously, RAM caches, and PrimoCache in particular, have a huge performance gain when properly used. Although we’re only using a 16GB cache in these tests, I can see this scaling without issue into the 100’s of GBs on server-class hardware. PrimoCache, unfortunately, is still in beta, and there’s no way I could recommend it in a production environment until it’s officially in a release stage. However, for a home or lab environment, I whole-heartedly recommend it. For the past year, I’ve used every version that’s come out, and have had not a single reliability issue.

For iSCSI targets for ESXi, I use Starwind iSCSI SAN Free Edition. It runs on Windows and when it creates iSCSI targets/devices, it does so by creating a virtual drive as a file, which it then presents to ESXi as a LUN. I simply make my virtual drive/file on the same volume I’m using PrimoCache on. For example, in my lab, my RAID volume is M:, and that’s where I run my iSCSI mounts and PrimoCache (among other things since my RAID volume is 32TB).

NISMO1968

StarWind has built-in RAM cache everywhere and flash with V8 being in beta. Why do you want to use external one?

Easy answer: the free version of Starwind only allows a 512KB cache, while an external cache, such as PrimoCache, allows an unlimited size external cache. In my lab, I’m using 28GB of RAM cache with the free version of Starwind, something you can’t achieve unless you use the commercial version.

Brian Bos

Doesn’t both the Linux and Windows filesystem cache have you covered without resorting to 3rd party software?

It was my impression that any unused RAM will be used as a block level cache until an application needs it in modern operating systems. What are your thoughts/experiences?

Not at all, or software like this wouldn’t exist and there would be no need for RAM drives in Windows or Linux at all. Modern OSes will indeed cache memory for system use, but there’s no such block caching going on. If that was true, then the performance differences would either be smaller in the benchmarks, or less. As is, you’re seeing a minimum of over 100% and thousands of percentages in some cases.

Perhaps I was wrong with windows. I can prove this out with dd on my Linux boxes though and it definitely reads far faster than disk if the output file is less than my total ram, and the file has been written/read since last bootup.

Not sure if that’s clear, but what I mean is writing a file with dd, then reading it to /dev/null results in a very fast read, as it cached the file while it was written.

On Linux, I know there are a few block-level disk caching kernel modules available, but you normally have to build these yourself. I’m not aware of a distro, that they come standard with, although in the wide world of Linux, I’m sure there are. There’s dm-cache, bache, and flashcache that I know of. I would consider these, however, to be “third party” since they are not part of the mainline kernel. Also, in the 3.9 kernel, Linus released an SSD caching feature to cache hard drives to SSD drives. As I said, however, just because I haven’t heard of it or seen it (and I’ve been working with drive performance a long time in both major OSes), doesn’t mean it doesn’t exist. As for Windows, however, I’m certain there’s no block level caching going on.

Brian Bos

Fair enough. I don’t know enough about it to be sure, hence my curiosity.

For what it’s worth, with Linux, this block level caching happens on anything 2.6 and north, with almost any distro I’ve tried. My SAN runs openmediavault on debian squeeze (but with the 3.2 backports kernel) and manages ~400MB/sec reads from my disks, but 1GB/s from files in cache. No thousand percent gain, but it’s a modest AMD C-60 based box with a 12TB RAID5.

Where I was coming from with the question, I was set to throw another 4GB at my SAN (it’s at 4 right now and I’ve got the matching stick from the set lying around now) and was curious if this would actually even help me after reading your article. From the tests I just re-ran I’m pretty sure I’ll see the boost.

Thanks for the compliments, and you’ve given me something to investigate, and being incessantly curious, I’ll track something down and make a post on it. I’ll make sure and reference back that you got me on the track for it 😀
Your SAN is a great build, and since you’re running ZFS, I’d definitely throw more RAM at it, since it loves it. Love your site, too: I see we’re both Crunchbang fans, and I love the article on booting the Pi over NFS. I’m booting 4 Pi’s around the house off iSCSI, and it’s definitely a performance booster.

Duke Abbaddon

Very good post and most complete and id recommend it to anyone looking to know the facts about drive caching

wla

Sorry, I cannot believe your non cached IOMeter results. I think you measured cache effects of RAID-Controller and/or OS.
Why? You have 4 good harddisks, each with approximately 300 IOPs. So, with a RAID0 array, I expect something like 1200 IOPs (512B random R/W). You have 48000???

Always thoroughly read an article before replying. The results state “without RAM cache”, it didn’t say without the RAID cache. The entire article is a POC about using a RAMCache to improve RAID performance, even with the RAID cache itself, once it has been exhausted. The RAM cache lets you keep your results in memory (you’re just moving pointers, really), and then it flushes it out as it can to the drive. I note it isn’t something I’d use in production (same thing Starwind stated when they broke the 1 million IOPS barrier), but it’s a cool POC and does have some applications.

Don Fountain

The HomeServerBlog.com is information about using real datacenter technologies in your home, configured in an easy to maintain, efficient, and most of all, cost effective manner. We focus on helping you with a lab that is multiuse, allowing for experimentation with virtualization and other technologies, home storage, home backup, and more.
Note that my posts are my opinion, and not any official view or stance of my employer. About Me