Wanted - testers: SSD as cache for LInux

So, I've been working on this little project the past couple months... It's getting to the point where I'm having a pretty hard time breaking it

Bcache uses SSDs to cache arbitrary block devices. It's designed around the performance characteristics of flash, in particular MLC flash - it doesn't do random writes, it allocates buckets sized to your erase block size, clears them all at once with TRIM, and then fills them up sequentially. End result is there's not much point in using expensive SLC drives, and it's fast

There's definitely still room for more optimization, but here's the relevant bits from some recent benchmarks (posted the whole runs in the thread in LKF):

Better numbers across the board, but sequential reads improved 80% and seeks by a lot. And like I said, there's room for improvement...

Currently I'm working on tracking recent IO and using that to not cache large sequential reads - so you can run a backup and not have it blow your cache. Once I'm finished, you'll be able to cache the drives that make up your raid5/6 (so the p/q blocks get cached), and raid resyncs will completely bypass the cache, without the user having to even know what's going on. I'm really aiming for this to be a "flip it on and forget about it" solution that you can use for just about anything. Really looking forward to having it on my dev machine...

So, if I understand this correctly, Bcache is a "read through" cache at the moment - getting a block from the disk will place it into the cache for future access? Not that there's anyway wrong with that, but other algorithms can be more effective.

Linux could use something similar to the Solaris (L2)ARC, it's both a least recently used and a most frequently used cache, so backups won't completely kill the cache. Seagate did something interesting as well with their Momentus XT line, it doesn't cache sequential reads at all, only random ones in order to use the limited cache space optimally.

As for the benchmark numbers: 1783 IOPS seems too low on 16G data, shouldn't it be served 100% from the cache only as it's way smaller than the cache device size? The Corsair Nova managed 40-50M in 4kB random reads on benchmarks, that's 10k+ IOPS. Also it's quite silly compare sequential read speed, the real power of SSDs lies in random IOs.

I don't have any spare hardware at the moment, but I'm wondering, is there a way to capture and replay disk traces in Linux? I'd love to see how much bcache can speed up things on a NFS server with ~2T of data, but there's no way I can test that on a live server.

My request pattern is mostly sequential reads tho on 2-10M sized data files with ~16G already reserved for file/page cache (another question: how do I see the block cache hit rate in linux?), so I'm not sure there will be any difference at all with basically just a large LRU: the cache device / pool size ratio would be about 1:10.

So, if I understand this correctly, Bcache is a "read through" cache at the moment - getting a block from the disk will place it into the cache for future access? Not that there's anyway wrong with that, but other algorithms can be more effective.

It's got nothing to do with algorithms, and writes get put in the cache too (you have to, or else invalidate that part of the cache) - it's that the consistency guarantees for caching writes are harder, so I've done the easier part first. Writeback caching is definitely happening.

Quote:

Linux could use something similar to the Solaris (L2)ARC, it's both a least recently used and a most frequently used cache, so backups won't completely kill the cache. Seagate did something interesting as well with their Momentus XT line, it doesn't cache sequential reads at all, only random ones in order to use the limited cache space optimally.

Deciding what to retain in the cache is a different problem from just writing the cache; right now it functions as an LRU but there is absolutely nothing preventing implementing something else - you'd just have to come up with the equations for initial priority/cache hit priority/whatever else you can think of. I've some ideas, there's just other more important stuff to do first.

Also, LRU or not LRU has nothing to do with not caching large sequential reads/writes. The initial implementation of that is done... if you'd read my original post. Backups are going to be a non issue, I can track the average io size for the last n processes to do io. Backups are actually the easier case, the one that's going to need more work is raid resyncs - if you've got a raid5 or 6 you want to cache the underlying devices so you cache the p/q blocks. The resync thread will be doing multiple streams of sequential IO, so I need to split the notion of a recent IO from open cache buckets so I can track multiple recent IOs per process. And then sooner or later I'm going to want to switch to a combined heap/red black tree or hash table instead of the current linked list, so I can track 100+ recent IOs instead of 10-20... but anyways, it's something I was considering months ago.

Quote:

As for the benchmark numbers: 1783 IOPS seems too low on 16G data, shouldn't it be served 100% from the cache only as it's way smaller than the cache device size? The Corsair Nova managed 40-50M in 4kB random reads on benchmarks, that's 10k+ IOPS. Also it's quite silly compare sequential read speed, the real power of SSDs lies in random IOs.

Like I said, alpha quality and definitely room for more optimization. The code is easily capable of saturating any sata SSD with 4k random reads, but there's weirdness with plugging in the block io layer I don't think I've got completely sorted out. I wanted to show a benchmark people would be familiar with, not output from my test code that shows what it does in an optimal situation.

Quote:

I don't have any spare hardware at the moment, but I'm wondering, is there a way to capture and replay disk traces in Linux? I'd love to see how much bcache can speed up things on a NFS server with ~2T of data, but there's no way I can test that on a live server.

My request pattern is mostly sequential reads tho on 2-10M sized data files with ~16G already reserved for file/page cache (another question: how do I see the block cache hit rate in linux?), so I'm not sure there will be any difference at all with basically just a large LRU: the cache device / pool size ratio would be about 1:10.

blktrace is what you want. No idea how you see the page cache hit rate.

I'm interested in experimenting with this a bit. I've got one 40GB SSD (Intel X25M, yeah, cheap) and a RAID-6 made up of 8 1TB drives and I'd rather not give up a channel on the Adaptec 5805 card just to put in a caching device. It's not yet in production use so it's not currently critical that it not get nuked :-)

From initiating a "git clone" it feels as if I'm about to get a full kernel tree. Is this available as a stand-alone module, could be built as a module for an existing kernel, or something else that might make it easier to test? I used to build my own kernels all the time, but I'm going to have to dig around to figure out how to not b0rk Ubuntu's kernel setup when building a custom kernel.

Is there any good metric for knowing if/when this has caused things to get munged on disk? (aside from manual checksumming?) I have about 1.1 TB of data from a previous system that I've just copied onto the array that could be played around with, but I'd rather not have to checksum everything just to see whether this cacheing system has eaten some data :-)

I'm interested in experimenting with this a bit. I've got one 40GB SSD (Intel X25M, yeah, cheap) and a RAID-6 made up of 8 1TB drives and I'd rather not give up a channel on the Adaptec 5805 card just to put in a caching device. It's not yet in production use so it's not currently critical that it not get nuked :-)

Cool!

Quote:

From initiating a "git clone" it feels as if I'm about to get a full kernel tree. Is this available as a stand-alone module, could be built as a module for an existing kernel, or something else that might make it easier to test? I used to build my own kernels all the time, but I'm going to have to dig around to figure out how to not b0rk Ubuntu's kernel setup when building a custom kernel.

It requires some hooks in the block layer so there's no way to build a standalone module against a vanilla kernel. Fortunately, building a kernel for Ubuntu and Debian is easy - copy the distro config from /boot/config-foo to .config, run make oldconfig; make; make install; update-initramfs -c -k <new kernel version>, update-grub.

The git tree is on top of 2.6.35-rc3, but the only thing keeping the patch from applying cleanly to older kernels (at least 2.6.34) is that some Kconfig stuff moved around, if you want to patch a stable kernel it should be pretty obvious how to fix the Kconfig file, or if you haven't done it before I can make a patch.

Quote:

Also, is there any good metric for knowing if/when this has caused things to get munged on disk? (aside from manual checksumming?) I have about 1.1 TB of data from a previous system that I've just copied onto the array that could be played around with, but I'd rather not have to checksum everything just to see whether this cacheing system has eaten some data :-)

Unfortunately no - however, the risk of it actually corrupting data on the disk being cached should be pretty miniscule, as it's not doing writeback caching yet - everything gets passed straight through. The real danger is the cache getting corrupted and returning stale data, but ext3/4 are quite conservative and should remount the filesystem ro if they notice anything screwy.

It's still definitely late alpha quality, I'm not running it on my dev machine quite yet - so don't run it if you want to get actual use out of it. It's coming together rapidly now, but right after I posted this thread I did find a bug that I eventually tracked down to a race in the btree io code. If you're willing to run potentially unstable code though, having some people help out with testing at this point could potentially help a lot I stress test as much as I can but the sooner bugs are found the better.

Flashcache is further along - there are people running it in production, though from what I can tell not a lot, it's still pretty new. They've got write behind caching, though (and I am somewhat talking out of my ass here) I am skeptical about how trustworthy it is in that mode... But write behind caching does help massively for certain workloads.

Bcache is a very different design, it's considerably more complex (around twice as much code right now) but it has the potential to be a lot faster - when it's done I honestly don't think there's anything out there that's going to be able to touch it. I've written up my thoughts on the design way too many times so far so rather than go in depth I'm going to try and start on the wiki tonight

I'll be happy to answer any specific questions, too.

Quote:

P.S.: Is 9px still alive? I decided to see if there was anything at the server hosting the git repo and found a project page of yours (http://evilpiepirate.org/~kent/)

-jsnyder

Alas no, it was more for fun than anything else (There isn't much about Plan 9 that isn't awesome, 9p really is a pleasure to behold compared to nigh any other network protocol) - it would be awesome if I could put in the time to make it compete with say nfs, but even though nfs's design is shit it has had a lot of work put into it, and nobody's going to pay me to get it up to that level. Bcache fortunately does have people interested enough in it to put money towards it The time I spent on it wasn't wasted though, if I ever get the time to finish it I've got a block based dedup network fs thing partway done that took some cues from 9p.

I'm planning to build a 6x2TB RAID5 + SSD cache system for home use and tinkering, and I'm trying to choose between linux/bcache/flascache and Solaris/ZFS/L2ARC/ZIL.

Is bcache compatible with LVM? Software RAID5? Should it be set up to cache the disk devices themselves or higher up the ladder in md0 or lvm? Or is this something that should be tested and benchmarked?

It looks stable enough for production use, but it hasn't seen enough testing for me to suggest using it for anything critical. It's fast, and writeback caching works great - using half an SSD to buffer random writes helps nicely with raid5/6 performance

Documentation and such are works in progress, but I'm more than happy to help and answer questions.

Documentation and such are works in progress, but I'm more than happy to help and answer questions.

Where should the cache be put? As far as I can see from the documentation, the system is pretty much flexible enough to go cache any of these options in the diagram below:For a logical standpoint, I'd prefer to have the cache attached to the logical volume since not all applications will benefit from SSD cache that much and it'd be easier to manage SSD wear. But from a performance perspective, it would probably be best to add all the individual raid drives to the caching system?

What benchmarks should be used for testing that everything is performing OK? I'm particularly concerned about the drives and their alignment because of the 4k block size on the F4 drives. I'd also like to do some measurements on cache benefits on different kinds of workloads. Since you've developed the system, you probably have some benchmarks you could recommend for testing that everything is OK?

Since this configuration is ordered with Solaris in mind, it'll also run S11X, If there are good cross platform tests, I can probably run those too. Just so see how Solaris/ZFS/L2ARC/ZIL compares to Linux/Bcache using different file systems.

You shouldn't see any differences in performance with options 2-4, so you can do whatever's most convenient.

The philosophy behind bcache is that you shouldn't have to worry about specific applications, cache wear, etc. - hence detection of sequential IO. If there's a workload that it doesn't do well on, that's a bug that ought to be fixed.

For benchmarking... I'd just try to benchmark something close to your actual workloads. If something's not working right blktrace is pretty useful for debugging, but it's not a benchmark.

I use sysbench/mysql a lot for testing performance, that ought to work on soloris too and it ought to be somewhat relevant to people. I am curious to see how it compares to Solaris

The other thing I'm working on right now is a bcache specific superblock for the backing device - i.e. non transparent caching. The transparent caching isn't going away, though it's probably not going into mainline.

But this'll allow bcache to prevent you from using a backing device without the cache when the cache has dirty data, and generally make sure things stay consistent. You trade convenience for safety.

That code's not going to be ready for awhile yet though... but after that I might be starting on volumes and thin provisioning