S3, MEMCACHE AND MEMBASE

Recently in SEOmoz Engineering, the Linkscape team had the opportunity to evaluate Membase, a distributed caching technology similar to Memcache. Linkscape’s continued popularity had confronted us with a new problem: what is the proper way to cache data in the cloud? The journey didn’t lead us where we expected, but the road to discovery is always interesting!

OUR PROBLEM

Linkscape, the driving force behind Open Site Explorer, is a large read-only database that runs in Amazon’s cloud. It has almost no user state to manage, so the metaphor of a paging cache is handy for thinking about its API. The bulk of the data is kept in “blocks” in S3, compressed and indexed in a certain way. At request time, a few blocks are pulled from S3, decompressed and examined for the desired records.

Blocks organization tries to take advantage of locality in requests, so it makes sense to cache entire blocks at a time. This can be a boon to performance, but it comes with a price. The Internet is not equally interesting in all places, and as a reflection of the Internet, neither is Linkscape. Some blocks are vastly more popular than others, because they contain information about popular sites. Unless special care is taken, these “hotspots” can be bad for performance: an unexpectedly disproportionate amount of resources is sometimes needed to handle them.

Linkscape is happily growing, and with that come growing pains. In this story those pains are increased throttling from S3. Our hotspots, which we were essentially passing on to S3, had finally grown large enough to aggravate S3 into occasional bursts of rate-limited service denial errors. It was time to shape up our cranky access pattern.

CurrentlyMemcache is our chosen caching technology, we like its simplicity and maturity. Each of our (100 or so) API nodes runs a local, isolated Memcache installation. S3 requests originating from that node only pass through it. This unclustered arrangement is clearly redundant (in a bad way), but fast since only cache misses require actual network activity.

Before we found ourselves in the hotspot hot seat, this was a reasonable trade-off: the cache miss penalty was the latency of an S3 request. Now, however, the miss penalty was that same latency plusthe amortized risk of exponential back-off, should our request be denied.

Suddenly the image of a distributed cluster of Memcache instances starts looking mighty fine. Since the penalty of a cache miss is becoming more severe, a reasonable tactic is to reduce the chance of misses altogether. Clever eviction schemes aside, the way one achieves that is with a bigger cache. Our Memcache instances were oblivious of each other: a miss in one might be a hit in another, but since the two instances don’t “talk”, they’d never know.

To be fair, Memcache does support a limited form of distributed caching (after all, the word “distributed” appears in its tagline). This feature, however, amounts to the observation that the client could keep track of everything. By having a list of all nodes in the caching cluster and sticking to some scheme for assigning keys to them, a smart client could treat several nodes as one large cache. Many do. It just doesn’t sound that great to us.

For one thing, we had tried it before, long ago way back when. Managing the client was all right, but we found our (somewhat naive) approach led to a lot of network “cross-talk”, to the point that it had a negative performance impact. Additionally, in our new situation, I fear we’d only end up moving the hotspot network activity from S3 to the cache nodes: without some kind of key replication, it seems likely that some nodes would be pounded an order of magnitude harder than others.

OUR SOLUTION

Enter Membase. Like Memcache, it is a typeless key-value store meant for “predictable, low-latency, random access to data with high sustained throughput.” Additionally its notion of “distributed” is a bit more hands-on: using a flavor of consistent hashing called “vbuckets”, it will deal with server management as well as providing a configurable degree of replication. Thus hot (and cold) blocks could be shared between (say) 3 cache nodes to spread the heat around. Finally, the coup de grace: “It is protocol-compatible with Memcached (both text and binary protocols), so if an application is already using Memcached, Membase can be dropped in without any change to application code or client configuration.”Rock on!

After grabbing the stock Debian package and perusing around the command line tool’s help, I found the first bit of fine print: Membase will happily speak modern versions of the Memcache protocol. We, however, were holding onto a moldy old libmemcached-0.26 due to an as-of-yet undiagnosed bug and/or feature present in later versions, including the 0.44 version that Membase is apparently coded against. Thus we got (somewhat) mysterious segfaults immediately after initial memcache client usage. This can’t be reasonably held against Membase as a fault, and so now that we had a defensible reason for upgrading we went about tracing the bug and replacing the client library (which is itself a tale for another time.)

Having migrated to a modern libmemcached client version, the next logical step is to swap out our unclustered, isolated Memcache instances for unclustered, isolated Membase instances on a set of testbed nodes. This wouldn’t garner any of the benefits of a distributed cache, but if we could demonstrate a working system in this state, the problem would be reduced essentially to configuration.

As promised, Membase supplies a means for Memcache clients to connect to Membase on Memcache’s usual port, while transparently adding in its consistent hashing and replication features. It turns out this is realized with a proxy process called “moxi”, which takes the role of the “smart client” envisioned in Memcache’s original distributed architecture. This proxy fired up along with the instance proper in the stock daemon control script, so our API process was indeed able to start up and begin going through its paces according to our warmup script.

OUR NOT-SO-HAPPY ENDING

During the warmup script, funny things began happening. The cached blocks, arriving compressed from S3, began failing their checksums. On fresh machines, with cold caches, the error would appear slowly and non-deterministically. On warm machines the error would be swift and assured. What happened?

An obvious thing to check is the data itself. Upon manually fetching it from S3 the checksums were found to be fine, and different from the failing checksum the API process complained of. Dusting off the ol’ packet sniffer turned up something unfortunate: pairs of cache insertions and probes where one set of data goes in, and another set of slightly different data comes out. The change was usually restricted to only a few bytes, but it was there. Some aspect of how we’re using the client was causing corruption somewhere in the Membase stack.

After hammering our isolated Membase with many flavors of traffic via simple testing scripts, we concluded that the corruption seems to only hit our specific usage pattern. Lucky us. This also made it quite difficult in helping the Membase team diagnose the issue, as there was no obvious way to communicate a reproducing script. In the end, we decided that it wasn’t for us, and instead built a varnish proxy cluster to stand between our API machines and S3, an arrangement that has been serving us well ever since.

The core issue is that the isolated cache isn’t scalable: as the cluster grows hit % goes down (since the load balancer actively destroys any locality in the requests.) So, in hindsight (haha) it wasn’t very surprising that this issue emerged only as we began scaling to meet increased demand.

You asked what our cache hit rates are. I don’t have the numbers from the isolated memcache topology, but I do for our varnish cache arrangement. As detailed in another post it is a two layer cache, with the “front” layer composed of isolated caches who delegate on miss to a “back” layer arranged in a CARP array. Currently the cache cluster has 4 m1.xlarge machines (15 GB ram), each running both a “front” and “back” varnish process. The front cache hit rate is 33.5% (averaged across the cluster for the last two weeks) while the back hit rate is 64.7%. This achieves an overall hit rate of 1 - (1 - 0.335) * (1 - 0.647) = 76.5%.

Nick Stielau

Awesome post Phil! Nothing like a good ‘cache hit rate’ post to get the heart pumping in the morning.

Sagar Sonawane

Hi Phil,
Great Post!!
An Eye opener for me at least, if i understood it correctly!!

let me explain, what makes my eyeballs to pop-out, as you had mentioned in your post,
“pairs of cache insertions and probes where one set of data goes in, and another set of slightly different data comes out …..
….we concluded that the corruption seems to only hit our specific usage pattern. ”
This shatters the conception on which these tools are built , i.e. *typeless* , as per my understanding if memcache/membase only understands *byte stream* and they don’t care about *type of data* then, how come there was *loss of data*(reckons from “one set of data goes in ..slightly different data comes out”) ?

Please share your observations with me, as i am betting *huge* on Membase (30+ million items, growing at 30% per fortnight).

Hoping for a favorable reply!!

Regards,
Sag

phil

Hello, @Sagar!

The corruption we saw was “low-level”, in the sense that what happened can be easily described (if not explained) in a bytes-over-a-wire sense. No particular type system need be involved.

There are other qualities of exchanged messages that could “finger-print” our usage pattern besides the schema of the payloads. The size of the messages on the wire, the order, multiplicity, types and timing of the memcache-level messages all come to mind.

Protocol specifications are typically not air-tight (I’m not aware of a standard way to constrain message timing, for example) and so even if our usage is spec-legal, it may have qualities that would exercise bugs in an implementation that does not expect them. I haven’t devoted any more attention to the problem, but I would expect there’s some race condition or reentrance quirk at the bottom of it.

To underscore the above point (and to help you sleep a little better) I should point out that I spent a sizable amount of time trying to reproduce this bug with a dumb script that generated high volumes of legal, uniformly random traffic. I couldn’t do it. This suggests that our API usage is actually somehow unique in one of these harder-to-quantify ways.

Welcome to our dev blog!

Moz engineers writing about the work we do and the tech we care about.