January 04, 2009

Note: This post relies heavily on one's general understanding of database sharding strategies. If you’re unsure on any particular points within this post, I recommend you read my previous post, Scalable Strategies Primer: Database Sharding, before continuing.

Introduction

While working with Memcache the other night, it dawned on me that it’s usage as a distributed caching mechanism was really just one of many ways to use it. That there are in fact many alternative usages that one could find for Memcache if they could just realize what Memcache really is at its core – a simple distributed hash-table – is an important point worthy of further discussion.

To be clear, when I say “simple”, by no means am I implying that Memcache’s implementation is simple, just that the ideas behind it are such. Think about that for a minute. What else could we use a simple distributed hash-table for, besides caching? How about using it as an alternative to the traditional shard lookup method we used in our Master Index Lookup scalability strategy, discussed previously here.

Implementation

Now, I’m a particularly intense supporter of the “use the correct tool for the job” and “think outside the box” mantras. I strongly believe that databases are not the end-all-be-all of persistent data storage solutions. And to that end, I’m proposing that we can utilize Memcache as a highly scalable, highly available, in-memory, database shard indexing solution.

The following is a short list of requirements that any distributed shard indexing solution must take into account:

It should be highly available. The failure of any single node should, ideally, not result in any data being unavailable for an index lookup, and at worst, not result in a majority percentage of data being unavailable for an index lookup. Minimizing the impact of failed instances is very important.

It should be highly scalable. That is, we should be able to add linear capacity to our indices by adding instances of our solution.

It should display characteristics that promote easy indexing of data. For example, it should loosely represent a data structure that lends itself well to the concept of retrieving data by a single unique value (i.e. an array, list, hash-table, etc.).

The actual location lookup for a piece of data should be high performance when being executed on each instance, or node, of our solution.

In order to give some context to our Memcache shard index solution, let’s describe a plausible use-case:

“Our system has a single database server. That database server is over utilized, nearly to the point of failure, by intermittent but long running queries. The primary issue is the sheer amount of growth the dataset is experiencing. To put this in more specific terms, we are working with approximately 5 million users added non-linearly over the last 2 years. Each user is made up of a user table, a user_profile table, a user_blog table, and a user_blog_entry table. Each row within the user_profile table is related to a single row within the user table. Each row within the user_blog table is related to a single row within the user table. Each row within the user_blog_entry table is related to a single row within the user_blog table.”

Normally, we might apply the Master Index Lookup strategy completely through. If we’re using Memcache, however, we would substitute the “Defining the Index Shard schema” section with the following alternative method of implementing a shard index lookup.

First, because Memcache is a key/value data structure, we need to think about the differences between creating an index lookup with a database versus a key/value data structure. It’s important to understand that with a key/value lookup, we’re making a trade-off between structured data and simplicity. Because we can’t create a schema of any real sort with a key/value data structure, it would help if we went with a method of managing keys that supports a convention over configuration approach. To that end, if we can guarantee that any key we enter into Memcache is unique, regardless of the value contained within its entry, we will have successfully denormalized our indexed data, and therefore also indirectly simplified working with Memcache. One way to accomplish this is by using Globally Unique IDentifiers (GUIDs) as keys for all entries.

Now let’s define our serialized index data to be stored in Memcache. For this, I’m choosing to stay as non-platform specific as possible. How exactly the data is serialized is fairly irrelevant to the concept, so whether we serialize into XML, JSON, or bytes, it should require no significant alterations on the techniques presented here.

For our shard information, we could use an object something like the following:

Key: shardId (GUID)

Value: shard (Serialized Shard Object)

shardId (GUID)

connectionString (String)

status (Byte)

createdDate (Date and Time)

And for our user index information, we could use an object something like the following:

Key: userId (GUID)

Value: user (Serialized Shard Object)

userId (GUID)

shardId (GUID)

username (String)

password (String)

createdDate (Date and Time)

And for our user index information, indexed by username, we could use an object something like the following:

Key: username (String)

Value: user (Serialized Shard Object)

userId (GUID)

shardId (GUID)

username (String)

password (String)

createdDate (Date and Time)

Lastly, for our Active Insert User Shard status, we could use the following:

Key: activeInsertUserShardId (GUID)

Value: activeInsertUserShard (Serialized Shard Object)

activeInsertUserShard (GUID)

shardId (GUID)

lastModifiedDate (Date and Time)

Now that we’ve defined the objects we’ll be using to index our sharded database user data, we can begin to think about how we might load this data into Memcache, use it to locate users, and generally manage all user indexing operations. Common CRUD operations would be executed using the following procedures:

Connect to the Memcache Index using an application configuration-level connection setting.

Insert the new user’s lookup information as a new user object, using the shardId from the retrieved shard table and the userId from the Domain Shard’s user table, for the new location of the user’s information.

Keep in mind that in order to index a user by more than just their userId, as we have above, we are storing the same set of user values twice. A proper work around to this is to store another entry with the username as the key and userId as the value, and then retrieve the user value by userId. Whether or not you implement this work around is really a trade-off between memory and round-trip requests. Whereas the method I’ve used in the above procedures uses more memory, it also reduces the number of round-trips required to retrieve a user by their username or userId. Optimize this for your specific bottleneck and/or business requirements.

Loading Index Data

Loading index data from our database data source into Memcache is an important piece of the overall in-memory indexing process. This can be simply achieved by querying the data to be indexed on each shard, assigning the appropriate data from each shard’s user row (userId, shardId, etc) to a new Memcache entry, setting each index entry to have no expiration date, and ensuring that there are enough servers running a Memcache node that Memcache’s memory reclamation isn’t triggered. It would be wise to have more servers running, and therefore more available meory, than is minimally necessary, so that there is room for index growth.

Inevitably we’ll need to add more servers (as a result of needing more memory) and a reloading of index data will be necessary, given how Memcache places new entries on each server – hashing each entry key among the currently available Memcache nodes. Depending on the size of the dataset to be indexed, this loading process can become time consuming and frequent reloading of index data should be avoided if possible.

Weaknesses

Memcache can only use as much memory to store entries on a node as there is available memory on the system that is running it. In the case that Memcache has reached the memory limit of the system that it’s running on, and it’s attempting to add another entry, it will automatically reclaim memory by discarding expired entries or the oldest entries within its data structure. Normally, this behavior is appropriate given Memcache’s purpose – to cache data from a data source that will fill it as necessary. Unfortunately for us, this behavior isn’t ideal. We can, however, work around it with a little clever thinking.

Because the data we’re storing in Memcache is index data, we can make a few assumptions as to the type and length of data that we’ll be storing. Almost all of the data within our index will be of a data type that has a preset maximum size. For example, storing a userId in Memcache, with a database source type of varchar(36), we can assume that every entry will have a predictable maximum key size (36 chars x 2 bytes = 72 bytes). Armed with this knowledge, we can apply the same thought process to the maximum data being stored in each entry’s value. If we know how much memory each key/value pair will utilize, we can put application level constraints in place so that we store only as many key/value pairs as the node system can fit within its available memory, therefore making memory reclamation unnecessary.

Wrap-Up

In this post, I’ve briefly presented an alternative usage for Memcache that exemplifies how a simple distributed data structure can be used for more than just the caching of data. Memcache allows us to build a simple, fast, and powerful indexing system that compliments a database sharding architecture, while simplifying the overall system.

It’s worth noting that Memcache is just one example of a distributed key/value system that can be used as an indexing mechanism. It might even be in one’s best interest to develop a distributed key/value system of their own, or even fork Memcache, to remove some of the weaknesses of the current version of Memcache when being used in the manner described in this post (guaranteeing data won’t be removed due to lack of space, etc).

As always, I’m interested in hearing other’s proposals for using distributed data structures, besides databases, for use in managing system data in new and innovative ways. Please comment below with any thoughts along these lines.

Comments

I don't quite follow your use of "highly available". The schemes you outlined seem to depend on specific memcached servers being reachable at all times. For example, what happens to the insert process if the server (to which the key 'activeInsertUserShard' hashes) goes down?

I don't quite follow your use of "highly available". The schemes you outlined seem to depend on specific memcached servers being reachable at all times. For example, what happens to the insert process if the server (to which the key 'activeInsertUserShard' hashes) goes down?

If you want to maximize the number of records memcached can store, you could tweak the memcached memory block sizes. You'll need to recompile it to do that, but you could save lots of memory. If memory serves me, the relevant default block sizes are 64 and 128 bytes, which would mean that for each stored GUID you lose another 50-ish bytes to internal fragmentation. Tweaking that could give you 50% more stored keys.

@Robert Brewer
This strategy is highly available in that it uses Memcache, which is in itself highly available. There are different ways to achieve high availability, none of which truly ensure 100% uptime. Using Memcache as I've described it achieves high availability in that if one index node goes down, only a subset of the overall index is down. That's high availability in action right there.

To use your example - if the shard which holds the 'activeInsertUserShard' goes down, we would have a problem. A work around for this is to replicate that value amongst multiple keys which are hashed to different servers. Regardless, in the worst case, you wouldn't be able to insert new users and that's all. No major data outages would occur as a result, beyond the indexed data that was on the downed node. The idea here is to minimize the effect of downed nodes by having the indexed data distributed amongst a number of servers.

@m_eiman
Good point. Alternatively, you could remove the memory reclamation feature and it would server as an in-memory version of a database. When you've hit a server's memory limit, you would just receive an error and not an overwrite. This behavior would be far more predictable and could yield decent results.

@Paolo Casciello
You're correct, Memcache isn't transactional. However, that shouldn't have a major impact on the usage of Memcache in this way (or even in the normal way i.e. to cache data). Many distributed systems trade-off between scalability and consistency, and end up with eventual consistency. To ensure eventual consistency, you could have an out-of-band process checking data intermittently, looking for signs of a loss in data integrity. I believe Flickr uses something like this.

@Me
Good point. I haven't looked into memcachedb yet. Sounds like that may be an existing solution to the memory reclamation issue that occurs when using this strategy.

Wouldn't be wise to use memcached just like we use it in web app, that is to still have a shard for indexing purpose. Each time we need an info from the index, we query memcached. If found, we use it, if not, we query the shard for the info, store it in memcached for next time, and use it.

That way, there would be no need to re-load the data into memcached and even if a memory reclamation happens, it wouldn't so bad.

"It should be highly available. The failure of any single node should, ideally, not result in any data being unavailable for an index lookup, and at worst, not result in a majority percentage of data being unavailable for an index lookup. Minimizing the impact of failed instances is very important."

You didn't discuss your strategy for this requirement. HA is the biggest issue since the data isn't persistent.

Many people have already mentioned that the data being placed in MC as proposed in this article is not persistent and if you run out of memory you will start to "loose date". How about all the items that are more likely to happen such as maintenance and upgrades of MC. These are times when the MC instance will need to come down.

@Martin
You certainly could, and there's nothing stopping you from implementing that on top of this strategy. However, I'm proposing to use Memcache in a different way - explicitly using it as in in-memory key/value datastore for a shard index.

@Jose Borges Ferreira
Good idea, that could make updating Memcached for index related inserts even easier. I'll have to look into that some more.

@Chris
The data is persistent, just not within Memcache. You can always re-retrieve the index data from the MySQL source dataset. That behavior is how you would achieve HA; by reloading data from the source shards as necessary. I briefly address this process in the "Loading Index Data" section.

@Ryan Shneider
Please see my above reply to Chris. Taking down the MC instance may be unavoidable as a result of Memcache's lack of persistence. Using MemcacheDB might be a work around to that. Otherwise it may be necessary to just do a reload of the whole Memcache cluster. This could likely be done fairly quickly (seconds or minutes) if there are enough shards and enough Memcache nodes to distribute the load of the reloading operation, and there likely will be if you've planned ahead appropriately.

@All
A quick note about where I was going with this article; I responded to a few of the same questions on Hacker News and wanted to re-post one of my responses here, to give further incite into the underlying purpose of this article.

...

"My intent in using Memcache as the mechanism for this type of indexing was more an attempt to relate the subject with something that most developers are at least somewhat familiar with. I explicitly address the weaknesses of Memcache in the section titled "Weaknesses". I also recommend some work-arounds to lessen the effect of Memcache's limitations. In the "Wrap-Up" section, I even go so far as to say that Memcache is really only one example of a distributed hash-table and that there are alternatives. Rolling your own is another option, and both of solutions are probably better suited to the indexing problem than Memcache.

Again, I was afraid that the subject would be lost on most people if there wasn't at least some relation made between the concept and something that concretely exists. Try to look at it as an exercise in thinking "outside the box", using the best example I could think of.

Thanks for the comments, good and bad. These critiques really do help."

Just a sidenote regarding the so called memcached "weakness".
Memcached is a cache. Like any other cache, it doesn't have to hold all the information. Assuming a nice read/write ratio , with Memcache the speed to the information will boost and your databases load will be lower!

If you really need to have all your data in cache add dedicated cluster. To see some impressive numbers take a look at FaceBook where they have "(...)more than 800 servers supplying over 28 terabytes of memory to our users(...). ( http://www.facebook.com/note.php?note_id=39391378919 )