January 2009

January 22, 2009

I follow a certain philosophy when developing system architectures. I assume that very few systems will ever exist in a consistent form for more than a short period of time. What constitutes a “short period of time” differs depending on the specifics of each system, but in an effort to quantify it, I generally find that it falls somewhere between a week and a month.

The driving forces behind the need for an ever changing architecture are largely business requirement based. This is a side effect of the reality that software development, in most cases, is used as a supporting role within the business unit it serves. As business requirements (i.e. additional features, new products, etc.) pour forth, it is the developer’s job to evolve their software system to accommodate these requirements and provide a software based solution to whatever problems lay ahead.

Given that many businesses can be identified as having the above characteristics, I can now begin to explain why I believe that Heterogeneous System Architectures hold a significant advantage over Homogeneous System Architectures, in many distributed system cases.

An Experiential Use Case

I work daily on a mobile platform that mixes web applications, server applications, and XMPP servers together into an architecture that delivers on the business requirements, as set forth by our business development team and clients. This mobile platform is fairly distributed, with web applications on one set of machines, server applications on another, and XMPP servers on yet another. These three components interconnect, with the XMPP protocol and/or a database being the glue that binds them.

It’s important to point out that each individual web application and server application interfaces with XMPP or a database only, and that each is independent of the other – we’ve essentially developed a “shared-nothing” architecture. This loosely coupled, non-interdependent XMPP and database architecture has served our purposes well so far.

We currently run on the Windows platform and utilize the .NET Framework to build both our web applications and our server applications. The only exception to using Microsoft technologies within our platform is the usage of a Java based XMPP server (Jive Software’s OpenFire). When we were researching XMPP servers at the time, we found that there were really no exceptional .NET based XMPP servers with the support and maturity we needed. This lead us to our currently implemented Java based XMPP server solution. Given the nature of XMPP, and the openness of its protocol, we were able to find a couple solid .NET based XMPP client SDK’s with which to work with, that made interfacing with the Java XMPP servers trivial.

Just to drive my previous points home: turning to a Heterogeneous System Architecture wasn’t intentional; it was a side-effect of working with the 3rd-party software available at the time. This decision had essentially turned our all-Microsoft based Homogeneous System into a Heterogeneous system overnight.

One added benefit of this shared-nothing architecture, is that because it is based on an open protocol (XMPP) and uses databases with wide support throughout a variety of platforms and development languages, we are now able to mix and match platforms, differing technology stacks (LAMP), and development environments with minimal effort and with near zero incompatibility issues.

Differing Platforms, Technology Stacks, and Development Environments

Having available to us the ability to work with differing platforms, technology stacks, and development environments, all in an effort to find the best tool for the job (based on budgetary requirements, different developer skill sets, the availability of certain 3rd-party software, etc), gives us a real advantage in that our options have been vastly diversified.

Now, of course, the desire to utilize any solution should never be dependent solely on its availability as an option. Just because we can mix and match, doesn’t mean that it is within our best interest to do so. Our decision to derive value from this ability is highly dependent on a number of factors. For example, we’re a startup division within a larger company, our roots are startup based, and so is our budget. In going over some of this year’s software purchases, we’ve had to include Windows Server purchases/upgrades, SQL Server licenses, and Visual Studio development environment licenses. Those are just a few examples of common large ticket items that need to be considered when working with a startup’s budgetary concerns.

The flipside to the above software purchases are that our platform, although currently based on a majority of Microsoft software, is in no way dependent on that software existing at its current majority share. We could, for example, move to MySQL and eliminate the high cost of SQL Server licences, or we could move to a LAMP stack and decrease the costs associated with running our web servers on Windows Server. And again with the LAMP stack, we could opt to run a mature PHP based MVC framework instead of working with the ASP.NET MVC Framework which requires us to work through various Beta/RC1 bugs.

The above is just one budgetary example of where having the option to become more heterogeneous with our architecture may be in our best interest. Obviously, using one set of tools versus another has its trade-offs, and no one platform, technology stack, or development environment is the best in all situations. But, given how dynamic our business requirements have been over the last few years, I’m excited to say that we have a whole slew of options available to us, largely as a result of having a Heterogeneous System Architecture.

Heterogeneous System Impact on Human Resources

I strongly believe that the single greatest detractor from using a Heterogeneous approach to your system’s architecture is the human resource factor. When architecting a system, it is important to keep in mind the skill sets of your developers, the quantity of those developers, and the ability for those developers to work within a Heterogeneous environment.

It is far more common than not, to have a team of developers aligned with a particular technology. For example, a Microsoft system often has employed developers that follow the Microsoft path of technologies (.NET, SQL Server, Windows). Whereas with an Open Source system, it is more common to have a crew of developers that are well versed in working with Linux/Unix based technologies (Red Hat Linux, Apache, MySQL, PHP). The occurrence of this type of specialization amongst developers makes having a Heterogeneous Architecture somewhat harder to hire for.Along these same lines, another unfortunate commonality between developers aligned with a particular platform is that they are often reluctant to learn and evolve their skill set outside of their realm of specialization. Finding the ideal type of developer that simply loves all technology and has the capacity to apply their general development knowledge across many platforms, quickly and successfully, is a rare breed indeed. Understandably so, as it takes a very dedicated and smart individual to become specialized in more than one platform. Equally as difficult as finding the will within developers, is finding developers with more than an introductory level of experience with multiple platforms.

So, in cases where your system is going the Heterogeneous route, ideally you would try to hire the particularly “smart” technology passionate developers; developers who have no allegiance to any particular technology stack. These are the guys who just want to use the best tool for the job, and have fun doing it; these guys are the Rockstars of the development world. Now, it’s easy to say “just hire the Rockstars”, but this usually comes at an increased salary cost and it is therefore largely more difficult to fill positions of this type. The question then becomes, based on your particular business requirements, your current team, and the general financial outlook of your business, whether it is to your benefit to go with a Heterogeneous, and therefore extremely flexible, system, or a more easily manageable and more thoroughly supported, Homogeneous system.

Getting personal for a moment, I prefer the flexibility of a Heterogeneous system, especially when it comes to working within a distributed architecture. The freedom and sheer number of options available to solve the often more complex software problems associated with distributed systems, makes it worth the extra high developer requirements.

Wrap-Up

I attempt to weave into my posts, the common theme that all implementations involve trade-offs. I try to drive home the reality that there is no single “best” way to accomplish any sufficiently complex task. That every strategy that you implement, and every decision you make has consequences in addition to its benefits, is one of the more important fundamentals of building robust system architectures.

I also try to illustrate these concepts using real life examples and I hope that they help you to better relate their applicability to your situation, however different it may be. Please comment below on what your experiences have been with Heterogeneous versus Homogeneous Architectures. I’m always interested in hearing how others tackle this subject, and what rules they’ve placed as guidelines for the implementations of their systems, one way or the other.

January 11, 2009

No long write-ups this week, just a short list of some great resources that I've found very inspirational and thought provoking. I've broken these resources up into two lists: Blogs and Presentations.

Blogs

The blogs listed below are ones that I subscribe to and are filled with some great posts about capacity planning, scalability problems and solutions, and distributed system information. Each blog is authored by exceptionally smart people and many of them have significant experience building production-level scalable systems.

Presentations

The presentations listed below are from the SlideShare site and are primarily the slides used to accompany scalability talks from around the world. Many of them outline the problems that various companies have encountered during their non-linear growth phases and how they've solved them by scaling their systems.

January 04, 2009

Note: This post relies heavily on one's general understanding of database sharding strategies. If you’re unsure on any particular points within this post, I recommend you read my previous post, Scalable Strategies Primer: Database Sharding, before continuing.

Introduction

While working with Memcache the other night, it dawned on me that it’s usage as a distributed caching mechanism was really just one of many ways to use it. That there are in fact many alternative usages that one could find for Memcache if they could just realize what Memcache really is at its core – a simple distributed hash-table – is an important point worthy of further discussion.

To be clear, when I say “simple”, by no means am I implying that Memcache’s implementation is simple, just that the ideas behind it are such. Think about that for a minute. What else could we use a simple distributed hash-table for, besides caching? How about using it as an alternative to the traditional shard lookup method we used in our Master Index Lookup scalability strategy, discussed previously here.

Implementation

Now, I’m a particularly intense supporter of the “use the correct tool for the job” and “think outside the box” mantras. I strongly believe that databases are not the end-all-be-all of persistent data storage solutions. And to that end, I’m proposing that we can utilize Memcache as a highly scalable, highly available, in-memory, database shard indexing solution.

The following is a short list of requirements that any distributed shard indexing solution must take into account:

It should be highly available. The failure of any single node should, ideally, not result in any data being unavailable for an index lookup, and at worst, not result in a majority percentage of data being unavailable for an index lookup. Minimizing the impact of failed instances is very important.

It should be highly scalable. That is, we should be able to add linear capacity to our indices by adding instances of our solution.

It should display characteristics that promote easy indexing of data. For example, it should loosely represent a data structure that lends itself well to the concept of retrieving data by a single unique value (i.e. an array, list, hash-table, etc.).

The actual location lookup for a piece of data should be high performance when being executed on each instance, or node, of our solution.

In order to give some context to our Memcache shard index solution, let’s describe a plausible use-case:

“Our system has a single database server. That database server is over utilized, nearly to the point of failure, by intermittent but long running queries. The primary issue is the sheer amount of growth the dataset is experiencing. To put this in more specific terms, we are working with approximately 5 million users added non-linearly over the last 2 years. Each user is made up of a user table, a user_profile table, a user_blog table, and a user_blog_entry table. Each row within the user_profile table is related to a single row within the user table. Each row within the user_blog table is related to a single row within the user table. Each row within the user_blog_entry table is related to a single row within the user_blog table.”

Normally, we might apply the Master Index Lookup strategy completely through. If we’re using Memcache, however, we would substitute the “Defining the Index Shard schema” section with the following alternative method of implementing a shard index lookup.

First, because Memcache is a key/value data structure, we need to think about the differences between creating an index lookup with a database versus a key/value data structure. It’s important to understand that with a key/value lookup, we’re making a trade-off between structured data and simplicity. Because we can’t create a schema of any real sort with a key/value data structure, it would help if we went with a method of managing keys that supports a convention over configuration approach. To that end, if we can guarantee that any key we enter into Memcache is unique, regardless of the value contained within its entry, we will have successfully denormalized our indexed data, and therefore also indirectly simplified working with Memcache. One way to accomplish this is by using Globally Unique IDentifiers (GUIDs) as keys for all entries.

Now let’s define our serialized index data to be stored in Memcache. For this, I’m choosing to stay as non-platform specific as possible. How exactly the data is serialized is fairly irrelevant to the concept, so whether we serialize into XML, JSON, or bytes, it should require no significant alterations on the techniques presented here.

For our shard information, we could use an object something like the following:

Key: shardId (GUID)

Value: shard (Serialized Shard Object)

shardId (GUID)

connectionString (String)

status (Byte)

createdDate (Date and Time)

And for our user index information, we could use an object something like the following:

Key: userId (GUID)

Value: user (Serialized Shard Object)

userId (GUID)

shardId (GUID)

username (String)

password (String)

createdDate (Date and Time)

And for our user index information, indexed by username, we could use an object something like the following:

Key: username (String)

Value: user (Serialized Shard Object)

userId (GUID)

shardId (GUID)

username (String)

password (String)

createdDate (Date and Time)

Lastly, for our Active Insert User Shard status, we could use the following:

Key: activeInsertUserShardId (GUID)

Value: activeInsertUserShard (Serialized Shard Object)

activeInsertUserShard (GUID)

shardId (GUID)

lastModifiedDate (Date and Time)

Now that we’ve defined the objects we’ll be using to index our sharded database user data, we can begin to think about how we might load this data into Memcache, use it to locate users, and generally manage all user indexing operations. Common CRUD operations would be executed using the following procedures:

Connect to the Memcache Index using an application configuration-level connection setting.

Insert the new user’s lookup information as a new user object, using the shardId from the retrieved shard table and the userId from the Domain Shard’s user table, for the new location of the user’s information.

Keep in mind that in order to index a user by more than just their userId, as we have above, we are storing the same set of user values twice. A proper work around to this is to store another entry with the username as the key and userId as the value, and then retrieve the user value by userId. Whether or not you implement this work around is really a trade-off between memory and round-trip requests. Whereas the method I’ve used in the above procedures uses more memory, it also reduces the number of round-trips required to retrieve a user by their username or userId. Optimize this for your specific bottleneck and/or business requirements.

Loading Index Data

Loading index data from our database data source into Memcache is an important piece of the overall in-memory indexing process. This can be simply achieved by querying the data to be indexed on each shard, assigning the appropriate data from each shard’s user row (userId, shardId, etc) to a new Memcache entry, setting each index entry to have no expiration date, and ensuring that there are enough servers running a Memcache node that Memcache’s memory reclamation isn’t triggered. It would be wise to have more servers running, and therefore more available meory, than is minimally necessary, so that there is room for index growth.

Inevitably we’ll need to add more servers (as a result of needing more memory) and a reloading of index data will be necessary, given how Memcache places new entries on each server – hashing each entry key among the currently available Memcache nodes. Depending on the size of the dataset to be indexed, this loading process can become time consuming and frequent reloading of index data should be avoided if possible.

Weaknesses

Memcache can only use as much memory to store entries on a node as there is available memory on the system that is running it. In the case that Memcache has reached the memory limit of the system that it’s running on, and it’s attempting to add another entry, it will automatically reclaim memory by discarding expired entries or the oldest entries within its data structure. Normally, this behavior is appropriate given Memcache’s purpose – to cache data from a data source that will fill it as necessary. Unfortunately for us, this behavior isn’t ideal. We can, however, work around it with a little clever thinking.

Because the data we’re storing in Memcache is index data, we can make a few assumptions as to the type and length of data that we’ll be storing. Almost all of the data within our index will be of a data type that has a preset maximum size. For example, storing a userId in Memcache, with a database source type of varchar(36), we can assume that every entry will have a predictable maximum key size (36 chars x 2 bytes = 72 bytes). Armed with this knowledge, we can apply the same thought process to the maximum data being stored in each entry’s value. If we know how much memory each key/value pair will utilize, we can put application level constraints in place so that we store only as many key/value pairs as the node system can fit within its available memory, therefore making memory reclamation unnecessary.

Wrap-Up

In this post, I’ve briefly presented an alternative usage for Memcache that exemplifies how a simple distributed data structure can be used for more than just the caching of data. Memcache allows us to build a simple, fast, and powerful indexing system that compliments a database sharding architecture, while simplifying the overall system.

It’s worth noting that Memcache is just one example of a distributed key/value system that can be used as an indexing mechanism. It might even be in one’s best interest to develop a distributed key/value system of their own, or even fork Memcache, to remove some of the weaknesses of the current version of Memcache when being used in the manner described in this post (guaranteeing data won’t be removed due to lack of space, etc).

As always, I’m interested in hearing other’s proposals for using distributed data structures, besides databases, for use in managing system data in new and innovative ways. Please comment below with any thoughts along these lines.