January 22, 2009

I follow a certain philosophy when developing system architectures. I assume that very few systems will ever exist in a consistent form for more than a short period of time. What constitutes a “short period of time” differs depending on the specifics of each system, but in an effort to quantify it, I generally find that it falls somewhere between a week and a month.

The driving forces behind the need for an ever changing architecture are largely business requirement based. This is a side effect of the reality that software development, in most cases, is used as a supporting role within the business unit it serves. As business requirements (i.e. additional features, new products, etc.) pour forth, it is the developer’s job to evolve their software system to accommodate these requirements and provide a software based solution to whatever problems lay ahead.

Given that many businesses can be identified as having the above characteristics, I can now begin to explain why I believe that Heterogeneous System Architectures hold a significant advantage over Homogeneous System Architectures, in many distributed system cases.

An Experiential Use Case

I work daily on a mobile platform that mixes web applications, server applications, and XMPP servers together into an architecture that delivers on the business requirements, as set forth by our business development team and clients. This mobile platform is fairly distributed, with web applications on one set of machines, server applications on another, and XMPP servers on yet another. These three components interconnect, with the XMPP protocol and/or a database being the glue that binds them.

It’s important to point out that each individual web application and server application interfaces with XMPP or a database only, and that each is independent of the other – we’ve essentially developed a “shared-nothing” architecture. This loosely coupled, non-interdependent XMPP and database architecture has served our purposes well so far.

We currently run on the Windows platform and utilize the .NET Framework to build both our web applications and our server applications. The only exception to using Microsoft technologies within our platform is the usage of a Java based XMPP server (Jive Software’s OpenFire). When we were researching XMPP servers at the time, we found that there were really no exceptional .NET based XMPP servers with the support and maturity we needed. This lead us to our currently implemented Java based XMPP server solution. Given the nature of XMPP, and the openness of its protocol, we were able to find a couple solid .NET based XMPP client SDK’s with which to work with, that made interfacing with the Java XMPP servers trivial.

Just to drive my previous points home: turning to a Heterogeneous System Architecture wasn’t intentional; it was a side-effect of working with the 3rd-party software available at the time. This decision had essentially turned our all-Microsoft based Homogeneous System into a Heterogeneous system overnight.

One added benefit of this shared-nothing architecture, is that because it is based on an open protocol (XMPP) and uses databases with wide support throughout a variety of platforms and development languages, we are now able to mix and match platforms, differing technology stacks (LAMP), and development environments with minimal effort and with near zero incompatibility issues.

Differing Platforms, Technology Stacks, and Development Environments

Having available to us the ability to work with differing platforms, technology stacks, and development environments, all in an effort to find the best tool for the job (based on budgetary requirements, different developer skill sets, the availability of certain 3rd-party software, etc), gives us a real advantage in that our options have been vastly diversified.

Now, of course, the desire to utilize any solution should never be dependent solely on its availability as an option. Just because we can mix and match, doesn’t mean that it is within our best interest to do so. Our decision to derive value from this ability is highly dependent on a number of factors. For example, we’re a startup division within a larger company, our roots are startup based, and so is our budget. In going over some of this year’s software purchases, we’ve had to include Windows Server purchases/upgrades, SQL Server licenses, and Visual Studio development environment licenses. Those are just a few examples of common large ticket items that need to be considered when working with a startup’s budgetary concerns.

The flipside to the above software purchases are that our platform, although currently based on a majority of Microsoft software, is in no way dependent on that software existing at its current majority share. We could, for example, move to MySQL and eliminate the high cost of SQL Server licences, or we could move to a LAMP stack and decrease the costs associated with running our web servers on Windows Server. And again with the LAMP stack, we could opt to run a mature PHP based MVC framework instead of working with the ASP.NET MVC Framework which requires us to work through various Beta/RC1 bugs.

The above is just one budgetary example of where having the option to become more heterogeneous with our architecture may be in our best interest. Obviously, using one set of tools versus another has its trade-offs, and no one platform, technology stack, or development environment is the best in all situations. But, given how dynamic our business requirements have been over the last few years, I’m excited to say that we have a whole slew of options available to us, largely as a result of having a Heterogeneous System Architecture.

Heterogeneous System Impact on Human Resources

I strongly believe that the single greatest detractor from using a Heterogeneous approach to your system’s architecture is the human resource factor. When architecting a system, it is important to keep in mind the skill sets of your developers, the quantity of those developers, and the ability for those developers to work within a Heterogeneous environment.

It is far more common than not, to have a team of developers aligned with a particular technology. For example, a Microsoft system often has employed developers that follow the Microsoft path of technologies (.NET, SQL Server, Windows). Whereas with an Open Source system, it is more common to have a crew of developers that are well versed in working with Linux/Unix based technologies (Red Hat Linux, Apache, MySQL, PHP). The occurrence of this type of specialization amongst developers makes having a Heterogeneous Architecture somewhat harder to hire for.Along these same lines, another unfortunate commonality between developers aligned with a particular platform is that they are often reluctant to learn and evolve their skill set outside of their realm of specialization. Finding the ideal type of developer that simply loves all technology and has the capacity to apply their general development knowledge across many platforms, quickly and successfully, is a rare breed indeed. Understandably so, as it takes a very dedicated and smart individual to become specialized in more than one platform. Equally as difficult as finding the will within developers, is finding developers with more than an introductory level of experience with multiple platforms.

So, in cases where your system is going the Heterogeneous route, ideally you would try to hire the particularly “smart” technology passionate developers; developers who have no allegiance to any particular technology stack. These are the guys who just want to use the best tool for the job, and have fun doing it; these guys are the Rockstars of the development world. Now, it’s easy to say “just hire the Rockstars”, but this usually comes at an increased salary cost and it is therefore largely more difficult to fill positions of this type. The question then becomes, based on your particular business requirements, your current team, and the general financial outlook of your business, whether it is to your benefit to go with a Heterogeneous, and therefore extremely flexible, system, or a more easily manageable and more thoroughly supported, Homogeneous system.

Getting personal for a moment, I prefer the flexibility of a Heterogeneous system, especially when it comes to working within a distributed architecture. The freedom and sheer number of options available to solve the often more complex software problems associated with distributed systems, makes it worth the extra high developer requirements.

Wrap-Up

I attempt to weave into my posts, the common theme that all implementations involve trade-offs. I try to drive home the reality that there is no single “best” way to accomplish any sufficiently complex task. That every strategy that you implement, and every decision you make has consequences in addition to its benefits, is one of the more important fundamentals of building robust system architectures.

I also try to illustrate these concepts using real life examples and I hope that they help you to better relate their applicability to your situation, however different it may be. Please comment below on what your experiences have been with Heterogeneous versus Homogeneous Architectures. I’m always interested in hearing how others tackle this subject, and what rules they’ve placed as guidelines for the implementations of their systems, one way or the other.

January 11, 2009

No long write-ups this week, just a short list of some great resources that I've found very inspirational and thought provoking. I've broken these resources up into two lists: Blogs and Presentations.

Blogs

The blogs listed below are ones that I subscribe to and are filled with some great posts about capacity planning, scalability problems and solutions, and distributed system information. Each blog is authored by exceptionally smart people and many of them have significant experience building production-level scalable systems.

Presentations

The presentations listed below are from the SlideShare site and are primarily the slides used to accompany scalability talks from around the world. Many of them outline the problems that various companies have encountered during their non-linear growth phases and how they've solved them by scaling their systems.

January 04, 2009

Note: This post relies heavily on one's general understanding of database sharding strategies. If you’re unsure on any particular points within this post, I recommend you read my previous post, Scalable Strategies Primer: Database Sharding, before continuing.

Introduction

While working with Memcache the other night, it dawned on me that it’s usage as a distributed caching mechanism was really just one of many ways to use it. That there are in fact many alternative usages that one could find for Memcache if they could just realize what Memcache really is at its core – a simple distributed hash-table – is an important point worthy of further discussion.

To be clear, when I say “simple”, by no means am I implying that Memcache’s implementation is simple, just that the ideas behind it are such. Think about that for a minute. What else could we use a simple distributed hash-table for, besides caching? How about using it as an alternative to the traditional shard lookup method we used in our Master Index Lookup scalability strategy, discussed previously here.

Implementation

Now, I’m a particularly intense supporter of the “use the correct tool for the job” and “think outside the box” mantras. I strongly believe that databases are not the end-all-be-all of persistent data storage solutions. And to that end, I’m proposing that we can utilize Memcache as a highly scalable, highly available, in-memory, database shard indexing solution.

The following is a short list of requirements that any distributed shard indexing solution must take into account:

It should be highly available. The failure of any single node should, ideally, not result in any data being unavailable for an index lookup, and at worst, not result in a majority percentage of data being unavailable for an index lookup. Minimizing the impact of failed instances is very important.

It should be highly scalable. That is, we should be able to add linear capacity to our indices by adding instances of our solution.

It should display characteristics that promote easy indexing of data. For example, it should loosely represent a data structure that lends itself well to the concept of retrieving data by a single unique value (i.e. an array, list, hash-table, etc.).

The actual location lookup for a piece of data should be high performance when being executed on each instance, or node, of our solution.

In order to give some context to our Memcache shard index solution, let’s describe a plausible use-case:

“Our system has a single database server. That database server is over utilized, nearly to the point of failure, by intermittent but long running queries. The primary issue is the sheer amount of growth the dataset is experiencing. To put this in more specific terms, we are working with approximately 5 million users added non-linearly over the last 2 years. Each user is made up of a user table, a user_profile table, a user_blog table, and a user_blog_entry table. Each row within the user_profile table is related to a single row within the user table. Each row within the user_blog table is related to a single row within the user table. Each row within the user_blog_entry table is related to a single row within the user_blog table.”

Normally, we might apply the Master Index Lookup strategy completely through. If we’re using Memcache, however, we would substitute the “Defining the Index Shard schema” section with the following alternative method of implementing a shard index lookup.

First, because Memcache is a key/value data structure, we need to think about the differences between creating an index lookup with a database versus a key/value data structure. It’s important to understand that with a key/value lookup, we’re making a trade-off between structured data and simplicity. Because we can’t create a schema of any real sort with a key/value data structure, it would help if we went with a method of managing keys that supports a convention over configuration approach. To that end, if we can guarantee that any key we enter into Memcache is unique, regardless of the value contained within its entry, we will have successfully denormalized our indexed data, and therefore also indirectly simplified working with Memcache. One way to accomplish this is by using Globally Unique IDentifiers (GUIDs) as keys for all entries.

Now let’s define our serialized index data to be stored in Memcache. For this, I’m choosing to stay as non-platform specific as possible. How exactly the data is serialized is fairly irrelevant to the concept, so whether we serialize into XML, JSON, or bytes, it should require no significant alterations on the techniques presented here.

For our shard information, we could use an object something like the following:

Key: shardId (GUID)

Value: shard (Serialized Shard Object)

shardId (GUID)

connectionString (String)

status (Byte)

createdDate (Date and Time)

And for our user index information, we could use an object something like the following:

Key: userId (GUID)

Value: user (Serialized Shard Object)

userId (GUID)

shardId (GUID)

username (String)

password (String)

createdDate (Date and Time)

And for our user index information, indexed by username, we could use an object something like the following:

Key: username (String)

Value: user (Serialized Shard Object)

userId (GUID)

shardId (GUID)

username (String)

password (String)

createdDate (Date and Time)

Lastly, for our Active Insert User Shard status, we could use the following:

Key: activeInsertUserShardId (GUID)

Value: activeInsertUserShard (Serialized Shard Object)

activeInsertUserShard (GUID)

shardId (GUID)

lastModifiedDate (Date and Time)

Now that we’ve defined the objects we’ll be using to index our sharded database user data, we can begin to think about how we might load this data into Memcache, use it to locate users, and generally manage all user indexing operations. Common CRUD operations would be executed using the following procedures:

Connect to the Memcache Index using an application configuration-level connection setting.

Insert the new user’s lookup information as a new user object, using the shardId from the retrieved shard table and the userId from the Domain Shard’s user table, for the new location of the user’s information.

Keep in mind that in order to index a user by more than just their userId, as we have above, we are storing the same set of user values twice. A proper work around to this is to store another entry with the username as the key and userId as the value, and then retrieve the user value by userId. Whether or not you implement this work around is really a trade-off between memory and round-trip requests. Whereas the method I’ve used in the above procedures uses more memory, it also reduces the number of round-trips required to retrieve a user by their username or userId. Optimize this for your specific bottleneck and/or business requirements.

Loading Index Data

Loading index data from our database data source into Memcache is an important piece of the overall in-memory indexing process. This can be simply achieved by querying the data to be indexed on each shard, assigning the appropriate data from each shard’s user row (userId, shardId, etc) to a new Memcache entry, setting each index entry to have no expiration date, and ensuring that there are enough servers running a Memcache node that Memcache’s memory reclamation isn’t triggered. It would be wise to have more servers running, and therefore more available meory, than is minimally necessary, so that there is room for index growth.

Inevitably we’ll need to add more servers (as a result of needing more memory) and a reloading of index data will be necessary, given how Memcache places new entries on each server – hashing each entry key among the currently available Memcache nodes. Depending on the size of the dataset to be indexed, this loading process can become time consuming and frequent reloading of index data should be avoided if possible.

Weaknesses

Memcache can only use as much memory to store entries on a node as there is available memory on the system that is running it. In the case that Memcache has reached the memory limit of the system that it’s running on, and it’s attempting to add another entry, it will automatically reclaim memory by discarding expired entries or the oldest entries within its data structure. Normally, this behavior is appropriate given Memcache’s purpose – to cache data from a data source that will fill it as necessary. Unfortunately for us, this behavior isn’t ideal. We can, however, work around it with a little clever thinking.

Because the data we’re storing in Memcache is index data, we can make a few assumptions as to the type and length of data that we’ll be storing. Almost all of the data within our index will be of a data type that has a preset maximum size. For example, storing a userId in Memcache, with a database source type of varchar(36), we can assume that every entry will have a predictable maximum key size (36 chars x 2 bytes = 72 bytes). Armed with this knowledge, we can apply the same thought process to the maximum data being stored in each entry’s value. If we know how much memory each key/value pair will utilize, we can put application level constraints in place so that we store only as many key/value pairs as the node system can fit within its available memory, therefore making memory reclamation unnecessary.

Wrap-Up

In this post, I’ve briefly presented an alternative usage for Memcache that exemplifies how a simple distributed data structure can be used for more than just the caching of data. Memcache allows us to build a simple, fast, and powerful indexing system that compliments a database sharding architecture, while simplifying the overall system.

It’s worth noting that Memcache is just one example of a distributed key/value system that can be used as an indexing mechanism. It might even be in one’s best interest to develop a distributed key/value system of their own, or even fork Memcache, to remove some of the weaknesses of the current version of Memcache when being used in the manner described in this post (guaranteeing data won’t be removed due to lack of space, etc).

As always, I’m interested in hearing other’s proposals for using distributed data structures, besides databases, for use in managing system data in new and innovative ways. Please comment below with any thoughts along these lines.

December 28, 2008

I recently received a great comment, in the form of a question, on the Scalability Strategies Primer: Database Sharding post. I thought the answer to Yuriy’s comment deserved its own post so that I could get as detailed as possible in discussing what is in actuality a question with multiple valid answers. I also believe Yuriy’s question is one that is fairly commonly asked and not always clearly explained, or at least overlooked in respect to some of scalability’s more obvious issues.

The question asked was:

“… Let's say I have shards that store proverbial invoices. Since each shard is a stand-alone db instance, it will have "invoices" table with auto_incremental PK. Each shard auto increments the PK value on its own and so values are not unique among shards. In such a case I cannot expose an invoice by its id. www.example.com/invoices/123 could return multiple invoices from multiple shards. Am I missing anything here? ...”

The key (no pun intended) to understanding how to organize your dataset’s data is to think of each shard not as an individual database, but as one large singular database. Just as in a normal single server database setup where you have a unique key for each row within a table, each row key within each individual shard must be unique to the whole dataset partitioned across all shards.

There are a few different ways we can accomplish uniqueness of row keys across a shard cluster. Each has its pro’s and con’s and the one chosen should be specific to the problems you’re trying to solve.

Multiple Column Keys

A simple method of guaranteeing uniqueness between shards is to include a shard identifier within the key. This is accomplished via a normal multiple column key. To effectively lookup keys, you will need to use a Master Index Lookup, as previously discussed here. For example, using the invoice scenario as an example and assuming we’re using MySQL for our RDBMS, we might specify the Invoice table as follows:

Pay special attention to that the “PRIMARY KEY (‘shardId’,’invoiceId’),” line in the above CREATE TABLE statement. The shardId and invoiceId columns combine to create the table’s key. This method of setting our key enables us to continue to use auto-incrementing unique columns the way we’re used to using them in a single server instance, in that the only necessary change being made to the table, in preparation for sharding, is the addition of the shardId column to the table’s key. This method’s primary positive is that it is intuitively simple.

Referring to a invoice requires a shardId and an invoiceId in this scenario because it is possible, even likely, that there will be invoiceId’s with the same number on different shards, each of which reflect different invoices. Including the shardId in a lookup is necessary to discern different invoices with matching invoiceId’s.

So as to not go too far off the mark in relation to the original question, here is an example of what a REST or MVC request might look like for viewing an invoice:

www.example.com/invoices/1/123 or www.example.com/invoices/[shardId]/[invoiceId]

Essentially, an invoice is now referenced by two numbers instead of just one. We could even combine them and add a hyphen to create a simple, more intuitive, form of the numbers to use. For example:

www.example.com/invoices/1-123 or www.example.com/invoices/[shardId]-[invoiceId]

Note, however, that an invoice’s key is transitory in that if you move an invoice to another shard as the result of a hotspot or necessary reorganization of data, the invoice key will change. If this is a problem, it may be wise to keep a history of moved data or even use one of the alternate keying methods below.

Also, moving data requires a re-idenification process to occur on each row that is moved off its origin shard. Because each invoice key includes a shardId, when an invoice is moved, the shardId value must change. In addition to the shardId changing, the invoiceId must also change because each invoiceId is unique only to the current shard it resides on as a result of the auto-incrementing invoiceId values. Also, please note that changing these key values requires updating them on any other data that references a moved invoice by its key. Automating this process with a script or small application is highly recommended.

Globally Unique Identifier Keys

Another fairly simple method of guaranteeing uniqueness between shards is to use a GUID or Globally Unique Identifier as your table row’s key. This is often accomplished by generating a GUID within the application layer, and applying it to a row, thereby guaranteeing uniqueness. To effectively lookup keys, you will need to use a Master Index Lookup, as previously discussed here.

Using the invoice scenario as an example, and assuming we’re using MySQL for our RDBMS, we might specify the Invoice table as follows:

In the CREATE TABLE statement above, we’ve set our primary key, invoiceId, to be a varchar(36). Within that varchar(36), we can simply store a 36 character GUID. (A quick note about optimizing the storage of GUIDs: you can remove hyphens and compress a GUID to something along the lines of a binary(12 or 16) without much difficulty. The only side effect is that it requires you to code encoding and decoding capabilities into your application so that you can easily work with the compressed GUIDs – I might write more about this and provide a code sample at some later time).

This method changes the identifier of an invoice into a less intuitive format. GUID’s aren’t pretty, and are hard to people to remember and refer to, given their length and randomness. That being said, we’ve simply solved many of the issues involved in scaling a dataset.

In relation to the original question, here is an example of what a REST or MVC request might look like for viewing an invoice:

www.example.com/invoices/03c7408e-0936-4e19-8dc7-323613e8618c or www.example.com/invoices/[invoiceId]

In this case, we are granted the ability to refer to an invoice using one number, which should in theory never change, no matter what shard it is located on or how many times it has been moved. An added bonus to using a GUID as a unique key is that if our dataset needs to scale beyond what our original intentions were, we are using a key that is globally unique. This means that we can move data out of our cluster and onto another cluster, or split our original shard cluster into many geographically diverse clusters and never have to change our invoice identifiers.Moving data is especially simple using this method of guaranteeing uniqueness. We don’t have to re-identify the data when moving it to another shard because the shard identifier isn’t part of the invoice table, nor is the invoiceId itself unique only to the current shard it resides on.

Auto-Increment Offset and Increment

I’ve left this method of guaranteeing uniqueness across shards for last on purpose. The usage of this method is RDBMS vendor specific, and not all RDBMS’ support this functionality out of the box. For ones that don’t, it is my recommendation that you use one of the alternative methods discussed above.

Essentially, this method grants us the ability to use the normally intuitive auto-increment property on our invoice key column, without having to add any further shard specific data to our rows. This is accomplished by using an auto-increment offset and an auto-increment increment. To effectively lookup keys, this method requires that you use an Algorithmic Lookup (hashing the key against each shard’s auto-increment offset and increment values), as previously discussed here.

An auto-increment offset value determines the starting point for an auto-incrementing row value. For example, if you set an auto-increment offset value to 3, an auto-incrementing column would assign row values starting at 3 (i.e. 3, 4, 5…) instead of the default 1 (i.e. 1, 2, 3…).

An auto-increment increment controls the interval between successive auto-incrementing row values. For example, if you set an auto-increment increment offset value to 3, an auto-incrementing column would assign row values in intervals of 3 (i.e. 1, 4, 7…) instead of the default 1 (i.e. 1, 2, 3…).

By combining the usage of an auto-increment offset and increment together, we can create uniqueness across our shard cluster by assigning varied values for each. I’ll provide examples for this method using MySQL 5.0+ as it is an RDBMS that supports this functionality.

In the CREATE TABLE statement above, we’ve set our primary key, invoiceId, to be a normal int(11) AUTO_INCREMENT column. You’ll notice that this method preserves the simplicity of a single integer invoice identifier, commonly used throughout invoicing systems for its easy to remember and work with format.

In order for us to make use of this method, we need to set MySQL’s AUTO_INCREMENT_INCREMENT and AUTO_INCREMENT_OFFSET values in the my.cnf file. Let’s assume that we have four shards in our shard cluster. We would set the following values on the following shards:

Shard 1AUTO_INCREMENT_OFFSET = 1AUTO_INCREMENT_INCREMENT = 4

Shard 2AUTO_INCREMENT_OFFSET = 2AUTO_INCREMENT_INCREMENT = 4

Shard 3AUTO_INCREMENT_OFFSET = 3AUTO_INCREMENT_INCREMENT = 4

Shard 4AUTO_INCREMENT_OFFSET = 4AUTO_INCREMENT_INCREMENT = 4

The following is what would transpire in the event of multiple rows being inserted into each shard:

Shard 1invoiceId = 1invoiceId = 5

Shard 2invoiceId = 2invoiceId = 6

Shard 3invoiceId = 3invoiceId = 7

Shard 4invoiceid = 4invoiceId = 8

As you can see, no shard has an invoiceId value the same as any other shard. Setting the AUTO_INCREMENT_OFFSET and AUTO_INCREMENT_INCREMENT values, appropriately, guarantees us non-collision of invoiceId values and therefore uniqueness of key data.

In relation to the original question, here is an example of what a REST or MVC request might look like for viewing an invoice:

www.example.com/invoices/12345 or www.example.com/invoices/[invoiceId]

In this case, we are granted the ability to refer to an invoice using one number that is as elegant and short as possible. Note, that as in the Multiple Column Keys method, an invoice’s key is transitory in that if you move an invoice to another shard as the result of a hotspot or necessary reorganization of data, the invoice key will change. If this is a problem, it may be wise to keep a history of moved data or use the Globally Unique Identifier Keys method instead.

Also, moving data requires a re-idenification process to occur on each row that is moved off its origin shard. Because each invoice key includes an auto-incrementing invoiceId that is incremented depending on the shard’s AUTO_INCREMENT_OFFSET and AUTO_INCREMENT_INCREMENT values, when an invoice is moved, those values will change, and so must the invoiceId. Also, again, please note that changing these key values requires updating them on any other data that references a moved invoice by its key. Automating this process with a script or small application is the way to go here too.

Wrap-Up

This post covered some of the more common techniques used to guarantee uniqueness of a dataset dimension’s keys across a shard cluster, and in some cases globally. Each technique’s usefulness is entirely dependent on the specific characteristics of your dataset, and it’s ultimately up to you to choose the most appropriate solution. I hope I’ve successfully outlined the basics of these techniques in a way that makes that choice fairly straightforward.

December 21, 2008

Successful software design is all about trade-offs. In the typical (if there is such a thing) distributed system, recognizing the importance of trade-offs within the design of your architecture is integral to the success of your system. Despite this reality, I see time and time again, developers choosing a particular solution based on an ill-placed belief in their solution as a “silver bullet”, or a solution that conquers all, despite the inevitable occurrence of changing requirements. Regardless of the reasons behind this phenomenon, I’d like to outline a few of the methods I use to ensure that I’m making good scalable decisions without losing sight of the trade-offs that accompany them. I’d also like to compile (pun intended) the issues at hand, by formulating a simple theorem that we can use to describe this oft occurring situation.

First though, let me be clear about what I’m trying to accomplish by coining a theorem to describe the above actuality. My thoughts on this are hardly original. In fact, I would say that these issues are practically standardized as issues that are expected to be successfully overcome by the average system architect. What I am proposing, however, is that we can take a number of descriptive core issues and sum them up into a single, easy to remember, theorem.

Now, let me explain why trade-offs should be taken into careful consideration when designing your distributed system. For every problem that a system is designed to solve, there is a limitation added to the system. By this I mean that if your system solves one issue, the code and/or data put in place to do so, may require us to write additional code and/or add additional data to the system in order to remove its unintended effects. However, my adding the extra code and/or data, we’ve effectively required ourselves to write further code or add further data to then remove the consequences of writing our previous code or adding our previous data; and on and on, until as you can see, we’ve hit an infinite loop of adding to remove.

To better explain this, take for example a system that is designed to solve a real-time messaging based problem. The architecture that has been decided on is an XMPP based architecture. As such, we’ve basically decided to abstract away the database in such a way as to nearly render it unavailable for useful reporting purposes. So, by choosing our XMPP based architecture, we’ve made it more difficult to report on the events occurring within our system, even though we’ve developed a system that very specifically solves the initial problem.

Now let’s assume that this system grows large and the inevitable happens; we need to monitor our system and report on its usage. We must now add code to handle monitoring every event and dump it into a database. This database does what we need, but now the inevitable happens again; we need to add a feature to the system that involves the addition of a separate component that runs as a web service and connects to the system via XMPP. So we write code for this feature, and we write code for the monitoring portion of the system so that it can successfully monitor our newly added web service. As you can see, the monitoring portion of our system is a portion of the system that must also scale, both to the system’s usage and the system’s addition of features. That is our limitation, we now need to code additionally for the monitoring system.

The above is a simplified example, but I hope it helps you understand the issues involved in developing a system – there are trade-offs involved in every aspect of the design and development process. Now this is not to say that every problem we solve in code will have an equal impact on the usage of our system for our purposes. Sometimes we’ll add a feature that produces a limitation that we indeed never come into contact with. That is ideal, and we attempt to write code that produces these kinds of limitations all the time. What I’m trying to do is discuss some of the broader design issues that aren’t of that ilk.

Breaking this concept down into its core issues is crucial to the complete understanding of necessarily recognizing the need for trade-offs within your system. The core issues are as follows:

Horizontally scaling a system complicates the system’s continued maintenance as more components are necessarily added to the system, therefore increasing the overall system’s size.

As a result of a system’s increase in size, monitoring becomes more difficult, and eventually becomes a sub-system in and of itself.

As a result of a system’s increase in size, the addition of features becomes a more time intensive and difficult task. This is due to the necessity of having to take into account the instantaneous volume of additional data, as a consequence of rolling out a feature amongst a system’s existing components. This is especially the case for a system with a large number of concurrent clients or users.

As a result of a system’s increase in size, refactoring or reorganizing a system’s components and/or its component’s data involves the removal and subsequent re-launch of a system’s affected components and/or its affected component’s data.

I’d like to hammer home that the decrease in overall system flexibility will likely be the result of multiple issues resulting from horizontally scaling your system, and that this theorem is meant to easily state that reality. This theorem also encompasses what is in fact the basis on which most other distributed system issues are based.

I hope that this brief brain-dump has helped to define an easy to remember theorem that you can use to keep the realities of developing distributed systems in mind. I’ve been using it for a while now and I’ve found that it’s helped to keep my head planted squarely on a few separate occasions.

December 16, 2008

Introduction

We’ve entered an era of application development that increasingly requires scalable solutions to everyday problems. By and large, the methods used to scale an application to massive proportions are fairly standardized. With a modicum of time spent researching database scalability methods, one can get the gist of the strategies involved in solving many common scalability problems. The actual implementations of such methods, however, have nary seen the light of day beyond the doors of the companies and developers that develop them.

This article is a primer, intended to shine some much needed light on the logical, process oriented implementations of these scalability strategies in the form of a broad introduction. More specifically, my intention is to teach the majority of these implementations by example. Examples will be formatted as imagined scenarios modeling real life scalability challenges.

The technology focus of this article is intended to be as non-specific as possible. However, in scenarios or example implementations where it is necessary to define a specific software stack, I have chosen the typical LAMP (Linux, Apache2, MySQL, PHP) web development stack.

This article is intended for intermediate developers familiar with the intricacies of a data access layer and its role within the typical web application.

Brief Author’s Note

Most industry standard development practices, frameworks, and design pattern fundamentals have inherent in their intended usage the requirement that the system on which they are running will be of sufficient performance to provide a consistently reliable environment. Now there are obvious exceptions to this horrifyingly broad generalization, such as any distributed design pattern for one, but I argue that the above holds true for what much of the software industry’s development is focused upon. I have found, through a combination of personal experience and fairly extensive research that most of what is taught to and learned by the typical developer is based on the absence of scale. As a result of this, the solutions to many problems of scale are often counter-intuitive.

Before delving further into the subject at hand, I encourage the reader to keep an open mind and attempt to overcome what might at first be an initial sense that you are a bit like Alice in Wonderland – that what you are reading is counter to much of what you’ve previously been taught or learned. Keeping in mind that there is no silver bullet solution, no solution that can solve all your scalability problems without requiring that you compromise on other portions of your application, will help to hammer home the reasoning behind many of the theories presented here. Simply stated, the generally accepted development solutions to most business requirements break down at scale and must be approached from an often counter-intuitively different angle.

Database Scalability

Its important to have an understanding of some of the basic scalability terminology used throughout this article to explain the more complex concepts and exacting scalability solutions presented later.

Let’s start by defining some of that basic terminology, and discerning some of it from their often misused forms. The addition of brackets to some of the definitions below is used to specify a more database centric form of the word or phrase.

Performance – The speed at which a system is able to service a single transaction.

Scalability – The ability of a system to maintain its quality of service as overall system load increases.

Scaling Vertically – To scale a system vertically is to increase the scalability of the system by adding resources to each individual component within the system.

Scaling Horizontally – To scale a system horizontally is to increase the scalability of the system via the addition of individual components to the system.

Vertical Dataset Partitioning – To partition a [database table] dataset vertically is to partition a dataset by field [or table column].

System – A system, as it is being used in the context of this article, is a general purpose term used to describe a collection of components that are interconnected in both software and hardware, and that combine to create an integrated whole.

Shard – A single instance of a database, often run on a commodity server, which houses a subset of the system’s total dataset. Each shard within a system typically shares the same topically based schema, though the actual data on each shard is unique to that shard.

Did you notice the differentiation made between the definition of Performance and Scalability? Both words are often used interchangeably to mean one or the other, but they are in fact starkly different in definition. I thought it wise to make a concerted effort to bring these two terms to attention, as it is important to realize that this article attempts to tackle the problems specific to rolling out a solution to the Scalability of a system and not the Performance of a system. Also, please note how each definition below Scalability uses one or more previously defined terms within its own definition. We’re building upon concepts here in a way that I hope is easy to follow.

I’d like to get even more specific about the kinds of scalability problems we’re trying to solve with the strategies presented here. This article focuses on scaling horizontally by partitioning horizontally (by row). You may alternatively decide to scale horizontally by partitioning vertically (by column) as necessary, but in my experience, you end up needing to partition horizontally at some point along the path to massive scalability regardless of your decision to partition vertically. Because of this inevitability, I’ve chosen to focus my efforts on an article that details the horizontal partitioning of a database schema.

Partitioning Your Dataset

To better explain the reasoning behind partitioning your dataset, we need to first answer the following fundamental two-part question:

Why are we partitioning our dataset and how does it help us to achieve scalability of our application?

It is difficult, if not nearly impossible, to massively scale your data layer when the data is limited to residing on a single server. Whether the limiting factor is a hardware cost issue, or you’ve simply equipped your server with the highest performing hardware possible, we ultimately find ourselves up against a wall – there are inherent limitations to what is currently possible by vertical scaling our hardware, it is a simple matter of fact. If we instead take our dataset schema, duplicate it onto multiple servers (shards), and split (or partition) the data on the original single server into equal portions distributed amongst our new set of servers (shards), we can parallelize our query load across them. Adding more servers (shards) to our existing set of servers results in near limitless scalability potential. The theory behind how this works is simple, but the execution is a fair bit more complex thanks to a series of scale-specific issues one encounters along the way.

We’ll continue with the concepts behind the partitioning of the actual dataset. There are numerous dimensions, or data subsets (often topically organized tables), to which you may choose to partition your dataset, and the method to follow is highly dependent on your particular problem domain’s requirements. This process is often referred to as denormalizing your schema. At small scale, having a normalized schema is ideal, but at large scale, having your data denormalized makes scalability much simpler – we’ll get into why that is shortly. Now I’m going to be completely honest here – how one decides to partition their dataset involves some serious thought, so let’s break down a few of the more commonly encountered, real-life dataset scenarios and walk through each by example.

I’ve divided each dataset scenario first by its lookup type, then by its partitioning method. I’ve done this for a couple of reasons. First, the way in which your web application will find where a specific piece of data (commonly a table row) resides within a partitioned system is different depending on the partitioning method used. Secondly, each lookup type has several trade-offs associated with its use, and therefore I think it appropriate that each lookup type be discussed at length.

Algorithmic Lookups

To perform an algorithmic lookup is to use a rule to retrieve the shard location of the data based upon a simple algorithm seeded by some portion of the data itself.

Date and Time Partitioning

Partitioning by a dataset entry’s relevant date and time is one of the simplest ways to perform an algorithmic lookup. Often the relevant date and time that you are performing the lookup on is the date and time of the entry’s insertion into the dataset, though this is not always the case.

Take, for example, a common billing scenario involving invoices. If you are working with a large dataset that is overpowering a single server within your system, you may opt to partition that server’s dataset over multiple servers, or shards. In doing so, you relieve much of the contention on the origin server’s resources. In our example, we’re presented with the following problem:

“Our system has a single database server. That database component is over utilized, nearly to the point of failure, by intermittent but long running queries. The primary issue is the sheer amount of data within the dataset. To put this in more specific terms, we are working with approximately 25 million invoices added evenly over a period of 5 years.”

Given the above scenario, we can say the following:

Low Volume of Reads/Writes – we aren’t working with a high volume of reads or writes

Resource Intensive Reads – although the database reads are intermittent, they are long running and therefore each utilize an intensive amount of both CPU and Disk resources

High Volume of Data – we’re working with a large volume of data

Even Distribution of Data – our dataset is large, but its evenly distributed over 5 years at about 5 million entries per year

Now that we’ve broken this problem down by its core characteristics, we can begin to apply our sharding solution!

In this scenario, we’re fortunate enough to be working with a predictable dataset. Based upon historical precedents, our dataset will continue to grow over time at a rate of 5 million invoices per year. The typical commodity server is capable of handling a table of 5 million records without much difficulty (the actual number of records is arbitrary in the context of this example). Because of this, we can easily partition our dataset of 25 million records by relocating each year’s worth of records onto a separate database shard. If we apply this strategy we end up with five commodity servers, or shards, each running a separate instance of MySQL. Within each instance of MySQL is a single Invoices table containing one year’s worth of invoices or five million records.

Each year going forward, we’ll add another shard to our system, and on it will reside a single instance of MySQL with its table containing that year’s worth of records. Assuming we continue to experience linear growth and fairly evenly distributed and consistent query execution across all shards, we’ve now positioned ourselves for long-term scalability.

Determining where to insert a new row is simple. Assuming that each shard is labeled starting with zero, the following pseudo-code expression will produce the shard to insert into.

The above expression can also be used to determine which shard we need to query in order to retrieve the data that we need. If we need to retrieve invoices from a year ago, we would use the following pseudocode.

This strategy is fairly straightforward and lends itself well to parallelization of queries that span multiple shards. For example, if we need to query records over two years, we would split the query up into two separate queries, each running on the shard that contains the specified year’s data. We’re now running these queries in parallel and taking advantage of the increase in resources that our new system provides us – running a query on two shards is effectively twice as fast as running nearly the exact same query on a single database server.

Row Count Partitioning

Another common alternative to Date and Time partitioning is Row Count Partitioning. Row Count Partitioning utilizes the same basic concepts as Date and Time partitioning, with the following exceptions:

We aren’t limited to a single type of dataset field as our criteria. We can simply use a common, albeit simplistic, benchmark for determining dataset size – the dataset’s row count.

We don’t need to worry about the growth rate at which our dataset grows, because the row count is a measurable result of our dataset’s growth itself.

Let’s apply the Row Count Partitioning strategy to the same scenario we used in the Date and Time Partitioning strategy. I’ve restated the scenario here so you don’t have to navigate backwards:

“Our system has a single database server. That database component is over utilized, nearly to the point of failure, by intermittent but long running queries. The primary issue is the sheer amount of data within the dataset. To put this in more specific terms, we are working with approximately 25 million invoices added evenly over a period of 5 years, though this can’t be guaranteed.”

Given the above scenario, we can say the following:

Low Volume of Reads/Writes – we aren’t working with a high volume of reads or writes

Resource Intensive Reads – although the database reads are intermittent, they are long running and therefore each utilize an intensive amount of both CPU and Disk resources

High Volume of Data – we’re working with a large volume of data

Even Distribution of Data – our dataset is large, but its evenly distributed over 5 years at about 5 million entries per year, though this can’t be guaranteed

We no longer have to worry about whether we’re working with a predictable dataset. Based upon historical precedents, our dataset will continue to grow over time at a rate of 5 million invoices per year, but this now can’t be guaranteed. The typical commodity server is capable of handling a table of 5 million records without much difficulty. Because of this, we can easily partition our dataset of 25 million records by relocating five million records at a time onto a separate database shard. If we apply this strategy we end up with five shards, each running a separate instance of MySQL. Within each instance of MySQL is a single table containing five million records worth of invoices.

Each year going forward, we’ll add another shard to our system, and on it will reside a single instance of MySQL with a table containing five million records. Now that we don’t need to worry about whether our dataset maintains its apparent even distribution, we are now positioned for long-term scalability, in addition to non-linear growth.

Determining where to insert a new row is simple. We simply set a configuration variable, within our application, indicating which of our shards is currently actively taking on additional rows. As the active shard reaches its maximum row count threshold, our application simply needs to increment (or otherwise change) the configuration variable to the next shard ready for use.

Now here’s where the flexibility we’ve received as a function of working with a row count tends to make querying the dataset slightly more difficult. The above rule cannot also be used to determine which server we need to query in order to retrieve the data that we need. If we need to retrieve invoices from a year ago, we need to query all of our shards, since we don’t have the luxury of being able to determine where the records that fell within the last year were inserted.

This isn’t necessarily as bad as it sounds. What we’ve done here is build a highly parallelized version of our existing system. Now, instead of one server querying through our dataset to find what we need, we have five shards doing the work in parallel, over smaller subsets of the same dataset. We would then take the results from each shard and combine them in our application layer, resulting in a significantly faster method of querying a large dataset.

It’s important to understand that each method of partitioning your dataset is better suited to certain problems, as opposed to others. In the case of the Row Count Partitioning strategy, we’ve explored a method of partitioning that is best used in conjunction with a large dataset that requires queries to be run often across a majority of the data within our dataset.

This strategy is fairly straightforward and lends itself well to parallelization of queries that span multiple shards. For example, if we need to query records over all five years, we would split the query up into five separate queries, each running on a separate shard. We’re now running these queries in parallel and taking advantage of the increase in resources that our new system provides us – running a query on five shards is effectively five times as fast as running nearly the exact same query on a single server.

Master Index Lookups

Using a Master Index Lookup isn’t nearly as simple or intuitive as an Algorithmic Lookup. It is, however, necessary for scenarios that require partitioning along a dimension that is specific to a particular dataset field that is not inherently consecutively ordered.

Domain Partitioning

One of the more complex partitioning methods, Domain Partitioning provides the highest amount of flexibility in choosing one’s partitioned dimension.

Determining what dimension to partition on is highly dependent on the nature of one’s database schema. However, in order to properly illustrate how you might do so, I will use a typical user-centric database pseudo-schema as our working example. I’m hoping that this example will clearly describe the process involved in determining a dimension to partition in such a way that you will be able to infer how to alter this strategy for your particular application.

The key to understanding how to partition a database schema is to first determine if you are working with a schema that can be appropriately partitioned in the first place. A schema that is well suited for partitioning along a particular dimension will have the following properties:

The schema’s tables are naturally topically segregated as opposed to topically cross-related. For example, most tables relate to a User table, and contain rows that are particular to a single User row.

The schema’s most significant topical tables, if split, are unlikely to require many joins across them. For example, if we are partitioning along the User table, it should be unlikely that we will need to join multiple Users that each reside on a different shard.

The number of satellite tables, and hence their contained data, related to a single hierarchical table is small enough in size, and will stay small enough in size, so as to not become a performance bottleneck.

Please note that although the above properties make for a best case scenario dataset, failure of your dataset to have all of them does not imply that your dataset cannot be partitioned using the Domain Partitioning strategy.

Domain Partitioning is comprised of two main server types: an Index Shard and one or more Domain Shards.

The Index Shard serves as the primary lookup mechanism. At a minimum, it will contain a single index of a dataset’s partitioned dimensions and their Domain Shard locations. This can be easily represented as such – an application utilizing the Domain Partitioned strategy will first connect to and query the Index Shard to retrieve the shard location of the requested data. Once that location is returned from the Index Shard, another connection will be made to the Domain Shard storing the requested data, and the original query will then be executed, returning its result to the application.

You may be wondering if there is a high amount of overhead involved in always connecting to the Index Shard and querying it to determine where the second data retrieving query should be executed. You would be correct to assume that there is some overhead, but that overhead is often insignificant in comparison to the increase in overall system performance, as a result of this strategy’s granted parallelization. It is likely, independent of most dataset scenarios encountered, that the Index Shard contains a relatively small amount of data. Having this small amount of lookup data means that the database tables holding that data are likely to be stored entirely in memory. That, coupled with the low latencies one can achieve on a typical gigabit LAN, and also the connection pooling in use within most applications, and we can safely assume that the Index Shard will not become a major bottleneck within the system (have fun cutting down this statement in the comments, I already know it’s coming :)

Let’s proceed through our user-centric scenario by detailing it out further:

“Our system has a single database server. That database server is over utilized, nearly to the point of failure, by intermittent but long running queries. The primary issue is the sheer amount of growth the dataset is experiencing. To put this in more specific terms, we are working with approximately 5 million users added non-linearly over the last 2 years. Each user is made up of a user table, a user_profile table, a user_blog table, and a user_blog_entry table. Each row within the user_profile table is related to a single row within the user table. Each row within the user_blog table is related to a single row within the user table. Each row within the user_blog_entry table is related to a single row within the user_blog table.”

Given the above scenario, let’s start by choosing our partitioning dimension. We know that all of the data within our dataset is directly or indirectly topically related to a single entity – a user. We also know that users being added to the system are the primary source of our system’s enormous growth. Since users are defined by their user row within the user table, we can confidently say that by partitioning the user table across multiple shards, we can distribute load in a manageable manner. The user table, and its rows, is our chosen partition dimension.

Defining the Index Shard schema

Now that we know our partition dimension is a user, we need a unique identifier for each user so that we can use it to lookup our user. This is easily done by incorporating a userId key column within the user table. Having the userId key is important, but it isn’t very useful in looking up a user in and of itself. Most users are queried using their username. We can even take this a step further and say that authenticating a user is nearly as important as identifying them by their username. Authenticating a user requires both a username and a password. Armed with this common knowledge, we can begin to specify the schema of our Index Shard’s first couple of tables.

The most obvious required table within the Index Shard lookup database is a shard table. This table has the following columns:

shardId – used as the unique identifier for a shard

connectionString – a connection string used to connect to a shard

status – used to signify a shard’s status as online, offline, or in active insert mode

createdDate – the date the shard was added to the system, used for historical purposes

The second table we need to define is the user_lookup table. This table has the following columns:

userId – used to uniquely identify a user. Is the same userId used to identify a user within the user table located on each shard.

shardId –used to uniquely identify the current shard that a user is located on

username – the username corresponding with the userId. Is the same username used within the user table located on each shard.

password – the password corresponding with the userId. Is the same password used within the user table located on each shard.

Now let’s run through the four common SQL command types and apply their usage to our sharded system, making sure to note the differences in procedures used. These command types are the SELECT, INSERT, UPDATE, and DELETE.

Insert Scenario: A new user signs up.

Connect to the Index Shard using an application configuration-level connection string.

Query the shard table and retrieve the shard row that represents the current shard with a status of active insert mode.

Disconnect from the Index Shard.

Connect to the Domain Shard as specified by the previously retrieved shard row’s connectionString.

Insert the user’s sign up information into the user table. Retrieving the userId as a result.

Connect to the Index Shard using an application configuration-level connection string.

Insert the new user’s lookup information into the user_lookup table, using the shardId from the retrieved shard table and the userId from the Domain Shard’s user table, for the new location of the user’s information.

Disconnect from the Index Shard.

Update Scenario: A user changes their password.

Connect to the Index Shard using an application configuration-level connection string.

Query the user_lookup table, using the username and current password of the user, and retrieve the user_lookup row that contains the user’s lookup information.

Query the shard table and retrieve the shard row that represents the user’s Domain Shard location.

Update the retrieved user_lookup row, changing the password field to the user’s new password.

Disconnect from the Index Shard.

Connect to the Domain Shard as specified by the previously retrieved shard row’s connectionString.

Update the user’s user row, found using the userId as retrieved earlier from the user_lookup table, changing the password field to the user’s new password.

Disconnect from the Domain Shard.

Delete Scenario: A user closes their account.

Connect to the Index Shard using an application configuration-level connection string.

Query the user_lookup table, using the username and current password of the user, and retrieve the user_lookup row that contains the user’s lookup information, saving it for later use.

Query the shard table and retrieve the shard row that represents the user’s Domain Shard location.

Delete the user’s user_lookup row, using the user’s username and password, or userId to find the user’s row.

Disconnect from the Index Shard.

Connect to the Domain Shard as specified by the previously retrieved shard row’s connectionString.

Delete the user’s user row, found using the userId as retrieved earlier from the user_lookup table.

Disconnect from the Domain Shard.

Select Scenario: A system visitor views a user’s profile page.

Connect to the Index Shard using an application configuration-level connection string.

Query the user_lookup table, using the username or userId of the user, and retrieve the user_lookup row that contains the user’s lookup information.

Query the shard table and retrieve the shard row that represents the user’s Domain Shard location.

Disconnect from the Index Shard.

Connect to the Domain Shard as specified by the previously retrieved shard row’s connectionString.

Query the user_lookup table to retrieve the user’s basic information, using the previously retrieved userId.

Our system has now been successfully Domain Partitioned. As a result, we have effectively enabled our system to scale along the User dimension far into the future. By simply adding additional shards to our system, and updating the necessary shard table row’s, we can easily increase the capacity of our system with little effort.

Dealing with Hotspots and Rebalancing Shards

It is nearly inevitable, that at some point within the lifecycle of any one of the previously discussed scalability strategies, we will encounter a scenario that requires a redistribution of our data as a measure to prevent continued performance degradation within a single overloaded shard.

Examples of real-life scenarios where this kind of behavior may be observed are numerous. Using our user-centric theme, one might notice a hotspot originating around a specific user that is getting a higher than usual query hit-rate due to some level of user importance. That user may be a prominent blogger or some other similarly popular individual. As a result of their prominence, we can imagine their user profile receiving pageviews of multiple orders of magnitude higher than the average user’s profile.

Building upon our Domain Partitioning strategy and by enumerating a further set of specifics, we can find a suitable solution to the hotspot problem. Let’s imagine that each row within our user table contains a unique identifier. This unique identifier is a Globally Unique Identifier (GUID) and its uniqueness is guaranteed throughout the system. If we also require that any related tables, such as the user_profile, user_blog, and user_blog_entry tables also contain GUID’s serving as their table keys, we can easily move any data associated with a single user amongst any of our user Domain Shards by migrating the data and updating the user_lookup table. These migrations could be automated, but are likely to be rare enough that a manual migration may be preferred given that it is sometimes difficult to estimate the required resources needed to dedicate to a hotspot user or group of users.

Connect to the user’s old Domain Shard as specified by the user’s originally retrieved shard row’s connectionString.

Delete the user’s user row, found using the userId as retrieved earlier from the user_lookup table.

Disconnect from the Domain Shard.

It’s important to realize that dimension entity migrations using the above method is a storage and performance trade-off. A true GUID is a relatively large column type, usually somewhere between 16 and 36 bytes depending on whether compression is used. Working with keys of this size makes migrations simple, but can have a negative effect on queries running through the dataset. Be sure to weigh your options and plan accordingly based upon your particular business requirements.

Adding High-Availability

This article has, up until this point, discussed solutions to database scalability challenges solely. However, if one is to design both a highly scalable and highly available database solution, there are a few other points that need to be applied to the design principles presented so far. Please note that the subject of high-availability is worthy of its own article and that this section is merely a brief introduction to one of the more common failover strategies.

Providing high-availability, or failover functionality to ensure continued operation of our system, is a concept that is independent of the scalability of a system. Therefore, we can add high-availability to any scenario previously used to explain the application of our scalability strategies. There are also numerous availability strategies that one can use, similar to the differences in how we apply scalability strategies. I’ll be reviewing one of the more common availability strategies used in MySQL – master to master replication.

Master to Master Replication

Master to Master Replication involves utilizing MySQL’s replication capabilities to create redundancy of data between two servers, effectively creating a small redundant cluster.

This strategy requires two MySQL servers, which together form a single shard as defined in any one of our previously imagined scenarios. As is typical in a MySQL replication implementation, there is a master and a slave replication instance on each server within the shard’s cluster. As mentioned earlier, each shard is partitioned along a specific topical dimension. This topical dimension, implemented via a database table, must contain a unique identifier. This unique identifier must be unique not just on a single machine, but either globally or between the two servers being used to create our highly-available shard. This uniqueness is really the only data-level requirement necessary for this availability strategy.

In an effort to explain this as simply as possible, let’s assume that our unique identifier is a Globally Unique Identifier (GUID). Upon inserting a row into our primary dimension table on our first Master, our replication configuration will replicate that command to its slave system which in our case is the second Master. Because this same configuration is setup on our second Master, inserting a row into our primary dimension table on our second Master will replicate the command to our first Master. This replication setup effectively results in a high-availability, hot-failover solution.

Any table that we wish to replicate is required to have a GUID as its key so that it may be properly replicated. In practice, one should provide every table within a shard with a GUID key so as to ensure complete global independence of the dataset contained within the shard.

Now, our application layer needs to be aware of the availability of both servers within the shard cluster. In doing so, we’ve provided our application with two endpoints to which it may connect, execute commands, and retrieve data. A common method of utilizing both servers equally is to implement a simple round-robin algorithm to spread the connections between the two servers fairly evenly.

The occurrence of a single server failure within the shard cluster does not result in the loss of any significant amount of data, thanks to the replication mechanism continuously copying commands from the failed server to the running server. In order for our system to take advantage of this failover, we need to make our application layer aware of the behavior that occurs as a result of one server failing. Simply put, if one server fails within a shard cluster, our application layer needs to identify this occurrence and redirect its SQL commands to the remaining server while the failed server is being recovered via human intervention.

Wrap-Up

Originally this article’s length was meant to be no more than a few pages, coupled with a coded implementation of a simple PHP sharding framework. About half way though, I realized that the concepts themselves were so important to the complete understanding of how a sharding implementation might work, and the subject itself so immense, that I wouldn’t be doing the reader justice by cutting their explanations so short and not getting as logically detailed as is reasonable for an introduction. Hence, I rewrote much of the original article into a format that espoused real-life scenarios to explain many of the more advanced partitioning strategies. I hope you, the reader, agree that I made the correct choice – please let me know if you don’t.

I feel confident saying that we’ve covered much of what I consider to be the bare conceptual basics of horizontally scaling your application’s database layer, and that was the intent – to provide a solid primer and no more. This subject is easily worthy of a book, and in fact, the absence of such literature was largely the motivation for this article. I hope to expound further on this subject over a series of future posts, but in the meantime, I’d like to hear some feedback on the format I’ve chosen. Did you find the “teach by example” theme easy to follow? Did the article provide a good base from which to build out from? Where there any particularly confusing sections that weren’t clear? Please comment and let me know so that I may further refine my writing style for this type of technical article.

November 29, 2008

Introduction

This article details the specifics of a simple PHP based web application framework. There are many different architectural patterns that one can use to develop a web application framework, and for this framework I’ve chosen the Model-View-Controller (MVC) pattern.

This article is intended for consumption by beginner-level PHP developers who have limited experience developing web applications, or by someone with a development background based upon a web programming language other than PHP and whom is familiar with the MVC pattern. It is not my intention to define every technical term or to instruct the reader how to setup a basic development environment. When in doubt, I strongly encourage you to lookup any terms you may not be familiar with, as it will only help to better your understanding of the concepts within this article.

Before continuing, I'd like to address the ever present "Why are we reinventing the wheel?" question. The answer to that is this - there are lots of great MVC frameworks, but none that I’ve been able to find are simple and flexible enough for my taste, nor were they generally paired with well documented reasoning. I hope that you’ll be able to take the concepts presented here and couple them with the code from the accompanying framework to form the basis of your future web applications. How you extend the functionality of this MVC framework is entirely up to you, and in fact, I highly encourage you to experiment and implement features specific to your development requirements.

This framework has formed the basis of a few of my small hobby-level web applications, and I’ve made a particular effort to strip down much of the functionality that I myself have added over time. This was done in an attempt to simplify the framework as much as possible. I believe that this framework, with some work, is truly not far off from a production-level MVC framework.

One important point to note is that this framework espouses convention over configuration and is a major reason why I consider this framework to be so simple. Though we are going to be using a few small configuration files to hold settings such as database connection information and other various global web site variables, for the most part, the framework’s implementation is designed with the intention that the developer using this framework will be following the basic conventions outlined in this article.

The thinking here is thus – simply defined coding conventions provide more flexibility and lessen the need for strict framework enforcement code. This results in a smaller, simpler, and increasingly easier to extend framework. It should be said, however, that the argument for and against convention over configuration, and vice versa, is beyond the scope of this article.

Pre-Requisites

The development of this article has been completed on a typical LAMP (Linux, Apache, MySQL, PHP5) stack. That being said, the usage of this article’s web framework only requires that the following programs are installed and running on your system, regardless of operating system:

Apache 2

mod_rewrite Apache2 module

PHP5

MySQL 5

Again, please ensure that the above applications are running properly before continuing. Once you have double checked the above programs are running, you need to execute the MySQL schema dump included within the accompanying code. This schema is located in the model/db/schema.sql file and should be executed on an already existing database of your choosing.

Lastly, you will also need to properly configure the settings in the database configuration file located in the model/db/dbconfig.php file. The settings there are few and fairly intuitive. There are further instructions within the file itself.

MVC Concepts

Model-View-Controller (MVC) is an architectural pattern purpose built to ensure proper loose coupling of an application’s data layer, business logic layer, and presentation layer within GUI applications. Though without question, the MVC pattern can be useful for many development scenarios that include data, logic, and presentation components, we are specifically applying it to web application development for the purposes of this demonstration.

The underlying theme of this article will be to provide a simple PHP Framework that supports the following concept:

Through the separation of the business logic layer from the presentation layer, via usage of the data layer, we are able to make changes to each, independent of the other.

To help visualize this concept, take for example how it is often the case that we find ourselves required to refactor a web application at some point in our application’s lifetime. If we had used an alternate tightly-coupled pattern, or lack thereof, we would likely need to make changes throughout the application, as one component would be dependent on one or more other components. This would ultimately lead us to the unfortunate realization that one change requires further changes, resulting in the trickling of changes down the application stack. This commonly results in a fairly imbalanced work to productivity ratio.

One solution to this kind of problem is that of a separation of concerns. For example, the MVC pattern separates the business logic layer from the presentation layer in such a way that making a business logic layer change has little or no impact on the presentation layer, and vice versa, if the interface between the two remains the same. In the case of the MVC pattern, that interface primarily consists of data (Model) being unidirectionally passed from the business logic layer (Controller) to the presentation layer (View).

Model

The Model contains both the actual data containers and the code used to access that data. For web applications, the Model is largely comprised of data containers based upon the application’s database schema, and the data access code is, more specifically, database access code.

I’ve chosen to use PHP5’s PDO library as the basis of the Model’s database access code. One notable feature of the PDO library is that it takes advantage of PHP5’s new Anonymous Objects. We use them to create, at runtime, our Model’s data containers such that they reflect our database’s schema exactly. The decision to use PDO in this way was based upon keeping up a transparent relationship with the database – hence following our simple, convention over configuration concept.

Many other data access layers attempt to abstract away the database in an attempt to ease database vendor lock-in or reliance. Although I believe that particular point to be valid, I have found that it is often the case that web applications are so closely coupled to the particular RDBMS in use, that it never becomes an issue that is beyond a fairly simple refactoring. The above reasons are why I’ve chosen to go with PDO as the core to this Model’s data access layer, but it is your prerogative to use any other data access pattern you wish, since ultimately, the Model is fairly loosely coupled to both the Controller and View.

The Model’s core data access layer functionality is encapsulated within the ModelBase class. This class contains the following methods, included to ease common usage patterns:

query($sql) - Executes a SQL query (Select) and returns the result as an Anonymous Object that reflects the dataset’s columns

queryFirst($sql) – The same as the query($sql) method above, except that it only returns the first row, instead of the entire dataset

Each Model database access class will inherit from the ModelBase class and implement its particular database access methods. The convention used here is that each Model class will coincide with a table within the database schema. For example, in our demo project, we have an ArticleModel class that performs all database access logic on the “article” database table. This encourages clean, easily readable, and maintainable database access code going forward. This is again, another example of using convention over configuration since you are free to do as you wish with your individual Model classes.

All database connection information used by the ModelBase class is housed within the dbconfig.php configuration file. This is one of the exceptions to the convention over configuration rule that we mentioned earlier. The decision to put database connection information within a configuration file was based upon how often this information changes during development and that as a result of that, it should be as easy to alter as possible.

In summary, a Model within the framework has the following properties:

Data containers are runtime created Anonymous Objects with properties that reflect the returned dataset’s columns.

Data access code is encapsulated within a Model class that inherits from the ModelBase framework class. A Model class contains all data access code specific to a single table within a web application’s database schema.

The ModelBase framework class uses PDO as its basis for handling the details of database access.

View

The View contains the code necessary to express the GUI to the user of the application, via the Controllers direction. The Controller will receive the user’s input, process that input against the applications business logic, and then pass the output of that process to the view for display.

Generally speaking, a View will render HTML as instructed by the Controller and Action that has processed a web request’s business logic. The convention for organizing a View is as follows:

A View’s controller is defined as a folder

A View’s controller’s action is defined as a PHP file within the controller’s folder

All Views are contained within the the “view” base folder

Example: view/[controller]/[action] or view/article/index

The View has the added benefit of a ViewHelper class that contains static methods that make working with common View components and HTML easy. For example, it contains a “createLinkUrl($controller, $action)” method that will auto-generate a hyperlink URL by passing in a Controller name and Action name.

In summary, a View within the framework has the following properties:

Is comprised of a Controller folder and Action file, within the View base folder e.g. view/article/index.php

Is passed data from the Controller via the “viewData” collection which is accessed directly by variable

Is rendered upon a Controller’s “render($view)” or “renderWithTemplate($view, $template)” method call

Can optionally be contained within a template

Can optionally include a template navigation section

Controller

The Controller contains an application’s business logic, specific to a particular input. In the case of a web application, the input is usually a web request based upon a URL. A web request is normally mapped to a Controller and an Action. An Action is exactly what it sounds like: an action to be performed upon a controller. An Action within a Controller is defined as any “public” method within a Controller class. Two examples of a URL mapped to a Controller and Action is as follows:

http://www.someurl.com/[controller]/[action] - mapped to the Controller and Action

http://www.someurl.com/article/index - mapped to the ArticleController Controller class and the index() Action method

The concrete implementation of a Controller, in our case, is a class that inherits from the ControllerBase class as defined within the framework. Within ControllerBase are two methods designed to help you render a View, with or without a template, and pass it some data. These two methods are as follows:

render($view)

renderWithTemplate($view, $template)

These methods are meant to be called at the end of a Controller’s Action method.

In our demonstration project, I have created an ArticleController Controller class. This class handles all Article related activity. This activity is reflected in the following Action methods:

index() – lists all Articles

view() – displays an individual article

add() – displays a form UI used for adding an Article

addsubmit() – redirects after receiving a form post from the add() Action

Each one of the above methods is called via URL and individually processes business logic, then calls either “render($view)” or “renderWithTemplate($view, $template)”, passing to the View any data added to the “viewData” collection.

You’ll notice that the ArticleController class has “Controller” within its name. This is not a coincidence and is an example of the framework’s convention over configuration mantra. All controller classes must inherit from the ControllerBase class and have a name that ends with “Controller”.

In summary, a Controller within the framework has the following properties:

Inherits from the ControllerBase class

Has a class name that ends with “Controller” e.g. ArticleController

Has Actions defined as public methods

Uses the “viewData” collection to pass data to the View

Usually calls either “renderView($view)” or “renderWithTemplate($view, $template)” after processing any Controller business logic

Wrap-Up

This article briefly overviewed the major concepts and design patterns used within a simple MVC framework written in PHP5. Details on each piece of the framework can be found within the code files that accompany this article, as nearly every line is commented to explain its purpose. If anything is particularly unclear, I urge you to review the code itself, since as a reality of development, only so much can be gleaned from the subject at hand without reading through the code itself.

It is worth noting that there are many features missing from this MVC framework, the implementation of which is left to you. The intention here was to setup a framework from which you as a developer can expand and improve as you see fit, based upon the architectural pattern set forth within the slim codebase of this demo.

Update 12/2/2008:

I've received a few inquires into the quality of the data access layer and I'd like to reiterate that although I consider the Model to be architecturally sound, its implementation is missing some important parts that are required to make this framework production ready (SQL Injection, connection management, etc). Craig Francis made a good comment about how to work through these issues in the comments below.

Its now clear to me that the Model layer is what most people are focusing on and that it would probably help if I made a follow-up post attending to the issues that have been pointed out, in addition to some further features that really should be included to make the Model more robust. Look for this update soon as it will be part of a larger database scalability article I'm working on.

May 16, 2008

I just listened to a great podcast from the Gillmor Gang. Staring front and center was Blaine Cooke, Twitter's former Chief Software Architect. Some interesting discussion surrounding Twitters scaling issues were highlighted, plus some future feature talk regarding topics such as grouping and tracking.

It was interesting hearing Blaine break down some of the issues that his team endured during Twitter's early days. I'm a strong believer in Blaine's obvious technical ability, and I'm even more impressed after his admission that Twitter had only three engineers, including himself, up until September/October of last year. It helps to put into perspective some of the issues Twitter was having when you realize the limited human resources they had at their disposal.

This podcast is definitely worth a listen if you're in any way interested in Twitter's history and/or software architecture.