Latest Tweets

James on the Web

All over the interwebs, people are equating NoSQL with scalability. Yesterday, I wrote a blog post explaining that most NoSQL dbs aren't, in fact, scalable. The general consensus in the comments seemed to be that my article was missing substantiation — that without explaining why the dbs weren't scalable, I wasn't adding anything to the discussion. So, what does ”scalable database“ mean?

There are two kinds of scalability: vertical and horizontal. Vertical scaling is just adding more capacity to a single machine. Virtually every database product is vertically scalable to the extent that they can make good use of more CPU cores[1], RAM, and disk space. With a horizontally scalable system, it's possible to add capacity by adding more machines. By far, most database products are not horizontally scalable.

But, people have been scaling products like MySQL for years, so how'd they do it?

Vertically is one way. 37signals has written about the mammoth (128GB of RAM and 8 x 15,000 RPM SAS drives) machines they run their dbs on — and that was over a year ago. I bet they've at least doubled the capacity in those machines by now.

It's also common to see RDBMS-backed applications run one or more read-slaves. With master-slave replication, it's possible to scale reads horizontally. But, there's a trade-off there. Since most (all?) replication systems are asynchronous, reads from slaves may be somewhat stale. It's not uncommon for replication lag times to be measured in minutes or hours in the event of network partitioning (for example). So, read-sharding trades some consistency (the ”C“ in ACID) for aditional read capacity[2].

When an application needs more write capacity than they can get out of a single machine, they're forced to partition (shard) their data across multiple database servers. This is how companies like facebook and twitter have scaled their MySQL installations to massive proportions. This is the closest you can get to horizontal scalability with most database products.

Sharding is a client-side affair — that is, the database server doesn't do it for you. In this kind of environment, when you access data, the data access layer uses consistent hashing to determine which machine in the cluster a specific piece of data should be written to (or read from). Adding capacity to (or alleviating “hot spots” from) a sharded system is a process of manually rebalancing the data across the cluster. So, while it's possible to add capacity to a mysql-backed application by adding machines, mysql itself is not horizontally scalable.

On the other hand, products like cassandra, riak, and voldemort automatically partition your data. It's possible to add capacity to these systems by simply turning on a new machine and starting the service (actually, I only know this for certain about cassandra, but I'm reasonably certain it's true for the others). The database system itself takes care of rebalancing the data and ensuring that it is sufficiently replicated across the cluster. This is what it means for a database to be horizontally scalable[3].

[1] Many databases are not capable of making use of additional CPU cores. Many NoSQL dbs, such as redis, are event-driven and run in a single thread.

[2] For the sake of completeness, I am aware of some tools (such as mysql-mmm) that monitor replication lag, and remove overly lagged shards from read rotation to help maintain consistency.

[3] Truly distributed dbs like cassandra also provide other goodies like per-query consistency and availability settings. But, that's beyond the scope of this article. For more on that (and many other things on this topic), the dynamo paper is a good place to start.

Every blog post in this (stupid) debate seems to equate NoSQL with scalability. You either do or don't have to be google to use NoSQL depending on who is talking. It's time to stop doing that.

Here are the well-known NoSQL dbs that are not scalable (at least, not differently than MySQL):

MongoDB (Some form of horizontal scalability is planned but not yet production ready. By far, most users of mongo are not using it for its scalability promises.)

CouchDB

Redis

Tokyo cabinet

MemcacheDB

Berkeley DB

Here are the well-known NoSQL dbs that are scalable:

Cassandra

Riak

Voldemort

At this point, I'd say that couch, mongo, redis, and cassandra are by far the most well-known products in the space. Of those solutions, only one is scalable. So, people must be using NoSQL for other reasons.

In fact, nearly each of the databases I listed above has its own specific reason somebody might want to use it. Redis has a rich set of data structures; it can be used as a queue server, for example. Mongo is document oriented, so schema evolutions are very natural, you don't really need an ORM, and very rich queries are supported against collections (among other things). Couch shares some of the advantages of mongo, plus it has very nice master-master replication support (not doing it justice, but that's not the point). Cassandra is horizontally scalable. I could go on, but I think I've made my point.

People use NoSQL for all kinds of different reasons. Of all of the people using NoSQL dbs, most of them are probably not motivated by scalability. So, let's stop saying NoSQL when we mean scalable database. It only obscures the discussion.

Update: The consensus in the comments here seems to be that this article is missing substantiation. So, I follwed up with another article titled: What does “scalable database” mean?.

When you start writing modular code, and using techniques like dependency injection, you end up with a lot of pieces, but not necessarily an obvious whole. Working with java libraries can be an exercise in instantiating huge towers of dependencies to finally get to the object you actually need.

My scala wrapper around cassandra's thrift bindings depends on an instance of TCassandra.Client, which depends on an instance of TProtocol (of which TBinaryProtocol is a subclass), which depends on an instance of TTransport (TSocket is subclass).

Thrift is necessarily modular. The user of a thrift service might wish to use a different TProtocol implementation or a non-blocking socket (TNonblockingSocket). Still, though, 4 lines of setup code just to get a simple cassandra client is cumbersome.

Contrasted with tightly coupled code, this appears to be a major drawback of composability. With coupled code, if you need an object, you just instantiate it and it creates — or refers directly to — all of the collaborators it needs. It's easy to imagine a tightly coupled version of my wrapper that instantiates hard dependencies.

In a relatively high percentage of use cases, this set of defaults is perfectly acceptable. This kind of opinionated code is really easy to get started with, but can become problematic later on when its user decides she really does need that non-blocking socket. So, what to do?

It turns out it's possible to have it both ways. This is going to sound so simple that it's silly… but just create a second constructor that invokes the first one.

This lowers the bar considerably to getting started with my Cassandra client. If somebody just wants to give it a try or whip up a quick and dirty program, they can do it easily — likely without even reading any documentation. But as the user's needs become more complex, I haven't shut them out of customizing the client to their heart's content.

Of course, as your object models become increasingly complex, there will be some additional effort required to maintain these auxiliary constructors. And arguably, you will create some degree of coupling between the classes. But it's mostly harmless. Provided it's possible to supply alternate dependencies, you have accomplished modularity. We're just adding sane defaults.

This makes for an object model that is really easy to work with, yet still highly modular. It's quick to get started with, but doesn't get in your way when you need to reach in and change some shit around. Making your classes both modular and convenient gets you the best of both worlds.

The current best-practice for writing rails code dictates that your business logic belongs in your model objects. Before that, it wasn't uncommon to see business logic scattered all over controller actions and even view code. Pushing business logic in to models makes apps easier to understand and test.

I used this technique rather successfully for quite some time. With plugins like resource_controller, building an app became simply a matter of implementing the views and the persistence layer. With so little in the controller, the focus was mainly on unit tests, which are easier to write and maintain than their functional counterparts. But it wasn't all roses.

As applications grew, test suites would get slow — like minutes slow. When you're depending on your persistence objects to do all of the work, your unit tests absolutely must hit the database, and hitting the database is slow. It's a given in the rails world: big app == slow tests.

But slow tests are bad. Developers are less likely to run them. And when they do, it takes forever, which often turns in to checking twitter, reading reddit, or a coffee break, harming productivity.

Also, coupling all of your business logic to your persistence objects can have weird side-effects. In our application, when something is created, an after_create callback generates an entry in the logs, which are used to produce the activity feed. What if I want to create an object without logging — say, in the console? I can't. Saving and logging are married forever and for all eternity.

When we deploy new features to production, we roll them out selectively. To achieve this, both versions of the code have to co-exist in the application. At some level, there's a conditional that sends the user down one code path or the other. Since both versions of the code typically use the same tables in the database, the persistence objects have to be flexible enough to work in either situation.

If calling #save triggers version 1 of the business logic, then you're basically out of luck. The idea of creating a database record is inseparable from all the actions that come before and after it.

Here Comes the Crazy Part

The solution is actually pretty simple. A simplified explanation of the problem is that we violated the Single Responsibility Principle. So, we're going to use standard object oriented techniques to separate the concerns of our model logic.

Let's look at the first example I mentioned: logging the creation of a user. Here's the tightly coupled version:

To decouple the logging from the creation of the database record, we're going to use something called a service object. A service object is typically used to coordinate two or more objects; usually, the service object doesn't have any logic of its own (simplified definition). We're also going to use Dependency Injection so that we can mock everything out and make our tests awesomely fast (seconds not minutes). The implementation is simple:

Aside from being able to create a user record in the console without triggering a log item, there are a few other advantages to this approach. The specs will run at lightning speed because no work is actually being done. We know that Fast specs make happier and more productive programmers.

Also, debugging the actions that occur after save becomes much simpler with this approach. Have you ever been in a situation where a model wouldn't save because a callback was mistakenly returning nil? Debugging (necessarily) opaque callback mechanisms is hard.

But then I'll have all these extra classes in my app!

Yeah, it's true. You might write a few more "class X; end"s with this approach. You might even write a few percent more lines of actual code. But you'll wind up with more maintainability for it (not to mention faster tests, code that's easier to understand, etc).

The truth is that in a simple application, obese persistence objects might never hurt. It's when things get a little more complicated than CRUD operations that these things start to pile up and become pain points. That's why so many rails plugins seem to get you 80% of the way there, like immediately, but then wind up taking forever to get that extra 20%.

Ever wondered why it seems impossible to write a really good state machine plugin — or why file uploads always seem to hurt eventually, even with something like paperclip? It's because these things don't belong coupled to persistence. The kinds of functionality that are typically jammed in to active record callbacks simply do not belong there.

Something like a file upload handler belongs in its own object (at least one!). An object that is properly encapsulated and thus isolated from the other things happening around it. A file upload handler shouldn't have to worry about how the name of the file gets stored to the database, let alone where it is in the persistence lifecycle and what that means. Are we in a transaction? Is it before or after save? Can we safely raise an error?

In the tightly coupled version of the example above, the interaction between the User object and the Log object are implicit. They're unstated side-effects of their respective implementations. In the UserCreationService version, they are completely explicit, stated nicely for any reader of our code to see. If we wanted to log conditionally (say, if the User object is valid), a plain old if statement would communicate our intent far better than simply returning false in a callback.

These kinds of interactions are hard enough to get right as it is. Properly separating concerns and responsibilities is a tried, tested, and true method for simplifying software development and maintenance. I'm not just pulling this stuff out of my ass.

Every so often, somebody blogs about getting bit by what they usually call "over-mocking". That is, they mocked some object, its interface changed, but the tests that were using mocks didn't fail because they were using mocks. The conclusion is: "mocks are bad".

Martin Fowler outlines two kinds of unit testers: stateist and mockist. To simplify things for a minute, a stateist tester asserts that a method returns a particular value. A mockist tester asserts that a method triggers a specific set of interactions with the object's dependencies. The "mocks are bad" crowd is arguing for a wholly stateist approach to unit testing.

On the surface, stateist testing seems certainly more convenient. A mockist is burdened with maintaining both the implementation of an object and its various test doubles. So why mocks? It seems like a lot of extra work for nothing.

Why Mocks?

A better place to start might be: what are the goals of unit testing?

For a stateist tester, unit tests serve primarily as a safety net. They catch regressions, and thus facilitate confident refactoring. If the tests are written in advance of the implementation (whether Test Driven or simply test-first), a stateist tester will derive some design benefit from their tests by virtue of designing an object's interface from the perspective of its user.

A mockist draws a thick line between unit tests and functional or integration tests. For a mockist, a unit test must only test a single unit. Test doubles replace any and all dependencies, ensuring that only an error in the object under test will cause a failure. A few design patterns facilitate this style of testing.

Dependency Injection is at the top of the list. In order to properly isolate the object under test, its dependencies must be replaced with doubles. In order to replace an object's dependencies with doubles, they must be supplied to its constructor (injected) rather than referred to explicitly in the class definition.

When we're unit testing the above VideoUploader (ruby code, by the way), it's easy to see how we'd replace the concrete Persister implementation with a fake persister for test purposes. Rather than test that the file was actually saved to the file system (the stateist test), the mockist tester would simply assert that the persister mock was invoked correctly.

This design has the benefit of easily supporting alternate persister implementations. Instead of persisting to the filesystem, we may wish to persist videos to Amazon's S3. With this design, it's as simple as implementing an S3Persister that conforms to the persister's interface, and injecting an instance of it.

This is possible because the VideoUploader is decoupled from the Persister. If the Persister class was referred to explicitly in the VideoUploader, it would be far more difficult to replace it with a different implementation. For more on decoupled code, you must read Nick Kallen's excellent article that goes in to far more detail on these patterns and their benefits.

To be sure, we're really talking more about Dependency Injection here than anything else, and stateist testers can and do make use of DI. But the mockist test paradigm prods us towards this sort of design.

We're forced to look at the system we're building in terms of objects' interactions and boundaries. This is because it tends to be quite painful (impossible in many languages) and verbose to unit test tightly coupled code in a mockist style.

So the primary goal of a mockist's unit tests is to guide design of their object model. Making it difficult to couple objects tightly is one such guiding force.

Mockist tests also tend to highlight objects that violate the Single Responsibility Principle since their tests become a jungle of test double setup code. We can think of mockist testing like a kind of shock therapy that pushes you towards a certain kind of design. You can ignore it, but it'll hurt.

Failure isolation is probably the other big advantage of mockist tests. If your unit tests are correctly isolated, you can be sure exactly which object is responsible for a test failure. With stateist tests, a given unit test could fail if the unit or any of its dependencies are broken.

But is it worth it?

Mockist or Stateist?

The burden of maintaining mocks is by far the most common argument against mockist tests. You have to write both the implementation and at least one test double. When one changes, the other has to change too.

Perhaps most troubling, if an object's interface changes, its dependencies' unit tests will continue to pass because the mock objects will function as always — arguably a hinderance to refactoring. Since you need to test for that scenario, mockists also write integration tests. Integration tests are probably a good idea anyway, but as a mockist, you don't really have a choice.

Also, the refactoring problem only applies to dynamic languages. In a statically typed language, the program will simply fail to compile.

I find this burden troubling. More code to write makes the “we don't have time” argument come out in pressure situations. For a design exercise, the cost of mockist tests seems quite high.

On my last open source project (friendly), I decided to give mockist testing a try. Most of the code turned out beautifully. And the mistakes I did make could have been avoided had I listened to the pain I felt while testing them.

Since that project worked out well, I've been applying mockist techniques to other work. I've written mockist tests in everything from my scala projects to my rails apps. So far, so good.

In theory, I hate the idea of mockist tests. They just seem like too much work. I don't want to like them and remain reluctant to admit that I do. But in practice, I'm writing better code, and it's hard to hate that.