HowTo deploy a sharded MongoDB Cluster with AWS CloudFormation

As I hinted yesterday I've tried to automate the deployment of a sharded MongoDB cluster in Amazon. It's unnecessarily difficult (rumor has it 10gen is doing something about it in the future) but it's doable with appropriate amount of persistence.

There are many reasons why this was more difficult than it should be. Creating a database cluster is always harder than launching, for example, a Tomcat server. A web server is stateless, so you just use some tool to say "launch 5 EC2 servers and install these RPMs on all of them". Easy to do with puppet, chef and a bash script. A database cluster on the other hand is not a bunch of independent cowboys, rather the nodes need to talk to each other and in the case of sharding or master-slave replication (both of which MongoDB does) they need to agree on different roles.

On Amazon by default you get a different hostname and ip address every time. This makes for example the default puppet approach useless, which is based on identifying the nodes by hostname. The only tool I've found that really addresses this problem is Amazon CloudFormation. It allows you to describe a deployment without having to know in advance their hostnames. CloudFormation sucks in 15 other ways, but currently it's the only tool I know that is usable from this perspective.

I could write a separate blog post, maybe even a small book, on all the flaws of CloudFormation, but in the above snippet you already see one antipattern: Embedding a bash script as strings in a json file is a terrible idea. It's also unnecessary, but seems to be widespread practice. Sysadmins are terrible coders. Anyways...

Alas, nobody has posted templates to also deploy shards, but it was simple enough to use the above examples to also deploy mongos and mongoc nodes and then have a top-level template that orchestrates the building of a whole sharded cluster. My template launches 3 shards, so that's a total of 15 nodes. In production it would include a backup node per shard, so 18 nodes. (Sorry, I would like to post my templates here, but the process of getting approval to publish them is harder than for someone else to just rewrite them.)

Except for one thing. At least in the 2.2 version the MongoDB rpm's do not actually contain init scripts or config files for starting mongoc and mongos processes. WTF? It almost makes me think sharding isn't seriously supported. So in my template I have to manually copy such files into place too from a git repo I googled (mongos.init, mongoc.init).

This worked to launch a 15 node MongoDB cluster. And it would have been great except for one thing. What happens when a node eventually fails and needs to be replaced? Well, CloudFormation is completely oblivious to such things. So you have to copy paste from the AutoScalingKeepAtNSample.template posted on the AWS CloudFormation templates page. Note: The point here is not to do any scaling, rather the AutoScaling group is configured to keep the amount of nodes at exactly 3 - and this includes when one node terminates, another is started to replace it.

AutoScaling has another limitation: In an AutoScaling group, all nodes are clones. They get exactly the same meta data, as far as I could see. Well jeez... The need for nodes to do different things was the reason to use CloudFormation in the first place. Now we are back to square one.

In the end I concluded I need to take matters into my own hands, as the AWS tooling is still inadequate on this point, alternatively you can also blame MongoDB for not making this very easy. So I wrote a bash script to the rescue. It's only a proof of concept, and for example will not recover from error conditions.

So what you need to do to deploy a sharded MongoDB with AWS tooling is the following:

Using AWS CloudFormation we will deploy each Replica Set as an AutoScaling group. (Need for AutoScaling is a workaround due to CloudFormation limitations.)

A Replica Set is launched as N nodes that belong to the AutoScaling groups. N >= 3. The N nodes are identical with each other. (AutoScaling limitation.)

To create a Replica Set, we need to pick one of the nodes as the first one to call rs.initialize() and he will then connect the other nodes to himself. Note that due to the previous point, it is not trivial to pick one node that will execute a different sequence of actions. (MongoDB limitation - creating a replica set is not symmetric operation across all of the members.)

A starting node can use the config database, available via MongoS nodes, to check whether a shard already exists or whether he should initialize and add it. If the shard already exists in the cluster, he can also find out a hostname of a current member and connect to it to join the shard. (MongoDB limitation - for example Galera would allow to specify a list of potentially available members.)

If the shard does not yet exist, the node can proceed to call rs.initialize() and after that call sh.addShard(). However, a race condition exists: rs.initialize() is a very slow operation so the risk is high that all N nodes would proceed to initialize themselves. Further, there is no easy way to undo replica set initialization, so if two or more nodes would call rs.initialize() at the same time, the latter is essentially useless after that. (MongoDB limitation.)

To ensure that only one node proceeds to initialization, nodes have to compete for a lock. We write this lock into our own collection in the config database on MongoS nodes. (MongoDB limitation)

The shell script is attached to this post.

It's not pretty, and everyone involved should feel a bit embarrassed.

Getting to this point has taken me about 2 months of work (more in calendar time). The main reason it is so slow is CloudFormation. It is really easy to make mistakes with it, and each iteration is really slow (30-60 minutes per iteration, in this case).

In practice we have pretty much decided to abandon CloudFormation and will use one of some 3rd party tools to deploy our stack instead. (One of Cloudify, juju, Scalr, Severalnines, etc...) The bash script was really finished just to understand the problem and prove it can be solved, but I don't currently conisder CloudFormation a maintainable approach.

For me, this is a great demonstration of over-reaching with CloudFormation. I lay the blame firmly at Amazon's door for this - they provide a whole page of examples using the same anti-patterns.

In my opinion, CloudFormation ought to be a tool for describing infrastructure only. It should provision machines and associated infrastructure (subnets, security group rules etc). Its only purpose should be to get these to the point where configuration management (Puppet, Chef, cfengine, whatever) can do the config. We use it like this across a number of environments, and it actually works really well.

Unfortunately, no-one told the guys who write the CloudFormation example templates. They present CloudFormation as the only solution you need to build your infrastructure.

I could also talk about how most of the example templates have clearly never been tested against the live APIs, because many of them simply don't work at all, but we're drifting off topic here.

Long story short, the xen of Unix still applies: do one thing and do it well. Use CloudFormation just to specify Amazon infrastructure and you'll be fine. Try to do anything clever with it and you'll be in a world of pain.

I agree completely, except for one thing: How do you deal with the main limitation I mention here, that CloudFormation will not replace terminated nodes and that AutoScaling loses meta-information? This is a problem squarely in the domain of what these tools are supposed to be doing, and they don't do it well.

As for using bash scripts in the templates, I would also have used puppet in production, but I suppose bash is a useful "lingua franca" to provide examples in. In fact, in my own work I simply perpetuated the antipattern and probably lost a week or two of Nokia time solely due to that decision :-)

While I've not played with mongo specifically I have handled this type of chicken and egg auto discovery problem with Cloud Formation and chef--I'd imagine that similar abilities exist within puppet. Anyways, as the commenter above said, use Cloud Formation to handle enforcing the number of nodes in each group, then use chef to configure them. It is very straightforward to use chef search to query the chef server for information about other machines it is controlling, and it seems that the business logic of what a new node's role in a given shard cluster would be pretty easy to do--eg if there's no master, promote one of the slaves, any new node comes in as a slave. If there's no backup node, become the backup node.
You can use chef node attributes and/or roles to identify what type of nodes are already present in the given cluster and then dynamically slot in any new nodes where they need to fit.

While chef can be a bit of a lift, it's going to be scads easier than brute force bash-ing a solution.

Note: you do need to do a little work in chef to make sure that terminated AWS nodes are correctly removed from chef server, but it's a pretty straightforward cron sanity check, and/or shutdown script.

Indeed, bash was never intended to be the ultimate solution, just kind of a "placeholder" language, or prototyping, if you will. I wanted to focus on CloudFormation first and later think about what to do with puppet, and how. Updates would be problematic with bash in any case, wheras puppet and chef excel at that.

So what happens when al three nodes in the same autoscaling group register with chef and do a chef search at the same moment in time? We've seen situations where they try to come up and since all the nodes are trying to come up at the same time the search results are inconsistent across the cluster, and they also decide that none of the nodes are up, and basically my cluster is in umpteen separate pieces.