Scalability Strategy

On http://www.datastax.com/solutions/scaleable-elastic-datacenter you write that is very easy to add new hardware on demand to a Cassandra cluster.
I wonder on the best strategy here. Given a cluster with n nodes, does it makes sense to just add 1,2,3 or so, or should one always add exactly n nodes, i.e. should one always double the cluster size?
Since Cassandra arranges the data in a logical ring, new nodes can always be placed between two existing nodes, right?
This means however the token range of the successor neighbor is bisected.
Thus if not adding exactly n nodes the token ranges will become unbalanced across the ring, leading to different load factors for different nodes.

Yeah, depending on your partitioner and data model, there's different strategies you can consider.

If you're using the random partitioner (MD5 hash is used to determine where to place keys on the node ring), then you can assume that your data is evenly spread across all nodes. So, when all of the nodes are evenly getting full, you can opt to easily double the cluster size by bisecting each node's range in the ring. If you want, you could also just add like 3 additional nodes to an existing 20 node cluster. But then you would potentially have to just pick 3 nodes to bisect, or if you want an even spread of the data, you would have to change the token range on all 20 nodes and then have them redistribute data, which could be heavy on the network.

If you're using an Order-preserving partitioner (OPP), then adding 3 nodes to a 20 node cluster makes more sense. As you may know, with OPP rows are stored by key order, aligning the physical structure of the data with the sort order (giving you the ability to perform range slices with ordering). So, if you're down with OPP, then your ring could potentially be very lopsided. When one node grows a hot spot, you can bisect its range in half.

Also, keep in mind that you should only plan on using 50% of the free space on the Cassandra data volume. The other half needs to remain free to accommodate compaction. And on the Apache Mailing list, the general consensus is that each Cassandra node should store no more than maybe 500 GB of data. That's just a rule of thumb, so depending on your use case, you may be able to store more.