Sunday, 29 November 2009

The Cassandra database has been getting quite a lot of publicity recently. I think this is a good thing in general, but it seems that some people are considering using it for unsuitable purposes.

Cassandra is a cluster database which uses multiple nodes to provide

Read-scaling

Write-scaling

High availability

Unless you need at least TWO of those things, you should probably not bother.

Good reasons to use Cassandra:High availability

Cassandra tolerates the failure of some nodes and will continue to read data and take writes despite some nodes being offline or unreachable - the exact behaviour depends on its settings and what consistency level of read/write is requested.

Write scaling

Cassandra allows you to scale writes by just adding more nodes; writes are split between nodes, hence you can generally get better and better write performance by JUST adding more nodes (NB: it doesn't necessarily do load balancing, so you might not in all cases, but this is what it aspires to)

Less good reasons to use CassandraRead scalingCassandra gives you read-scaling in the same way as write-scaling. This is a good thing, but can also be achieved relatively easily* with a conventional database by adding more and more read-only slaves / replicas, or using a cache (if you tend to get a lot of similar requests). Many big MySQL users do both.

Also Cassandra does NOT create more than the configured number of replicas of any given piece of data, regardless of the amount of traffic on that part, so you could end up having a small number of servers hammered and the rest idle.

Bad reasons to use Cassandra

Schema flexibility

aka "I cannot figure out how to use ALTER TABLE", or at least make a flexible conventional schema ...

Some people have cited schema flexibility as a good reason to use Cassandra (same argument applies for Voldemort, Couchdb etc).

However, in practice this is NOT a benefit, because it comes at the cost of EVERYTHING ELSE YOU HAVE IN A TRADITIONAL DATABASE.

That's quite a big list (and very incomplete) so you'd better have a better reason for using it than "I cannot figure out how to use ALTER TABLE"

Because X or Y uses it

Just becauseDigg, Facebook et al use Cassandra, doesn't mean you have to. Your data are probably more important than theirs. Your workload is probably different from theirs. In particular, your write/read scale requirements are probably less than theirs.I have a lot of respect for Facebook, Digg developers etc, but I also have a lot of envy:

They lose data, nobody cares

They lose data, nobody rings up and complains

They lose data, and NOBODY DEMANDS THEIR MONEY BACK

They could get a bit of bad press, their users might desert them in numbers, but they wouldn't lose money directly and immediately.

Most companies who have big data provide a service, which comes with an SLA. The SLA often says that if we lose their data, they get their money back.

* May or may not be easy, depending on the calibre of your developers, ops staff, change control requirements, data structure etc.