With the list of features above, why don’t we all use Cassandra for all our database needs? This is the hype I hear at conferences and from some commercial entities pushing their version of Cassandra. Unfortunately, some people believe it. Especially now when many users of proprietary database technologies like Oracle and SQL Server are looking to get out of massive license fees. The (apparent) low cost of open-source in combination with the list of features above, make Cassandra very attractive to many corporate CTOs and CFOs. What they are missing is the core features they assume a database has, but are missing from Cassandra.

I am a database architect and consultant. I have been working with Cassandra since version 0.7. came out in 2010.

I like, and often promote Cassandra to my customers—for the right use cases.

Unfortunately, I often find myself being asked to help after the choice was already made and it turned out to be a poor use case for Cassandra, or they made some poor choices in their data modeling for Cassandra.

In this blog post I am going to discuss some of the pitfalls to avoid, suggest a few good use cases for Cassandra and offer just a bit of data modeling advice.

Where Cassandra users go wrong

Cassandra projects tend to fail as a result of one or more of these reasons:

The wrong Cassandra features were used.

The use case was totally wrong for Cassandra.

The data modeling was not done properly.

Wrong Features

To be honest, it doesn’t help that Cassandra has a bunch of features that probably shouldn’t be there. Features leading one to believe you can do some of the things everyone expects a relational database to do:

Secondary indexes: They have their uses but not as an alternative access path into a table.

Counters: They work most of the time, but they are very expensive and should not be used very often.

Light weight transactions: They are not transactions nor are they light weight.

Batches: Sending a bunch of operations to the server at one time is usually good, saves network time, right? Well in the case of Cassandra not so much.

Materialized views: I got taken in on this one. It looked like it made so much sense. Because of course it does. But then you look at how it has to work, and you go…Oh no!

CQL: Looks like SQL which confuses people into thinking it is SQL.

Using any of the above features the way you would expect them to work in a traditional database is certain to result in serious performance problems and in some cases a broken database.

Get your data model right

Another major mistake developers make in building a Cassandra database is making a poor choice for partition keys.

Cassandra is distributed. This means you need to have a way to distribute the data across multiple nodes. Cassandra does this by hashing a part of every table’s primary key called the partition key and assigning the hashed values (called tokens) to specific nodes in the cluster. It is important to consider the following rules when choosing you partition keys:

There should be enough partition key values to spread the data for each table evenly across all the nodes in the cluster.

Keep data you want to retrieve in single read within a single partition

Don’t let partitions get too big. Cassandra can handle large partitions >100 Megabytes but its not very efficient. Besides, if you are getting partitions that large, it’s unlikely your data distribution will be even.

Ideally all partitions would be roughly the same size. It almost never happens.

Typical real-world partition keys are user id, device id, account number etc. To manage partition size, often a time modifier like year and month or year are added to the partition key.

If you get this wrong, you will suffer greatly. I should probably point out that this is true in one way or another of all distributed databases. The key word here is distributed.

Wrong Use Cases for Cassandra

If you have a database where you depend on any of the following things– Cassandra is wrong for your use case. Please don’t even consider Cassandra. You will be unhappy.

Cassandra does not do ACID. LSD, Sulphuric or any other kind. If you think you need it go elsewhere. Many times people think they do need it when they don’t.

Aggregates: Cassandra does not support aggregates, if you need to do a lot of them, think another database.

Joins: You many be able to data model yourself out of this one, but take care.

Locks: Honestly, Cassandra does not support locking. There is a good reason for this. Don’t try to implement them yourself. I have seen the end result of people trying to do locks using Cassandra and the results were not pretty.

Updates: Cassandra is very good at writes, okay with reads. Updates and deletes are implemented as special cases of writes and that has consequences that are not immediately obvious.

Transactions: CQL has no begin/commit transaction syntax. If you think you need it then Cassandra is a poor choice for you. Don’t try to simulate it. The results won’t be pretty.

If you are thinking about using Cassandra with any of the above requirements, you likely don’t have an appropriate use case. Please think about using another database technology that might better meet your needs.

When you should think about using Cassandra

Every database server ever designed was built to meet specific design criteria. Those design criteria define the use cases where the database will fit well and the use cases where it will not.

Cassandra’s design criteria are the following:

Distributed: Runs on more than one server node.

Scale linearly: By adding nodes, not more hardware on existing nodes.

Work globally: A cluster may be geographically distributed.

Favor writes over reads: Writes are an order of magnitude faster than reads.

Support data with a defined lifetime: All data in a Cassandra database has a defined lifetime no need to delete it after the lifetime expires the data goes away.

There is nothing in the list about ACID, support for relational operations or aggregates. At this point you might well say, “what is it going to be good for?” ACID, relational and aggregates are critical to the use of all databases. No ACID means no Atomic and without Atomic operations, how do you make sure anything ever happens correctly–meaning consistently. The answer is you don’t. If you were thinking of using Cassandra to keep track of account balances at a bank, you probably should look at alternatives.

Ideal Cassandra Use Cases

It turns out that Cassandra is really very good for some applications.

The ideal Cassandra application has the following characteristics:

Writes exceed reads by a large margin.

Data is rarely updated and when updates are made they are idempotent.

Read Access is by a known primary key.

Data can be partitioned via a key that allows the database to be spread evenly across multiple nodes.

Conclusion

Frequently, executives and developers look at the feature set of a technology without understanding the underlying design criteria and the methods used to implement those features. When dealing with distributed databases, it’s also very important to recognize how the data and workload will be distributed. Without understanding the design criteria, implementation and distribution plan, any attempt to use a distributed database like Cassandra is going to fail. Usually in a spectacular fashion.

Whether you’re considering an open source or commercial Cassandra deployment, planning to implement it, or already have it in production, Pythian’s certified experts can work with your team to ensure the success of your project at every phase. Learn more about Pythian Services for Cassandra.

About the Author

John has 40 of years experience working with data. Data in files and in Databases from flat files through ISAM to relational databases and, most recently, NoSQL. For the last 15 year he's worked on a variety of Open source technologies including MySQL, PostgreSQL, Cassandra, Riak, Hadoop, and Hbase. As a Chief Database Architect at AOL he brought MySQL in to replace Sybase and has worked hands on with MySQL databases holding hundreds of billions of rows and running millions of transactions per second. For the last three years he has been working for Pythian to help their customers improve their existing databases and select new ones for new applications.

Secondary indexes can be very useful in improving performance when querying a large partition (one with a significant number of rows in it) on non-primary key columns. Secondary indexes should not be used to provide an alternate access path into a table. Used as an alternate access path, they limit the scalablity of the cluster.

Thank you, John.
Your post is so good and it’s soooo short.
Why don’t you think writing a book ?
I understand it’s an enormous effort and not much money at all
but for the sake of humanity :)
Thank you again

Hi John,
I’ve a question regarding consistency. What happens when two threads are adding a new record to a ‘clustering column ordered table’ at the same time? especially when the little bit faster thread writes a record that contains a younger timestamp than the slower one?

Does it result in a wrong ordered table or is there a kind of a wait or a correcting mechanism?

PYTHIAN®, LOVE YOUR DATA®, and ADMINISCOPE® are trademarks and registered trademarks owned by Pythian in North America and certain other countries, and are valuable assets of our company. Other brands, product and company names on this website may be trademarks or registered trademarks of Pythian or of third parties. Use of trademarks without permission is strictly prohibited.