About Brian ONeill

Broken Glass : Diagnosing Production Cassandra Issues

I just past my second year anniversary at Health Market Science (HMS), and we’ve been working with Cassandra for almost the entirety of my career here. In that time, we have had remarkably few problems with it. Like few other technologies I’ve worked with, Cassandra “just works”.

But, as with *every* technology I’ve ever worked with, you eventually have some sort of issue, even if it is not with the technology itself, but rather your use of the technology. And that was the situation here. (gun? check. foot? check. aim… fire. =)

Here is our tale of when bullet met foot…

Our dependency on Cassandra has increased exponentially since its been in production. We’ve been adding product lines and clients to those product lines at an ever-increasing rate. And with that success, we’ve had to evolve the architecture over time, but some parts of the system have remained untouched because they’ve been cruising along. Over the last couple weeks, one of those parts reared its ugly head.

We’ve been scaling the nodes in our cluster vertically to accommodate demand. Our cluster is entirely virtual, so this was always the path of least resistance. Need more memory? No problem. Need more CPU? No problem. Need space/disk? We’ve got tons in our SAN. You do that a few times and with increasing frequency, and you can start to see a trend that doesn’t end well. =)

We had our heap size set too large given our system memory, and that started causing hiccups in Cassandra. Once we brought that back in-line, we limped along for a few more weeks.

Then things came to a head last week. We saw the cliff at the end of the road. We found a “bug” in one of our client applications that was inadvertently introducing an artificial throttle. Fantastic! We make the code change (2 lines of code), do some testing, and release it to production. Bam, we increased our concurrency by orders of magnitude. Uh oh, what’s that? Cassandra is choking?

We started looking at tpstats and cfstats. All seemed relatively okay. What could be expanding our footprint?

Well, we have a boat-load of column families. We’ve evolved the architecture and our data model, and in the newer applications we’ve taken a virtual-keyspaces approach, consolidating data into a single large column family using composite row keys. But alas, the legacy data model remains in production. Many of those column families see very little traffic, but Cassandra still reserves some memory for them. That might have been the culprit, but those column families had been there since the beginning of time. We had to look deeper.

Way back when we had a brilliant idea to introduce some server-side AOP code to act as triggers. Initially, we used them to keep indexes in sync: wide-rows, and even at one point we kept Elastic Search up-to-date with server-side triggers. This kept the client-side code simple-stupid. The apps connecting to C* didn’t need to know about any of our indexing mechanisms.

Eventually, we figured out that it was better to control that data flow in the app-layer (via Storm), but we still had AOP code server-side to manage the wide-rows. And despite the fact that I’ve recently been speaking out against our previous approach, that code was still in there. Could that be the be root cause? Our wide-rows were certainly getting wider… (into the millions of columns at this point)

One of our crew (kudos to sandrews) found JMeter Cassandra and started hammering away in a non-production environment. We attached a profiler, which exposed our problem — the AOP inside. Fortunately, we had already been working on a patch that removed the AOP from C*. The patch moved the AOP code to the client-side (point-cutting Hector instead of Thrift/Cassandra). We applied the patch and tested away.

Voila, C* was humming again, and we all lived happily ever after.

A big thanks to +Aaron Morton again for the help. You are a rock star. And to the crew at HMS, it’s an honor to work with such a talented, passionate team.

Newsletter

Join them now to gain exclusive access to the latest news in the Java world, as well as insights about Android, Scala, Groovy and other related technologies.

Email address:

Recent Jobs

No job listings found.

Join Us

With 1,240,600 monthly unique visitors and over 500 authors we are placed among the top Java related sites around. Constantly being on the lookout for partners; we encourage you to join us. So If you have a blog with unique and interesting content then you should check out our JCG partners program. You can also be a guest writer for Java Code Geeks and hone your writing skills!

Disclaimer

All trademarks and registered trademarks appearing on Java Code Geeks are the property of their respective owners. Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries. Examples Java Code Geeks is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.