Over the past nine months I have delved into the world of providing hardware to support our applications teams use of the Cassandra datastore. This has turned out to be a somewhat unique challenge as the platform is rapidly evolving along with our use case of the platform. Cassandra is a very different beast compared to your traditional RDBMS (as you would expect).

I absolutely love the fact that Cassandra has a clear scaling path to allow massive datasets and it runs on very cheap commodity hardware with local storage. It is built with the expectation of underlying hardware failure. This is wonderful from an operations perspective as it means I can buy extremely cheap “consumer grade” hardware without having to buy “enterprise grade” (whatever that really means) servers and storage for $$$.

Before I dive into my findings, I should point out that this is not one size fits all solution as it greatly depends on what your dataset looks like and what your read/write patterns are. Our dataset happens to be billions of exceedingly small records. This means we do an incredible amount of random read i/o. Your milage may vary depending on what you do with it.

Finding the optimal node size

As usual, spec’ing out hardware for a given application is a matter of balancing five variables:

CPU capacity (taking into account the single/multi threaded aspects of the application)

RAM capacity (how much working space does the application need and how much cache is optimal)

Disk capacity (actual disk storage space)

Disk i/o performance (the number of read and write requests per second that can be handled)

Network capacity (how much bandwidth is needed)

If you run into a bottleneck on any of these five items, any additional capacity that is available within the other four categories is wasted. The procedure to determine optimal is as follows:

Determine which of the five variables is going to be your limiting factor through performance testing

Research the most cost-effective price/performance point for the limiting variable

Spec out hardware to meet the other four variables needs relative to the bottleneck

Note that this is a somewhat iterative process as (for example) it may make sense to buy a CPU significantly beyond the price/performance sweet spot (when looking at CPU pricing in a vacuum) as paying for that higher end CPU may allow you to make much better use of the other pieces of the system that would otherwise sit idle. I am not suggesting that most Cassandra shops will be CPU bound, but this is just an example.

There is also fuzziness in this process as there can be some interdependencies between the variables (i.e. increasing system RAM can reduce disk i/o needs due to increased caching).

Nehalem platforms

If you are at all familiar with the current server-platform market then you know that Nehalem microarchitecture (you need to read the Wikipedia article) based servers are the platform of choice today with the Westmere processors being the current revision within that series. In-general, the most cost effective solution when scaling large systems out on Nehalem platforms is to go with dual processor machines as this gives you twice the amount of processing power and system memory without doubling your costs (i.e. you still only need one motherboard, power supplies, etc…)

All of the major OEMs have structured their mainline platforms around this dual processor model. Note that there ARE situations where dual processors don’t make sense including:

Single threaded applications that can not make use of all those cores and that do NOT need the additional memory capacity.

Applications that are purely Disk i/o or network bound where the additional CPU and memory would be wasted (perhaps a file server).

Applications that need less than a “full” machine (i.e. your DNS/DHCP servers).

In general, I don’t think Cassandra falls into these special use case scenarios, unless your just completely i/o bound or network bound and can’t solve them in another way other than adding more nodes. You may need that second processor however just for the memory controllers it contains (i.e. it gives you twice as many ram slots). If you are i/o bound you can consider SSD’s, and if you are network bound you can leverage 10 gigabit network interfaces.

In looking at platforms to run Cassandra on, we wanted a vanilla Nehalem platform to run on, without too many bells and whistles. If you drink the Cassandra kool-aid you will let Cassandra handle all the reliability needs and purchase hardware without node level fault tolerance (i.e. disk RAID). This means putting disks in a RAID 0 (for optimal speed and capacity) but then letting the fact that Cassandra can store multiple copies of the data across other nodes handle fault recovery. We are currently using linux kernel RAID, but may also test hardware RAID 0 that is available on the platform we ended up choosing.

It is shocking to me to see how many OEM’s have come up with platforms that do not have equal numbers of RAM slots per memory channel. News flash folks- In Nehalem it is critical to install memory in equal sets of 3 (or 6 for dual processor) in order to take advantage of memory interleaving. Every server manufactured should have a number of memory slots divisible by three as the current crop of processors has three memory controllers per processor (this may change in the next generation of processors).

A note about chipsets – The Intel 5500 vs. 5520 – The main difference here is just in the number of PCIe paths the chipset provides. They should both provide equivalent performance. The decision point here is made by your OEM and is just based on the number of PCI devices your platform supports.

Our platform choice

In looking at platform options, the following options were lead contenders (there are of course many other possible options, but most are too focused on the enterprise market with features we do not need that just drive costs up):

At first we were looking at 1U machines with 4x 3.5 inch bays (and in fact bought some C1100′s in this configuration) though it turned out that Cassandra was extremely i/o bound which made a small number of large SATA disks impractical. Once we realized we were going to need a larger number of drives we decided to go with 1U platforms that supported 2.5 inch bays as we can put eight to ten 2.5 inch drives in a 1U to give us more spindles (if we go with disks), or more SSD’s (for the disk capacity rather than iops) if we go with SSD’s. It’s also worth noting that the 2.5 inch SATA drives draw a lot less power than the 3.5 inch SATA disks of the same capacity.

We ended up going with the Dell C1100 platforms (over the Supermicro offering) as we already had purchasing relationships with Dell and they have a proven track record of being able to support systems throughout a lifecycle (provide “like” replacement parts, etc…), though on this particular order they fell down in numerous ways (mostly related to their recent outsourcing of production to Mexico) which has caused us to re-evaluate future purchasing plans. In the end, the C1100′s have worked out extremely well thus far, but the speed-bumps along the way were painful. We have not physically tested any Supermicro offerings so perhaps they have as bad (or worse) issues as well.

What we like:

Inexpensive platform

Well-targeted to our needs

Have 18 RAM slots (only populating 12 of them right now with 4 gig sticks)

Dual Intel nic’s not Broadcom

They include out of band controllers

Dual power supplies available (this is the only “redundancy” piece we do purchase)

Low power consumption

Quiet

What we don’t like:

Lead time issues

Rails with clips that easily break

Servers arriving DOA

Using a SAS expander to give 10 bays vs only 8 (we would have rathered the option to only use 8 bays)

They don’t give us the empty drive sleds to add disks later -> force you to purchase from them at astronomical rates

The 2 foot IEC to IEC power cords they sent us were only rated to 125 volts (we use 208 volt exclusively)

Lack of MLC SSD option from factory

OCZ Technology Vertex 2 MLC SSD’s

After purchasing our first round of Dell C1100′s with four SATA disks (one for boot/commit and three in a RAID 0 for data) we rapidly discovered they were EXTREMELY i/o bound. Cassandra does an extremely poor job bringing pertinent data into memory and keeping it there (across a four node cluster we had nearly 200 gigs of RAM as each node has 48 gigs). Things like the fact that Cassandra invalidates cache for any data it writes to disk (rather than writing the data into the cache) make it extremely painful. Cassandra also (in .6) will do a read on all three nodes (assuming your data is replicated three places) in order to do a read-repair, even if the read factor is only set to one. This puts extremely high load on the disks across the cluster in aggregate. I believe in .7 you will be able to tune this down to a more reasonable level.

Our solution was to swap the 1TB SATA disks with 240 gig OCZ Vertex 2 MLC SSD’s which are based on the Sandforce controller. Now normally I would not consider using “consumer grade” MLC SSD’s for an OLTP type application, however, Cassandra is VERY unique in that it NEVER does random write i/o operations and instead does everything with large sequential i/o. This is a huge deal because with MLC SSD’s, random writes can rapidly kill the device as writing into the MLC cells can only be done sequentially and editing any data requires wiping the entire cell and re-writing it.

The Sandforce controller does an excellent job of managing where data is actually placed on the SSD media (it has more space available than what is made available to the O/S so that it can shift where things actually get written). By playing games with how data is written the Sandforce controller is supposed to dramatically improve the lifespan of MLC SSD’s. We will see how it works out over time.

It is unfortunate that Dell does not have an MLC SSD offering, so we ended up buying small SATA disks in order to get the drive sleds, and then going direct to OCZ Technology to buy a ton of their SSD’s. I must say, I have been very happy with OCZ and I am happy to provide contact info if you shoot me an email. I do understand the hesitation Dell has with selling MLC SSD’s, as Cassandra is a very unique use-case (only large sequential writes) and a lot of workloads would probably kill the drives rapidly.

It is also worth noting that our first batch of C1100′s with the 3.5 inch drives were using the onboard Intel ICH10 controller (which has 6 ports), but the second batch of C1100′s with the 10 2.5 inch bays are using an LSI 2008 controller (available on the Dell C1100) with a SAS expander board (since the LSI 2008 only has 8 channels). We are seeing *much* better performance with the LSI 2008 controllers, though that may be simply due to us not having the disks tuned properly on the ICH10 (using native command queueing, DMA mode, etc…) in CentOS 5.5. The OCZ Sandforce based drives are massively fast.

If you are going to have any decent number of machines in your Cassandra cluster I highly recommend keeping spare parts on hand and then just purchasing the slow-boat maintenance contracts (next business day). You *will* loose machines from the cluster due to disk failures, etc (especially since we are using inexpensive parts)… It is much easier to troubleshoot when you can go swap out parts as needed and then follow up after the fact to get the replacement parts.

Networking

Since Cassandra is a distributed data store it puts a lot more load on the network than say monolithic applications like Oracle that generally have all their data backended on FibreChannel SAN’s. Particular care must be taken in network design to ensure you don’t have horrible bottlenecks. In our case, our existing network switches did not have enough available ports and their architecture is 8:1 over-subscribed on each gigabit port, which simply would not do. After much investigation, we decided to go with Arista 7048 series switches.

The Arista 7048 switches are 1U, 48 port copper 1 gig, and 4 ports of 10 gig SFP+. This is the same form factor of the Cisco 4948E switches. This form factor is excellent for top-of-rack switching as it provides fully meshed 1 gig connectivity to the servers with 40 gigabit uplink capacity to the core. While the Arista product offering is not as well baked as the Cisco offering (they are rapidly implementing features still), they do have one revolutionary feature that Cisco does not have called MLAG.

MLAG stands for “Multi-Chassis Link Aggregation“. It allows you to physically plug your sever into two separate Arista switches and run LACP between the server and the switches as if both ports were connected to the same switch. This allows you to use *both* ports in a non-blocking mode giving you full access to the 2 gigabits of bandwidth while still having fault-tolerance in the event a switch fails (of course you would drop down to only 1 gig of capacity). We are using this for *all* of our hosts now (using the linux kernel bonding driver) and indeed it works very well.

MLAG also allows you to uplink your switches back to the core in such a way as to keep all interfaces in a forwarding state (i.e. no spanning-tree blocked ports). This is another great feature, though I do need to point out a couple of downsides to MLAG:

You still have to do all your capacity planning as if you are in a “failed” state. It’s nice to have that extra capacity in case of unexpected conditions, but you can’t count on it if you want to always be fully functional even in the event of a failure.

When running MLAG one of the switches is the “master” that handles LACP negotiation and spanning-tree for the pair of switches. If there is a software fault in that switch it is very possible that it would take down both paths to your severs (in theory the switches can fall back to independent operation, but we are dealing with *software* here).

It is worth noting that we did not go with 10 gig NIC’s and switches as it does not seem necessary yet with our workload and 10 gig is not quite ready for prime time yet (switches are very expensive, the phy’s draw a lot of power, and cabling is still “weird” – either Coax or Fiber or short distance twisted pair over CAT6, or CAT7 / 7a over 100 meters). I would probably consider going with a server platform that had four 1 gig NIC’s still before going to 10 gig. As of yet I have not seen any Cassandra operations take over 100 megabit of network bandwidth (though my graphs are all heavily averaged down so take that with a grain of salt).

Summary

So to recap, we came up with the following:

Dell C1100′s – 10x 2.5 inch chassis with dual power supplies

Dual 2.4 ghz E5620 processors

12 sticks of 4 gig 1066mhz memory for a total of 48 gigs per node (this processor only supports 1066mhz memory)

Notes:

We did not evaluate the low power processors, they may have made sense for Cassandra, but we did not have the time to look into the

We just had our Cassandra cluster loose it’s first disk and the data filesystem went read-only on one node, but the Cassandra process continued on running and processing requests. I am surprised by this as I am not sure what state the node was in (what was it doing with writes when it came time to write out the memtables?). We manually killed the Cassandra process on the node.

The Dell C1100′s did not come set by default in NUMA mode in the BIOS. CentOS 5.5 supports this and so we turned it on. I am not sure how much (if any) performance impact this has on Cassandra.

Conclusion

This is still a rapidly evolving space so I am sure my opinions will change here in a few months, but I wanted to get some of my findings out there for others to make use of. This solution is most certainly not the optimal solution for everyone (and in fact, it remains to be seen if is the optimal solution for us), but hopefully it is a useful datapoint for others that are headed down the same path.

Please feel free as always to post questions below that you feel may be useful to others and I will attempt to answer them, or email me if you want contact information for any of the vendors mentioned above.

We use Dell servers at VoltDB and have purchased empty drive trays from a company called SCSI4me with decent success (usually via EBay).

I wasn’t aware of the MLAG feature on the Arista switches. Very cool.

I would be very interested in knowing how the performance of your machines compares to the big machines in EC2. Would you say your post supports the idea that back-room servers with fast networking and SSDs offer a big value proposition over hosted alternatives?

Nice post! Unfortunately very few people take the time to plan their capacity correctly these days – particularly with Cassandra it seems. Thank you for detailing your process and choices.

Fwiw, The latest versions of 0.6.x addresses some of your issues regarding read repair (you can turn it off) and cache performance (caches can be serialized for at intervals to avoid restart overhead). See http://www.riptano.com/blog/whats-new-cassandra-066 for details of both.

Further, with 0.7.0, cache performance is addressed via CASSANDRA-1267 and read repair configuration per CF in CASSANDRA-930. A release candidate for 0.7.0 is currently undergoing a vote on the Apache Cassandra developer list.

Thanks for the tip on the Dell drive sleds. This is an issue I think I need to work out directly with Dell since I am sure SuperMicro sells their chassis full of the proper sleds rather than blanking panels.

I have not performance tested the EC2 boxes so I can’t offer a direct comparison. We already have investment in core datacenter infrastructure (internet routers/connections, firewalls, switching infrastructure) so adding on pieces is not so expensive. I will say that the best comparison to the C1100′s in EC2 is the high-memory quadruple extra large instances and I can purchase the box outright for less than the 1 year reserved instance price (and the C1100 includes 3 yrs of hardware maint).

Of course that does not factor in labor costs, datacenter costs, and other infrastructure costs.

I can say that local SSD’s are crazy fast and as far as I know there is no way to get performance anywhere near that in Amazon. The SSD’s are at least 20x as fast as local SATA disks for random reads.

The analysis I have done thus far indicates I can deploy and maintain hardware in-house (including bandwidth costs) for less than we would pay Amazon for it and with a higher level of consistency. YMMV

Amazon is most cost effective when you have short-term needs for computing and can spin up/down instances when needed.

It’s not like you can realistically shutdown and provision Cassandra nodes dynamically to handle traffic spikes at different times of day.

I should also note that perhaps its more cost-effective to rent many smaller amazon instances vs. fewer large instances for Cassandra. This would require extensive testing. It may also depend on how heavy usage the other folks provisioned on the same physical box are. In reality, that’s what Amazon does. They run the same Nehalem boxes that you can run in house and they chop them up with a hypervisor.

Nice post Eric! you have a wrong statement that you may want to fix if you want to be accurate.
they do have one revolutionary feature that Cisco does not have called MLAG.

Cisco supports MEC (Multi chassis etherchannel) on cat6k VSS, it provides an even better solution as it doesn’t have the drawbacks you presented and you can also configure L3 MEC. In the nexus family, cisco supports VPC (Virtual Port Channels) which does the same as mlag.

Indeed you are correct, I am well aware of MEC on the Cat6k VSS switches and VPC in the Nexus family. I was referring to a comparable feature not being available in the Cisco 4948E which was the Cisco alternative I was considering.

For this project I was un-interested in a large chassis switch (such as the Cat6k with VSS) as I only had a limited number of hosts needing connectivity (and did not want the power overhead either and my understanding is that the 6500 is being relegated to campus switching going forward and away from datacenter switching). I wanted to go top-of-rack as 1 gig host links with 10 gig links back to the core are very cost effective and it makes cabling easy. If I was going with 10 gig to the hosts I would be doing big chassis switches as 40 gig and 100 gig uplinks are not commercially viable yet and aggregating a ton of 10 gig does not make sense. Also, I am not 100% up on the VSS architecture but I have concerns that a software fault (bug) on the primary switch could take down both chassis simultaneously (this same argument could be made about the Arista’s).

As far as Nexus goes, I had no need or desire to combine my SAN and network fabric and in fact, the whole point of Cassandra is to avoid expensive “enterprise grade” things like SAN’s in the first place. I also strongly dislike the vendor-lock-in aspects of Nexus (i.e. you must hang it off Nexus 5k chassis which means some day when I want to waterfall these devices down to be out of band controller aggregation switches they will be useless without a corresponding Nexus 5k with it’s associated power draw and maintenance costs). Not to even mention that their switching capacity is limited since the FEX is dumb and can’t make any switching decisions on it’s own and must one-arm them through the Nexus 5k (correct me if I am wrong) to go between two Cassandra nodes attached to one FEX. The Arista is fully non-blocking on all ports.

I did evaluate Cisco options before purchasing Arista for this project, but I ruled out 6500′s due to cost/power/space available (also I am unclear on their growth path to 10 gig host switching at this time), and I ruled out Nexus partly due to disliking the architecture (though it would certainly work), but more because it was just way too expensive at the scale I needed, and I eliminated the 4948′s due to the lack of an MLAG like feature but moreover due to Cisco’s lack of a high density 10 gig core option to aggregate them like Arista has (the closest I am aware of in the Cisco product line is the 4948M).

My mind is (as always) open to change. Every environments needs are different so this is not one-size-fits all.

We are landing on a config very similar to this, but currently testing with Intel SSDs.

The sandforce controllers have a somewhat shaky reputation due to stability issues caused by some firmware releases (but the main issue has been related to power saving/hibernation which is not a big issue for servers of course).

We are running all RAID0. Initially we were using C1100′s with 3.5 inch bays and so we were using the onboard Intel ICH10R controllers with Linux kernel RAID 0. When we moved to the LSI controllers we kept using Linux RAID 0 since that was how we had our kickstart environment built, though using the LSI controller to do the RAID 0 would have probably made more sense.

All of that being said, I must post an update to the article above. Under high load we have discovered a SERIOUS issue whereby each day one or two of our Cassandra systems (out of a cluster size of 4) will have it’s file system go read-only due to some underlying i/o subsystem error. The LSI logs some error code that we can not find documentation on.

In order to get our cluster stabile we have manually cabled the OCZ drives to the onboard Intel ICH10R controller (which is much slower, at least without tuning things like NCQ which we have not yet done). This is a serious problem with the hardware config described above and I am hoping that Dell/LSI/OCZ will step up to the plate and help us resolve the issue. Dell does not technically support these drives since they don’t sell them, and LSI won’t support us directly as the LSI2008 is OEM’d by Dell. Perhaps Supermicro servers are looking better by the day…

Another update (long overdue)- We eventually ended up moving back to the Dell LSI 2008 series controller by purchasing hydra cables that go from a 4 lane SAS/SATA connector out to four individual SATA connectors that could be plugged directly into the midplane that the drives slide into. This allows us to use up to eight of the drive slots for SSD’s without going through the SAS expander which was causing us grief. We have now been running for months and have not lost ANY of our OCZ drives.

You’re probably not watching this thread any more, but we’re looking at a very similar build for our key-value store. Did the OCZs end up being as reliable for that workload as you thought they would be? Real durability data is impossible to find these days. Also, did you use the LSI for RAID on the SSDs? I didn’t think you could pass TRIM that way. We can’t find info on that anywhere either.

The OCZ’s have ended up working very well. Zero defect rate on the first cluster. On the second cluster, we did have a couple that failed in the first week, but actually, we never went back and tested them. For all we know they could have just needed re-seating, or it could have been a Dell C1100 issue.

We are still running on all the original Vertex 2′s and last I checked, they all reported plenty of life remaining with the three I just randomly sampled indicating over 90% left (no idea how accurate that is).

Now for the down sides:
1. We had to bypass the SAS expander in the C1100′s because we had issues and Dell refused to support us (understandably). This is a hack, and it means we can only use 8 drives instead of 10 (not that we cared for our application).
2. Last I checked, OCZ did not have a Linux firmware update utility and so the only way to update them is to remove them and put them in a Windows box one at a time. There have been MANY updates since we bought them (several of them involving data loss issues) but because of the pain factor we have never updated them.
3. These models do not have super-capacitors in them to protect from data loss during power failures, so frankly, I have no idea what happens if the datacenter loses power. Does our entire cluster brick? I have been meaning to test this in QA.
4. Since these are consumer grade devices, they are on a consumer lifecycle. I may not be able to get them in the future when I want to upgrade my cluster.

I am happy to share more on this topic if you have further questions.

P.S. We don’t use TRIM yet since we are not running a new enough kernel version. Not sure if the LSI2008 will even allow it. That’s a good question.

Thanks for this article. As an employee of Dell, I’m always interested in seeing how people in the real world design solutions. I’m curious as to what your CPU design ended up being (single Quad core?). Also, did the 48GB RAM perform to your needs, or could you go with less?

@eprosenx
I’m not from a particular product group – I’m a server specialist on the Global 500 sales team. The Dell PowerEdge C product line is the best suited product line for Apache Cassandra, so I’m trying to get some real world examples to reference.

I would love to chat with you offline. My Twitter name is @Kevin_Houston – send me a message with your Twitter name and I’ll DM you my contact info.

I am from the C Series team at Dell. I am a Sales Specialist (not an engineer), and I have had numerous project discussions with customers around new Cassandra projects of late. I’d like to catch up with you on what server platform you would lean towards if you were building the environment from scratch today. New PowerEdge C form factors have been released recently, so I’d like to know which of these Cassandra would like. From a pure core density perspective, nobody can touch the C6145. 2x 4 socket 8-core (soon to be 12 core) server nodes in a shared 2U chassis packs in 64 cores aggregate. Is there any particular concern with the 4 socket footprint in how Cassandra interacts with the CPU and Nodes?