Cassandra documentation from DataStax

Introduction

This document aims to provide a few easy to follow steps to take the first-time user from installation, to running single node Cassandra, and overview to configure multinode cluster. Cassandra is meant to run on a cluster of nodes, but will run equally well on a single machine. This is a handy way of getting familiar with the software while avoiding the complexities of a larger system.

Step 0: Prerequisites and connection to the community

Cassandra requires the most stable version of Java 1.6 you can deploy. For Sun's jvm, this means at least u19; u21 is better. Cassandra also runs on the IBM jvm, and should run on jrockit as well.

The best way to ensure you always have up to date information on the project, releases, stability, bugs, and features is to subscribe to the users mailing list (subscription required) and participate in the #cassandra channel on IRC.

Step 1: Download Cassandra Kit

Download links for the latest stable release can always be found on the website.

Users of Debian or Debian-based derivatives can install the latest stable release in package form, see DebianPackaging for details.

Step 2: Edit configuration files

Step 2.1: Edit cassandra.yaml

The distribution's sample configuration conf/cassandra.yaml contains reasonable defaults for single node operation, but you will need to make sure that the paths exist for data_file_directories, commitlog_directory, and saved_caches_directory.

Verify storage_port and rpc_port are not conflict with other service on your computer. By default, Cassandra uses 7000 for storage_port, and 9160 for rpc_port. The storage_port must be identical between Cassandra nodes in a cluster. Cassandra client applications will use rpc_port to connect to Cassandra.

It will be a good idea to change cluster_name to avoid unnecessary conflict with existing clusters.

initial_token. You can leave it blank, but I recommend you to set it to 0 if you are configuring your first node.

Step 2.2: Edit log4j-server.properties

conf/log4j.properties contains a path for the log file. Edit the line if you need.

# Edit the next line to point to your logs directory
log4j.appender.R.File=/var/log/cassandra/system.log

Step 2.3: Edit cassandra-env.sh

Cassandra has JMX (Java Management Extensions) interface, and the JMX_PORT is defined in conf/cassandra-env.sh. Edit following line if you need.

# Specifies the default port over which Cassandra will be available for
# JMX connections.
JMX_PORT="7199"

By default, Cassandra will allocate memory based on physical memory your system has. For example it will allocate 1GB heap on 2GB system, and 2GB heap on 8GB system. If you want to specify Cassandra heap size, remove leading pound sign(#) on the following lines and specify memory size for them.

#MAX_HEAP_SIZE="4G"
#HEAP_NEWSIZE="800M"

If you are not familiar with Java GC, 1/4 of MAX_HEAP_SIZE may be a good start point for HEAP_NEWSIZE.

Cassandra will need more than few GB heap for production use, but you can run it with smaller footprint for test drive. If you want to assign 96MB as max, edit the lines as following.

MAX_HEAP_SIZE="96M"
HEAP_NEWSIZE="24M"

If you face OutOfMemory exceptions or massive GCs with this configuration, increase these values. Don't start your production service with such tiny heap configuration!

Note for Mac Uses:

Some people running OS X have trouble getting Java 6 to work. If you've kept up with Apple's updates, Java 6 should already be installed (it comes in Mac OS X 10.5 Update 1). Unfortunately, Apple does not default to using it. What you have to do is change your JAVA_HOME environment setting to /System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home and add /System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home/bin to the beginning of your PATH.

Step 3: Start up Cassandra

And now for the moment of truth, start up Cassandra by invoking bin/cassandra -f from the command line1. The service should start in the foreground and log gratuitously to standard-out. Assuming you don't see messages with scary words like "error", or "fatal", or anything that looks like a Java stack trace, then chances are you've succeeded.

Press "Control-C" to stop Cassandra.

If you start up Cassandra without "-f" option, it will run in background, so you need to kill the process to stop.

Step 4: Using cassandra-cli

bin/cassandra-cli is a interactive command line interface for Cassandra. You can define schema, store and fetch data with the tool. Run following command to connect to your Cassandra instance.

You have inserted a row to Users column family. The row key is '1234', and we set the 2 columns in the row: column named 'name', and 'password'. 'utf8()' means to treat the data as UTF8 string. Refer to 'help set;' for more details. Now let's try to fetch data you inserted.

Please note that we didn't use "utf8()" for the row key this time. You can define the data type as meta data of the column family. Check 'help update column family;' and 'help create column family;' for more details.

To be certain though, take some time to try out the examples in CassandraCli before moving on Also, if you run into problems, Don't Panic, calmly proceed to If Something Goes Wrong.

Users of recent Linux distributions and Mac OS X Snow Leopard should be able to start up Cassandra simply by untarring and invoking bin/cassandra -f with root privileges. Snow Leopard ships with Java 1.6.0 and does not require changing the JAVA_HOME environment variable or adding any directory to your PATH. On Linux just make sure you have a working Java JDK package installed such as the openjdk-6-jdk on Ubuntu Lucid Lynx.

Configuring Multinode Cluster

Now you have single working Cassandra node. It is a Cassandra cluster which has only one node. By adding more nodes, you can make it a multi node cluster.

Setting up a Cassandra cluster is almost as simple as repeating the above procedures for each node in your cluster. There are a few minor exceptions though.

Cassandra nodes exchange information about one another using a mechanism called Gossip, but to get the ball rolling a newly started node needs to know of at least one other, this is called a Seed. It's customary to pick a small number of relatively stable nodes to serve as your seeds, but there is no hard-and-fast rule here. Do make sure that each seed also knows of at least one other, remember, the goal is to avoid a chicken-and-egg scenario and provide an avenue for all nodes in the cluster to discover one another.

In addition to seeds, you'll also need to configure the IP interface to listen on for Gossip and Thrift, (listen_address and rpc_address respectively). Use a 'listen_address that will be reachable from the listen_address used on all other nodes, and a rpc_address` that will be accessible to clients.

One other thing you need to care at multi node cluster is Token. Each node in the cluster owns a part of token range from 0 to 2^127-1. If the Nth node in the cluster has token value T(N), the node owns range from T(N-1)+1 to T(N). Cassandra decide nodes where a data should be stored based on the consistent mapping of the row key and token range (refer to RandomPartitioner, ByteOrderedPartitioner).

The token can be assigned to node by initial_token parameter in cassandra.yaml. The parameter is effective only at the first boot of the node. Once you boot a node, use 'nodetool move' command to change the assigned token. You need to specify appropriate initial_token for each node to balance data load across the nodes. Here is a python script to calculate balanced tokens.

# Number of nodes in the cluster
num_node = 4
for n in range(num_node):
print int(2**127 / num_node * n)

Once everything is configured and the nodes are running, use the bin/nodetool ring utility to verify a properly connected cluster. For example:

If you don't yet have access to hardware for a Cassandra cluster you can try it out on EC2 with CloudConfig.

For more details about configuring multi node cluster, please refer to MultinodeCluster.

Write your application

The recommended way to communicate with Cassandra in your application is to use a higher-level client. These provide programming language specific API:s for talking to Cassandra in a variety of languages. The details will vary depending on programming language and client, but in general using a higher-level client will mean that you have to write less code and get several features for free that you would otherwise have to write yourself.

That said, it is useful to know that Cassandra uses Thrift for its external client-facing API. Cassandra's main API/RPC/Thrift port is 9160. Thrift supports a wide variety of languages so you can code your application to use Thrift directly if you so chose (but again we recommend a high-level client where available).

Important note: If you intend to use thrift directly, you need to install a version of thrift that matches the revision that your version of Cassandra uses. InstallThrift

Cassandra's main API/RPC/Thrift port is 9160 by default, which is defined as rpc_port in cassandra.yaml. It is a common mistake for API clients to connect to the JMX port instead.

Checking out a demo application like Twissandra (Python + Django) will also be useful.

If Something Goes Wrong

If you followed the steps in this guide and failed to get up and running, we'd love to help. Here's what we need.

If you are running anything other than a stable release, please upgrade first and see if you can still reproduce the problem.

Make sure debug logging is enabled (hint: conf/log4j.properties) and save a copy of the output.

Search the mailing list archive and see if anyone has reported a similar problem and what, if any resolution they received.