令牌选择

With order preserving partitioners, your key distribution will be application-dependent. You should still take your best guess at specifying initial tokens (guided by sampling actual data, if possible), but you will be more dependent on active load balancing (see below) and/or adding new nodes to hot spots.

Once data is placed on the cluster, the partitioner may not be changed without wiping and starting over.

Replication

A Cassandra cluster always divides up the key space into ranges delimited by Tokens as described above, but additional replica placement is customizable via IReplicaPlacementStrategy in the configuration file. The standard strategies are

RackUnawareStrategy: replicas are always placed on the next (in increasing Token order) N-1 nodes along the ring

RackAwareStrategy: replica 2 is placed in the first node along the ring the belongs in another data center than the first; the remaining N-2 replicas, if any, are placed on the first nodes along the ring in the same rack as the first

Note that with RackAwareStrategy, succeeding nodes along the ring should alternate data centers to avoid hot spots. For instance, if you have nodes A, B, C, and D in increasing Token order, and instead of alternating you place A and B in DC1, and C and D in DC2, then nodes C and A will have disproportionately more data on them because they will be the replica destination for every Token range in the other data center.

The corollary to this is, if you want to start with a single DC and add another later, when you add the second DC you should add as many nodes as you have in the first rather than adding a node or two at a time gradually.

Replication factor is not really intended to be changed in a live cluster either, but increasing it may be done if you (a) read at ConsistencyLevel.QUORUM or ALL (depending on your existing replication factor) to make sure that a replica that actually has the data is consulted, (b) are willing to accept downtime while anti-entropy repair runs (see below), or (c) are willing to live with some clients potentially being told no data exists if they read from the new replica location(s) until repair is done.

Network topology

Besides datacenters, you can also tell Cassandra which nodes are in the same rack within a datacenter. Cassandra will use this to route both reads and data movement for Range changes to the nearest replicas. This is configured by a user-pluggable EndpointSnitch class in the configuration file.

EndpointSnitch is related to, but distinct from, replication strategy itself: RackAwareStrategy needs a properly configured Snitch to place replicas correctly, but even absent a Strategy that cares about datacenters, the rest of Cassandra will still be location-sensitive.

Range changes

Bootstrap

Adding new nodes is called "bootstrapping."

To bootstrap a node, turn AutoBootstrap on in the configuration file, and start it.

If you explicitly specify an InitialToken in the configuration, the new node will bootstrap to that position on the ring. Otherwise, it will pick a Token that will give it half the keys from the node with the most disk space used, that does not already have another node bootstrapping into its Range.

Important things to note:

You should wait long enough for all the nodes in your cluster to become aware of the bootstrapping node via gossip before starting another bootstrap. The new node will log "Bootstrapping" when this is safe, 2 minutes after starting. (90s to make sure it has accurate load information, and 30s waiting for other nodes to start sending it inserts happening in its to-be-assumed part of the token ring.)

Relating to point 1, one can only boostrap N nodes at a time with automatic token picking, where N is the size of the existing cluster. If you need to more than double the size of your cluster, you have to wait for the first N nodes to finish until your cluster is size 2N before bootstrapping more nodes. So if your current cluster is 5 nodes and you want add 7 nodes, bootstrap 5 and let those finish before boostrapping the last two.

As a safety measure, Cassandra does not automatically remove data from nodes that "lose" part of their Token Range to a newly added node. Run nodetool cleanup on the source node(s) (neighboring nodes that shared the same subrange) when you are satisfied the new node is up and working. If you do not do this the old data will still be counted against the load on that node and future bootstrap attempts at choosing a location will be thrown off.

When bootstrapping a new node, existing nodes have to divide the key space before beginning replication. This can take awhile, so be patient.

During bootstrap, a node will drop the Thrift port and will not be accessible from nodetool.

Bootstrap can take many hours when a lot of data is involved. See Streaming for how to monitor progress.

Cassandra is smart enough to transfer data from the nearest source node(s), if your EndpointSnitch is configured correctly. So, the new node doesn't need to be in the same datacenter as the primary replica for the Range it is bootstrapping into, as long as another replica is in the datacenter with the new one.

Bootstrap progress can be monitored using nodetool with the streams argument.

During bootstrap nodetool may report that the new node is not receiving nor sending any streams, this is because the sending node will copy out locally the data they will send to the receiving one, which can be seen in the sending node through the the "AntiCompacting... AntiCompacted" log messages.

Moving or Removing nodes

Removing nodes entirely

You can take a node out of the cluster with nodetool decommission to a live node, or nodetool removetoken (to any other machine) to remove a dead one. This will assign the ranges the old node was responsible for to other nodes, and replicate the appropriate data there. If decommission is used, the data will stream from the decommissioned node. If removetoken is used, the data will stream from the remaining replicas.

No data is removed automatically from the node being decommissioned, so if you want to put the node back into service at a different token on the ring, it should be removed manually.

Moving nodes

nodetool move: move the target node to a given Token. Moving is essentially a convenience over decommission + bootstrap.

Load balancing

nodetool loadbalance: also essentially a convenience over decommission + bootstrap, only instead of telling the target node where to move on the ring it will choose its location based on the same heuristic as Token selection on bootstrap.

The status of move and balancing operations can be monitored using nodetool with the streams argument.

Repairing missing or inconsistent data

Cassandra repairs data in two ways:

Read Repair: every time a read is performed, Cassandra compares the versions at each replica (in the background, if a low consistency was requested by the reader to minimize latency), and the newest version is sent to any out-of-date replicas.

Anti-Entropy: when nodetool repair is run, Cassandra computes a Merkle tree of the data on that node, and compares it with the versions on other replicas, to catch any out of sync data that hasn't been read recently. This is intended to be run infrequently (e.g., weekly) since computing the Merkle tree is relatively expensive in disk i/o and CPU, since it scans ALL the data on the machine (but it is is very network efficient).

Running nodetool repair: Like all nodetool operations, repair is non-blocking; it sends the command to the given node, but does not wait for the repair to actually finish. You can tell that repair is finished when (a) there are no active or pending tasks in the CompactionManager, and after that when (b) there are no active or pending tasks on AE-SERVICE-STAGE.

Repair should be run against one machine at a time. (This limitation will be fixed in 0.7.)

Handling failure

If a node goes down and comes back up, the ordinary repair mechanisms will be adequate to deal with any inconsistent data. Remember though that if a node is down longer than your configured GCGraceSeconds (default: 10 days), it could have missed remove operations permanently. Unless your application performs no removes, you should wipe its data directory, re-bootstrap it, and removetoken its old entry in ghe ring (see below).

If a node goes down entirely, then you have two options:

(Recommended approach) Bring up the replacement node with a new IP address, and AutoBootstrap set to true in storage-conf.xml. This will place the replacement node in the cluster and find the appropriate position automatically. Then the bootstrap process begins. While this process runs, the node will not receive reads until finished. Once this process is finished on the replacement node, run nodetool removetoken once, supplying the token of the dead node, and nodetool cleanup on each node.

You can obtain the dead node's token by running nodetool ring on any live node, unless there was some kind of outage, and the others came up but not the down one -- in that case, you can retrieve the token from the live nodes' system tables.

(Alternative approach) Bring up a replacement node with the same IP and token as the old, and run nodetool repair. Until the repair process is complete, clients reading only from this node may get no data back. Using a higher ConsistencyLevel on reads will avoid this.

The reason why you run nodetool cleanup on all live nodes is to remove old Hinted Handoff writes stored for the dead node.

Backing up data

Cassandra can snapshot data while online using nodetool snapshot. You can then back up those snapshots using any desired system, although leaving them where they are is probably the option that makes the most sense on large clusters.

With some combinations of operating system/jvm you may receive an error related to the inability to create a process during the snapshotting, such as this on Linux

Remain calm. The operating system is trying to allocate for the "ln" process a memory space as large as the parent process (the cassandra server), even if it's not going to use it. So if you have a machine with 8GB of RAM and no swap, and you gave 6 to the cassandra server, it will fail during this because the operating system will wan 12 GB of memory before allowing you to create the process.

This can be worked around depending on the operating system by either creating a swap file, snapshotting, turning it off or by turning on "memory overcommit". Since the child process memory is the same as the parent, until it performs an exec("ln") call the operating system will not use the new memory and will just refer to the old one, and everything will work.

Currently, only flushed data is snapshotted (not data that only exists in the commitlog). Run nodetool flush first and wait for that to complete, to make sure you get all data in the snapshot.

To revert to a snapshot, shut down the node, clear out the old commitlog and sstables, and move the sstables from the snapshot location to the live data directory.

Consistent backups

You can get an eventually consistent backup by flushing all nodes and snapshotting; no individual node's backup is guaranteed to be consistent but if you restore from that snapshot then clients will get eventually consistent behavior as usual.

There is no such thing as a consistent view of the data in the strict sense, except in the trivial case of writes with consistency level = ALL.

Import / export

As an alternative to taking snapshots it's possible to export SSTables to JSON format using the bin/sstable2json command:

Usage: sstable2json [-f outfile] <sstable> [-k key [-k key [...]]]

bin/sstable2json accepts as a required argument, the full path to an SSTable data file, (files ending in -Data.db), and an optional argument for an output file (by default, output is written to stdout). You can also pass the names of specific keys using the -k argument to limit what is exported.

Note: If you are not running the exporter on in-place SSTables, there are a couple of things to keep in mind.

The corresponding configuration must be present (same as it would be to run a node).

SSTables are expected to be in a directory named for the keyspace (same as they would be on a production node).

JSON exported SSTables can be "imported" to create new SSTables using bin/json2sstable:

Usage: json2sstable -K keyspace -c column_family <json> <sstable>

bin/json2sstable takes arguments for keyspace and column family names, and full paths for the JSON input file and the destination SSTable file name.

You can also import pre-serialized rows of data using the BinaryMemtable interface. This is useful for importing via Hadoop or another source where you want to do some preprocessing of the data to import.

NOTE: Starting with version 0.7, json2sstable and sstable2json must be run in such a way that the schema can be loaded from system tables. This means that cassandra.yaml must be found in the classpath and refer to valid storage directories.

Monitoring

Cassandra exposes internal metrics as JMX data. This is a common standard in the JVM world; OpenNMS, Nagios, and Munin at least offer some level of JMX support. The specifics of the JMX Interface are documented at JmxInterface.

Running nodetool cfstats can provide an overview of each Column Family, and important metrics to graph your cluster. Some folks prefer having to deal with non-jmx clients, there is a JMX-to-REST bridge available at http://code.google.com/p/polarrose-jmx-rest-bridge/

Important metrics to watch on a per-Column Family basis would be: Read Count, Read Latency, Write Count and Write Latency. Pending Tasks tell you if things are backing up. These metrics can also be exposed using any JMX client such as jconsole

You can also use jconsole, and the MBeans tab to look at PendingTasks for thread pools. If you see one particular thread backing up, this can give you an indication of a problem. One example would be ROW-MUTATION-STAGE indicating that write requests are arriving faster than they can be handled. A more subtle example is the FLUSH stages: if these start backing up, cassandra is accepting writes into memory fast enough, but the sort-and-write-to-disk stages are falling behind.

If you are seeing a lot of tasks being built up, your hardware or configuration tuning is probably the bottleneck.

Running nodetool tpstats will dump all of those threads to console if you don't want to use jconsole. Example: