How to Set Up and Monitor a Multi-Node Cassandra and Hadoop Cluster on Linux

How would you like to easily install and stand up a scalable, distributed database system that can manage big data in real-time fashion with Apache Cassandra and is also able to analyze that same data via Apache Hadoop? If your answer is “yes”, good, because that’s what I’ll be showing you how to do in this article.

For this particular tutorial, I’ll be using 4 CentOS 6.0 Linux machines that I’ve provisioned on Rackspace. Each has 2GB of RAM and 80GB of hard disk space. If you’re not using CentOS, don’t worry as the process below works for any Linux platform (e.g. Red Hat, Ubuntu, Debian, etc.)

For this exercise, we’ll designate that two of our nodes will manage real-time data with Cassandra, and the other two will be assigned analytic operations with Hadoop. I’ll also be using DataStax OpsCenter to manage and monitor the Cassandra and Hadoop cluster.

Prerequisites

Before you install the Apache Cassandra/Hadoop software, there are a few prerequisites that you’ll want to make sure are on your boxes before you begin. Specifically, they include:

Java 1.6 or higher

Python 2.6 or higher

OpenSSL

You can check these prerequisites on your Linux boxes by entering the following commands and viewing the output:

To improve performance, you can also install JNA, which helps with memory management. JNA is optional, but if you’d like to install it, see the DataStax documentation for the few required steps.

Also, if you have a firewall enabled, be sure that you open ports so that all your machines have access to one another. If you don’t do this, when you start Cassandra or Hadoop on a node, it will act as a standalone database server rather than as a database cluster.

The ports you need to open for Cassandra are 7000 and 9160. For Hadoop, you need to open 8012, 9290, 50030, and 50031. For OpsCenter, you need to open 7199, 8888, 61620, and 61621.

For each port you need to open, you can use the iptables command similar to this:

iptables -A INPUT -p tcp --dport 7000 -j ACCEPT

You’ll want to enter a command like the above on all machines that make up your cluster and for all ports that need to be opened.

Also, a quick note on CentOS and RHEL systems: they oftentimes have a universal REJECT all anywhere line in the firewall file (iptables), that you’ll want to remove. If you don’t, even if you open all the ports listed above, your machine will still universally reject all incoming requests.

Download the Software

Next we need to download the Apache Cassandra and Hadoop software. For this article, I’ll be using the DataStax Enterprise Edition bundle, which contains the latest and most stable Apache Cassandra and Hadoop versions, along with the Cassandra Query Language (CQL) utility, a sample database and application, and the enterprise edition of DataStax OpsCenter, which is the tool you’ll want to use for managing and monitoring your Cassandra/Hadoop cluster.

For Linux, DataStax makes available rpm, deb, and tar downloads. For this tutorial, I’ll be using the tar download files.

You can register and obtain the DataStax Enterprise Edition from the downloads page. There’s no time bomb or trial license date for DataStax Enterprise if you use it for development/testing purposes (production deployments do require a subscription license).

Once you get an assigned userid/password to get the DataStax Enterprise Edition software, you can also get the download file via the wget command in this fashion:

Unpack the Download File

Next, on each of your machines, move the tar file you’ve downloaded to the directory where you’d like to install things and unzip it (e.g. tar -xzf).

Generate the Tokens for the Cluster

Cassandra automatically distributes data in a cluster evenly between all the machines via a hashing algorithm that’s applied to incoming data. Because DataStax Enterprise is built on top of Cassandra, this is done also for data utilized by Hadoop.

All you have to do is assign a token to each machine in the cluster, which is a numerical identifier that determines the machine’s position in the cluster and the range of data each machine is responsible for.

DataStax supplies a small python program that makes the generation of tokens very easy. To use it:

Create a new file on one of your Linux machines using your favorite editor

Save and name your file (e.g. tokentool). Also be sure to set the permissions on the file so you can execute it

Then, execute the new script, input the number of nodes you intend to use for your new Cassandra/Hadoop cluster, and the script will output the tokens for each node of your new cluster:

Record the tokens because we’ll need them for the next step. For the cluster I’m building for this exercise, I’ll assign the above tokens as follows:

Node 1: Cassandra: token 0

Node 2: Hadoop: token 42535295865117307932921825928971026432

Node 3: Cassandra: token 85070591730234615865843651857942052864

Node 4: Hadoop: token 127605887595351923798765477786913079296

Next, Set a Few Parameters

There are a few startup parameters you’ll want to set for your new Cassandra / Hadoop cluster. Some are mandatory and others are optional. All are contained in the cassandra.yaml file that can be found in the /conf subdirectory that’s directly under the directory where you unzipped the DataStax Enterprise Edition server package:

The mandatory parameters you’ll need to set are:

initial_token: using the token list you generated, input a token for each machine in your cluster.

listen_address: Set to the IP address of the machine.

seeds: When a node first starts, it contacts a seed node to bootstrap the gossip communication process. The seed node designation has no purpose other than bootstrapping new nodes joining the cluster. Seed nodes are not a single point of failure and you can input multiple seed nodes if you’d like. For your new Cassandra/Hadoop cluster, input the IP address or machine name of your first Hadoop node followed by the IP/name of your first Cassandra node separated by a comma:

cluster_name: is (surprise!) the name of your cluster. You can leave “Test Cluster” in if you’d like or give it a more meaningful name.

data_file_directories, commitlog_directory, saved_caches_directory: Like many other database systems, Cassandra has data and log files. And like other databases, you’ll get better performance if you separate the data and log files so that they’re on different drives/devices. This isn’t necessary if you’re just getting your feet wet with Cassandra, but you’ll want to create separate directories for these if/when you go into production, and include those directories here. If you leave them blank, Cassandra will just use the /var/lib directory for everything.

Of course, there are other parameters you can tweak in the cassandra.yaml file for performance tuning, etc., but you’re done for now. Save your changes to the cassandra.yaml file once you’re finished.

Start your Cluster

You’re now ready to start up your Cassandra and Hadoop cluster. Start your Hadoop seed node first by going to the [installation directory]/bin directory and executing ./dse cassandra –t:

Then go to your Cassandra seed node and execute ./dse cassandra. With the seed nodes up, you can start up your other Hadoop and Cassandra nodes in the same fashion.

Give the nodes a few seconds to establish themselves in the ring and then check the status of your cluster by using the nodetool utility in the [installation directory]/resources/cassandra/bin directory (on any node in your cluster), and entering the command: ./nodetool –h localhost ring:

You should see the nodes that comprise your cluster along with their status and some other pieces of information.

Notice how two of my nodes are labled Casssandra nodes for realtime workloads, and two are labled Analytic nodes for Hadoop / batch analytic workloads.

One of the great things about using Cassandra and Hadoop with DataStax Enterprise is that it provides automatic workload separation between your Cassandra nodes and workloads and the Hadoop nodes. Your Cassandra real time work won’t compete for data or compute resources on your Hadoop nodes, and your Hadoop MapReduce, Hive, and Pig tasks don’t compete for the data or compute resources on your Cassandra nodes.

All data in the cluster is kept in sync via Cassandra’s replication feature, so you don’t have to ETL any data between nodes. For example, if you create and populate a Hive table on your Hadoop nodes, that object is automatically copied for you and made available in a column family on your Cassandra nodes.

You also get a fault-tolerant Hadoop implemenation via DataStax Enterprise so you’ll have no points of failure as with standard Hadoop installs. Your Hadoop Job Tracker will be running on your Hadoop seed node, but you can move it if you’d like to another Hadoop node in the cluster via the dsetool utility that comes packaged with your install.

That was pretty easy wasn’t it? Download/unzip a single tar file, set a couple of config parms, start your nodes, and you’ve got a full blown scalable and fault tolerant Cassandra / Hadoop that’s ready for work.

Let’s now install and configure the management and monitoring tool you’ll use to maintain your new cluster.

Installing DataStax OpsCenter

DataStax OpsCenter is a visual management and monitoring solution for Apache Cassandra that lets you easily see what’s going on in your database cluster, manage objects, and more. There are two editions of DataStax OpsCenter: a free community edition and a paid enterprise edition. The paid version is what you’ll need to manage and monitor both Cassandra and Hadoop, but there’s no need to pay for it until you move your cluster into production status.

With DataStax OpsCenter, you’ll install the main OpsCenter service on one of your nodes and agents on every node. You can choose any node for the OpsCenter service; I’ll just use my first node.

Go to the OpsCenter installation directory and then to the bin directory. Enter the command: ./setup.py (that used to be called create-agent-tarball; if you’re using OpsCenter 1.3 or below, this is what the setup routine is called in those versions), which will set up SSL for OpsCenter and do a few other housekeeping chores:

Next, go to the conf directory and edit the opscenterd.conf file. You need to change the interface parameter in the file from 127.0.0.1 (defaults to just the localhost machine) to 0.0.0.0 so you can get to OpsCenter from any machine.

Then, start DataStax OpsCenter up as a background process, by executing ./opscenter &:

Now, you have the main OpsCenter service running on one of your nodes. Next, we have to install the OpsCenter agent on each node in your new cluster.

For every node in your cluster, follow these instructions:

Make a directory on each machine for the OpsCenter agent (e.g. mkdir opscenteragent)

The setup script for OpsCenter created a file called agent.tar.gz on your first machine. Copy that file (e.g. scp) to all the machines in your cluster and put it into the new directory you just created.

Unzip the agent.tar.gz file

Next, you need to set the agent to point to the main OpsCenter machine (the one you just installed the primary OpsCenter service on) and also monitor the current machine you are on. This is done by executing the setup script in the bin subdirectory of the agent directory and passing it first the machine IP or name of where the OpsCenter main service is running followed by the local machine’s address or name: ./setup main-opscenter-machine-name-or-ip local-machine-name-or-ip

Start up the agent by executing the opscenter-agent script.

Lastly, DataStax OpsCenter uses the iostat command on Linux to obtain disk I/O metrics. On my new Rackspace boxes, I didn’t have iostat installed, so I installed it by issuing the command: yum install sysstat.

You can now invoke the DataStax OpsCenter dashboard from any Google Chrome or Firefox web browser by typing in the following on the browser’s address bar: http://[IP address of the OpsCenter service machine]:8888/opscenter/index.html:

Congratulations – you’ve got a multi-node Cassandra and Hadoop cluster up and running along with your visual management and monitor solution.

Conclusion

We’ve reached the end of this short article on how to setup and start monitoring a multi-node Cassandra cluster on Linux. Look for some more articles on setting up Cassandra on Amazon EC2 and other platforms in the future.

We’ve now reached the end for this short article on how to stand up a multi-node Cassandra and Hadoop cluster on Linux. To download the DataStax Enterprise edition and start working with Cassandra and Hadoop today, please visit the downloads page.

There is not much written about Cassandra administration with respect to backup and recovery. I’ve read on various blog sites that with a multi-node cluster you need to run snapshot on all nodes in the cluster in parallel to have a ‘consistent’ backup (unlike with MySQL master-slave setups where you can backup one of the slaves and transaction log coordinates to recover a failed slave).
Is that true? If a node goes down it can only be recovered from its own snapshot files?

No, not at all. Cassandra is built to withstand and recover from node failure very well. As long as you replicate and maintain multiple copies of the data on other nodes, a node can be automatically recovered from surviving nodes once it comes back online. See the DataStax Cassandra documentation that covers the topic of Repair. For more info on backup, see the sections on Backup and Restoring data. Lastly, see StackOverflow for more Q&A on topics such as this.