Quick start – Creating your first Java application

Cassandra's storage architecture is designed to manage large data volumes and revolve around some important factors:

Decentralized systems

Data replication and transparency

Data partitioning

Decentralized systems are systems that provide maximum throughput from each node.Cassandra offers decentralization by keeping each node with an identical configuration. There are no such master-slave configurations between nodes. Data is spread across nodes and each node is capable of serving read/write requests with the same efficiency.

A data center is a physical space where critical application data resides. Logically, a data center is made up of multiple racks, and each rack may contain multiple nodes.

Cassandra replication strategies

Cassandra replicates data across the nodes based on configured replication. If the replication factor is 1, it means that one copy of the dataset will be available on one node only. If the replication factor is 2, it means two copies of each dataset will be made available on different nodes in the cluster. Still, Cassandra ensures data transparency, as for an end user data is served from one logical cluster. Cassandra offers two types of replication strategies.

Simple strategy

Simple strategy is best suited for clusters involving a single data center, where data is replicated across different nodes based on the replication factor in a clockwise direction. With a replication factor of 3, two more copies of each row will be copied on nearby nodes in a clockwise direction:

Network topology strategy

Network topology strategy ( NTS ) is preferred when a cluster is made up of nodes spread across multiple data centers. With NTS, we can configure the number of replicas needed to be placed within each data center. Data colocation and no single point of failure are two important factors that we need to consider priorities while configuring the replication factor and consistency level. NTS identifies the first node based on the selected schema partitioning and then looks up for nodes in a different rack (in the same data center). In case there is no such node, data replicas will be passed on to different nodes within the same rack. In this way, data colocation can be guaranteed by keeping the replica of a dataset in the same data center (to serve read requests locally). This also minimizes the risk of network latency at the same time. NTS depends on snitch configuration for proper data replica placement across different data centers.

A snitch relies upon the node IP address for grouping nodes within the network topology. Cassandra depends upon this information for routing data requests internally between nodes. The preferred snitch configurations for NTS are RackInferringSnitch and PropertyFileSnitch . We can configure snitch in cassandra.yaml (the configuration file).

Data partitioning

Data partitioning strategy is required for node selection of a given data read/request. Cassandra offers two types of partitioning strategies.

Random partitioning

Random partitioning is the recommended partitioning scheme for Cassandra. Each node is assigned a 128-bit token value ( initial_token for a node is defined in cassandra.yaml) generated by a one way hashing (MD5) algorithm. Each node is assigned an initial token value (to determine the position in a ring) and a data range is assigned to the node. If a read/write request with the token value (generated for a row key value) lies within the assigned range of nodes, then that particular node is responsible for serving that request. The following diagram is a common graphical representation of the numbers of nodes placed in a circular representation or a ring, and the data range is evenly distributed between these nodes:

Ordered partitioning

Ordered partitioning is useful when an application requires key distribution in a sorted manner. Here, the token value is the actual row key value. Ordered partitioning also allows you to perform range scans over row keys. However, with ordered partitioning, key distribution might be uneven and may require load balancing administration. It is certainly possible that the data for multiple column families may get unevenly distributed and the token range may vary from one node to another. Hence, it is strongly recommended not to opt for ordered partitioning unless it is really required.

Cassandra write path

Here, we will discuss how the Cassandra process writes a request and stores it on a disk:

As we have mentioned earlier, all nodes in Cassandra are peers and there is no master-slave configuration. Hence, on receiving a write request, a client can select any node to serve as a coordinator. The coordinator node is responsible for delegating write requests to an eligible node based on the cluster's partitioning strategy and replication factor. First, it is written to a commit log and then it is delegated to corresponding memtables (see the preceding diagram). A memtable is an in-memory table, which serves subsequent read requests without any look up in the disk. For each column family, there is one memtable. Once a memtable is full, data is flushed down in the form of SS tables (on disk), asynchronously. Once all the segments are flushed onto the disk, they are recycled. Periodically, Cassandra performs compaction over SS tables (sorted by row keys) and claims unused segments. In case of data node restart (unwanted scenarios such as failover), the commit log replay will happen, to recover any previous incomplete write requests.

Hands on with the Cassandra command-line interface

Cassandra provides a default command-line interface that is located at:

CASSANDRA_HOME/bin/cassandra-cli.sh using Linux

CASSANDRA_HOME/bin/cassandra-cli.bat using Windows

Before we proceed with the sample exercise, let's have a look at the Cassandra schema:

Keyspace: A keyspace may contain multiple column families; similarly, a cluster (made up of multiple nodes) can contain multiple keyspaces.

Column family: A column family is a collection of rows with defined column metadata. Cassandra offers different ways to define two types of column families, namely, static and dynamic column families.

Static column family: A static column family contains a predefined set of columns with metadata. Please note that a predefined set of columns may exist, but the number of columns can vary across multiple rows within the column family.

Dynamic column family: A dynamic column family generally defines a comparator type and validation class for all columns instead of individual column metadata. The client application is responsible for providing columns for a particular row key, which means the column names and values may differ across multiple row keys:

Column: A column can be attributed as a cell, which contains a name, value, and timestamp.

Super column: A super column is similar to a column and contains a name, value, and timestamp, except that a super column value may contain a collection of columns. Super columns cannot be sorted; however, subcolumns within super columns can be sorted by defining a sub comparator. Super columns do have some limitations, such as that secondary indexes over super columns are not possible. Also, it is not possible to read a particular super column without deserialization of the wrapped subcolumns. Because of such limitations, usage of super columns is highly discouraged within the Cassandra community. Using composite columns we can achieve such functionalities. In the next articles, we will cover composite columns in detail:

Counter column family: Since 0.8 onwards, Cassandra has enabled support for counter columns. Counter columns are useful for applications that perform the following:

Maintain the page count for the website

Do aggregation based on a column value from another column family

A counter column is a sort of 64 bit signed integer. To create a counter column family, we simply need to define default_validation_class as CounterColumnType. Counter columns do have some application and technical limitations:

In case of events, such as disk failure, it is not possible to replay a column family containing counters without reinitializing and removing all the data

Secondary indexes over counter columns are not supported in Cassandra

Frequent insert/delete operations over the counter column in a short period of time may result in inconsistent counter values

You can start a Cassandra server simply by running $CASSANDRA_HOME/bin/ cassandra. If started in the local mode, it means there is only one node. Once successfully started, you should see logs on your console, as follows:

Cassandra-cli: Cassandra distribution, by default, provides a command-line utility (cassandra-cli ), which can be used for basic ddl /dml operations; you can connect to a local/remote Cassandra server instance by specifying the host and port options, as follows:

$CASSANDRA_HOME/bin/cassandra-cli -host locahost -port 9160

Performing DDL/DML operations on the column family

First, we need to create a keyspace using the create keyspace command, as follows:

The create keyspace command:

This operation will create a keyspace cassandraSample with node placement strategy as SimpleStrategy and replication factor one. By default, if you don't specify placement_strategy and strategy_options, it will opt for NTS, where replication will be on one data center:

We can always update the keyspace for configurations, such as replication factor. To update the keyspace, do the following:

Modify the replication factor: You can update a keyspace for changing the replication factor as well as the placement strategy. For example, to change a replication factor to 2 for cassandraSample, you simply need to execute the following command:

Strategy options are in the format {datacentername:number of replicas}, and there can be multiple datacenters.

After successfully creating a keyspace before proceeding with other ddl operations (for example, column family creation), we need to authorize a keyspace. We will authorize to a keyspace using the following command:

use cassandraSample;

Create a column family/super column family as follows:

Use the following command to create column family users within the cassandraSample keyspace:

default_validation_class: It defines the datatype for the column value

subcomparator: It defines the datatype for subcolumns.

You can create/update a column by using the set method as follows:

// create a column named "username", with a value of "user1" for row key 1set users[1][username] = user1; // create a column named "password", with a value of "password1" for row key 1 set users[1][password] = password1;// create a column named "username", with a value of "user2" for row key 2set users[2][username] = user2; // create a column named "password", with a value of "password2" for row key 2 set users[2][password] = password2;

To fetch all the rows and columns from a column family, execute the following command:

// to list down all persisted rows within a column family.list users ; // to fetch a row from users column family having row key value "1".get users[1];

If you want to change key_validation_class from UTF8Type to BytesType and validation_class for the password column from UTF8Type to BytesType, then type the following command:

update column family users with key_validation_class=BytesType and comparator=UTF8Type and column_metadata = [{column_name:password, validation_class:BytesType}]

To drop/truncate the column family, follow the ensuing steps:

Delete all the data rows from a column family users, as follows:

truncate users;

Drop a column family by issuing the following command:

drop column family users;

These are some basic operations that should give you a brief idea about how to create/manage the Cassandra schema.

Cassandra Query Language

Cassandra is schemaless, but CQL is useful when we need data modeling with the traditional RDBMS flavor. Two variants of CQL (2.0 and 3.0) are provided by Cassandra. We will use CQL3.0 for a quick exercise. We will refer to similar exercises, as we follow with the Cassandra-cli interface.

The command to connect with cql is as follows:

$CASSANDRA_HOME/bin/cqlsh host port cqlversion

You can connect to the localhost and 9160 ports by executing the following command:

$CASSANDRA_HOME/bin/cqlsh localhost 9160 -3

After successfully connecting to the command-line CQL client, you can create the keyspace as follows:

To select all the data from the users column family, we need to execute the following CQL query:

select * from users;

We can delete a row as well as specific columns using the delete operation. The following command-line scripts are to perform the deletion of a complete row and column age from the users column family, respectively:

// delete complete row for user_id=1delete from users where user_id=1; // delete age column from users for row key 1.delete age from users where user_id=1;

You can update a column family to add columns and to update or drop column metadata.

Truncating a column family will delete all the data belonging to the corresponding column family, whereas dropping a column family will also remove the column family definition along with the containing data. We can drop/truncate the column family as follows:

truncate users;drop columnfamily users;

Dropping a keyspace means instantly removing all the column families and data available within that keyspace.We can drop a keyspace using the following command:

drop keyspace cassandrasample;

By default, the CQL shell converts the column family and keyspace name to lowercase. You can ensure case sensitivity by wrapping these identifiers within " " .

Alerts & Offers

Series & Level

We understand your time is important. Uniquely amongst the major publishers, we seek to develop and publish the broadest range of learning and information products on each technology. Every Packt product delivers a specific learning pathway, broadly defined by the Series type. This structured approach enables you to select the pathway which best suits your knowledge level, learning style and task objectives.

Learning

As a new user, these step-by-step tutorial guides will give you all the practical skills necessary to become competent and efficient.

Beginner's Guide

Friendly, informal tutorials that provide a practical introduction using examples, activities, and challenges.

Essentials

Fast paced, concentrated introductions showing the quickest way to put the tool to work in the real world.

Cookbook

A collection of practical self-contained recipes that all users of the technology will find useful for building more powerful and reliable systems.

Blueprints

Guides you through the most common types of project you'll encounter, giving you end-to-end guidance on how to build your specific solution quickly and reliably.

Mastering

Take your skills to the next level with advanced tutorials that will give you confidence to master the tool's most powerful features.

Starting

Accessible to readers adopting the topic, these titles get you into the tool or technology so that you can become an effective user.

Progressing

Building on core skills you already have, these titles share solutions and expertise so you become a highly productive power user.