How can you manage large volume of data using Apache Cassandra NoSQL database?

Overview: Apache Cassandra is one of the most popular and scalable open source NoSQL database. Cassandra is an ideal database for managing huge volume of unstructured, semi-structured and structured data across multiple data centers and the cloud environment. Cassandra delivers high scalability and availability across many commodity servers without compromising performance. With this model there is no single point of failure, and it provides a powerful data model for maximum flexibility and fast response time. Linear scalability and a fault tolerant hardware or a cloud infrastructure makes a perfect combination for any critical data.

Introduction: Relational databases are very good in solving certain type of data storage problems. But as the focus is different for RDBMS, it creates problem when scaling up for large volume of data. So, we need to find a way to get rid of the joins. This will result in de-normalizing the data. This will lead to maintain multiple copies of data and also cause a huge damage to the design, both in the database and in the application. In this condition solutions provided by NoSQL seems to be less radical and less scary than we may have thought. The design goal of NoSQL database has to understand clearly before implementing it in any application.

Design goals of Cassandra NoSQL database: The design goals of NoSQL database are completely different from relational database. So the choice of using NoSQL DB or RDBMS also depends upon the type of application and its requirement. As we know that ACID transaction provides a strong consistency model for all web applications developed and designed traditionally. But when we think about scalability, it comes at a cost and conflicts some of the rules followed in RDMBS design. So promoting availability over consistency is one of the key design factors for NoSQL databases. Common design goals followed of Cassandra are stated below.

High performance

Horizontal scalability

Simplicity

Schema flexibility

Cassandra architecture to manage large data volume: As we all know that NoSQL databases are distributed on a number of commodity nodes. Cassandra is also distributed on a number of nodes and it follows ‘masterless’ architecture. ‘Masterless’ architecture means, all nodes are same and there is no single node which controls other nodes. Cassandra automatically distributes data across all the commodity nodes which forms the ‘ring’ known as database cluster. As the data is automatically and transparently partitioned on the cluster, developers do not need to do anything programmatically. Another important feature of Cassandra architecture is its support for in-built and customizable replication. The redundant data is stored across multiple nodes in the Cassandra ring. So if there is any failure in any node, the same data is retrieved from other nodes having replicated data. The replication can be configured in the following ways.

Across one data center

Across multiple data centers

Across multiple cloud infrastructure

Another architectural feature is the support for linear scalability. It means the capacity or scalability can be increased by just adding new nodes. For example, if 2 nodes can handle 1000 transactions/sec, then 4 nodes will support 2000 transactions/sec and so on. Following picture shows the linear scalability of a Cassandra ring.

Cassandra Ring

Accessing large volume of data: The first thing which comes into mind is the availability of different client libraries when developing database driven application. For RDBMS products the available libraries are straight forward. For example, JDBC is the standard database access API for Java based applications. Normally there is a single JDBC driver vendor for a particular type of database product. On the other hand, Cassandra has approximately nine different clients for Java application development. And the most important thing is that, these clients provide different flavors for managing the data. Some are providing object relational mapping APIs, some are offering CQL based support and many more. So the flexibility for accessing the NoSQL DB is another major advantage for application development. The developers can choose the type of access according to their requirement.

Large volume of data in Cassandra can be accessed and managed by APIs which follows RPC style. At the same time, Cassandra also provides basic query language support called CQL which is similar to SQL to some extent. But the application developer must have a sound knowledge about the storage engine and its functionality.

Standard use cases for Cassandra NoSQL DB: As we have already discussed that the standard use cases for Cassandra is different from traditional RDBMS applications. Following are some standard use cases.

Applications handling very large data volume

Applications of high scalability and availability

Applications with high reliability requirement for data storage

Dynamic data model which is expected to change significantly over time

Distribution over different datacenters

Downloading and Installing Cassandra: Now let us discuss about the download and installation part of Cassandra NoSQL DB. The download and installation will take some time.

Apache Cassandra can be downloaded from http://cassandra.apache.org . The binary distribution is named as apache-cassandra-<VERSION>-bin.tar.gz. Easiest way to install Cassandra is mentioned in the following steps below –

Download the binary distribution from the above website

Unzip this using some regular ZIP utility

Once unzipped, you should get the following directories –

bin – this contains the executables to run Cassandra and the command line interface client.

conf – this contains files used to configure Cassandra

interface – interface is defined using the Thrift syntax and provides an easy way to generate clients. If you want to see all of the operations that Cassandra supports, open this file by using a regular text editor. The file will have all Cassandra supports clients for Java, C++, PHP, Ruby, and Python, Perl, and C # through this interface.

lib – This contains the external which are required to execute Cassandra.

javadoc – This contains the documentation in html format for Cassandra.

To start the Cassandra server on any OS like linux or windows, first you need to open a command prompt or terminal window. Now go to the <cassandra-directory>/bin where you unpacked Cassandra, and run the following command to start the Cassandra server. If the installation was clean, we would see some log statements like this:

The -f option used here tells Cassandra to stay in the foreground instead of running as a background process. This helps us, so that all of the server logs will print to standard out and you can see them in your terminal window, which is useful for testing.

Summary:

Let us conclude our discussion in the form of following bullets –

Apache Cassandra is a scalable NoSQL based database

It can be downloaded and installed from the Apache website

Cassandra is an ideal database for managing large amounts of structured, semi-structured, and unstructured data across multiple data centers and the cloud.

Cassandra supports linear scalability and high performance across multiple commodity servers with no single point of failure, and provides a powerful dynamic data model designed for maximum flexibility and fast response time.

Categories

Archives

Contact Us:

About Us:

TechAlpine is a technology centric software Solution Company in India. TechAlpine has been formed in the year 2008 by a group of Information Technology professionals from premier institutions and organizations with emphasis on the use of modern technologies on different technology platform.