Introduction

With the uninterrupted growth of data volumes ever since the primitive ages of computing, storage of information, support and maintenance has been the biggest challenge. Also, the cloud computing technology has positioned a new dimension ("pay-as-you-go" model) to the information storage with efficient use of computer resources. Even the matured relational database products in the current market fall behind to scaling the applications according to the incoming traffic at a conventional cost. The demands of huge data and elastic scaling with desired performance has led to concept of No SQL databases. No SQL - undoubtedly is the hottest stint in today’s database technology and moving the data from the existing data structures to No SQL would be the potential area of interest for the customers.

What are No SQL Databases

No SQL databases are the data stores that are non-relational (without any fixed schemas & joins), distributed, horizontally scalable and often don’t adhere to the principles (ACID: atomicity, consistency, isolation, durability) of traditional relational databases.

Based on type of data storage, No SQL databases are broadly classified as:

Column family (Wide column) stores

Graph databases

Key value/tulip store

Document stores

There are more than 100+ No SQL databases in current market. The information on some of the most popular No SQL databases is depicted in the below list (For detailed information on No SQL databases, please refer to: http://nosql-databases.org):

What is JSON

JSON (stands for JavaScript Object Notation) is a lightweight and highly portable data-interchange format. JSON is intuitive to the web as well as the browser. Interoperability with any/all platforms in the current market can be easily achieved using JSON message format.

·A collection of name/value pairs. In various languages, this is realized as an object, record, dictionary, structure, keyed list, hash table or associative array.

·An ordered list of values. In most languages, this is realized as an array, list, vector, or sequence.

These are universal data structures. Virtually all modern programming languages support them in one form or another. It makes sense that a data format that is interchangeable with programming languages also be based on these structures.”

A typical JSON syntax is as follows:

Data is represented in the form of name-value pairs.

A name value pair is comprised of a “Member Name” in double quotes, followed by colon “:” and the value in double quotes

• JSON is plain text data format• JSON is human readable and self-describing• JSON is categorized (contains values within values)• JSON can be parsed by scripting languages like Java script• JSON data is supported and transported using AJAX

JSON vs. XML

Though JSON & XML are both data formats, JSON has the upper hand over XML because of the following reasons:

• JSON is lighter compared to XML (No unnecessary/additional tags in JSON)• JSON is easier to read and understand by humans.• JSON is easier to parse and generate for machines.• For AJAX related applications, JSON is quite faster compared to XML

Several No SQL products have provided built-in capabilities/readily available tools for loading data from JSON file format. Below is the list of import/export utilities for some of the widely held No SQL products

Product

Import/Export Utilities for JSON

Cassandra

json2sstable -> JSON to Cassandra data structure

sstable2JSON -> Cassandra data structure to JSON

MongoDB

mongoimport -> JSON/CSV/TSV to MongoDB data structure

mongoexport -> MongoDB data structure to JSON/CSV

CouchDB

tools/load.py-> JSON to CouchDB data structure

tools/dump.py -> CouchDB data structure to JSON

Riak

bucket_importer:import_data-> JSON to Riak data structure

bucket_exporter:EXport_data -> Riak data structure to JSON

Cassandra Data Model

Understanding Cassandra data model

The Cassandra data model is premeditated for highly distributed and large scale data. It trades off the customary database guidelines (ACID compliant) for important benefits in operational manageability, performance and availability.

An illustration of how a Cassandra data model would like is as below:

The basic elements of the Cassandra data model are as follows:

• Column• Super Column • Column Family • Keyspace• Cluster

Column: A column is the basic unit of Cassandra data model. A column comprises of name, value and a time stamp (by default). An example of column in JSON format is as follows:

• Column’s value is a string but the super column’s value is a record of columns• A super column doesn’t include any time stamp (only terms name & value).

Note: Cassandra does not index sub columns, so when a super column is loaded into memory; all of its columns are loaded as well.

Column Family (CF):A column family resembles an RDBMS table closely and is an assembly of ordered collection of rows which in-turn are ordered collection of columns.A column family can be a “standard” or a “super” column family.

A row in a standard column family contains collections of name/value pairs whereas the row in a super column family(SCF) holds collections of super columns (group of sub columns). An example for a column family is described below (in JSON):

Note: Each column would contain “Time Stamp” by default. For easier narration, time stamp is not included here.

The address of a value in a regular column family is a row key pointing to a column name pointing to a value, while the address of a value in a column family of type “super” is a row key pointing to a column name pointing to a sub column name pointing to a value. An example for Super column in JSON format is as follows:

Columns are always organized as per the Column‘s name within their rows. The data would be sorted as soon as it is inserted into the data model.

Keyspace:A keyspace is the outmost grouping for data in Cassandra, closely resembling an RDBMS database. Similar to the relational database, a keyspace has title and properties that describe the keyspace demeanor. The keyspace is a container for a list of one or more column families (without any enforced association between them).

Cluster:Cluster is the outermost structure in Cassandra (also called as ring). Cassandra database is specially designed to be spread across several machines functioning together that act as a single occurrence to the end user. Cassandra allocates data to nodes in the cluster by arranging them in a ring.

Relational data model vs. Cassandra data model

Relational Data Model

Cassandra data model (Standard)

Cassandra data model (Super)

Server

Cluster

Database

Key space

Table

Column Family

Primary Key

Key

Column Value

Column Name

Super Column Name

Column Value

Column Name

Column Value

Unlike the traditional RDBMS, Cassandra doesn’t support

Query language like SQL (T-SQL, PL/SQL etc.). Cassandra provides an API called thrift through which the data could be accessed.

Referential Integrity (operations like cascading deletes are not available)

Designing Cassandra data structures

1. Entities – Point of Interest

The finest way to model a Cassandra data structure is to identify the entities on which most queries would be attentive and creating the entire structure around the entity. The activities performed (generally the use cases) by the user applications, how the data is retrieved and displayed would be the areas of interest for designing the Cassandra column families.

For example, a simple employee data model (in any RDMBS) would contain:

·Employee

·Employeecontact details

·Employeefinancial information

·Employeerole information

·Employeeattendance information

·Employeeprojects

….

And so on…

Here “Employee” is the entity for point of interest and any application using this design would frame the queries relating to the employee.

2. De-normalization

Normalization is the set of rules established to aid in the design of tables and their relation-ships in any RDBMS. The benefits of normalizing would be:

Achieving the similar kind of performance for the growing data volume is a challenge in traditional relational data models and the companies could compromise on de-normalization to achieve performance. Cassandra does not support foreign key relationships like a relational database and the better way is to de-normalize the data model. The important fact is that instead of modeling the data first and framing the queries, with Cassandra the queries would be modeled and the data be framed around them.

3. Planning for Concurrent Writes

In Cassandra, every row within a column family is identified by the unique row key (generally a string of unlimited length). Unlike the traditional RDBMS primary key (which enforces uniqueness), Cassandra doesn’t impose uniqueness (Duplicate row key insertion might disturb the existing column structure). So the care must be taken to create the rows with unique row keys. Some of the ways for creating unique row keys is as follows:

• Surrogate/ UUID type of row keys • Natural row keys

Data Migration approach (Using ETL)

There are various ways of porting the data from relational data structures to Cassandra structures, but the migrations involving complex transformations and business validations might accommodate a data processing layer comprising ETL utilities.

In case of using in-built data loaders, the processed data can be extracted to flat files (in JSON format) and then uploaded to the Cassandra data structure’s using these loaders. Custom loaders could be fabricated in case of additional dispensation rules, which could either deal the data from the processed store or the JSON files.

The overall migration approach would be as follows:

Data preparation as per the JSON file format.

Data extractions into flat files as per the JSON file format or extraction of data from the processed data store using custom data loaders.

The following activities will be executed during data extraction into JSON file formats:

Data Selection as per the JSON file layout

Creation of SQL programs based on as the JSON file layout

Scripts or PLSQL programs are created based on the data mapping requirements and the ETL processes. These programs shall serve various purposes including the loading of data into staging tables and standard open interface tables.

Data Transformation before extract as per the JSON files layout specification and mapping documents.

Flat files in form of JSON format for data loading

Data Loading

Cassandra data structures can be accessed using different programing languages like (.net, Java, Python, Ruby etc.). Data can be directly loaded from the relational databases (like Access, SQL Server, Oracle, MySQL, IBM DB2, etc.) using these programing languages. Custom loaders could be used to load data into Cassandra data structure(s) based on the enactment rules, customization level and the kind of data processing.