Know OpenStack? Prove it. An IT professional who has earned the Mirantis® Certificate of Expertise in OpenStack has demonstrated the skills, knowledge, and abilities needed to create, configure, and manage OpenStack environments.

Some vendors choose to “improve” OpenStack by salting it with their own exclusive technology. At Mirantis, we’re totally committed to keeping production open source clouds free of proprietary hooks or opaque packaging. When you choose to work with us, you stay in full control of your infrastructure roadmap.

NoSQL databases are systems for data storage and retrieval that do not primarily use the now-dominant RDBMS model: tabular data structures, organized relationally and accessed using Structured Query Language (SQL). Instead, NoSQL databases employ a host of methods, such as schema-free key-value pairs, intended to map better to a growing class of problems that may be difficult to solve with RDBMS. For example, some problems are best approached using data-structures (for example, trees) that are hard to represent with relational tables, algorithms that are difficult to express in SQL, or problems whose efficient solution requires creation and access to very large, unstructured and/or distributed databases using massive parallelism.

Apache Cassandra NoSQL Database

Apache Cassandra is an extremely high-performance, scaleable, distributed and robust NoSQL database designed to handle very large data stores on many inexpensive commodity servers and across multiple datacenters with no single point of failure, using a very flexible, simple partitioned row-store data model. It was originally developed by Avinash Lakshman (the author of Amazon Dynamo) and Prashant Malik at Facebook to solve their Inbox-search problem. The code was published in July 2008 as free software under the Apache V2 license, and since then, development of Cassandra has continued at an amazing pace, driven in part by contributions from IBM, Twitter and Rackspace. Since February 2010, Cassandra has been an “Apache top-level project.”

Cassandra forgoes the widely used Master-Slave setup in favor of a peer-to-peer cluster. This contributes to Cassandra having no single-point-of-failure, as there is no master server which, when faced with lots of requests or when down, would render all of its slaves useless. Any number of commodity servers can be grouped into a Cassandra cluster. While this architecture is a lot more complex to implement behind the scenes, we don’t have to deal with that as users. Not having to distinguish between a Master and a Slave node allows you to add any number of machines to any cluster in any datacenter, without having to worry about what type of machine you need at the moment. Every server accepts requests from any client. Every server is equal.

If you go up a level, what does that mean?

Cassandra excels at online transactions, also known as real time transactions: requests that need to fully execute in a small amount of time because otherwise, users will perceive latency. Such queries need to execute at the single millisecond level, not hundreds or thousands of milliseconds. With Cassandra’s multiple caching levels, your data can be served incredibly quickly. Every write is fast with Cassandra thanks to the log-structured storage design, and each write is persisted with a commit log, making Cassandra an excellent choice when downtime or data loss is unacceptable.

Cassandra also does well in the other area of data management – analytics. With the current release, MapReduce is supported across your stored data. MapReduce is an algorithm popularized by Google that allows for analytical queries to be run on large data sets across large numbers of servers in parallel. It’s not real time – typical jobs can take minutes if not hours – but it’s capable of processing gigantic data sets to scour your data for the information you need. Because Cassandra provides both online and analytical solutions, you can use a single technology to accomplish the majority of your data needs — beneficial for both development, QA and operational efficiency. Given that Cassandra has shown itself to work at scale, you know you can trust it to perform well as your needs grow.

Cassandra and OpenStack

As should be clear by now, Cassandra and OpenStack are conceptually a good pairing, with OpenStack powering and abstracting the datacenters and defining the server infrastructure Cassandra needs to work, and simplifying all phases of development, deployment and operations.

Up until recently, however, managing Cassandra on OpenStack was difficult. It was possible to provision database instances using Orchestrator templates, but normal security policies (i.e., no access to database from the WAN) made management by end-users largely impractical. Today, however, the Trove OpenStack DBaaS solution has arrived – offering an API letting users interact directly with in-VM agents and enabling all possible operations defined by the management interface.

Cassandra and OpenStack DBaaS

OpenStack DBaaS now supports the Apache Cassandra NoSQL database. Its first iteration will cover:

Provisioning of Cassandra DB as a single instance database.

Power maintenance (start, stop, restart, restart with new configurations).

Resize events (volume and flavors).

The next iteration of improvements for OpenStack’s Juno release will cover:

Configuration management.

Backup (nodetool snapshot + custom scripts).

Restore (custom scripts).

Incremental backup (for version Cassandra 2.x.x or above).

Conclusion

Cassandra is a highly available, Internet-scale NoSQL database with design goals that are very different from those of traditional relational databases. The differences between Cassandra and relational databases identified in this article should each be considered for their pros and cons and be evaluated in the context of your problem domain. Also, using NoSQL does not exclude the use of RDBMS – it’s quite common to have a hybrid architecture where each database type is used in different situations according its strengths.

When starting their first NoSQL project, developers are likely to enter new territory and have their first encounters with related concepts such as big data and eventual consistency. Relational databases are often associated with strong consistency, whereas NoSQL systems are associated with eventual consistency (even though the use of a certain type of database doesn’t formally imply a particular consistency model). When moving from the relational world and strong consistency to the NoSQL world, the biggest mindshift may be in understanding and architecting an application for eventual consistency. Data modeling is another area where developers may need to develop new understanding.

Cassandra is a very interesting product with a wide range of use cases. I think it’s particularly well suited for use cases involving:

Very large data volumes

Very large user transaction volumes

High reliability requirements for data storage

A dynamic data model, where data may be relatively unstructured, or whose structure may change over time

Cross-datacenter distribution

And now Apache Cassandra NoSQL Database service comes as part of OpenStack Database cloud service.

2 Responses

Thanks for your comment. About HBase – good question, and also good topic for the summit design session. As for me, HBase would be the part of the Trove, someday. Also, Sahara could be the great option for Trove as the provisioning engine for Hadoop and HBase. So, the short answer, Trove would support HBase. But the other question – how soon?

Continuing the Discussion

[…] Mirantis: Trove + Cassandra = Love: NoSQL Database Solutions and the OpenStack Ecosystem NoSQL databases are systems for data storage and retrieval that do not primarily use the now-dominant RDBMS model: tabular data structures, organized relationally and accessed using Structured Query Language (SQL). Instead, NoSQL databases employ a host of methods, such as schema-free key-value pairs, intended to map better to a growing class of problems that may be difficult to solve with RDBMS. For example, some problems are best approached using data-structures (for example, trees) that are hard to represent with relational tables, algorithms that are difficult to express in SQL, or problems whose efficient solution requires creation and access to very large, unstructured and/or distributed databases using massive parallelism. Read more. […]