Migrating from a Relational Database to Apache Cassandra Database

Categories

Many applications running today are built using relational database technology such as PostgresSQL, MySQL, Microsoft SQL Server, or Oracle Database. Relational technology was the mainstay database for applications development for at least 20 years. However, the availability and scalability expectations of modern applications (along with the ability to realise substantial savings through adoption of open source technology), is causing many people to re-examine this choice.

Overview

Apache Cassandra database is an ideal candidate as a modern operational database to replace an existing relational database for many applications. Apache Cassandra NoSQL technology is designed from the ground up to support always-on availability, scale to unlimited data and transaction volumes, and remain manageable as your application grows.

Instaclustr has helped many customers make the transition from a relational to Cassandra database. This paper sets out our recommendations on why, when, and how to make the migration based on our learnings from these successful migrations.

When to consider migration

There are a number of signs that indicate that you have reached a point where it is time to consider a migration away from relational database technology to a modern NoSQL technology; such as Cassandra. In many cases, as you business grows, you will find yourself hitting several of these limits at the same time.

The first and clearest indication is where you begin to hit the scalability limits of your existing technology. This could become apparent either due to a realisation that you approaching hard, physical limits with your existing technology; performance of aspects of your application that are getting worse and you have no solution for; or ever increasing DBA time being spent to maintain performance and availability. Remember that migration will take significant time so you need to be aware of these signs long before they become critical to your application’s basic availability.

A second indication is changing (escalating) requirements for your application. Perhaps it was previously acceptable to have an outage once a month to apply patches and upgrades but you want to remove this downtime. Maybe you have a desire to implement a second, synchronised data center for disaster recovery, you want to improve response times by having multiple synchronized instances of your stack running geographically close to your users or there is a need to introduce a workload-isolated but synchronised environment for analytics purposes.

The final typical indication would be that you are approaching a major re-architect of your application and you want to future-proof it against the type of issues highlighted above. If you have to make major schema changes anyway then changing underlying technology from relational to Cassandra may represent a minimal incremental cost for substantial future benefits.

Preparing your application

Whether you see a short-term need to make the migration to Cassandra or just have a hope that your application will be successful enough in the future to reach the scale to need Cassandra, there are some approaches you can take in your development practices and architecture that will minimise the effort required to make the migration when the time comes.

The approaches that we will discuss in this section are:

Abstracting the data access layer (service oriented architecture)

Denormalizing within the relational database

Minimizing logic implemented within the database

Building data validation checks and data profiles

Most of these approaches (with the possible exception of denormalising) are considered good practice architecture in any event and will aid in maintainability of your application even if you never make a migration to Cassandra.

Abstracting the data access layer means adopting a service-oriented-architecture approach so you have a single service responsible for updating and retrieving data from your database, rather than having code modules access the database directly. This approach hides the database implementation details from the majority of your code. In the ideal case the changes to your application resulting from the change in database technology would be isolated to this single module.

Denormalizing within the relational database means building the schema within your relational database as if you were already using Cassandra. Keep all data within denormalized tables so a single query on the primary key extracts all the data related to an entity. This approach both ensures minimal application change when you make the migration and can make the performance of your relational database stretch a bit further than complex schemas requiring many joins to retrieve information.

Minimizing logic implemented in the database means not using features such as stored procedures, custom functions and triggers within your relational database. Instead, implement this logic within your application. Although Cassandra 3.x has started to provide features like this in Cassandra they are not a one-for-one replacement for relational features and logic in the database will still need to be, at best, rewritten for Cassandra or, more likely, moved to the application. In any event, excess use of triggers and stored procedures is likely to make your application hard to understand and debug.

Building data validation checks and data profiles is not a change to your core application architecture but rather recognising that when you do come to migrate you will need a range of tools for data validation and a good understanding our the profile of your data. Data validation tools are often useful for more minor migrations in your application. Recognising that you have a major requirement coming up can allow you to invest a little more upfront to save overall effort in the long run.

Migration approaches

In our experience there are two basic approaches to migration: big bang migration or parallel run.

Big bang migration refers to approach where you stop your application, copy data from your existing database to Cassandra and then start a version of your application that works with Cassandra. The big bang approach minimises the effort required to build migration tools and versions of your application that run in the transition state, and also minimises the operational effort managing the migration. However, there are some negatives to consider before pursuing the big-bang approach:

Significant application downtime may be required to execute the data copy: Your application will need to stop writing to the existing database for a period so that data can be reliably and completely copied to Cassandra. Depending on the volume of data, this process could take hours or even days. It may be possible to shorten this downtime if you can load a snapshot of your relational database and then reliably extract and load a copy of the changes since the snapshot once your application is stopped.

It’s high risk – once you’ve cutover to the new database and started writing, rollback is hard to impossible. The only real mitigation to this is testing – you needs to be very certain your migration process has accurately copied all data and that your application is functionally correct and operationally stable before you make the cutover. Once you have made updates in the new database, you will have no way of apply those writes to the old database and rolling back to the old application if you do hit issues.

Parallel run refers to an approach where you modify your application to write to both Cassandra and the relational database and the same time, gain confidence that this is working correctly (via regular reconciliations and performance monitoring), gradually cutover reads to Cassandra and then decommission the reads and writes to the relational database. This approach typically requires more development effort and also more effort in operating the system during the transition phase. However, it can be completed without downtime and allows you to gain full confidence of the new solution in production before removing the possibility of a rollback. Some considerations for this approach are:

You will need a reliable, repeatable process for validating that the relational and Cassandra databases are in synch while your application is still making updates. This will be run regularly during the parallel run period to improve your confidence that everything is functioning correctly. An ideal process would likely have the ability to do both a complete reconciliation or to reconcile a selected subset or randmon sample of the data.

You will need an efficient, repeatable process for resynching the Cassandra database from the relational database while both are running. Most likely, you will do the bulk of the initial loading of your Cassandra database from an offline snaphshot of your production relational database. So, your first use of your synch process will be to bring Cassandra up to date with changes made to the relational database after the snapshot while both are online. You can also use the synch process to recover from any issues found with writing to Cassandra after the initial synch.

With either of these approaches you will likely have the choice of cutting over your entire application in one operation or migrating individual tables (or, more likely, groups or related tables) one at a time. This option can be very attractive in the fairly common scenario where you have one or two very large table in relational database that are reaching scalability limits and need to be migrated quickly but they are supported by a large number of related tables that will take significant effort to migrate. The main issue to consider here is that join to the table to be migrated that were undertaken in the relational database will now have to be undertaken in the application, possibly leading to more, smaller read operations on the database. Also, you will not be able to rely on the database enforcing referential integrity.

Estimating a migration

The actual effort required to complete the migration will obviously be highly dependant on the particulars of your application and environment. However, all migrations will have many common, high-level tasks. A typical list of tasks (ordered roughly from most work to least work) would include:

Conclusion

Many organisations have successfully migrated applications from relational database technology to Cassandra and reaped significant benefits. While this kind of change will be a major undertaking for any application, the level of effort can be reduced through pre-migration design approaches and the risk of migration can be managed by careful planning and parallel run approaches.

Good point – (lack of) full ACID transaction support is another key difference of Cassandra vs relational databases.

There are two patterns that are typically used to deal with this requirement.

The first is a Cassandra’s logged batch capability. A logged batch is executed as an atomic operation in that it is guaranteed that either all writes succeed or none do.

However, logged writes do have a performance overhead so in some circumstances it’s better to construct the application so that writes are idempotent and replay them from the client in the event of a failure (although this does potentially leave you in an inconsistent state for a period).

The choice of pattern will depend on your exact requirements.

The other thing worth mentioning here is C* 3’s materialised views. Materialised views are updated in the same operation as the base write so a materialised view is guaranteed to be consistent with the base table.