Next-Generation Databases for Big Data

In a new book, titled Next Generation Databases: NoSQL, NewSQL and Big Data, Guy Harrison explores and contrasts both new and established database technologies. Harrison, who leads the team at Dell that develops the Toad, Spotlight, and SharePlex product families, wrote the book to address the gap he sees in the conversation about the latest generation of databases.

What was your motivation in writing this book?

There was a lot of talk about why one type of database was better than another but nothing that I could see that put them all in the context of how they arose and what they each were good for in a comparative approach. The second motivation, besides the need to put them in context, was to try from a technical view to compare the way in which the various database technologies solve the common problems that all databases have, such as the need for concurrency, locking, durability, and performance.

Who is it targeted at?

Primarily, I was aiming at database developers, administrators, and application architects. Those are the people who generally need to understand the strengths and weaknesses of various technology options in order to make the best decisions about which database to choose, how to optimize code for a particular database, and how to manage the technology once it is established. If there’s one thing I wanted the book to do, it was to provide a reasonably concise survey of the database landscape for people who are trying to choose a database technology for a new project.

What does the explosion of non-relational database alternatives mean from a business perspective?

There are two main business perspectives to consider and modern databases are to some extent responding to those business pressures rather than driving them. First, what we generally call NoSQL is a response to the demands of modern applications in the cloud. The traditional RDBMS was perfectly suited to power on-premise and small web systems but really was unable to scale to the global scope and continuous availability that large enterprises now aspire to. This led to some of the eventually consistent distributed database models that we see in systems such as Cassandra.

And, the second?

Systems such as Hadoop and Spark are part of the ongoing transformation to a “data economy.” That is the idea that the competitive differentiation between two companies may be how well they each process the data that they have; and how much data they have about their customers and how they process it. That’s where data farms and big data come in, and Hadoop and Spark are the key technologies because they are the only ones that provide an economically viable way of storing the volumes and lack of structure that this data will have.

What are the biggest forces shaping the data landscape today?

I think the biggest force continues to be the phenomenon that we refer to as big data. The term has been somewhat overhyped and become poorly defined, but at its essence it describes how an enterprise can leverage data for competitive advantage. The possession and smart use of data has become as much of a competitive differentiator in many industries as any other factor. I think we’re seeing a bit of a “trough of disappointment” as enterprises struggle to find the key to unlocking the value inherent in data, but I still believe this remains a strategic priority for most modern businesses.

If the ability to leverage data is key to the survival of a modern business, what are the technologies that you see as being key to leveraging data?

The database technologies are not the most important factor here. You obviously need a data storage platform that is capable of storing all the data that you may need to process, and capable of applying enough CPU to the data to allow it to be processed in a timely manner, but with technologies such as Hadoop and Spark we really have a “good enough” solution for this.

What’s missing seems to be a big data analytic platform that is usable by people without a Ph.D. in statistics. Today, you need an almost unattainable combination of skills to succeed in data science projects—advanced programming, statistics, and machine learning, together with a deep level of business insight. The larger companies can acquire these rock stars, but in the mid-market, we clearly need software solutions that are easier to use.

How do relational database technologies fit in?

The relational database is an absolute triumph in terms of software engineering. Certainly for anything that doesn’t have global scope, the relational database is likely to be the best choice and that will remain true for the foreseeable future. The relational database remains the best model for the widest variety of workloads and I expect it to remain dominant for CRM, ERP, and data warehousing applications. Its ability to maintain relevance across three generations of application architecture—green screen, client/server, and web—is a testament to its incredible utility and applicability.

The idea that there is a system in which the data has been carefully structured and put into a schema remains critical for business intelligence. There is a need in big data to be able to handle masses of raw data that might be in various structures—JSON documents, XML, text files, or images. However, at some point there is also a need for curated, cleansed, transformed data that is reliable and on which you can base real-time decision making. There is a lot of energy going into non-relational technologies because they are new and because they are associated with the leading edge of application development, but the relational database definitely is the mainstay for the majority of applications.

Do you foresee a time when relational database technologies will not be part of the technology mix?

No, I don’t. Actually what I expect is that core relational principles will increasingly find their way into the next generation of systems. When we say relational database, we usually mean the combination of Codd’s relational model, the ACID transaction pattern, and SQL. These combined in the modern RDBMS in an incredibly fruitful manner, but they are really independent technologies.

The relational representation of data is the most logically correct representation that we have been able to devise and its theoretical foundation is as valid today as it ever was. You see the SQL language finding its way into new generation systems all over the place. Although the ACID transaction model which became strongly associated with the relational database has proven too restrictive for modern workloads, it will still be a strong factor in future database design.

Will enterprises in the future continue to rely on separate, special-purpose database technologies for different requirements?

I hope not. As I outline in the book, I hope that the special-purpose functionalities will be incorporated as configuration options into a single DBMS rather than requiring users to make a sort of Hobson’s choice between multiple Not Quite Right solutions.

You can see the trend toward convergence. Databases such as MongoDB are talking about implementing joins and transactions. There are many SQL options available for Hadoop distributions, and N1QL is an impressive attempt to provide a full SQL dialect for Couchbase. Meanwhile, Oracle has implemented a complete document database within its flagship product.

We still need a more tunable consistency model. Cassandra and other Dynamo systems provide an example of how this might work, allowing you to choose the level of consistency—strict, eventual, or weak—on a transaction by transaction basis depending on availability and performance requirements.

What do you see ahead as the most promising new technologies that are not yet mainstream?

In database technology, I’m currently really interested in Blockchain—the distributed ledger that underlies Bitcoin. Blockchain is like a new type of distributed database that uses peer-to-peer protocols to confirm transactions. In a Blockchain it’s possible for users to really own their own data. For instance, there is nothing to technically stop a Facebook DBA from changing one of your posts, but in a Blockchain system you could maintain complete control over the contents of your personal data. Blockchain stands to revolutionize and possibly severely disrupt banking, but I think in a less dramatic way it’s going to become incorporated into a lot of database technologies.

If you had one piece of advice for an organization getting started with a big data project, what would it be?

The first thing to do is to stop throwing data away. There are so many stories of companies that have discovered that valuable products or decisive insights can be generated from data that was previously regarded as industrial waste from some other production system.

And second, build a Hadoop cluster and start pushing all of the data you are collecting from all sources into it. Don’t wait until you work out what you are going to do with it because that might be 2 years down the road and you might have lost 2 years’ of data and that data might be decisive in the success of your project. After all, once you delete data, there’s no way to get it back.