Tag Archives: nosql

I’m a big believer in polyglot persistence. There are so many (very different) production ready databases available today that’s it is becoming more and more common to find applications using more than one database, utilizing the strengths of each. Using the right tool for the job gives me a warm, fuzzy feeling inside.

However, polyglot persistence comes with its own set of drawbacks. One of those drawbacks is the loss of foreign keys, which are very important in maintaining data integrity. Another drawback is that Object/Relational Mapping (ORM) libraries typically focus on a specific database, or type of database. So, writing code that manages relationships between objects backed by different databases hasn’t been nearly as easy as writing code to manage relationships between objects in the same database.

Tenacity’s goal is to address some of these issues. Tenacity is a ruby gem that provides an ORM independent way of managing relationships between models backed by different databases.

Tenacity works by extending popular Ruby ORM libraries to respond to a set of methods that the tenacity core uses to build and manage relationships between objects. By extending the ORM libraries to implement this interface, tenacity is able work with the objects in a generic way, without having to know what database is backing the given objects. This approach also allows you to continue using your favorite ORM libraries. To use tenacity, you simply need to include Tenacity inside your model.

Tenacity is heavily based on ActiveRecord’s associations, and aims to behave in much the same way, supporting many of the same options.

Example

Download

I’m going to be speaking about CouchDB at the next ChicagoDB meeting, which will be held on August 16th, 2010. I’m currently putting together some slides that will (I hope) provide a good introduction to CouchDB and its features. I also plan on doing a live demo at the end, so everybody can see CouchDB in action.

Information about the meeting can be found here. I hope to see you there!

WindyCityDB has been over for a while now, and I’ve had some time to collect my thoughts about the conference.

First and foremost, this was an incredibly well organized and wonderfully run conference. Ray Hightower and the WindyCityDB team did an amazing job. The space at IIT was perfect. The food and drink was ample and tasty. And the talks were interesting and thought provoking. To top it off, all proceeds from the conference, approximately $1,000, was donated to the Greater Chicago Food Depository.

One theme that took me by surprise was the broad support for relational databases. Almost all of the speakers stated that the relational database still has a place in this new world, along side all of the NoSQL technologies that we had gathered to discuss. This is something I completely agree with, but being a NoSQL conference, I had imagined that there would be quite a bit of RDBMS bashing going on. Nothing could have been further from the truth. Everybody seemed to agree that the RDBMS is a very functional and reliable piece of software that is still very much needed. Everybody also seemed to agree that there are some things the RDBMS doesn’t do too well based on how it is designed, and that the different NoSQL technologies aim to fill this gap. Nobody was advocating their particular NoSQL database as a replacement to the RDBMS. Instead, all were pitched as alternatives, created to solve different problems.

Another theme that sort of caught me off guard was the wide acceptance of Polyglot Persistence as “the future” of application persistence methods. I say “sort of” because I was there to speak about Polyglot Persistence, which of course means that I think it can be a great tool if used properly. However, I thought it was a bit more controversial of a topic. As it turns out, there are many already taking a polyglot approach to solving their persistence needs. This was great to hear, as made me feel a little less insane for wanting to use multiple databases within a single application. Well, I could still be insane, but at least now I know I’m in good company :) Several speakers mentioned Polyglot Persistence in their talks, and I spoke with several individuals at the conference, and after the conference, who are actively using a polyglot approach to address their persistence requirements.

I also found John Nunemaker’s prediction that hosted database solutions are going to be the future of application data persistence to be very interesting. If you think about it, this is exactly what happened with web hosting. Most companies, especially small companies, now use a 3rd party to host their production environments. In fact, I’d bet that the thought of hosting your own environment doesn’t even cross the mind of the vast majority of small companies now-a-days. I can easily see how hosted database solutions, such as MongoHQ or Cloudant, could follow a similar path. You would no longer need to worry about hardware issues, scaling issues, backups, maintenance, etc. It would allow you to simply focus on your data.

Even though I didn’t get a chance to play with it, the WindyCityDB lab was a nice touch, especially for those who had never tinkered with some of the NoSQL databases on display. The lab consisted of several iMacs, each loaded with a particular NoSQL database. All of the databases were preloaded with the Stack Overflow data set. Sharing a data set was a great idea, as it gave you a good way to compare and contrast how the different databases handled the same set of data.

I really enjoyed myself at this conference, and am already looking forward to next year’s. At the rate the NoSQL movement is moving, I can only imagine what the landscape will look like a year from now.

The slides from my WindyCityDB talk about Polyglot Persistence have been posted to Slideshare. You can see them here, or embedded in the post below.

The case for Polyglot Persistence was made throughout the day by several of the speakers. Most people seemed to acknowledge the fact that it is very likely one database may not have all of the tools you need to get your job done. Instead of coercing a single database to try and do things it was never designed to do, it is becoming more common for applications to use multiple databases, utilizing each for their respective strengths. But, of course, using multiple databases in a single application comes with its own set of issues, and you should make sure there is a real need for Polyglot Persistence before making that leap.

Thank you very much to the WindyCityDB organizers for putting on such a great event. I had a great time, learned a ton, met some interesting people, and participated in some great conversations. What more can you ask for in a tech conference?

The days of the relational database being a one-stop-shop for all of your persistence needs are over. A new class of application is beginning to emerge with requirements that exceed the capabilities of the relational database. Some of these applications need unlimited scalability or bullet proof fault tolerance, while others may require blazing fast access or flexible data storage. The relational database was simply not designed to meet the needs of this small but growing class. Instead, a new breed of data stores are gaining momentum. These data stores are looking at data persistence with a fresh set of eyes, diverging from the relational model considerably in order to meet these challenges.

What’s wrong with the relational database?

For 99% of the applications out there, absolutely nothing. The relational database has been the industry standard for data storage over the past 30+ years for good reason. It is an incredibly capable piece of software. Although it may not be the best tool for everything it is used for, it certainly satisfies the needs of the vast majority of applications just fine.

However, while not new, the class of applications mentioned above are becoming more common. These applications either handle enormous amounts of traffic, or deal with tremendous amounts of data. The relational database falls short in a few areas when trying to meet the demands of an application like this.

A single database server is usually not enough to support these requirements. Applications like this need a true database cluster, capable of adding storage space and processing power on the fly without the application even noticing. However, relational databases weren’t designed to operate in a cluster where all machines are capable of reading and writing data. This is largely due to the promises they make regarding data integrity. In order to fulfill these promises, the database needs easy, quick access to all of the data at all times to verify that duplicates aren’t being inserted, constraints aren’t being violated, etc. This quickly becomes a bottleneck when dealing with very large amounts of data.

There are techniques for scaling out relational databases, but they don’t address every concern. One popular technique is to use one or more slave databases for read requests, while continuing to funnel all write requests through the master database. The master database constantly synchronizes with the read only databases, so the data remains consistent between databases. This technique works great for read heavy applications, but does not help applications that perform just as many creates, updates, and deletes. Data sharding is another popular technique, which involves splitting the data up onto several different databases based on some criteria. But this pushes an extraordinary amount of complexity onto the application, as it is now responsible for determining which database to use for specific data sets. Master-master replication can be used to keep multiple master databases in sync, so any database server can perform read or write operations. However, for some applications there comes a point where the replication can’t keep up with the traffic.

Relational databases are also (intentionally) very strict when it comes to the structure of the data being stored. Data must be broken up into a series of rows and columns. Good object/relation mapping tools hide much of this awkwardness from us, but some applications deal with data that doesn’t map well into rows and columns. A simple key/value store is usually a better fit for applications like this.

How does the new breed address these problems?

The new breed of data stores, called NoSQL databases, make very few promises regarding data integrity. In this new model, data integrity becomes the application’s concern. By not having to enforce any complex data integrity rules, NoSQL databases can scale to levels way beyond that of a relational database. Adding more processing power or storage capacity can be as simple as adding a new machine to the cluster. The database can then store and process the data using any machine in the cluster.

In this model, the data being stored is self contained, and does not rely on any other data in the database. Therefore there is no need for one machine to know anything about the other machines in the cluster. This approach is quite different from the relational model, where data is broken up into multiple tables to eliminate duplicate data, and joined back together when being accessed.

Most of these databases subscribe to a theory called eventual consistency. In situations where duplicate information is scattered across different servers in the cluster, it is not feasible for the database to find all instances of that data and update it as a part of the original operation. Instead, the data will be replicated to the other database servers at a later time. Until that replication takes place, the application will be in an inconsistent state, where simultaneous queries fetching the same data could return different results. Although this sounds terrible, it turns out that in practice it is really not too big of a deal for most applications. Do all customers of an online retailer need to see the exact same set of product reviews 100% of the time? Probably not.

Also, because there are few promises regarding data integrity, NoSQL databases can offer data storage that is much more flexible. The database no longer has to enforce the uniqueness of a column, or ensure that the id of some referenced piece of data actually exists in the database. Some of these databases are true key/value data stores, where you can store just about anything. Others require a certain document format to be used (such as JSON or XML), but still allow you to freely change the contents of that document as you wish.

Still no one-stop-shop for persistence

Although NoSQL databases address some issues that can’t be addressed by relational databases, the opposite is true as well. The relational database offers an unparalleled feature set. While some of these features prevent it from serving the needs of the class of applications described above, they are absolutely required by other classes of applications. In some domains, data integrity is the number one concern. You need to look no further than the classic “try to withdraw money from the same account at the same time” example to justify the need for locks and transactions.

For the vast majority of applications out there, relational databases work great. There are a boat load of tools and libraries that support them, and software developers are very familiar with how to use them. It is safe to say that the relational database has secured its spot in IT departments and data centers around the world, and it isn’t going anywhere. It is far from dead.

Polyglot persistence

An increasing amount of case studies are appearing that describe how real world applications are needing the data integrity offered by the relational database in addition to the benefits offered by NoSQL databases. I believe this trend will continue, as companies are storing more data than ever, and processing that data in different ways than previously imagined.

To address these needs, some companies are beginning to run their relational database side-by-side with one or more of the NoSQL alternatives. Extremely large data sets that require scalable storage space and processing power are moved to a NoSQL database, while everything else, especially data that needs its integrity kept in-check, remains in the relational database. The term Polyglot Persistence has been used to describe the use of multiple databases within the same project.

The benefits of polyglot persistence

The benefits are somewhat obvious. By running a relational database side-by-side with a NoSQL database, you get the best of both worlds. Strict enforcement of data integrity from the relational database, and the scalability and flexibility provided by the NoSQL database. This allows you to use the best tool for the job, depending on your use case.

There are a few scenarios where I’ve seen systems take advantage of polyglot persistence. The first scenario involves the need to perform some set of complex calculations on an extremely large data set. The data is either copied/moved from the relational database to the NoSQL database, or inserted directly into the NoSQL database by the application. The application can then use a cluster of NoSQL database servers can then divide the work, process the data, and aggregate the results. The more machines you have in your cluster, the less time the processing will take. The resulting data can either remain in the NoSQL database or be inserted into the relational database, depending on what needs to be done with the results.

The other scenario takes advantage of the schema-less nature of some NoSQL databases. While it is certainly possible to store a serialized data structure in a single column of a relational database, interacting with that data can be a bit more challenging than if that data were in a schema-less, document oriented database. This use case, after all, is what the documented oriented databases were designed for. These types of databases simply treat the data as a collection of key/value pairs, identified by a unique ID. The NoSQL databases provide ways in which you can add structure back into the document so the data inside the document can be queried. These databases are great for storing data that can be radically different from document to document, or data whose structure changes constantly.

The challenges of polyglot persistence

Polyglot persistence comes with its own set of challenges. While potentially getting the best of both worlds as far as features go, you get the complexity and hassle of dealing with not only multiple databases, but multiple databases models.

Determining which database to use to store certain data

With more than one database, you now have to decide where to store the data. It’s no longer a given. If you make the wrong decision, you could be looking at a painful migration from one database model to another as a result. To make this decision, you need to carefully examine how the data will be used.

Increased application complexity

Applications also face increased complexity as they now have to interface with two different (potentially very different) data stores. If done correctly, you should be able to isolate this complexity to the persistence layer of your application, freeing the rest of the application from having to know what database specific data is coming from. But, interfacing to multiple data stores could greatly increase the complexity of that data persistence layer. Your application will now need to know:

How to connect to each of the databases

What database to use for specific sets of data

How to handle the different types of errors from each database

How to map results from each database back to your application’s object model

How to handle queries for information across databases

How to mock out the different databases for testing

Potentially, how to move data from one database to another

Addressing these concerns could result in a bunch of new application code, and with added code usually comes added complexity, and more bugs.

Increased deployment complexity

In addition to the increased application complexity, you will also face increased deployment complexity.

Will you need to provision new hardware to host the new database?

How will you backup the data in the new database?

How will you manage and control changes to the configuration of the new database?

Training for developers and operational staff

Given that this database will likely be radically different from the relational database that your developers and operational staff are comfortable with, how will you bring them up to speed on how to use and manage this new database? And, given that the majority of the NoSQL databases are still very young, how will you keep your developers and operational staff up to speed with the latest developments on the project?

This is a big issue, especially in companies with large development and operations teams, and needs to be thought through carefully.

Is there an expert you can hire to help you get up and running, and mentor your staff?

Is there any training available that you can give to your staff?

Who can you turn to for support when something goes wrong in production?

Summary

I’ve always been a big advocate of using the right tool for the right job. For the past 30 years, the relational database has been the de-facto standard for persistence. Creative people have managed to utilize and manipulate it to serve all sorts of different use cases, quite successfully. But just because you can fit a square peg through a round hole if you hit it with a big enough hammer doesn’t necessarily mean that you should.

NoSQL databases can be great tools for addressing data persistence cases that the relational database struggles with. In addition, each NoSQL database brings its own set of strengths and weaknesses to the table. They are becoming very important tools to have around, and I believe that our industry will see a steady increase in the adoption of these tools going forward.