It's a truism that we should choose the right tool for the job. Everyone says that. And who can disagree? The problem is this is not helpful advice without being able to answer more specific questions like: What jobs are the tools good at? Will they work on jobs like mine? Is it worth the risk to try something new when all my people know something else and we have a deadline to meet? How can I make all the tools work together?

In the NoSQL space this kind of real-world data is still a bit vague. When asked, vendors tend to give very general answers like NoSQL is good for BigData or key-value access. What does that mean for for the developer in the trenches faced with the task of solving a specific problem and there are a dozen confusing choices and no obvious winner? Not a lot. It's often hard to take that next step and imagine how their specific problems could be solved in a way that's worth taking the trouble and risk.

Let's change that. What problems are you using NoSQL to solve? Which product are you using? How is it helping you? Yes, this is part the research for my webinar on December 14th, but I'm a huge believer that people learn best by example, so if we can come up with real specific examples I think that will really help people visualize how they can make the best use of all these new product choices in their own systems.

Here's a list of uses cases I came up with after some trolling of the interwebs. The sources are so varied I can't attribute every one, I'll put a list at the end of the post. Please feel free to add your own. I separated the use cases out for a few specific products simply because I had a lot of uses cases for them they were clearer out on their own. This is not meant as an endorsement of any sort. Here's a master list of all the NoSQL products. If you would like to provide a specific set of use cases for a product I'd be more than happy to add that in.

General Use Cases

These are the general kinds of reasons people throw around for using NoSQL. Probably nothing all that surprising here.

Bigness. NoSQL is seen as a key part of a new data stack supporting: big data, big numbers of users, big numbers of computers, big supply chains, big science, and so on. When something becomes so massive that it must become massively distributed, NoSQL is there, though not all NoSQL systems are targeting big. Bigness can be across many different dimensions, not just using a lot of disk space.

Massive write performance. This is probably the canonical usage based on Google's influence. High volume. Facebook needs to store 135 billion messages a month. Twitter, for example, has the problem of storing 7 TB/data per day with the prospect of this requirement doubling multiple times per year. This is the data is too big to fit on one node problem. At 80 MB/s it takes a day to store 7TB so writes need to be distributed over a cluster, which implies key-value access, MapReduce, replication, fault tolerance, consistency issues, and all the rest. For faster writes in-memory systems can be used.

Fast key-value access. This is probably the second most cited virtue of NoSQL in the general mind set. When latency is important it's hard to beat hashing on a key and reading the value directly from memory or in as little as one disk seek. Not every NoSQL product is about fast access, some are more about reliability, for example. but what people have wanted for a long time was a better memcached and many NoSQL systems offer that.

Flexible schema and flexible datatypes. NoSQL products support a whole range of new data types, and this is a major area of innovation in NoSQL. We have: column-oriented, graph, advanced data structures, document-oriented, and key-value. Complex objects can be easily stored without a lot of mapping. Developers love avoiding complex schemas and ORM frameworks. Lack of structure allows for much more flexibility. We also have program and programmer friendly compatible datatypes likes JSON.

Schema migration. Schemalessness makes it easier to deal with schema migrations without so much worrying. Schemas are in a sense dynamic, because they are imposed by the application at run-time, so different parts of an application can have a different view of the schema.

Write availability. Do your writes need to succeed no mater what? Then we can get into partitioning, CAP, eventual consistency and all that jazz.

Easier maintainability, administration and operations. This is very product specific, but many NoSQL vendors are trying to gain adoption by making it easy for developers to adopt them. They are spending a lot of effort on ease of use, minimal administration, and automated operations. This can lead to lower operations costs as special code doesn't have to be written to scale a system that was never intended to be used that way.

No single point of failure. Not every product is delivering on this, but we are seeing a definite convergence on relatively easy to configure and manage high availability with automatic load balancing and cluster sizing. A perfect cloud partner.

Generally available parallel computing. We are seeing MapReduce baked into products, which makes parallel computing something that will be a normal part of development in the future.

Programmer ease of use. Accessing your data should be easy. While the relational model is intuitive for end users, like accountants, it's not very intuitive for developers. Programmers grok keys, values, JSON, Javascript stored procedures, HTTP, and so on. NoSQL is for programmers. This is a developer led coup. The response to a database problem can't always be to hire a really knowledgeable DBA, get your schema right, denormalize a little, etc., programmers would prefer a system that they can make work for themselves. It shouldn't be so hard to make a product perform. Money is part of the issue. If it costs a lot to scale a product then won't you go with the cheaper product, that you control, that's easier to use, and that's easier to scale?

Use the right data model for the right problem. Different data models are used to solve different problems. Much effort has been put into, for example, wedging graph operations into a relational model, but it doesn't work. Isn't it better to solve a graph problem in a graph database? We are now seeing a general strategy of trying find the best fit between a problem and solution.

Avoid hitting the wall. Many projects hit some type of wall in their project. They've exhausted all options to make their system scale or perform properly and are wondering what next? It's comforting to select a product and an approach that can jump over the wall by linearly scaling using incrementally added resources. At one time this wasn't possible. It took custom built everything, but that's changed. We are now seeing usable out-of-the-box products that a project can readily adopt.

Distributed systems support. Not everyone is worried about scale or performance over and above that which can be achieved by non-NoSQL systems. What they need is a distributed system that can span datacenters while handling failure scenarios without a hiccup. NoSQL systems, because they have focussed on scale, tend to exploit partitions, tend not use heavy strict consistency protocols, and so are well positioned to operate in distributed scenarios.

Tunable CAP tradeoffs. NoSQL systems are generally the only products with a "slider" for choosing where they want to land on the CAP spectrum. Relational databases pick strong consistency which means they can't tolerate a partition failure. In the end this is a business decision and should be decided on a case by case basis. Does your app even care about consistency? Are a few drops OK? Does your app need strong or weak consistency? Is availability more important or is consistency? Will being down be more costly than being wrong? It's nice to have products that give you a choice.

Load balance to accommodate data and usage concentrations and to help keep microprocessors busy.

Real-time inserts, updates, and queries.

Hierarchical data like threaded discussions and parts explosion.

Dynamic table creation.

Two tier applications where low latency data is made available through a fast NoSQL interface, but the data itself can be calculated and updated by high latency Hadoop apps or other low priority apps.

Sequential data reading. The right underlying data storage model needs to be selected. A B-tree may not be the best model for sequential reads.

Slicing off part of service that may need better performance/scalability onto it's own system. For example, user logins may need to be high performance and this feature could use a dedicated service to meet those goals.

Caching. A high performance caching tier for web sites and other applications. Example is a cache for the Data Aggregation System used by the Large Hadron Collider.

Voting.

Real-time page view counters.

User registration, profile, and session data.

Document, catalog management and content management systems. These are facilitated by the ability to store complex documents has a whole rather than organized as relational tables. Similar logic applies to inventory, shopping carts, and other structured data types.

Archiving. Storing a large continual stream of data that is still accessible on-line. Document-oriented databases with a flexible schema that can handle schema changes over time.

Analytics. Use MapReduce, Hive, or Pig to perform analytical queries and scale-out systems that support high write loads.

Embedded systems. They don’t want the overhead of SQL and servers, so they uses something simpler for storage.

A "market" game, where you own buildings in a town. You want the building list of someone to pop up quickly, so you partition on the owner column of the building table, so that the select is single-partitioned. But when someone buys the building of someone else you update the owner column along with price.

JPL is using SimpleDB to store rover plan attributes. References are kept to a full plan blob in S3.

Helping diagnose the typology of tumors by integrating the history of every patient.

In-memory database for high update situations, like a web site that displays everyone's "last active" time (for chat maybe). If users are performing some activity once every 30 sec, then you will be pretty much be at your limit with about 5000 simultaneous users.

Database for university course availability information. If the set contains the course ID it has an open seat. Data is scraped and processed continuously and there are ~7200 courses.

Server for backed sessions. A random cookie value which is then associated with a larger chunk of serialized data on the server) are a very poor fit for relational databases. They are often created for every visitor, even those who stumble in from Google and then leave, never to return again. They then hang around for weeks taking up valuable database space. They are never queried by anything other than their primary key.

Fast, atomically incremented counters are a great fit for offering real-time statistics.

Polling the database every few seconds. Cheap in a key-value store. If you're sharding your data you'll need a central lookup service for quickly determining which shard is being used for a specific user's data. A replicated Redis cluster is a great solution here - GitHub use exactly that to manage sharding their many repositories between different backend file servers.

Transient data. Any transient data used by your application is also a good fit for Redis. CSRF tokens (to prove a POST submission came from a form you served up, and not a form on a malicious third party site, need to be stored for a short while, as does handshake data for various security protocols.

Incredibly easy to set up and ridiculously fast (30,000 read or writes a second on a laptop with the default configuration)

Share state between processes. Run a long running batch job in one Python interpreter (say loading a few million lines of CSV in to a Redis key/value lookup table) and run another interpreter to play with the data that’s already been collected, even as the first process is streaming data in. You can quit and restart my interpreters without losing any data.

Redis semantics map closely to Python native data types, you don’t have to think for more than a few seconds about how to represent data.

That’s a simple capped log implementation (similar to a MongoDB capped collection)—push items on to the tail of a ’log’ key and use ltrim to only retain the last X items. You could use this to keep track of what a system is doing right now without having to worry about storing ever increasing amounts of logging information.

An interesting example of an application built on Redis is Hurl, a tool for debugging HTTP requests built in 48 hours by Leah Culver and Chris Wanstrath.

It’s common to use MySQL as the backend for storing and retrieving what are essentially key/value pairs. I’ve seen this over-and-over when someone needs to maintain a bit of state, session data, counters, small lists, and so on. When MySQL isn’t able to keep up with the volume, we often turn to memcached as a write-thru cache. But there’s a bit of a mis-match at work here.

With sets, we can also keep track of ALL of the IDs that have been used for records in the system.

Quickly pick a random item from a set.

API limiting. This is a great fit for Redis as a rate limiting check needs to be made for every single API hit, which involves both reading and writing short-lived data.

A/B testing is another perfect task for Redis - it involves tracking user behaviour in real-time, making writes for every navigation action a user takes, storing short-lived persistent state and picking random items.

Implementing the inbox method with Redis is simple: each user gets a queue (a capped queue if you're worried about memory running out) to work as their inbox and a set to keep track of the other users who are following them. Ashton Kutcher has over 5,000,000 followers on Twitter - at 100,000 writes a second it would take less than a minute to fan a message out to all of those inboxes.

Have workers periodically report their load average in to a sorted set.

Redistribute load. When you want to issue a job, grab the three least loaded workers from the sorted set and pick one of them at random (to avoid the thundering herd problem).

Multiple GIS indexes.

Recommendation engine based on relationships.

Web-of-things data flows.

Social graph representation.

Dynamic schemas so schemas don't have to be designed up-front. Building the data model in code, on the fly by adding properties and relationships, dramatically simplifies code.

Reducing the impedance mismatch because the data model in the database can more closely match the data model in the application.

VoltDB Use Cases

VoltDB as a relational database is not traditionally thought of as in the NoSQL camp, but I feel based on their radical design perspective they are so far away from Oracle type systems that they are much more in the NoSQL tradition.

Analytics Use Cases

Kevin Weil at Twitter is great at providing Hadoop use cases. At Twitter this includes counting big data with standard counts, min, max, std dev; correlating big data with probabilities, covariance, influence; and research on Big data. Hadoop is on the fringe of NoSQL, but it's very useful to see what kind of problems are being solved with it.

How many request do we serve each day?

What is the average latency? 95% latency?

Grouped by response code: what is the hourly distribution?

How many searches happen each day at Twitter?

Where do they come from?

How many unique queries?

How many unique users?

Geographic distribution?

How does usage differ for mobile users?

How does usage differ for 3rd party desktop client users?

Cohort analysis: all users who signed up on the same day—then see how they differ over time.

Site problems: what goes wrong at the same time?

Which features get users hooked?

Which features do successful users use often?

Search corrections and suggestions (not done now at Twitter, but coming in the feature).

What can web tell about a user from their tweets?

What can we tell about you from the tweets of those you follow?

What can we tell about you from the tweets of your followers?

What can we tell about you from the ratio of your followers/following?

What graph structures lead to successful networks? (Twitter’s graph structure is interesting since it’s not two-way)

What features get a tweet retweeted?

When a tweet is retweeted, how deep is the corresponding retweet three?

Long-term duplicate detection (short term for abuse and stopping spammers)

Machine learning. About not quite knowing the right questions to ask at first. How do we cluster users?

Language detection (contact mobile providers to get SMS deals for users—focusing on the most popular countries at first).

How can we detect bots and other non-human tweeters?

Poor Use Cases

OLTP. Outside VoltDB, complex multi-object transactions are generally not supported. Programmers are supposed to denormalize, use documents, or use other complex strategies like compensating transactions.

Data integrity. Most of the NoSQL systems rely on applications to enforce data integrity where SQL uses a declarative approach. Relational databases are still the winner for data integrity.

Data independence. Data outlasts applications. In NoSQL applications drive everything about the data. One argument for the relational model is as a repository of facts that can last for the entire lifetime of the enterprise, far past the expected life-time of any individual application.

SQL. If you require SQL then very few NoSQL system will provide a SQL interface, but more systems are starting to provide SQLish interfaces.

Ad-hoc queries. If you need to answer real-time questions about your data that you can’t predict in advance, relational databases are generally still the winner.

Complex relationships. Some NoSQL systems support relationships, but a relational database is still the winner at relating.

Maturity and stability. Relational databases still have the edge here. People are familiar with how they work, what they can do, and have confidence in their reliability. There are also more programmers and toolsets available for relational databases. So when in doubt, this is the road that will be traveled.

Reader Comments (36)

Fantastic post -- I've been looking for this kind of thing for some time. As a one-man dev shop trying to build a proto-type of a product, I'm at a stage where I'm trying to figure out how to best separate my data and which systems are best for solving what problems.

Any additional links to methodology/ies for finding the right data source for a development problems / data are greatly appreciated.

there is a huge number of writes with possible surges. In the traditional mysql relational approach, feeds to display to a user had to generated at the time of read. this involved joins and sorts across multiple tables.

moving all the processor heavy logic to the time of write as opposed to the time of read solved the performance issue from the user's point of view.

the systems are sharded out to scale writes.

the two tangible mongo benefits are 1 supports a very high sustained rate of write (dstat shows disk write as high as 70M)2 schemaless -- json/bson based allows us to support arrays and other structures as we please

From this: http://engineering.twitter.com/2010/07/cassandra-at-twitter-today.html

"For now, we're not working on using Cassandra as a store for Tweets. This is a change in strategy. Instead we're going to continue to maintain our existing Mysql-based storage. We believe that this isn't the time to make large scale migration to a new technology. We will focus our Cassandra work on new projects that we wouldn't be able to ship without a large-scale data store."

Yet you say:

"Massive write performance: This is probably the canonical usage based on Google's influence. Twitter, for example, has the problem of storing 7 TB/data per day with the prospect of this requirement doubling multiple times per year. This is the data is too big to fit on one node problem. At 80 MB/s it takes a day to store 7TB so writes need to be distributed over a cluster, which implies key-value access, MapReduce, replication, fault tolerance, consistency issues, and all the rest. For faster writes in-memory systems can be used."

Two points BJ:1. The second references is for Twitter analytics, not the tweets. Projects will use different products for different purposes. 2. People who are using MySQL at massive scale are generally using it in a NoSQL manner. They are sharding and storing key-value pairs, they aren't generally using MySQL as a typical OLTP database.

Brian, I separated Redis out because when I tried to fold Redis into the general list it looked silly because there were so many and they are so Redis specific. So I got the idea of separating out use cases. I have a well defined list of VoltDB uses cases that are also very specific so that was an easy call. I thought about having a Neo4j section, Cassandra section, Membase section, etc, especially Neo4j because the graph use cases are unique, but most everything else was pretty general so I just kept them in the more general list. Plus it would take me about another 2 weeks to finish :-) If you would help create more specialized lists that would be very helpful.

I would like to add that Ad Hoc queries are actually our database's bread-and-butter. We have distributed indexes and fulltext search on top of HBase, so people can write queries like "DocumentID, Top 10 Authors Group By(Date)".

Maybe you could also talk about the pro's anc con's of embedded databases? I thought about using MongoDB but I use Sqlite instead because it is embedded. I've developed a bloom filter to make the system faster.

I would like to use an embedded key-value-store that is open source and runs on windows but afaik such a thing doesn't exist.

The OLTP apps mentioned under VoltDB use cases:Does anyone really do these ? VoltDB's definition of durability is being durable when <= 'k' nodes are down, not the traditional definition of having data on disk. If the full system (> k nodes) crash, it is no longer durable - bye bye to all your current data and restore from atleast a few hours' old backup.I doubt serious oltp having financial/legal obligation (like booking airline ticket) will be done in VoltDB.

"If the full system (> k nodes) crash, it is no longer durable - bye bye to all your current data and restore from atleast a few hours' old backup"

Your definition of k-safety is right, but VoltDB is able to snapshot data continously to disk so your statement is plainly wrong., e.g., no way the backup will be hours' old. Furthermore, in the scenario you described (If the full system (> k nodes) crash) it would also be a disaster to a NoSQL system as well as to a disk-based RDBMS too, so no VoltDB's fault here. :)

Furthermore, the "nodes crash" assumption is a very vague notion as there are a lot of possible problems that could happen, with various degrees of probability. Finally, companies with financial/legal obligations will invest a LOT of money on expensive and redudant hardware so that they can reduce this 'crash all' scenario as much as possible (and yes, it's impossible to get 100% security/safety).

Again, you are misinformed, because companies in Wall Street, for example, has been using KDB+ (Kx Systems), an IN-MEMORY RDBMS, for at least 20 years! It's heavily used for trading stocks and data mining algorithms in hedge funds corporations. By the way, as far as I know, KDB+ has some share of influence on VoltDB's design. ;-)

Thanks for the excellent write-up. Programmers' ease of use is important. For fast development, you sometimes want to deploy a NoSQL solution quickly, and without any admin overhead (even the daemon ;-) -- we use the Python y_serial module which takes less than 10 minutes to understand and implement. The key-value access is fast, and even offers regex on the key.

The main idea is to let the "value" be any arbitrary Python object. Thus if the objects happen to be dictionaries, then one has a schema-less database. Or the objects can be the result of intermediate but intense scientific computations -- which are accessed later to be finalized. Etc.

BWSB, you obviously have not tried to do a serious hands-on on VoltDB but just parroting a lot of general good things jumbled up that no one can argue about.>>> it would also be a disaster to a NoSQL system as well as to a disk-based RDBMS too <<<<disk-based RDBMS has data ON DISK even if the system crashes. So maybe it will take time for the system to come back but the data is not lost in black-hole. I am not saying it will not be a disaster in either case, but with disk-based synchronous commit, the data is not lost. That is the key -point for OLTP. If you make a transaction and the system crashes, it is one thing to take hours to restore but be assured your transaction will be recoverable and another thing to restore but not knowing you made the transaction or not.>>> no way the backup will be hours' old.<<<<Try doing hand-on and see how frequently you can have backup on VoltDB without compromising on its performance. Even if you come down to having a backup which is only 5 minutes old, for OLTP system not knowing what you committed in the last 5 minutes (hell why 5 minutes, even in the last 300ms) is a deal-breaker.If you make a ACH transfer of 40K out of your bank and the FULL system goes down in next 300ms and you have to restore anything older than 300ms, you have just created money out of nowhere. You have 40K in both accounts !>>an IN-MEMORY RDBMS, for at least 20 years!<<Eventually anything can be made to work. What I am saying is out of box functionality.