Dominated by the RDMS paradigm, databases were a pretty sleepy area of technology for many years. NoSQL changed all that. Although the technology behind most NoSQL databases isn't new, a wild variety of approaches and products has left many customers struggling to catch up (see Andrew Oliver's classic "Which freaking database should I choose" for a lively primer).

One of the key benefits touted by NoSQL is the ability to scale easily compared to RDBMS databases. But how exactly does that work? For the answer, we turn this week to Rahim Yaseen, senior vice president of engineering at Couchbase, vendor of a popular, documents-style NoSQL database. He also touches on the complexities of querying using a document database. -- Paul Venezia

Scaling out and querying large datasets with NoSQL

Today, we're at an inflection point where organizations are looking for ways to manage, store, and capitalize on a ballooning influx of data. With ultralarge data sets, organizations must determine the solution that best fits their needs to scale their technology and their business.

One key database decision is whether to scale out or to scale up -- but what does that mean, anyway?

In a scale-out architecture, associated with NoSQL databases, a distributed set of nodes known as a cluster is used as the basic architecture. It provides highly elastic scaling capability, enabling you to add nodes to handle load on the fly. This is the opposite of a scale-up architecture associated with the RDBMs, which adds more resources to a single, larger machine.

A key concept in scaling out is "shared nothing." An ideal scale-out architecture is based on a shared-nothing architecture, where all nodes are peers and there is no single shared resource that serves as a bottleneck. In addition to all nodes being independent, all the data must be evenly distributed or partitioned across these nodes through a process called sharding. This is an important process and can be accomplished either manually or through an automated system.

Manual vs. autosharding

To understand the difference between manual and autosharding, consider the registration process at a typical conference. When you walk into the registration area, you may be asked to go to the registration booth that corresponds to the first initial of your last name to check in. For instance, A through D might check in at booth No. 1, E through H at booth No. 2, and so on. This is an example of scaling via manual sharding.

With manual sharding, the registrations are distributed across a series of check-in booths. This works because there is a well-defined, pre-determined scheme. There are no guarantees, however, that the data or registrants will be distributed evenly or that the booths can be easily expanded without reshuffling all the registrations. Furthermore, the shutdown of a single booth (equivalent to a node failure) requires reshuffling across the other booths.

Alternatively, suppose you walk into the same conference and are given the option to check in at any open registration booth. This scenario is similar to how distributing data through autosharding works, where data is automatically hashed in the cluster for even distribution. Data can be accessed from any node by an algorithm that routes the request to the particular node.

Note, however, that node failures are quite common. Large scale-out architectures must operate on the assumption that several nodes being in a failed state at any given time is normal operating procedure.

If nodes in a scale-out architecture fail, a well-designed architecture can enable processing to continue across the rest of the cluster. Here, autosharding has a clear advantage; with manual sharding, replicating and moving data around during node failures -- that is, rebalancing data -- can be a significant challenge. As data sets become very large, rebalancing a manually sharded deployment carries a very high overhead, and an autosharding scheme becomes a must.

Querying large datasets

The contrast between manual and autosharding also emerges in querying large data sets. Think of the registration analogy again. With manual sharding, if you want to find all individuals at the conference with a last name beginning with "S," you would only have to go to a single check-in booth to determine those names.

With autosharding, if you wanted to find the same information, you would have to go to each booth to determine who checked in with a last name beginning with "S" to pull the same data. Typically, map-reduce techniques are used to accomplish this.

Alternatively, the query might not be all last names beginning with "S," but rather all members of a certain group ("get all registrants from a specific organization"). This is where secondary indexing and querying on a distributed data set becomes a key challenge.

Key direct access patterns in a document database require accessing data through a single key and navigating to another document through a related key. For example, you might pull up a certain order with an order ID, then navigate to the customer or account data by using an account ID. Unlike with a relational database, it's unlikely that a given object will be scattered across multiple nodes requiring an expensive join in a sharded scale-out architecture just to re-create that single object.

However, there is a key need to access data through secondary indexes and to query data that is likely distributed across the cluster. Using a map-reduce technique is overkill since every node is required to participate in the query execution.

Challenges ahead

As an industry based on sharded scale-out architectures, we need to solve the access patterns that require related objects to be retrieved, plus support querying data through secondary indexing. While map-reduce techniques have been useful in building the first generation of solutions, interesting challenges arise in building the next generation of these innovations.

A large body of work has been accomplished in distributed algorithms that is relevant in moving toward distributed query processing and query optimizers for large scale-out architectures. The future of managing large data sets is likely to see significant innovations in indexing schemes and query optimization.

Despite such challenges, scale-out architectures are gaining more traction every day. Scaling up RDBMS-style becomes less and less practical as the volume of data increases. And when a scale-up architecture underlying a very large data set fails, it'll likely be one very large central point of failure.

The Internet is a classic example of ultrascale, shared-nothing clusters -- that is, very large distributed systems working in tandem. In many ways, database systems must learn and adapt from this example and build out ultrascale, distributed database systems on similar principles.

New Tech Forum provides a means to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all enquiries to newtechforum@infoworld.com.