Data Distribution

As a reminder, the goal of data distribution is to cause workloads to become distributed across multiple servers. This allows an application to experience higher TPS throughput, allow more concurrent users, and to allow for much larger overall database size, all while avoiding database bottlenecks.

For the best performance, any data that is logically related should be stored together. In this way, queries can be satisfied in a single fetch. But, that ideal situation isn’t always available. What happens when a query needs data from multiple shards (or nodes)? Let’s take a look at a few example application scenarios and see how the way you distribute data can impact performance.

The model on the left is the MongoDB data model (represented via BSON) and the model on the right is a relational model. While the exact way that the data is stored is different, both models accomplish the same thing.

So, with the groundwork of our example understood, let’s examine how MongoDB and ScaleBase execute the four blogging application scenarios we outlined earlier.

Scenario Execution with MongoDB

You can see that for the two “write” scenarios, in MongoDB, with that model, adding an article, a high frequency activity, and adding a tag, are both easily accomplished in a single call to a single database. For example, we already know the user so it is simply a matter of inserting the relevant documents.

However, both of the “read” scenarios require that a “join” be made in the application. For example, if we want to query all of an author’s articles, along with that author’s details, we would need to first request the Users collection (table) and then query the Article collection (table). The application would then have to join those two sets of data and return it to the client.

Depending on how the Article collection is sharded, MongoDB may need to query multiple shards and then aggregate that data before returning it to the blogging application, and for that data to then also be joined with the User collection.

Figure 3: Read / Write Scenario Execution in MongoDB

Scenario Execution with ScaleBase with MySQL

In ScaleBase with MySQL, the data may be distributed across articles, but the distribution process will co-locate logically related user data with an author’s articles. In cases where data is not co-located, ScaleBase will join the data at the database level. This means the application does not need to code those joins, and three of the four scenarios are satisfied in a single call.

Figure 4: Read / Write Scenario Execution in ScaleBase and MySQL

As you can see, both MongoDB and ScaleBase data models obviously can support our example Blogging application. In MongoDB, there can be read scenarios where the application needs to accomplish a join. In ScaleBase, for the one scenario that requires it, the join is accomplished by ScaleBase, and not within the application.

Data Redistribution in MongoDB, and ScaleBase with MySQL

There can come a time when you need to rethink how data is distributed across servers. For example, application usage patterns may evolve. And, as the number of concurrent users goes up and/or transaction and data volumes increase, you may find that certain data nodes (shards) become “hotspots” receiving more read/write workloads than other parts of the database array. When this happens, you will want to adjust how data is distributed in the future, and possibly also redistribute existing data.

Data Chunks

In both MongoDB and ScaleBase, data is distributed in “chunks”. In MongoDB, a data chuck is a set of documents and the default size for a data chunk 64 megabytes. In ScaleBase, a data chuck is a logical data slice from an entire relational hierarchy.

Splitting Shards and Rebalancing Shards

When you want to alleviate the workloads from a database hotspot, there are two typical data redistribution use-cases: splitting shards and rebalancing shards

They are easily explained in a few pictures.

Figure 5: Data Redistribution – Splitting and Rebalancing Shards

Simply stated, in splitting shards you are adding resources to your database array, and in re-balancing shards you are redistributing data across already existing server resources.

Both MongoDB and ScaleBase can automatically redistribute data across shards, online and without downtime, by adding and removing nodes as needed, to handle volume changes and hotspots.

A new server might need to be added to add capacity, but it may also be possible to move hot data from an overused server node to an underused server node. Remember that in MongoDB, a chunk is the smallest logical set of data. When migrating between over- and under-used nodes, MongoDB moves entire chunks from one server to the other. Whereas, in ScaleBase, a data chuck is a logical data slice from an entire relational hierarchy. ScaleBase will similarly move entire chunks of data in a way that maintains logical data relations.

The best part, for both MongoDB users and ScaleBase users, is that the logic for data redistribution and rebalancing happens behind the scenes, automatically. Hotspots are detected and data is moved using logic at the data layer. Once again, your application does not need to contain code for that functionality.

Summary

In this blog entry, and the previous entry that compared MongoDB with ScaleBase autosharding, we’ve covered a lot of material. The figure below summarizes the discussion so far for MongoDB and ScaleBase sharding and data distribution capabilities.

As you can see MongoDB and ScaleBase both equally support auto-sharding, data distribution at the database level, scalable workload distribution, and application transparency. Where they differ is at the references and relational level. In MongoDB, each document collection is sharded independently and joins are not supported. In ScaleBase, relations are retained and form the basis of a good data distribution policy and joins are supported locally on each node, and across the entire database array.