Sitecore MongoDB Shard Key Suggestions

By: Antonios Giannopoulos, Grant Killian

Posted on: September 12, 2017

In an earlier post on this series we're calling ObjectRocket MongoDB Deep Dive for Sitecore, we explored the "replica set" vs "sharded cluster" topic. We shared that for our large Sitecore projects at ObjectRocket, we've found a good MongoDB shard key for Sitecore is _id:1 or _id:hashed. In this write-up, we'll dive further into the topic of shard keys for MongoDB used with Sitecore. You may want to review the first post in this series for more background.

Shard Key Basics

A MongoDB sharded cluster needs a shard key. A shard key is crucial for MongoDB sharded clusters to distribute the data efficiently and retrieve it quickly. A shard key is derived from the fields stored in the MongoDB documents, and it’s important to pick a good shard key for improving MongoDB performance. If you have to change the shard key later on in a project, it can require downtime and significant hassle, so our hope is this guidance will help you select the best shard key for your Sitecore projects from the very start.

The following attributes describe what makes a good shard key:

High-cardinality fields

Fields without null values

Immutable fields

Fields not monotonically increasing

Even read/write distribution

Even data distribution

Read targeting/locality

When trying to determine a good shard key for our Sitecore projects, we collect a sample of the workload MongoDB is performing, identify statement patterns and check if any of shard key constraints are violated.

Analyzing Sitecore workloads for MongoDB

“Sitecore calculates diskspace sizing projections using 5KB per interaction and 2.5KB per identified contact and these two items make up 80% of the diskspace”

Note: The above statement from Sitecore isn’t entirely consistent with what we’ve observed on large implementations; it overstates the size of contact records by an order of magnitude based on what we’ve observed in some cases. However, the volume of data collected into MongoDB is hard to predict because each Sitecore implementation is very different in this respect.

Let’s take the Sitecore documentation as accurate and focus on the two areas, interaction and contact, as the majority of our MongoDB disk space.

Taking the interaction collection first, we observed the activity through MongoDB and saw that it receives a steady stream of inserts, queries, and updates with a read/write ratio of 60/40 (60% reads, 40% writes). In our analysis, the updates rely entirely on the Sitecore _id field. The queries use a scattering of fields along the following lines:

_id, ContactId queries for 80% of the activity

ContactId, ContactVisitIndex queries for 15% of the activity

ContactId, _t queries for 5% of the activity

Based on what we’ve found, a recommended shard key for the interaction collection would be based on _id. This will scale all writes (inserts & updates) and the vast majority of query statements, leaving only around 20% as scatter-gather queries (i.e. queries not using the shard key).

Basing a shard key on ContactId could also be decent, but the _id is used in the update statements, and it’s recommended (if not mandatory) for updates to sharded collections to make use of the shard key. Thus, we settle on _id instead of ContactId. Additionally, sharding on _id may reduce the occurrences of the exception described on KB939840. Also ContactId may lead to hotspots and create jumbo chunks.

When we performed this analysis on the contact collection, the results are much the same. There’s a read/write ratio of 80/20, with update statements relying on _id as well for most queries.

Looking more broadly and expanding our attention to the other busy MongoDB collections for Sitecore, like GeoIPs, Devices or KeyBehaviorCache, we find the _id field plays a prominent role in most of the activity.

So, _id it is - easy, right?

The MongoDB _id field stored for Sitecore workloads is a .NET Guid (Global Unique Identifier) that the Sitecore application uses as a primary key for most records in SQL Server. Guids are pervasive in Sitecore development, so determining that Guids are an important consideration for sharding MongoDB clusters won’t surprise any experienced Sitecore technologists.

There’s more to this for MongoDB, however.

In defining the shard key, we want a random value that lends itself to better distribution of inserts. There are cases where taking a hash of the _id field may yield better performance because the hashing process precomputes and distributes the chunks better. MongoDB provides a consistent hashing mechanism for transforming the _id into an integer and saves the output to an index; this would be noted as _id:hashed in terms of the shard key. Generally speaking, the _id:hashed option becomes handy when the source data is either monotonically increased or too long, for example, a full url name. In notation easy for MongoDB professionals to digest, then, we would make the following general recommendation for Sitecore shard keys in MongoDB sharded clusters:

{_id:1} or {_id:hashed}

Remember, Sitecore workloads can have significant variance, so we consider this our default stance and always measure to confirm for a specific scenario.

Empty collections vs full and some Grecian Formulas

Besides profiling your specific data, deciding between _id:1 or _id:hashed can sometimes be a question of whether there’s already Sitecore data in MongoDB or not. For a new Sitecore project with empty collections in MongoDB, we suggest using _id:hashed and define MongoDB numInitialChunks to pre-split and distribute the empty storage units (chunks) that MongoDB allocates. This will make for a more efficient MongoDB environment as the data accumulates over time – you’ll avoid chunk operations like splits and moves if you set this up ahead of time.

After significant analysis on numerous Sitecore projects, we derived the "Grecian Formula" to determine the number value for the previous statement. The "Grecian Formula" (named for one of our senior MongoDB DBAs at ObjectRocket who is responsible for a lot of the in-depth mathematics that helped us arrive at this... and who happens to be Greek ☺ ) is as follows:

numInitialChunks = Min(Max(varSize, varCount), varLimit)

The variables in our Grecian Formula are:

varSize = MongoDB collection size in MB divided by 32

varCount = Number of MongoDB documents divided by 125,000

varLimit = Number of shards multiplied by 8,192

Here's an example Grecian Formula calculation for numInitialChunks given a contrived Sitecore environment of 10,000 MB in size with 1 million documents and 3 shards:

You can experiment with the "Grecian Formula" and get a sense for the orders of magnitude involved. The math isn’t hard! The trick can be in making wise estimates of document counts and anticipated collection sizes; pinpoint accuracy isn’t crucial.

If this is intimidating

Our ObjectRocket team specializes in working with customers to analyze and run MongoDB for Sitecore workloads in the most performant manner. Determining proper MongoDB sharded cluster keys and applying those findings to specific environments is what ObjectRocket does routinely for customers running Sitecore. Our deep MongoDB expertise continues through to topics of pruning, compactions, upgrades and stepdowns . . . the list goes on, as MongoDB is a vast platform.

MongoDB is a powerful system that’s easily taken for granted by Sitecore technologists. Our hope is that these notes on sharded cluster keys, and the earlier write-up on replica sets vs sharded clusters, helps others to make the most of their MongoDB investments. As always, feel free to get in touch with ObjectRocket if you’re looking for additional help with your MongoDB cluster!