DateTieredCompactionStrategy: Notes From the Field

It has been a few months since DateTieredCompactionStrategy made its way into Cassandra 2.0.11 and then into DataStax Enterprise 4.0.5. Since then, we’ve been able to learn quite a bit about it. This post is meant to consolidate various sources of information, as well as to provide some basic guidance on when and how to use it. DTCS was originally contributed to Apache Cassandra by Björn Hegerfors as part of his masters thesis, with additional work from Marcus Eriksson. For additional detail, including implementation notes, please see Marcus's original post introducing DTCS.

Basics

DateTieredCompactionStrategy, also known as DTCS from here on, is a new compaction strategy that is most suited for time-series data and access patterns. While not strictly limited to time-series data, such scenarios provide the best reference point for understanding what DTCS is good at.

LSM, Compaction, Data Locality

One of the inherent trade-offs with LSM systems is that modifications are buffered and then stored in immutable chunks of data. It is a trade-off in the sense that it allows the write path to be simplified in a way that allows for maximum throughput while at the same time deferring the work of optimizing data storage to a workload-specific strategy. This means that the logical view of the data and the physical arrangement of it in storage will diverge over time until some process simplifies the stored data to more closely match the logical view. Changes that overlap rows or fields represent extraneous data in the older versions, thus represent a carrying cost to the storage and also the operations that have to read over it. During simplification, the old versions are forgotten, and the last version of any field is retained. The last version of a deleted field is known as a tombstone, and is only retained for long enough to be seen by all replicas. This is what compaction is all about.

However, there are varying levels of overlapping changes. Certain data models and access patterns create no overlapping changes. In a canonical time-series scenario, all data is in the form of time-stamped facts-- measurements or events that span across a timeline in some regular fashion. For such case, compaction can be effective by simply grouping the stored data for more optimal access, which is less intensive. That is what DTCS is all about.

By keeping data clustered in storage in the same way it is used in the data model, reads which use clustering structure will be more optimal.

There are two fundamental details that are important for understanding the behavior of DTCS:

DTCS tries to keep data close together in storage when it is written close together in time. As data ages in the system, it is compacted into bigger timespans.

DTCS knows nothing about your data model. It does know about the cell timestamps which Cassandra uses for conflict resolution. I use the term cell timestamps here, since it is the most correct. You may see the phrase “write timestamp” used as well, but we must be clear that the one we are talking about is not part of a CQL data model that a user would normally interact with.

Taking these together, it follows that data models and access patterns which utilize time-ordering for clustering can benefit directly from DTCS. Specifically, data models in which the clustering structure aligns with the order of insertion are a natural fit for DTCS.

Appropriate Use

Since DTCS can be used with any table, it is important to know when it is a good idea, and when it is not. I’ll try to explain the spectrum and trade-offs here:

Perfect Fit: Time Series Fact Data, Deletes by Default TTL: When you ingest fact data that is ordered in time, with no deletes or overwrites. This is the standard “time series” use case.

OK Fit: Time-Ordered, with limited updates across whole data set, or only updates to recent data: When you ingest data that is (mostly) ordered in time, but revise or delete a very small proportion of the overall data across the whole timeline.

Not a Good Fit: many partial row updates or deletions over time: When you need to partially revise or delete fields for rows that you read together. Also, when you revise or delete rows within clustered reads.

It may be desirable to modify your data model to support DTCS usage, but this is not the best approach. A first principle of designing for scale is to model your data around your most intensive patterns. Since this is nearly always your queries, it’s fair to assume that you should model your data around your reads. Your queries are the closest thing you have in a Cassandra system to a native representation of your system requirements. Therefore, it’s wise to follow the practice of 1) modeling your queries for your system requirements 2) modeling your data models to support your queries. Then, and only then, should you decide to choose an appropriate compaction strategy according to your data model, access patterns, and data life-cycle.

DTCS Parameters

base_time_seconds (default: 60)

The smallest interval of time that DTCS will use to assemble its view of the timeline.

min_threshold (default: 4)

Controls how many intervals of time are combined to create the next larger interval size.

max_sstable_age_days (default: 365)

How old an sstable must be before DTCS stops considering it for compaction. This default is likely to be lowered to less than 1 month in a future version.

There are a few other parameters, but they are mostly for internals, so we’ll leave those out for now.

DTCS Parameters, Visually

The timeline is broken apart in divisions of 4, i.e., min_threshold. This looks almost exactly like a measuring stick with English units.

You’ll notice that now is quoted. This is because the notion of now in DTCS has nothing to do with the wall clock. All timing parameters for DTCS logic are based on two specific points in time: 1) The beginning of the unix epoch, and 2) The latest timestamp of the latest sstable. This means that DTCS compaction logic runs in relative time to the last flushed sstable.

The buckets do not slide around in time. They are fixed with respect to the beginning of the unix epoch. They simply coalesce from smaller bucket sizes to larger ones, going backwards in time from the newest sstable. Every compaction interval will fit neatly into the next bigger bucket size without overlaps. This means that compaction progress is incremental, making measured progress. It will always do the same work for a given starting state. There are no intermediate states which might cause it to recompact differently based on the current time.

Did you notice that a compaction is showing active at the first completed 4 minute interval? It is indicated by the curvy arrow. This is because it was the first eligible interval, going backwards in time, which had 4 (min_threshold) sstables. These 4 sstables were the result of compacting the sstables flushed from memory in the 4 base (60 second) intervals which overlap it.

Also, the max_sstable_age_days parameter is set extremely low here. This was for illustrative purposes. Generally, you will want to set this to weeks or months. For systems which have a high operational velocity and relatively short data lifetimes, setting it to a day or days is reasonable. For systems with low operational velocity and long data lifetimes, a much longer setting makes sense. Choosing a good setting for this parameter is one of the most important aspects of tuning DTCS. By setting it too low, you can allow your sstables to remain small, which affects read performance, especially cold reads. By setting it too high, you allow compaction to keep putting load on the system far after the point of diminishing returns, holding resources that you would prefer to keep available for operations. The best setting here will be in between those two extremes.

To choose a useful value for max_sstable_age_days, consider these heuristics:

Prefer lower values for systems which will only keep new data for a short while.

Prefer higher values for systems which need to optimize storage for reads that may otherwise need to access many sstables.

Key Benefits of DTCS

Compaction is a necessary element of a log-structured system like Cassandra’s write path. An effective compaction strategy will be complementary to the read and write patterns of its data. You want to consider not only the result of compaction, in terms of sstable consolidation, but also the work that the system has to do in order to achieve those results. The best compaction strategy for a table is one that balances the compaction workload with the need to optimize your read requests to the storage layer.

DTCS allows you to more closely amortize the compaction work over the ingested data for compatible workloads. This is fundamentally different than STCS and LCS, which both exhibit a cascade-like loading behavior when used with time-series data. By limiting the age of data which is eligible for compaction, you can choose a balance between compaction workload and the amount of consolidation required. This is controlled with the max_sstable_age_days parameter as described above.

Testing Details

We ran a test that illustrates the behavior of DTCS in such a scenario. It was a typical time-series workload over an interval of two weeks with DTCS on commodity spinning disks. This test was tuned to load data at the fastest sustainable rate-- It was intended to tax the system at a much higher level than a typical ingestion rate. To counterbalance the ingestion rate, the DTCS max_age was set at 1 day. We also striped the commit log and the data directory. This allowed us to push a dense system with spinning disks at a relatively high operational rate. This should be considered merely a simulation of system dynamics and behavior. It does illustrate some interesting possibilities, however. The goal was to keep loading data into the system, observing the behavior and resource utilization as density increased. Since we are focusing on compaction and node dynamics, this was only a single node test. The results can be easily projected to a larger cluster with the baselines gathered here.

Data was written as telemetry-- time series fact data. The table is configured with a default TTL, and no TTLs are applied per operation. The test started with only writes. A week later, a most-recent-n workload was added at ratio of 10%, across all partitions. This is called out in the plots below with the first vertical annotation. Around 3 days later, a cold-read workload was added with a few requests constantly pending. This one is called out by the second vertical annotation.

All workloads were sustained at a queuing target level with asynchronous operations. This method is the most direct way to measure a system’s performance in my experience, as opposed to trying to enforce a rate target.

Here is the most telling plot from the test:

Interpretation

Notice how the op rate and the compaction load are almost mirror images on the graph: when the system is more busy with compactions, front-end operations complete at a slower rate. This is known from first principles, but this graph provides a visual confirmation for the skeptical. The op rate on the left is the actual number of all operations combined, but the compactions are scaled x 1000 for visual comparisons. Notable as well is the density line, increasing linearly throughout the test with no significant effect to the ops and compaction. Other significant improvements included: non-saturating survivor space (see below), stable latencies over the course of the test, as well as bounded integral IO load throughout. When the warm reads and then the cold reads were added, the read IO against storage increased very slightly. The net effect of reduced compaction load was that the system was able to sustain a relatively flat rate operations over the course of the test even though density was increasing steadily. This result has also been shown relative to STCS in previous tests, in which the DTCS system ran much more smoothly over the long term than an equally configured STCS system, for a canonical time-series workload.

In order to get the test node to a meaningful density in a reasonable time, we initially focused on tuning for throughput. Our intent was to load the system at maximum rate to get it to a dense configuration, and then change the tunings to observe latency-affecting behavior. Due to the test running shorter than we expected (see below), we were not able to test the low latency settings. Still, the latencies remained stable, with little variation across each workload.

For the latency plots both reads and writes were broken into two separate views to make it easier to see the low and high percentiles with clarity. Also, only the most interesting part of the test (including the reads) is shown in focus.

The write latencies are solid across the board. The median write latency stays below 1ms the whole time.

The read latencies are stable as well. They do shift slightly with the addition of the cold read load. This is pretty encouraging, given that the test was only tuned for throughput and not latency.

Heap usage was well contained throughout the test. The graph above shows each memory pool on the 0%-100% scale, not stacked. Even after tuning for maximum ingestion throughput and then adding the warm and cold reads, there is no concerning change to the GC behavior. The small stair step in the middle marks the beginning of the warm-read workload, as expected.

For those who are familiar with GC tuning, this is a particularly interesting plot. The survivor space tends to be a canary in the coal mine for heap pressure. Notice how it was very tightly bounded, even when taking the min and max values across 10 minute windows. In systems which have high GC pressure, you will tend to see this peg to 100% shortly before significant GC pauses. It simply doesn’t happen in this test.

In this graph, you can see the storage shift from slightly write-biased to slightly read biased, and then moderately read-biased. There are two interesting characteristics here. First, the addition of the warm and then the cold workloads correspond directly to the increases in read activity. As well, the lower disk write throughput corresponds to the longer running compaction tasks (higher concurrent compactions) seen in the very first plot above. This is a direct illustration of the trade-off between compaction load and front-end capacity, partially affected by IO contention. Notably, the interplay here was more between the read and write operations, by way of compaction. The density of the data was not a significant factor. So long as active compactions remain above a threshold, front-end operations remain below a threshold, and vice-versa.

The plot above shows the CPU idle range across all cores. We can see an obvious reduction in CPU availability as we add in the read workloads, mostly in the headroom (max idle), but not as significantly in the base load (min idle). This shows that we are utilizing the CPUs almost nominally, and probably have little more bandwidth to attain from this particular system. The further we saturate CPU, the more extended the tail latencies will be. However, op rates and latencies remained stable at the loading levels we used.

Partial Success

The test ended before we reached a significant trade-off in performance. This was due to the fact that our timestamp stepping on the telemetry data was too high per partition. For the record, I’ve never run out of epoch time in this type of test before. Systems tend to fall over way before that point. Next time, we’ll know to reduce the timestamp stepping by a factor. The monotonicity of the timestamp was an important detail in the test setup. Hence, the timestamp stepping could not be adjusted retroactively without affecting the integrity of the results. We may reproduce these results again in the future with the additional latency-tuning phase.

Complementary Settings

Default TTL on a table is another natural fit for time-series data. It allows you to avoid costly compactions that run solely for the purpose of dropping tombstone and/or TTLed cells. Actually, it allows the compaction logic to be much more efficient, simply dropping the sstable when the newest cell timestamps are old enough to indicate all of the sstable is past its shelf life. This data is kept in the table metadata when it is written, making it easy to check. When default TTL is used in conjunction with a time-series workload, you can further reduce the compaction load associated with data density. When used in conjunction with DTCS, you can almost eliminate the geometric loads associated with high density on the compaction side of the system.

Cautionary Tales

These results are very promising for certain workloads. We have to remain honest about the trade-offs as with any distributed system. Specifically, DTCS is not an automatic win for every case. It is an optimization for a specific type of workload. If you try to use it with the wrong workload, it will perform worse than STCS or LCS-- painfully worse. On the flip side, if you have the right workload for it, the benefits are undeniable. If you are interested in using it, but want to approach it cautiously, consider testing it out with write sampling as described here: What’s new in Cassandra 1.1: live traffic sampling. It will be helpful to monitor the changes in total bytes compacted, IO load, and active compactions over time. Be sure to test beyond the max_age parameter so that you can see the maximum bound of compaction load up to that point.

This is probably an appropriate time to remind everyone that operational headroom is necessary for your sanity. If you tune your system to hold lots of data without considering the amount of operational capacity you need in reserve, you are signing up for some pain. Operational headroom is there to absorb the overhead of reacting to node failures, bootstrapping new nodes, running repairs, etc. This means that you should size the whole system for operations as well as your application loads. A simple approach to use for this is to plan for a reasonably low time-to-recovery, AKA how long it takes to bootstrap a node, while your cluster is serving traffic. It is better to design towards a specific time-to-recovery goal than to discover that it is uncomfortably long due to an imbalanced node profile. Specifically, the amount of data you keep on your nodes is an operational design parameter that depends on compute, storage, memory, and network capacity.

Known Issues as of April 2015

This ticket is about sstable_max_age_days interfering with the default table ttl optimization. There is a fix included in 3.0, 2.0.15, and 2.1.5.

In Summary

With DTCS and default TTLs, significant data density is possible for time-series scenarios. Even scenarios which closely resemble time-series can benefit from DTCS.

Our saturation test showed a stable disk-based system operating near throughput saturation, with mixed workload, with stable latencies, with continuously increasing data density. We did not hit any walls or sudden density barriers for the time series workload in the test. It is fair to assume that there is a data density at which the system would need retuning, but we have not identified that point yet in our testing.

If you choose to use DTCS with a higher density configuration, it is imperative that you consider the time that it takes to bootstrap or repair your data. Please design responsibly.

DataStax has many ways for you to advance in your career and knowledge.

Comments

Congratulations for the excelent text.
Which tools did you use to simulate? Cassandra stress-tools?
How did you get all the data about C*write and compaction operations. I am very interested in this for my Master Thesis.