Leveled Compaction in Apache Cassandra

Introduction to Compaction

Cassandra's log-structured storage engine enables Cassandra's amazing performance and features like application-transparent compression by turning all updates into data files called sstables that are written sequentially to disk. No update-in-place is done (because that would require doing random i/o); instead, a new version of the columns being inserted or updated is written in the next sstable.

Figure 1: adding sstables with size tiered compaction

Thus, over time, many versions of a row may exist in different sstables. Each of these versions may have different sets of columns. If the sstables were allowed to accumulate, reading a row could require many seeks in different files to return a result.

To prevent read speed from deteriorating, compaction runs in the background, merging sstables together. (This can also be done performantly, without random i/o, because rows are sorted by primary key within in each sstable.)

Tiered Compaction

Figure 2: sstables under size-tiered compaction after many inserts

Cassandra's size-tiered compaction stragety is very similar to the one described in Google's Bigtable paper: when enough similar-sized sstables are present (four by default), Cassandra will merge them. In figure 1, each green box represents an sstable, and the arrow represents compaction. As new sstables are created, nothing happens at first. Once there are four, they are compacted together, and again when we have another four. Figure 2 shows what we expect much later, when the second-tier sstables have been combined to create third-tier, third tier combined to create fourth, and so on.

There are three problems with size-tiered compaction in update-heavy workloads:

Performance can be inconsistent because there are no guarantees as to how many sstables a row may be spread across: in the worst case, we could have columns from a given row in each sstable.

A substantial amount of space can be wasted since there is no guarantee as to how quickly obsolete columns will be merged out of existance; this is particularly noticeable when there is a high ratio of deletes.

Space can also be a problem as sstables grow larger from repeated compactions, since an obsolete sstable cannot be removed until the merged sstable is completely written. In the worst case of a single set of large sstable with no obsolete rows to remove, Cassandra would need 100% as much free space as is used by the sstables being compacted, into which to write the merged one.

Leveled Compaction

Figure 3: adding sstables under leveled compaction

Cassandra 1.0 introduces the Leveled Compaction Strategy, based on LevelDB from the Chromium team at Google.

Leveled compaction creates sstables of a fixed, relatively small size (5MB by default in Cassandra's implementation), that are grouped into "levels." Within each level, sstables are guaranteed to be non-overlapping. Each level is ten times as large as the previous.

In figure 3, new sstables are added to the first level, L0, and immediately compacted with the sstables in L1 (blue). When L1 fills up, extra sstables are promoted to L2 (violet). Subsequent sstables generated in L1 will be compacted with the sstables in L2 with which they overlap. As more data is added, leveled compaction results in a situation like the one shown in figure 4.

Figure 4: sstables under leveled compaction after many inserts

This solves the above problems with tiered compaction:

Leveled compaction guarantees that 90% of all reads will be satisfied from a single sstable (assuming nearly-uniform row size). Worst case is bounded at the total number of levels -- e.g., 7 for 10TB of data.

At most 10% of space will be wasted by obsolete rows.

Only enough space for 10x the sstable size needs to be reserved for temporary use by compaction.

Leveled compaction can be enabled by creating (or updating) a ColumnFamily with the compaction_strategy option set to LeveledCompactionStrategy. When updating an existing column family, reads and writes can continue as usual while leveling of existing sstables is performed in the background.

Considerations

Because leveled compaction makes the above guarantees, it does roughly twice as much i/o as size-tiered compaction. For primarily insert-oriented workloads, this extra i/o does not pay off in terms of the above benefits, since there are few obsolete row versions involved.

Leveled compaction ignores the concurrent_compactors setting. Concurrent compaction is designed to avoid tiered compaction's problem of a backlog of small compaction sets becoming blocked temporarily while the compaction system is busy with a large set. Leveled compaction does not have this problem, since all compaction sets are roughly the same size. Leveled compaction does honor the multithreaded_compaction setting, which allows using one thread per sstable to speed up compaction. However, most compaction tuning will still involve using compaction_throughput_mb_per_sec (default: 16) to throttle compaction back.

In http://leveldb.googlecode.com/svn/trunk/doc/impl.html ,you can find a rule:”We also switch to a new output file when the key range of the current output file has grown enough to overlap more then ten level-(L+2) files. This last rule ensures that a later compaction of a level-(L+1) file will not pick up too much data from level-(L+2).”
I think “10x the sstable size” may come from this rule.

Each level is allowed to grow to 10x the size of the previous level, so (slightly less than) 90% of the data will be in the largest level. Remember that within a level, data is guaranteed not to overlap across sstables, or put another way, a given row will be in at most one sstable.

1. Since the SStables are 5 MB each here, are the bloom filter, index files maintained for each sstable ??

If that is the case, it would take time to scan through all the filters, indexes right ? how does it works here ?

2. “Extra sstables are promoted to L2 (violet)”: in the above document :

Is that something like SStables with (New Sstables)10,9,8,7,6,5,4,3,2,1(old Sstable) on L1. when a new sstables(11) is to be compacted from L0 to L1. So, sstables (1) from L1 is moved to L2 and compacted with L2 sstables and New SStable (11) is compacted to L1 sstables ?

3. If the SSTable size was set to 5 MB, and if there are large row size more then 5 MB, how does it work out in this case ?

4. Every time during Compaction, All the sstables in a specific level which require changes are participated in compaction when a new sstable to be compacted has new updates ? Then once the compaction is done, SStables are created with new changes and old sstables are discarded ? (in Update Environment)

All metadata is maintained for sstables compacted by LCS, although the default bloom filter accuracy is reduced starting in 1.2. We use an IntervalTree to cut down the number of sstables that need to be checked.

SSTables retain the level they are created in. If a level is full, new sstables resulting from compacting it will be placed in the next-higher level.

An SSTable will contain a minimum of one row.

New SSTables are always flushed to L0 and compact “upwards” from there.

A compaction will run whenever there is more data in Level N > 0 than desired (sstable_size_in_mb * 10**level), or whenever there is anything in L0. In practice, you will often get multiple tables flushed to L0 while it is busy compacting higher levels (especially before parallel compactions are allowed in 1.2), but in theory a single L0 sstable can definitely be compacted with L1.

I don’t understand the 10x requirement for temporary compactions for leveled compactions. If my total CF size is 500G does it mean I will need 5Terabytes ? That makes no sense. Say my highest level SSTable will be 100G. Does it mean I need a terabyte for temporary compactions ? That’s worse than size-tiered. Can you please clarify ?

Hi Jonathan,
I have some question for LeveledCompaction after reading related code in Cassandra-1.2:
1. In the case that we get candidates from Level-L(L>=1) and pick up just one sstable(since we doesn’t found overlapping ones in Level-L+1), why not just promote this one instead of doing one sstable compaction?
2. In our suitation, echo node has 12 disks(1.6T capacity), echo one with 500G data right now and the memtable size of cf A is 1G. How to tune sstable_size_in_mb? What about 4G size?