support incremental sstable switching

Details

Type: Improvement

Status:Resolved

Priority: Minor

Resolution:
Won't Fix

Fix Version/s:
None

Component/s:
None

Labels:

None

Description

I have been thinking about how to minimize the impact of compaction further beyond CASSANDRA-1470. 1470 deals with the impact of the compaction process itself in that it avoids going through the buffer cache; however, once compaction is complete you are still switching to new sstables which will imply cold reads.

Instead of switching all at once, one could keep both the old and new sstables around for a bit and incrementally switch over traffic to the new sstables.

A given request would go to the new or old sstable depending on e.g. the hash of the row key couple with the point in time relative to compaction completion and relative to the intended target sstable switch-over.

In terms of end-user configuration/mnemonics, one would specify, for a given column family, something like "sstable transition period per gb of data" or similar. The "per gb of data" would refer to the size of the newly written sstable after a compaction. So; for a major compaction you would wait for a very significant period of time since the entire database just went cold. For a minor compaction, you would only wait for a short period of time.

The result should be a reasonable negative impact on e.g. disk space usage, but hopefully a very significant impact in terms of making the sstable transition as smooth as possible for the node.

I like this because it feels pretty simple, is not relying on OS specific features or otherwise rely on specific support from the OS other than a "well functioning cache mechanism", and does not imply something hugely significant like writing our own page cache layer. The performance w.r.t. CPU should be very small, but the improvement in terms of disk I/O should be very significant for workloads where it matters.

The feature would be optional and per-sstable (or possibly global for the node).

Activity

Clarification: The reason why sstable choice should be deterministic on row key (or something similar) is that it means that whenever some particular data is picked from the new sstable, they are never again read from the old. This is important both because it means the transition can happen faster, and because it means you do not increase the working set size as significantly as otherwise.

(In an ideal case the algorithm to determine sstable should take into account actual on-disk placement in the new sstable, given that the OS cache will often be pulling in other things than what is being asked for due to locality. But even lacking that, being deterministic is a significant improvement.)

Peter Schuller
added a comment - 23/Oct/10 23:30 Clarification: The reason why sstable choice should be deterministic on row key (or something similar) is that it means that whenever some particular data is picked from the new sstable, they are never again read from the old. This is important both because it means the transition can happen faster, and because it means you do not increase the working set size as significantly as otherwise.
(In an ideal case the algorithm to determine sstable should take into account actual on-disk placement in the new sstable, given that the OS cache will often be pulling in other things than what is being asked for due to locality. But even lacking that, being deterministic is a significant improvement.)

In addition to determinism, there should be on-disk locality, which implies that it should be by token, not key. Without such locality, your first fractions could cause all the blocks to be read. In other words, you want to make sure when you're at 10% through the transition, it roughly amounts to 10% of the blocks, not 10% of each block.

Ryan King
added a comment - 15/Nov/10 22:36 In addition to determinism, there should be on-disk locality, which implies that it should be by token, not key. Without such locality, your first fractions could cause all the blocks to be read. In other words, you want to make sure when you're at 10% through the transition, it roughly amounts to 10% of the blocks, not 10% of each block.

I didn't think of that; that we can actually make the transition sequential in terms of sstable coverage. That's great.

The only case I can think of where it might not be advantageous would be use of the OPP on a terribly un-even workload where a gradual transition from 0-100% in terms of the node's ring segment might imply very uneven jumps in the number of requests going to the new cold sstable at any given moment. (Not that I'm using this as an argument against the idea, just pointing it out.)

Peter Schuller
added a comment - 16/Nov/10 08:14 I didn't think of that; that we can actually make the transition sequential in terms of sstable coverage. That's great.
The only case I can think of where it might not be advantageous would be use of the OPP on a terribly un-even workload where a gradual transition from 0-100% in terms of the node's ring segment might imply very uneven jumps in the number of requests going to the new cold sstable at any given moment. (Not that I'm using this as an argument against the idea, just pointing it out.)

FWIW, this seems like much less bang-for-complexity-buck than limiting sstable size as in CASSANDRA-1608 to me. (They are not mutually exclusive but I would like to see how urgent this feels after limiting size is done, first.)

Jonathan Ellis
added a comment - 16/Nov/10 18:28 FWIW, this seems like much less bang-for-complexity-buck than limiting sstable size as in CASSANDRA-1608 to me. (They are not mutually exclusive but I would like to see how urgent this feels after limiting size is done, first.)

So originally with 1608 I looked at the size restrictions as something that you'd select to be pretty high; essentially limiting it to "very large" instead of "huge" for bloom filter purposes (and probably disk space purposes - avoiding spikes). Limiting sizes sufficiently such that individual sstable compactions are no longer an issue would imply having pretty sever limits (on the order of smallish subset of RAM size rather than say 100 gig).

My main concern is the number of sstables this would generate if the maximum size was e.g. 500 MB or something along those lines. This means that row locality (between sstables) becomes significantly more important for large data sets. However, assuming 1608 works well enough, and coupled with rate limited compactions (outside the scope of this ticket or 1608), I agree that this should essentially become unnecessary, instead effectively being a complex way to achieve the same thing as one of the side-effects of 1608.

That said, I'm still not sold on how 1608 is to accomplish sufficiently aggressive row "de-spreading" without incurring significant overhead by compacting too aggressively. But I am starting to think I have misunderstood something about 1608, so take that with a grain of salt.

Peter Schuller
added a comment - 16/Nov/10 18:50 Ryan: Yes, good point.
Jonathan:
So originally with 1608 I looked at the size restrictions as something that you'd select to be pretty high; essentially limiting it to "very large" instead of "huge" for bloom filter purposes (and probably disk space purposes - avoiding spikes). Limiting sizes sufficiently such that individual sstable compactions are no longer an issue would imply having pretty sever limits (on the order of smallish subset of RAM size rather than say 100 gig).
My main concern is the number of sstables this would generate if the maximum size was e.g. 500 MB or something along those lines. This means that row locality (between sstables) becomes significantly more important for large data sets. However, assuming 1608 works well enough, and coupled with rate limited compactions (outside the scope of this ticket or 1608), I agree that this should essentially become unnecessary, instead effectively being a complex way to achieve the same thing as one of the side-effects of 1608.
That said, I'm still not sold on how 1608 is to accomplish sufficiently aggressive row "de-spreading" without incurring significant overhead by compacting too aggressively. But I am starting to think I have misunderstood something about 1608, so take that with a grain of salt.

I'd like to propose slightly modified approach to switch-over: it could be better to switch - over reads incrementally to new sstable, not after compaction completed, but during compaction. Taking into account that compaction writes rows in ordered by token fashion, it could be easily determined was a row with exact key already written to new sstable or not. As soon as row is written to the new sstable, it will be never changed, so it could be read from new sstable like any other normal row.

Ok, I agree, this implementation is more complex, but it gives a number of advantages:

We dont need to keep storage occupied after compaction is completed.

Just written row resides in buffer cache with higher probability, so reads from it are 1) cheaper and 2) prevent OS from purging hot read blocks of new sstable from buffer cache.

In addition, if we limit the speed of compaction (say, no more than 10% of disk io utilisation), we can avoid disk spikes completely without even employing direct io writes approach. My reasoning is to have constantly low load spread over time is much better for overall system stability, than have eventual spikes of disk activity AND read duration latencies. So, ideally, compaction with incremental switch over should be tuned to run slow and continously: as soon as one compaction ends, another is starting.

Oleg Anastasyev
added a comment - 24/Nov/10 08:11 I'd like to propose slightly modified approach to switch-over: it could be better to switch - over reads incrementally to new sstable, not after compaction completed, but during compaction. Taking into account that compaction writes rows in ordered by token fashion, it could be easily determined was a row with exact key already written to new sstable or not. As soon as row is written to the new sstable, it will be never changed, so it could be read from new sstable like any other normal row.
Ok, I agree, this implementation is more complex, but it gives a number of advantages:
We dont need to keep storage occupied after compaction is completed.
Just written row resides in buffer cache with higher probability, so reads from it are 1) cheaper and 2) prevent OS from purging hot read blocks of new sstable from buffer cache.
In addition, if we limit the speed of compaction (say, no more than 10% of disk io utilisation), we can avoid disk spikes completely without even employing direct io writes approach. My reasoning is to have constantly low load spread over time is much better for overall system stability, than have eventual spikes of disk activity AND read duration latencies. So, ideally, compaction with incremental switch over should be tuned to run slow and continously: as soon as one compaction ends, another is starting.

At first I really really liked it, but then I realized a problem that takes a way a little bit of it and now I'm not sure. Anyways, firstly what I like: Couple this with posix_fadvise()/DONTNEED on the sstable's being switched from, and one would not even have to have memory for both sets of sstables in order to remain hot in cases where you rely on a cf being mostly or completely in memory.

The posix_fadvise() (and munlock() if mlock():ed sstables come into the picture in the future) would presumably be done at some granularity higher than rows or calls would be much too frequent for performance purposes. But doing so every few tens of MB:s or something should be fine.

In addition, on the topic of rate limiting, fsync():ing would still be required for rate limiting purposes under some circumstances to avoid affecting read latencies too much (to avoid e.g. the OS pushing out more than fits in battery-backed cache on a RAID controller as a result of pushing data in bursts).

A big downside though is this: For workloads where performance is dependent on the warmness of the cache with respect to the active set, this way of doing it would still imply most of the negative effects of mass cache eviction. Any large cf with a significant warm-up period would be highly effected.

A possible way to categorize a cf might be:

(1) Very small cf; fits in RAM with lots of margin.
(2) Smallish, just barely fits in RAM.
(3) Large; a lot larger than RAM.

On the premise that we're discussing situations where cache warmth is relevant the following disposition of the above cf:s with respect to an incremental switch-over:

(1) Works, but doesn't matter much since it fits in RAM anyway (except for muliple such sstables, but then see (2))
(2) Here we improve significantly by allowing us to lower the constant factor of RAM required relative to domain data size.
(3) Doesn't work anyway due to eviction on writes.

So really, it seems to me that for situations where you need a reasonably high rate of compaction, it would only work very well in (2) which is sort of a special case sitting in the middle on a spectrum.

You do point out that slow compaction is a potential helper here, and I agree. Provided that compaction is sufficiently slow that the warm-up period of the node is similar or less to the time spent compacting, this would indeed work well even in the case of (3).

I would further suggest that if you are IOPS sensitive you probably have a strong desire to limit compaction rate to something reasonable anyway.

It's not clear to me whether the trade-offs would tend to land on the side of it working well in practice or not.

A reasonably realistic example of type (3) with concrete numbers (let me know if I'm taking a mis-step in the calculations):

(I am about to engage on some pretty speculative stuff that terminates with insufficient math skills on my part; you may want to just skip the remainder of this comment.)

Say you have a 500 GB CF, and 16 GB of page cache in the OS. Say you have a warm-up period of 30 minutes on a completely cold start before you're comfortable taking the load. Assume that you don't want more than a ~ 25% impact in terms of cold IOPS during a compaction, relative to the level of warmness you reach after your 30 minute warm-up on node start.

Eviction will tend to be random relative to the frequency/recency of access. So an instant eviction of some percentage of page cache should result in a proportional (by some factor) percentage of IOPS.

Assume that your workload is such that 90% of reads are served from cache. This should mean that the factor in question is 10. I.e., a 10% eviction should result in a 100% increase in IOPS.

Now, if over time cache hit rates increased linearly this would mean that a 25% target IOPS increase during compaction translates into a 2.5% maximum eviction rate over the 30 minute time window. But here is where we become dependent on the distribution of reads and unfortunately where my math skills fail me.

But at least in the worst possible (unrealistic) case, those 2.5% over 30 minutes translates, with a 16 GB page cache, into 400 MB/30 minutes. Compacting 500 GB would thus take 26 days. Of course this is utterly unrealistic but should be an upper bound. Anyone with more math skills want to chime in on the expected behavior given a long-tail distribution (for example) where 30 minutes translates into the 90% hit rate?

Peter Schuller
added a comment - 25/Nov/10 19:25 At first I really really liked it, but then I realized a problem that takes a way a little bit of it and now I'm not sure. Anyways, firstly what I like: Couple this with posix_fadvise()/DONTNEED on the sstable's being switched from , and one would not even have to have memory for both sets of sstables in order to remain hot in cases where you rely on a cf being mostly or completely in memory.
The posix_fadvise() (and munlock() if mlock():ed sstables come into the picture in the future) would presumably be done at some granularity higher than rows or calls would be much too frequent for performance purposes. But doing so every few tens of MB:s or something should be fine.
In addition, on the topic of rate limiting, fsync():ing would still be required for rate limiting purposes under some circumstances to avoid affecting read latencies too much (to avoid e.g. the OS pushing out more than fits in battery-backed cache on a RAID controller as a result of pushing data in bursts).
A big downside though is this: For workloads where performance is dependent on the warmness of the cache with respect to the active set, this way of doing it would still imply most of the negative effects of mass cache eviction. Any large cf with a significant warm-up period would be highly effected.
A possible way to categorize a cf might be:
(1) Very small cf; fits in RAM with lots of margin.
(2) Smallish, just barely fits in RAM.
(3) Large; a lot larger than RAM.
On the premise that we're discussing situations where cache warmth is relevant the following disposition of the above cf:s with respect to an incremental switch-over:
(1) Works, but doesn't matter much since it fits in RAM anyway (except for muliple such sstables, but then see (2))
(2) Here we improve significantly by allowing us to lower the constant factor of RAM required relative to domain data size.
(3) Doesn't work anyway due to eviction on writes.
So really, it seems to me that for situations where you need a reasonably high rate of compaction, it would only work very well in (2) which is sort of a special case sitting in the middle on a spectrum.
You do point out that slow compaction is a potential helper here, and I agree. Provided that compaction is sufficiently slow that the warm-up period of the node is similar or less to the time spent compacting, this would indeed work well even in the case of (3).
I would further suggest that if you are IOPS sensitive you probably have a strong desire to limit compaction rate to something reasonable anyway.
It's not clear to me whether the trade-offs would tend to land on the side of it working well in practice or not.
A reasonably realistic example of type (3) with concrete numbers (let me know if I'm taking a mis-step in the calculations):
(I am about to engage on some pretty speculative stuff that terminates with insufficient math skills on my part; you may want to just skip the remainder of this comment.)
Say you have a 500 GB CF, and 16 GB of page cache in the OS. Say you have a warm-up period of 30 minutes on a completely cold start before you're comfortable taking the load. Assume that you don't want more than a ~ 25% impact in terms of cold IOPS during a compaction, relative to the level of warmness you reach after your 30 minute warm-up on node start.
Eviction will tend to be random relative to the frequency/recency of access. So an instant eviction of some percentage of page cache should result in a proportional (by some factor) percentage of IOPS.
Assume that your workload is such that 90% of reads are served from cache. This should mean that the factor in question is 10. I.e., a 10% eviction should result in a 100% increase in IOPS.
Now, if over time cache hit rates increased linearly this would mean that a 25% target IOPS increase during compaction translates into a 2.5% maximum eviction rate over the 30 minute time window. But here is where we become dependent on the distribution of reads and unfortunately where my math skills fail me.
But at least in the worst possible (unrealistic) case, those 2.5% over 30 minutes translates, with a 16 GB page cache, into 400 MB/30 minutes. Compacting 500 GB would thus take 26 days. Of course this is utterly unrealistic but should be an upper bound. Anyone with more math skills want to chime in on the expected behavior given a long-tail distribution (for example) where 30 minutes translates into the 90% hit rate?

Marking as wontfix since people with super latency sensitive workloads are increasingly switching to SSDs, so if nobody has been motivated to work on this in the last ~18 months I doubt it's going to happen.

I'm not -1 on the idea though, which is interesting. Feel free to reopen if you have a patch.

Jonathan Ellis
added a comment - 15/May/12 18:43 Marking as wontfix since people with super latency sensitive workloads are increasingly switching to SSDs, so if nobody has been motivated to work on this in the last ~18 months I doubt it's going to happen.
I'm not -1 on the idea though, which is interesting. Feel free to reopen if you have a patch.