Make the reading half of compactions memory-efficient

Details

Description

This issue is the next on the road to finally fixing CASSANDRA-16. To make compactions memory efficient, we have to be able to perform the compaction process on the smallest possible chunks that might intersect and contend one-another, meaning that we need a better abstraction for reading from SSTables.

Activity

This patchset applies something almost identical to the SSTableScanner interface from CASSANDRA-674 to our current file format, and uses the new interface for compactions.

0001
Adds necessary datastructures from 674, and makes minor changes to other support classes

0002
Implements the majority of SSTableScanner for our current file format, and adds shims to allow various portions of the code to continue using iterators

0003
Splits tests to allow us to test both the SSTableScanner api and direct FileDataInput access

0004
Continues to flesh out the RowIndexedScanner implementation and tests, and separates SuperColumn handling into a subclass for clarity

0005
Adds a 'Scanner' interface which is more limited (and might apply to memtables?), implements a FilteredScanner which filters slices in a wrapped Scanner, for use during AntiCompaction, and adds the CompactionIterator implementation from 674

0006
Adds support for compaction of super column families, which has one more bug to be ironed out (but I wanted to post this set for feedback asap, since it is likely to change anyway) and a few tests that failed to help me find the bug.

Stu Hood
added a comment - 04/Mar/10 04:36 This patchset applies something almost identical to the SSTableScanner interface from CASSANDRA-674 to our current file format, and uses the new interface for compactions.
0001
Adds necessary datastructures from 674, and makes minor changes to other support classes
0002
Implements the majority of SSTableScanner for our current file format, and adds shims to allow various portions of the code to continue using iterators
0003
Splits tests to allow us to test both the SSTableScanner api and direct FileDataInput access
0004
Continues to flesh out the RowIndexedScanner implementation and tests, and separates SuperColumn handling into a subclass for clarity
0005
Adds a 'Scanner' interface which is more limited (and might apply to memtables?), implements a FilteredScanner which filters slices in a wrapped Scanner, for use during AntiCompaction, and adds the CompactionIterator implementation from 674
0006
Adds support for compaction of super column families, which has one more bug to be ironed out (but I wanted to post this set for feedback asap, since it is likely to change anyway) and a few tests that failed to help me find the bug.

"one huge BF" is still a bad idea. you're cramming more into a single BF than it can usefully handle. You remember CASSANDRA-790 of course. Throwing columns into the same BF as row keys means that (a) your estimation of how big a BF you'll need gets drastically less accurate in the worst case and (b) you can either support many less rows, or have a much less accurate filter because of capacity problems.

furthermore, the more I think about this, the less I think "access column X by name that doesn't actually exist" is a frequent operation. usually if you are accessing columns by name the column names are uniform across your rows and will exist close to 100% of the time. and if you are accessing columns by slice then BF is useless.

Put another way, the row key is not just another level of column name and deserves special treatment at least in this respect.

[the one exception may be if you are accessing rows whose contents have been deleted, but whose tombstones haven't been GC'd. we should make sure we don't actually have a BF entry for a row unless it actually contains data. I don't think the current code does this.]

Structures
========

the Scanner api seems like a step back from IteratingRow to me. self-contained iterators are good. any time you get more complicated than "here's an object I call next() on" things get buggy in my experience. even more confusing, scanners can return IR (but you're not supposed to use it as an iterator? or you are? not sure).

telling bad sign: CompactionIterator is 2x as long as it used to be.

I have some thoughts on this but I am going to save this here, typing long things in JIRA is risky.

Jonathan Ellis
added a comment - 05/Mar/10 23:32 Some high-level thoughts:
Meta
====
please increase your column width to 120, and space around operators (n-1 -> n - 1)
Bloom filters
==========
"one huge BF" is still a bad idea. you're cramming more into a single BF than it can usefully handle. You remember CASSANDRA-790 of course. Throwing columns into the same BF as row keys means that (a) your estimation of how big a BF you'll need gets drastically less accurate in the worst case and (b) you can either support many less rows, or have a much less accurate filter because of capacity problems.
furthermore, the more I think about this, the less I think "access column X by name that doesn't actually exist" is a frequent operation. usually if you are accessing columns by name the column names are uniform across your rows and will exist close to 100% of the time. and if you are accessing columns by slice then BF is useless.
Put another way, the row key is not just another level of column name and deserves special treatment at least in this respect.
[the one exception may be if you are accessing rows whose contents have been deleted, but whose tombstones haven't been GC'd. we should make sure we don't actually have a BF entry for a row unless it actually contains data. I don't think the current code does this.]
Structures
========
the Scanner api seems like a step back from IteratingRow to me. self-contained iterators are good. any time you get more complicated than "here's an object I call next() on" things get buggy in my experience. even more confusing, scanners can return IR (but you're not supposed to use it as an iterator? or you are? not sure).
telling bad sign: CompactionIterator is 2x as long as it used to be.
I have some thoughts on this but I am going to save this here, typing long things in JIRA is risky.

> "one huge BF" is still a bad idea
I wasn't attempting to make a decision on the BF issue with this code: the ColumnKey BF methods are remnants from 674 which I will happily remove. I think we agreed at some point that having a bloom filter per SSTable block was going to be a better approach, and to leave the global bloom filter as-is.

> the Scanner api seems like a step back from IteratingRow to me. self-contained iterators are good
I definitely agree re encapsulation, but I've always considered the Iterator interface to be super-weak for IO. There is no support for exception handling, and hasNext() is dangerous (when you begin chaining iterators that are returning lazy objects like IteratingRow) because it usually involves seeks.

> scanners can return IR
You'll note that that method is deprecated: IteratingRow is sticking around as a shim in this case, because I didn't want to spend too much time porting SSTable(Im|Ex)port to the Slice API quite yet. The same goes for other @Deprecated methods on the SSTableScanner and SSTableReader base classes: I hope that we can remove them when the new file format is added.

> telling bad sign: CompactionIterator is 2x as long as it used to be.
This is mostly because I replaced ReducingIterator with a custom merge-sort which may or may not be slower anway. Additionally, we're doing a lot of work to collect slices into CFs that we can remove once the writing half of compactions is memory efficient.

Stu Hood
added a comment - 06/Mar/10 00:29 > "one huge BF" is still a bad idea
I wasn't attempting to make a decision on the BF issue with this code: the ColumnKey BF methods are remnants from 674 which I will happily remove. I think we agreed at some point that having a bloom filter per SSTable block was going to be a better approach, and to leave the global bloom filter as-is.
> the Scanner api seems like a step back from IteratingRow to me. self-contained iterators are good
I definitely agree re encapsulation, but I've always considered the Iterator interface to be super-weak for IO. There is no support for exception handling, and hasNext() is dangerous (when you begin chaining iterators that are returning lazy objects like IteratingRow) because it usually involves seeks.
> scanners can return IR
You'll note that that method is deprecated: IteratingRow is sticking around as a shim in this case, because I didn't want to spend too much time porting SSTable(Im|Ex)port to the Slice API quite yet. The same goes for other @Deprecated methods on the SSTableScanner and SSTableReader base classes: I hope that we can remove them when the new file format is added.
> telling bad sign: CompactionIterator is 2x as long as it used to be.
This is mostly because I replaced ReducingIterator with a custom merge-sort which may or may not be slower anway. Additionally, we're doing a lot of work to collect slices into CFs that we can remove once the writing half of compactions is memory efficient.

> I've always considered the Iterator interface to be super-weak for IO. There is no support for exception handling, and hasNext() is dangerous (when you begin chaining iterators that are returning lazy objects like IteratingRow) because it usually involves seeks.

Statically checked exceptions are a misfeature anyway, so that's a non-issue.

hasNext won't seek any more than a clunkier manual interface. either way you need to (a) design your format so you don't have to do random i/o and (b) buffer to avoid seeks when switching between different objects, and that is what we do.

the old iterator-based compaction for instance never does more seeks than another interface would given the same buffer sizes.

Jonathan Ellis
added a comment - 06/Mar/10 02:02 - edited > I've always considered the Iterator interface to be super-weak for IO. There is no support for exception handling, and hasNext() is dangerous (when you begin chaining iterators that are returning lazy objects like IteratingRow) because it usually involves seeks.
Statically checked exceptions are a misfeature anyway, so that's a non-issue.
hasNext won't seek any more than a clunkier manual interface. either way you need to (a) design your format so you don't have to do random i/o and (b) buffer to avoid seeks when switching between different objects, and that is what we do.
the old iterator-based compaction for instance never does more seeks than another interface would given the same buffer sizes.

Stu Hood
added a comment - 06/Mar/10 02:46 > hasNext won't seek any more than a clunkier manual interface
I guess it is actually the lazy object that might cause the problem I described.
Alright. So we convert Scanner back into an Iterator over a lazily fetched Slice subclass?

The goal is to create an abstraction that (a) compaction code can apply to both old and new data formats, while (b) allowing for memory-efficient compactions on the new format and (c) making new-format indexing of subcolumns possible and (d) ideally allowing up to 256 levels of new-format subcolumn nesting.

In particular, the goal does not include improving efficiency of compaction of old format data; if that falls out naturally, fine, but it's not really our goal.

Nor is it yet our goal to support the new format, in this patchset, although maybe it should be. Compacting from old format to old format, with new data structures, is not part of our ultimate goal either, and making it an intermediate step may be making things harder than necessary. It may be simpler to introduce the new format first, so we can skip to compacting from old -> new and new -> new, not bothering with old -> old.

Are we on the same page?

I think the simplest way to get to this is to simply continue using IColumn. It generalizes just fine to multiple levels, and the existing implementation knows how to use abstractions like mostRecentLiveChangeAt to handle tricky problems like tombstones. Throwing this away and starting over will lead us eventually to the same place. [Although certainly some parts like getObjectCount won't be needed and can ultimately be removed.] Also, sharing code b/t old and new formats is within reason a good thing. So let's keep IColumn (I believe the analogue in your patch is Named?) and Column.

ColumnFamily + SuperColumn should be replaced with a more generalized structure supporting arbitrary nesting. Here I think ColumnGroup is a better name than Slice; we use the latter term in querying, which would be potentially confusing. But I think it would have a lot in common w/ the existing CF/SuperColumn code. Each ColumnGroup, like Column, only needs a byte[] name. No need to copy a lot of full paths around; experience with existing code shows that this is unnecessary.

Mapping this to the old data format is hopefully clear since it resembles it relatively strongly. What about the new format? Here we come back to my advocating that "all container information goes in the block header, followed by serialized Columns [not IColumns, just name-data-ts triples]." This is where we will need something like ColumnKey to contain column boundaries – i.e., not in this patchset, unless you decide that actually introducing the new format here is the way to go.

Thus, for compaction, our algorithm goes something like "read all the header information at once and build the ColumnGroup structure in memory, then iterate through matching sub-columngroups, merging as necessary." Since we read the header all at once, and then the subcolumns in-order, all i/o within a single sstable remains sequential.

It's not clear to me how to apply the old ReducingIterator approach to multilevel groups when the data to merge into one Block may be spread across multiple Blocks in another sstable, although I find the iterator design very elegant and easy to confirm correctness in. So you are probably right that this has to change.

One other thing about header info / column key: it would be nice to come up with a scheme that doesn't repeat the full path in the description of each ColumnGroup [i.e., ColumnKey or its analogue], at least not on-disk; in a heavily nested structure that would be a lot of duplication of the initial path elements, although presumably compression would mitigate this some.

Jonathan Ellis
added a comment - 06/Mar/10 18:37 Let's keep this simple.
The goal is to create an abstraction that (a) compaction code can apply to both old and new data formats, while (b) allowing for memory-efficient compactions on the new format and (c) making new-format indexing of subcolumns possible and (d) ideally allowing up to 256 levels of new-format subcolumn nesting.
In particular, the goal does not include improving efficiency of compaction of old format data; if that falls out naturally, fine, but it's not really our goal.
Nor is it yet our goal to support the new format, in this patchset, although maybe it should be. Compacting from old format to old format, with new data structures, is not part of our ultimate goal either, and making it an intermediate step may be making things harder than necessary. It may be simpler to introduce the new format first, so we can skip to compacting from old -> new and new -> new, not bothering with old -> old.
Are we on the same page?
I think the simplest way to get to this is to simply continue using IColumn. It generalizes just fine to multiple levels, and the existing implementation knows how to use abstractions like mostRecentLiveChangeAt to handle tricky problems like tombstones. Throwing this away and starting over will lead us eventually to the same place. [Although certainly some parts like getObjectCount won't be needed and can ultimately be removed.] Also, sharing code b/t old and new formats is within reason a good thing. So let's keep IColumn (I believe the analogue in your patch is Named?) and Column.
ColumnFamily + SuperColumn should be replaced with a more generalized structure supporting arbitrary nesting. Here I think ColumnGroup is a better name than Slice; we use the latter term in querying, which would be potentially confusing. But I think it would have a lot in common w/ the existing CF/SuperColumn code. Each ColumnGroup, like Column, only needs a byte[] name. No need to copy a lot of full paths around; experience with existing code shows that this is unnecessary.
Mapping this to the old data format is hopefully clear since it resembles it relatively strongly. What about the new format? Here we come back to my advocating that "all container information goes in the block header, followed by serialized Columns [not IColumns, just name-data-ts triples] ." This is where we will need something like ColumnKey to contain column boundaries – i.e., not in this patchset, unless you decide that actually introducing the new format here is the way to go.
Thus, for compaction, our algorithm goes something like "read all the header information at once and build the ColumnGroup structure in memory, then iterate through matching sub-columngroups, merging as necessary." Since we read the header all at once, and then the subcolumns in-order, all i/o within a single sstable remains sequential.
It's not clear to me how to apply the old ReducingIterator approach to multilevel groups when the data to merge into one Block may be spread across multiple Blocks in another sstable, although I find the iterator design very elegant and easy to confirm correctness in. So you are probably right that this has to change.
One other thing about header info / column key: it would be nice to come up with a scheme that doesn't repeat the full path in the description of each ColumnGroup [i.e., ColumnKey or its analogue] , at least not on-disk; in a heavily nested structure that would be a lot of duplication of the initial path elements, although presumably compression would mitigate this some.
What do you think?

> Are we on the same page?
Absolutely: the only point I would add is e) supporting range deletes.

> So let's keep IColumn (I believe the analogue in your patch is Named?) and Column.
The analogue in this patch was Slice: a slice has nested Metadata and a nested ColumnKey, and it supports merging and garbage collection (see SliceBuffer.merge() and SliceBuffer.garbageCollect).

> Here I think ColumnGroup is a better name than Slice
I'm fine with changing the name of Slice to ColumnGroup, but the problem I was trying to solve with having a 'begin' and 'end' key in Slice is that it indicates that you may not be holding the entire ColumnGroup in memory: for instance, when a ColumnGroup needed to be split into multiple blocks. Also, the reason we can support SliceBuffer.garbageCollect is because we know that we have all columns between the two names.

> It's not clear to me how to apply the old ReducingIterator approach to multilevel groups. So you are probably
> right that this has to change.
I was thinking about this more last night, and I think you actually convinced me that Scanner should go back to being an Iterator... heh. When iterating over Slice objects (because they contain all of the information about a set of columns), you can build a SuperColumn with a ReducingIterator that collects all Slices with equal names at depth 1, and you can build a ColumnFamily by collecting all Slices with equal names at depth 0.

Also, CompactionIterator can go back to being a ReducingIterator that says that all intersecting Slices are "equal", and reduces them using SliceBuffer.merge() and garbageCollect(). There would need to be one subtle change to ReducingIterator, to allow getReduced to return multiple objects (since merge() does).

> it would be nice to come up with a scheme that doesn't repeat the full path in the description of each
> ColumnGroup [i.e., ColumnKey or its analogue], at least not on-disk
We definitely agree on this: a block header storing as much key and metadata information as possible is definitely a good idea tm.

Stu Hood
added a comment - 06/Mar/10 19:25 > Are we on the same page?
Absolutely: the only point I would add is e) supporting range deletes.
> So let's keep IColumn (I believe the analogue in your patch is Named?) and Column.
The analogue in this patch was Slice: a slice has nested Metadata and a nested ColumnKey, and it supports merging and garbage collection (see SliceBuffer.merge() and SliceBuffer.garbageCollect).
> Here I think ColumnGroup is a better name than Slice
I'm fine with changing the name of Slice to ColumnGroup, but the problem I was trying to solve with having a 'begin' and 'end' key in Slice is that it indicates that you may not be holding the entire ColumnGroup in memory: for instance, when a ColumnGroup needed to be split into multiple blocks. Also, the reason we can support SliceBuffer.garbageCollect is because we know that we have all columns between the two names.
> It's not clear to me how to apply the old ReducingIterator approach to multilevel groups. So you are probably
> right that this has to change.
I was thinking about this more last night, and I think you actually convinced me that Scanner should go back to being an Iterator... heh. When iterating over Slice objects (because they contain all of the information about a set of columns), you can build a SuperColumn with a ReducingIterator that collects all Slices with equal names at depth 1, and you can build a ColumnFamily by collecting all Slices with equal names at depth 0.
Also, CompactionIterator can go back to being a ReducingIterator that says that all intersecting Slices are "equal", and reduces them using SliceBuffer.merge() and garbageCollect(). There would need to be one subtle change to ReducingIterator, to allow getReduced to return multiple objects (since merge() does).
> it would be nice to come up with a scheme that doesn't repeat the full path in the description of each
> ColumnGroup [i.e., ColumnKey or its analogue] , at least not on-disk
We definitely agree on this: a block header storing as much key and metadata information as possible is definitely a good idea tm.

1. Remove BloomFilter references from ColumnKey
2. Make Scanner back into an Iterator
3. Make CompactionIterator a subclass of ReducingIterator
4. Try to make ColumnKey.BEGIN and END both be empty byte arrays, and compare using reference equality
5. Rename Slice to ColumnGroup

Stu Hood
added a comment - 06/Mar/10 19:56 TODO:
1. Remove BloomFilter references from ColumnKey
2. Make Scanner back into an Iterator
3. Make CompactionIterator a subclass of ReducingIterator
4. Try to make ColumnKey.BEGIN and END both be empty byte arrays, and compare using reference equality
5. Rename Slice to ColumnGroup
Working on this now.

Jonathan Ellis
added a comment - 06/Mar/10 20:13 I guess I didn't make it clear that the reason i think CG should implement icolumn is so that we can port the rest of the old-format code to using it and drop CF/SC entirely even in 0.7.

Stu Hood
added a comment - 06/Mar/10 20:23 > the reason i think CG should implement icolumn is so that we can port the rest of the old-format code to using
> it and drop CF/SC entirely
Ahh... gotcha. I'll see what I can do there.

I managed to keep 0007-0010 fairly separated, but I can't imagine a good way to rebase this into the original patch set.

All Scanners now extend Iterator again, and Compaction extends a new MergingIterator class.

The MergingIterator class is almost a copy-paste of ReducingIterator, but it changes the return type of getReduced to an Iterator, and I didn't think we'd want to incur that overhead for other ReducingIterators.

Compaction is completely memory efficient now: the actual collection into an IteratingRow/CompactedRow happens externally in CompactionManager, so it will be easy to remove.

Stu Hood
added a comment - 10/Mar/10 02:43 I managed to keep 0007-0010 fairly separated, but I can't imagine a good way to rebase this into the original patch set.
All Scanners now extend Iterator again, and Compaction extends a new MergingIterator class.
The MergingIterator class is almost a copy-paste of ReducingIterator, but it changes the return type of getReduced to an Iterator, and I didn't think we'd want to incur that overhead for other ReducingIterators.
Compaction is completely memory efficient now: the actual collection into an IteratingRow/CompactedRow happens externally in CompactionManager, so it will be easy to remove.

Ack, piling more patches on top of the existing ones makes life hard for reviewers. This means we have to understand (a) the original code, (b) the original patchset, and (c) the final result. This is particularly onerous when there are big changes from (b) to (c).

I guess I will try to review by squashing everything down at once but that defeats the purpose of having small patches.

Jonathan Ellis
added a comment - 10/Mar/10 17:38 Ack, piling more patches on top of the existing ones makes life hard for reviewers. This means we have to understand (a) the original code, (b) the original patchset, and (c) the final result. This is particularly onerous when there are big changes from (b) to (c).
I guess I will try to review by squashing everything down at once but that defeats the purpose of having small patches.

I spent some quality time with this patchset on the plane but rather than focus on the trees I'd like to back up and look at the forest. Git sums up my misgivings here:
31 files changed [14 new], 2574 insertions, 521 deletions

There is too much code churn, and simultaneously too much left to be cleaned up or proved later, specifically, that the new structures aren't used on the read/write paths or in the new data format that is their real raison d'etre.

Let's reboot this. Save it off into a git branch, it will be useful, but let's start fresh and from a different direction. I apologize for not coming to this conclusion sooner. I guess I owe you two lunches now.

I think the most productive way to move forward on CASSANDRA-16 and company is as follows:

0. Create compaction benchmark (rows compacted per second would be nice)
1. Change keys to byte[].
1.5 stress.py benchmark and compaction benchmark to establish baseline. We should pick up some gains from this alone.
2. Replace ColumnFamily and SuperColumn with ColumnGroup, implementing IColumn, and deleting IColumnContainer. Leave compaction algorithm alone for now, and don't introduce Slice yet.
2.5 stress.py benchmark for informational purposes
3. Implement new disk format, read + write, but no compaction yet. I strongly suspect that getting this working will greatly clarify how the supporting Slice etc structures should look. Maybe it looks a lot like the current patchset, in which case you can think of this as an exercise in providing me with enough examples that I can understand it. (And, it will be easy to crib code from your branch of this patchset.) Although I still think ColumnKey should be a ColumnKeyRange pairing begin + end...
3.5 stress.py benchmark, fix performance regressions if any

Note that IMO it is okay to take a minor speed hit in the 2.5 benchmark, as long as we gain it back in 3. But if 3.5 shows that we got way slower then it will be good to have 2.5 data to show where the problem(s) are.

So at this point we can compact old -> old still but not old -> new or new -> new. So the last is to remove the former and add the latter, and I suspect that the most natural order is

4. Implement old -> new compaction
5. Implement new -> new compaction
6. Benchmark new -> new compaction, compare with results from 0

(Note, I think starting with old -> new is easier since we can "cheat" to simplify the problem since we know old can fit in memory. But maybe it is more natural to do new -> new first, or both together. Doesn't really matter.)

I think this gives us a natural progression to where we want to go while adding a minimum of new structures at each step and being confident that we haven't made changes without having a good idea of the performance implications.

Jonathan Ellis
added a comment - 11/Mar/10 05:33 I spent some quality time with this patchset on the plane but rather than focus on the trees I'd like to back up and look at the forest. Git sums up my misgivings here:
31 files changed [14 new] , 2574 insertions , 521 deletions
There is too much code churn, and simultaneously too much left to be cleaned up or proved later, specifically, that the new structures aren't used on the read/write paths or in the new data format that is their real raison d'etre.
Let's reboot this. Save it off into a git branch, it will be useful, but let's start fresh and from a different direction. I apologize for not coming to this conclusion sooner. I guess I owe you two lunches now.
I think the most productive way to move forward on CASSANDRA-16 and company is as follows:
0. Create compaction benchmark (rows compacted per second would be nice)
1. Change keys to byte[].
1.5 stress.py benchmark and compaction benchmark to establish baseline. We should pick up some gains from this alone.
2. Replace ColumnFamily and SuperColumn with ColumnGroup, implementing IColumn, and deleting IColumnContainer. Leave compaction algorithm alone for now, and don't introduce Slice yet.
2.5 stress.py benchmark for informational purposes
3. Implement new disk format, read + write, but no compaction yet. I strongly suspect that getting this working will greatly clarify how the supporting Slice etc structures should look. Maybe it looks a lot like the current patchset, in which case you can think of this as an exercise in providing me with enough examples that I can understand it. (And, it will be easy to crib code from your branch of this patchset.) Although I still think ColumnKey should be a ColumnKeyRange pairing begin + end...
3.5 stress.py benchmark, fix performance regressions if any
Note that IMO it is okay to take a minor speed hit in the 2.5 benchmark, as long as we gain it back in 3. But if 3.5 shows that we got way slower then it will be good to have 2.5 data to show where the problem(s) are.
So at this point we can compact old -> old still but not old -> new or new -> new. So the last is to remove the former and add the latter, and I suspect that the most natural order is
4. Implement old -> new compaction
5. Implement new -> new compaction
6. Benchmark new -> new compaction, compare with results from 0
(Note, I think starting with old -> new is easier since we can "cheat" to simplify the problem since we know old can fit in memory. But maybe it is more natural to do new -> new first, or both together. Doesn't really matter.)
I think this gives us a natural progression to where we want to go while adding a minimum of new structures at each step and being confident that we haven't made changes without having a good idea of the performance implications.

I'm not comfortable with these changes as they stand. From the perspective of a reviewer, they are daunting enough. Add to that the number of changes, and I think we're looking at 2-3 weeks of an unstable trunk while the kinks are worked out, which will hamper other features.

I realize that this issue is just a part of CASSANDRA-16, but I think this issue should be broken into several pieces as well: 1) decide on the API we want to end up with, meld it in or have it replace what we have, 2) alter SSTable read/write, 3) alter compaction process. Yeah, this basically corresponds with what Jonathan outlined, but I think it is sensible.

Gary Dusbabek
added a comment - 12/Mar/10 13:47 I'm not comfortable with these changes as they stand. From the perspective of a reviewer, they are daunting enough. Add to that the number of changes, and I think we're looking at 2-3 weeks of an unstable trunk while the kinks are worked out, which will hamper other features.
I realize that this issue is just a part of CASSANDRA-16 , but I think this issue should be broken into several pieces as well: 1) decide on the API we want to end up with, meld it in or have it replace what we have, 2) alter SSTable read/write, 3) alter compaction process. Yeah, this basically corresponds with what Jonathan outlined, but I think it is sensible.

Starting with step 0. from jbellis's suggested order, I implemented a compaction benchmark, and tested trunk and a rebased version of this patch. As expected, this patch is significantly faster for wide rows and slightly slower for narrow rows, but I'm sure that the performance can be improved by cleaning up the mergesort in SliceBuffer.merge().

The reason I bring this up now is that I began work on 767, and realized how painful it was going to be to make format changes before we had settled on an interface to replace FileDataInput/SSTableScanner. We need to remove/improve those interfaces before we go about making sweeping changes like the String->byte[] refactor.

I think that this patch begins the 674 refactor in the correct place, so rather than starting over and losing more time, I would love to be able to clean up this patch and remove any structures that you guys think are excessive. If I can squash this down into much clearer patches and remove more of the speculative code, what are the chances of getting it committed?

Stu Hood
added a comment - 29/Mar/10 05:42 Starting with step 0. from jbellis's suggested order, I implemented a compaction benchmark, and tested trunk and a rebased version of this patch. As expected, this patch is significantly faster for wide rows and slightly slower for narrow rows, but I'm sure that the performance can be improved by cleaning up the mergesort in SliceBuffer.merge().
The reason I bring this up now is that I began work on 767, and realized how painful it was going to be to make format changes before we had settled on an interface to replace FileDataInput/SSTableScanner. We need to remove/improve those interfaces before we go about making sweeping changes like the String->byte[] refactor.
I think that this patch begins the 674 refactor in the correct place, so rather than starting over and losing more time, I would love to be able to clean up this patch and remove any structures that you guys think are excessive. If I can squash this down into much clearer patches and remove more of the speculative code, what are the chances of getting it committed?

Why do we need to change FDI first? Treating keys as bytes shouldn't require a data format change. Remember we already have a method for interpreting on-disk keys differently in the IPartitioner code. In fact treating it only as a format change is wrong, since we need to preserve the old ordering, not just read it once and rewrite in a different order, since changing the comparison order could change the nodes that data is supposed to be on.

ISTM that the easiest way to make the String -> byte[] change is to make a new OPP that is strictly byte-oriented, update old OPP to preserves the old utf8-based ordering, add a getFilterBytes to DecoratedKey, and have BloomFilter.add take a DK instead of a String (the only caller of add(String) already has a DK object so no problem there).

So, DK will change to (Token, byte[]) instead of (Token, String), COPP will become (BytesToken, byte[]), decorating from bytes to bytes w/ the different collation order, old OPP will become (StringToken, byte[]), RP will become (BigIntToken, byte[]), add a new BytesOPP (BytesToken, byte[]) where the decoration is a no-op the way current OPP behaves.

Jonathan Ellis
added a comment - 29/Mar/10 15:26 Why do we need to change FDI first? Treating keys as bytes shouldn't require a data format change. Remember we already have a method for interpreting on-disk keys differently in the IPartitioner code. In fact treating it only as a format change is wrong, since we need to preserve the old ordering, not just read it once and rewrite in a different order, since changing the comparison order could change the nodes that data is supposed to be on.
ISTM that the easiest way to make the String -> byte[] change is to make a new OPP that is strictly byte-oriented, update old OPP to preserves the old utf8-based ordering, add a getFilterBytes to DecoratedKey, and have BloomFilter.add take a DK instead of a String (the only caller of add(String) already has a DK object so no problem there).
So, DK will change to (Token, byte[]) instead of (Token, String), COPP will become (BytesToken, byte[]), decorating from bytes to bytes w/ the different collation order, old OPP will become (StringToken, byte[]), RP will become (BigIntToken, byte[]), add a new BytesOPP (BytesToken, byte[]) where the decoration is a no-op the way current OPP behaves.

Alright, after the hiatus to implement byte[] keys, I'm back on this horse.

> 2. Replace ColumnFamily and SuperColumn with ColumnGroup, implementing IColumn,
> and deleting IColumnContainer.
I don't think that nested structures, each with their own iterators is a good idea... especially when they may be hiding the fact that they are fetching columns from disk. And if they are not fetching transparently from disk, how do we make this any more memory efficient than the current approach?

The beauty in the Slice approach is that a List<Slice> can represent any arbitrarily nested structure you can think of, and yet the Slices are still autonomous.

EDIT: Erased an offtopic point.

> 3. Implement new disk format, read + write, but no compaction yet.
I'm not sure how this is supposed to work: is the idea that we would break backwards compatibility in trunk, and then restore it later on in your steps 4,5,6?

Stu Hood
added a comment - 06/Apr/10 04:20 - edited Alright, after the hiatus to implement byte[] keys, I'm back on this horse.
> 2. Replace ColumnFamily and SuperColumn with ColumnGroup, implementing IColumn,
> and deleting IColumnContainer.
I don't think that nested structures, each with their own iterators is a good idea... especially when they may be hiding the fact that they are fetching columns from disk. And if they are not fetching transparently from disk, how do we make this any more memory efficient than the current approach?
The beauty in the Slice approach is that a List<Slice> can represent any arbitrarily nested structure you can think of, and yet the Slices are still autonomous.
EDIT: Erased an offtopic point.
> 3. Implement new disk format, read + write, but no compaction yet.
I'm not sure how this is supposed to work: is the idea that we would break backwards compatibility in trunk, and then restore it later on in your steps 4,5,6?

Jonathan Ellis
added a comment - 06/Apr/10 21:40 > I don't think that nested structures, each with their own iterators is a good idea... especially when they may be hiding the fact that they are fetching columns from disk.
Sure it is. You're trying to make the api sane. There's nothing "hidden" about the fact that they read from disk; that is what buffering is for.
> I'm not sure how this is supposed to work: is the idea that we would break backwards compatibility in trunk, and then restore it later on in your steps 4,5,6?
The idea is you introduce the new format w/o breaking the old one (which continues to do old -> old compaction in the meantime).

Jonathan Ellis
added a comment - 06/Apr/10 21:44 > The idea is you introduce the new format w/o breaking the old one (which continues to do old -> old compaction in the meantime).
... to clarify, first you implement the old format in terms of the new structures [step 2] , then you implement the new format [step 3] .