cassandra-commits mailing list archives

[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval

Date

Thu, 07 Apr 2016 11:53:25 GMT

[ https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15230127#comment-15230127
]
Stefan Podkowinski commented on CASSANDRA-11349:
------------------------------------------------
[~frousseau], what makes things more complicated here is that changes to LCR will effect regular
compactions as well. Adding all tombstones as expired in your {{11349-2.1-v2.patch}} will
have unwanted side effects for regular compactions, e.g. try {{RangeTombstoneMergeTest}} with
it.
I've now spend some time trying to make use of the RT.Tracker there but without much success.
Adding non-expired range tombstones to the tracker from within LCR would cause corrupted sstables.
Even creating an edge case just for validation compaction would not handle all potential TS
shadowing scenarios and will probably cause more harm than good (and potential digest mismatch
storms). I'm not even sure it's possible given the current iterative MergeIterator > LazilyCompactedRow
> RT.Tracker interaction.
I'm now at a point where I'd suggest to just stick with {{11349-2.1.patch}} unless someone
else has a better idea how to solve this. I've updated the [dtest PR|https://github.com/riptano/cassandra-dtest/pull/881]
with two of the described shadowing scenarios that will only work with 3.0+ even after the
patch, if someone wants to give it a try.
Cassci results for {{11349-2.1.patch}}:
||2.1||2.2||
|[branch|https://github.com/spodkowinski/cassandra/tree/CASSANDRA-11349-2.1]|[branch|https://github.com/spodkowinski/cassandra/tree/CASSANDRA-11349-2.2]|
|[dtest|http://cassci.datastax.com/view/Dev/view/spodkowinski/job/spodkowinski-CASSANDRA-11349-2.1-dtest/]|[dtest|http://cassci.datastax.com/view/Dev/view/spodkowinski/job/spodkowinski-CASSANDRA-11349-2.2-dtest/]|
|[testall|http://cassci.datastax.com/view/Dev/view/spodkowinski/job/spodkowinski-CASSANDRA-11349-2.1-testall/]|[testall|http://cassci.datastax.com/view/Dev/view/spodkowinski/job/spodkowinski-CASSANDRA-11349-2.2-testall/]|
> MerkleTree mismatch when multiple range tombstones exists for the same partition and
interval
> ---------------------------------------------------------------------------------------------
>
> Key: CASSANDRA-11349
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11349
> Project: Cassandra
> Issue Type: Bug
> Reporter: Fabien Rousseau
> Assignee: Stefan Podkowinski
> Labels: repair
> Fix For: 2.1.x, 2.2.x
>
> Attachments: 11349-2.1-v2.patch, 11349-2.1.patch
>
>
> We observed that repair, for some of our clusters, streamed a lot of data and many partitions
were "out of sync".
> Moreover, the read repair mismatch ratio is around 3% on those clusters, which is really
high.
> After investigation, it appears that, if two range tombstones exists for a partition
for the same range/interval, they're both included in the merkle tree computation.
> But, if for some reason, on another node, the two range tombstones were already compacted
into a single range tombstone, this will result in a merkle tree difference.
> Currently, this is clearly bad because MerkleTree differences are dependent on compactions
(and if a partition is deleted and created multiple times, the only way to ensure that repair
"works correctly"/"don't overstream data" is to major compact before each repair... which
is not really feasible).
> Below is a list of steps allowing to easily reproduce this case:
> {noformat}
> ccm create test -v 2.1.13 -n 2 -s
> ccm node1 cqlsh
> CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', 'replication_factor':
2};
> USE test_rt;
> CREATE TABLE IF NOT EXISTS table1 (
> c1 text,
> c2 text,
> c3 float,
> c4 float,
> PRIMARY KEY ((c1), c2)
> );
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> # now flush only one of the two nodes
> ccm node1 flush
> ccm node1 cqlsh
> USE test_rt;
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> ccm node1 repair
> # now grep the log and observe that there was some inconstencies detected between nodes
(while it shouldn't have detected any)
> ccm node1 showlog | grep "out of sync"
> {noformat}
> Consequences of this are a costly repair, accumulating many small SSTables (up to thousands
for a rather short period of time when using VNodes, the time for compaction to absorb those
small files), but also an increased size on disk.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)