As discussed here and in CASSANDRA-1072, this ticket is a dead end for counting, and the primary use case for classic vector clocks of merging non-conflicting updates to different fields w/in a value is already handled by cassandra breaking a row into columns.

Jonathan Ellis
added a comment - 02/Sep/10 23:35 - edited As discussed here and in CASSANDRA-1072 , this ticket is a dead end for counting, and the primary use case for classic vector clocks of merging non-conflicting updates to different fields w/in a value is already handled by cassandra breaking a row into columns.
Is there a good reason not to close this?

Kelvin Kakugawa
added a comment - 18/Aug/10 03:26 Hi Sylvain,
I read through your patch:
0001-Add-handler-to-delegate-the-write-protocol-to-a-repl.patch
It makes sense and generalizes to 1072. I agree that it's better than the current impl in 580.
If you don't mind, I'm working on integrating the above diff w/ 1072 logic, here:
https://issues.apache.org/jira/browse/CASSANDRA-1397

I believe the attached patch (580-version-vector-wip.patch) has a problem. At
CL.ZERO and CL.ONE, it doesn't replicate writes (the ones using version
vectors) at all (SP.updateDestinationByClock() clears the destinationEndpoints
but still returns an empty collection). This is (overly) unsafe.

This certainly could be fixed by adding new WriteResponseHandler for those
cases. But I believe that there is a much better alternative.

This alternative consists in changing the write protocol (for version vector
only of course) to do the following (and note that the protocol of the current
patch is already different of the one for timestamps):

a node receive a write request (with version vector clock) from a client.
If it's a replica for the write, goto 3) otherwise goto 2)

the node delegate the write to one replica (along with the asked CL) and
then only wait for a ack of this replica before answering the client (it
doesn't replicate anything)

the chosen replica apply the mutation locally first (we must do it before
replication)

then it send the mutation to other replicates, waiting for how many
responses are necessary to achieve asked consistency

To make this more concrete, I'm attaching a patch (0001-Add-handler-to-delegate-the-write-protocol-to-a-repl.patch)
that implements this protocol (it all starts in SP.delegateMutateBlocking()).
Small disclaimers: this should work but is not really tested (so please be
nice ). The function RowMutation.updateBeforeReplication() could safely be
ignored on a first read but it would be needed if #1072 was to use this. It
could also probably be slightly optimized by allowing the
DelegatedRowMutationVerbHandler to handle multiple mutations at once. This is
also just the protocol mentioned above, #580 would have to be rebased on top
of this.

Anyway, I think this alternative is superior to the one used by the currently
attached #580 patch for the following reasons:

the protocol used by the current patch (write to one replica, wait for the
ack and then replicate to others, which differs from what I propose in
that this is done from a potentially non replica node), doesn't work for
#1072 (because of potential race condition with the read repairs). The
protocol I'm proposing does not suffer of this problem and (I'm quite
convinced, let's hope I'm not wrong) would reconciliate #1072 with the EC
model of Cassandra. This is obviously the more important point.

it is slightly faster (network-latency-wise), as we don't wait for a full
round-trip to a node before starting the replication.

it more cleanly separate the protocols of timestamped writes and versionned
ones (without much code duplication really). I suppose this is more a
matter of opinion whether this is better or not, but at the very least it
make it clearer that version vectors don't slow down nor break the other
writes.

I'd be happy if someone had a look at this and confirm that I'm not
completely wide of the mark. If I'm not, I may be able to spare some cycle
merging this idea with #580 (and #1072).

Sylvain Lebresne
added a comment - 16/Aug/10 17:40 I believe the attached patch (580-version-vector-wip.patch) has a problem. At
CL.ZERO and CL.ONE, it doesn't replicate writes (the ones using version
vectors) at all (SP.updateDestinationByClock() clears the destinationEndpoints
but still returns an empty collection). This is (overly) unsafe.
This certainly could be fixed by adding new WriteResponseHandler for those
cases. But I believe that there is a much better alternative.
This alternative consists in changing the write protocol (for version vector
only of course) to do the following (and note that the protocol of the current
patch is already different of the one for timestamps):
a node receive a write request (with version vector clock) from a client.
If it's a replica for the write, goto 3) otherwise goto 2)
the node delegate the write to one replica (along with the asked CL) and
then only wait for a ack of this replica before answering the client (it
doesn't replicate anything)
the chosen replica apply the mutation locally first (we must do it before
replication)
then it send the mutation to other replicates, waiting for how many
responses are necessary to achieve asked consistency
To make this more concrete, I'm attaching a patch (0001-Add-handler-to-delegate-the-write-protocol-to-a-repl.patch)
that implements this protocol (it all starts in SP.delegateMutateBlocking()).
Small disclaimers: this should work but is not really tested (so please be
nice ). The function RowMutation.updateBeforeReplication() could safely be
ignored on a first read but it would be needed if #1072 was to use this. It
could also probably be slightly optimized by allowing the
DelegatedRowMutationVerbHandler to handle multiple mutations at once. This is
also just the protocol mentioned above, #580 would have to be rebased on top
of this.
Anyway, I think this alternative is superior to the one used by the currently
attached #580 patch for the following reasons:
the protocol used by the current patch (write to one replica, wait for the
ack and then replicate to others, which differs from what I propose in
that this is done from a potentially non replica node), doesn't work for
#1072 (because of potential race condition with the read repairs). The
protocol I'm proposing does not suffer of this problem and (I'm quite
convinced, let's hope I'm not wrong) would reconciliate #1072 with the EC
model of Cassandra. This is obviously the more important point.
it is slightly faster (network-latency-wise), as we don't wait for a full
round-trip to a node before starting the replication.
it more cleanly separate the protocols of timestamped writes and versionned
ones (without much code duplication really). I suppose this is more a
matter of opinion whether this is better or not, but at the very least it
make it clearer that version vectors don't slow down nor break the other
writes.
I'd be happy if someone had a look at this and confirm that I'm not
completely wide of the mark. If I'm not, I may be able to spare some cycle
merging this idea with #580 (and #1072).

Kelvin Kakugawa
added a comment - 21/Jul/10 18:21 Both 580 and 1072 are against 0.7.
580 works, now. However, I'd like to add the improvements that Cliff and Andy recommended.
1072 is closer, but we're writing up a design document for how distributed counters work.

580 and 1072 are forks of each other. So, either could be applied first.

Basically, 1072 was built off of the original 580 work. However, 1072 made a lot of various improvements to the underlying code. So, when I brought 580 up to trunk, I incorporated all of the underlying improvements in 1072, but replaced the counter-specific code w/ the vector clock implementation.

I believe more ppl are interested in 1072. So, after that gets committed to trunk, I'll refactor the above patch to let it be applied on top of 1072.

Kelvin Kakugawa
added a comment - 19/Jul/10 02:05 580 and 1072 are forks of each other. So, either could be applied first.
Basically, 1072 was built off of the original 580 work. However, 1072 made a lot of various improvements to the underlying code. So, when I brought 580 up to trunk, I incorporated all of the underlying improvements in 1072, but replaced the counter-specific code w/ the vector clock implementation.
I believe more ppl are interested in 1072. So, after that gets committed to trunk, I'll refactor the above patch to let it be applied on top of 1072.

CASSANDRA-1243 was my somewhat failed attempt to split out code from CASSANDRA-1072 to make that patch more manageable, but there is not really a clear cut component to split out so I have canceled that patch. I can close it for now and reopen if we want to try again. Alternatively create a few smaller and more focused issues instead. It won't make much of a dent in 1072 though.

The idea is to get CASSANDRA-1072 committed first and then resume work on CASSANDRA-580. They do both use a lot of the same changes that we have already committed to Cassandra (clocks, reconciler) and probably share a bit of code still.

Johan Oskarsson
added a comment - 18/Jul/10 21:41 CASSANDRA-1243 was my somewhat failed attempt to split out code from CASSANDRA-1072 to make that patch more manageable, but there is not really a clear cut component to split out so I have canceled that patch. I can close it for now and reopen if we want to try again. Alternatively create a few smaller and more focused issues instead. It won't make much of a dent in 1072 though.
The idea is to get CASSANDRA-1072 committed first and then resume work on CASSANDRA-580 . They do both use a lot of the same changes that we have already committed to Cassandra (clocks, reconciler) and probably share a bit of code still.

Kelvin Kakugawa
added a comment - 18/Jul/10 10:33 - edited bringing 580 up to trunk (git@714eab0c4079276eefcdd5a1d8d9ca7b225c9f80)
TODO:
1) modify version vector node ids to use a unique, per-connection id (thanks Cliff Moon and Andy Gross for pointing this out as an improvement)
note: see Riak's implementation, which derives unique client ids based on erlang process id.
pros: less replica coordination wrt node ids.
necessary evil: must add timestamps to version vector tuples and prune the version vector (based on time).
2) modify tuple format
add timestamp (to allow version vector pruning)
shorten count type to be int, instead of long (node ids will updated for a much shorter period of time)

Kazuki Aranami
added a comment - 15/Jul/10 13:40 Hello.
I recognize that I am very important after implementation of this Vector Clock thought about the architecture.
Is the work of Vector Clock stopping?
I am concerned about the influence on road map.
0.7 roadmap
http://www.mail-archive.com/dev@cassandra.apache.org/msg00404.html
There is such a site, too.
itclocks
http://code.google.com/p/itclocks/
thanks!

Sylvain Lebresne
added a comment - 06/May/10 12:16
Can you create a new (sub) ticket for that change and attach the patch to it?
Sure: CASSANDRA-1058
Also, it looks like the actual ColumnType java file didn't make it into the patch.
Ahah, my bad. Forgot the add. That's corrected in the patch attached to #1058

Sylvain: Good work splitting out the ColumnType into a new patch. Can you create a new (sub) ticket for that change and attach the patch to it? Also, it looks like the actual ColumnType java file didn't make it into the patch.

Johan Oskarsson
added a comment - 06/May/10 11:40 Sylvain: Good work splitting out the ColumnType into a new patch. Can you create a new (sub) ticket for that change and attach the patch to it? Also, it looks like the actual ColumnType java file didn't make it into the patch.

Attached patch is the first part of the wip patch (ColumnType as enum) rebased
against trunk. It ships the addition of the ColumnType in superColumns which is
not strictly needed per se but make sense here I believe (note that this patch depends
on CASSANDRA-1054).

Since I did not wrote this, I merely rebased it, I'll +1 it (just this first part of course) at
the same time.

Sylvain Lebresne
added a comment - 06/May/10 10:34 Attached patch is the first part of the wip patch (ColumnType as enum) rebased
against trunk. It ships the addition of the ColumnType in superColumns which is
not strictly needed per se but make sense here I believe (note that this patch depends
on CASSANDRA-1054 ).
Since I did not wrote this, I merely rebased it, I'll +1 it (just this first part of course) at
the same time.

Kelvin Kakugawa
added a comment - 05/May/10 22:38 Hey Sylvain,
Any help is appreciated. I'm still testing and refactoring the counters logic.
However, the ColumnType as enum is pretty distinct, so if you want to rebase that and submit the patch I'm totally fine w/ that. Thanks.
-Kelvin

Where are you with this patch splitting/rebasing ? In the process of
giving a hand in reviewing this, I've extracted the first part of this
(ColumnType as enum) and rebased it against trunk. I certainly don't
want to go behind you back but if that can help I can attach it so
that maybe this first part get commited.

Sylvain Lebresne
added a comment - 05/May/10 20:55 Where are you with this patch splitting/rebasing ? In the process of
giving a hand in reviewing this, I've extracted the first part of this
(ColumnType as enum) and rebased it against trunk. I certainly don't
want to go behind you back but if that can help I can attach it so
that maybe this first part get commited.

Kelvin Kakugawa
added a comment - 14/Apr/10 17:06 Yes, you're right. This WIP patch will be broken out into 3 separate patches: ColumnType as enum, VC extension, and counters extension.
And, most of the TODO: Test comments can be covered by system tests.

Jonathan Ellis
added a comment - 14/Apr/10 13:52 If there is any "prep" refactoring in here it should be pulled out to a separate patch. It's impossible to say, looking at at 300k patch.
Adding system tests should make most of the TODO: TESTs go away, right?

Kelvin Kakugawa
added a comment - 14/Apr/10 05:22 WIP patch against svn r923420
This is the last patch we have against cassandra 0.6. The internal version has more test coverage and bug fixes, but it needs to be backported.

Jonathan Ellis
added a comment - 22/Jan/10 20:01 kelvin, can you r/m patches from this that are of historical interest only, for the benefit of those who want to see what the current (or at least, most-recently-posted) patchset it?

Moved utility methods in VersionVectorContext to FBUtilities.
Modified impl to use System.arraycopy().
Modified internal methods to be protected/private (depending on whether necessary for testing).
Removed now unused methods.

Note: FBUtilities.compareByteSubArrays() will throw an IllegalArgumentException if a length is passed in that extends past either array.

Kelvin Kakugawa
added a comment - 06/Jan/10 02:10 Moved utility methods in VersionVectorContext to FBUtilities.
Modified impl to use System.arraycopy().
Modified internal methods to be protected/private (depending on whether necessary for testing).
Removed now unused methods.
Note: FBUtilities.compareByteSubArrays() will throw an IllegalArgumentException if a length is passed in that extends past either array.

In particular, the section called, "Keyspaces and ColumnFamilies," where they discuss compareWith and how you can extend org.apache.cassandra.db.marshal.AbstractType to use your own ColumnFamily comparator.

For the initial implementation to get this out the door, we won't support a javascript API. However, I'm open to suggestions. It probably would be nice to support a high-level scripting language for certain components.

Kelvin Kakugawa
added a comment - 05/Jan/10 19:03 For the initial server-side reconciliation, we'll probably adopt an approach similar to the way custom ColumnFamily comparators are specified.
It might help to refer to:
http://wiki.apache.org/cassandra/StorageConfiguration
In particular, the section called, "Keyspaces and ColumnFamilies," where they discuss compareWith and how you can extend org.apache.cassandra.db.marshal.AbstractType to use your own ColumnFamily comparator.
For the initial implementation to get this out the door, we won't support a javascript API. However, I'm open to suggestions. It probably would be nice to support a high-level scripting language for certain components.

I have been following this issue for some time now and I'm curious how will you deal with reconciliation.
I suppose for the beginning of the discussion that some sort of interface will be implemented to allow pluggable logic to be added to the server, personalized scripts were an idea, I have heard.

Is something planned? As a suggestion, Java Scripting API and scripts stored on Cassandra?
Sorry if I seem hasty, but I will probably work on some dummy implementation as I need this for my work, and I didn't want to diverge from a future release of Cassandra.
If you can give me some lights, it would be thankful.

Pedro Gomes
added a comment - 05/Jan/10 16:43 I have been following this issue for some time now and I'm curious how will you deal with reconciliation.
I suppose for the beginning of the discussion that some sort of interface will be implemented to allow pluggable logic to be added to the server, personalized scripts were an idea, I have heard.
Is something planned? As a suggestion, Java Scripting API and scripts stored on Cassandra?
Sorry if I seem hasty, but I will probably work on some dummy implementation as I need this for my work, and I didn't want to diverge from a future release of Cassandra.
If you can give me some lights, it would be thankful.

(4) My goal is to keep the context an opaque array, because I want to be support other version implementations. i.e. interval tree clocks, which have a different format. So, if I wanted to use an object representation, VVC would have to internally inflate the opaque context. However, the manual byte manipulation isn't as easy to read as an object-based implementation and this was a concern of mine.

Right now, for the reconcile() method, I'm probably going to modify its interface. Instead of using a List of Pair objects, I have a new IColumn impl, VectorColumn, that would probably be more appropriate than Pair.

Kelvin Kakugawa
added a comment - 04/Jan/10 22:39 Thanks for your comments.
(1) I thought about putting the utility methods in FBUtilities. However, I left them in VVC for this pass. I can move them over.
(2) You're right about cBSA. I didn't implement the cases that were already handled by VVC. However, if I move it to FBUtilities, I'll look for all the missing cases.
(3) Thanks for the heads up. I'll look into replacing those loops w/ arraycopy.
(4) My goal is to keep the context an opaque array, because I want to be support other version implementations. i.e. interval tree clocks, which have a different format. So, if I wanted to use an object representation, VVC would have to internally inflate the opaque context. However, the manual byte manipulation isn't as easy to read as an object-based implementation and this was a concern of mine.
Right now, for the reconcile() method, I'm probably going to modify its interface. Instead of using a List of Pair objects, I have a new IColumn impl, VectorColumn, that would probably be more appropriate than Pair.

(2) compareByteSubArrays seems to have following issues: (a) it does not handle the case where bytes2 is null and bytes1 is non-null (b) it does not handle the case where parameter 'length' is bigger than byte array length

(3) bytes are copied in manual for loops. Is there a reason for not using arraycopy? It would perhaps make the code slightly cleaner and faster (don't know about the latter though, as this involves copying elements within the same array)

(4) Would it make sense to use some level of data abstraction? Sorting and comparisons always involve copying & handling of individual bytes, which makes the code slightly cumbersome to read, and might be more inefficient than using object references as well (again, not sure about this, have to do some research).

Jaakko Laine
added a comment - 04/Jan/10 09:10 Read through the patch, following comments/questions:
(1) Utility methods should be put to FBUtilities
(2) compareByteSubArrays seems to have following issues: (a) it does not handle the case where bytes2 is null and bytes1 is non-null (b) it does not handle the case where parameter 'length' is bigger than byte array length
(3) bytes are copied in manual for loops. Is there a reason for not using arraycopy? It would perhaps make the code slightly cleaner and faster (don't know about the latter though, as this involves copying elements within the same array)
(4) Would it make sense to use some level of data abstraction? Sorting and comparisons always involve copying & handling of individual bytes, which makes the code slightly cumbersome to read, and might be more inefficient than using object references as well (again, not sure about this, have to do some research).

The key issue is that Column and SuperColumn rely on timestamps to indicate when deletion should occur. In particular, through these methods:
getMarkedForDeleteAt()
mostRecentLiveChangeAt()

A separate, less interesting, issue is that SuperColumn auto-assumes Column for its sub-column serializer.

Right now, I've put together a new class, VersionColumn, that's context-based. I've added context() methods to IColumn and added some instanceof checks. However, I'm still exploring the implementation for the delete logic.

atm, I'm trying to avoid having to add a SuperVersionColumn class. However, what is an interesting way to define the version "context" of a SuperColumn? An aggregated context of all the sub-columns may be worth exploring. Alternatively, since all the updates are timestamped, we could use that as a rough approximation for mostRecentLiveChangeAt(), but that seems to break the spirit of versioned contexts.

Kelvin Kakugawa
added a comment - 24/Dec/09 15:27 potential incompatibility w/ IColumn and IColumnContainer interfaces.
The key issue is that Column and SuperColumn rely on timestamps to indicate when deletion should occur. In particular, through these methods:
getMarkedForDeleteAt()
mostRecentLiveChangeAt()
A separate, less interesting, issue is that SuperColumn auto-assumes Column for its sub-column serializer.
Right now, I've put together a new class, VersionColumn, that's context-based. I've added context() methods to IColumn and added some instanceof checks. However, I'm still exploring the implementation for the delete logic.
atm, I'm trying to avoid having to add a SuperVersionColumn class. However, what is an interesting way to define the version "context" of a SuperColumn? An aggregated context of all the sub-columns may be worth exploring. Alternatively, since all the updates are timestamped, we could use that as a rough approximation for mostRecentLiveChangeAt(), but that seems to break the spirit of versioned contexts.

Kelvin Kakugawa
added a comment - 23/Dec/09 06:07 re: the node id format.
If Cassandra supports IPv6, then it might be advantageous to just use the IPv4 portion of it. If that's possible. Otherwise, IPv6's 16 bytes is kind of rough.

IContext is just a general interface to manipulate context byte[]. Basically, create(), update() and reconcile().

create(): create a new/empty context.
update(): update this context w/ the local node's id.
reconcile(): pass in a list of context-value pairs and it'll return a merged context (that supersedes all the passed in contexts) and a list of values that it couldn't automatically reconcile.

VersionVectorContext is a version vector implementation.

The VV format is a concatenated list of node id (IPv4's 4 bytes), count (int), and timestamp (long) tuples in a byte[].

create(): returns an empty byte[].
update(): will look for the local node's tuple in the byte[], increment its count, then prepend it to the front of the byte[] w/ an updated timestamp. So, that the byte[] is always in timestamp descending order.
reconcile(): looks for all disjoint (incompatible) VVs and collapses all VVs that are a subset of another VV in the list. (implementation note: if 2 VVs are equal, but their values are not equivalent, both values will be added to the set of values that need to be manually reconciled. It seems inefficient, though, so when I go through the rest of the system, I'm going to see if I can avoid this check. Since, it's a problem that can only happen on the local node.)

VersionVectorContext helper methods of interest:
compareContexts(): sorts contexts by id, then steps through both contexts to determine pairwise: equality, superset, subset, disjoint.
mergeContexts(): creates a map from node id to count-timestamp pairs, then create a timestamp-sorted array and pulls off up to the max entries to form the new merged context.

Kelvin Kakugawa
added a comment - 23/Dec/09 05:54 - edited The patch adds a new package:
db/context
which contains:
IContext
VersionVectorContext
IContext is just a general interface to manipulate context byte[]. Basically, create(), update() and reconcile().
create(): create a new/empty context.
update(): update this context w/ the local node's id.
reconcile(): pass in a list of context-value pairs and it'll return a merged context (that supersedes all the passed in contexts) and a list of values that it couldn't automatically reconcile.
VersionVectorContext is a version vector implementation.
The VV format is a concatenated list of node id (IPv4's 4 bytes), count (int), and timestamp (long) tuples in a byte[].
create(): returns an empty byte[].
update(): will look for the local node's tuple in the byte[], increment its count, then prepend it to the front of the byte[] w/ an updated timestamp. So, that the byte[] is always in timestamp descending order.
reconcile(): looks for all disjoint (incompatible) VVs and collapses all VVs that are a subset of another VV in the list. (implementation note: if 2 VVs are equal, but their values are not equivalent, both values will be added to the set of values that need to be manually reconciled. It seems inefficient, though, so when I go through the rest of the system, I'm going to see if I can avoid this check. Since, it's a problem that can only happen on the local node.)
VersionVectorContext helper methods of interest:
compareContexts(): sorts contexts by id, then steps through both contexts to determine pairwise: equality, superset, subset, disjoint.
mergeContexts(): creates a map from node id to count-timestamp pairs, then create a timestamp-sorted array and pulls off up to the max entries to form the new merged context.

If it's going to be complex – and it sounds like it is – I'd be inclined to prefer the "timestamped vector clock, with truncation" approach. (And since it's opaque to the client this can be changed later if we determine that the truncation actually is a problem in practice.)

Jonathan Ellis
added a comment - 07/Dec/09 20:45 If it's going to be complex – and it sounds like it is – I'd be inclined to prefer the "timestamped vector clock, with truncation" approach. (And since it's opaque to the client this can be changed later if we determine that the truncation actually is a problem in practice.)

I've been talking w/ the authors of the interval tree clocks (ITC) paper about how to apply ITC to Cassandra, and it looks like we may need to modify the ITC algorithm for our use-case.

The crux of the matter is Cassandra's hinted hand-off feature. The ITC algorithm composes an id-tree and event-tree to represent the version of a given value. The id-tree is a nice way to create unique ids on-the-fly for any node (by splitting the id-tree, as necessary) and the event-tree represents causality. However, the problem is that for a node to update the event-tree for a value, it has to be assigned a part of the id-tree beforehand.

A short example, follows:
If a node tries to forward a value, but (because of failure scenarios) it has to store the value, locally. It wouldn't be able to update the version of the value, unless it had been assigned a part of the id-tree beforehand from the set of nodes responsible for the value.

The authors have a couple of solutions:
1) Split the id-tree between all nodes in the cluster from the very start. This solves the problem, but it does mute the attractive benefits of ITC over traditional version vectors. i.e. dynamically partitioning the id space at run-time and only to the extent necessary to conserve space.
2) On client reads, doing a "fork" instead of a "peek" and sharing the id-tree w/ the client. However, this is a more complicated approach that may need to be worked out some more.

In any case, since we're using an opaque context, these decisions won't affect the interface. However, it's an interesting implementation concern. Depending on the average size of a Cassandra cluster, it may or may not be worth pre-forking the id-tree to all nodes from the very start.

Kelvin Kakugawa
added a comment - 07/Dec/09 20:31 I've been talking w/ the authors of the interval tree clocks (ITC) paper about how to apply ITC to Cassandra, and it looks like we may need to modify the ITC algorithm for our use-case.
The crux of the matter is Cassandra's hinted hand-off feature. The ITC algorithm composes an id-tree and event-tree to represent the version of a given value. The id-tree is a nice way to create unique ids on-the-fly for any node (by splitting the id-tree, as necessary) and the event-tree represents causality. However, the problem is that for a node to update the event-tree for a value, it has to be assigned a part of the id-tree beforehand.
A short example, follows:
If a node tries to forward a value, but (because of failure scenarios) it has to store the value, locally. It wouldn't be able to update the version of the value, unless it had been assigned a part of the id-tree beforehand from the set of nodes responsible for the value.
The authors have a couple of solutions:
1) Split the id-tree between all nodes in the cluster from the very start. This solves the problem, but it does mute the attractive benefits of ITC over traditional version vectors. i.e. dynamically partitioning the id space at run-time and only to the extent necessary to conserve space.
2) On client reads, doing a "fork" instead of a "peek" and sharing the id-tree w/ the client. However, this is a more complicated approach that may need to be worked out some more.
In any case, since we're using an opaque context, these decisions won't affect the interface. However, it's an interesting implementation concern. Depending on the average size of a Cassandra cluster, it may or may not be worth pre-forking the id-tree to all nodes from the very start.

Jonathan Ellis
added a comment - 02/Dec/09 05:34 making it a byte[] context makes sense.
replacing the existing timestamps with that does not, since they are client-provided by design which is the opposite of the VC context.

I studied the Interval Tree Clocks paper and it looks promising. Let me work out the algorithm, so that we can investigate its implementation in Cassandra.

Having said the above, I think we should re-do the "clock" aspect of the interface and make it an opaque context object (like Dynamo). We pass it out to clients on a read, and when they update a given value they pass back the context that they're updating. I'm not sure if we want to extend the concept so far as to make existing timestamps just a special case of an opaque context, though.

Kelvin Kakugawa
added a comment - 02/Dec/09 05:29 Thanks Stu for the lead.
I studied the Interval Tree Clocks paper and it looks promising. Let me work out the algorithm, so that we can investigate its implementation in Cassandra.
Having said the above, I think we should re-do the "clock" aspect of the interface and make it an opaque context object (like Dynamo). We pass it out to clients on a read, and when they update a given value they pass back the context that they're updating. I'm not sure if we want to extend the concept so far as to make existing timestamps just a special case of an opaque context, though.

Regarding the timestamp being necessary in a version vector: have you looked at Interval Tree Clocks? The paper is slightly over my head, but the algorithm is supposed to be a generalization of version vectors / vector clocks, and it has a natural solution for changing members: http://en.wikipedia.org/wiki/Version_vector#cite_note-5

Stu Hood
added a comment - 01/Dec/09 23:05 Regarding the timestamp being necessary in a version vector: have you looked at Interval Tree Clocks? The paper is slightly over my head, but the algorithm is supposed to be a generalization of version vectors / vector clocks, and it has a natural solution for changing members: http://en.wikipedia.org/wiki/Version_vector#cite_note-5

I think you're right about the timestamps not necessarily being useful to a client. It shouldn't be part of the interface, since it's really an implementation detail to manage the size of the vector. I'll make it internal to the server.

Kelvin Kakugawa
added a comment - 01/Dec/09 23:00 I think you're right about the timestamps not necessarily being useful to a client. It shouldn't be part of the interface, since it's really an implementation detail to manage the size of the vector. I'll make it internal to the server.

... but shouldn't that timestamp be internal to the server? we definitely don't want the client to specify what amounts to an implementation detail, and i don't think there is any reason to send it to the client on a read, either.

Jonathan Ellis
added a comment - 01/Dec/09 22:54 ... but shouldn't that timestamp be internal to the server? we definitely don't want the client to specify what amounts to an implementation detail, and i don't think there is any reason to send it to the client on a read, either.

Ah, I see what you were going after. That's a more concise interface change.

My understanding about including a timestamp along w/ the logical clock is to limit the potential growth of the vector. Basically, whenever a new node updates a value, the vector size grows by one. However, the problem is that if many different nodes happen to update a given value (for various reasons--failure scenarios, etc.), the potential size of a vector could grow to an unmanageable length and it would keep that length forever. So, the Dynamo authors chose to tag each update with a timestamp, so they could truncate the vector to only the last 10 nodes to update the value. There is a possibility that an inconsistency could arise, because of the truncation. However, the paper said in practice, it was a non-issue.

In summary, the timestamp is not there to help resolve consistency problems, it's there to make the vector more manageable.

Kelvin Kakugawa
added a comment - 01/Dec/09 22:49 Ah, I see what you were going after. That's a more concise interface change.
My understanding about including a timestamp along w/ the logical clock is to limit the potential growth of the vector. Basically, whenever a new node updates a value, the vector size grows by one. However, the problem is that if many different nodes happen to update a given value (for various reasons--failure scenarios, etc.), the potential size of a vector could grow to an unmanageable length and it would keep that length forever. So, the Dynamo authors chose to tag each update with a timestamp, so they could truncate the vector to only the last 10 nodes to update the value. There is a possibility that an inconsistency could arise, because of the truncation. However, the paper said in practice, it was a non-issue.
In summary, the timestamp is not there to help resolve consistency problems, it's there to make the vector more manageable.

Jonathan Ellis
added a comment - 01/Dec/09 19:38 The reason I switched to favoring server-side-only is because it lets us get by with much smaller API changes, as in the attached.
Also, why do we need a timestamp in LogicalClock when we have the counter? I thought the whole point was to get away from the problems posed by timestamps.

The vector clock is returned by gets and required for insert/remove. I assume that server-side conflict resolution will be implemented, so I only pass back the definitive version of the row. If client-side conflict resolution will be implemented in the future, another data structure will be needed that encapsulates a list of conflicts and, maybe, a summary vector clock to be used for an insert/remove operation that resolves the conflict.

Kelvin Kakugawa
added a comment - 25/Nov/09 04:57 proposed vector clock interface
The vector clock is returned by gets and required for insert/remove. I assume that server-side conflict resolution will be implemented, so I only pass back the definitive version of the row. If client-side conflict resolution will be implemented in the future, another data structure will be needed that encapsulates a list of conflicts and, maybe, a summary vector clock to be used for an insert/remove operation that resolves the conflict.

Yes, that's how you'd have to change it. I'd rather not; it would get messy.

(If we do server-side resolution we could still conceivably support both "classic" columnfamilies and vector clocked ones in the same Column and SuperColumn objects, just differing in their clock/timestamp field. But if we have to potentially store multiple versions of a column in a single row for the vector clock version, then I think that is diverging too far and we'd have to split the implementation. Remember that a classic ColumnFamily object just has a hash of its columns by name.)

Jonathan Ellis
added a comment - 25/Nov/09 03:14 Yes, that's how you'd have to change it. I'd rather not; it would get messy.
(If we do server-side resolution we could still conceivably support both "classic" columnfamilies and vector clocked ones in the same Column and SuperColumn objects, just differing in their clock/timestamp field. But if we have to potentially store multiple versions of a column in a single row for the vector clock version, then I think that is diverging too far and we'd have to split the implementation. Remember that a classic ColumnFamily object just has a hash of its columns by name.)

> if the client has to resolve writes, you can no longer compact without involving the client.
Kindof: you can still compact SSTables, but you need to keep all conflicting versions, which is why options (a) and (c) are still feasible.

Stu Hood
added a comment - 25/Nov/09 03:01 > if the client has to resolve writes, you can no longer compact without involving the client.
Kindof: you can still compact SSTables, but you need to keep all conflicting versions, which is why options (a) and (c) are still feasible.

Kelvin Kakugawa
added a comment - 25/Nov/09 02:59 I'll take Stu and Jonathan's advice and restrict the scope of this ticket to server-side conflict resolution.
If client-side resolution is interesting, we can pursue it in a later ticket.

Jonathan Ellis
added a comment - 25/Nov/09 02:46 As Stu points out, if the client has to resolve writes, you can no longer compact without involving the client. This is a big big lose. +1 pluggable server-side conflict resolution from me.
(This doesn't have to be complicated; just allow a class name to be specified per-CF like we do for CompareWith.)
Also, I think you can make a good case that this is a better stylistic fit for Cassandra, which tries to support "dumb" clients more than Dynamo did.

Basically, all updates are written out and conflict resolution is handled at read time. An exception being a version in the Memtable that can be resolved syntactically. However, it would require more copies of the data and a more complex API. It would make the storage system more flexible for end users, though, since they wouldn't have to write server-side logic. However, they would have to parse a list of conflicting versions and pass back a context/summary version vector of the merged conflict.

My reasoning is that Cassandra is write-optimized, so we should shift the burden to reads rather than writes.

Kelvin Kakugawa
added a comment - 25/Nov/09 02:31 Right now, I'm leaning towards client-side conflict resolution.
Basically, all updates are written out and conflict resolution is handled at read time. An exception being a version in the Memtable that can be resolved syntactically. However, it would require more copies of the data and a more complex API. It would make the storage system more flexible for end users, though, since they wouldn't have to write server-side logic. However, they would have to parse a list of conflicting versions and pass back a context/summary version vector of the merged conflict.
My reasoning is that Cassandra is write-optimized, so we should shift the burden to reads rather than writes.

One important difference between Cassandra and some other systems utilizing vector clocks is that in Cassandra, writes do not read from disk any old versions of the value that is being written to. This means that conflict resolution currently happens in 3 different places:
1. At read time - Two nodes have different versions,
2. At write time - The version being written is older than the version a node has in a Memtable,
3. At compaction time - Versions of values persisted to different SSTables disagree.

NB: For the purposes of this ticket, I think that all resolution should be handled server side, deterministically, and that one of the following options should be implemented as part of a separate ticket.

But before too much progress is made, we will probably want to decide whether we want to support:
a) Client side conflict resolution (logic implemented on the client side),
b) Server side resolution (pluggable logic on the server),
c) A hybrid (pluggable resolution, which can optionally sends the versions to the client at resolution time #1 or #2)

If we decide to implement client-side resolution (a), then we will need to remove resolution at steps #2 and #3 (or make it optional for option (c)), and keep more copies of the data. For #2, a Memtable could store conflicting versions in memory until they are resolved by a read or flushed to disk. For #3, SSTables will need to be able to store multiple versions of a row/cf until they are resolved by a read.

Stu Hood
added a comment - 25/Nov/09 02:03 - edited One important difference between Cassandra and some other systems utilizing vector clocks is that in Cassandra, writes do not read from disk any old versions of the value that is being written to. This means that conflict resolution currently happens in 3 different places:
1. At read time - Two nodes have different versions,
2. At write time - The version being written is older than the version a node has in a Memtable,
3. At compaction time - Versions of values persisted to different SSTables disagree.
NB: For the purposes of this ticket, I think that all resolution should be handled server side, deterministically, and that one of the following options should be implemented as part of a separate ticket.
But before too much progress is made, we will probably want to decide whether we want to support:
a) Client side conflict resolution (logic implemented on the client side),
b) Server side resolution (pluggable logic on the server),
c) A hybrid (pluggable resolution, which can optionally sends the versions to the client at resolution time #1 or #2)
If we decide to implement client-side resolution (a), then we will need to remove resolution at steps #2 and #3 (or make it optional for option (c)), and keep more copies of the data. For #2, a Memtable could store conflicting versions in memory until they are resolved by a read or flushed to disk. For #3, SSTables will need to be able to store multiple versions of a row/cf until they are resolved by a read.