incubator-cassandra-dev mailing list archives

Taking the discussion below to the dev list.
Continuing the discussion, it seems to me that objects in Cassandra
might be quite large from this passage:
"The main limitation on column and supercolumn size is that all data
for a single key and column must fit (on disk) on a single machine in
the cluster. Because keys alone are used to determine the nodes
responsible for replicating their data, the amount of data associated
with a single key has this upper bound." -
http://wiki.apache.org/cassandra/CassandraLimitations
So there is (apparently) lots of room for objects to become large with
respect to the size of the storage overhead of a single piece, at
which time using an online code could provide significant space
savings for a given level of resiliency. There is also lots of room
for objects to become large with respect to the size of a network
frame, at which time using an online code would not impose a
significant additional read latency penalty, and might help (just as
fetching different ranges from different nodes would anyway) and it
would seem to make the latency for write consistency better, if
propagating whole duplicates across the network is more expensive than
computing and communicating the check symbols.
Then again, it seems like it would be a lot of work from the reaction,
but I can't comment on that not knowing much about the current
implementation (I'm going mostly off the Dynamo paper, and the jargon
in the ArchitectureInternals page is still a bit opaque to me, and I
haven't gotten into the code at all, and I'm overgeneralizing about
which read and write semantics that Dynamo/Cassandra offers I'm
talking about).
Jonathan Ellis resolved CASSANDRA-755.
--------------------------------------
Resolution: Invalid
This is not a good fit for Cassandra. See
http://wiki.apache.org/cassandra/ArchitectureInternals (and read the
Dynamo paper).
erasure codes are a good fit when you are storing (a) relatively large
pieces of data that (b) don't care too much about latency. neither of
those applies to cassandra.
also note that jira is not a good place for "is X a good fit for
cassandra?" discussions; use the -dev mailing list for that.
> Use a systematic online code for efficient redundancy
> -----------------------------------------------------
>
> Key: CASSANDRA-755
> URL: https://issues.apache.org/jira/browse/CASSANDRA-755
> Project: Cassandra
> Issue Type: Improvement
> Components: Core
> Reporter: Anthony Di Franco
> Priority: Minor
>
> Use a systematic online code for efficient redundancy (or specify why this is not appropriate
for Cassandra in the documentation). Systematic online codes permit an arbitrarily large
or small number of repair symbols to be added to the original data to more smoothly increase
the amount of redundant storage; in particular, they would permit redundancy ratios between
1 and 2 to be used.
> See here:
> http://en.wikipedia.org/wiki/Online_codes
> and here:
> http://archipelago.rubyforge.org/svn/trunk/oneliner/
> and here:
> http://tools.ietf.org/html/rfc5053