From user-return-18375-apmail-cassandra-user-archive=cassandra.apache.org@cassandra.apache.org Sat Jul 2 19:58:25 2011
Return-Path:
X-Original-To: apmail-cassandra-user-archive@www.apache.org
Delivered-To: apmail-cassandra-user-archive@www.apache.org
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
by minotaur.apache.org (Postfix) with SMTP id 7A73A6EA8
for ; Sat, 2 Jul 2011 19:58:25 +0000 (UTC)
Received: (qmail 64545 invoked by uid 500); 2 Jul 2011 19:58:23 -0000
Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org
Received: (qmail 64505 invoked by uid 500); 2 Jul 2011 19:58:22 -0000
Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
List-Help:
List-Unsubscribe:
List-Post:
List-Id:
Reply-To: user@cassandra.apache.org
Delivered-To: mailing list user@cassandra.apache.org
Received: (qmail 64497 invoked by uid 99); 2 Jul 2011 19:58:22 -0000
Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230)
by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 02 Jul 2011 19:58:22 +0000
X-ASF-Spam-Status: No, hits=3.2 required=5.0
tests=FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,HK_RANDOM_ENVFROM,HK_RANDOM_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL
X-Spam-Check-By: apache.org
Received-SPF: pass (nike.apache.org: domain of teddyyyy123@gmail.com designates 209.85.218.44 as permitted sender)
Received: from [209.85.218.44] (HELO mail-yi0-f44.google.com) (209.85.218.44)
by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 02 Jul 2011 19:58:16 +0000
Received: by yie30 with SMTP id 30so2059803yie.31
for ; Sat, 02 Jul 2011 12:57:55 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=gmail.com; s=gamma;
h=mime-version:in-reply-to:references:date:message-id:subject:from:to
:content-type;
bh=s43FjqLbGA5QpitK04ZaAMeXJwpnmibVtCXtRWvW/Lg=;
b=qTukJvC0yLLoLicGilLTYuQpUCaR47cH4ttcAmqxNCiEzr8bAZcFEkDLGZdEtrE5wI
Oru43tiKA4mbFxuBq7sdq8HnBzqqBfwni0m3RQz7XRZjvJZdnY4KJ8LFwHqOjt/U9jfV
jahoYnIft5YrZyUx0iAauTA+4gM31RVpZ4K6E=
MIME-Version: 1.0
Received: by 10.236.76.41 with SMTP id a29mr5998277yhe.39.1309636675245; Sat,
02 Jul 2011 12:57:55 -0700 (PDT)
Received: by 10.236.36.103 with HTTP; Sat, 2 Jul 2011 12:57:55 -0700 (PDT)
In-Reply-To:
References: <4E0E7A07.2050807@dude.podzone.net>
Date: Sat, 2 Jul 2011 12:57:55 -0700
Message-ID:
Subject: Re: Strong Consistency with ONE read/writes
From: Yang
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=20cf303b40e3d0167604a71b8cbe
X-Virus-Checked: Checked by ClamAV on apache.org
--20cf303b40e3d0167604a71b8cbe
Content-Type: text/plain; charset=ISO-8859-1
Jonathan:
could you please elaborate more on specifically why they are "not even
close"?
--- I kind of see what you mean (please correct me if I misunderstood):
Cassandra failure detector
is consulted on every write; while HBase failure detector is only used when
the tablet server joins or leaves.
in order to have the single write entry point approach originally brought
up in this thread,
I think you need a strong membership protocol to lock on the key range
leadership, once leadership is acquired,
failure detectors do not need to be consulted on every write.
yes by definition of the original requirement brought up in this thread,
Cassandra's write behavior is going to be changed, to be more like Hbase,
and mongo in "replica set" mode. but
it seems that this leader mode can even co-exist with the multi-entry write
mode that Cassandra uses now, just as
you can use different CL for each single write request. in that case you
would need to keep both the current lightweight Phi-detector
and add the ZK for leader election for single-entry mode write.
Thanks
Yang
(I should correct my terminology .... it's not a "strong failure detector"
that's needed, it's a "strong membership protocol". strongly complete and
accurate failure detectors do not exist in
async distributed systems (Tushar Chandra " Unreliable Failure Detectors
for Reliable Distributed Systems, Journal of the ACM, 43(2):225-267,
1996"
and FLP "Impossibility of Distributed Consensus with One Faulty
Process"
) )
On Sat, Jul 2, 2011 at 10:11 AM, Jonathan Ellis wrote:
> The way HBase uses ZK (for master election) is not even close to how
> Cassandra uses the failure detector.
>
> Using ZK for each operation would (a) not scale and (b) not work
> cross-DC for any reasonable latency requirements.
>
> On Sat, Jul 2, 2011 at 11:55 AM, Yang wrote:
> > there is a JIRA completed in 0.7.x that "Prefers" a certain node in
> snitch,
> > so this does roughly what you want MOST of the time
> >
> > but the problem is that it does not GUARANTEE that the same node will
> always
> > be read. I recently read into the HBase vs Cassandra comparison thread
> that
> > started after Facebook dropped Cassandra for their messaging system, and
> > understood some of the differences. what you want is essentially what
> HBase
> > does. the fundamental difference there is really due to the gossip
> protocol:
> > it's a probablistic, or eventually consistent failure detector while
> > HBase/Google Bigtable use Zookeeper/Chubby to provide a strong failure
> > detector (a distributed lock). so in HBase, if a tablet server goes
> down,
> > it really goes down, it can not re-grab the tablet from the new tablet
> > server without going through a start up protocol (notifying the master,
> > which would notify the clients etc), in other words it is guaranteed
> that
> > one tablet is served by only one tablet server at any given time. in
> > comparison the above JIRA only TRYIES to serve that key from one
> particular
> > replica. HBase can have that guarantee because the group membership is
> > maintained by the strong failure detector.
> > just for hacking curiosity, a strong failure detector + Cassandra
> replicas
> > is not impossible (actually seems not difficult), although the
> performance
> > is not clear. what would such a strong failure detector bring to
> Cassandra
> > besides this ONE-ONE strong consistency ? that is an interesting question
> I
> > think.
> > considering that HBase has been deployed on big clusters, it is probably
> OK
> > with the performance of the strong Zookeeper failure detector. then a
> > further question was: why did Dynamo originally choose to use the
> > probablistic failure detector? yes Dynamo's main theme is "eventually
> > consistent", so the Phi-detector is **enough**, but if a strong detector
> > buys us more with little cost, wouldn't that be great?
> >
> >
> > On Fri, Jul 1, 2011 at 6:53 PM, AJ wrote:
> >>
> >> Is this possible?
> >>
> >> All reads and writes for a given key will always go to the same node
> from
> >> a client. It seems the only thing needed is to allow the clients to
> compute
> >> which node is the closes replica for the given key using the same
> algorithm
> >> C* uses. When the first replica receives the write request, it will
> write
> >> to itself which should complete before any of the other replicas and
> then
> >> return. The loads should still stay balanced if using random
> partitioner.
> >> If the first replica becomes unavailable (however that is defined),
> then
> >> the clients can send to the next repilca in the ring and switch from ONE
> >> write/reads to QUORUM write/reads temporarily until the first replica
> >> becomes available again. QUORUM is required since there could be some
> >> replicas that were not updated after the first replica went down.
> >>
> >> Will this work? The goal is to have strong consistency with a
> read/write
> >> consistency level as low as possible while secondarily a network
> performance
> >> boost.
> >
> >
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of DataStax, the source for professional Cassandra support
> http://www.datastax.com
>
--20cf303b40e3d0167604a71b8cbe
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Jonathan:

could you please ela=
borate more on specifically why they are "not even close"?

<=
div>=A0--- I kind of see what you mean (please correct me if I misunderstoo=
d): Cassandra failure detector=A0

is consulted on every write; while HBase failure detector is only used=
when the tablet server joins or leaves.=A0

=A0in =
order to have the single write entry point approach originally brought up i=
n this thread,

I think you need a strong membership protocol to lock on the key range=
=A0leadership, once leadership is acquired,=A0

failure detectors=
do not need to be consulted on every write.=A0

ye=
s by definition of the original requirement brought up in this thread,

Cassandra's write behavior is going to be changed, to be more like=
Hbase, and mongo in "replica set" mode. but=A0

it seem=
s that this leader mode can even co-exist with the multi-entry write mode t=
hat Cassandra uses now, just as

you can use different CL for each single write request. =A0in that cas=
e you would need to keep both the current lightweight Phi-detector

The way HBase uses ZK (for master election) is not even close to how
Cassandra uses the failure detector.

Using ZK for each operation would (a) not scale and (b) not work
cross-DC for any reasonable latency requirements.

On Sat, Jul 2, 2011 at 11:55 AM, Yang <teddyyyy123@gmail.com> wrote:
> there is a JIRA completed in 0.7.x that "Prefers" a certain =
node in snitch,
> so this does roughly what you want MOST of the time
>
> but the problem is that it does not GUARANTEE that the same node will =
always
> be read. =A0I recently read into the HBase vs Cassandra comparison thr=
ead that
> started after Facebook dropped Cassandra for their messaging system, a=
nd
> understood some of the differences. what you want is essentially what =
HBase
> does. the fundamental difference there is really due to the gossip pro=
tocol:
> it's a probablistic, or eventually consistent failure detector =A0=
while
> HBase/Google Bigtable use Zookeeper/Chubby to provide a strong failure=
> detector (a distributed lock). =A0so in HBase, if a tablet server goes=
down,
> it really goes down, it can not re-grab the tablet from the new tablet=
> server without going through a start up protocol (notifying the master=
,
> which would notify the clients etc), =A0in other words it is guarantee=
d that
> one tablet is served by only one tablet server at any given time. =A0i=
n
> comparison the above JIRA only TRYIES to serve that key from one parti=
cular
> replica. HBase can have that guarantee because the group membership is=
> maintained by the strong failure detector.
> just for hacking curiosity, a strong failure detector + Cassandra repl=
icas
> is not impossible (actually seems not difficult), although the perform=
ance
> is not clear. what would such a strong failure detector bring to Cassa=
ndra
> besides this ONE-ONE strong consistency ? that is an interesting quest=
ion I
> think.
> considering that HBase has been deployed on big clusters, it is probab=
ly OK
> with the performance of the strong =A0Zookeeper failure detector. then=
a
> further question was: why did Dynamo originally choose to use the
> probablistic failure detector? yes Dynamo's main theme is "ev=
entually
> consistent", so the Phi-detector is **enough**, but if a strong d=
etector
> buys us more with little cost, wouldn't that =A0be great?
>
>
> On Fri, Jul 1, 2011 at 6:53 PM, AJ <aj@dude.podzone.net> wrote:
>>
>> Is this possible?
>>
>> All reads and writes for a given key will always go to the same no=
de from
>> a client. =A0It seems the only thing needed is to allow the client=
s to compute
>> which node is the closes replica for the given key using the same =
algorithm
>> C* uses. =A0When the first replica receives the write request, it =
will write
>> to itself which should complete before any of the other replicas a=
nd then
>> return. =A0The loads should still stay balanced if using random pa=
rtitioner.
>> =A0If the first replica becomes unavailable (however that is defin=
ed), then
>> the clients can send to the next repilca in the ring and switch fr=
om ONE
>> write/reads to QUORUM write/reads temporarily until the first repl=
ica
>> becomes available again. =A0QUORUM is required since there could b=
e some
>> replicas that were not updated after the first replica went down.<=
br>
>>
>> Will this work? =A0The goal is to have strong consistency with a r=
ead/write
>> consistency level as low as possible while secondarily a network p=
erformance
>> boost.
>
>