From cassandra-user-return-904-apmail-incubator-cassandra-user-archive=incubator.apache.org@incubator.apache.org Tue Oct 13 23:27:25 2009
Return-Path:
Delivered-To: apmail-incubator-cassandra-user-archive@minotaur.apache.org
Received: (qmail 3452 invoked from network); 13 Oct 2009 23:27:25 -0000
Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3)
by minotaur.apache.org with SMTP; 13 Oct 2009 23:27:25 -0000
Received: (qmail 80851 invoked by uid 500); 13 Oct 2009 23:27:25 -0000
Delivered-To: apmail-incubator-cassandra-user-archive@incubator.apache.org
Received: (qmail 80829 invoked by uid 500); 13 Oct 2009 23:27:25 -0000
Mailing-List: contact cassandra-user-help@incubator.apache.org; run by ezmlm
Precedence: bulk
List-Help:
List-Unsubscribe:
List-Post:
List-Id:
Reply-To: cassandra-user@incubator.apache.org
Delivered-To: mailing list cassandra-user@incubator.apache.org
Received: (qmail 80820 invoked by uid 99); 13 Oct 2009 23:27:25 -0000
Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136)
by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 13 Oct 2009 23:27:25 +0000
X-ASF-Spam-Status: No, hits=-2.6 required=5.0
tests=BAYES_00
X-Spam-Check-By: apache.org
Received-SPF: pass (athena.apache.org: domain of jbellis@gmail.com designates 74.125.78.147 as permitted sender)
Received: from [74.125.78.147] (HELO ey-out-1920.google.com) (74.125.78.147)
by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 13 Oct 2009 23:27:22 +0000
Received: by ey-out-1920.google.com with SMTP id 5so2086547eyb.8
for ; Tue, 13 Oct 2009 16:27:01 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=gmail.com; s=gamma;
h=domainkey-signature:mime-version:received:in-reply-to:references
:date:message-id:subject:from:to:content-type
:content-transfer-encoding;
bh=vFmodENzFfnil10N4Mw9AX9sgjGB7WoImi9zAnOi2pg=;
b=Hm2YPRDCc8FQwNGis6oXx2+5f4j2bs+iuQe68TZHI2Y2eB81hdrcJFOgmO6c1ae+w7
l/gc9A/sknhR5EeJVoGjXKdNwy57Izwom2QZFPGNx46LUXYXlmNn9SJq1Y+LrQwCJEVG
pkZA42epohBtz5vM3pedHECHuhH8i3+SJ1wH8=
DomainKey-Signature: a=rsa-sha1; c=nofws;
d=gmail.com; s=gamma;
h=mime-version:in-reply-to:references:date:message-id:subject:from:to
:content-type:content-transfer-encoding;
b=JWj6C5jxm8RwOEdqxVMTOLtXK1vYK41xwFmQv4PMY5T6CTtZ2BV7OeIx/5dlzfoAcj
Ph/1oJjDOrT02VlPdjZH/zshf7IPZy0piL1diWTYOG8zCB76ke+v9q1BR6uLcJJUUt9x
v+R0aWdwxYNM3rVXZ3vXHfVER/TqZiMHglVWs=
MIME-Version: 1.0
Received: by 10.216.87.5 with SMTP id x5mr2613845wee.75.1255476421210; Tue, 13
Oct 2009 16:27:01 -0700 (PDT)
In-Reply-To: <4AD4F69F.5020802@bulkowski.org>
References: <4AD4F69F.5020802@bulkowski.org>
Date: Tue, 13 Oct 2009 18:27:01 -0500
Message-ID:
Subject: Re: eventual consistency question
From: Jonathan Ellis
To: cassandra-user@incubator.apache.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
On Tue, Oct 13, 2009 at 4:52 PM, Brian Bulkowski wrot=
e:
> Question 1:
> =A0 the bootstrap parameter: what does it do, exactly?
It's for adding nodes to an existing cluster. (This is being reworked
to be more automatic for 0.5.) If you start a node without it, it
assumes it already has the data that's "supposed" to be on it (either
b/c you are starting a new cluster, or restarting an existing node in
one). If you do specify -b it will contact the cluster you're
starting it in to move the right data to it.
> Question 2:
> =A0 "how eventual is eventual?"
> =A0 Imagine the following case:
> =A0 =A0 =A0Defaults from storage-conf.xml + replication count 2 (and the =
IP
> addresses required, etc)
> =A0 =A0 =A0Up server A (no -b)
> =A0 =A0 =A0Insert a few values, read, all is good (using _cli)
> =A0 =A0 =A0Up server B, C (with -b)
> =A0 =A0 =A0read values from A, B, or C - all is good, appears to be readi=
ng from A
> =A0 =A0 =A0wait a few minutes - servers appear quiescent.
> =A0 =A0 =A0Down server A
> =A0 =A0 =A0read values from B - values are not available (NPE exception o=
n server
> & _cli interface)
>
> So I read that Cassandra doesn't optimistically replicate, so I understan=
d
> in theory that the data inserted to A shouldn't replicate.
> I believe if I used the proper thrift inteface and asked for replication
> count 2, the transaction would have failed.
> Yet, I expect that if I asked for replication count 2, I should get it. A=
t
> some point. Eventually. The data has been inserted.
> I expect the cluster to work toward replication count 2 regardless of the
> current state of the cluster --- is there a way to achieve this behavior?
There's a couple things going on here.
The big one is that after a fresh start A doesn't know what other
nodes "should" be part of the cluster. Cassandra does assume that you
reach a "good" state before bad stuff starts happening (in production
this turns out to be quite reasonable). So what you need to do is
bring up A/B/C, then turn things off, rather than just bring up A by
itself to start with.
The second one is that there's a different time scale for "eventually"
between "eventually, all the replicas are in sync" and "eventually,
the failure detector will notice that a node is down and not route
requests to it." The first is in ms (with a few caveats mostly the
same as in the Dynamo paper), the second is in seconds (a handful for
a small cluster, maybe 15 for a large one).
> Question 3:
> =A0 "balancing"
> =A0 =A0 =A0This question is similar to question 2, from a different way.
> =A0 =A0 =A0I have three nodes which I brought up at the dawn of time. The=
y've
> taken a lot of inserts, and have 1T each.
> =A0 =A0 =A0Let's say the load now is mostly reads, as the data has alread=
y been
> inserted
> =A0 =A0 =A0I bring up a fourth node.
> =A0 =A0 =A0Clients (aka app servers) are pointing at the first 3 nodes. I=
have to
> reconfigure those servers to start using the 4th server, right?
Depends on your infrastructure. I'm a fan of round-robin DNS. Or you
can use a sw or hw load balancer. Or you can ask the cluster what
machines are in it and manually balance from your client app, but that
is my least favorite option.
> =A0 =A0 =A0New writes may take advantage of the 4th server, but no data w=
ill
> automatically move?
If you specify -b when starting the 4th node, the right data will be
copied to it. (Then, you need to manually tell the other nodes to
"cleanup" -- remove data that doesn't belong on them. This is not
automatic since if the nodes are "missing" as in the answer to #2 you
can shoot yourself in the foot here.)
> Thanks for the hints - I'm clearly not "getting" Cassandra yet and don't
> want to foolishly misrepresent it.
These are not dumb questions. Carry on. :)
-Jonathan