From user-return-36545-apmail-cassandra-user-archive=cassandra.apache.org@cassandra.apache.org Mon Sep 16 22:39:28 2013
Return-Path:
X-Original-To: apmail-cassandra-user-archive@www.apache.org
Delivered-To: apmail-cassandra-user-archive@www.apache.org
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
by minotaur.apache.org (Postfix) with SMTP id 18B76101E5
for ; Mon, 16 Sep 2013 22:39:28 +0000 (UTC)
Received: (qmail 58356 invoked by uid 500); 16 Sep 2013 22:39:24 -0000
Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org
Received: (qmail 58330 invoked by uid 500); 16 Sep 2013 22:39:24 -0000
Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
List-Help:
List-Unsubscribe:
List-Post:
List-Id:
Reply-To: user@cassandra.apache.org
Delivered-To: mailing list user@cassandra.apache.org
Received: (qmail 58316 invoked by uid 99); 16 Sep 2013 22:39:23 -0000
Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230)
by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 16 Sep 2013 22:39:23 +0000
X-ASF-Spam-Status: No, hits=2.2 required=5.0
tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE
X-Spam-Check-By: apache.org
Received-SPF: error (nike.apache.org: local policy)
Received: from [209.85.192.169] (HELO mail-pd0-f169.google.com) (209.85.192.169)
by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 16 Sep 2013 22:39:16 +0000
Received: by mail-pd0-f169.google.com with SMTP id r10so4755657pdi.28
for ; Mon, 16 Sep 2013 15:38:34 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=1e100.net; s=20130820;
h=x-gm-message-state:from:content-type:message-id:mime-version
:subject:date:references:to:in-reply-to;
bh=rc2F27Ynxh3XggQ5Zn2eSwc94VwEagnBqYgGbaA2piQ=;
b=ThBn+IpXO3CFVliXuZoKSAsJfJ4Pg7tPaQ1lQDot69LbymdHZnAdD9wuvAmy9n8iz8
ybzJ/0dNuLOknhrjYAI+9Aun2hrqObC/ioHvzbgYHm8vd5ma+ECMkjNYPqi9cw9MDfbE
91xIOZrmKjXqDAVBJJxJ7btnvT1rBPy8Zw48n/6iN5RL+y/wQk2RtUTzMcbYXhHn9hO0
LsNBwwlYzWfm+NuwAP4HvRyI2DYJTGQ5oL4RMOM6PcT8ebLz9ijV/RCtPXURcZ1OE8LP
w3IvHjobd3n6OokHZgEpTSdpBiDjI1yWMJYN59gEn+11V/kJgzuwPRC+39ys3QRmaM6b
Nxzw==
X-Gm-Message-State: ALoCoQkD4pEkhQbDHYYeih7qSTSd8W7460lSAdOVzSS+vQ6KFgOvT5svskTtFTmJAb2qttPWGQZb
X-Received: by 10.67.4.197 with SMTP id cg5mr32990467pad.10.1379371114786;
Mon, 16 Sep 2013 15:38:34 -0700 (PDT)
Received: from [172.16.1.7] ([203.86.207.101])
by mx.google.com with ESMTPSA id dw3sm33626005pbc.17.1969.12.31.16.00.00
(version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128);
Mon, 16 Sep 2013 15:38:34 -0700 (PDT)
From: Aaron Morton
Content-Type: multipart/alternative; boundary="Apple-Mail=_1CD6698F-D845-42D8-9E57-24732199E0A7"
Message-Id: <129C0213-1619-4828-B4E7-2AB35C2CAFB3@thelastpickle.com>
Mime-Version: 1.0 (Mac OS X Mail 6.5 \(1508\))
Subject: Re: is the select result grouped by the value of the partition key?
Date: Tue, 17 Sep 2013 10:38:22 +1200
References: ,<88A5AB1D-5564-4F86-8DB7-F93168D8D42B@thelastpickle.com>
To: user@cassandra.apache.org
In-Reply-To:
X-Mailer: Apple Mail (2.1508)
X-Virus-Checked: Checked by ClamAV on apache.org
--Apple-Mail=_1CD6698F-D845-42D8-9E57-24732199E0A7
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
charset=iso-8859-1
> My high-level understanding of how Cassandra handles a SELECT is that =
:
> (excuse incorrect terminology)
> 1. client connects to some node N
> 2. node N acts as a kind of coordinator and fires off the thrift or =
binary-protocol messages
> to all other nodes to fetch rows off the memtables and/or disks
The internode messages are a custom binary protocol, not the thrift / =
native api messages. These messages are also used on the node to move =
your request into the appropriate thread pooll.
The nodes reads the data needed for the request as if it was the only =
node performing the request. The only time we act differently is when =
sending the data back to the coordinator.=20
=20
> 3. coordinator merges, truncates, etc the sets from the nodes =
and returns one answer set to client.
>=20
The coordinator simply compares the results from the replicas and =
determines if the match. It does not merge or truncate.=20
If they do not match we perform the read again, but this time transmit =
some extra data so we can resolve differences.=20
> It is step 3 which has me wondering - does it explicitly preserve =
the on-disk order?
Order from the on disk read (including reverse ordered in the select =
statement) is preserved in the serialisation process. After which we =
never order again.=20
=20
> In fact - does it simply keep each individual node's answer set =
separate? Is that how it works?
I did some recent webinars for PlanetCassandra that may help:
Introduction to Apache Cassandra 1.2
http://thelastpickle.com/speaking/2013/04/25/Community-Webinar.html
Talks about the read / write and cluster process at a high level.=20
Cassandra Internals
=
http://thelastpickle.com/speaking/2013/08/25/Cassandra-Community-Webinar.h=
tml=20
Goes deep into the code to explain how cassandra works.=20
Hope that helps.=20
-----------------
Aaron Morton
New Zealand
@aaronmorton
Co-Founder & Principal Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com
On 13/09/2013, at 1:11 AM, John Lumby wrote:
> Aaron, thanks for the super-rapid response. That clarifies a lot =
for me,
> but I think I am still wondering about one point embedded below.
>=20
> ________________________________
>> From: aaron@thelastpickle.com=20
>> Subject: Re: is the select result grouped by the value of the =
partition key?=20
>> Date: Thu, 12 Sep 2013 14:19:06 +1200=20
>> To: user@cassandra.apache.org=20
>>=20
>> GROUP BY "feature",=20
>> I would not think of it like that, this is about physical order of =
rows.=20
>>=20
>> since it seems really important yet does not seem to be mentioned in =
the=20
>> CQL reference documentation.=20
>> It's baked in, this is how the data is organised on the row.=20
>=20
> Yes, I see, and I absolutely get the relevance of where columns =
are stored on disk to,
> say, doing INSERTs.
> But what I am wondering about is, in the context of a SELECT, we =
seem to be relying on
> the Cassandra client api preserving that on-disk order while returning =
rows.
> My high-level understanding of how Cassandra handles a SELECT is that =
:
> (excuse incorrect terminology)
> 1. client connects to some node N
> 2. node N acts as a kind of coordinator and fires off the thrift or =
binary-protocol messages
> to all other nodes to fetch rows off the memtables and/or disks
> 3. coordinator merges, truncates, etc the sets from the nodes =
and returns one answer set to client.
>=20
> It is step 3 which has me wondering - does it explicitly preserve =
the on-disk order?
> In fact - does it simply keep each individual node's answer set =
separate? Is that how it works?
>=20
>>=20
>> http://www.datastax.com/dev/blog/thrift-to-cql3=20
>> We often say the PRIMARY KEY is the PARTITION KEY and the GROUPING =
COLUMNS=20
>> =
http://www.datastax.com/documentation/cql/3.0/webhelp/index.html#cql/cql_r=
eference/create_table_r.html=20
>>=20
>> See also =
http://thelastpickle.com/blog/2013/01/11/primary-keys-in-cql.html=20
>>=20
>> Is it something we can bet the farm and farmer's family on?=20
>> Sure.=20
>>=20
>> The kinds of scenarios where I am wondering if it's possible for =20
>> partition-key groups=20
>> to get intermingled are :=20
>> All instances of the table entity with the same value(s) for the =20
>> PARTITION KEY portion of the PRIMARY KEY existing in the same storage =
=20
>> engine row.=20
>>=20
>> . what if the node containing primary copy of a row is down=20
>> There is no primary copy of a row.=20
>>=20
>> . what if there is a heavy stream of UPDATE activity from =20
>> applications which=20
>> connect to all nodes, causing different nodes to have =
different =20
>> versions of replicas of same row?=20
>> That's fine with me.=20
>> It's only an issue when the data is read, and at that point the =20
>> Consistency Level determines what we do.=20
>>=20
>> Hope that helps.=20
>>=20
>>=20
>> -----------------=20
>> Aaron Morton=20
>> New Zealand=20
>> @aaronmorton=20
>>=20
>> Co-Founder & Principal Consultant=20
>> Apache Cassandra Consulting=20
>> http://www.thelastpickle.com=20
>>=20
>> On 12/09/2013, at 7:43 AM, John Lumby =20
>> > wrote:=20
>>=20
>> I would like to make quite sure about this implicit GROUP BY =
"feature",=20
>>=20
>> since it seems really important yet does not seem to be mentioned in =
the=20
>> CQL reference documentation.=20
>>=20
>>=20
>>=20
>> Aaron, you said "yes" -- is that "yes, always, in all =
scenarios =20
>> no matter what"=20
>>=20
>> or "yes usually"? Is it something we can bet the farm and =
farmer's =20
>> family on?=20
>>=20
>>=20
>>=20
>> The kinds of scenarios where I am wondering if it's possible for =20
>> partition-key groups=20
>> to get intermingled are :=20
>>=20
>>=20
>>=20
>> . what if the node containing primary copy of a row is down=20
>> and=20
>> cassandra fetches this row from a replica on a different node=20
>> (e.g. with CONSISTENCY ONE)=20
>>=20
>> . what if there is a heavy stream of UPDATE activity from =20
>> applications which=20
>> connect to all nodes, causing different nodes to have =
different =20
>> versions of replicas of same row?=20
>>=20
>>=20
>>=20
>> Can you point me to some place in the cassandra source code where =
this =20
>> grouping is ensured?=20
>>=20
>>=20
>>=20
>> Many thanks,=20
>>=20
>> John Lumby=20
>> =20
--Apple-Mail=_1CD6698F-D845-42D8-9E57-24732199E0A7
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
charset=iso-8859-1

My high-level understanding of how Cassandra =
handles a SELECT is that : (excuse =
incorrect terminology) 1. client connects to some node =
N 2. node N acts as a kind of coordinator and fires off =
the thrift or binary-protocol messages =
to all other nodes to fetch rows off the memtables and/or =
disks

The internode messages are a custom binary =
protocol, not the thrift / native api messages. These messages are also =
used on the node to move your request into the appropriate thread =
pooll.

The nodes reads the data needed for the =
request as if it was the only node performing the request. The only time =
we act differently is when sending the data back to the =
coordinator.

3. coordinator merges, =
truncates, etc the sets from the nodes and returns one answer set =
to client.

The coordinator simply compares the =
results from the replicas and determines if the match. It does not merge =
or truncate.

If they do not match we =
perform the read again, but this time transmit some extra data so we can =
resolve differences.

It is step 3 which has me wondering - =
does it explicitly preserve the on-disk order?

Order =
from the on disk read (including reverse ordered in the select =
statement) is preserved in the serialisation process. After which we =
never order again.

In =
fact - does it simply keep each individual node's answer set =
separate? Is that how it =
works?

GROUP BY "feature", I would not think of it like that, this =
is about physical order of rows.

since it seems really important =
yet does not seem to be mentioned in the CQL reference =
documentation. It's baked in, this is how the data is organised on =
the row.

Yes, I see, and I =
absolutely get the relevance of where columns are stored on disk =
to,say, doing INSERTs.But what I am wondering about =
is, in the context of a SELECT, we seem to be =
relying onthe Cassandra client api preserving that on-disk order =
while returning rows.My high-level understanding of how Cassandra =
handles a SELECT is that : (excuse =
incorrect terminology) 1. client connects to some node =
N 2. node N acts as a kind of coordinator and fires off =
the thrift or binary-protocol messages =
to all other nodes to fetch rows off the memtables and/or =
disks 3. coordinator merges, =
truncates, etc the sets from the nodes and returns one answer set =
to client.

It is step 3 which has me wondering =
- does it explicitly preserve the on-disk order?In =
fact - does it simply keep each individual node's answer set =
separate? Is that how it works?

The kinds of scenarios where I am wondering if it's =
possible for partition-key groups to get intermingled are =
: All instances of the table entity with the same value(s) for the =
PARTITION KEY portion of the PRIMARY KEY existing in the same =
storage engine row.

. what if =
the node containing primary copy of a row is down There is no =
primary copy of a row.

. what if there =
is a heavy stream of UPDATE activity from applications which =
connect to all nodes, =
causing different nodes to have different versions =
of replicas of same row? That's fine with me. It's only an issue =
when the data is read, and at that point the Consistency Level =
determines what we do.