From user-return-34562-apmail-cassandra-user-archive=cassandra.apache.org@cassandra.apache.org Tue Jun 11 06:38:25 2013
Return-Path:
X-Original-To: apmail-cassandra-user-archive@www.apache.org
Delivered-To: apmail-cassandra-user-archive@www.apache.org
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
by minotaur.apache.org (Postfix) with SMTP id 0CD2010837
for ; Tue, 11 Jun 2013 06:38:25 +0000 (UTC)
Received: (qmail 11131 invoked by uid 500); 11 Jun 2013 06:38:20 -0000
Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org
Received: (qmail 10587 invoked by uid 500); 11 Jun 2013 06:38:20 -0000
Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
List-Help:
List-Unsubscribe:
List-Post:
List-Id:
Reply-To: user@cassandra.apache.org
Delivered-To: mailing list user@cassandra.apache.org
Received: (qmail 10335 invoked by uid 99); 11 Jun 2013 06:38:19 -0000
Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136)
by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 Jun 2013 06:38:19 +0000
X-ASF-Spam-Status: No, hits=1.5 required=5.0
tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS
X-Spam-Check-By: apache.org
Received-SPF: pass (athena.apache.org: domain of arodrime@gmail.com designates 209.85.215.50 as permitted sender)
Received: from [209.85.215.50] (HELO mail-la0-f50.google.com) (209.85.215.50)
by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 Jun 2013 06:38:15 +0000
Received: by mail-la0-f50.google.com with SMTP id dy20so4262293lab.37
for ; Mon, 10 Jun 2013 23:37:53 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=gmail.com; s=20120113;
h=mime-version:in-reply-to:references:from:date:message-id:subject:to
:content-type;
bh=KWk9fWCK1pu3Zpu9IJ1vK57FlU5VzAs5J8bdKK21I58=;
b=uXEpmPvRii731S/vTnJ1HgV5lzVuuPFelNzdwMS9aCbCZTfI/xRtac+zh01k40orEZ
NDxmMFqUFPrBSNf9faIpqkhAKPi58LKHW7leK7doFwCrzuZWQdw6jZcWL/gFoHPX238B
5Gubjvp0ibfewdDpAH2Ex6D1ri6s+e3lyq39MXgZk++aSR/3yCmgmT7QgnzaoDeGGs+E
hZNytOzKmpcGBktmF3nswCjAoB/GB2Kw3rYCqkWsIWJR1vSUd85laGOqjD3Edh2GLpS5
ObwpBayRdbKC7witNBKqLJpoGX2zxd4CjG7wzD8iRhzGy5WYVKsEgvptjZ1vZ09qD7G0
RmsA==
X-Received: by 10.112.170.166 with SMTP id an6mr5198191lbc.22.1370932673453;
Mon, 10 Jun 2013 23:37:53 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.112.7.168 with HTTP; Mon, 10 Jun 2013 23:37:33 -0700 (PDT)
In-Reply-To:
References:
From: Alain RODRIGUEZ
Date: Tue, 11 Jun 2013 08:37:33 +0200
Message-ID:
Subject: Re: Why so many vnodes?
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=001a11c373600323c904dedb2305
X-Virus-Checked: Checked by ClamAV on apache.org
--001a11c373600323c904dedb2305
Content-Type: text/plain; charset=ISO-8859-1
I think he actually meant *increase*, for this reason "For small T, a
random choice of initial tokens will in most cases give a poor distribution
of data. The larger T is, the closer to uniform the distribution will be,
with increasing probability."
Alain
2013/6/11 Theo Hultberg
> thanks, that makes sense, but I assume in your last sentence you mean
> decrease it for large clusters, not increase it?
>
> T#
>
>
> On Mon, Jun 10, 2013 at 11:02 PM, Richard Low wrote:
>
>> Hi Theo,
>>
>> The number (let's call it T and the number of nodes N) 256 was chosen to
>> give good load balancing for random token assignments for most cluster
>> sizes. For small T, a random choice of initial tokens will in most cases
>> give a poor distribution of data. The larger T is, the closer to uniform
>> the distribution will be, with increasing probability.
>>
>> Also, for small T, when a new node is added, it won't have many ranges to
>> split so won't be able to take an even slice of the data.
>>
>> For this reason T should be large. But if it is too large, there are too
>> many slices to keep track of as you say. The function to find which keys
>> live where becomes more expensive and operations that deal with individual
>> vnodes e.g. repair become slow. (An extreme example is SELECT * LIMIT 1,
>> which when there is no data has to scan each vnode in turn in search of a
>> single row. This is O(NT) and for even quite small T takes seconds to
>> complete.)
>>
>> So 256 was chosen to be a reasonable balance. I don't think most users
>> will find it too slow; users with extremely large clusters may need to
>> increase it.
>>
>> Richard.
>>
>>
>> On 10 June 2013 18:55, Theo Hultberg wrote:
>>
>>> I'm not sure I follow what you mean, or if I've misunderstood what
>>> Cassandra is telling me. Each node has 256 vnodes (or tokens, as the
>>> prefered name seems to be). When I run `nodetool status` each node is
>>> reported as having 256 vnodes, regardless of how many nodes are in the
>>> cluster. A single node cluster has 256 vnodes on the single node, a six
>>> node cluster has 256 nodes on each machine, making 1590 vnodes in total.
>>> When I run `SELECT tokens FROM system.peers` or `nodetool ring` each node
>>> lists 256 tokens.
>>>
>>> This is different from how it works in Riak and Voldemort, if I'm not
>>> mistaken, and that is the source of my confusion.
>>>
>>> T#
>>>
>>>
>>> On Mon, Jun 10, 2013 at 4:54 PM, Milind Parikh wrote:
>>>
>>>> There are n vnodes regardless of the size of the physical cluster.
>>>> Regards
>>>> Milind
>>>> On Jun 10, 2013 7:48 AM, "Theo Hultberg" wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> The default number of vnodes is 256, is there any significance in this
>>>>> number? Since Cassandra's vnodes don't work like for example Riak's, where
>>>>> there is a fixed number of vnodes distributed evenly over the nodes, why so
>>>>> many? Even with a moderately sized cluster you get thousands of slices.
>>>>> Does this matter? If your cluster grows to over thirty machines and you
>>>>> start looking at ten thousand slices, would that be a problem? I guess trat
>>>>> traversing a list of a thousand or ten thousand slices to find where a
>>>>> token lives isn't a huge problem, but are there any other up or downsides
>>>>> to having a small or large number of vnodes per node?
>>>>>
>>>>> I understand the benefits for splitting up the ring into pieces, for
>>>>> example to be able to stream data from more nodes when bootstrapping a new
>>>>> one, but that works even if each node only has say 32 vnodes (unless your
>>>>> cluster is truly huge).
>>>>>
>>>>> yours,
>>>>> Theo
>>>>>
>>>>
>>>
>>
>
--001a11c373600323c904dedb2305
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

I think he actually meant *increase*, for this reason &quo=
t;For small T, =
a random choice of initial tokens will in most cases give a poor distributi=
on of data. =A0The larger T is, the closer to uniform the distribution will=
be, with increasing probability."

The number (let's call it =
T and the number of nodes N) 256 was chosen to give good load balancing for=
random token assignments for most cluster sizes. =A0For small T, a random =
choice of initial tokens will in most cases give a poor distribution of dat=
a. =A0The larger T is, the closer to uniform the distribution will be, with=
increasing probability.

Also, for small T, when a new node is added, it won'=
;t have many ranges to split so won't be able to take an even slice of =
the data.

For this reason T should be large. =A0Bu=
t if it is too large, there are too many slices to keep track of as you say=
. =A0The function to find which keys live where becomes more expensive and =
operations that deal with individual vnodes e.g. repair become slow. =A0(An=
extreme example is SELECT * LIMIT 1, which when there is no data has to sc=
an each vnode in turn in search of a single row. =A0This is O(NT) and for e=
ven quite small T takes seconds to complete.)

So 256 was chosen to be a reasonable balance. =A0I don&=
#39;t think most users will find it too slow; users with extremely large cl=
usters may need to increase it.

I'm not sure I follow w=
hat you mean, or if I've misunderstood what Cassandra is telling me. Ea=
ch node has 256 vnodes (or tokens, as the prefered name seems to be). When =
I run `nodetool status` each node is reported as having 256 vnodes, regardl=
ess of how many nodes are in the cluster. A single node cluster has 256 vno=
des on the single node, a six node cluster has 256 nodes on each machine, m=
aking 1590 vnodes in total. When I run `SELECT tokens FROM system.peers` or=
`nodetool ring` each node lists 256 tokens.

This is different from how it works in Riak and Voldemort, i=
f I'm not mistaken, and that is the source of my confusion.

The default number of vnodes =
is 256, is there any significance in this number? Since Cassandra's vno=
des don't work like for example Riak's, where there is a fixed numb=
er of vnodes distributed evenly over the nodes, why so many? Even with a mo=
derately sized cluster you get thousands of slices. Does this matter? If yo=
ur cluster grows to over thirty machines and you start looking at ten thous=
and slices, would that be a problem? I guess trat traversing a list of a th=
ousand or ten thousand slices to find where a token lives isn't a huge =
problem, but are there any other up or downsides to having a small or large=
number of vnodes per node?

I understand the benefits for splitting up the ring into pie=
ces, for example to be able to stream data from more nodes when bootstrappi=
ng a new one, but that works even if each node only has say 32 vnodes (unle=
ss your cluster is truly huge).