From user-return-16324-apmail-cassandra-user-archive=cassandra.apache.org@cassandra.apache.org Mon May 2 04:51:15 2011
Return-Path:
X-Original-To: apmail-cassandra-user-archive@www.apache.org
Delivered-To: apmail-cassandra-user-archive@www.apache.org
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
by minotaur.apache.org (Postfix) with SMTP id D7F4E3F85
for ; Mon, 2 May 2011 04:51:15 +0000 (UTC)
Received: (qmail 22117 invoked by uid 500); 2 May 2011 04:51:13 -0000
Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org
Received: (qmail 22024 invoked by uid 500); 2 May 2011 04:51:13 -0000
Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
List-Help:
List-Unsubscribe:
List-Post:
List-Id:
Reply-To: user@cassandra.apache.org
Delivered-To: mailing list user@cassandra.apache.org
Received: (qmail 22016 invoked by uid 99); 2 May 2011 04:51:12 -0000
Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230)
by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 02 May 2011 04:51:12 +0000
X-ASF-Spam-Status: No, hits=1.5 required=5.0
tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS
X-Spam-Check-By: apache.org
Received-SPF: pass (nike.apache.org: domain of tyler@datastax.com designates 74.125.82.44 as permitted sender)
Received: from [74.125.82.44] (HELO mail-ww0-f44.google.com) (74.125.82.44)
by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 02 May 2011 04:51:05 +0000
Received: by wwa36 with SMTP id 36so5084468wwa.25
for ; Sun, 01 May 2011 21:50:45 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.216.91.17 with SMTP id g17mr2098722wef.1.1304311845159; Sun,
01 May 2011 21:50:45 -0700 (PDT)
Received: by 10.216.38.1 with HTTP; Sun, 1 May 2011 21:50:45 -0700 (PDT)
X-Originating-IP: [70.124.64.4]
In-Reply-To:
References:
Date: Sun, 1 May 2011 23:50:45 -0500
Message-ID:
Subject: Re: Combining all CFs into one big one
From: Tyler Hobbs
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=0016e6d7df743504e904a243c4a6
X-Virus-Checked: Checked by ClamAV on apache.org
--0016e6d7df743504e904a243c4a6
Content-Type: text/plain; charset=ISO-8859-1
>
> If you had one big cache, wouldn't it be the case that it's mostly
> populated with frequently accessed rows, and less populated with rarely
> accessed rows?
>
Yes.
In fact, wouldn't one big cache dynamically and automatically give you
> exactly what you want? If you try to partition the same amount of memory
> manually, by guesswork, among many tables, aren't you always going to do a
> worse job?
>
Suppose you have one CF that's used constantly through interaction by
users. Suppose you have another CF that's only used periodically by a batch
process, you tend to access most or all of the rows during the batch
process, and it's too large to cache all of the rows. Normally, you would
dedicate cache space to the first CF as anything with human interaction
tends to have good temporal locality and you want to keep latencies there
low. On the other hand, caching the second CF provides little to no real
benefit. When you combine these two CFs, every time your batch process
runs, rows from the second CF will populate the cache and will cause
eviction of rows from the first CF, even though having those rows in the
cache provides little benefit to you.
As another example, if you mix a CF with wide rows and a CF with small rows,
you no longer have the option of using a row cache, even if it makes great
sense for the small-row CF data.
Knowledge of data and access patterns gives you a very good advantage when
it comes to caching your data effectively.
--
Tyler Hobbs
Software Engineer, DataStax
Maintainer of the pycassa Cassandra
Python client library
--0016e6d7df743504e904a243c4a6
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

If you had =
one big cache, wouldn't it be the case that it's mostly populated w=
ith frequently accessed rows, and less populated with rarely accessed rows?=

Yes.

In fact, wouldn't one bi=
g cache dynamically and automatically give you exactly what you want? If yo=
u try to partition the same amount of memory manually, by guesswork, among =
many tables, aren't you always going to do a worse job?

Suppose you have one CF that's used constantly t=
hrough interaction by users.=A0 Suppose you have another CF that's only=
used periodically by a batch process, you tend to access most or all of th=
e rows during the batch process, and it's too large to cache all of the=
rows.=A0 Normally, you would dedicate cache space to the first CF as anyth=
ing with human interaction tends to have good temporal locality and you wan=
t to keep latencies there low.=A0 On the other hand, caching the second CF =
provides little to no real benefit.=A0 When you combine these two CFs, ever=
y time your batch process runs, rows from the second CF will populate the c=
ache and will cause eviction of rows from the first CF, even though having =
those rows in the cache provides little benefit to you.

As another example, if you mix a CF with wide rows and a CF with small =
rows, you no longer have the option of using a row cache, even if it makes =
great sense for the small-row CF data.

Knowledge of data and access =
patterns gives you a very good advantage when it comes to caching your data=
effectively.