From user-return-37408-apmail-cassandra-user-archive=cassandra.apache.org@cassandra.apache.org Tue Nov 5 07:08:14 2013
Return-Path:
X-Original-To: apmail-cassandra-user-archive@www.apache.org
Delivered-To: apmail-cassandra-user-archive@www.apache.org
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
by minotaur.apache.org (Postfix) with SMTP id 10B201017F
for ; Tue, 5 Nov 2013 07:08:14 +0000 (UTC)
Received: (qmail 7619 invoked by uid 500); 5 Nov 2013 07:08:10 -0000
Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org
Received: (qmail 7578 invoked by uid 500); 5 Nov 2013 07:08:09 -0000
Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
List-Help:
List-Unsubscribe:
List-Post:
List-Id:
Reply-To: user@cassandra.apache.org
Delivered-To: mailing list user@cassandra.apache.org
Received: (qmail 7570 invoked by uid 99); 5 Nov 2013 07:08:08 -0000
Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136)
by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 05 Nov 2013 07:08:08 +0000
X-ASF-Spam-Status: No, hits=2.2 required=5.0
tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS
X-Spam-Check-By: apache.org
Received-SPF: pass (athena.apache.org: local policy includes SPF record at spf.trusted-forwarder.org)
Received: from [209.85.192.176] (HELO mail-pd0-f176.google.com) (209.85.192.176)
by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 05 Nov 2013 07:08:04 +0000
Received: by mail-pd0-f176.google.com with SMTP id g10so7845993pdj.35
for ; Mon, 04 Nov 2013 23:07:43 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=1e100.net; s=20130820;
h=x-gm-message-state:from:content-type:message-id:mime-version
:subject:date:references:to:in-reply-to;
bh=WhGYVwY0rTc+F5BmjeU5p6LP2b5Xz6kAlGUdMmqekvA=;
b=gIQCK/3VzJLI1DkbVRI70PuCKuICja/PPgsTXVt6yzL0kpdxtfDE1Opq46izfMpXyK
Ubjw+P9DvBIXOyUb2g5HjDgWoIkli2Iqwj9CzUEBxGqqomxt3hU4uQ9OdsdgcdNvm5eR
Djk2R6E7vD38YgT+rC8MkCMpt9acb7uG3LjxBKmBizVdFLrmCefov6tSfBlvvx20OIe/
Eitnnifo1vK+0CHeA9Vo0GOrFc+sgId7I1y0rqCdaqdJ0uqccUXC5b4dY6wD0rrWao6O
OyHxZPi9l7qyoG04td1bhbnWWv7f5j0D1KA8HiiA6EIulc/8BLrrZbg3MYWT3GVSpjP9
RPwQ==
X-Gm-Message-State: ALoCoQkr8CszAZagE7vWRDRgIXAitnvVBFZtY8W83QyJTW5vmNo17uMMmG1aYQ0Lpk8lptDkTqcP
X-Received: by 10.66.155.102 with SMTP id vv6mr21833398pab.89.1383635263794;
Mon, 04 Nov 2013 23:07:43 -0800 (PST)
Received: from [172.16.1.20] ([203.86.207.101])
by mx.google.com with ESMTPSA id ho3sm32574460pbb.23.2013.11.04.23.07.41
for
(version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128);
Mon, 04 Nov 2013 23:07:43 -0800 (PST)
From: Aaron Morton
Content-Type: multipart/alternative; boundary="Apple-Mail=_1CCEA72F-D8A8-4082-8577-FF228E2C8214"
Message-Id:
Mime-Version: 1.0 (Mac OS X Mail 7.0 \(1816\))
Subject: Re: Storage management during rapid growth
Date: Tue, 5 Nov 2013 20:07:40 +1300
References:
To: Cassandra User
In-Reply-To:
X-Mailer: Apple Mail (2.1816)
X-Virus-Checked: Checked by ClamAV on apache.org
--Apple-Mail=_1CCEA72F-D8A8-4082-8577-FF228E2C8214
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
charset=windows-1252
> However, when monitoring the performance of our cluster, we see =
sustained periods - especially during repair/compaction/cleanup - of =
several hours where there are >2000 IOPS.
If the IOPS are there compaction / repair / cleanup will use them if the =
configuration allows it. If there are not there and the configuration =
matches the resources the only issue will be things take longer =
(assuming the HW can handle the throughput).=20
> 2) Move some nodes to a SAN solution, ensuring that there is a mix of =
storage, drives,
IMHO you will have a terrible time and regret the decision. Performance =
in anger rarely matches local disks and when someone decides the SAN =
needs to go through a maintenance process say goodbye to your node. Also =
you will need very good network links.=20
Cassandra is designed for shared nothing architecture, it=92s best to =
embrace that.=20
> 1) Has anyone moved from SSDs to spinning-platter disks, or managed a =
cluster that contained both? Do the numbers we're seeing exaggerate the =
performance hit we'd see if we moved to spinners?
Try to get a feel for the general IOPS used for reads without compaction =
etc running.=20
Also for the bytes going into the cluster on the rpc / native binary =
interface.=20
=20
> 2) Have you successfully used a SAN or a hybrid SAN solution (some =
local, some SAN-based) to dynamically add storage to the cluster? What =
type of SAN have you used, and what issues have you run into?
I=92ve worked with people who have internal SANS and those that have =
used EBS. I would not describe either solution as optimal. The issues =
are performance under load, network contention, SLA / consistency.=20
> 3) Am I missing a way of economically scaling storage?
version 1.2+ has better support for fat nodes, nodes with up to 5TB of =
data via:
* JBOD: mount each disk independently and add it to =
adata_file_directories . Cassandra will balance the write load between =
disks and have one flush thread per data directory, I=92ve heard this =
gives good performance with HDD's. This will give you 100% of the raw =
disk capacity and mean a single disk failure does necessitate a node =
rebuild.=20
* disk failure: set the disk_failure_policy to best_effort or stop so =
the node can handle disk failure =
https://github.com/apache/cassandra/blob/cassandra-1.2/conf/cassandra.yaml=
#L125
* have good networking in place so you can rebuild a failed node, either =
completely or from a failed disk.=20
* use vnodes so that as the number of nodes grows the time to rebuild a =
failed node drops.=20
I would be a little uneasy about very high node loads with only three =
nodes. The main concern is how long it will take to replace a node that =
completely fails.=20
I=92ve also seen people have a good time moving from SSD to 12 fast =
disks in a RAID10 config.
You can mix HDD and SSD=92s and have some hot CF=92s on the SSD and =
others on the HDD.=20
Hope that helps.=20
=20
-----------------
Aaron Morton
New Zealand
@aaronmorton
Co-Founder & Principal Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com
On 1/11/2013, at 10:01 am, Franc Carter =
wrote:
>=20
> I can't comment on the technical question, however one thing I learnt =
with managing the growth of data is that the $/GB of tends to drop at a =
rate that can absorb a moderate proportion of the increase in cost due =
to the increase in size of data. I'd recommend having a =
wet-finger-in-the-air stab at projecting the growth in data sizes versus =
the historical trends in the decease in cost of storage.
>=20
> cheers
>=20
>=20
>=20
> On Fri, Nov 1, 2013 at 7:15 AM, Dave Cowen wrote:
> Hi, all -
>=20
> I'm currently managing a small Cassandra cluster, several nodes with =
local SSD storage.
>=20
> It's difficult for to forecast the growth of the Cassandra data over =
the next couple of years for various reasons, but it is virtually =
guaranteed to grow substantially.
>=20
> During this time, there may be times where it is desirable to increase =
the amount of storage available to each node, but, assuming we are not =
I/O bound, keep from expanding the cluster horizontally with additional =
nodes that have local storage. In addition, expanding with local SSDs is =
costly.
>=20
> My colleagues and I have had several discussions of a couple of other =
options that don't involve scaling horizontally or adding SSDs:
>=20
> 1) Move to larger, cheaper spinning-platter disks. However, when =
monitoring the performance of our cluster, we see sustained periods - =
especially during repair/compaction/cleanup - of several hours where =
there are >2000 IOPS. It will be hard to get to that level of =
performance in each node with spinning platter disks, and we'd prefer =
not to take that kind of performance hit during maintenance operations.
>=20
> 2) Move some nodes to a SAN solution, ensuring that there is a mix of =
storage, drives, LUNs and RAIDs so that there isn't a single point of =
failure. While we're aware that this is frowned on in the Cassandra =
community due to Cassandra's design, a SAN seems like the obvious way of =
being able to quickly add storage to a cluster without having to juggle =
local drives, and provides a level of performance between local spinning =
platter drives and local SSDs.
>=20
> So, the questions:
>=20
> 1) Has anyone moved from SSDs to spinning-platter disks, or managed a =
cluster that contained both? Do the numbers we're seeing exaggerate the =
performance hit we'd see if we moved to spinners?
>=20
> 2) Have you successfully used a SAN or a hybrid SAN solution (some =
local, some SAN-based) to dynamically add storage to the cluster? What =
type of SAN have you used, and what issues have you run into?
>=20
> 3) Am I missing a way of economically scaling storage?
>=20
> Thanks for any insight.
>=20
> Dave
>=20
>=20
>=20
> --=20
> Franc Carter | Systems architect | Sirca Ltd
> franc.carter@sirca.org.au | www.sirca.org.au
> Tel: +61 2 8355 2514=20
> Level 4, 55 Harrington St, The Rocks NSW 2000
> PO Box H58, Australia Square, Sydney NSW 1215
>=20
--Apple-Mail=_1CCEA72F-D8A8-4082-8577-FF228E2C8214
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
charset=windows-1252

However, when monitoring the performance of our =
cluster, we see sustained periods - especially during =
repair/compaction/cleanup - of several hours where there are >2000 =
IOPS.

If the IOPS are there =
compaction / repair / cleanup will use them if the configuration allows =
it. If there are not there and the configuration matches the resources =
the only issue will be things take longer (assuming the HW can handle =
the throughput).

2) Move some nodes to a SAN solution, ensuring =
that there is a mix of storage, =
drives,

IMHO you will have a =
terrible time and regret the decision. Performance in anger rarely =
matches local disks and when someone decides the SAN needs to go through =
a maintenance process say goodbye to your node. Also you will need very =
good network links.

Cassandra is designed =
for shared nothing architecture, it=92s best to embrace =
that.

1) Has anyone moved from SSDs =
to spinning-platter disks, or managed a cluster that contained both? Do =
the numbers we're seeing exaggerate the performance hit we'd see if we =
moved to spinners?

Try to get =
a feel for the general IOPS used for reads without compaction etc =
running.

Also for the bytes going into the cluster on =
the rpc / native binary =
interface.

2) Have you successfully used a SAN or a hybrid =
SAN solution (some local, some SAN-based) to dynamically add storage to =
the cluster? What type of SAN have you used, and what issues have you =
run into?

I=92ve worked with =
people who have internal SANS and those that have used EBS. I would not =
describe either solution as optimal. The issues are performance under =
load, network contention, SLA / =
consistency.

3) Am I missing a way of =
economically scaling =
storage?

version 1.2+ has =
better support for fat nodes, nodes with up to 5TB of data =
via:

* JBOD: mount each disk independently and =
add it to adata_file_directories . Cassandra will balance the write load =
between disks and have one flush thread per data directory, I=92ve heard =
this gives good performance with HDD's. This will give you 100% of the =
raw disk capacity and mean a single disk failure does necessitate a node =
rebuild.

I can't comment on =
the technical question, however one thing I learnt with managing the =
growth of data is that the $/GB of tends to drop at a rate that can =
absorb a moderate proportion of the increase in cost due to the =
increase in size of data. I'd recommend having a wet-finger-in-the-air =
stab at projecting the growth in data sizes versus the historical trends =
in the decease in cost of storage.

I'm currently =
managing a small Cassandra cluster, several nodes with local SSD =
storage.

It's difficult for to forecast the growth of the =
Cassandra data over the next couple of years for various reasons, but it =
is virtually guaranteed to grow substantially.

During this time, there may be times where it is =
desirable to increase the amount of storage available to each node, but, =
assuming we are not I/O bound, keep from expanding the cluster =
horizontally with additional nodes that have local storage. In addition, =
expanding with local SSDs is costly.

My colleagues and I have had several discussions of =
a couple of other options that don't involve scaling horizontally or =
adding SSDs:

1) Move to larger, cheaper =
spinning-platter disks. However, when monitoring the performance of our =
cluster, we see sustained periods - especially during =
repair/compaction/cleanup - of several hours where there are >2000 =
IOPS. It will be hard to get to that level of performance in each node =
with spinning platter disks, and we'd prefer not to take that kind of =
performance hit during maintenance operations.

2) Move some nodes to a SAN solution, ensuring that =
there is a mix of storage, drives, LUNs and RAIDs so that there isn't a =
single point of failure. While we're aware that this is frowned on in =
the Cassandra community due to Cassandra's design, a SAN seems like the =
obvious way of being able to quickly add storage to a cluster without =
having to juggle local drives, and provides a level of performance =
between local spinning platter drives and local SSDs.

So, the questions:

1) Has =
anyone moved from SSDs to spinning-platter disks, or managed a cluster =
that contained both? Do the numbers we're seeing exaggerate the =
performance hit we'd see if we moved to spinners?

2) Have you successfully used a SAN or a hybrid SAN =
solution (some local, some SAN-based) to dynamically add storage to the =
cluster? What type of SAN have you used, and what issues have you run =
into?