From couchdb-user-return-728-apmail-incubator-couchdb-user-archive=incubator.apache.org@incubator.apache.org Sat Jul 12 04:21:58 2008
Return-Path:
Delivered-To: apmail-incubator-couchdb-user-archive@locus.apache.org
Received: (qmail 9703 invoked from network); 12 Jul 2008 04:21:58 -0000
Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2)
by minotaur.apache.org with SMTP; 12 Jul 2008 04:21:58 -0000
Received: (qmail 24891 invoked by uid 500); 12 Jul 2008 04:21:57 -0000
Delivered-To: apmail-incubator-couchdb-user-archive@incubator.apache.org
Received: (qmail 24864 invoked by uid 500); 12 Jul 2008 04:21:57 -0000
Mailing-List: contact couchdb-user-help@incubator.apache.org; run by ezmlm
Precedence: bulk
List-Help:
List-Unsubscribe:
List-Post:
List-Id:
Reply-To: couchdb-user@incubator.apache.org
Delivered-To: mailing list couchdb-user@incubator.apache.org
Received: (qmail 24852 invoked by uid 99); 12 Jul 2008 04:21:57 -0000
Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136)
by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 11 Jul 2008 21:21:57 -0700
X-ASF-Spam-Status: No, hits=-0.0 required=10.0
tests=SPF_PASS
X-Spam-Check-By: apache.org
Received-SPF: pass (athena.apache.org: domain of froseph@gmail.com designates 209.85.198.242 as permitted sender)
Received: from [209.85.198.242] (HELO rv-out-0708.google.com) (209.85.198.242)
by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 12 Jul 2008 04:21:03 +0000
Received: by rv-out-0708.google.com with SMTP id k29so3989229rvb.0
for ; Fri, 11 Jul 2008 21:21:26 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=gmail.com; s=gamma;
h=domainkey-signature:received:received:message-id:date:from:to
:subject:in-reply-to:mime-version:content-type
:content-transfer-encoding:content-disposition:references;
bh=KXqwYPjFViOvAg7H0xGxjER42Ltpu18VGaqLTHwsntA=;
b=Wbt8gUSa1huHTwX+njQSRgrmFO3l5aw+PpRLsXaP8w7Y5ZbSab6x6qVD9p4iRb4tjK
Fllf/G3D/jMujfqmUApFeYGH5GX4t2wfWpWVgOA02EzsQjbZJehh/ennCiZcdRF3BC3t
cMxQLEtocNEUCNJelwOoZnSWQDKgV2qfWm06I=
DomainKey-Signature: a=rsa-sha1; c=nofws;
d=gmail.com; s=gamma;
h=message-id:date:from:to:subject:in-reply-to:mime-version
:content-type:content-transfer-encoding:content-disposition
:references;
b=EBkQMC0VTzo0TOW8LxWyNdgIrxJuarzrPUeDuncTB0les9q/n9l50E5CwlYK3lUZpT
ZIMGdiHm/urLMAqf0EY9gDd+Rkq5d2ffOksJsbrDB0EVcTvgr5c313t4XbVesblesQAx
d6QJzsuC9iPDpoWlBPiMBkEzOkivs66Ebkr+c=
Received: by 10.140.172.19 with SMTP id u19mr5037131rve.31.1215836485958;
Fri, 11 Jul 2008 21:21:25 -0700 (PDT)
Received: by 10.141.114.8 with HTTP; Fri, 11 Jul 2008 21:21:25 -0700 (PDT)
Message-ID:
Date: Fri, 11 Jul 2008 21:21:25 -0700
From: "Joseph Liu"
To: couchdb-user@incubator.apache.org
Subject: Re: view index build time
In-Reply-To: <888cd9180807080653s3ee7629fv5875b6acc3e99ce4@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
References: <888cd9180806201345k423a4dc8j64a881f351bd60bd@mail.gmail.com>
<888cd9180807021058j12bab0eh18315ae3e57f0d36@mail.gmail.com>
<888cd9180807021500l154fa147m36af75b8b19009a6@mail.gmail.com>
<0DFB44C5-14CE-46C3-9829-A817E6D53EB5@gmail.com>
<888cd9180807030635l36924ebehb7f2434d45a82236@mail.gmail.com>
<888cd9180807080653s3ee7629fv5875b6acc3e99ce4@mail.gmail.com>
X-Virus-Checked: Checked by ClamAV on apache.org
Late to the discussion but here's my 2 cents:
Depending on your virtualization software, disk accesses can suck. On
a "hosted" hypervisor, you're to have to rely on the host to schedule
your disk accesses. Disk io is scheduled in the guest, potentially go
through an emulation layer by the hypervisor, and then be scheduled in
the host. Furthermore there can be significant latency switching
between the host and the guest. If the disk accesses are small and
random this can cause the slowdown you are observing. Finally, your
guest is not always scheduled in since it's just like any other
processes to the host, so the actual amount of cpu time in the guest
is less than you normally have and will affect the total wall clock of
the computation time.
I'm not saying that virtualization sucks as it has many important uses
(e.g. VMotion), and some of these issues may be mitigated with proper
paravirtualization, but at the end you should still run benchmarks to
see if your workload is suited for the hypervisor you are considering.
On Tue, Jul 8, 2008 at 6:53 AM, Brad King wrote:
> Following up on this. After moving to real hardware my view index time
> for the same data set dropped from 25 minutes to 6 minutes, so
> definitely was a factor. If there any other optimizations I can make
> I'd love to know what they are. Thanks.
>
> On Thu, Jul 3, 2008 at 9:35 AM, Brad King wrote:
>> That would be fantastic, but it sounds like other users are seeing
>> performance similar to what I see. When you say tuning and
>> optimizations, are you talking about code changes in future versions
>> of couchdb or parameters we can change now? VM is definitely a
>> variable. I probably should try this out on real hardware too and
>> compare.
>>
>> On Wed, Jul 2, 2008 at 7:30 PM, Damien Katz wrote:
>>> This sounds really slow, like somethings wrong. 25 minutes to process 300k
>>> means ~500 docs sec, or each document takes 2ms. That's a really long time
>>> CPU wise.
>>>
>>> Assuming it's not another VM bug, we should be able about to get that down
>>> to under minute with some tuning, and probably closer to 10 secs after
>>> serious optimizations.
>>>
>>> -Damien
>>>
>>>
>>> On Jul 2, 2008, at 6:28 PM, Chris Anderson wrote:
>>>
>>>> On Wed, Jul 2, 2008 at 3:08 PM, Paul Davis
>>>> wrote:
>>>>>
>>>>> I'd have to go back and double check, but off the top of my head 25
>>>>> min for 300K docs seems about like what I was getting. Ie, not orders
>>>>> of magnitude slower or anything.
>>>>
>>>> In my experience, views generate about 1/2 as fast as that, if not
>>>> more slowly. My views are often quite complex with a lot of internal
>>>> looping and multiple emits, so that probably explains it. In short,
>>>> the times you're reporting seem reasonable.
>>>>
>>>> The bottleneck (based on my extremely unscientific use of top) doesn't
>>>> seem to be the view server, but rather CouchDB's beam process, which
>>>> as I understand it, is busy sorting the results as they come back from
>>>> the view server. So the quickest route to parallelizing this may be to
>>>> manually partition your data across CouchDB instances, generate the
>>>> views, and query them in parallel, merging the results in your
>>>> application.
>>>>
>>>> I don't actually plan to do all that work until my insert rate
>>>> eclipses CouchDB's view generation speed. :)
>>>>
>>>> Once upon a time there was a feature to return the available results
>>>> of a view, even while generation is still occurring. The feature has
>>>> fallen by the wayside, and it would be non-trivial to turn it back on,
>>>> according to Damien on IRC. Maybe if it would be useful to enough
>>>> people, we'll see it again.
>>>>
>>>> --
>>>> Chris Anderson
>>>> http://jchris.mfdz.com
>>>
>>>
>>
>