couchdb-dev mailing list archives

On Nov 7, 2010, at 2:52 PM, Filipe David Manana wrote:
> On Sun, Nov 7, 2010 at 7:20 PM, Adam Kocoloski <kocolosk@apache.org> wrote:
>> On Nov 7, 2010, at 11:35 AM, Filipe David Manana wrote:
>>
>>> Also, with this patch I verified (on Solaris, with the 'zpool iostat
>>> 1' command) that when running a writes only test with relaximation
>>> (200 write processes), disk write activity is not continuous. Without
>>> this patch, there's continuous (every 1 second) write activity.
>>
>> I'm confused by this statement. You must be talking about relaximation runs with
delayed_commits = true, right? Why do you think you see larger intervals between write activity
with the optimization from COUCHDB-767? Have you measured the time it takes to open the extra
FD? In my tests that was a sub-millisecond operation, but maybe you've uncovered something
else.
>
> No, it happens for tests with delayed_commits = false. The only
> possible explanation I see for the variance might be related to the
> Erlang VM scheduler decisions about when to start/run that process.
> Nevertheless, I dont know the exact cause, but the fsync run frequency
> varies a lot.
I think it's worth investigating. I couldn't reproduce it on my plain-old spinning disk MacBook
with 200 writers in relaximation; the IOPS reported by iostat stayed very uniform.
>>> For the goal of not having readers getting blocked by fsync calls (and
>>> write calls), I would propose using a separate couch_file process just
>>> for read operations. I have a branch in my github for this (with
>>> COUCHDB-767 reverted). It needs to be polished, but the relaximation
>>> tests are very positive, both reads and writes get better response
>>> times and throughput:
>>>
>>> https://github.com/fdmanana/couchdb/tree/2_couch_files_no_batch_reads
>>
>> I'd like to propose an alternative optimization, which is to keep a dedicated file
descriptor open in the couch_db_updater process and use that file descriptor for _all_ IO
initiated by the db_updater. The advantage is that the db_updater does not need to do any
message passing for disk IO, and thus does not slow down when the incoming message queue is
large. A message queue much much larger than the number of concurrent writers can occur if
a user writes with batch=ok, and it can also happen rather easily in a BigCouch cluster.
>
> I don't see how that will improve things, since all write operations
> will still be done in a serialized manner. Since only couch_db_updater
> writes to the DB file, and since access to the couch_db_updater is
> serialized, to me it only seems that you're solution avoids one level
> of indirection (the couch_file process). I don't see how, when using a
> couch_file only for writes, you get the message queue for that
> couc_file process full of write messages.
It's the db_updater which gets a large message queue, not the couch_file. The db_updater
ends up with a big backlog of update_docs messages that get in the way when it needs to make
gen_server calls to the couch_file process for IO. It's a significant problem in R13B, probably
less so in R14B because of some cool optimizations by the OTP team.
> Also, what I did on that branch is a bit more generic, as it works for
> view index files as well, and doesn't introduce significant changes
> elsewhere except in couch_file.erl. Of course your solution might be
> extended to the view updater process as well easily, I don't have
> anything against it.
>
> Anyway, +1.
I do like that the work you did applies immediately to the view group files. Applying what
I'm proposing to the view updater would probably be easy, but not "zero lines changed" easy.
On the other hand, the problem I'm trying to avoid is a non-issue with views, since they're
never updated directly by clients. Best,
Adam