Asynchronous file writes

Details

Description

This change updates the file module so that it can do
asynchronous writes. Basically it replies immediately
to process asking to write something to the file, with
the position where the chunks will be written to the
file, while a dedicated child process keeps collecting
chunks and write them to the file (and batching them
when possible). After issuing a series of write request
to the file module, the caller can call its 'flush'
function which will block the caller until all the
chunks it requested to write are effectively written
to the file.

This maximizes the IO subsystem, as for example, while
the updater is traversing and modifying the btrees and
doing CPU bound tasks, the writes are happening in
parallel.

I really appreciate Randall's pushing for collaboration rather than expect a single person to do it all or let this fall into oblivion.

I will do some updates to the branch soon as well repost some performance benchmarks, and instructions how to reproduce them as usual, in comparison to latest master (the results posted months ago don't account for many improvements that came after such as COUCHDB-1334).

Filipe Manana
added a comment - 22/Jan/12 13:00 I really appreciate Randall's pushing for collaboration rather than expect a single person to do it all or let this fall into oblivion.
I will do some updates to the branch soon as well repost some performance benchmarks, and instructions how to reproduce them as usual, in comparison to latest master (the results posted months ago don't account for many improvements that came after such as COUCHDB-1334 ).

Bob Dionne
added a comment - 22/Jan/12 12:25 I revisited this a bit this morning. I tried to rebase master just to see how far it's moved, it's not too bad, couch_db had some conflicts mostly relating to a new field #db.updater_fd
Given Damien's abandonment of the project I'm not sure whether we should push on this or not. I suppose it's worth cleaning up and using if in fact the improvements are substantial

Randall Leeds
added a comment - 22/Jan/12 02:42 Bump! I made a branch on the central git repo so you all could iterate, not so I could kill the topic.
What's left to make everyone feel good about merging this?

I'm with Paul on these things: proc_lib and an exported loop function are a must.
I'm with Damien/Jason on these things: I fear a culture where things languish too much because we don't have it together, collectively, to iterate on things effectively.
I'm with Bob on this thing: I don't see why this can't sit on a branch where we can actually iterate and commit things together.

I stand alone in that: I made you all a branch. Guess what it's called!? COUCHDB-1342. Have at it!

Randall Leeds
added a comment - 18/Nov/11 09:00 Ok. Wow.
I'm with Paul on these things: proc_lib and an exported loop function are a must.
I'm with Damien/Jason on these things: I fear a culture where things languish too much because we don't have it together, collectively, to iterate on things effectively.
I'm with Bob on this thing: I don't see why this can't sit on a branch where we can actually iterate and commit things together .
I stand alone in that: I made you all a branch. Guess what it's called!? COUCHDB-1342 . Have at it!

For #3 the issue is that it's broken. Granted there's lots in CouchDB that breaks Erlang idioms, but this is basic broken window theory. We know this is broken and its easily fixable. Why would we check in something that we know is broken?

For #4 the patch is reimplementing parts of the standard library. In Erlang there's a Right Way™ to start a process synchronously that might fail to initialize. It reduces the code that we have to write and maintain and its considerably more tested than the custom implementation we have here. Trying to minimize this by calling it a "style issue" is not conducive to the technical discourse.

You're also forgetting #2 which you agreed can be done. And if done it not only reduces the size of this patch in terms of total lines, but reduces the number of files that are touched, as well as removes the need for people to scour the source for places that we might need to add calls to couch_file:flush/1 and also prevents the need for devs to remember to add it in the future. And it also makes the flushing more optimal in terms of decreased message passing latency.

I'm sorry, but the culture issue here is that you continue to be subtly derogatory by deriding technical comments as a "pet concerns". This is especially disheartening when I've spent so much time and effort on trying to help by reviewing this patch and responding to this conversation. The sad part is that most of the things I've voiced an opinion on are easy to fix or abstract enough that a brief discussion on the issue with a list of things to watch out for would have been enough to assuage my concerns.

This patch is less than 48 hours old and it changes a core aspect of how we write data to disk in a database that prides itself on not losing data. What are you honestly expecting for a time table here? It hasn't been two days and you've already referenced speed of development and project progress multiple times in response to technical points.

You ask for my help yet you remain dismissive of anything I'm offering. I'm sorry you're frustrated but this is how code review works. I just recently finished maintaing a 7K line refactor for over a month while people reviewed it and even made changes to code that I was refactoring which I had to then merge (behavior, not code) into my branch. I understand how painful review can be, but that's how this system works and I, for one, thinks it makes us stronger.

Paul Joseph Davis
added a comment - 18/Nov/11 06:29 @Damien
For #3 the issue is that it's broken. Granted there's lots in CouchDB that breaks Erlang idioms, but this is basic broken window theory. We know this is broken and its easily fixable. Why would we check in something that we know is broken?
For #4 the patch is reimplementing parts of the standard library. In Erlang there's a Right Way™ to start a process synchronously that might fail to initialize. It reduces the code that we have to write and maintain and its considerably more tested than the custom implementation we have here. Trying to minimize this by calling it a "style issue" is not conducive to the technical discourse.
You're also forgetting #2 which you agreed can be done. And if done it not only reduces the size of this patch in terms of total lines, but reduces the number of files that are touched, as well as removes the need for people to scour the source for places that we might need to add calls to couch_ file:flush/1 and also prevents the need for devs to remember to add it in the future. And it also makes the flushing more optimal in terms of decreased message passing latency.
I'm sorry, but the culture issue here is that you continue to be subtly derogatory by deriding technical comments as a "pet concerns". This is especially disheartening when I've spent so much time and effort on trying to help by reviewing this patch and responding to this conversation. The sad part is that most of the things I've voiced an opinion on are easy to fix or abstract enough that a brief discussion on the issue with a list of things to watch out for would have been enough to assuage my concerns.
This patch is less than 48 hours old and it changes a core aspect of how we write data to disk in a database that prides itself on not losing data. What are you honestly expecting for a time table here? It hasn't been two days and you've already referenced speed of development and project progress multiple times in response to technical points.
You ask for my help yet you remain dismissive of anything I'm offering. I'm sorry you're frustrated but this is how code review works. I just recently finished maintaing a 7K line refactor for over a month while people reviewed it and even made changes to code that I was refactoring which I had to then merge (behavior, not code) into my branch. I understand how painful review can be, but that's how this system works and I, for one, thinks it makes us stronger.

Paul, what I mean by "Apache users concerns" is that #3 isn't something that vanilla Apache CouchDB users deal with, but third parties who modify the code or embed in interesting way might (I suppose Cloudant has to deal with this). Perhaps I'm mistaken about that. I do think patches should only be concerned with the vanilla use cases in order to be considered check-in quality.

#4 is a style issue, not a correctness issue, or at least you haven't made a case that it's a correctness issues. I have no problems with you changing it to a style you prefer, but we should not expect that submitters of patches conform to an undocumented style.

There is no urgency around this patch, at Couchbase we can keep adding performance enhancements and drift our codebase further and further from Apache. I don't want to see that happen, but it only hurts the Apache project.

And I do see we have some culture problems in the Apache project. We need a culture where useful, correct, fast code is verified and checked in, and then is improved incrementally. Right now we have a culture of everyone's pet concerns must addressed before code gets checked in, which is demoralizing and slows things down, which is a very big problem the project has right now. I want your help in trying to change that.

Damien Katz
added a comment - 18/Nov/11 03:04 Paul, what I mean by "Apache users concerns" is that #3 isn't something that vanilla Apache CouchDB users deal with, but third parties who modify the code or embed in interesting way might (I suppose Cloudant has to deal with this). Perhaps I'm mistaken about that. I do think patches should only be concerned with the vanilla use cases in order to be considered check-in quality.
#4 is a style issue, not a correctness issue, or at least you haven't made a case that it's a correctness issues. I have no problems with you changing it to a style you prefer, but we should not expect that submitters of patches conform to an undocumented style.
There is no urgency around this patch, at Couchbase we can keep adding performance enhancements and drift our codebase further and further from Apache. I don't want to see that happen, but it only hurts the Apache project.
And I do see we have some culture problems in the Apache project. We need a culture where useful, correct, fast code is verified and checked in, and then is improved incrementally. Right now we have a culture of everyone's pet concerns must addressed before code gets checked in, which is demoralizing and slows things down, which is a very big problem the project has right now. I want your help in trying to change that.

The gen_server2 hack is designed to make gen_server:call more efficient inside the server loop. It's also ~obsolete in R14+. Not really an issue for writer_loop and reader_loop here because they've got only one receive statement apiece with a clause for every message they expect to receive.

Adam Kocoloski
added a comment - 18/Nov/11 01:25 The gen_server2 hack is designed to make gen_server:call more efficient inside the server loop. It's also ~obsolete in R14+. Not really an issue for writer_loop and reader_loop here because they've got only one receive statement apiece with a clause for every message they expect to receive.

Short and skinny of it was that I just remembered that there's an (unmeasured) optimization in zip_server that I learned from gen_server2. The basic idea is to pull all messages out of the mailbox and then process them off a queue. The original idea was that it'd prevent pattern matching from running away scanning for messages though I think writer_loop's repeated implementation. Might be worth a check but I only note this cause the thought occurred. Feel free to ignore.

Also, just to be clear I'm not suggesting the use of zip_server or anything nutty like that. It just occurred to me that it might help inform the current discussion.

Paul Joseph Davis
added a comment - 18/Nov/11 01:16 Damn it JIRA just ate a comment.
Short and skinny of it was that I just remembered that there's an (unmeasured) optimization in zip_server that I learned from gen_server2. The basic idea is to pull all messages out of the mailbox and then process them off a queue. The original idea was that it'd prevent pattern matching from running away scanning for messages though I think writer_loop's repeated implementation. Might be worth a check but I only note this cause the thought occurred. Feel free to ignore.
Also, just to be clear I'm not suggesting the use of zip_server or anything nutty like that. It just occurred to me that it might help inform the current discussion.

> Robert, the inflight batching of writes is limited to 1 meg per database.

No, its up to 1 meg per file that's being written to. It's also important to note that the buffering isn't actually a passive thing like is generally done. The "buffer" is actually just the mailbox for the writer_loop process. The queued_bytes_len or whatever is just counting how much data has been submitted to to the process that hasn't been acked to prevent blowing the top of that mailbox (which is quite reasonable).

The writer isn't really buffering anything itself, its just leaning on Erlang's message passing internals to be that buffer (which is quite reasonable). Then all the writer_loop does is accept messages and respond to the parent couch_file gen_server. If it happens to find multiple write messages in the mailbox consecutively at the same time, it'll write those in a single call to file:write/2.

I would not be at all surprised if it were shown that the bulk of the improvement from this patch is due to this specific part of the patch. For the curious, the zip_server test at [1] tests something quite similar to this setup.

Paul Joseph Davis
added a comment - 18/Nov/11 00:56 @Damien
> Robert, the inflight batching of writes is limited to 1 meg per database.
No, its up to 1 meg per file that's being written to. It's also important to note that the buffering isn't actually a passive thing like is generally done. The "buffer" is actually just the mailbox for the writer_loop process. The queued_bytes_len or whatever is just counting how much data has been submitted to to the process that hasn't been acked to prevent blowing the top of that mailbox (which is quite reasonable).
The writer isn't really buffering anything itself, its just leaning on Erlang's message passing internals to be that buffer (which is quite reasonable). Then all the writer_loop does is accept messages and respond to the parent couch_file gen_server. If it happens to find multiple write messages in the mailbox consecutively at the same time, it'll write those in a single call to file:write/2 .
I would not be at all surprised if it were shown that the bulk of the improvement from this patch is due to this specific part of the patch. For the curious, the zip_server test at [1] tests something quite similar to this setup.
[1] https://github.com/davisp/zip_server

> However, there is another optimization coming where a raw erlang FD is used in a calling process to avoid messaging overhead (another big performance improvement in certain long operations), which will maybe make it necessary again.

I'm not sure what you mean here. Something along the lines of a file:open call in the couch_db_updater process (and couch_mrview_updater)? If so that's an interesting idea. Seems like we could make couch_file handle that quite easily along the lines of how file handles #prim_file vs #file (if I recall those record names correctly). This could also solve some of the fd duplication if we only need an extra fd for views that are updating.

> The concern with doubling the # non-db file descriptors is a real one. How big of a concern of this?

The thing is, I'm not certain how it'll behave. Hence why it concerns me. Is it a matter of just making sure that ulimit is set sufficiently high? How high is sufficient? If I'm running in production, and I upgrade to a version of CouchDB that has this patch, can I at least guestimate how configs might need to change? Maybe I'm being overly paranoid and its not an issue. I dunno. Hence why it concerns me.

> The 5th concern would definitely make code more complicated for callers

I agree. I should've prefaced that bit with a "I wonder if in the future there's a follow up direction we can go". It only occurred as I was finishing that comment so I figured I'd write it down.

I have no idea what you mean by "Apache user concerns" here. If you're referring to "no one cares how the sausage is made so long as its faster" then I'm going to have to disagree. Strongly. Saying that databases are complicated so we shouldn't concern ourselves with code quality is just going to leave us with a source tree in an even worse state than it already is.

And I'd like to address this argument about progress and the desires of users. This patch was submitted to JIRA yesterday. My initial review was up within 3.5h. This patch changes how the file abstraction works. In a database. As far as I'm concerned development on this started yesterday at 13:28 when Jan uploaded the patch to JIRA. If you wanted things to be moving more quickly at this stage you should have been developing this on a branch in git and asking for input from the community.

Secondly, while I understand that you're highly motivated to help users by improving performance, what does that have to do with the conversation about the technical merits of this patch? This sense of urgency that progress must be made so lets address the issues I brought up after its in trunk is not a convincing argument. You could address my comments by spending thirty minutes in an editor and resubmitting the patch. Instead you're asking me to clean this up for you after its committed.

Thirdly, every time someone asks, "Can it wait till it's on trunk?", all I hear is, "Can I ignore what you just said and commit this anyway?" If I point at something and say that its broken its because I'm expecting the patch to change or an explanation of why I'm wrong. And I'm fine being wrong. It happens quite often. But this pattern of submitting patches and asking for all concerns to be addressed after the patch is in trunk is starting to get a bit annoying. If we want to adjust our policies around CTR vs RTC for larger patches, that's fine. Perhaps adding an edge branch in git that will accept all our bigger somewhat scary commits would be beneficial. If we start doing automated package building then users could even pull bleeding edge code to test. But I digress.

Paul Joseph Davis
added a comment - 18/Nov/11 00:38
@Damien
> However, there is another optimization coming where a raw erlang FD is used in a calling process to avoid messaging overhead (another big performance improvement in certain long operations), which will maybe make it necessary again.
I'm not sure what you mean here. Something along the lines of a file:open call in the couch_db_updater process (and couch_mrview_updater)? If so that's an interesting idea. Seems like we could make couch_file handle that quite easily along the lines of how file handles #prim_file vs #file (if I recall those record names correctly). This could also solve some of the fd duplication if we only need an extra fd for views that are updating.
> The concern with doubling the # non-db file descriptors is a real one. How big of a concern of this?
The thing is, I'm not certain how it'll behave. Hence why it concerns me. Is it a matter of just making sure that ulimit is set sufficiently high? How high is sufficient? If I'm running in production, and I upgrade to a version of CouchDB that has this patch, can I at least guestimate how configs might need to change? Maybe I'm being overly paranoid and its not an issue. I dunno. Hence why it concerns me.
> The 5th concern would definitely make code more complicated for callers
I agree. I should've prefaced that bit with a "I wonder if in the future there's a follow up direction we can go". It only occurred as I was finishing that comment so I figured I'd write it down.
> Your 3rd and 4th concerns aren't Apache user concerns, but can be easily addressed after check-in.
I have no idea what you mean by "Apache user concerns" here. If you're referring to "no one cares how the sausage is made so long as its faster" then I'm going to have to disagree. Strongly. Saying that databases are complicated so we shouldn't concern ourselves with code quality is just going to leave us with a source tree in an even worse state than it already is.
And I'd like to address this argument about progress and the desires of users. This patch was submitted to JIRA yesterday. My initial review was up within 3.5h. This patch changes how the file abstraction works. In a database. As far as I'm concerned development on this started yesterday at 13:28 when Jan uploaded the patch to JIRA. If you wanted things to be moving more quickly at this stage you should have been developing this on a branch in git and asking for input from the community.
Secondly, while I understand that you're highly motivated to help users by improving performance, what does that have to do with the conversation about the technical merits of this patch? This sense of urgency that progress must be made so lets address the issues I brought up after its in trunk is not a convincing argument. You could address my comments by spending thirty minutes in an editor and resubmitting the patch. Instead you're asking me to clean this up for you after its committed.
Thirdly, every time someone asks, "Can it wait till it's on trunk?", all I hear is, "Can I ignore what you just said and commit this anyway?" If I point at something and say that its broken its because I'm expecting the patch to change or an explanation of why I'm wrong. And I'm fine being wrong. It happens quite often. But this pattern of submitting patches and asking for all concerns to be addressed after the patch is in trunk is starting to get a bit annoying. If we want to adjust our policies around CTR vs RTC for larger patches, that's fine. Perhaps adding an edge branch in git that will accept all our bigger somewhat scary commits would be beneficial. If we start doing automated package building then users could even pull bleeding edge code to test. But I digress.

I saw the (hardcoded) 1 mib limit but my context is hundreds or thousands of such things at once. Making it configurable or even optional would be an improvement.

I appreciate the effort to close some of these gaps but, speaking personally and from (bitter) experience, trying to get a large piece of work into trunk with promises to fix things 'later' really bothers me. I don't see why this work can't sit on a branch while the discussion, and the enhancements continue.

Robert Newson
added a comment - 17/Nov/11 19:07 I saw the (hardcoded) 1 mib limit but my context is hundreds or thousands of such things at once. Making it configurable or even optional would be an improvement.
I appreciate the effort to close some of these gaps but, speaking personally and from (bitter) experience, trying to get a large piece of work into trunk with promises to fix things 'later' really bothers me. I don't see why this work can't sit on a branch while the discussion, and the enhancements continue.

Robert, the inflight batching of writes is limited to 1 meg per database. we've not seen it increase memory usage, but the code is written such that it could be made configurable to smaller max batch size. If we are concerned, this is a detail I'd like to see addressed post check-in.

Damien Katz
added a comment - 17/Nov/11 18:52 Robert, the inflight batching of writes is limited to 1 meg per database. we've not seen it increase memory usage, but the code is written such that it could be made configurable to smaller max batch size. If we are concerned, this is a detail I'd like to see addressed post check-in.

I don't mean to imply that Paul, or any committer isn't smart enough to handle a flush call. I know Paul is has the smarts and talent to deal with much more complexity. What I am saying is that if a flush call requirement makes it so that someone can't work on the internals of CouchDB, then they aren't suited for core database development. Database engines are complex beasts.

Paul's point is about that the flush call can maybe be gotten rid of seems right. Originally, we didn't have the code that prevented the write queue getting overwhelmed, because in our product it's not possible. But I added it to make the rest of the enhancements suitable for Apache, and now it seems it could be used to prevent the reads of unflushed data. However, there is another optimization coming where a raw erlang FD is used in a calling process to avoid messaging overhead (another big performance improvement in certain long operations), which will maybe make it necessary again. We can remove it in the meantime, but it may need to be added back in the future.

The concern with doubling the # non-db file descriptors is a real one. How big of a concern of this? Do you have ideas how to fix this? Can we address this post check-in?

Your 3rd and 4th concerns aren't Apache user concerns, but can be easily addressed after check-in. I have no objections, but I would prefer we have a culture of small changes/environment specific changes like that happening after checkin. That will increase the rate the of progress on the project in general. If you agree, would you be willing to add those changes post check-in?

The 5th concern would definitely make code more complicated for callers, and would involved them batching usually a non-optimal amount of data. This code makes the batching automatic and parallelize the writes, retiring batched data as fast as it can, and prevented the batching of too much data.

Damien Katz
added a comment - 17/Nov/11 18:49 I don't mean to imply that Paul, or any committer isn't smart enough to handle a flush call. I know Paul is has the smarts and talent to deal with much more complexity. What I am saying is that if a flush call requirement makes it so that someone can't work on the internals of CouchDB, then they aren't suited for core database development. Database engines are complex beasts.
Paul's point is about that the flush call can maybe be gotten rid of seems right. Originally, we didn't have the code that prevented the write queue getting overwhelmed, because in our product it's not possible. But I added it to make the rest of the enhancements suitable for Apache, and now it seems it could be used to prevent the reads of unflushed data. However, there is another optimization coming where a raw erlang FD is used in a calling process to avoid messaging overhead (another big performance improvement in certain long operations), which will maybe make it necessary again. We can remove it in the meantime, but it may need to be added back in the future.
The concern with doubling the # non-db file descriptors is a real one. How big of a concern of this? Do you have ideas how to fix this? Can we address this post check-in?
Your 3rd and 4th concerns aren't Apache user concerns, but can be easily addressed after check-in. I have no objections, but I would prefer we have a culture of small changes/environment specific changes like that happening after checkin. That will increase the rate the of progress on the project in general. If you agree, would you be willing to add those changes post check-in?
The 5th concern would definitely make code more complicated for callers, and would involved them batching usually a non-optimal amount of data. This code makes the batching automatic and parallelize the writes, retiring batched data as fast as it can, and prevented the batching of too much data.

@Damien, thanks for the clarification. It's possible to read what you originally wrote as an intention to dismiss concerns over introducing further complexity as long as it improved performance. Since Paul has now explicitly described several technical problems with the patch I'm sure we can all move on to addressing them. I'll only add that I had read the entire page on voting and didn't feel that the section you highlighted applied in this case, which is why I didn't mention it myself.

I would like to know why couch_file now needs two file descriptors to provide asynchronous writes, though. If it's integral, I'd appreciate knowing why, and whether there are alternatives. If not, a separate ticket seems appropriate. The only thing I can think of is the inability to usefully pass a handle to a file opened with [raw] between processes. In any case, doubling the consumption of precious server resources should be called out explicitly, rather than incidentally, in my opinion.

Am I also right in thinking that the improved write performance entails increased memory usage (and therefore increased GC too)? What's the impact of that on very busy servers?

Robert Newson
added a comment - 17/Nov/11 12:29 @Damien, thanks for the clarification. It's possible to read what you originally wrote as an intention to dismiss concerns over introducing further complexity as long as it improved performance. Since Paul has now explicitly described several technical problems with the patch I'm sure we can all move on to addressing them. I'll only add that I had read the entire page on voting and didn't feel that the section you highlighted applied in this case, which is why I didn't mention it myself.
I would like to know why couch_file now needs two file descriptors to provide asynchronous writes, though. If it's integral, I'd appreciate knowing why, and whether there are alternatives. If not, a separate ticket seems appropriate. The only thing I can think of is the inability to usefully pass a handle to a file opened with [raw] between processes. In any case, doubling the consumption of precious server resources should be called out explicitly, rather than incidentally, in my opinion.
Am I also right in thinking that the improved write performance entails increased memory usage (and therefore increased GC too)? What's the impact of that on very busy servers?

Also, I'd point out that I've not so much as implied that this patch shouldn't go in. The patch came to JIRA, I reviewed it and I've been trying to discuss the technical aspects of this patch. Somehow that got dragged off into the weeds about voting and other silly things. I'd prefer if we could avoid spending time squabbling like raccoons in a dumpster. There's a good idea here and we should be spending our time working on the engineering to make it happen.

Paul Joseph Davis
added a comment - 17/Nov/11 01:58 Also, I'd point out that I've not so much as implied that this patch shouldn't go in. The patch came to JIRA, I reviewed it and I've been trying to discuss the technical aspects of this patch. Somehow that got dragged off into the weeds about voting and other silly things. I'd prefer if we could avoid spending time squabbling like raccoons in a dumpster. There's a good idea here and we should be spending our time working on the engineering to make it happen.

That's an awful lot of disappointment packed into a single comment. First, resorting to an ad hominem attack to insinuate that I'm not intelligent enough to work on databases is quite disconcerting. Secondly, its an egregious fallacy to suggest that because a patch appears to be technically correct that it should be committed. Thirdly, declaring what is and isn't a valid reason to hold up a patch is not how the ASF works.

And now back to the regularly scheduled technical discussion.

First, couch_file:flush/1. Unless I'm missing something extremely subtle here, it's existence is so that clients can read their own writes. Yet the couch_file gen_server has all the knowledge it needs to know if it has to flush to service a write call. If the requested read position is between #file.eof and #file.eof + #file.queued_write_bytes, then it can call flush and move on with its life. Not only does this mean that clients don't have remember to call flush, but it removes unnecessary message passing that every unconditional call to flush would generate.

Second, this is doubling the number of file descriptors required for anything that isn't a database. On the first production machine I checked that's an increase of 75% from 40K to 70K file descriptors. That's a fairly serious change that ought to be discussed. At the very least it ought to be mentioned somewhere so ops teams know to expect it.

Third, this is spawning long lived processes that aren't looping on exported functions. After two code upgrades this would crash every couch_file in the VM simultaneously.

Fourth, as I've mentioned numerous times before, the proper way to synchronously start a process that might fail to initialize is to use proc_lib:start_link and proc_lib:init_ack.

Fifth, has anyone considered using a write buffer outside of the couch_file API that would allow clients more precise control. For instance, thinking briefly on the view updater, you could buffer writes for a single add_remove call. This also leads to the possibility that mostly read views aren't needlessly holding open a writer fd for no reason.

Paul Joseph Davis
added a comment - 17/Nov/11 01:41 @Damien
That's an awful lot of disappointment packed into a single comment. First, resorting to an ad hominem attack to insinuate that I'm not intelligent enough to work on databases is quite disconcerting. Secondly, its an egregious fallacy to suggest that because a patch appears to be technically correct that it should be committed. Thirdly, declaring what is and isn't a valid reason to hold up a patch is not how the ASF works.
And now back to the regularly scheduled technical discussion.
First, couch_ file:flush/1 . Unless I'm missing something extremely subtle here, it's existence is so that clients can read their own writes. Yet the couch_file gen_server has all the knowledge it needs to know if it has to flush to service a write call. If the requested read position is between #file.eof and #file.eof + #file.queued_write_bytes, then it can call flush and move on with its life. Not only does this mean that clients don't have remember to call flush, but it removes unnecessary message passing that every unconditional call to flush would generate.
Second, this is doubling the number of file descriptors required for anything that isn't a database. On the first production machine I checked that's an increase of 75% from 40K to 70K file descriptors. That's a fairly serious change that ought to be discussed. At the very least it ought to be mentioned somewhere so ops teams know to expect it.
Third, this is spawning long lived processes that aren't looping on exported functions. After two code upgrades this would crash every couch_file in the VM simultaneously.
Fourth, as I've mentioned numerous times before, the proper way to synchronously start a process that might fail to initialize is to use proc_lib:start_link and proc_lib:init_ack.
Fifth, has anyone considered using a write buffer outside of the couch_file API that would allow clients more precise control. For instance, thinking briefly on the view updater, you could buffer writes for a single add_remove call. This also leads to the possibility that mostly read views aren't needlessly holding open a writer fd for no reason.

"To prevent vetos from being used capriciously, they must be accompanied by a technical justification showing why the change is bad (opens a security exposure, negatively affects performance, etc. ). A veto without a justification is invalid and has no weight."

No one is suggesting one person wants this change and will force it in. I am saying that we don't refuse progress simply because someone finds it more complicated, especially when it offers what users want: much better performance.

Damien Katz
added a comment - 16/Nov/11 23:41 - edited Robert, from the next paragraph:
"To prevent vetos from being used capriciously, they must be accompanied by a technical justification showing why the change is bad (opens a security exposure, negatively affects performance, etc. ). A veto without a justification is invalid and has no weight."
No one is suggesting one person wants this change and will force it in. I am saying that we don't refuse progress simply because someone finds it more complicated, especially when it offers what users want: much better performance.

Technical comments notwithstanding, saying "We won't be holding up this patch because it seems more complicated than before. It's about what users want." is a little bold. As I understand the Apache rules, if Paul feels strongly he can vote -1 which is a veto.

Robert Newson
added a comment - 16/Nov/11 22:08 Technical comments notwithstanding, saying "We won't be holding up this patch because it seems more complicated than before. It's about what users want." is a little bold. As I understand the Apache rules, if Paul feels strongly he can vote -1 which is a veto.
From http://www.apache.org/foundation/voting.html:
"Votes on code modifications follow a different model. In this scenario, a negative vote constitutes a veto , which cannot be overridden."
Any suggestion that one person can force in a controversial change to an Apache project needs to be challenged.

This is about write performance of database internals, which this improves dramatically. If database devs can't deal with a flush call after some writes, then they probably shouldn't be working on databases. Unless someone has have issues with the correctness (note, writes also get buffered in other places as well, fsync is the only thing that ensures they hit disk), this is what USERs want. Better overall performance. This improves view index creation and write through greatly.

Databases are complicated, making them perform well is hard. We won't be holding up this patch because it seems more complicated than before. It's about what users want.

Damien Katz
added a comment - 16/Nov/11 21:32 This is about write performance of database internals, which this improves dramatically. If database devs can't deal with a flush call after some writes, then they probably shouldn't be working on databases. Unless someone has have issues with the correctness (note, writes also get buffered in other places as well, fsync is the only thing that ensures they hit disk), this is what USERs want. Better overall performance. This improves view index creation and write through greatly.
Databases are complicated, making them perform well is hard. We won't be holding up this patch because it seems more complicated than before. It's about what users want.

> I disagree fairly strongly on this point. couch_file is about as core of an API as it gets. couch_file:flush/1 shouldn't even exist as far as I'm concerned. This isn't a mediocre API that should be improved later, its a bad API that shouldn't go in at all. Its 100% foot-gun.

The new issue would be definitely a blocker for the next release. I don't see a problem with this, but I'm happy to iterate the patch in JIRA as well.

> I don't need another patch to know that this one is complicated and could be less complicated.

Earlier versions of this patch did use gen_servers and they were neither less complicated nor did they give the desired performance improvements, thus we went past them. They add a lot of overhead for such a low level module. So yes, I think we need an alternative patch to see if it were simpler and as useful.

> You're pointing at code that's about three abstractions away from couch_file's writer loop. Suffice it to say, erlang:monitor/2 on a random process in the write path does little to assuage my fears that we're entering dangerous territory relying on our own file writing buffers.

Not sure where there this is three abstractions away, but I gotta have to defer to the experts Filipe and Damien here.

Jan Lehnardt
added a comment - 16/Nov/11 20:49 > I disagree fairly strongly on this point. couch_file is about as core of an API as it gets. couch_ file:flush/1 shouldn't even exist as far as I'm concerned. This isn't a mediocre API that should be improved later, its a bad API that shouldn't go in at all. Its 100% foot-gun.
The new issue would be definitely a blocker for the next release. I don't see a problem with this, but I'm happy to iterate the patch in JIRA as well.
> I don't need another patch to know that this one is complicated and could be less complicated.
Earlier versions of this patch did use gen_servers and they were neither less complicated nor did they give the desired performance improvements, thus we went past them. They add a lot of overhead for such a low level module. So yes, I think we need an alternative patch to see if it were simpler and as useful.
> You're pointing at code that's about three abstractions away from couch_file's writer loop. Suffice it to say, erlang:monitor/2 on a random process in the write path does little to assuage my fears that we're entering dangerous territory relying on our own file writing buffers.
Not sure where there this is three abstractions away, but I gotta have to defer to the experts Filipe and Damien here.

> These are deliberate. Currently we are maintaining an Fd/couch_file pair for reads and writes in a bit of a clumsy way in the #btree.fd record and the couch_file module and do a switcheroo for writes vs. reads. Now all of that is moved into couch_file and couch_db doesn't have to concern itself with the details of efficient file access.

Oh I hadn't noticed that weirdness before, but I see it now.

> Good point, API-wise this is a bit leaky and we should definitely look into making this better, but I don't think this blocks the bulk of the improvements that this patch introduces. I'm happy to open a new ticket about this. Good starting points are calling flush() inside couch_file 1) after writing a header and 2) if a read fails with

{error, eof}

(fwiw, Damien tried the latter before but removed it again, the details elude me).

I disagree fairly strongly on this point. couch_file is about as core of an API as it gets. couch_file:flush/1 shouldn't even exist as far as I'm concerned. This isn't a mediocre API that should be improved later, its a bad API that shouldn't go in at all. Its 100% foot-gun.

> The difference is that delayed_write does buffering whereas we just want to keep writing to the file all the time. delayed_write would not write until the buffer is full or the timeout is hit. In addition, we wouldn't be able to tell for delayed_commits=false writes when data hit the file reliably.

Oh right, I remember that discussion about clearing the buffer being an issue now. Good call.

> The whole point of this patch is to make couch_file smarter and move the reader/writer abstraction further down the stack and reap the associated performance benefits. Unless we have an alternate patch, we can't really compare this.

Code Review 2.0:

Q: "This looks complicated, can we think about trying to simplify some of the organization?"
A: "No. Write it yourself if you care that much."

I don't need another patch to know that this one is complicated and could be less complicated.

> This doesn't change the error scenarios. We already rely on monitoring to tell the request process a couch_file write didn't work

You're pointing at code that's about three abstractions away from couch_file's writer loop. Suffice it to say, erlang:monitor/2 on a random process in the write path does little to assuage my fears that we're entering dangerous territory relying on our own file writing buffers.

Paul Joseph Davis
added a comment - 16/Nov/11 19:06
> These are deliberate. Currently we are maintaining an Fd/couch_file pair for reads and writes in a bit of a clumsy way in the #btree.fd record and the couch_file module and do a switcheroo for writes vs. reads. Now all of that is moved into couch_file and couch_db doesn't have to concern itself with the details of efficient file access.
Oh I hadn't noticed that weirdness before, but I see it now.
> Good point, API-wise this is a bit leaky and we should definitely look into making this better, but I don't think this blocks the bulk of the improvements that this patch introduces. I'm happy to open a new ticket about this. Good starting points are calling flush() inside couch_file 1) after writing a header and 2) if a read fails with
{error, eof}
(fwiw, Damien tried the latter before but removed it again, the details elude me).
I disagree fairly strongly on this point. couch_file is about as core of an API as it gets. couch_ file:flush/1 shouldn't even exist as far as I'm concerned. This isn't a mediocre API that should be improved later, its a bad API that shouldn't go in at all. Its 100% foot-gun.
> The difference is that delayed_write does buffering whereas we just want to keep writing to the file all the time. delayed_write would not write until the buffer is full or the timeout is hit. In addition, we wouldn't be able to tell for delayed_commits=false writes when data hit the file reliably.
Oh right, I remember that discussion about clearing the buffer being an issue now. Good call.
> The whole point of this patch is to make couch_file smarter and move the reader/writer abstraction further down the stack and reap the associated performance benefits. Unless we have an alternate patch, we can't really compare this.
Code Review 2.0:
Q: "This looks complicated, can we think about trying to simplify some of the organization?"
A: "No. Write it yourself if you care that much."
I don't need another patch to know that this one is complicated and could be less complicated.
> This doesn't change the error scenarios. We already rely on monitoring to tell the request process a couch_file write didn't work
You're pointing at code that's about three abstractions away from couch_file's writer loop. Suffice it to say, erlang:monitor/2 on a random process in the write path does little to assuage my fears that we're entering dangerous territory relying on our own file writing buffers.

> This patch looks like its a mishmash of a couple patches that are applied slightly differently. Not sure if that's a branch thing or what. But for instance, there's a lot of id_btree(Db) -> Db#db.couch type changes.

These are deliberate. Currently we are maintaining an Fd/couch_file pair for reads and writes in a bit of a clumsy way in the #btree.fd record and the couch_file module and do a switcheroo for writes vs. reads. Now all of that is moved into couch_file and couch_db doesn't have to concern itself with the details of efficient file access.

> There's also a switch between couch_file:sync(Filepath) and couch_file:sync(Fd). I'm not sure if that's on purpose or not because we switched to Filepath on purpose. I see it as being possible but I don't see a reference to it.

We used to pass in the filepath to allow the fsync() to be async so it wouldn't block any readers or writers. That behaviour is now obsolete as couch_file will now pass the fsync() request to it's writer child process.

> I'm really not a fan of exposing the inner workings of couch_file to every client as can be seen by the massive number of flush calls that this creates. I find it quite likely that we'll eventually end up missing one of these flush calls and causing a nasty bug because the error would look like data hadn't been written. I'd want to see things rewritten so that the usage of couch_file doesn't change. Easiest solution I see would be to automatically flush buffers if a read request comes in for something that's not on disk.

Good point, API-wise this is a bit leaky and we should definitely look into making this better, but I don't think this blocks the bulk of the improvements that this patch introduces. I'm happy to open a new ticket about this. Good starting points are calling flush() inside couch_file 1) after writing a header and 2) if a read fails with

{error, eof}

(fwiw, Damien tried the latter before but removed it again, the details elude me).

> I'm pretty sure we talked about this at one point, but can someone describe the differences between this and Erlang's delayed_write mode for files?

The difference is that delayed_write does buffering whereas we just want to keep writing to the file all the time. delayed_write would not write until the buffer is full or the timeout is hit. In addition, we wouldn't be able to tell for delayed_commits=false writes when data hit the file reliably.

> couch_file becomes much more complicated with this patch. I'd prefer to see it simplified if at all possible. I'm not entirely certain how it'd look but I might start by making a couch_file_reader and couch_file_writer gen_servers instead of having bare processes in couch_file.

The whole point of this patch is to make couch_file smarter and move the reader/writer abstraction further down the stack and reap the associated performance benefits. Unless we have an alternate patch, we can't really compare this.

> I have to say that this patch scares me a bit. If we make the switch to something like this then we're opening up a whole new type of failure scenario where file "writes" appear to succeed but then end up failing after the caller has moved on. While there are certainly cases where I can see this being a fine tradeoff (view updaters, compactors, etc) it worries me a bit for the main database file. The fact is that I can't (yet) reasonably consider all of the possible new failure modes and convince myself that things are correct. I've already seen some super weird reactions to things like running out of disk space, it seems like not knowing about such an error until you've "written" a megabyte of data seems awkward.

This doesn't change the error scenarios. We already rely on monitoring to tell the request process a couch_file write didn't work:

Jan Lehnardt
added a comment - 16/Nov/11 18:26 Good review Paul, thanks!
Let me quote and reply inline to each of your points:
> This patch looks like its a mishmash of a couple patches that are applied slightly differently. Not sure if that's a branch thing or what. But for instance, there's a lot of id_btree(Db) -> Db#db.couch type changes.
These are deliberate. Currently we are maintaining an Fd/couch_file pair for reads and writes in a bit of a clumsy way in the #btree.fd record and the couch_file module and do a switcheroo for writes vs. reads. Now all of that is moved into couch_file and couch_db doesn't have to concern itself with the details of efficient file access.
> There's also a switch between couch_ file:sync(Filepath ) and couch_ file:sync(Fd ). I'm not sure if that's on purpose or not because we switched to Filepath on purpose. I see it as being possible but I don't see a reference to it.
We used to pass in the filepath to allow the fsync() to be async so it wouldn't block any readers or writers. That behaviour is now obsolete as couch_file will now pass the fsync() request to it's writer child process.
> I'm really not a fan of exposing the inner workings of couch_file to every client as can be seen by the massive number of flush calls that this creates. I find it quite likely that we'll eventually end up missing one of these flush calls and causing a nasty bug because the error would look like data hadn't been written. I'd want to see things rewritten so that the usage of couch_file doesn't change. Easiest solution I see would be to automatically flush buffers if a read request comes in for something that's not on disk.
Good point, API-wise this is a bit leaky and we should definitely look into making this better, but I don't think this blocks the bulk of the improvements that this patch introduces. I'm happy to open a new ticket about this. Good starting points are calling flush() inside couch_file 1) after writing a header and 2) if a read fails with
{error, eof}
(fwiw, Damien tried the latter before but removed it again, the details elude me).
> I'm pretty sure we talked about this at one point, but can someone describe the differences between this and Erlang's delayed_write mode for files?
The difference is that delayed_write does buffering whereas we just want to keep writing to the file all the time. delayed_write would not write until the buffer is full or the timeout is hit. In addition, we wouldn't be able to tell for delayed_commits=false writes when data hit the file reliably.
> couch_file becomes much more complicated with this patch. I'd prefer to see it simplified if at all possible. I'm not entirely certain how it'd look but I might start by making a couch_file_reader and couch_file_writer gen_servers instead of having bare processes in couch_file.
The whole point of this patch is to make couch_file smarter and move the reader/writer abstraction further down the stack and reap the associated performance benefits. Unless we have an alternate patch, we can't really compare this.
> I have to say that this patch scares me a bit. If we make the switch to something like this then we're opening up a whole new type of failure scenario where file "writes" appear to succeed but then end up failing after the caller has moved on. While there are certainly cases where I can see this being a fine tradeoff (view updaters, compactors, etc) it worries me a bit for the main database file. The fact is that I can't (yet) reasonably consider all of the possible new failure modes and convince myself that things are correct. I've already seen some super weird reactions to things like running out of disk space, it seems like not knowing about such an error until you've "written" a megabyte of data seems awkward.
This doesn't change the error scenarios. We already rely on monitoring to tell the request process a couch_file write didn't work:
https://github.com/fdmanana/couchdb/blob/master/src/couchdb/couch_db.erl#L819 ff. (825 in particular)

This patch looks like its a mishmash of a couple patches that are applied slightly differently. Not sure if that's a branch thing or what. But for instance, there's a lot of id_btree(Db) -> Db#db.by_docid_btree type changes. There's also a switch between couch_file:sync(Filepath) and couch_file:sync(Fd). I'm not sure if that's on purpose or not because we switched to Filepath on purpose. I see it as being possible but I don't see a reference to it.

I'm really not a fan of exposing the inner workings of couch_file to every client as can be seen by the massive number of flush calls that this creates. I find it quite likely that we'll eventually end up missing one of these flush calls and causing a nasty bug because the error would look like data hadn't been written. I'd want to see things rewritten so that the usage of couch_file doesn't change. Easiest solution I see would be to automatically flush buffers if a read request comes in for something that's not on disk.

I'm pretty sure we talked about this at one point, but can someone describe the differences between this and Erlang's delayed_write mode for files?

couch_file becomes much more complicated with this patch. I'd prefer to see it simplified if at all possible. I'm not entirely certain how it'd look but I might start by making a couch_file_reader and couch_file_writer gen_servers instead of having bare processes in couch_file.

I have to say that this patch scares me a bit. If we make the switch to something like this then we're opening up a whole new type of failure scenario where file "writes" appear to succeed but then end up failing after the caller has moved on. While there are certainly cases where I can see this being a fine tradeoff (view updaters, compactors, etc) it worries me a bit for the main database file. The fact is that I can't (yet) reasonably consider all of the possible new failure modes and convince myself that things are correct. I've already seen some super weird reactions to things like running out of disk space, it seems like not knowing about such an error until you've "written" a megabyte of data seems awkward.

Paul Joseph Davis
added a comment - 16/Nov/11 16:46 This patch looks like its a mishmash of a couple patches that are applied slightly differently. Not sure if that's a branch thing or what. But for instance, there's a lot of id_btree(Db) -> Db#db.by_docid_btree type changes. There's also a switch between couch_ file:sync(Filepath ) and couch_ file:sync(Fd ). I'm not sure if that's on purpose or not because we switched to Filepath on purpose. I see it as being possible but I don't see a reference to it.
I'm really not a fan of exposing the inner workings of couch_file to every client as can be seen by the massive number of flush calls that this creates. I find it quite likely that we'll eventually end up missing one of these flush calls and causing a nasty bug because the error would look like data hadn't been written. I'd want to see things rewritten so that the usage of couch_file doesn't change. Easiest solution I see would be to automatically flush buffers if a read request comes in for something that's not on disk.
I'm pretty sure we talked about this at one point, but can someone describe the differences between this and Erlang's delayed_write mode for files?
couch_file becomes much more complicated with this patch. I'd prefer to see it simplified if at all possible. I'm not entirely certain how it'd look but I might start by making a couch_file_reader and couch_file_writer gen_servers instead of having bare processes in couch_file.
I have to say that this patch scares me a bit. If we make the switch to something like this then we're opening up a whole new type of failure scenario where file "writes" appear to succeed but then end up failing after the caller has moved on. While there are certainly cases where I can see this being a fine tradeoff (view updaters, compactors, etc) it worries me a bit for the main database file. The fact is that I can't (yet) reasonably consider all of the possible new failure modes and convince myself that things are correct. I've already seen some super weird reactions to things like running out of disk space, it seems like not knowing about such an error until you've "written" a megabyte of data seems awkward.