This feature limits the time that unflushed data will stay in the memstore.
By default a memstore will flush if data older than 1h (3600000ms) is present.
This can be configured via hbase.regionserver.optionalcacheflushinterval (default value is 3600000).

Description

A colleague of mine ran into an interesting issue.
He inserted some data with the WAL disabled, which happened to fit in the aggregate Memstores memory.

Two weeks later he a had problem with the HDFS cluster, which caused the region servers to abort. He found that his data was lost. Looking at the log we found that the Memstores were not flushed at all during these two weeks.

Should we have an option to flush memstores periodically. There are obvious downsides to this, like many small storefiles, etc.

Lars Hofhansl
added a comment - 03/May/12 21:09 What should trigger the flush is an interesting discussion in itself. Should we flush:
after N timeunits of write inactivity, or
when the last flush happened more than N TUs ago
The former would avoid smaller storefiles, the latter would put a limit on how stale an entry in the memstore can be.

Matt Corgan
added a comment - 03/May/12 22:25 Maybe add a boolean to the memstore to track if it contains edits that were not written to the WAL. No need to auto-flush in the frequent case where all edits are in the WAL.

Jean-Daniel Cryans
added a comment - 03/May/12 22:36 No need to auto-flush in the frequent case where all edits are in the WAL.
And we already roll every hour. From LogRoller:
this.rollperiod = this.server.getConfiguration().getLong("hbase.regionserver.logroll.period", 3600000);
Meaning that your data in the WAL can only be sitting there for so long.

Todd Lipcon
added a comment - 03/May/12 22:39 Maybe add a boolean to the memstore to track if it contains edits that were not written to the WAL
HBASE-5886 adds code which tracks how much un-WAL-ed data is in the memstore.
Meaning that your data in the WAL can only be sitting there for so long.
But if we retain 20 or so HLogs, and we roll only every hour, then we still have 20 hours worth of data sitting there unflushed, which might be a little strange if the cluster is entirely idle.

Periodically flushing the memstore seems like a good feature to me. Could also help clear out cold data from memory to make more room for bigger memstores on regions that are actually being used.

A different solution to the underlying data loss issue might be to have a third client setting for WAL writing: NONE, SYNC, and ASYNC. ASYNC would write the data to a memory buffer, return success to the client, and another thread would flush the buffer to the WAL. The WAL would ideally only lag a few seconds behind the memstores, but some form of throttling would probably be needed.

Matt Corgan
added a comment - 05/May/12 18:53 Periodically flushing the memstore seems like a good feature to me. Could also help clear out cold data from memory to make more room for bigger memstores on regions that are actually being used.
A different solution to the underlying data loss issue might be to have a third client setting for WAL writing: NONE, SYNC, and ASYNC. ASYNC would write the data to a memory buffer, return success to the client, and another thread would flush the buffer to the WAL. The WAL would ideally only lag a few seconds behind the memstores, but some form of throttling would probably be needed.

Lars Hofhansl
added a comment - 05/May/12 23:30 That (deferred flush) is what I told my colleague to use last week.
Would be nice if the client could control this (in addition to writeToWal, we could have writeToWalAsynchronously - or something).
A periodic memstore flush still make sense. If I get some time next week I'll come up with a patch (unless somebody else wants to take this ).

I also think that a periodic memstore flush. Even with a WAL, it's seems safer/more efficient.
It seems that HBase had this a long long time ago:

<property>
<name>hbase.regionserver.optionalcacheflushinterval</name>
<value>1800000</value>
<description>
Amount of time to wait since the last time a region was flushed before
invoking an optional cache flush (An optional cache flush is a
flush even though memcache is not at the memcache.flush.size).
Default: 30 minutes (in miliseconds)
</description>
</property>

It could also be linked to major compactions (before a major compaction, flush 'old' memstore)?

Nicolas Liochon
added a comment - 10/Jul/12 16:52 I also think that a periodic memstore flush. Even with a WAL, it's seems safer/more efficient.
It seems that HBase had this a long long time ago:
<property>
<name>hbase.regionserver.optionalcacheflushinterval</name>
<value>1800000</value>
<description>
Amount of time to wait since the last time a region was flushed before
invoking an optional cache flush (An optional cache flush is a
flush even though memcache is not at the memcache.flush.size).
Default: 30 minutes (in miliseconds)
</description>
</property>
It could also be linked to major compactions (before a major compaction, flush 'old' memstore)?

I would like to pick this up again and add a flag to Mutation to indicate deferred WAL sync. If HRegion receives a batch of Mutation of which at least one is not marked as deferred the log is sync'ed. Otherwise it is deferred.
This will mingle well later with HBASE-5954.

Lars Hofhansl
added a comment - 18/Nov/12 04:52 I would like to pick this up again and add a flag to Mutation to indicate deferred WAL sync. If HRegion receives a batch of Mutation of which at least one is not marked as deferred the log is sync'ed. Otherwise it is deferred.
This will mingle well later with HBASE-5954 .

Regardless of whether mutations have deferred sync, we might always want to flush periodically. We are rolling the WAL periodically, but if we do not flush, we may end up with a lof of hlogs to recover from in case of failover.

Enis Soztutar
added a comment - 23/Jan/13 00:37 Regardless of whether mutations have deferred sync, we might always want to flush periodically. We are rolling the WAL periodically, but if we do not flush, we may end up with a lof of hlogs to recover from in case of failover.

After going some back and forth, I decided to separate the issue of having the feature on asynchronous write to WAL from the periodic flush (and Enis also suggested the same in our offline discussions).

This is a work-in-progress patch that does the periodic flush based on when was the last flush done, and whether there are any edits after the last flush in any memstore in that region. Resurrected the config hbase.regionserver.optionalcacheflushinterval. Not tested yet.

Devaraj Das
added a comment - 24/Jan/13 00:43 After going some back and forth, I decided to separate the issue of having the feature on asynchronous write to WAL from the periodic flush (and Enis also suggested the same in our offline discussions).
This is a work-in-progress patch that does the periodic flush based on when was the last flush done, and whether there are any edits after the last flush in any memstore in that region. Resurrected the config hbase.regionserver.optionalcacheflushinterval. Not tested yet.

I would like to pick this up again and add a flag to Mutation to indicate deferred WAL sync. If HRegion receives a batch of Mutation of which at least one is not marked as deferred the log is sync'ed. Otherwise it is deferred.

I like the idea of having a deferred flush at the Put level. Now the weird thing is that it is per table, not per column family. I guess we can have per-table/per-cf or per batch deferred flush setting.
With this, maybe we will no longer need skipWAL if we can prove that deferred flush is as fast as skip WAL. Most of the time, we actually do not want to skip WAL, we just want a deferred flush.

I decided to separate the issue of having the feature on asynchronous write to WAL from the periodic flush

Enis Soztutar
added a comment - 24/Jan/13 01:14 I would like to pick this up again and add a flag to Mutation to indicate deferred WAL sync. If HRegion receives a batch of Mutation of which at least one is not marked as deferred the log is sync'ed. Otherwise it is deferred.
I like the idea of having a deferred flush at the Put level. Now the weird thing is that it is per table, not per column family. I guess we can have per-table/per-cf or per batch deferred flush setting.
With this, maybe we will no longer need skipWAL if we can prove that deferred flush is as fast as skip WAL. Most of the time, we actually do not want to skip WAL, we just want a deferred flush.
I decided to separate the issue of having the feature on asynchronous write to WAL from the periodic flush
+1 on doing separating the two.

With this, maybe we will no longer need skipWAL if we can prove that deferred flush is as fast as skip WAL.

In standard database, skipping the WAL is often used when you're doing a functional upgrade requiring some unavailability time, i.e.:

dump

run batch scripts to update your data

if anything goes wrong reload the dump

For hundreds of reasons it makes much less sense with HBase, but it could happen (some companies don't need 24x24). So we should not remove the skipWAL imho, except if it really simplify something internally.

On the patch itself, I have a question on adding some randomness. The scenario I'm thinking about is a massive but periodic update on a table: all the regions will be written simultaneously, hence flushed simultaneously. That's the main use case for this JIRA, and this could hammer the namenode, imho. Except if we thing there is enough randomness by having a different flusher by regionserver (which may not be the case if all regions servers are started simultaneously).

As a side note, I would personally like a flush interval of 10 minutes:

it would help on .META. recovery, especially with the separate wal for .META.

this allows to have more regions: today, on average and in theory, each region takes 50% of an hdfs block size of memory. The more regions we flush early, the more empty memstore we have...

Nicolas Liochon
added a comment - 24/Jan/13 08:23 With this, maybe we will no longer need skipWAL if we can prove that deferred flush is as fast as skip WAL.
In standard database, skipping the WAL is often used when you're doing a functional upgrade requiring some unavailability time, i.e.:
dump
run batch scripts to update your data
if anything goes wrong reload the dump
For hundreds of reasons it makes much less sense with HBase, but it could happen (some companies don't need 24x24). So we should not remove the skipWAL imho, except if it really simplify something internally.
On the patch itself, I have a question on adding some randomness. The scenario I'm thinking about is a massive but periodic update on a table: all the regions will be written simultaneously, hence flushed simultaneously. That's the main use case for this JIRA, and this could hammer the namenode, imho. Except if we thing there is enough randomness by having a different flusher by regionserver (which may not be the case if all regions servers are started simultaneously).
As a side note, I would personally like a flush interval of 10 minutes:
it would help on .META. recovery, especially with the separate wal for .META.
this allows to have more regions: today, on average and in theory, each region takes 50% of an hdfs block size of memory. The more regions we flush early, the more empty memstore we have...

How can deferred log flush ever be as fast as not writing the WAL at all?
Considering only the latency of a single request that might be true in many cases, but it will definitely not be true on a busy cluster since all data is written to the disks twice.

Lars Hofhansl
added a comment - 24/Jan/13 16:36 How can deferred log flush ever be as fast as not writing the WAL at all?
Considering only the latency of a single request that might be true in many cases, but it will definitely not be true on a busy cluster since all data is written to the disks twice.

Ted Yu
added a comment - 24/Jan/13 18:10 Where is PeriodicMemstoreFlusher instantiated ?
Currently MEMSTORE_PERIODIC_FLUSH_INTERVAL is read by both HRegion and PeriodicMemstoreFlusher.
+ boolean shouldFlush() {
Can we pass the interval to the above method so that HRegion doesn't need to introduce:
+ private long flushCheckInterval;
What value for MEMSTORE_PERIODIC_FLUSH_INTERVAL would be interpreted as disabling the periodic flush ?
Thanks

Lars Hofhansl, yeah what Enis Soztutar meant IMHO is that the latency from the client's point of view would improve when deferred flush is used for the mutations. Also, we considered the case that users would most likely not want to skip WAL if we promise them that there wouldn't be latency issues (maybe on a busy cluster). But yeah, it'd not make a difference on the overall IOPS in the cluster...

Nicolas Liochon, generally agree with you that we should not remove the skipWal option without giving it a real good thought and before considering more use cases. And, yes the idea of randomizing the flushes across regionservers sounds good. I'll think up how to incorporate that.

Ted Yu, good catch on the instantiation I was focusing on getting the logic right; forgot to instantiate the chore. I'd prefer to leave the shouldFlush() signature as is (it's a matter of implementation that the shouldFlush method implementation is using the same constant underneath but it could be very well a different constant or shouldFlush implementation could be different sometime when this constant is not even used..).

Devaraj Das
added a comment - 24/Jan/13 18:38 Lars Hofhansl , yeah what Enis Soztutar meant IMHO is that the latency from the client's point of view would improve when deferred flush is used for the mutations. Also, we considered the case that users would most likely not want to skip WAL if we promise them that there wouldn't be latency issues (maybe on a busy cluster). But yeah, it'd not make a difference on the overall IOPS in the cluster...
Nicolas Liochon , generally agree with you that we should not remove the skipWal option without giving it a real good thought and before considering more use cases. And, yes the idea of randomizing the flushes across regionservers sounds good. I'll think up how to incorporate that.
Ted Yu , good catch on the instantiation I was focusing on getting the logic right; forgot to instantiate the chore. I'd prefer to leave the shouldFlush() signature as is (it's a matter of implementation that the shouldFlush method implementation is using the same constant underneath but it could be very well a different constant or shouldFlush implementation could be different sometime when this constant is not even used..).

w.r.t. randomizing the flushes across regionservers, one approach is to introduce a new znode whose data is the outstanding count of flush requests, cluster wise. We place an upper bound on this count. PeriodicMemstoreFlusher wouldn't create new request if the count is at upper bound.

Ted Yu
added a comment - 25/Jan/13 00:24 w.r.t. randomizing the flushes across regionservers, one approach is to introduce a new znode whose data is the outstanding count of flush requests, cluster wise. We place an upper bound on this count. PeriodicMemstoreFlusher wouldn't create new request if the count is at upper bound.

That would work (using a znode). I do think it's fine to place an upper limit per regionserver, and maybe we won't need an upper limit at all.
I so like the idea of some randomness. We could stagger per memstore and add a random jigger that could be up to 1/2 (just making this up, though) of the flush interval. We'd get a new random jigger number after each flush and at memstore creation.

Lars Hofhansl
added a comment - 25/Jan/13 00:39 That would work (using a znode). I do think it's fine to place an upper limit per regionserver, and maybe we won't need an upper limit at all.
I so like the idea of some randomness. We could stagger per memstore and add a random jigger that could be up to 1/2 (just making this up, though) of the flush interval. We'd get a new random jigger number after each flush and at memstore creation.

-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.

+1 hadoop2.0. The patch compiles against the hadoop 2.0 profile.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

Ted Yu
added a comment - 25/Jan/13 23:31
+ private Random rand = new Random();
Please use SecureRandom.
+ } catch (InterruptedException ie){
+ //ignore
Please restore interrupt status.
Should upper bound for the sleep take length of MemStoreFlusher.flushQueue into consideration ?
When many FlushQueueEntry's pile up in flushQueue, we may want to wait longer.
Also, the sleep should be bounded by the remaining time w.r.t. cacheFlushInterval - we don't want the loop in chore() to outlast cacheFlushInterval.

Absolutely not use SecureRandom here. We're not using this to generate cryptographics keys, but just some jitter for memstore flush timing, right?
SecureRandom will exhaust your locally generated entropy that is much better used in case where it is actually needed (and it can hang - on Linux at least - if not enough entropy has been collected)

Lars Hofhansl
added a comment - 26/Jan/13 00:12 Absolutely not use SecureRandom here. We're not using this to generate cryptographics keys, but just some jitter for memstore flush timing, right?
SecureRandom will exhaust your locally generated entropy that is much better used in case where it is actually needed (and it can hang - on Linux at least - if not enough entropy has been collected)

I think the delay should algorithmically related to the flush interval (like interval / 3 or something, could make the jitter factor configurable).
Could we fold the jitter into shouldFlush() rather than actually waiting in chore()?

Lars Hofhansl
added a comment - 26/Jan/13 00:38 I think the delay should algorithmically related to the flush interval (like interval / 3 or something, could make the jitter factor configurable).
Could we fold the jitter into shouldFlush() rather than actually waiting in chore()?

I think the delay should algorithmically related to the flush interval

I think what I currently have has a certain advantage - like if the configured value of cacheflushinterval is too low or something, the chore will be triggered very often but the sleep interval (0 - 2 minutes) would keep the #flushes under control. But yeah, I can always enforce a minimum delay before each flush.

Could we fold the jitter into shouldFlush() rather than actually waiting in chore()?

I think that shouldFlush shouldn't be involved in determining how much to delay the flush. Do you see any issues in waiting in the chore?

Devaraj Das
added a comment - 26/Jan/13 00:52 I think the delay should algorithmically related to the flush interval
I think what I currently have has a certain advantage - like if the configured value of cacheflushinterval is too low or something, the chore will be triggered very often but the sleep interval (0 - 2 minutes) would keep the #flushes under control. But yeah, I can always enforce a minimum delay before each flush.
Could we fold the jitter into shouldFlush() rather than actually waiting in chore()?
I think that shouldFlush shouldn't be involved in determining how much to delay the flush. Do you see any issues in waiting in the chore?

Should upper bound for the sleep take length of MemStoreFlusher.flushQueue into consideration ?

Ted Yu, I think we don't have to worry about this one as much. The reason being that there is a random delay before each flush is inserted in the queue (as opposed to inserts coming in at a rate faster than what the flusher can handle).

Also, the sleep should be bounded by the remaining time w.r.t. cacheFlushInterval - we don't want the loop in chore() to outlast cacheFlushInterval.

Devaraj Das
added a comment - 26/Jan/13 01:06 Should upper bound for the sleep take length of MemStoreFlusher.flushQueue into consideration ?
Ted Yu , I think we don't have to worry about this one as much. The reason being that there is a random delay before each flush is inserted in the queue (as opposed to inserts coming in at a rate faster than what the flusher can handle).
Also, the sleep should be bounded by the remaining time w.r.t. cacheFlushInterval - we don't want the loop in chore() to outlast cacheFlushInterval.
This should be fine. I don't see issues with this one.

I think what we want to limit is this: The maximum time an unflushed edit will remain in the memstore. Otherwise one could trickle in edit 1 every hour and get very old data in the memstore.
(Doing that could potentially also be cheaper as we do not need to retrieve the current time on each edit, just the first one after a flush).

If that is true, then what we want track is not the time of the newest edit, but the time of oldest unflushed edit, and flush if that gets too old.
In order to avoid flushing all memstores at the same time, we want to offset the memstores flush times.
We can do it the way you have it.
(but it seems natural to me to do that at the place where we detect that the memstore needs to be flushed. For this to work the chore needs to wake up more frequently than the flush interval.)

Lars Hofhansl
added a comment - 28/Jan/13 07:28 Hmm... This is a bit more difficult than I thought.
I think what we want to limit is this: The maximum time an unflushed edit will remain in the memstore. Otherwise one could trickle in edit 1 every hour and get very old data in the memstore.
(Doing that could potentially also be cheaper as we do not need to retrieve the current time on each edit, just the first one after a flush).
If that is true, then what we want track is not the time of the newest edit, but the time of oldest unflushed edit, and flush if that gets too old.
In order to avoid flushing all memstores at the same time, we want to offset the memstores flush times.
We can do it the way you have it.
(but it seems natural to me to do that at the place where we detect that the memstore needs to be flushed. For this to work the chore needs to wake up more frequently than the flush interval.)
Btw. the flush interval you have a 10mins, not 1h.

Lars Hofhansl, I think the patch addresses the use case you mention. The shouldFlush call in HRegion first does a global check whether there was a recent flush done, and if so, returns. If not, it goes over the memstores and if there is a memstore edit that happened after the last flush (and due to the previous global check, implicitly after the timeinterval), it returns true indicating that a flush is needed now. In general, if a memstore has at least one edit after the last flush, and the last flush happened before the timeinterval (let's say configured to 60 minutes), then a flush will be triggered now. The flushes are offset by a random sleep related up to a max of 2 minutes.

The attached patch addresses a few of Ted's comments (setting the interrupt status back in the PeriodicMemstoreFlusher chore). It also puts a minimum delay of 1 sec between flushed. It makes the default sleep to 60 minutes.

Lars Hofhansl, would appreciate if you could take a look at this please, and let me know if I missed anything in the patch w.r.t the scenario you outlined..

Devaraj Das
added a comment - 29/Jan/13 01:57 Lars Hofhansl , I think the patch addresses the use case you mention. The shouldFlush call in HRegion first does a global check whether there was a recent flush done, and if so, returns. If not, it goes over the memstores and if there is a memstore edit that happened after the last flush (and due to the previous global check, implicitly after the timeinterval), it returns true indicating that a flush is needed now. In general, if a memstore has at least one edit after the last flush, and the last flush happened before the timeinterval (let's say configured to 60 minutes), then a flush will be triggered now. The flushes are offset by a random sleep related up to a max of 2 minutes.
The attached patch addresses a few of Ted's comments (setting the interrupt status back in the PeriodicMemstoreFlusher chore). It also puts a minimum delay of 1 sec between flushed. It makes the default sleep to 60 minutes.
Lars Hofhansl , would appreciate if you could take a look at this please, and let me know if I missed anything in the patch w.r.t the scenario you outlined..

-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.

+1 hadoop2.0. The patch compiles against the hadoop 2.0 profile.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

Attaching another rev of the patch. Changed the policies for flushing some. Here is a summary:
0. If the last flush happened in the last 60 minutes don't request another flush now for the region in question.
1. For a region, a flush is requested if the following is true:
1.1. If the region is the meta region, and if there is at least one edit after the last flush request another flush now.
1.2. If the region is not one of the meta regions, a flush is requested,
1.2.1. if the last edit is more than 1 minute old, or,
1.2.2. if the region wasn't flushed in the last two flush cycles (this was Enis's suggestion)

I also added a new API in FlushRequester to do flush requests with a delay. That is used by the chore (earlier I used to throttle the flushes by doing a sleep in the chore and that seemed a little odd).

Devaraj Das
added a comment - 05/Feb/13 22:31 Attaching another rev of the patch. Changed the policies for flushing some. Here is a summary:
0. If the last flush happened in the last 60 minutes don't request another flush now for the region in question.
1. For a region, a flush is requested if the following is true:
1.1. If the region is the meta region, and if there is at least one edit after the last flush request another flush now.
1.2. If the region is not one of the meta regions, a flush is requested,
1.2.1. if the last edit is more than 1 minute old, or,
1.2.2. if the region wasn't flushed in the last two flush cycles (this was Enis's suggestion)
I also added a new API in FlushRequester to do flush requests with a delay. That is used by the chore (earlier I used to throttle the flushes by doing a sleep in the chore and that seemed a little odd).
Thoughts?

Devaraj Das
added a comment - 06/Feb/13 18:16 I should add that making the checks as described before prevents some potentially unneeded flushes, while bounding the max duration an edit lives in the memstore...
Lars Hofhansl , could you please take a look at the patch.

Lars Hofhansl
added a comment - 24/Apr/13 03:46 I'd have to reread the patch.
The semantics that we should achieve is a maximum time for an unlogged KV to remain in the memstore (this is different from periodic flushing - I misnamed this issue).

Lars Hofhansl
added a comment - 24/Apr/13 19:19 - edited Sorry if I seem difficult with this one... How about a theme like this:
We record the time of the first edit made to the memstore after a flush. We can even improve this and only record the time of the first unlogged edit made.
Periodically we run the chore, if the recorded time of that first edit is older than a configurable X then we flush the memstore.
That would:
be simpler
clearly limit the maximum an edit will stay in the memstore without being flushed

Devaraj Das
added a comment - 25/Apr/13 02:55
Lars, I'd appreciate if you kindly take a look at the patch. The patch as it stands is simple and also limits the maximum time an edit will stay un-flushed. The high level outline is here https://issues.apache.org/jira/browse/HBASE-5930?focusedCommentId=13571813&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13571813 (I have also put detailed comments in the shouldFlush method added in the patch)

Lars Hofhansl
added a comment - 25/Apr/13 04:26 Why is the approach in the patch better than what I have described?
I believe the approach I described is better in the following ways:
The logic is simpler
We directly measure the age of oldest edit in the memstore, which is the exact metric we want to limit
We only have track the current time for the first KV inserted into the memstore after a flush (System.currentTimeMillis() is not free)
I'm happy to make a sample patch, then we can decide on the merit of the two patches.

Cool. This looks like it exactly does what you stated in your last comment. +1 (not sure why releaseaudit failed, even though the structure of the patch is essentially the same as what I submitted before). I'd like to commit this tomorrow morning unless I hear otherwise from anyone.

Devaraj Das
added a comment - 25/Apr/13 07:27 Cool. This looks like it exactly does what you stated in your last comment. +1 (not sure why releaseaudit failed, even though the structure of the patch is essentially the same as what I submitted before). I'd like to commit this tomorrow morning unless I hear otherwise from anyone.

Don't think the releaseaudit warning is caused by the last patch. It seems to be there in the other builds prior to this pre-commit build as well (is HBASE-8431 the fix?). I ran the releaseaudit test locally and it passed.

Devaraj Das
added a comment - 25/Apr/13 18:38 Don't think the releaseaudit warning is caused by the last patch. It seems to be there in the other builds prior to this pre-commit build as well (is HBASE-8431 the fix?). I ran the releaseaudit test locally and it passed.

Lars Hofhansl
added a comment - 25/Apr/13 19:02 Whoa. You guys are fast. This was more of a sample patch
I'll do a bit more double checking today.
I would also like to have this 0.94. Any objections to that?

Thanks, Elliott for catching this. I was testing releaseaudit - randomly chose a java file and removed the header with an intent to make releaseaudit fail. Forgot to revert the change before my commit.

Devaraj Das
added a comment - 25/Apr/13 22:19 Thanks, Elliott for catching this. I was testing releaseaudit - randomly chose a java file and removed the header with an intent to make releaseaudit fail. Forgot to revert the change before my commit.

I also was not able to understand when we put a delayed flush request to the queue, and later another request with no delay comes, it seems we will just ignore the latter one. Shouldn't we honor the no-delay one before waiting on the delay?

Enis Soztutar
added a comment - 25/Apr/13 23:06 This is causing the test at patch HBASE-2231 to fail. The reason is that we had a residual configuration in hbase-site.xml from earlier:
<name>hbase.regionserver.optionalcacheflushinterval</name>
<value>1000</value>
I also was not able to understand when we put a delayed flush request to the queue, and later another request with no delay comes, it seems we will just ignore the latter one. Shouldn't we honor the no-delay one before waiting on the delay?

I also was not able to understand when we put a delayed flush request to the queue, and later another request with no delay comes, it seems we will just ignore the latter one. Shouldn't we honor the no-delay one before waiting on the delay?

Makes sense to honor the no-delay one. I'll open an issue to take this into account.

Devaraj Das
added a comment - 25/Apr/13 23:21 I also was not able to understand when we put a delayed flush request to the queue, and later another request with no delay comes, it seems we will just ignore the latter one. Shouldn't we honor the no-delay one before waiting on the delay?
Makes sense to honor the no-delay one. I'll open an issue to take this into account.

Lars, I think the addendums fixes the issues to do with rat and the test, but I'd like to open a separate issue on the queuing of flush requests. I'll get a patch by today (worst case tomorrow) and will do a 0.94 patch for that as well..

Devaraj Das
added a comment - 25/Apr/13 23:26 Lars, I think the addendums fixes the issues to do with rat and the test, but I'd like to open a separate issue on the queuing of flush requests. I'll get a patch by today (worst case tomorrow) and will do a 0.94 patch for that as well..

After some deliberations and code browsing, it seems like we don't need to do this. The reasons being:

1. We already have the emergency flush procedure in place (MemStoreFlusher.flushRegion with an emergencyFlush boolean argument exists that is called when there is an urgent need to flush).

2. In the normal steady state of affairs, entries from the queue are processed asynchronously anyway. Flushes are done by a thread and there is no guarantee when the thread picks a flush request up. By adding a delay, this situation is not worsened much, and especially in combination with (1) above.

3. To address the issue raised, I was thinking that when a new flush request comes without a delay, we could remove an existing delayed-flush-request entry that was queued before, and queue a new one without a delay. However, there is a complication here - there is logic that queues up entries with delayed flushes in MemStoreFlusher.flushRegion(final FlushRegionEntry). That would be impacted if I did what I said.

All in all, this doesn't seem like an issue.. Enis Soztutar also ack'ed this in our offline discussions.

Devaraj Das
added a comment - 26/Apr/13 06:29 I'd like to open a separate issue on the queuing of flush requests
After some deliberations and code browsing, it seems like we don't need to do this. The reasons being:
1. We already have the emergency flush procedure in place (MemStoreFlusher.flushRegion with an emergencyFlush boolean argument exists that is called when there is an urgent need to flush).
2. In the normal steady state of affairs, entries from the queue are processed asynchronously anyway. Flushes are done by a thread and there is no guarantee when the thread picks a flush request up. By adding a delay, this situation is not worsened much, and especially in combination with (1) above.
3. To address the issue raised, I was thinking that when a new flush request comes without a delay, we could remove an existing delayed-flush-request entry that was queued before, and queue a new one without a delay. However, there is a complication here - there is logic that queues up entries with delayed flushes in MemStoreFlusher.flushRegion(final FlushRegionEntry) . That would be impacted if I did what I said.
All in all, this doesn't seem like an issue.. Enis Soztutar also ack'ed this in our offline discussions.

Devaraj Das
added a comment - 01/May/13 07:03 Sorry for the delay in getting back with a review on the 0.94 patch. I will get to it tomorrow. I am fine with a new jira for the 0.94 patch or using this one itself...

Lars Hofhansl
added a comment - 11/May/13 04:55 Added release notes. I also would like add an option to disable this feature (for example by setting base.regionserver.optionalcacheflushinterval to a value <= 0).

Thanks DD (you beat me to the patch ). Was more thinking about not starting the chore at at all when disabled.
Then again, we do have per table (or CF) configs now, right? In that case this approach would be the right one.

Lars Hofhansl
added a comment - 13/May/13 19:50 Thanks DD (you beat me to the patch ). Was more thinking about not starting the chore at at all when disabled.
Then again, we do have per table (or CF) configs now, right? In that case this approach would be the right one.

Lars Hofhansl, yes, the configs could be per table/CF. All right, this patch has some unit tests as you earlier asked for it. If you are okay with it, I'll commit it in trunk/0.95, and post an updated one for 0.94.

Devaraj Das
added a comment - 13/May/13 23:40 Lars Hofhansl , yes, the configs could be per table/CF. All right, this patch has some unit tests as you earlier asked for it. If you are okay with it, I'll commit it in trunk/0.95, and post an updated one for 0.94.