Make flush decisions per column family

Details

Description

Today, the flush decision is made using the aggregate size of all column families. When large and small column families co-exist, this causes many small flushes of the smaller CF. We need to make per-CF flush decisions.

Activity

I have been thinking about this one for some time... I think it makes sense in loads of ways since a common problem of multi-CF is that during the initial import the user ends up with thousands of small store files because some family grows faster and triggered the flushes, which in turn generates incredible compaction churn. On the other hand, it means that we almost consider a family as a region e.g. one region with 3 CF can have up to 3x64MB in the memstores.

Jean-Daniel Cryans
added a comment - 25/Oct/10 21:20 I have been thinking about this one for some time... I think it makes sense in loads of ways since a common problem of multi-CF is that during the initial import the user ends up with thousands of small store files because some family grows faster and triggered the flushes, which in turn generates incredible compaction churn. On the other hand, it means that we almost consider a family as a region e.g. one region with 3 CF can have up to 3x64MB in the memstores.

Eventually, is it not better to enforce the memory limit by using a combination of flush sizes and restricting the number of regions we create? Because ideally we should allow different flush sizes for the different CF's as the KV sizes could be way different...

Shall I just make this an option in the conf for now with the default the way it is?

Karthik Ranganathan
added a comment - 26/Oct/10 20:03 Yes, agreed that the memory implication is different.
Eventually, is it not better to enforce the memory limit by using a combination of flush sizes and restricting the number of regions we create? Because ideally we should allow different flush sizes for the different CF's as the KV sizes could be way different...
Shall I just make this an option in the conf for now with the default the way it is?

Some interesting stats. We did some rough calculations internally to see what effect an uneven distribution of data into column families was having on our network IO. Our data distribution for 3 column families was 1:1:20. When we looked at the flush:minor-compaction ratio for each of the store files, the large column family had a 1:2 ratio but the small CFs both had a 1:20 ratio! We are looking at roughly a 10% network IO decrease if we can bring those other 2 CFs down to a 1:2 ratio as well.

Nicolas Spiegelberg
added a comment - 18/Jan/11 18:31 Some interesting stats. We did some rough calculations internally to see what effect an uneven distribution of data into column families was having on our network IO. Our data distribution for 3 column families was 1:1:20. When we looked at the flush:minor-compaction ratio for each of the store files, the large column family had a 1:2 ratio but the small CFs both had a 1:20 ratio! We are looking at roughly a 10% network IO decrease if we can bring those other 2 CFs down to a 1:2 ratio as well.

Nicolas Spiegelberg
added a comment - 19/Jan/11 01:22 This is a substantial refactoring effort. My current development strategy is to break this down into 4 parts. Each one will have a diff + review board so you guys don't get overwhelmed...
1. move flushcache() from Region -> Store. have Region.flushcache loop through Store API
2. move locks from Region -> Store. figure out flush/compact/split locking strategy
3. refactor HLog to store per-CF seqnum info
4. refactor MemStoreFlusher from regionsInQueue to storesInQueue

Nicolas Spiegelberg
added a comment - 19/Jan/11 03:39 @ryan: the main work in step #3 isn't HBASE-2856 work. It's roughly modifying HLog.lastSeqWritten from Map<region, long> => Map<store, long> and all the refactoring to the HLog code that it entails.

While our row-level acid guarantees for gets/scans are slightly broken in the trunk (HBASE-2856), per-CF flushes will extremely aggravate this problem because a single user-level Put could span both current memstore and snapshot memstore depending upon what CF the put shard resides in. Halting progress on this issue until this jira is completed.

Nicolas Spiegelberg
added a comment - 21/Jan/11 01:21 While our row-level acid guarantees for gets/scans are slightly broken in the trunk ( HBASE-2856 ), per-CF flushes will extremely aggravate this problem because a single user-level Put could span both current memstore and snapshot memstore depending upon what CF the put shard resides in. Halting progress on this issue until this jira is completed.

This jira is very useful in practice.
In HBase, the horizontal partitions by rowkey-ranges make regions, and the vertical partitions by column-family make stores. These horizontal and vertical partitoning schema make a data tetragonum — the store in hbase.

The memstore is base on the store, so the flush and compaction need also be based on store. The memstoreSize in HRegion should be in HStore.

For flexible configuration, I think we shlould be able to configure memstoresize (i.e. hbase.hregion.memstore.flush.size) in Column-Family level (when create table). And if possaible, I want the maxStoreSize also be configurable for different Column-Family.

Schubert Zhang
added a comment - 31/Jan/11 15:34 This jira is very useful in practice.
In HBase, the horizontal partitions by rowkey-ranges make regions, and the vertical partitions by column-family make stores. These horizontal and vertical partitoning schema make a data tetragonum — the store in hbase.
The memstore is base on the store, so the flush and compaction need also be based on store. The memstoreSize in HRegion should be in HStore.
For flexible configuration, I think we shlould be able to configure memstoresize (i.e. hbase.hregion.memstore.flush.size) in Column-Family level (when create table). And if possaible, I want the maxStoreSize also be configurable for different Column-Family.

@Nicolas,
Is there any update on this issue? We have a production use-case wherein 80% of data goes to one CF and remaining 20% goes to two other CFs. I can collaborate with you if you are interested to pursue with patch. Thanks.

Mubarak Seyed
added a comment - 17/Feb/12 23:42 @Nicolas,
Is there any update on this issue? We have a production use-case wherein 80% of data goes to one CF and remaining 20% goes to two other CFs. I can collaborate with you if you are interested to pursue with patch. Thanks.

@Mubarak: I think you probably are more interested in tuning the compaction settings. The initial reason for this JIRA was higher network IO. The actual problem was that the min unconditional compact size was too high & caused bad compaction decision. We fixed this by lowering the min size from the default of the flush size (256MB, for us) to 4MB.

<property>
<name>hbase.hstore.compaction.min.size</name>
<value>4194304</value>
<description>
The "minimum" compaction size. All files below this size are always
included into a compaction, even if outside compaction ratio times
the total size of all files added to compaction so far.
</description>
</property>

We identified this a while ago and I thought we were going to change the default for 0.92, but it looks like it's still in the Store.java code A better use of your time would be to verify that this reduces your IO and write up a JIRA to change the default.

Nicolas Spiegelberg
added a comment - 18/Feb/12 00:21 @Mubarak: I think you probably are more interested in tuning the compaction settings. The initial reason for this JIRA was higher network IO. The actual problem was that the min unconditional compact size was too high & caused bad compaction decision. We fixed this by lowering the min size from the default of the flush size (256MB, for us) to 4MB.
<property>
<name>hbase.hstore.compaction.min.size</name>
<value>4194304</value>
<description>
The "minimum" compaction size. All files below this size are always
included into a compaction, even if outside compaction ratio times
the total size of all files added to compaction so far.
</description>
</property>
We identified this a while ago and I thought we were going to change the default for 0.92, but it looks like it's still in the Store.java code A better use of your time would be to verify that this reduces your IO and write up a JIRA to change the default.

stack
added a comment - 19/Feb/12 04:39 Thanks @Nicolas (and thanks @Mubarak – sounds like something to indeed get into 0.92).
At the same time, I'd think this issue still worth some time; if lots of cfs and only one is filling, its silly to flush the others as we do now because one is over the threshold.

@Nicolas: Interesting bit about hstore.compaction.min.size. I'm curious, is 4MB something that works specifically for your setup, or would you generally recommend it setting it that low?
It probably has to do with whether compression is enabled, how many CFs and relative sizes, etc.

Maybe instead of defaulting it to flushsize, we could default it to flushsize/2 or flushsize/4...?

Lars Hofhansl
added a comment - 20/Feb/12 00:15 @Nicolas: Interesting bit about hstore.compaction.min.size. I'm curious, is 4MB something that works specifically for your setup, or would you generally recommend it setting it that low?
It probably has to do with whether compression is enabled, how many CFs and relative sizes, etc.
Maybe instead of defaulting it to flushsize, we could default it to flushsize/2 or flushsize/4...?

@Nicolas I wonder about this... hbase.hstore.compaction.min.size. When we compact, don't we have to take adjacent files as part of our ACID guarantees? This would frustrate that? (I'll take a look... tomorrow). I'm wondering because i want to figure how to make it so we favor reference files... so they are always included in a compaction.

stack
added a comment - 20/Feb/12 22:20 @Nicolas I wonder about this... hbase.hstore.compaction.min.size. When we compact, don't we have to take adjacent files as part of our ACID guarantees? This would frustrate that? (I'll take a look... tomorrow). I'm wondering because i want to figure how to make it so we favor reference files... so they are always included in a compaction.

At the same time, I'd think this issue still worth some time; if lots of cfs and only one is filling, its silly to flush the others as we do now because one is over the threshold.

I thought so too. Setting the hbase.hstore.compaction.size to 4MB, and having the flush size at 256MB, it means you will never compact flush files larger than 4MB. So, in other words, only if you are flushing small files (say from a small, dependent column family) you are running a minor compaction on them. For the larger family you typically do not run those at all, right?

This surely seems a specific setting for this use-case, and there are others that need a slightly different setting. If you mix those two on the same cluster, then having only one global setting to adjust this seems restrictive? Should this be a setting per table, like the flush size?

It still seems to me that decoupling is what we should have available as well. But I thought about it for a while as well as discussed this various people: it seems that decoupling brings its own set of issues, for example, you might end up with too many HLog files because the small family is flushed only rarely.

Lars George
added a comment - 21/Feb/12 20:49 At the same time, I'd think this issue still worth some time; if lots of cfs and only one is filling, its silly to flush the others as we do now because one is over the threshold.
I thought so too. Setting the hbase.hstore.compaction.size to 4MB, and having the flush size at 256MB, it means you will never compact flush files larger than 4MB. So, in other words, only if you are flushing small files (say from a small, dependent column family) you are running a minor compaction on them. For the larger family you typically do not run those at all, right?
This surely seems a specific setting for this use-case, and there are others that need a slightly different setting. If you mix those two on the same cluster, then having only one global setting to adjust this seems restrictive? Should this be a setting per table, like the flush size?
It still seems to me that decoupling is what we should have available as well. But I thought about it for a while as well as discussed this various people: it seems that decoupling brings its own set of issues, for example, you might end up with too many HLog files because the small family is flushed only rarely.

@Lars G: Now I am perfectly confused. The description says that "All files below this size are always included into a compaction".
I had assumed this setting is to quickly get rid of the smaller store files. Did I misunderstand?

Lars Hofhansl
added a comment - 22/Feb/12 18:57 @Lars G: Now I am perfectly confused. The description says that "All files below this size are always included into a compaction".
I had assumed this setting is to quickly get rid of the smaller store files. Did I misunderstand?

@Lars/Stack: note that the number of StoreFiles necessary to store N amount of data is order O(log N) with the existing compaction algorithm. This means that setting the compaction min size to a low value will not result in significantly more files. Furthermore, what's hurting performance is not the amount of files but the size of each file. The extra files will be very small and take up only a minority of the space in the LRU cache. Every time you unnecessarily compact files, you have to repopulate that StoreFile in the LRU cache and get a lot of disk reads in addition to the obvious write increase. This is all to say that I would recommend defaulting it to that low because the downsides are very minimal and the benefit can be substantial IO gains.

At the same time, I'd think this issue still worth some time; if lots of cfs and only one is filling, its silly to flush the others as we do now because one is over the threshold.

Why is this silly? With cache-on-write, the data is still cached in memory. It's just migrated from the MemCache to the BlockCache, which has comparable performance. Furthermore, BlockCache data is compressed, so it then takes up less space. Flushing also minimizes the amount of HLogs and decreases recovery time. Flushing would be bad if it meant we weren't optimally using the global MemStore size, but we currently are.

This surely seems a specific setting for this use-case, and there are others that need a slightly different setting. If you mix those two on the same cluster, then having only one global setting to adjust this seems restrictive? Should this be a setting per table, like the flush size?

I think this is a better default, not that it's a one-size setting. I agree that this should toggleable on a per-CF basis, hence HBASE-5335.

Nicolas Spiegelberg
added a comment - 22/Feb/12 19:03 @Lars/Stack: note that the number of StoreFiles necessary to store N amount of data is order O(log N) with the existing compaction algorithm. This means that setting the compaction min size to a low value will not result in significantly more files. Furthermore, what's hurting performance is not the amount of files but the size of each file. The extra files will be very small and take up only a minority of the space in the LRU cache. Every time you unnecessarily compact files, you have to repopulate that StoreFile in the LRU cache and get a lot of disk reads in addition to the obvious write increase. This is all to say that I would recommend defaulting it to that low because the downsides are very minimal and the benefit can be substantial IO gains.
At the same time, I'd think this issue still worth some time; if lots of cfs and only one is filling, its silly to flush the others as we do now because one is over the threshold.
Why is this silly? With cache-on-write, the data is still cached in memory. It's just migrated from the MemCache to the BlockCache, which has comparable performance. Furthermore, BlockCache data is compressed, so it then takes up less space. Flushing also minimizes the amount of HLogs and decreases recovery time. Flushing would be bad if it meant we weren't optimally using the global MemStore size, but we currently are.
This surely seems a specific setting for this use-case, and there are others that need a slightly different setting. If you mix those two on the same cluster, then having only one global setting to adjust this seems restrictive? Should this be a setting per table, like the flush size?
I think this is a better default, not that it's a one-size setting. I agree that this should toggleable on a per-CF basis, hence HBASE-5335 .

Because I was seeing a plethora of small files a problem but given your explaination above, I think I grok that its not many small files thats the prob; its that w/ the way high min size, our selection was to inclusionary and so we end up doing loads of rewriting.

stack
added a comment - 23/Feb/12 06:30 @Nicolas I think I follow. I opened HBASE-5461 . Let me try it.
Why is this silly?
Because I was seeing a plethora of small files a problem but given your explaination above, I think I grok that its not many small files thats the prob; its that w/ the way high min size, our selection was to inclusionary and so we end up doing loads of rewriting.

The basic idea is to be able to maintain the smallest LSN amongst the edits present in a particular memstore for a column family. When we decide to flush a set of memstores, we find the smallest LSN id amongst the memstores that we are not flushing, say X, and say that we can remove the logs for any edits with LSN less than X. We choose a particular memstore to be flushed, if it occupies more than 't' bytes, when the global memstore size threshold is 'T' (and t/T = 1/4 for our configuration). If there is no memstore with >= t bytes but the total size of all the memstores is above T, we flush all the memstores.

Gaurav Menghani
added a comment - 24/Oct/13 19:53 The basic idea is to be able to maintain the smallest LSN amongst the edits present in a particular memstore for a column family. When we decide to flush a set of memstores, we find the smallest LSN id amongst the memstores that we are not flushing, say X, and say that we can remove the logs for any edits with LSN less than X. We choose a particular memstore to be flushed, if it occupies more than 't' bytes, when the global memstore size threshold is 'T' (and t/T = 1/4 for our configuration). If there is no memstore with >= t bytes but the total size of all the memstores is above T, we flush all the memstores.

Ted Yu
added a comment - 25/Oct/13 00:31 Unfinished port to trunk.
There is no log.appendNoSync() call in 89-fb. So that's difference we need to resolve.
Talking with Gaurav, I think it would be nice to provide different heuristics (policies) for which column families to flush when there is no single column family whose size exceeds threshold t.

stack
added a comment - 18/Dec/13 19:17 Gaurav Menghani Gaurav, have you deployed this change? If so, what do you see in operation? When you talk about different heuristics, what you thinking? Thanks boss.

Stack Yes, we have deployed this, with selective flushing disabled for now, since we didn't see any aggregate benefits yet. The heuristics that I was thinking about were around, which column families to flush when there are no column families above the threshold for flushing families. Eg. if the memstore limit is 128 MB, and the flushing threshold for a CF is 32 MB, there might be a case, where there are like 7-8 CFs, and none of them are above 32 MB.

In that case, there are a couple of heuristics you can choose. Like: flush the top N column families, flush only as few column families to free up 1/4 th of the memstore, etc. The main benefit I see is the time spent while compacting the smaller CFs will be much lesser, since the number of files created would be much lesser. This is compensated against bigger column families being flushed earlier than before, and having smaller files than without this change, but with the right heuristics we can find a good balance.

Gaurav Menghani
added a comment - 21/Dec/13 01:15 Stack Yes, we have deployed this, with selective flushing disabled for now, since we didn't see any aggregate benefits yet. The heuristics that I was thinking about were around, which column families to flush when there are no column families above the threshold for flushing families. Eg. if the memstore limit is 128 MB, and the flushing threshold for a CF is 32 MB, there might be a case, where there are like 7-8 CFs, and none of them are above 32 MB.
In that case, there are a couple of heuristics you can choose. Like: flush the top N column families, flush only as few column families to free up 1/4 th of the memstore, etc. The main benefit I see is the time spent while compacting the smaller CFs will be much lesser, since the number of files created would be much lesser. This is compensated against bigger column families being flushed earlier than before, and having smaller files than without this change, but with the right heuristics we can find a good balance.