I'd like to collect opinions from HBase experts on the queryuniformity and whether there's any advance technique currently existsin HBase to cope with the problems of query uniformity beyond justmaintaining the key uniform distribution.

I know we start with the statement that in order to scale queries, weneed them uniformly distributed over key space. The next advice peopleget is to use uniformly distributed key. Then, the thinking goes, thequery load will also be uniformly distributed among regions.

For what seems to be an embarassingly long time i was missing thepoint however that using uniformly distributed keys does not equateuniform distribution of the queries since it doesn't account forskewness of queries over the key space itself. This skewness can bebad enough under some circumstances to create query hot spots in thecluster which could have been avoided should region splits werebalanced based on query loads rather than on a data size per se. (sortof dynamic query distribution sampling in order to equalize the loadsimilar to how TotalOrderPartitioner does random data sampling tobuild distribution of the key skewness in the incoming data).

To cut a long story, is the region size the only current HBasetechnique to balance load, esp. w.r.t query load? Or perhaps there aresome more advanced techniques to do that ?

Then naturally I'm only putting read load on the region server that hosts "row1". That's contrived, of course, you'd never really do that. But I can imagine plenty of situations where there's an imbalance in query load w/r/t the leading part of the row key of a table. It's not fundamentally different from "write hotspotting", except that it's probably less common (it happens frequently in writes because ascending data in a time series or number sequence is a common thing to insert into a database).

I guess the simple answer is, if you know of non-even distribution of read patterns, it might be something to consider in a custom partitioning of the data into regions. I don't know of any other technique (short of some external caching mechanism) that'd alleviate this; at base, you still have to ask exactly one RS for any given piece of data.

Ian

On May 25, 2012, at 12:31 PM, Dmitriy Lyubimov wrote:

> Hello,> > I'd like to collect opinions from HBase experts on the query> uniformity and whether there's any advance technique currently exists> in HBase to cope with the problems of query uniformity beyond just> maintaining the key uniform distribution.> > I know we start with the statement that in order to scale queries, we> need them uniformly distributed over key space. The next advice people> get is to use uniformly distributed key. Then, the thinking goes, the> query load will also be uniformly distributed among regions.> > For what seems to be an embarassingly long time i was missing the> point however that using uniformly distributed keys does not equate> uniform distribution of the queries since it doesn't account for> skewness of queries over the key space itself. This skewness can be> bad enough under some circumstances to create query hot spots in the> cluster which could have been avoided should region splits were> balanced based on query loads rather than on a data size per se. (sort> of dynamic query distribution sampling in order to equalize the load> similar to how TotalOrderPartitioner does random data sampling to> build distribution of the key skewness in the incoming data).> > To cut a long story, is the region size the only current HBase> technique to balance load, esp. w.r.t query load? Or perhaps there are> some more advanced techniques to do that ?> > Thank you very much.> -Dmitriy

I am talking about situation when even when we have uniform keys, thequery distribution over them is still non-uniform and impossible topredict without sampling query skewness, but skewness is surprisinglygreat. (as in least active/most active user may differ in activity 100times and there is no way one could now which users are going to beactive and which are going to be not active). Assuming there are fewvery active users, but many low active users, if two active users getinto the same region, it creates a hotspot which could have beenavoided if region balancer took notions of number of hits the regionsare getting recently.

Like i pointed out before, such skewness balancer could be fairlyeasily implemented externally to hbase (as in TotalOrderPartitioner),with exception that it would be interfering with the Hbase's balanceritself so it must be integrated with the balancer in that case.

Also another distinct problem is time parameters of such balancecontroller. The load may be changing fast enough or slow enough sothat sampling must be time-weighted itself.

All these tehchnicalities make it difficult to implement it outsidehbase or use key manipulation (as dynamic nature makes it difficult todeal with key re-assigning to match newly discovered loaddistribution).

Ok I guess there's nothing in HBase like that right now otherwise iwould've seen it in the book i suppose...

Thanks.-d

On Fri, May 25, 2012 at 10:42 AM, Ian Varley <[EMAIL PROTECTED]> wrote:> Dmitriy,>> If I understand you right, what you're asking about might be called "Read Hotspotting". For an obvious example, if I distribute my data nicely over the cluster but then say:>> for (int x = 0; x < 10000000000; x++) {> htable.get(new Get(Bytes.toBytes("row1")));> }>> Then naturally I'm only putting read load on the region server that hosts "row1". That's contrived, of course, you'd never really do that. But I can imagine plenty of situations where there's an imbalance in query load w/r/t the leading part of the row key of a table. It's not fundamentally different from "write hotspotting", except that it's probably less common (it happens frequently in writes because ascending data in a time series or number sequence is a common thing to insert into a database).>> I guess the simple answer is, if you know of non-even distribution of read patterns, it might be something to consider in a custom partitioning of the data into regions. I don't know of any other technique (short of some external caching mechanism) that'd alleviate this; at base, you still have to ask exactly one RS for any given piece of data.>> Ian>> On May 25, 2012, at 12:31 PM, Dmitriy Lyubimov wrote:>>> Hello,>>>> I'd like to collect opinions from HBase experts on the query>> uniformity and whether there's any advance technique currently exists>> in HBase to cope with the problems of query uniformity beyond just>> maintaining the key uniform distribution.>>>> I know we start with the statement that in order to scale queries, we>> need them uniformly distributed over key space. The next advice people>> get is to use uniformly distributed key. Then, the thinking goes, the>> query load will also be uniformly distributed among regions.>>>> For what seems to be an embarassingly long time i was missing the>> point however that using uniformly distributed keys does not equate>> uniform distribution of the queries since it doesn't account for>> skewness of queries over the key space itself. This skewness can be>> bad enough under some circumstances to create query hot spots in the>> cluster which could have been avoided should region splits were>> balanced based on query loads rather than on a data size per se. (sort>> of dynamic query distribution sampling in order to equalize the load>> similar to how TotalOrderPartitioner does random data sampling to>> build distribution of the key skewness in the incoming data).>>>> To cut a long story, is the region size the only current HBase

Yeah, I think you're right Dmitriy; there's nothing like that in HBase today as far as I know. If it'd be useful for you, maybe it would be for others, too; work up a rough patch and see what people think on the dev list.

Ian

On May 25, 2012, at 1:02 PM, Dmitriy Lyubimov wrote:

> Thanks, Ian.> > I am talking about situation when even when we have uniform keys, the> query distribution over them is still non-uniform and impossible to> predict without sampling query skewness, but skewness is surprisingly> great. (as in least active/most active user may differ in activity 100> times and there is no way one could now which users are going to be> active and which are going to be not active). Assuming there are few> very active users, but many low active users, if two active users get> into the same region, it creates a hotspot which could have been> avoided if region balancer took notions of number of hits the regions> are getting recently.> > Like i pointed out before, such skewness balancer could be fairly> easily implemented externally to hbase (as in TotalOrderPartitioner),> with exception that it would be interfering with the Hbase's balancer> itself so it must be integrated with the balancer in that case.> > Also another distinct problem is time parameters of such balance> controller. The load may be changing fast enough or slow enough so> that sampling must be time-weighted itself.> > All these tehchnicalities make it difficult to implement it outside> hbase or use key manipulation (as dynamic nature makes it difficult to> deal with key re-assigning to match newly discovered load> distribution).> > Ok I guess there's nothing in HBase like that right now otherwise i> would've seen it in the book i suppose...> > Thanks.> -d> > On Fri, May 25, 2012 at 10:42 AM, Ian Varley <[EMAIL PROTECTED]> wrote:>> Dmitriy,>> >> If I understand you right, what you're asking about might be called "Read Hotspotting". For an obvious example, if I distribute my data nicely over the cluster but then say:>> >> for (int x = 0; x < 10000000000; x++) {>> htable.get(new Get(Bytes.toBytes("row1")));>> }>> >> Then naturally I'm only putting read load on the region server that hosts "row1". That's contrived, of course, you'd never really do that. But I can imagine plenty of situations where there's an imbalance in query load w/r/t the leading part of the row key of a table. It's not fundamentally different from "write hotspotting", except that it's probably less common (it happens frequently in writes because ascending data in a time series or number sequence is a common thing to insert into a database).>> >> I guess the simple answer is, if you know of non-even distribution of read patterns, it might be something to consider in a custom partitioning of the data into regions. I don't know of any other technique (short of some external caching mechanism) that'd alleviate this; at base, you still have to ask exactly one RS for any given piece of data.>> >> Ian>> >> On May 25, 2012, at 12:31 PM, Dmitriy Lyubimov wrote:>> >>> Hello,>>> >>> I'd like to collect opinions from HBase experts on the query>>> uniformity and whether there's any advance technique currently exists>>> in HBase to cope with the problems of query uniformity beyond just>>> maintaining the key uniform distribution.>>> >>> I know we start with the statement that in order to scale queries, we>>> need them uniformly distributed over key space. The next advice people>>> get is to use uniformly distributed key. Then, the thinking goes, the>>> query load will also be uniformly distributed among regions.>>> >>> For what seems to be an embarassingly long time i was missing the>>> point however that using uniformly distributed keys does not equate>>> uniform distribution of the queries since it doesn't account for>>> skewness of queries over the key space itself. This skewness can be>>> bad enough under some circumstances to create query hot spots in the

>>>> To cut a long story, is the region size the only current HBase>>>> technique to balance load, esp. w.r.t query load? Or perhaps there are>>>> some more advanced techniques to do that ?

So maybe I'm missing something but I don't see the problem.

In terms of writing data to be evenly/randomly distributed, you would hash the key (md5 or SHA-1 as examples). This works well if you're doing get()s and not a lot of scan()s.

But on reads, how do you get 'hot spotting' ?

Should those rows be cached in memory?

So what am I missing? Besides another cup of coffee?

-Mike

On May 25, 2012, at 1:23 PM, Ian Varley wrote:

> Yeah, I think you're right Dmitriy; there's nothing like that in HBase today as far as I know. If it'd be useful for you, maybe it would be for others, too; work up a rough patch and see what people think on the dev list.> > Ian> > On May 25, 2012, at 1:02 PM, Dmitriy Lyubimov wrote:> >> Thanks, Ian.>> >> I am talking about situation when even when we have uniform keys, the>> query distribution over them is still non-uniform and impossible to>> predict without sampling query skewness, but skewness is surprisingly>> great. (as in least active/most active user may differ in activity 100>> times and there is no way one could now which users are going to be>> active and which are going to be not active). Assuming there are few>> very active users, but many low active users, if two active users get>> into the same region, it creates a hotspot which could have been>> avoided if region balancer took notions of number of hits the regions>> are getting recently.>> >> Like i pointed out before, such skewness balancer could be fairly>> easily implemented externally to hbase (as in TotalOrderPartitioner),>> with exception that it would be interfering with the Hbase's balancer>> itself so it must be integrated with the balancer in that case.>> >> Also another distinct problem is time parameters of such balance>> controller. The load may be changing fast enough or slow enough so>> that sampling must be time-weighted itself.>> >> All these tehchnicalities make it difficult to implement it outside>> hbase or use key manipulation (as dynamic nature makes it difficult to>> deal with key re-assigning to match newly discovered load>> distribution).>> >> Ok I guess there's nothing in HBase like that right now otherwise i>> would've seen it in the book i suppose...>> >> Thanks.>> -d>> >> On Fri, May 25, 2012 at 10:42 AM, Ian Varley <[EMAIL PROTECTED]> wrote:>>> Dmitriy,>>> >>> If I understand you right, what you're asking about might be called "Read Hotspotting". For an obvious example, if I distribute my data nicely over the cluster but then say:>>> >>> for (int x = 0; x < 10000000000; x++) {>>> htable.get(new Get(Bytes.toBytes("row1")));>>> }>>> >>> Then naturally I'm only putting read load on the region server that hosts "row1". That's contrived, of course, you'd never really do that. But I can imagine plenty of situations where there's an imbalance in query load w/r/t the leading part of the row key of a table. It's not fundamentally different from "write hotspotting", except that it's probably less common (it happens frequently in writes because ascending data in a time series or number sequence is a common thing to insert into a database).>>> >>> I guess the simple answer is, if you know of non-even distribution of read patterns, it might be something to consider in a custom partitioning of the data into regions. I don't know of any other technique (short of some external caching mechanism) that'd alleviate this; at base, you still have to ask exactly one RS for any given piece of data.>>> >>> Ian>>> >>> On May 25, 2012, at 12:31 PM, Dmitriy Lyubimov wrote:>>> >>>> Hello,>>>> >>>> I'd like to collect opinions from HBase experts on the query>>>> uniformity and whether there's any advance technique currently exists>>>> in HBase to cope with the problems of query uniformity beyond just

I gather that Dmitriy is asking whether there are any smarts in the region balancer based on heavy *read* traffic (i.e. if it turns out that your read load is heavily skewed towards a small subset of regions). Which there aren't, but could be if someone wanted to write the infrastructure for it (which would likely be complex, as you'd have to persist information about read traffic somewhere other than the logs). Then read-hot regions would be candidates for splitting, not just based on their size but also based on their read traffic.

Caching is relevant to help read performance, for sure, but there could still be scenarios where your read traffic is all stuck in one region, and even after all other optimizations, it still leaves one region hot and the rest cold.

To be totally clear, Dmitriy: I think this is a pretty advanced feature that's not high on the overall priority list, because in such a rare situation you could always manually split that region.

Ian

On May 26, 2012, at 11:25 AM, Michael Segel wrote:

> Hi,> > Jumping in on this late...> >>>>> To cut a long story, is the region size the only current HBase>>>>> technique to balance load, esp. w.r.t query load? Or perhaps there are>>>>> some more advanced techniques to do that ?> > So maybe I'm missing something but I don't see the problem.> > In terms of writing data to be evenly/randomly distributed, you would hash the key (md5 or SHA-1 as examples). > This works well if you're doing get()s and not a lot of scan()s. > > But on reads, how do you get 'hot spotting' ? > > Should those rows be cached in memory? > > So what am I missing? Besides another cup of coffee? > > -Mike> > On May 25, 2012, at 1:23 PM, Ian Varley wrote:> >> Yeah, I think you're right Dmitriy; there's nothing like that in HBase today as far as I know. If it'd be useful for you, maybe it would be for others, too; work up a rough patch and see what people think on the dev list.>> >> Ian>> >> On May 25, 2012, at 1:02 PM, Dmitriy Lyubimov wrote:>> >>> Thanks, Ian.>>> >>> I am talking about situation when even when we have uniform keys, the>>> query distribution over them is still non-uniform and impossible to>>> predict without sampling query skewness, but skewness is surprisingly>>> great. (as in least active/most active user may differ in activity 100>>> times and there is no way one could now which users are going to be>>> active and which are going to be not active). Assuming there are few>>> very active users, but many low active users, if two active users get>>> into the same region, it creates a hotspot which could have been>>> avoided if region balancer took notions of number of hits the regions>>> are getting recently.>>> >>> Like i pointed out before, such skewness balancer could be fairly>>> easily implemented externally to hbase (as in TotalOrderPartitioner),>>> with exception that it would be interfering with the Hbase's balancer>>> itself so it must be integrated with the balancer in that case.>>> >>> Also another distinct problem is time parameters of such balance>>> controller. The load may be changing fast enough or slow enough so>>> that sampling must be time-weighted itself.>>> >>> All these tehchnicalities make it difficult to implement it outside>>> hbase or use key manipulation (as dynamic nature makes it difficult to>>> deal with key re-assigning to match newly discovered load>>> distribution).>>> >>> Ok I guess there's nothing in HBase like that right now otherwise i>>> would've seen it in the book i suppose...>>> >>> Thanks.>>> -d>>> >>> On Fri, May 25, 2012 at 10:42 AM, Ian Varley <[EMAIL PROTECTED]> wrote:>>>> Dmitriy,>>>> >>>> If I understand you right, what you're asking about might be called "Read Hotspotting". For an obvious example, if I distribute my data nicely over the cluster but then say:>>>> >>>> for (int x = 0; x < 10000000000; x++) {>>>> htable.get(new Get(Bytes.toBytes("row1")));>

If you have records that are being read that frequently, they would be cached in memory.

I think you could use some concept of a systems table and then using coprocessors you could update the table with the read patterns.

There'd be a performance hit. (I don't know how much of one but it will exist...) But its possible...

On May 26, 2012, at 11:45 AM, Ian Varley wrote:

> Mike,> > I gather that Dmitriy is asking whether there are any smarts in the region balancer based on heavy *read* traffic (i.e. if it turns out that your read load is heavily skewed towards a small subset of regions). Which there aren't, but could be if someone wanted to write the infrastructure for it (which would likely be complex, as you'd have to persist information about read traffic somewhere other than the logs). Then read-hot regions would be candidates for splitting, not just based on their size but also based on their read traffic.> > Caching is relevant to help read performance, for sure, but there could still be scenarios where your read traffic is all stuck in one region, and even after all other optimizations, it still leaves one region hot and the rest cold.> > To be totally clear, Dmitriy: I think this is a pretty advanced feature that's not high on the overall priority list, because in such a rare situation you could always manually split that region.> > Ian> > On May 26, 2012, at 11:25 AM, Michael Segel wrote:> >> Hi,>> >> Jumping in on this late...>> >>>>>> To cut a long story, is the region size the only current HBase>>>>>> technique to balance load, esp. w.r.t query load? Or perhaps there are>>>>>> some more advanced techniques to do that ?>> >> So maybe I'm missing something but I don't see the problem.>> >> In terms of writing data to be evenly/randomly distributed, you would hash the key (md5 or SHA-1 as examples). >> This works well if you're doing get()s and not a lot of scan()s. >> >> But on reads, how do you get 'hot spotting' ? >> >> Should those rows be cached in memory? >> >> So what am I missing? Besides another cup of coffee? >> >> -Mike>> >> On May 25, 2012, at 1:23 PM, Ian Varley wrote:>> >>> Yeah, I think you're right Dmitriy; there's nothing like that in HBase today as far as I know. If it'd be useful for you, maybe it would be for others, too; work up a rough patch and see what people think on the dev list.>>> >>> Ian>>> >>> On May 25, 2012, at 1:02 PM, Dmitriy Lyubimov wrote:>>> >>>> Thanks, Ian.>>>> >>>> I am talking about situation when even when we have uniform keys, the>>>> query distribution over them is still non-uniform and impossible to>>>> predict without sampling query skewness, but skewness is surprisingly>>>> great. (as in least active/most active user may differ in activity 100>>>> times and there is no way one could now which users are going to be>>>> active and which are going to be not active). Assuming there are few>>>> very active users, but many low active users, if two active users get>>>> into the same region, it creates a hotspot which could have been>>>> avoided if region balancer took notions of number of hits the regions>>>> are getting recently.>>>> >>>> Like i pointed out before, such skewness balancer could be fairly>>>> easily implemented externally to hbase (as in TotalOrderPartitioner),>>>> with exception that it would be interfering with the Hbase's balancer>>>> itself so it must be integrated with the balancer in that case.>>>> >>>> Also another distinct problem is time parameters of such balance>>>> controller. The load may be changing fast enough or slow enough so>>>> that sampling must be time-weighted itself.>>>> >>>> All these tehchnicalities make it difficult to implement it outside>>>> hbase or use key manipulation (as dynamic nature makes it difficult to>>>> deal with key re-assigning to match newly discovered load>>>> distribution).

> Hi,> > Jumping in on this late...> >>>>> To cut a long story, is the region size the only current HBase>>>>> technique to balance load, esp. w.r.t query load? Or perhaps there are>>>>> some more advanced techniques to do that ?> > So maybe I'm missing something but I don't see the problem.> > In terms of writing data to be evenly/randomly distributed, you would hash the key (md5 or SHA-1 as examples). > This works well if you're doing get()s and not a lot of scan()s. > > But on reads, how do you get 'hot spotting' ? > > Should those rows be cached in memory? > > So what am I missing? Besides another cup of coffee? > > -Mike> > On May 25, 2012, at 1:23 PM, Ian Varley wrote:> >> Yeah, I think you're right Dmitriy; there's nothing like that in HBase today as far as I know. If it'd be useful for you, maybe it would be for others, too; work up a rough patch and see what people think on the dev list.>> >> Ian>> >> On May 25, 2012, at 1:02 PM, Dmitriy Lyubimov wrote:>> >>> Thanks, Ian.>>> >>> I am talking about situation when even when we have uniform keys, the>>> query distribution over them is still non-uniform and impossible to>>> predict without sampling query skewness, but skewness is surprisingly>>> great. (as in least active/most active user may differ in activity 100>>> times and there is no way one could now which users are going to be>>> active and which are going to be not active). Assuming there are few>>> very active users, but many low active users, if two active users get>>> into the same region, it creates a hotspot which could have been>>> avoided if region balancer took notions of number of hits the regions>>> are getting recently.>>> >>> Like i pointed out before, such skewness balancer could be fairly>>> easily implemented externally to hbase (as in TotalOrderPartitioner),>>> with exception that it would be interfering with the Hbase's balancer>>> itself so it must be integrated with the balancer in that case.>>> >>> Also another distinct problem is time parameters of such balance>>> controller. The load may be changing fast enough or slow enough so>>> that sampling must be time-weighted itself.>>> >>> All these tehchnicalities make it difficult to implement it outside>>> hbase or use key manipulation (as dynamic nature makes it difficult to>>> deal with key re-assigning to match newly discovered load>>> distribution).>>> >>> Ok I guess there's nothing in HBase like that right now otherwise i>>> would've seen it in the book i suppose...>>> >>> Thanks.>>> -d>>> >>> On Fri, May 25, 2012 at 10:42 AM, Ian Varley <[EMAIL PROTECTED]> wrote:>>>> Dmitriy,>>>> >>>> If I understand you right, what you're asking about might be called "Read Hotspotting". For an obvious example, if I distribute my data nicely over the cluster but then say:>>>> >>>> for (int x = 0; x < 10000000000; x++) {>>>> htable.get(new Get(Bytes.toBytes("row1")));>>>> }>>>> >>>> Then naturally I'm only putting read load on the region server that hosts "row1". That's contrived, of course, you'd never really do that. But I can imagine plenty of situations where there's an imbalance in query load w/r/t the leading part of the row key of a table. It's not fundamentally different from "write hotspotting", except that it's probably less common (it happens frequently in writes because ascending data in a time series or number sequence is a common thing to insert into a database).>>>> >>>> I guess the simple answer is, if you know of non-even distribution of read patterns, it might be something to consider in a custom partitioning of the data into regions. I don't know of any other technique (short of some external caching mechanism) that'd alleviate this; at base, you still have to ask exactly one RS for any given piece of data.>>>> >>>> Ian>>>> >>>> On May 25, 2012, at 12:31 PM, Dmitriy Lyubimov wrote:

NEW: Monitor These Apps!

All projects made searchable here are trademarks of the Apache Software Foundation.
Service operated by Sematext