If I use md5 hash + timestamp rowkey would hbase automatically detect thedifference in ranges and peforms split? How does split work in such casesor is it still advisable to manually split the regions.

On Wed, Aug 29, 2012 at 3:56 PM, Mohit Anchlia <[EMAIL PROTECTED]> wrote:> If I use md5 hash + timestamp rowkey would hbase automatically detect the> difference in ranges and peforms split? How does split work in such cases> or is it still advisable to manually split the regions.

Yes.

On how split works, when a region hits the maximum configured size, itsplits in two.

Manual splitting can be useful when you know your distribution andyou'd save on hbase doing it for you. It can speed up bulk loads forinstance.

> On Wed, Aug 29, 2012 at 3:56 PM, Mohit Anchlia <[EMAIL PROTECTED]>> wrote:> > If I use md5 hash + timestamp rowkey would hbase automatically detect the> > difference in ranges and peforms split? How does split work in such cases> > or is it still advisable to manually split the regions.>

What logic would you recommend to split the table into multiple regionswhen using md5 hash?> Yes.>> On how split works, when a region hits the maximum configured size, it> splits in two.>> Manual splitting can be useful when you know your distribution and> you'd save on hbase doing it for you. It can speed up bulk loads for> instance.>> St.Ack>

On Wed, Aug 29, 2012 at 9:38 PM, Mohit Anchlia <[EMAIL PROTECTED]> wrote:> On Wed, Aug 29, 2012 at 9:19 PM, Stack <[EMAIL PROTECTED]> wrote:>>> On Wed, Aug 29, 2012 at 3:56 PM, Mohit Anchlia <[EMAIL PROTECTED]>>> wrote:>> > If I use md5 hash + timestamp rowkey would hbase automatically detect the>> > difference in ranges and peforms split? How does split work in such cases>> > or is it still advisable to manually split the regions.>>>> What logic would you recommend to split the table into multiple regions> when using md5 hash?>

Its hard to know how well your inserts will spread over the md5namespace ahead of time. You could try sampling or just let HBasetake care of the splits for you (Is there a problem w/ your lettingHBase do the splits?)

> On Wed, Aug 29, 2012 at 9:38 PM, Mohit Anchlia <[EMAIL PROTECTED]>> wrote:> > On Wed, Aug 29, 2012 at 9:19 PM, Stack <[EMAIL PROTECTED]> wrote:> >> >> On Wed, Aug 29, 2012 at 3:56 PM, Mohit Anchlia <[EMAIL PROTECTED]> >> >> wrote:> >> > If I use md5 hash + timestamp rowkey would hbase automatically detect> the> >> > difference in ranges and peforms split? How does split work in such> cases> >> > or is it still advisable to manually split the regions.> >>> >> > What logic would you recommend to split the table into multiple regions> > when using md5 hash?> >>> Its hard to know how well your inserts will spread over the md5> namespace ahead of time. You could try sampling or just let HBase> take care of the splits for you (Is there a problem w/ your letting> HBase do the splits?)>> From what I;ve read it's advisable to do manual splits since you are ableto spread the load in more predictable way. If I am missing somethingplease let me know.> St.Ack>

On Thu, Aug 30, 2012 at 7:35 AM, Mohit Anchlia <[EMAIL PROTECTED]> wrote:>> From what I;ve read it's advisable to do manual splits since you are able> to spread the load in more predictable way. If I am missing something> please let me know.>

The Facebook devs have mentioned in public talks that they pre-split their tables and don't use automated region splitting. But as far as I remember, the reason for that isn't predictability of spreading load, so much as predictability of uptime & latency (they don't want an automated split to happen at a random busy time). Maybe that's what you mean, Mohit?

Ian

On Aug 30, 2012, at 5:45 PM, Stack wrote:

On Thu, Aug 30, 2012 at 7:35 AM, Mohit Anchlia <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:>From what I;ve read it's advisable to do manual splits since you are ableto spread the load in more predictable way. If I am missing somethingplease let me know.Where did you read that?St.Ack

Also, you might have read that an initial loading of data can be betterdistributed across the cluster if the table is pre-split rather thanstarting with a single region and splitting (possibly aggressively,depending on the throughput) as the data loads in. Once you are in a stablestate with regions distributed across the cluster, there is really nobenefit in terms of spreading load by managing splitting manually v/sletting HBase do it for you. At that point it's about what Ian mentioned -predictability of latencies by avoiding splits happening at a busy time.

> The Facebook devs have mentioned in public talks that they pre-split their> tables and don't use automated region splitting. But as far as I remember,> the reason for that isn't predictability of spreading load, so much as> predictability of uptime & latency (they don't want an automated split to> happen at a random busy time). Maybe that's what you mean, Mohit?>> Ian>> On Aug 30, 2012, at 5:45 PM, Stack wrote:>> On Thu, Aug 30, 2012 at 7:35 AM, Mohit Anchlia <[EMAIL PROTECTED]> <mailto:[EMAIL PROTECTED]>> wrote:> From what I;ve read it's advisable to do manual splits since you are able> to spread the load in more predictable way. If I am missing something> please let me know.>>> Where did you read that?> St.Ack>>

> Also, you might have read that an initial loading of data can be better> distributed across the cluster if the table is pre-split rather than> starting with a single region and splitting (possibly aggressively,> depending on the throughput) as the data loads in. Once you are in a stable> state with regions distributed across the cluster, there is really no> benefit in terms of spreading load by managing splitting manually v/s> letting HBase do it for you. At that point it's about what Ian mentioned -> predictability of latencies by avoiding splits happening at a busy time.>> On Thu, Aug 30, 2012 at 4:26 PM, Ian Varley <[EMAIL PROTECTED]>> wrote:>> > The Facebook devs have mentioned in public talks that they pre-split> their> > tables and don't use automated region splitting. But as far as I> remember,> > the reason for that isn't predictability of spreading load, so much as> > predictability of uptime & latency (they don't want an automated split to> > happen at a random busy time). Maybe that's what you mean, Mohit?> >> > Ian> >> > On Aug 30, 2012, at 5:45 PM, Stack wrote:> >> > On Thu, Aug 30, 2012 at 7:35 AM, Mohit Anchlia <[EMAIL PROTECTED]> > <mailto:[EMAIL PROTECTED]>> wrote:> > From what I;ve read it's advisable to do manual splits since you are able> > to spread the load in more predictable way. If I am missing something> > please let me know.> >> >> > Where did you read that?> > St.Ack> >> >>

On Thu, Aug 30, 2012 at 5:04 PM, Mohit Anchlia <[EMAIL PROTECTED]> wrote:> In general isn't it better to split the regions so that the load can be> spread accross the cluster to avoid HotSpots?>

Time series data is a particular case [1] and the sematextians havetools to help w/ that particular loading pattern. Is time series yourloading pattern? If so, yes, you need to employ some smarts (tsdbschema and write tricks or hbasewd tool) to avoid hotspotting. Buthotspotting is an issue apart from splts; you can split all you wantand if your row keys are time series, splitting won't undo them.

You would split to distribute load over the cluster and HBase shouldbe doing this for you w/o need of human intervention (caveat thereasons you might want to manually split as listed above by AK andIan).

>In general isn't it better to split the regions so that the load can be>spread accross the cluster to avoid HotSpots?>>I read about pre-splitting here:>>http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting>-despite-writing-records-with-sequential-keys/>>On Thu, Aug 30, 2012 at 4:30 PM, Amandeep Khurana <[EMAIL PROTECTED]>>wrote:>>> Also, you might have read that an initial loading of data can be better>> distributed across the cluster if the table is pre-split rather than>> starting with a single region and splitting (possibly aggressively,>> depending on the throughput) as the data loads in. Once you are in a>>stable>> state with regions distributed across the cluster, there is really no>> benefit in terms of spreading load by managing splitting manually v/s>> letting HBase do it for you. At that point it's about what Ian>>mentioned ->> predictability of latencies by avoiding splits happening at a busy time.>>>> On Thu, Aug 30, 2012 at 4:26 PM, Ian Varley <[EMAIL PROTECTED]>>> wrote:>>>> > The Facebook devs have mentioned in public talks that they pre-split>> their>> > tables and don't use automated region splitting. But as far as I>> remember,>> > the reason for that isn't predictability of spreading load, so much as>> > predictability of uptime & latency (they don't want an automated>>split to>> > happen at a random busy time). Maybe that's what you mean, Mohit?>> >>> > Ian>> >>> > On Aug 30, 2012, at 5:45 PM, Stack wrote:>> >>> > On Thu, Aug 30, 2012 at 7:35 AM, Mohit Anchlia <[EMAIL PROTECTED]>> > <mailto:[EMAIL PROTECTED]>> wrote:>> > From what I;ve read it's advisable to do manual splits since you are>>able>> > to spread the load in more predictable way. If I am missing something>> > please let me know.>> >>> >>> > Where did you read that?>> > St.Ack>> >>> >>>

> On Thu, Aug 30, 2012 at 5:04 PM, Mohit Anchlia <[EMAIL PROTECTED]>> wrote:> > In general isn't it better to split the regions so that the load can be> > spread accross the cluster to avoid HotSpots?> >>> Time series data is a particular case [1] and the sematextians have> tools to help w/ that particular loading pattern. Is time series your> loading pattern? If so, yes, you need to employ some smarts (tsdb> schema and write tricks or hbasewd tool) to avoid hotspotting. But> hotspotting is an issue apart from splts; you can split all you want> and if your row keys are time series, splitting won't undo them.>> My data is timeseries and to get random distribution and still have thekeys in the same region for a user I am thinking of usingmd5(userid)+reversetimestamp as a row key. But with this type of key howcan one do pre-splits? I have 30 nodes.> You would split to distribute load over the cluster and HBase should> be doing this for you w/o need of human intervention (caveat the> reasons you might want to manually split as listed above by AK and> Ian).>> St.Ack> 1. http://hbase.apache.org/book.html#rowkey.design>

On Fri, Aug 31, 2012 at 7:55 AM, Mohit Anchlia <[EMAIL PROTECTED]> wrote:>> My data is timeseries and to get random distribution and still have the> keys in the same region for a user I am thinking of using> md5(userid)+reversetimestamp as a row key. But with this type of key how> can one do pre-splits? I have 30 nodes.>

If you don't know the key spread ahead of time, let HBase do thesplitting for you?St.Ack

NEW: Monitor These Apps!

All projects made searchable here are trademarks of the Apache Software Foundation.
Service operated by Sematext