> Hi,>> We have been working on implementing secondary index in HBase, and had> shared an overview of our design in the 2012 Hadoop Technical Conference> at Beijing(http://bit.ly/hbtc12-hindex). We are pleased to open source it> today.>> The project is available on github.> https://github.com/Huawei-Hadoop/hindex>> It is 100% Java, compatible with Apache HBase 0.94.8, and is open sourced> under Apache Software License v2.>> Following features are supported currently.> - multiple indexes on table,> - multi column index,> - index based on part of a column value,> - equals and range condition scans using index, and> - bulk loading data to indexed table (Indexing done with bulk> load)>> We now plan to raise HBase JIRA(s) to make it available in Apache release,> and can hopefully continue our work on this in the community.>> Regards> Rajeshbabu>>-- Best regards,

Sorry to be a debbie downer here, but really this is not a good idea. Here's why:

In terms of design, you have some serious scalability and performance issues when compared to alternatives. Let me try to give you a real life example. *

CCCIS (CCC Information Services) is the middle man in the US between the auto repair shop and the insurance company. They have one competitor but they handle most of the accident claims in the US. So when you go to your authorized repair shop, they have this application called Pathways which takes down all of your information and the accident, the parts required to be replaced and sends it first to CCC which then sends it on to your insurance company. In short CCC collects a lot of information about the type of vehicles, the accidents, the cost of parts, labor to put your car back on the road. As the middle man they collect a lot of very useful information…

So imagine you have a large data warehouse in HBase of all of the claims. Your primary key is going to be a composite of the insurer and the claim_id.

But you're going to want to also index based on the make/model, type of accident, driver details, location… , VIN

This will allow your actuaries to figure out the average cost of a front end collision, by make and model, by state/zip.Or by age bracket, who's a better driver?

Imagine that the claim table will have a column for the claim in its entirety as an Avro doc (JSON) along with the important fields broken out separately. (For this example the schema isn't that important.)

So you want to find the average cost of a front end collision of a VOLVO S80 for the past 3 model years.

Now, you have an index based on manufacturer/model/year.

Using your index scheme, you now have to query every RS for the row keys in the index.Then you have to take these results and then put them in a sort order in order to use the index.

Note: This isn't too bad if you're doing a simple query against one index. You can do the work by RS and then join the results from all RS.

However… what happens if you have two indexes and your result set is going to be the intersection of the indexes?

Or you're going to do a join between two tables using the indexes to limit the result set?

Now your design breaks down quickly.

And then there's another problem. Your index may be relatively much smaller than your base table. In this example… the insurance claim is a huge record. I would say 2-3 orders of magnitude larger than the row key. Since you split your index at the same rate you split your table… you will have a lot of regions for your index.

Again,this may lead to other issues….

Is it better than doing a full table scan? Sure.

Are there better alternatives? Yes. Apply KISS. (Keep it simple)

Still using an inverted table, let HBase manage it rather than trying to tie it to the underlying base table. While its not perfect, its lighter, and will perform better in the general use cases. (You could even use Async HBase to decouple the write to the base table and the update to the index.)

Same model could be applied to a Lucene index as well.

Just Saying….

-Mike*FULL DISCLOSUREI am a consultant and CCC was a client of mine back in the late '90s. In one project I worked on ProEFT (now defunct) and an ODS, also now defunct. The example is a hypothetical of what I would do if I were CCC and wanted to use Big Data to help manage Auto claims. Any resemblance to any actual work being done by CCC in the Big Data space is pure coincidence. ;-)

> Thanks so much for the contribution!> > On Mon, Aug 12, 2013 at 11:19 PM, rajeshbabu chintaguntla <> [EMAIL PROTECTED]> wrote:> >> Hi,>> >> We have been working on implementing secondary index in HBase, and had>> shared an overview of our design in the 2012 Hadoop Technical Conference>> at Beijing(http://bit.ly/hbtc12-hindex). We are pleased to open source it

Michael, I do not think its the competitor to Solr, Solr/HBase or ClouderaSearch, but it can be good addition to the HBase SQL front-end, such asPhoenix .On Wed, Aug 14, 2013 at 8:45 AM, Michael Segel <[EMAIL PROTECTED]>wrote:

> Guys,>> Sorry to be a debbie downer here, but really this is not a good idea.> Here's why:>> In terms of design, you have some serious scalability and performance> issues when compared to alternatives.>>> Let me try to give you a real life example. *>> CCCIS (CCC Information Services) is the middle man in the US between the> auto repair shop and the insurance company. They have one competitor but> they handle most of the accident claims in the US.> So when you go to your authorized repair shop, they have this application> called Pathways which takes down all of your information and the accident,> the parts required to be replaced and sends it first to CCC which then> sends it on to your insurance company. In short CCC collects a lot of> information about the type of vehicles, the accidents, the cost of parts,> labor to put your car back on the road. As the middle man they collect a> lot of very useful information…>> So imagine you have a large data warehouse in HBase of all of the claims.> Your primary key is going to be a composite of the insurer and the claim_id.>> But you're going to want to also index based on the make/model, type of> accident, driver details, location… , VIN>> This will allow your actuaries to figure out the average cost of a front> end collision, by make and model, by state/zip.> Or by age bracket, who's a better driver?>> Imagine that the claim table will have a column for the claim in its> entirety as an Avro doc (JSON) along with the important fields broken out> separately. (For this example the schema isn't that important.)>> So you want to find the average cost of a front end collision of a VOLVO> S80 for the past 3 model years.>> Now, you have an index based on manufacturer/model/year.>> Using your index scheme, you now have to query every RS for the row keys> in the index.> Then you have to take these results and then put them in a sort order in> order to use the index.>> Note: This isn't too bad if you're doing a simple query against one index.> You can do the work by RS and then join the results from all RS.>> However… what happens if you have two indexes and your result set is going> to be the intersection of the indexes?>> Or you're going to do a join between two tables using the indexes to limit> the result set?>> Now your design breaks down quickly.>> And then there's another problem.> Your index may be relatively much smaller than your base table.> In this example… the insurance claim is a huge record. I would say 2-3> orders of magnitude larger than the row key. Since you split your index> at the same rate you split your table… you will have a lot of regions for> your index.>> Again,this may lead to other issues….>> Is it better than doing a full table scan? Sure.>> Are there better alternatives?> Yes.> Apply KISS. (Keep it simple)>> Still using an inverted table, let HBase manage it rather than trying to> tie it to the underlying base table.> While its not perfect, its lighter, and will perform better in the general> use cases. (You could even use Async HBase to decouple the write to the> base table and the update to the index.)>> Same model could be applied to a Lucene index as well.>> Just Saying….>> -Mike>>> *FULL DISCLOSURE> I am a consultant and CCC was a client of mine back in the late '90s. In> one project I worked on ProEFT (now defunct) and an ODS, also now defunct.> The example is a hypothetical of what I would do if I were CCC and wanted> to use Big Data to help manage Auto claims. Any resemblance to any actual> work being done by CCC in the Big Data space is pure coincidence. ;-)>> On Aug 13, 2013, at 1:31 PM, Andrew Purtell <[EMAIL PROTECTED]> wrote:

The point I was trying to make was that if you are going to use an inverted table as your index, managing your index at the RS level is going to bite you in the ass and will cause more headaches down the road.

This is being done because they want to avoid the overhead of RPC calls. But you're in a distributed database where RPC is part of the ecosystem and its something that you have to deal with. (And you can do some basic design to decouple the write to the index from the base table. )

In addition to this, the use of an inverted table is just one of the options you have for a secondary index. You could also look at Lucene which we did a PoC a few years back.

Also beyond the secondary indexing, you have issues with coprocessors in general that should be addressed. But that's a different story.

Please don't misunderstand, but while secondary indexing is a very important thing, going down the path of tying the index to the region is going down the wrong path.

When you look at trying to integrate it in to Phoenix, you'll start to see the problems….

> Michael, I do not think its the competitor to Solr, Solr/HBase or Cloudera> Search, but it can be good addition to the HBase SQL front-end, such as> Phoenix .> > > On Wed, Aug 14, 2013 at 8:45 AM, Michael Segel <[EMAIL PROTECTED]>wrote:> >> Guys,>> >> Sorry to be a debbie downer here, but really this is not a good idea.>> Here's why:>> >> In terms of design, you have some serious scalability and performance>> issues when compared to alternatives.>> >> >> Let me try to give you a real life example. *>> >> CCCIS (CCC Information Services) is the middle man in the US between the>> auto repair shop and the insurance company. They have one competitor but>> they handle most of the accident claims in the US.>> So when you go to your authorized repair shop, they have this application>> called Pathways which takes down all of your information and the accident,>> the parts required to be replaced and sends it first to CCC which then>> sends it on to your insurance company. In short CCC collects a lot of>> information about the type of vehicles, the accidents, the cost of parts,>> labor to put your car back on the road. As the middle man they collect a>> lot of very useful information…>> >> So imagine you have a large data warehouse in HBase of all of the claims.>> Your primary key is going to be a composite of the insurer and the claim_id.>> >> But you're going to want to also index based on the make/model, type of>> accident, driver details, location… , VIN>> >> This will allow your actuaries to figure out the average cost of a front>> end collision, by make and model, by state/zip.>> Or by age bracket, who's a better driver?>> >> Imagine that the claim table will have a column for the claim in its>> entirety as an Avro doc (JSON) along with the important fields broken out>> separately. (For this example the schema isn't that important.)>> >> So you want to find the average cost of a front end collision of a VOLVO>> S80 for the past 3 model years.

> This isn't too bad if you're doing a simple query against one index. You> can do the work by RS and then join the results from all RS.>> However… what happens if you have two indexes and your result set is going> to be the intersection of the indexes?>> Or you're going to do a join between two tables using the indexes to limit> the result set?>> Now your design breaks down quickly.>

You may have just described their design assumptions.

I'm not endorsing this per se, but suggesting it is not a good idea onaccount it can't live up to the requirements of a pretty particularstrawman seems a step too far.

Maybe someone from Huawei can talk a bit here about successful use cases?

> You could also look at Lucene which we did a PoC a few years back.

A certain large technology company has an HBase full text index built onLucene that might be offered as a contribution at some point. From what Iknow of it, there are a different set of tradeoffs and it certainly won'twork for everyone, and not because the people working on it were not smartenough to find a silver bullet.

> On Wed, Aug 14, 2013 at 8:45 AM, Michael Segel <[EMAIL PROTECTED]>wrote:> >> This isn't too bad if you're doing a simple query against one index. You>> can do the work by RS and then join the results from all RS.>> >> However… what happens if you have two indexes and your result set is going>> to be the intersection of the indexes?>> >> Or you're going to do a join between two tables using the indexes to limit>> the result set?>> >> Now your design breaks down quickly.>> > > You may have just described their design assumptions.> > I'm not endorsing this per se, but suggesting it is not a good idea on> account it can't live up to the requirements of a pretty particular> strawman seems a step too far.> > Maybe someone from Huawei can talk a bit here about successful use cases?> >> You could also look at Lucene which we did a PoC a few years back.> > A certain large technology company has an HBase full text index built on> Lucene that might be offered as a contribution at some point. From what I> know of it, there are a different set of tradeoffs and it certainly won't> work for everyone, and not because the people working on it were not smart> enough to find a silver bullet.> > -- > Best regards,> > - Andy> > Problems worthy of attack prove their worth by hitting back. - Piet Hein> (via Tom White)

The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental. Use at your own risk. Michael Segelmichael_segel (AT) hotmail.com

Michael, JOIN is not supported in Phoenix for very obvious reasons and willprobably never be (may be except of JOIN against replicated tables) .On Wed, Aug 14, 2013 at 1:52 PM, Andrew Purtell <[EMAIL PROTECTED]> wrote:

> On Wed, Aug 14, 2013 at 8:45 AM, Michael Segel <[EMAIL PROTECTED]> >wrote:>> > This isn't too bad if you're doing a simple query against one index. You> > can do the work by RS and then join the results from all RS.> >> > However… what happens if you have two indexes and your result set is> going> > to be the intersection of the indexes?> >> > Or you're going to do a join between two tables using the indexes to> limit> > the result set?> >> > Now your design breaks down quickly.> >>> You may have just described their design assumptions.>> I'm not endorsing this per se, but suggesting it is not a good idea on> account it can't live up to the requirements of a pretty particular> strawman seems a step too far.>> Maybe someone from Huawei can talk a bit here about successful use cases?>> > You could also look at Lucene which we did a PoC a few years back.>> A certain large technology company has an HBase full text index built on> Lucene that might be offered as a contribution at some point. From what I> know of it, there are a different set of tradeoffs and it certainly won't> work for everyone, and not because the people working on it were not smart> enough to find a silver bullet.>> --> Best regards,>> - Andy>> Problems worthy of attack prove their worth by hitting back. - Piet Hein> (via Tom White)>

> Michael, JOIN is not supported in Phoenix for very obvious reasons and will> probably never be (may be except of JOIN against replicated tables) .>>> On Wed, Aug 14, 2013 at 1:52 PM, Andrew Purtell <[EMAIL PROTECTED]>> wrote:>> > On Wed, Aug 14, 2013 at 8:45 AM, Michael Segel <> [EMAIL PROTECTED]> > >wrote:> >> > > This isn't too bad if you're doing a simple query against one index.> You> > > can do the work by RS and then join the results from all RS.> > >> > > However… what happens if you have two indexes and your result set is> > going> > > to be the intersection of the indexes?> > >> > > Or you're going to do a join between two tables using the indexes to> > limit> > > the result set?> > >> > > Now your design breaks down quickly.> > >> >> > You may have just described their design assumptions.> >> > I'm not endorsing this per se, but suggesting it is not a good idea on> > account it can't live up to the requirements of a pretty particular> > strawman seems a step too far.> >> > Maybe someone from Huawei can talk a bit here about successful use cases?> >> > > You could also look at Lucene which we did a PoC a few years back.> >> > A certain large technology company has an HBase full text index built on> > Lucene that might be offered as a contribution at some point. From what I> > know of it, there are a different set of tradeoffs and it certainly won't> > work for everyone, and not because the people working on it were not> smart> > enough to find a silver bullet.> >> > --> > Best regards,> >> > - Andy> >> > Problems worthy of attack prove their worth by hitting back. - Piet Hein> > (via Tom White)> >>

> bq. JOIN is not supported in Phoenix> > That is correct.> > See https://github.com/forcedotcom/Phoenix/wiki> > On Wed, Aug 14, 2013 at 2:04 PM, Vladimir Rodionov> <[EMAIL PROTECTED]>wrote:> >> Michael, JOIN is not supported in Phoenix for very obvious reasons and will>> probably never be (may be except of JOIN against replicated tables) .>> >> >> On Wed, Aug 14, 2013 at 1:52 PM, Andrew Purtell <[EMAIL PROTECTED]>>> wrote:>> >>> On Wed, Aug 14, 2013 at 8:45 AM, Michael Segel <>> [EMAIL PROTECTED]>>>> wrote:>>> >>>> This isn't too bad if you're doing a simple query against one index.>> You>>>> can do the work by RS and then join the results from all RS.>>>> >>>> However… what happens if you have two indexes and your result set is>>> going>>>> to be the intersection of the indexes?>>>> >>>> Or you're going to do a join between two tables using the indexes to>>> limit>>>> the result set?>>>> >>>> Now your design breaks down quickly.>>>> >>> >>> You may have just described their design assumptions.>>> >>> I'm not endorsing this per se, but suggesting it is not a good idea on>>> account it can't live up to the requirements of a pretty particular>>> strawman seems a step too far.>>> >>> Maybe someone from Huawei can talk a bit here about successful use cases?>>> >>>> You could also look at Lucene which we did a PoC a few years back.>>> >>> A certain large technology company has an HBase full text index built on>>> Lucene that might be offered as a contribution at some point. From what I>>> know of it, there are a different set of tradeoffs and it certainly won't>>> work for everyone, and not because the people working on it were not>> smart>>> enough to find a silver bullet.>>> >>> -->>> Best regards,>>> >>> - Andy>>> >>> Problems worthy of attack prove their worth by hitting back. - Piet Hein>>> (via Tom White)>>> >>

The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental. Use at your own risk. Michael Segelmichael_segel (AT) hotmail.com

> bq. JOIN is not supported in Phoenix>> That is correct.>> See https://github.com/forcedotcom/Phoenix/wiki>> On Wed, Aug 14, 2013 at 2:04 PM, Vladimir Rodionov> <[EMAIL PROTECTED]>wrote:>> > Michael, JOIN is not supported in Phoenix for very obvious reasons and> will> > probably never be (may be except of JOIN against replicated tables) .> >> >> > On Wed, Aug 14, 2013 at 1:52 PM, Andrew Purtell <[EMAIL PROTECTED]>> > wrote:> >> > > On Wed, Aug 14, 2013 at 8:45 AM, Michael Segel <> > [EMAIL PROTECTED]> > > >wrote:> > >> > > > This isn't too bad if you're doing a simple query against one index.> > You> > > > can do the work by RS and then join the results from all RS.> > > >> > > > However… what happens if you have two indexes and your result set is> > > going> > > > to be the intersection of the indexes?> > > >> > > > Or you're going to do a join between two tables using the indexes to> > > limit> > > > the result set?> > > >> > > > Now your design breaks down quickly.> > > >> > >> > > You may have just described their design assumptions.> > >> > > I'm not endorsing this per se, but suggesting it is not a good idea on> > > account it can't live up to the requirements of a pretty particular> > > strawman seems a step too far.> > >> > > Maybe someone from Huawei can talk a bit here about successful use> cases?> > >> > > > You could also look at Lucene which we did a PoC a few years back.> > >> > > A certain large technology company has an HBase full text index built> on> > > Lucene that might be offered as a contribution at some point. From> what I> > > know of it, there are a different set of tradeoffs and it certainly> won't> > > work for everyone, and not because the people working on it were not> > smart> > > enough to find a silver bullet.> > >> > > --> > > Best regards,> > >> > > - Andy> > >> > > Problems worthy of attack prove their worth by hitting back. - Piet> Hein> > > (via Tom White)> > >> >>

Of course it's not quite that black and white. Global indexes that serve index covered queries (where the query can be answered from the index alone) would also work in many cases of non-selective queries.

In the end it is quite simple (IMHO):If a query retrieves data from only a single region, you want to able to hone into that region quickly, via a piece of global information.If on the other hand a query returns data from many regions, you're better off handling the filtering locally.

> This isn't too bad if you're doing a simple query against one index. You> can do the work by RS and then join the results from all RS.>> However… what happens if you have two indexes and your result set is going> to be the intersection of the indexes?>> Or you're going to do a join between two tables using the indexes to limit> the result set?>> Now your design breaks down quickly.>

You may have just described their design assumptions.

I'm not endorsing this per se, but suggesting it is not a good idea onaccount it can't live up to the requirements of a pretty particularstrawman seems a step too far.

Maybe someone from Huawei can talk a bit here about successful use cases?

> You could also look at Lucene which we did a PoC a few years back.

A certain large technology company has an HBase full text index built onLucene that might be offered as a contribution at some point. From what Iknow of it, there are a different set of tradeoffs and it certainly won'twork for everyone, and not because the people working on it were not smartenough to find a silver bullet.

Its not a question of their solution not working. Its that it takes a lot more resources on the read than the alternative. On Aug 14, 2013, at 5:35 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:

> Yep.> > 1. highly selective indexes + point gets -> global inverted index tables> 2. less selective indexes + queries returning many rows -> "local" indexes, such as the Huawei solution.> > Of course it's not quite that black and white. Global indexes that serve index covered queries (where the query can be answered from the index alone) would also work in many cases of non-selective queries.> > In the end it is quite simple (IMHO):> If a query retrieves data from only a single region, you want to able to hone into that region quickly, via a piece of global information.> If on the other hand a query returns data from many regions, you're better off handling the filtering locally.> > Just my $0.02.> > -- Lars> > > ________________________________> From: Andrew Purtell <[EMAIL PROTECTED]>> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> > Sent: Wednesday, August 14, 2013 1:52 PM> Subject: Re: [ANNOUNCE] Secondary Index in HBase - from Huawei> > > On Wed, Aug 14, 2013 at 8:45 AM, Michael Segel <[EMAIL PROTECTED]>wrote:> >> This isn't too bad if you're doing a simple query against one index. You>> can do the work by RS and then join the results from all RS.>> >> However… what happens if you have two indexes and your result set is going>> to be the intersection of the indexes?>> >> Or you're going to do a join between two tables using the indexes to limit>> the result set?>> >> Now your design breaks down quickly.>> > > You may have just described their design assumptions.> > I'm not endorsing this per se, but suggesting it is not a good idea on> account it can't live up to the requirements of a pretty particular> strawman seems a step too far.> > Maybe someone from Huawei can talk a bit here about successful use cases?> >> You could also look at Lucene which we did a PoC a few years back.> > A certain large technology company has an HBase full text index built on> Lucene that might be offered as a contribution at some point. From what I> know of it, there are a different set of tradeoffs and it certainly won't> work for everyone, and not because the people working on it were not smart> enough to find a silver bullet.> > -- > Best regards,> > - Andy> > Problems worthy of attack prove their worth by hitting back. - Piet Hein> (via Tom White)

The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental. Use at your own risk. Michael Segelmichael_segel (AT) hotmail.com

Its not a question of their solution not working. Its that it takes a lot more resources on the read than the alternative. On Aug 14, 2013, at 5:35 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:

> Yep.> > 1. highly selective indexes + point gets -> global inverted index tables> 2. less selective indexes + queries returning many rows -> "local" indexes, such as the Huawei solution.> > Of course it's not quite that black and white. Global indexes that serve index covered queries (where the query can be answered from the index alone) would also work in many cases of non-selective queries.> > In the end it is quite simple (IMHO):> If a query retrieves data from only a single region, you want to able to hone into that region quickly, via a piece of global information.> If on the other hand a query returns data from many regions, you're better off handling the filtering locally.> > Just my $0.02.> > -- Lars> > > ________________________________> From: Andrew Purtell <[EMAIL PROTECTED]>> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> > Sent: Wednesday, August 14, 2013 1:52 PM> Subject: Re: [ANNOUNCE] Secondary Index in HBase - from Huawei> > > On Wed, Aug 14, 2013 at 8:45 AM, Michael Segel <[EMAIL PROTECTED]>wrote:> >> This isn't too bad if you're doing a simple query against one index. You>> can do the work by RS and then join the results from all RS.>> >> However… what happens if you have two indexes and your result set is going>> to be the intersection of the indexes?>> >> Or you're going to do a join between two tables using the indexes to limit>> the result set?>> >> Now your design breaks down quickly.>> > > You may have just described their design assumptions.> > I'm not endorsing this per se, but suggesting it is not a good idea on> account it can't live up to the requirements of a pretty particular> strawman seems a step too far.> > Maybe someone from Huawei can talk a bit here about successful use cases?> >> You could also look at Lucene which we did a PoC a few years back.> > A certain large technology company has an HBase full text index built on> Lucene that might be offered as a contribution at some point. From what I> know of it, there are a different set of tradeoffs and it certainly won't> work for everyone, and not because the people working on it were not smart> enough to find a silver bullet.> > -- > Best regards,> > - Andy> > Problems worthy of attack prove their worth by hitting back. - Piet Hein> (via Tom White)

The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental. Use at your own risk. Michael Segelmichael_segel (AT) hotmail.com

> No it doesn't in case 2 below. Quite the opposite.> > From: Michael Segel <[EMAIL PROTECTED]>> To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]> > Sent: Wednesday, August 14, 2013 4:57 PM> Subject: Re: [ANNOUNCE] Secondary Index in HBase - from Huawei> > Its not a question of their solution not working. > Its that it takes a lot more resources on the read than the alternative. > > > On Aug 14, 2013, at 5:35 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:> > > Yep.> > > > 1. highly selective indexes + point gets -> global inverted index tables> > 2. less selective indexes + queries returning many rows -> "local" indexes, such as the Huawei solution.> > > > Of course it's not quite that black and white. Global indexes that serve index covered queries (where the query can be answered from the index alone) would also work in many cases of non-selective queries.> > > > In the end it is quite simple (IMHO):> > If a query retrieves data from only a single region, you want to able to hone into that region quickly, via a piece of global information.> > If on the other hand a query returns data from many regions, you're better off handling the filtering locally.> > > > Just my $0.02.> > > > -- Lars> > > > > > ________________________________> > From: Andrew Purtell <[EMAIL PROTECTED]>> > To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> > > Sent: Wednesday, August 14, 2013 1:52 PM> > Subject: Re: [ANNOUNCE] Secondary Index in HBase - from Huawei> > > > > > On Wed, Aug 14, 2013 at 8:45 AM, Michael Segel <[EMAIL PROTECTED]>wrote:> > > >> This isn't too bad if you're doing a simple query against one index. You> >> can do the work by RS and then join the results from all RS.> >> > >> However… what happens if you have two indexes and your result set is going> >> to be the intersection of the indexes?> >> > >> Or you're going to do a join between two tables using the indexes to limit> >> the result set?> >> > >> Now your design breaks down quickly.> >> > > > > You may have just described their design assumptions.> > > > I'm not endorsing this per se, but suggesting it is not a good idea on> > account it can't live up to the requirements of a pretty particular> > strawman seems a step too far.> > > > Maybe someone from Huawei can talk a bit here about successful use cases?> > > >> You could also look at Lucene which we did a PoC a few years back.> > > > A certain large technology company has an HBase full text index built on> > Lucene that might be offered as a contribution at some point. From what I> > know of it, there are a different set of tradeoffs and it certainly won't> > work for everyone, and not because the people working on it were not smart> > enough to find a silver bullet.> > > > -- > > Best regards,> > > > - Andy> > > > Problems worthy of attack prove their worth by hitting back. - Piet Hein> > (via Tom White)> > The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental. > Use at your own risk. > Michael Segel> michael_segel (AT) hotmail.com> > > > > >

The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental. Use at your own risk. Michael Segelmichael_segel (AT) hotmail.com

> Hi,>> We have been working on implementing secondary index in HBase, and had> shared an overview of our design in the 2012 Hadoop Technical Conference> at Beijing(http://bit.ly/hbtc12-hindex). We are pleased to open source it> today.>> The project is available on github.> https://github.com/Huawei-Hadoop/hindex>> It is 100% Java, compatible with Apache HBase 0.94.8, and is open sourced> under Apache Software License v2.>> Following features are supported currently.> - multiple indexes on table,> - multi column index,> - index based on part of a column value,> - equals and range condition scans using index, and> - bulk loading data to indexed table (Indexing done with bulk> load)>> We now plan to raise HBase JIRA(s) to make it available in Apache release,> and can hopefully continue our work on this in the community.>> Regards> Rajeshbabu>>

Good to see this. Hope this would help in more improvements andenhancements. :)On Tue, Aug 13, 2013 at 12:14 PM, Anoop John <[EMAIL PROTECTED]> wrote:

> Good to see this Rajesh. Thanks a lot to Huawei HBase team!>> -Anoop->> On Tue, Aug 13, 2013 at 11:49 AM, rajeshbabu chintaguntla <> [EMAIL PROTECTED]> wrote:>> > Hi,> >> > We have been working on implementing secondary index in HBase, and had> > shared an overview of our design in the 2012 Hadoop Technical Conference> > at Beijing(http://bit.ly/hbtc12-hindex). We are pleased to open source> it> > today.> >> > The project is available on github.> > https://github.com/Huawei-Hadoop/hindex> >> > It is 100% Java, compatible with Apache HBase 0.94.8, and is open sourced> > under Apache Software License v2.> >> > Following features are supported currently.> > - multiple indexes on table,> > - multi column index,> > - index based on part of a column value,> > - equals and range condition scans using index, and> > - bulk loading data to indexed table (Indexing done with bulk> > load)> >> > We now plan to raise HBase JIRA(s) to make it available in Apache> release,> > and can hopefully continue our work on this in the community.> >> > Regards> > Rajeshbabu> >> >>

> Good to see this Rajesh. Thanks a lot to Huawei HBase team!>> -Anoop->> On Tue, Aug 13, 2013 at 11:49 AM, rajeshbabu chintaguntla <> [EMAIL PROTECTED]> wrote:>> > Hi,> >> > We have been working on implementing secondary index in HBase, and had> > shared an overview of our design in the 2012 Hadoop Technical Conference> > at Beijing(http://bit.ly/hbtc12-hindex). We are pleased to open source> it> > today.> >> > The project is available on github.> > https://github.com/Huawei-Hadoop/hindex> >> > It is 100% Java, compatible with Apache HBase 0.94.8, and is open sourced> > under Apache Software License v2.> >> > Following features are supported currently.> > - multiple indexes on table,> > - multi column index,> > - index based on part of a column value,> > - equals and range condition scans using index, and> > - bulk loading data to indexed table (Indexing done with bulk> > load)> >> > We now plan to raise HBase JIRA(s) to make it available in Apache> release,> > and can hopefully continue our work on this in the community.> >> > Regards> > Rajeshbabu> >> >>