We already have code (based on early version of hive) that does the important work. But we need to integrate this with Hive QL and determine the candidate indexes for a query. Also a way to create buckets (both hash and range) would really speed up the index processing. I didn't realize but predicate pushdown makes things simpler.

Prasad Chakka
added a comment - 15/Apr/09 17:20 We already have code (based on early version of hive) that does the important work. But we need to integrate this with Hive QL and determine the candidate indexes for a query. Also a way to create buckets (both hash and range) would really speed up the index processing. I didn't realize but predicate pushdown makes things simpler.

Ashish Thusoo
added a comment - 15/Apr/09 19:07 +1 to this...
We should also look at Katta
http://katta.wiki.sourceforge.net/
to see if we can leverage it to be a really fast index. I think that integration would really rock..
Thoughts?

HIVE-1230 has changed the interface for RecordReader and it no longer has getPos() method. The older interfaces are deprecated. I used this method in the prototype get the current position while creating the index and also while reading the actual data file. Even the SequenceFileRecordReader does not have this method.

Without getPos() and seek() methods to RecordReader it becomes tough to implement any kind of generic indexing.

Prasad Chakka
added a comment - 25/Apr/09 20:59 HIVE-1230 has changed the interface for RecordReader and it no longer has getPos() method. The older interfaces are deprecated. I used this method in the prototype get the current position while creating the index and also while reading the actual data file. Even the SequenceFileRecordReader does not have this method.
Without getPos() and seek() methods to RecordReader it becomes tough to implement any kind of generic indexing.

Here is a very rough outline as how this can be done (prototype code has creation and execution parts but not he HiveQL related stuff)

hive indexing:
goal of hive indexing is to speed up lookup queries on certain columns of the table. currently queries with predicates like 'WHERE tab1.col1 = 10' has to load the complete table/partition and process all the rows. if there exists an index on col1 then only a small portion of the file can be loaded.

command to create index:

create index tab_idx1 on tab1 (col1, ...);

if the base table is partitioned then the index is also partitioned. indexes can be created on base tables whose file format supports (getPos() and possible seek() or equivalent methods.)

format of index:
index is also a hive table with the following columns
col_1...col_k – key cols. base table columns on which this index is defined
list<offset> – positions of rows which contain these keys

offset is a combination of following
file_name – relative path of the file in which this row is contained. (relative to the partition/table location)
byte_offset – byte offset of the row in the file. row can be found at this byte offset or in the block starting at this byte offset for Block Compressed Sequence Files.

when to create index:
traditionally databases try to update index when the table is loaded. hive doesn't process rows while loading tables using 'LOAD DATA INPATH' command. also it may slow down the actual loading for 'INSERT ... SELECT ... FROM ...' type of statements. so users should have an option whether the index is initialized during 'INSERT ... SELECT ...' or initialized separately. Another command like 'update index tab_idx partition ..' can be provided.

how to create index:
index can be created using the following hive command augmented with 'offset'
'select col_1...col_k, offset from tab1'

offset can be provided as built in function which can be derived in HiveInputRecordReader which will in turn use the specific FileFormat's Reader getPos() method and the 'map.input.file' for the file name (or from the tableDesc or partiionDesc).

Algorithm For using index:
1) Hive QL needs to determine whether a particular query can use any existing indexes. This can be determined by examining the predicate tree. After predicate pushdown, all those predicates which can use index are in the child operator of a TableScanOperator. This predicate tree needs to be examined. If this contains any subset of columns of an index then that index can be used. Until stats are available, it is not possible to guess whether using index is beneficial. This needs to be fleshed out more to check both 'AND' and 'OR' predicates.

2) For each of the qualified indexes, a map/reduce job can be created using the predicates determined in step 1. The output of this job should have the following information
file_name – fully qualified file name that contains the data
byte_offset – position of row

3) If there is more than one qualified index then the outputs of step2 needs to be combined depending on whether the predicates on these indexes have 'AND' or 'OR' between them.

4) Modify the original plan to use only those FileSplits that appear in the output of step3. This reduces the number of mappers spawned by JobTracker.

5) Modify the original plan to use HiveIndexRecordReader instead of regular record reader. Output of step3 (which is sorted) is available to the HiveIndexRecordReader. It can skip to these locations instead of reading every record in the input of the Mapper.

Prasad Chakka
added a comment - 15/May/09 17:37 Here is a very rough outline as how this can be done (prototype code has creation and execution parts but not he HiveQL related stuff)
hive indexing:
goal of hive indexing is to speed up lookup queries on certain columns of the table. currently queries with predicates like 'WHERE tab1.col1 = 10' has to load the complete table/partition and process all the rows. if there exists an index on col1 then only a small portion of the file can be loaded.
command to create index:
create index tab_idx1 on tab1 (col1, ...);
if the base table is partitioned then the index is also partitioned. indexes can be created on base tables whose file format supports (getPos() and possible seek() or equivalent methods.)
format of index:
index is also a hive table with the following columns
col_1...col_k – key cols. base table columns on which this index is defined
list<offset> – positions of rows which contain these keys
offset is a combination of following
file_name – relative path of the file in which this row is contained. (relative to the partition/table location)
byte_offset – byte offset of the row in the file. row can be found at this byte offset or in the block starting at this byte offset for Block Compressed Sequence Files.
when to create index:
traditionally databases try to update index when the table is loaded. hive doesn't process rows while loading tables using 'LOAD DATA INPATH' command. also it may slow down the actual loading for 'INSERT ... SELECT ... FROM ...' type of statements. so users should have an option whether the index is initialized during 'INSERT ... SELECT ...' or initialized separately. Another command like 'update index tab_idx partition ..' can be provided.
how to create index:
index can be created using the following hive command augmented with 'offset'
'select col_1...col_k, offset from tab1'
offset can be provided as built in function which can be derived in HiveInputRecordReader which will in turn use the specific FileFormat's Reader getPos() method and the 'map.input.file' for the file name (or from the tableDesc or partiionDesc).
Algorithm For using index:
1) Hive QL needs to determine whether a particular query can use any existing indexes. This can be determined by examining the predicate tree. After predicate pushdown, all those predicates which can use index are in the child operator of a TableScanOperator. This predicate tree needs to be examined. If this contains any subset of columns of an index then that index can be used. Until stats are available, it is not possible to guess whether using index is beneficial. This needs to be fleshed out more to check both 'AND' and 'OR' predicates.
2) For each of the qualified indexes, a map/reduce job can be created using the predicates determined in step 1. The output of this job should have the following information
file_name – fully qualified file name that contains the data
byte_offset – position of row
3) If there is more than one qualified index then the outputs of step2 needs to be combined depending on whether the predicates on these indexes have 'AND' or 'OR' between them.
4) Modify the original plan to use only those FileSplits that appear in the output of step3. This reduces the number of mappers spawned by JobTracker.
5) Modify the original plan to use HiveIndexRecordReader instead of regular record reader. Output of step3 (which is sorted) is available to the HiveIndexRecordReader. It can skip to these locations instead of reading every record in the input of the Mapper.

Thanks Prasad for detailed description of the index design.
Several questions:
(1)
Is the index based on sort? Single-column index based on sort can be very useful for query involving this column, both point query and range query. But for sort-based multi-column index, it can not be utilized for queries not containing the column used as primary sort order in the index. For example, we create an sort based index on table1(col1,col2,col3). The index uses col1 as primary sort order, col2 as secondary sort order, and col3...
We can use this index to accelerate queries like:
1) select * from table1 where col1>2 and col2<34
2) select * from table1 where col1<34 and col3 >45
3) selcet * from table1 where col1>23
but, we can not use it for queries like:
4) select * from table1 where col2>34 and col3<3
5) select * from table1 where col2 =34
6) select * from table1 where col3 <45

(2)
Should we consider using index to accelerate query involving join several tables. For example, we have two tables:
user(userid,name,address, age,title,company);
click(userid,url,datetime);
And now we have a query like:
select url from user, click where user.userid=click.userid and user.name="user_name" and datetime between last month; to select the url list the specified user visits in last month.
If we have an index: create index user_url on table user(name), click(datetime) where user.userid=click.userid, then the above query can be accelerated.

(3)
Index can also be used in Group-by aggregation queries. Should we also consider them?
(4)
Another feature is to integrate Lucene index with Hive. Ashish suggested to integrate katta. I took a look at katta, and i think it maybe not necessary to include katta in. If we include it, the hive user will have to deploy katta and zookeeper in their cluster. I think we can integrate lucene internally without touch katta.

He Yongqiang
added a comment - 17/May/09 14:59 Thanks Prasad for detailed description of the index design.
Several questions:
(1)
Is the index based on sort? Single-column index based on sort can be very useful for query involving this column, both point query and range query. But for sort-based multi-column index, it can not be utilized for queries not containing the column used as primary sort order in the index. For example, we create an sort based index on table1(col1,col2,col3). The index uses col1 as primary sort order, col2 as secondary sort order, and col3...
We can use this index to accelerate queries like:
1) select * from table1 where col1>2 and col2<34
2) select * from table1 where col1<34 and col3 >45
3) selcet * from table1 where col1>23
but, we can not use it for queries like:
4) select * from table1 where col2>34 and col3<3
5) select * from table1 where col2 =34
6) select * from table1 where col3 <45
(2)
Should we consider using index to accelerate query involving join several tables. For example, we have two tables:
user(userid,name,address, age,title,company);
click(userid,url,datetime);
And now we have a query like:
select url from user, click where user.userid=click.userid and user.name="user_name" and datetime between last month; to select the url list the specified user visits in last month.
If we have an index: create index user_url on table user(name), click(datetime) where user.userid=click.userid, then the above query can be accelerated.
(3)
Index can also be used in Group-by aggregation queries. Should we also consider them?
(4)
Another feature is to integrate Lucene index with Hive. Ashish suggested to integrate katta. I took a look at katta, and i think it maybe not necessary to include katta in. If we include it, the hive user will have to deploy katta and zookeeper in their cluster. I think we can integrate lucene internally without touch katta.

1)
The question you raised applies only to B+Tree indexes. The index that I defined above is not really a traditional database index but a kind of summary table (or view) and any lookup/range-query on table requires reading of the whole index. So you can apply all predicates as long as columns referenced in the predicates exist in the index. So we should be able use index on (col1, col2, col3) for all the queries above. Sorting order has no impact here since the whole index is read into memory anyways.

Since this index can be created in sorted order, we can create sparse index (similar to non-leaf nodes of a B+-Tree) if the index itself is too big (ie, index sizes are order of magnitude larger than HDFS block size). But this can be done as a later optimization.

2)
With the design above, indexes on joins will come free since predicate pushdown will push the 'user.name="user_name"' to above the join and only index filtered rows participate in join.

But creating indexes on the joined output may increase the index size so as to decrease the overall effectiveness. But with sparse indexes this problem might be mitigated so we can support this kind of join indexes along with support for sparse indexes.

3)
Yes, for some aggregation queries it may make sense to read the index (since it is a summary table as well). Aggregations or any queries that involve only columns from the index can operate only on the index and not the main table.

4)
I also looked at it and not sure how it fits into Hive. Katta is more like an distributed index server.

Prasad Chakka
added a comment - 17/May/09 16:39 1)
The question you raised applies only to B+Tree indexes. The index that I defined above is not really a traditional database index but a kind of summary table (or view) and any lookup/range-query on table requires reading of the whole index. So you can apply all predicates as long as columns referenced in the predicates exist in the index. So we should be able use index on (col1, col2, col3) for all the queries above. Sorting order has no impact here since the whole index is read into memory anyways.
Since this index can be created in sorted order, we can create sparse index (similar to non-leaf nodes of a B+-Tree) if the index itself is too big (ie, index sizes are order of magnitude larger than HDFS block size). But this can be done as a later optimization.
2)
With the design above, indexes on joins will come free since predicate pushdown will push the 'user.name="user_name"' to above the join and only index filtered rows participate in join.
But creating indexes on the joined output may increase the index size so as to decrease the overall effectiveness. But with sparse indexes this problem might be mitigated so we can support this kind of join indexes along with support for sparse indexes.
3)
Yes, for some aggregation queries it may make sense to read the index (since it is a summary table as well). Aggregations or any queries that involve only columns from the index can operate only on the index and not the main table.
4)
I also looked at it and not sure how it fits into Hive. Katta is more like an distributed index server.

Prasad,the index you designed is a kind of hash index? And it requires to scan the whole table for either lookup/range-query.
Sort based query will only need to scan part of the index table, but it also has problems as i posted in the previous comment.

Since this index can be created in sorted order, we can create sparse index

how? I think sort based index can not be sparse index, and only can be dense index.

I also looked at it and not sure how it fits into Hive. Katta is more like an distributed index server.

yes, we do not need to fit katta into Hive. But we can integrate lucene index into Hive. But right now i think we can put this on the last part of our schedule.

He Yongqiang
added a comment - 18/May/09 03:36 Prasad,the index you designed is a kind of hash index? And it requires to scan the whole table for either lookup/range-query.
Sort based query will only need to scan part of the index table, but it also has problems as i posted in the previous comment.
Since this index can be created in sorted order, we can create sparse index
how? I think sort based index can not be sparse index, and only can be dense index.
I also looked at it and not sure how it fits into Hive. Katta is more like an distributed index server.
yes, we do not need to fit katta into Hive. But we can integrate lucene index into Hive. But right now i think we can put this on the last part of our schedule.

Prasad, I think the hash based index is more like a projection, because number of rows with same hash value of indexed columns will be very low.
And in most cases, the block/pos list's size will only be 1.

He Yongqiang
added a comment - 18/May/09 03:46 Prasad, I think the hash based index is more like a projection, because number of rows with same hash value of indexed columns will be very low.
And in most cases, the block/pos list's size will only be 1.

The above index is not a hash index since you can't do range queries on hash index and lookups are constant time. not sure what to call this except that it is a view (simple projection) of the base table with offsets into the base table.

on sparse index, i meant you can create a sparse index on top of the index i described above. but this can be done later.

> And in most cases, the block/pos list's size will only be 1

that is not the case if the index is on a non-primary key column. and i think, mostly this is the case where indexes will be used in data warehouses.

Prasad Chakka
added a comment - 18/May/09 22:12 The above index is not a hash index since you can't do range queries on hash index and lookups are constant time. not sure what to call this except that it is a view (simple projection) of the base table with offsets into the base table.
on sparse index, i meant you can create a sparse index on top of the index i described above. but this can be done later.
> And in most cases, the block/pos list's size will only be 1
that is not the case if the index is on a non-primary key column. and i think, mostly this is the case where indexes will be used in data warehouses.

>that is not the case if the index is on a non-primary key column. and i think, mostly this is the case where indexes will be used in data warehouses.
Yes. If the index is built on one column, the block/pos list's size will be large. But if it is built on many columns, i think the block/pos list's size will be small.
Anyway, we can build this index as the first step.
And after this finished, we can try other kinds of index, like:
1) sort based index
2) lucene index
3) block-scope B+Tree or R-tree or other advantage index data structures.

Prasad, you said you already wrote some code, would you please attach it?

He Yongqiang
added a comment - 19/May/09 03:44 >that is not the case if the index is on a non-primary key column. and i think, mostly this is the case where indexes will be used in data warehouses.
Yes. If the index is built on one column, the block/pos list's size will be large. But if it is built on many columns, i think the block/pos list's size will be small.
Anyway, we can build this index as the first step.
And after this finished, we can try other kinds of index, like:
1) sort based index
2) lucene index
3) block-scope B+Tree or R-tree or other advantage index data structures.
Prasad, you said you already wrote some code, would you please attach it?

Prasad Chakka
added a comment - 19/May/09 15:25 Yes, mostly the block/pos size will be small but I don't think we can assume that since there will be enough cases where it will not be true.
We explore the other approaches later on. different indexes will be useful in different scenarios.
i will try to post some code this week.

patch built from hive prototype code. it compiles with hive code but needs lot of work before it can fit nicely with hive.

Yongqing, the code is hard to understand in the context of open source hive code but makes sense from prototype code. let me know if you have any question. you might want to use this as a guide and start from scratch hive2 indexing code.

Prasad Chakka
added a comment - 23/May/09 03:23 patch built from hive prototype code. it compiles with hive code but needs lot of work before it can fit nicely with hive.
Yongqing, the code is hard to understand in the context of open source hive code but makes sense from prototype code. let me know if you have any question. you might want to use this as a guide and start from scratch hive2 indexing code.

Checked how Mysql does with index and found mysql either can not use index to handle situations in my earlier post:

but, we can not use it for queries like:
4) select * from table1 where col2>34 and col3<3
5) select * from table1 where col2 =34
6) select * from table1 where col3 <45

And now a basic idea for our index design, just like Prasad commented in previous post:
1) index structure
use a mr job to create index, input is a file with all columns, and mapper output kv pairs, where key is <indexed col1, indexed col2,...> offset.
And we define a comparator for <indexed col1, indexed col2,...> to letting the shuffle phase sort all mappers' output. And in reducer, we combine kv-pairs to
<indexed col1, indexed col2,...> list_of_offsets
This is a dense sorted index, then we create a sparse index on the dense index. And we also collect column data distribution informations (histogram) while doing this.
2)
we consider using index for a query only when the query involves the columns of leftmost part of the index.
And also need to consider index merge when involves two indexes, and a cost estimation to consider whether using index will decrease query time (this is the work need to do in the optimizer).

But as first step, we can first finish part 1 and hive ql part. Then consider part two(optimizer part). After part1 finished, i will examine part2 in more detail.

He Yongqiang
added a comment - 26/May/09 14:19 Checked how Mysql does with index and found mysql either can not use index to handle situations in my earlier post:
but, we can not use it for queries like:
4) select * from table1 where col2>34 and col3<3
5) select * from table1 where col2 =34
6) select * from table1 where col3 <45
And now a basic idea for our index design, just like Prasad commented in previous post:
1) index structure
use a mr job to create index, input is a file with all columns, and mapper output kv pairs, where key is <indexed col1, indexed col2,...> offset.
And we define a comparator for <indexed col1, indexed col2,...> to letting the shuffle phase sort all mappers' output. And in reducer, we combine kv-pairs to
<indexed col1, indexed col2,...> list_of_offsets
This is a dense sorted index, then we create a sparse index on the dense index. And we also collect column data distribution informations (histogram) while doing this.
2)
we consider using index for a query only when the query involves the columns of leftmost part of the index.
And also need to consider index merge when involves two indexes, and a cost estimation to consider whether using index will decrease query time (this is the work need to do in the optimizer).
But as first step, we can first finish part 1 and hive ql part. Then consider part two(optimizer part). After part1 finished, i will examine part2 in more detail.

the plan looks good. i am not sure we need to create sparse index on the dense index in phase 1. In most cases the size of dense index will be small enough so that additional mr job for processing the sparse index will become unnecessary. if sparse index is not necessary then there is not need for the dense index be sorted.

since the dense index is scanned completely while processing the query, we can use the index if any predicate column exists in index definition.

Prasad Chakka
added a comment - 29/May/09 06:40 the plan looks good. i am not sure we need to create sparse index on the dense index in phase 1. In most cases the size of dense index will be small enough so that additional mr job for processing the sparse index will become unnecessary. if sparse index is not necessary then there is not need for the dense index be sorted.
since the dense index is scanned completely while processing the query, we can use the index if any predicate column exists in index definition.

Hello Prasad and Yongqiang, Thank you very much for this great effort.

One of my suggestions would be that, since we've done indexing with Mapreduce, and for some queries based on the generated indexes, can we just omit the time-consuming Mapreduce phase during the querying period, as we've already got all of the files/offsets and we can go to these specific file offsets directly to get relevant rows of the table? This would greatly expedite the query process.

This would be helpful for the following case in one of my usages with Hive. With Hive, I've already sharded (by date), and bucketed (by cols hashing) of my log data into a hierachical files. Also I've sorted each file with the hashing cols. As I may have many rows with same column values but different timestamps, to minimize index size, I'd like to treat these rows of same col values as a block and only use a single index entry for this block. This will grealy reduce the index size of my data, but still very useful in my query request with those cols.

Seymour Zhang
added a comment - 29/May/09 16:52 Hello Prasad and Yongqiang, Thank you very much for this great effort.
One of my suggestions would be that, since we've done indexing with Mapreduce, and for some queries based on the generated indexes, can we just omit the time-consuming Mapreduce phase during the querying period, as we've already got all of the files/offsets and we can go to these specific file offsets directly to get relevant rows of the table? This would greatly expedite the query process.
This would be helpful for the following case in one of my usages with Hive. With Hive, I've already sharded (by date), and bucketed (by cols hashing) of my log data into a hierachical files. Also I've sorted each file with the hashing cols. As I may have many rows with same column values but different timestamps, to minimize index size, I'd like to treat these rows of same col values as a block and only use a single index entry for this block. This will grealy reduce the index size of my data, but still very useful in my query request with those cols.

Thanks for the suggestions, Seymour.
I have also thought what your said, directly fetch the data instead of initilize a new mr job. I will try include this, but it may be done in the second phase(the optimize phase).

>>I'd like to treat these rows of same col values as a block and only use a single index entry for this block
in the design, we indeed only use one index entry. And not only for contineous values, we use the same index entry for all rows with the same col value.

He Yongqiang
added a comment - 30/May/09 16:22 Thanks for the suggestions, Seymour.
I have also thought what your said, directly fetch the data instead of initilize a new mr job. I will try include this, but it may be done in the second phase(the optimize phase).
>>I'd like to treat these rows of same col values as a block and only use a single index entry for this block
in the design, we indeed only use one index entry. And not only for contineous values, we use the same index entry for all rows with the same col value.

are we going to have one index file per hdfs file? (or one per partition?)

related question is how this is going to interact with sampling? (i think currently the sampling predicate is optimized out for bucketed tables - although not terribly sure).

i would love to see the api to invoke the index.

ideally we would like to plug in different indexing schemes - as well with map-side joins - the hashmap storing the smaller table can be seen as an index on this table. It would seem that one should be able to replace a map-side join based on tables loaded into jdbm with tables with indices proposed here (and thereby do joins based on indices almost trivially).

we should enable people to be able to plug in their own indices (since it's quite likely that over time there will be multiple indexing efforts on hadoop files).

Joydeep Sen Sarma
added a comment - 01/Jun/09 18:16
are we going to have one index file per hdfs file? (or one per partition?)
related question is how this is going to interact with sampling? (i think currently the sampling predicate is optimized out for bucketed tables - although not terribly sure).
i would love to see the api to invoke the index.
ideally we would like to plug in different indexing schemes - as well with map-side joins - the hashmap storing the smaller table can be seen as an index on this table. It would seem that one should be able to replace a map-side join based on tables loaded into jdbm with tables with indices proposed here (and thereby do joins based on indices almost trivially).
we should enable people to be able to plug in their own indices (since it's quite likely that over time there will be multiple indexing efforts on hadoop files).

Joydeep, Thanks for the concern.
>>are we going to have one index file per hdfs file?
yeah.
>>i would love to see the api to invoke the index
currently it is not settle down. I will try to give it in the next week.
>>enable people to be able to plug in their own indices
I think if we have a well-designed adaptable api, then this can be addressed.
>>we would like to plug in different indexing schemes
yes. i have proposed several schemes in previous posts. Can you give me some schemes, so i can compare and make a better design.

BTW, I will try to write a proposal in next week. I have an important english exam this weekend. Sorry for the delay.

He Yongqiang
added a comment - 02/Jun/09 03:41 Joydeep, Thanks for the concern.
>>are we going to have one index file per hdfs file?
yeah.
>>i would love to see the api to invoke the index
currently it is not settle down. I will try to give it in the next week.
>>enable people to be able to plug in their own indices
I think if we have a well-designed adaptable api, then this can be addressed.
>>we would like to plug in different indexing schemes
yes. i have proposed several schemes in previous posts. Can you give me some schemes, so i can compare and make a better design.
BTW, I will try to write a proposal in next week. I have an important english exam this weekend. Sorry for the delay.

Can we also support exporting these index files as a table to some other storage system like HBase or Tokyou Cabinet, i.e. these seperate index files for each HDFS file, can be expressed as a single table in Hive?

Seymour Zhang
added a comment - 02/Jun/09 15:26
>>are we going to have one index file per hdfs file?
Can we also support exporting these index files as a table to some other storage system like HBase or Tokyou Cabinet, i.e. these seperate index files for each HDFS file, can be expressed as a single table in Hive?

About yout comments at 17/May/09 08:39 AM.
>> ... So we should be able use index on (col1, col2, col3) for all the queries above. Sorting order has no impact here since the whole index is read into memory anyways.

If the size of index is big out of memory size, how to read whole index into memory?

Schubert Zhang
added a comment - 22/Jun/09 19:01 Prasad,
About yout comments at 17/May/09 08:39 AM.
>> ... So we should be able use index on (col1, col2, col3) for all the queries above. Sorting order has no impact here since the whole index is read into memory anyways.
If the size of index is big out of memory size, how to read whole index into memory?

We can run another map-reduce job that scans the index and builds out the results file sorted by the index key. This file can be read sequentially and determine which input table HDFS blocks to be fed to the actual job for the query.

Another way is to build a sparse index on the index. But if the table itself is sorted, we can build the sparse index (ala MapFile) directly and use it. @Facebook, the usecase we have doesn't have this sorting property but I can envision this being useful for primary indexes where the index sort order and the table sort order are same.

Can you think of any other ways? Ofcourse, we can process index files using HBase or TokyoCabinet but that requires another system to be setup and administered and both systems need to be available for index processing. But in some cases these solutions also work. The indexing scheme described above should play well with Hbase and TokyoCabinet since index is a file with rows containg a key and position parameters. In Hadoop we can stored that in SequenceFile or may be TFile but if they have to be stored in external systems, we can plug-in a custom SerDe and change the default location of these two a location where the external systems can access these files.

Prasad Chakka
added a comment - 22/Jun/09 20:27 Schubert,
We can run another map-reduce job that scans the index and builds out the results file sorted by the index key. This file can be read sequentially and determine which input table HDFS blocks to be fed to the actual job for the query.
Another way is to build a sparse index on the index. But if the table itself is sorted, we can build the sparse index (ala MapFile) directly and use it. @Facebook, the usecase we have doesn't have this sorting property but I can envision this being useful for primary indexes where the index sort order and the table sort order are same.
Can you think of any other ways? Ofcourse, we can process index files using HBase or TokyoCabinet but that requires another system to be setup and administered and both systems need to be available for index processing. But in some cases these solutions also work. The indexing scheme described above should play well with Hbase and TokyoCabinet since index is a file with rows containg a key and position parameters. In Hadoop we can stored that in SequenceFile or may be TFile but if they have to be stored in external systems, we can plug-in a custom SerDe and change the default location of these two a location where the external systems can access these files.

Schubert Zhang
added a comment - 29/Jun/09 10:55 Prasad,
Thanks for your comments. Now, I understand your comments.
Yes, in one of our projects, we sorted the data table and build sparse index which record the block keys and file offsets. Then, we load the index files into HBase to service for query. I works fine.

Had a talk with Prasad and Zheng about the index design.
And i write a summary of the meeting: http://docs.google.com/View?id=dc9jpfdr_6dkrt82c7
Any suggestion is appreciated!
BTW, In this meeting and the doc, we did not include any other index types. And i think we first implement one index type, and after that, others will be more clear.

He Yongqiang
added a comment - 08/Jul/09 11:29 Had a talk with Prasad and Zheng about the index design.
And i write a summary of the meeting: http://docs.google.com/View?id=dc9jpfdr_6dkrt82c7
Any suggestion is appreciated!
BTW, In this meeting and the doc, we did not include any other index types. And i think we first implement one index type, and after that, others will be more clear.

Prasad Chakka
added a comment - 08/Jul/09 21:56 One thing that isn't mentioned is that in List<bucketname, List<offset>> column, offsets are sorted.
Another thing missing is, when an index on a partition is built then a new partition will be created for that index table (similar to that of creating a partition for a regular table).
We can distinguish index tables and regular tables by having a table parameter.
We can skip partition specific indexes in the first phase if it reduces amount of work and assume indexes defined on a table can be created on all partitions.

A Draft version of buiding(create/update) index.
Still needs:
1) reducer sorts a set positions of one key in memory, which may cause out of memory when a index key occures too many times
2) the index output format can only use IgoreKeyTextOutputFormat right now.

He Yongqiang
added a comment - 18/Jul/09 14:33 A Draft version of buiding(create/update) index.
Still needs:
1) reducer sorts a set positions of one key in memory, which may cause out of memory when a index key occures too many times
2) the index output format can only use IgoreKeyTextOutputFormat right now.

1) Are you worried about the sort phase of the reducer or the IndexBuilder's reducer code? I don't think former issue will be a problem. The later issue can be avoided by writing multiple rows for a key if the number of offsets exceed a certain limit. So reducer can flush the offsets periodically to disk thus avoiding OutOfMemory exceptions in reducer.

Prasad Chakka
added a comment - 22/Jul/09 18:25 1) Are you worried about the sort phase of the reducer or the IndexBuilder's reducer code? I don't think former issue will be a problem. The later issue can be avoided by writing multiple rows for a key if the number of offsets exceed a certain limit. So reducer can flush the offsets periodically to disk thus avoiding OutOfMemory exceptions in reducer.
2) What are the other options for the index output format?

1) For a given key, we are using a sorted set for each bucket to store positions at the reduer. I am worried that "one sorted set for each bucket" may cause out of memory problem.
as you commentted earlier: "List<bucketname, List<offset>> column, offsets are sorted".
Think about one extreme situation: one file contains a single value million times. So at the reducer we are storing million positions in a sorted set.

>>So reducer can flush the offsets periodically to disk thus avoiding OutOfMemory exceptions in reducer.
If we do this, how we can guarantee they are sorted. I mean offsets after this flush are greater than offsets in previous flush.

2)What are the other options for the index output format?
I think there is no other options. We need to discard the key part. And i think in hive only IgnoreKeyTextOutputFormat does that. And Of course all hive's custom HiveOutputFormat can discard key part, but they can not be specified in the map-reduce jobconf, since they do not extend OutputFormat.

He Yongqiang
added a comment - 23/Jul/09 01:23 1) For a given key, we are using a sorted set for each bucket to store positions at the reduer. I am worried that "one sorted set for each bucket" may cause out of memory problem.
as you commentted earlier: "List<bucketname, List<offset>> column, offsets are sorted".
Think about one extreme situation: one file contains a single value million times. So at the reducer we are storing million positions in a sorted set.
>>So reducer can flush the offsets periodically to disk thus avoiding OutOfMemory exceptions in reducer.
If we do this, how we can guarantee they are sorted. I mean offsets after this flush are greater than offsets in previous flush.
2)What are the other options for the index output format?
I think there is no other options. We need to discard the key part. And i think in hive only IgnoreKeyTextOutputFormat does that. And Of course all hive's custom HiveOutputFormat can discard key part, but they can not be specified in the map-reduce jobconf, since they do not extend OutputFormat.

well the number of offsets can't exceed number of SequenceFile blocks since we can only index the SequenceFile block offsets. So the problem is not as dire as it can be. And also if there are that many (i.e. more than 10% of rows in traditional RDBMS but may more in Hadoop case) have same key then index may not be efficient after all since it is better to read the whole table anyways.

Prasad Chakka
added a comment - 23/Jul/09 01:30 well the number of offsets can't exceed number of SequenceFile blocks since we can only index the SequenceFile block offsets. So the problem is not as dire as it can be. And also if there are that many (i.e. more than 10% of rows in traditional RDBMS but may more in Hadoop case) have same key then index may not be efficient after all since it is better to read the whole table anyways.

what i am trying to say is for such frequent keys indexing may not be of much help so may be we can relax 'sort' property? i don't think there is another easy way out other than do a disk based sort. check you can reuse any of the hadoop sorting code. Or can we piggyback this sorting on top of hadoop reduce sort phase some how?

Prasad Chakka
added a comment - 23/Jul/09 02:44 what i am trying to say is for such frequent keys indexing may not be of much help so may be we can relax 'sort' property? i don't think there is another easy way out other than do a disk based sort. check you can reuse any of the hadoop sorting code. Or can we piggyback this sorting on top of hadoop reduce sort phase some how?

I am about to update it. And making some modifications according to our offline talk.
And i need to add some testcases and do more tests.
So i will submit a new patch by this weekend, and will notify you any updates.

He Yongqiang
added a comment - 23/Jul/09 04:15 I am about to update it. And making some modifications according to our offline talk.
And i need to add some testcases and do more tests.
So i will submit a new patch by this weekend, and will notify you any updates.

This jira page is a bit too long to follow.
With discussions with Prasad and Zheng, will open several sub-tasks for index-building, generalize index interface to allow plugin different index strategies, and optimize queries with index etc.

He Yongqiang
added a comment - 23/Jul/09 06:59 This jira page is a bit too long to follow.
With discussions with Prasad and Zheng, will open several sub-tasks for index-building, generalize index interface to allow plugin different index strategies, and optimize queries with index etc.
Let's keep discussions involved overall indexing here in this jira page, and move other discussions to separate jira pages.

Jeff Hammerbacher
added a comment - 22/Sep/09 01:02 Another type of index worth knowing about: the "negative index"/"storage index" from Exadata, described at http://blogs.oracle.com/datawarehousing/2009/09/500gbsec_and_database_machine.html .
We get some "negative indexing" for free with partitions, but this may be useful for more "distinctive" scans over columns for which we have not partitioned.

Prasad Chakka
added a comment - 22/Sep/09 01:27 @jeff, i think this is more suitable for storing it along with data where blocks of data can skipped while scanning rows. i think columnar storage might already be doing this.

someone had earlier suggested this (apparently from reading Netezza documentation) - but i don't understand when it would work. why would a (fairly large) sequencefile block only limited range of values (assuming the metadata stores a min-max range). most cases i can imagine in our dataset would either have low cardinality columns (so most values would be present) or for large cardinality ones - the distribution would be random (relative to the primary sort key) - and the range would seem ineffective.

unless there are columns that are closely related to the how data is sorted/partitioned (perhaps some product ids are limited to specific range of time - but the partitioning is on time and not product id - and even that sounds dubious).

a bloom filter would seem much more plausible at allowing good filtering. even then don't understand why this sort of metadata should be kept along with the block and not separately (much more flexible - can be added on demand) as this jira is headed towards.

Joydeep Sen Sarma
added a comment - 22/Sep/09 05:18 are there any references on this technique?
someone had earlier suggested this (apparently from reading Netezza documentation) - but i don't understand when it would work. why would a (fairly large) sequencefile block only limited range of values (assuming the metadata stores a min-max range). most cases i can imagine in our dataset would either have low cardinality columns (so most values would be present) or for large cardinality ones - the distribution would be random (relative to the primary sort key) - and the range would seem ineffective.
unless there are columns that are closely related to the how data is sorted/partitioned (perhaps some product ids are limited to specific range of time - but the partitioning is on time and not product id - and even that sounds dubious).
a bloom filter would seem much more plausible at allowing good filtering. even then don't understand why this sort of metadata should be kept along with the block and not separately (much more flexible - can be added on demand) as this jira is headed towards.

i don't think it makes much sense unless there is some clustering or sorting property. if there is clustering and sorting and the selectivity of a query is much higher than 10% then storing this metadata along with data makes sense instead of a separate block. the 10% threshold may be larger for Hive but the point still stands. in OLAP case data is change seldom and the size of this kind of metadata is much smaller than the data itself so the overhead of storing this data is negligible.

something similar to this is done in DB2 Multi-Dimensional Clustering where whole blocks (disk blocks) are skipped if the key value doesn't fit the query.

Prasad Chakka
added a comment - 22/Sep/09 06:14 i don't think it makes much sense unless there is some clustering or sorting property. if there is clustering and sorting and the selectivity of a query is much higher than 10% then storing this metadata along with data makes sense instead of a separate block. the 10% threshold may be larger for Hive but the point still stands. in OLAP case data is change seldom and the size of this kind of metadata is much smaller than the data itself so the overhead of storing this data is negligible.
something similar to this is done in DB2 Multi-Dimensional Clustering where whole blocks (disk blocks) are skipped if the key value doesn't fit the query.

Prasad Chakka
added a comment - 22/Sep/09 07:12 yes they do but they don't use for table scans which are done if the query selectivity is greater than 10% (or some such). they use the index for index scans and in joins. I wrote the table scan code

Prasad Chakka added a comment - 15/Apr/09 11:25 AM
Another way of doing it is to create a file format that contains index along with data... but i think that would take lot more time.

We are trying to store data in sorted and block-indexed files (such as HFile or TFile). Then I think we can know the startKey and lastKey of each file and each block. This block index(block summary) is just for primary key.

Schubert Zhang
added a comment - 23/Sep/09 04:01
Prasad Chakka added a comment - 15/Apr/09 11:25 AM
Another way of doing it is to create a file format that contains index along with data... but i think that would take lot more time.
We are trying to store data in sorted and block-indexed files (such as HFile or TFile). Then I think we can know the startKey and lastKey of each file and each block. This block index(block summary) is just for primary key.

Edward Capriolo
added a comment - 23/Dec/09 23:14 I currently am benching an 11 node hive cluster against a 16 TB MySQL system 4x quad core 32 GB RAM 5.1 with partitioning.
Hive destroys mysql with any query like:
(date_id is my partition column.)
set mapred.map.tasks=34;
set mapred.reduce.tasks=11;
FROM pageviews
insert overwrite directory '/user/ecapriolo/hivetest4'
select sitename_id, user_id, count(user_id) WHERE date_id=20091250 group by sitename_id,user_id
12098855 Rows loaded to /user/ecapriolo/hivetest4
OK
Time taken: 185.528 seconds
The same query can take over 3000 seconds on MySQL because these large summary queries are always written to a temp table and then writes bottleneck your read queries.
However, if mysql has an index (and if the index is in memory, which is hard in a warehouse) on some other value in the where clause like:
select sitename_id, user_id, count(user_id) WHERE date_id=20091250 and sitename_id=400 group by sitename_id,user_id
MySQL gets a relative performance speed-up, while hive ends up scanning the entire table.
I agree with dhruba,
>>This sounds really awesome! Make hadoop-hive suitable for things other than brute force table-scans!
If we had indexes helping stop some brute force scans, that would just open up other doors to what hive could do.

Got talked with Prasad about this issue today.
I may not able to finish this in the coming one or two months. I am now spending most of my time working on some other issues. I am sorry about that.
If anyone want this feature in, please feel free to take over from me. And i will provide all help that i can. If no one picked up, i can finish it after finishing issues at hand.
Thanks.

He Yongqiang
added a comment - 19/Feb/10 03:48 Got talked with Prasad about this issue today.
I may not able to finish this in the coming one or two months. I am now spending most of my time working on some other issues. I am sorry about that.
If anyone want this feature in, please feel free to take over from me. And i will provide all help that i can. If no one picked up, i can finish it after finishing issues at hand.
Thanks.

Prafulla T
added a comment - 08/Jun/10 14:29 He Yongqiang ,
Have you started working on this one ?
If not, I was interested in taking a look at it.
Patch link hive- 417－2009-07-18.patch is not working, can you share latest patch here ?

//find the offset for 'key=0' in the index table, and put the bucketname and offset list in a temp directory
insert overwrite directory "/tmp/index_result" select `_bucketname` , `_offsets` from src_rc_index where key=0;

set hive.exec.index_file=/tmp/index_result;

//use a new index file format to prune inputsplit based on the offset list
//stored in "hive.exec.index_file" which is populated in previous command
set hive.input.format=org.apache.hadoop.hive.ql.index.io.HiveIndexInputFormat;

//this query will not scan the whole base data
select key, value from src_rc where key=0;

Things done in the patch:
1) hql command for creating index table
2) hql command and map-reduce job for updating index (generating the index table's data).
3) a HiveIndexInputFormat to leverage the offsets got from index table to reduce number of blocks/map-tasks

Things need to be done:
1) right now the index table is manually specified in queries. we need this to be more intelligent by automatically generating the plan using index .
2) The HiveIndexInputFormat needs a new RecordReader to seek to a given offset instead of scanning the whole block.
3) right now we use a map-reduce job to scan the whole index table to find hits offsets. But since the index table is sorted, we can leverage the sort property to avoid the map-reduce job in many cases. (easiest way is to do a binary search in client.)

The first todo is the most important part. I think the third may need much more work (maybe not true).

(Note: although this patch has been tested in production cluster, it could still have bugs. We will be really appreciate if you can report bugs you find here.)

He Yongqiang
added a comment - 08/Jun/10 22:48 With this patch, the index can work. but it is not so intelligent.
This is how this patch works:
=== how to create the index table and generate index data ===
set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
drop table src_rc_index;
//create an index table on table src_rc, and the index col is key.
//And the index table's data is stored using textfile (also work with seq, rcfile)
create index src_rc_index type compact on table src_rc(key) stored as textfile;
hive> show table extended like src_rc_index;
tableName:src_rc_index
owner:heyongqiang
location: file:/user/hive/warehouse/src_rc_index
inputformat:org.apache.hadoop.mapred.TextInputFormat
outputformat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
columns:struct columns
{ i32 key, string _bucketname, list<string> _offsets}
About the index table's schema. besides the index columns from the base table, the index table has two more columns (_bucketname string, array(string) offsets )
//generate the actuall index table's data (here also support partition)
update index src_rc_index;
====How to use the index table====
//find the offset for 'key=0' in the index table, and put the bucketname and offset list in a temp directory
insert overwrite directory "/tmp/index_result" select `_bucketname` , `_offsets` from src_rc_index where key=0;
set hive.exec.index_file=/tmp/index_result;
//use a new index file format to prune inputsplit based on the offset list
//stored in "hive.exec.index_file" which is populated in previous command
set hive.input.format=org.apache.hadoop.hive.ql.index.io.HiveIndexInputFormat;
//this query will not scan the whole base data
select key, value from src_rc where key=0;
Things done in the patch:
1) hql command for creating index table
2) hql command and map-reduce job for updating index (generating the index table's data).
3) a HiveIndexInputFormat to leverage the offsets got from index table to reduce number of blocks/map-tasks
Things need to be done:
1) right now the index table is manually specified in queries. we need this to be more intelligent by automatically generating the plan using index .
2) The HiveIndexInputFormat needs a new RecordReader to seek to a given offset instead of scanning the whole block.
3) right now we use a map-reduce job to scan the whole index table to find hits offsets. But since the index table is sorted, we can leverage the sort property to avoid the map-reduce job in many cases. (easiest way is to do a binary search in client.)
The first todo is the most important part. I think the third may need much more work (maybe not true).
(Note: although this patch has been tested in production cluster, it could still have bugs. We will be really appreciate if you can report bugs you find here.)

I was thinking of adding something called query rewrite module.
It would be rule-based query rewrite system and it would
rewrite the query into semantically equivalent query which is
more optimized and/or uses indexes (not just for scans, but
for other query operators, e.g. GroupBy etc.)

Eg.

select distinct c1
from t1;

This query, if we have densed index ('compact summary index' in this
hive indexing patch) on c1 can be replaced with query on index table
itself.

select idx_key
from t1_cmpct_sum_idx;

Similar query transformation can happen for other queries.

Module will be placed just before optimizer and will help optimizer.
Module structure looks like below.

The rewrite module is 'generic', not just for above indexing case,
but for other cases too, e.g. OR predicates to union (for efficiency?), outer join
to union of anti & semi joins, moving out 'order by' out of union
subquery etc etc.

The aim is to implement a very simple, light-weight rewrite support,
implement the indexing related rewrites (above rewrite does not
even need a new run-time map-red operator) and integrate indexing
support quickly and cleanly. As noted above, this rewrite phase
is rule-based (and not cost-based), sort of early optimization.

Let me know what u think. I'll start with reading ur patch.
This would do most part from TODO 1,
TODO 2 and 3 will have to be looked into.

Prafulla T
added a comment - 09/Jun/10 12:39 I was thinking of adding something called query rewrite module.
It would be rule-based query rewrite system and it would
rewrite the query into semantically equivalent query which is
more optimized and/or uses indexes (not just for scans, but
for other query operators, e.g. GroupBy etc.)
Eg.
select distinct c1
from t1;
This query, if we have densed index ('compact summary index' in this
hive indexing patch) on c1 can be replaced with query on index table
itself.
select idx_key
from t1_cmpct_sum_idx;
Similar query transformation can happen for other queries.
Module will be placed just before optimizer and will help optimizer.
Module structure looks like below.
[Query parser]
[Query rewrites] --> new phase
[Query optimization]
[Query execution planner]
[Query execution engine]
The rewrite module is 'generic', not just for above indexing case,
but for other cases too, e.g. OR predicates to union (for efficiency?), outer join
to union of anti & semi joins, moving out 'order by' out of union
subquery etc etc.
The aim is to implement a very simple, light-weight rewrite support,
implement the indexing related rewrites (above rewrite does not
even need a new run-time map-red operator) and integrate indexing
support quickly and cleanly. As noted above, this rewrite phase
is rule-based (and not cost-based), sort of early optimization.
Let me know what u think. I'll start with reading ur patch.
This would do most part from TODO 1,
TODO 2 and 3 will have to be looked into.

A complication that happens by doing a rewrite just after parse is that you loose the ability to report back errors that correspond to the original query. Also the
metadata that you need to do the rewrite is only available after phase 1 of semantic analysis. So in my opinion the rewrite should be done after semantic analysis but before plan generation. Is that what you had in mind...

Ashish Thusoo
added a comment - 09/Jun/10 23:21 A couple of comments on this:
A complication that happens by doing a rewrite just after parse is that you loose the ability to report back errors that correspond to the original query. Also the
metadata that you need to do the rewrite is only available after phase 1 of semantic analysis. So in my opinion the rewrite should be done after semantic analysis but before plan generation. Is that what you had in mind...
so something like...
[Query parser]
[Query semantic analysis]
[Query optimization]
...

Prafulla T
added a comment - 10/Jun/10 03:45 Yes Ashish,
Thats what I had in mind.
Rewrite system would need metadata, and hence it should be invoked
after semantic analysis phase which would make metadata available.

Hi All,
Here are first set of changes about query rewrite module in Hive and
intial rewrite rule to transform groupby and distinct queries so that
they make use of indexes.
Note that patch contains createIndex* related changes from Yongqiang
Here are the files that I have touched for rewrite related changes
ql/src/java/org/apache/hadoop/hive/ql/metadata/Table.java
– hasIndex and getIndex related APIs
ql/src/java/org/apache/hadoop/hive/ql/parse/QB.java
ql/src/java/org/apache/hadoop/hive/ql/parse/QBParseInfo.java
ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
ql/src/java/org/apache/hadoop/hive/ql/rewrite/HiveRewriteEngine.java
– Rewrite engine infrastructure
ql/src/java/org/apache/hadoop/hive/ql/rewrite/rules/GbToCompactSumIdxRewrite.java
– Actual rewrite rule which does above mentioned optimization
ql/src/java/org/apache/hadoop/hive/ql/rewrite/rules/HiveRwRule.java
ql/src/java/org/apache/hadoop/hive/ql/rewrite/rules/HiveRwRuleContext.java
ql/src/test/queries/clientpositive/ql_rewrite_gbtoidx.q
– Unit test including various queries.
ql/src/test/results/clientpositive/ql_rewrite_gbtoidx.q.out

I've tested some queries end-to-end with this GbToIdx rewrite rule on.
It is working well.

Rewrite changes can be disabled with hive.ql.rewrite boolean flag
and this perticular rewrite can be disabled/enabled based on
hive.ql.rewrite.gbtoidx boolean flag.
Default value for hive.ql.rewrite is true
and that for hive.ql.rewrite.gbtoidx is false.

There are currently following limitations (which are being worked upon)
1. Rewrite engine does not invoke rewrite in recursive manner.
It currently just rewrites topLevel QB.
2. GbToIdx rewrite is disabled for all queries having "where clause" , we
can support certain "where clauses" in which all colrefs are index key
columns.
3. We need some API to know if indexes are up-to-date or not. We need to
do this rewrite only when index has up-to-date data.

Let me know your review comments.
I am using fork of apache-git repository for development on github.
All these changes are also available on http://github.com/prafullat/hive

Prafulla T
added a comment - 20/Jun/10 10:37 Hi All,
Here are first set of changes about query rewrite module in Hive and
intial rewrite rule to transform groupby and distinct queries so that
they make use of indexes.
Note that patch contains createIndex* related changes from Yongqiang
Here are the files that I have touched for rewrite related changes
ql/src/java/org/apache/hadoop/hive/ql/metadata/Table.java
– hasIndex and getIndex related APIs
ql/src/java/org/apache/hadoop/hive/ql/parse/QB.java
ql/src/java/org/apache/hadoop/hive/ql/parse/QBParseInfo.java
ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
ql/src/java/org/apache/hadoop/hive/ql/rewrite/HiveRewriteEngine.java
– Rewrite engine infrastructure
ql/src/java/org/apache/hadoop/hive/ql/rewrite/rules/GbToCompactSumIdxRewrite.java
– Actual rewrite rule which does above mentioned optimization
ql/src/java/org/apache/hadoop/hive/ql/rewrite/rules/HiveRwRule.java
ql/src/java/org/apache/hadoop/hive/ql/rewrite/rules/HiveRwRuleContext.java
ql/src/test/queries/clientpositive/ql_rewrite_gbtoidx.q
– Unit test including various queries.
ql/src/test/results/clientpositive/ql_rewrite_gbtoidx.q.out
I've tested some queries end-to-end with this GbToIdx rewrite rule on.
It is working well.
Rewrite changes can be disabled with hive.ql.rewrite boolean flag
and this perticular rewrite can be disabled/enabled based on
hive.ql.rewrite.gbtoidx boolean flag.
Default value for hive.ql.rewrite is true
and that for hive.ql.rewrite.gbtoidx is false.
There are currently following limitations (which are being worked upon)
1. Rewrite engine does not invoke rewrite in recursive manner.
It currently just rewrites topLevel QB.
2. GbToIdx rewrite is disabled for all queries having "where clause" , we
can support certain "where clauses" in which all colrefs are index key
columns.
3. We need some API to know if indexes are up-to-date or not. We need to
do this rewrite only when index has up-to-date data.
Let me know your review comments.
I am using fork of apache-git repository for development on github.
All these changes are also available on http://github.com/prafullat/hive

Namit Jain
added a comment - 30/Jun/10 22:27 Looking at the patch (not yet in detail) seems to suggest the following:
1. The index file can only be a text file.
2. PROJECTION index is not used - I mean, to start, can we just get the basic COMPACT+SUMMARY and only support that.

1. Populate the index at create index.
2. Instead of proposing a new syntax, why dont we use 'alter index <INDEX_NAME> ON <TABLE_NAME> REBUILD;
3. Since the code is in a prototype stage, can we move the index code to contrib ?

Namit Jain
added a comment - 01/Jul/10 21:24 Few higher level comments:
1. Populate the index at create index.
2. Instead of proposing a new syntax, why dont we use 'alter index <INDEX_NAME> ON <TABLE_NAME> REBUILD;
3. Since the code is in a prototype stage, can we move the index code to contrib ?

He Yongqiang
added a comment - 01/Jul/10 21:56 1. Populate the index at create index.
Let's do this in a followup jira.
3. Since the code is in a prototype stage, can we move the index code to contrib ?
Move to contrib is not good. In future we need to plug index automatically to the query.

Jeff Hammerbacher
added a comment - 01/Jul/10 22:31 3. Since the code is in a prototype stage, can we move the index code to contrib ?
It's been the experience of other Hadoop-related projects that contrib gets messy. It has proven effective to either keep experimental features in mainline trunk or to put them up on github.

Namit Jain
added a comment - 01/Jul/10 22:37 if (work.getReducer() != null)
{
work.getReducer().jobClose(job, success, feedBack);
}
if (IndexBuilderBaseReducer.class.isAssignableFrom(this
.getReducerClass()))
{
this.closeIndexBuilder(job, success);
}
}
Instead of the above code in ExecDriver, IndexBuilderBaseReducer/CompactSumReducer should have a jobClose - no code
change needed in ExecDriver.
I would still vote for the index code to be in contrib, it will take some time to clean it up - then it should be moved to the mainline.
Till then, it is usable, but in a prototype state.
What we should aim for is minimum changes in ql/. and put all changes in contrib for now. As they become stable, we can pull them
in - even the DDLSemanticAnalyzer should be factored in contrib

Can you explain how the metastore object model is laid out. It seems that the table names of the index are stored in key value properties of the table that the index is created on. Is that correct? Would it be better to put a key reference from the index table to the base table instead (similar to what is done for partitions)?

Also, how would this be used to query the table? Can you give an example?

Is the idea here to select from the index an then pass the offsets to another query to look up the table? An example or a test which shows the query on the base table would be useful.

Ashish Thusoo
added a comment - 02/Jul/10 15:20 Looked at the code and have some questions...
Can you explain how the metastore object model is laid out. It seems that the table names of the index are stored in key value properties of the table that the index is created on. Is that correct? Would it be better to put a key reference from the index table to the base table instead (similar to what is done for partitions)?
Also, how would this be used to query the table? Can you give an example?
Is the idea here to select from the index an then pass the offsets to another query to look up the table? An example or a test which shows the query on the base table would be useful.

Had a chat with Ashish and Yongqiang offline, and came up with three alternatives.

1) "Shortest path to checkin": Treat current code as prototype and move it into contrib, providing a utility for creating/updating the index, and keeping changes to core classes to a minimum. As Yongqiang pointed out, this makes it harder to follow up with automatic use of the index due to the lack of metadata. If we do this, we should create a new JIRA issue for its limited scope.

2) "Full-fledged index support": change the JDO metamodel to add support for indexes as first class objects, and come up with a pluggable index creation+access design framework which can encompass a variety of index types likely to be needed in the future. Code from this patch would become the first such index implementation provided. If we do this, we should continue on in this truly epic JIRA issue.

3) "Rework as materialized view": keep the JDO metamodel as is (adding a new table type for MATERIALIZED_VIEW) but change the DDL to CREATE MATERIALIZED VIEW AS SELECT ... and then come up with the system functions needed (e.g. for accessing file offsets) in order to be able to express the index construction as SQL. We would then execute view materialization in a fashion similar to CREATE TABLE AS SELECT. This approach best reflects the way the current code models an index as an ordinary table, but requires some other changes (e.g. CTAS + dynamic partitioning, something we want anyway). If we do this, we should create a new JIRA issue since it's a different feature from the user POV.

We're aiming to reach a decision next week; input is welcome on whether these alternatives make sense (and on others we should consider).

Since this JIRA issue is already so overloaded, we would also like to treat the following two items as separate followup JIRA issues rather than trying to address it all at once:

John Sichi
added a comment - 03/Jul/10 01:24 Had a chat with Ashish and Yongqiang offline, and came up with three alternatives.
1) "Shortest path to checkin": Treat current code as prototype and move it into contrib, providing a utility for creating/updating the index, and keeping changes to core classes to a minimum. As Yongqiang pointed out, this makes it harder to follow up with automatic use of the index due to the lack of metadata. If we do this, we should create a new JIRA issue for its limited scope.
2) "Full-fledged index support": change the JDO metamodel to add support for indexes as first class objects, and come up with a pluggable index creation+access design framework which can encompass a variety of index types likely to be needed in the future. Code from this patch would become the first such index implementation provided. If we do this, we should continue on in this truly epic JIRA issue.
3) "Rework as materialized view": keep the JDO metamodel as is (adding a new table type for MATERIALIZED_VIEW) but change the DDL to CREATE MATERIALIZED VIEW AS SELECT ... and then come up with the system functions needed (e.g. for accessing file offsets) in order to be able to express the index construction as SQL. We would then execute view materialization in a fashion similar to CREATE TABLE AS SELECT. This approach best reflects the way the current code models an index as an ordinary table, but requires some other changes (e.g. CTAS + dynamic partitioning, something we want anyway). If we do this, we should create a new JIRA issue since it's a different feature from the user POV.
We're aiming to reach a decision next week; input is welcome on whether these alternatives make sense (and on others we should consider).
Since this JIRA issue is already so overloaded, we would also like to treat the following two items as separate followup JIRA issues rather than trying to address it all at once:
rewrite framework
automatic usage of index or materialized view by optimizer

Hi Yongqiang,
I am facing some problem for creating SUMMARY indexes.
This index is not built with update index command.
COMPACT SUMMARY index works fine. Is there any problem with
creation of SUMMARY index table ?

Prafulla T
added a comment - 08/Jul/10 12:54 Hi Yongqiang,
I am facing some problem for creating SUMMARY indexes.
This index is not built with update index command.
COMPACT SUMMARY index works fine. Is there any problem with
creation of SUMMARY index table ?

He Yongqiang
added a comment - 14/Jul/10 00:09 Attache a new patch for review. Added metastore remodeling based on offline discussion with John and Paul.
Also addressed namit's suggestions: "Instead of proposing a new syntax, why dont we use 'alter index <INDEX_NAME> ON <TABLE_NAME> REBUILD;"
@Ashish,
>>Also, how would this be used to query the table? Can you give an example?
Added four test files in clientpositive. You can now refer thoee files for examples.
An example from one qfile:
CREATE INDEX srcpart_index_proj TYPE COMPACT ON TABLE srcpart(key);
ALTER INDEX srcpart_index_proj ON srcpart REBUILD;
SET hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
INSERT OVERWRITE DIRECTORY "/tmp/index_test_index_result" SELECT ` bucketname` , `_offsets` FROM default srcpart_srcpart_index_proj _ x WHERE x.key=100 AND x.ds = '2008-04-08';
SET hive.exec.index_file=/tmp/index_test_index_result;
SET hive.input.format=org.apache.hadoop.hive.ql.index.io.HiveIndexInputFormat;
SELECT key, value FROM srcpart WHERE key=100 AND ds = '2008-04-08';
Will update the patch and open follow up jiras after get more comments.
Thanks
Yongqiang

Attaching metastore data model diagram reverse-engineered from Yongqiang's latest patch (using Power Architect); can be used in a design doc. Cosmetic: I renamed INDEXS to IDXS for consistency with TBLS, likewise for IDX_ID and other attributes.

I marked one of the relationships between index and table as optional since we want to allow for indexes which don't store their data structure in tabular form. I'm not sure whether there's something we need to do in package.jdo to make sure the corresponding key is nullable.

John Sichi
added a comment - 15/Jul/10 23:47 Attaching metastore data model diagram reverse-engineered from Yongqiang's latest patch (using Power Architect); can be used in a design doc. Cosmetic: I renamed INDEXS to IDXS for consistency with TBLS, likewise for IDX_ID and other attributes.
I marked one of the relationships between index and table as optional since we want to allow for indexes which don't store their data structure in tabular form. I'm not sure whether there's something we need to do in package.jdo to make sure the corresponding key is nullable.

Here are some preliminary comments on the metastore work. We can move on to the plugin design next week and start getting all of this into a doc.

We should support a property on the index which controls the name of the index table, and only generate an index table name automatically in the case where the user doesn't supply the property. For this, we'll need to add property key/values to the grammar (IDXPROPERTIES like TBLPROPERTIES and SERDEPROPERTIES?).

The grammar supports control over the tableFileFormat for the index table; what about other attributes such as row format, location, and TBLPROPERTIES? Some of these may be dictated by the index implementation, but it may be useful to override in some cases (same as tableFileFormat).

Is the partitioning for the index independent of the partitioning for the table? Don't we need to allow control over this in the grammar?

I think we should track the status of the index (when was the last time it was rebuilt, if ever) so that we know whether it is fresh with respect to the base table data. How should we model this in such a way that it takes per-partition indexing into account?

Some metastore followups to be logged separately: COMMENT clause on index definition; DESCRIBE INDEX; SHOW INDEXES; dealing with base table columns being dropped/renamed out from under the index

For generating the index table structure, we'll need to move that to plugin (rather than in Hive.java), since each index will need a different table structure (or no table structure at all).

Test queries: remember to add ORDER BY for determinism. Also, I'm not sure whether it is safe to use /tmp in the local file system (it may not exist, e.g. on Windows). I used it in hbase_bulk.m, but that uses a mini HDFS cluster (not the local file system).

Dropping a table with an index on it currently gives the exception below (in Derby; I didn't test MySQL yet). Same for attempting to drop an index table directly (instead of dropping the index). The second case should either fail with a meaningful exception, or implicitly drop the index definition as a trigger from dropping the table.

John Sichi
added a comment - 16/Jul/10 22:45 Here are some preliminary comments on the metastore work. We can move on to the plugin design next week and start getting all of this into a doc.
We should support a property on the index which controls the name of the index table, and only generate an index table name automatically in the case where the user doesn't supply the property. For this, we'll need to add property key/values to the grammar (IDXPROPERTIES like TBLPROPERTIES and SERDEPROPERTIES?).
The grammar supports control over the tableFileFormat for the index table; what about other attributes such as row format, location, and TBLPROPERTIES? Some of these may be dictated by the index implementation, but it may be useful to override in some cases (same as tableFileFormat).
Is the partitioning for the index independent of the partitioning for the table? Don't we need to allow control over this in the grammar?
I think we should track the status of the index (when was the last time it was rebuilt, if ever) so that we know whether it is fresh with respect to the base table data. How should we model this in such a way that it takes per-partition indexing into account?
Some metastore followups to be logged separately: COMMENT clause on index definition; DESCRIBE INDEX; SHOW INDEXES; dealing with base table columns being dropped/renamed out from under the index
For generating the index table structure, we'll need to move that to plugin (rather than in Hive.java), since each index will need a different table structure (or no table structure at all).
Test queries: remember to add ORDER BY for determinism. Also, I'm not sure whether it is safe to use /tmp in the local file system (it may not exist, e.g. on Windows). I used it in hbase_bulk.m, but that uses a mini HDFS cluster (not the local file system).
Dropping a table with an index on it currently gives the exception below (in Derby; I didn't test MySQL yet). Same for attempting to drop an index table directly (instead of dropping the index). The second case should either fail with a meaningful exception, or implicitly drop the index definition as a trigger from dropping the table.
hive> create table t1(i int);
OK
hive> create index q type compact on table t1 ;
OK
hive> drop table t1;
FAILED: Error in metadata: javax.jdo.JDODataStoreException: Exception thrown flushing changes to datastore
NestedThrowables:
java.sql.BatchUpdateException: DELETE on table 'TBLS' caused a violation of foreign key constraint 'INDEXS_FK3' for key (12). The statement has been rolled back.
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
hive> create table t5(i int);
OK
hive> create index r type compact on table t5 ;
OK
hive> drop table default_ t5_r _;
FAILED: Error in metadata: javax.jdo.JDODataStoreException: Exception thrown flushing changes to datastore
NestedThrowables:
java.sql.BatchUpdateException: DELETE on table 'TBLS' caused a violation of foreign key constraint 'INDEXS_FK2' for key (17). The statement has been rolled back.
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask

>>We should support a property on the index which controls the name of the index table, and only generate an index table name automatically in the case where the user doesn't supply the property.
will add this in the following patch.

>>For this, we'll need to add property key/values to the grammar (IDXPROPERTIES like TBLPROPERTIES and SERDEPROPERTIES?).
Let's do it in a followup jira.

>>The grammar supports control over the tableFileFormat for the index table; what about other attributes such as row format, location, and TBLPROPERTIES? Some of these may be dictated by the index implementation, but it may be useful to override in some cases (same as tableFileFormat).
We can add this when we see the requirement. For now we can leave this out.

>>I think we should track the status of the index (when was the last time it was rebuilt, if ever) so that we know whether it is fresh with respect to the base table data. How should we model this in such a way that it takes per-partition indexing into account?
I think it's the same as the one of key/value property. no?

>>Test queries: remember to add ORDER BY for determinism.
will add this in the following patch.

>>Also, I'm not sure whether it is safe to use /tmp in the local file system (it may not exist, e.g. on Windows). I used it in hbase_bulk.m, but that uses a mini HDFS cluster (not the local file system).
I think it's should be ok because it's not local tmp. it's mini HDFS /tmp

>>Dropping a table with an index on it currently gives the exception below (in Derby; I didn't test MySQL yet). Same for attempting to drop an index table directly (instead of dropping the index). The second case should either fail with a meaningful exception, or implicitly drop the index definition as a trigger from dropping the table.
Actually this is reported by Prafulla offline. Will add this in the following patch. For the second case, i am planning to report error.

He Yongqiang
added a comment - 16/Jul/10 23:44 THANKS FOR THE DETAILED COMMENTS.
>>We should support a property on the index which controls the name of the index table, and only generate an index table name automatically in the case where the user doesn't supply the property.
will add this in the following patch.
>>For this, we'll need to add property key/values to the grammar (IDXPROPERTIES like TBLPROPERTIES and SERDEPROPERTIES?).
Let's do it in a followup jira.
>>The grammar supports control over the tableFileFormat for the index table; what about other attributes such as row format, location, and TBLPROPERTIES? Some of these may be dictated by the index implementation, but it may be useful to override in some cases (same as tableFileFormat).
We can add this when we see the requirement. For now we can leave this out.
>>I think we should track the status of the index (when was the last time it was rebuilt, if ever) so that we know whether it is fresh with respect to the base table data. How should we model this in such a way that it takes per-partition indexing into account?
I think it's the same as the one of key/value property. no?
>>Test queries: remember to add ORDER BY for determinism.
will add this in the following patch.
>>Also, I'm not sure whether it is safe to use /tmp in the local file system (it may not exist, e.g. on Windows). I used it in hbase_bulk.m, but that uses a mini HDFS cluster (not the local file system).
I think it's should be ok because it's not local tmp. it's mini HDFS /tmp
>>Dropping a table with an index on it currently gives the exception below (in Derby; I didn't test MySQL yet). Same for attempting to drop an index table directly (instead of dropping the index). The second case should either fail with a meaningful exception, or implicitly drop the index definition as a trigger from dropping the table.
Actually this is reported by Prafulla offline. Will add this in the following patch. For the second case, i am planning to report error.

Based on some internal discussions below are some comments about the design doc:

1) the staleness (inconsistency) between the index and the base table should be addressed more precisely.
Since the current implementation allows the user to query the index table directly, we should guarantee that the index is consistent with the base table at the query time. This means at the query START time, the index was built completely based on the data stored in the base table. The current design does not satisfy this criteria in that it only record the last_modification_time (LMT) of the base table and the index table, and check if the latter is larger than the former. This leaves the following example break:

The LMTs of the index and the base table are timestamp4 and timestamp3 respectively so the optimizer will conclude the index is consistent with base table. However, the index was built based on stale data at the timestamp5. So the index should not be used.

Instead of recording the LMT of the index table, we probably should record the LMT of the base table in the index metadata at the beginning of the index creation. In the above example, the timestamp recorded in the index metadata should be timestamp0. This means the index was created based on the base table at timestamp0. At the query time, we should check timestamp0 against timestamp 3, which correctly conclude the index is stale.

BTW, all the timestamp should be coming from some centralized clock such as the DFS directory update time (from the namenode).

2) The above consistency problem does not only present in the case of "DEFERRED REBUILD". Even if the index rebuild starts right away after INSERT OVERWRITE, there is still a time window that the index is stale (before the index creation is complete). So we need the same mechanism to figure out stale indexes.

3) I think a lock-based concurrency may not be the best choice as well. If the index creation takes a long time, it defers the availability of the base table. If we have the optimizer, we should always query against the base tables, and let the optimizer to figure out whether an index is available and fresh. So if an index creation is not finished, we can just use the base table, otherwise we can use the index if the cost is less expensive.

4) Another case is that if the index creation finished and the query is using the index, and then an DML happened on the base table and finished before the query finish. Here we only guarantee snapshot consistency (results consisting with the data at the beginning of the query, not after the query).

5) If we have the mechanism to check consistency of the index, then the "index rebuild" command could just return if the index is consistent. We can also allow a "force" option in case we need to compensate for bad metadata.

Ning Zhang
added a comment - 22/Jul/10 19:34 Based on some internal discussions below are some comments about the design doc:
1) the staleness (inconsistency) between the index and the base table should be addressed more precisely.
Since the current implementation allows the user to query the index table directly, we should guarantee that the index is consistent with the base table at the query time. This means at the query START time, the index was built completely based on the data stored in the base table. The current design does not satisfy this criteria in that it only record the last_modification_time (LMT) of the base table and the index table, and check if the latter is larger than the former. This leaves the following example break:
timestamp0: last update of partition P1
timestamp1: start create index on partition P1
timestamp2: start insert overwrite P1
timestamp3: finish insert overwrite P1
timestamp4: finish index creation on P1
timestamp 5: query on P1
The LMTs of the index and the base table are timestamp4 and timestamp3 respectively so the optimizer will conclude the index is consistent with base table. However, the index was built based on stale data at the timestamp5. So the index should not be used.
Instead of recording the LMT of the index table, we probably should record the LMT of the base table in the index metadata at the beginning of the index creation. In the above example, the timestamp recorded in the index metadata should be timestamp0. This means the index was created based on the base table at timestamp0. At the query time, we should check timestamp0 against timestamp 3, which correctly conclude the index is stale.
BTW, all the timestamp should be coming from some centralized clock such as the DFS directory update time (from the namenode).
2) The above consistency problem does not only present in the case of "DEFERRED REBUILD". Even if the index rebuild starts right away after INSERT OVERWRITE, there is still a time window that the index is stale (before the index creation is complete). So we need the same mechanism to figure out stale indexes.
3) I think a lock-based concurrency may not be the best choice as well. If the index creation takes a long time, it defers the availability of the base table. If we have the optimizer, we should always query against the base tables, and let the optimizer to figure out whether an index is available and fresh. So if an index creation is not finished, we can just use the base table, otherwise we can use the index if the cost is less expensive.
4) Another case is that if the index creation finished and the query is using the index, and then an DML happened on the base table and finished before the query finish. Here we only guarantee snapshot consistency (results consisting with the data at the beginning of the query, not after the query).
5) If we have the mechanism to check consistency of the index, then the "index rebuild" command could just return if the index is consistent. We can also allow a "force" option in case we need to compensate for bad metadata.

Update:
Based on some offline discussions, this patch takes a virtual column approach to build index.
Added 2 virtual columns in this patch, one is input filename, and the other is file-wise block_offset.
An example of querying virtual columns:
select INPUT_FILENAME, key, BLOCKOFFSETINSIDE_FILE from src;

He Yongqiang
added a comment - 26/Jul/10 06:42 Attach a new diff for review.
Update:
Based on some offline discussions, this patch takes a virtual column approach to build index.
Added 2 virtual columns in this patch, one is input filename, and the other is file-wise block_offset.
An example of querying virtual columns:
select INPUT_ FILE NAME, key, BLOCK OFFSET INSIDE _FILE from src;

First pass of review comments on latest patch (I'll probably have more tomorrow).

INDEX_NAME precision in the metastore should be 128 characters (not 767), following convention for other identifiers

I don't think we need INDEX_TABLE_NAME at all in the metastore; it should only be used during CREATE INDEX and then forgotten

Move HiveIndexInputFormat and HiveIndexResult to package org.apache.hadoop.hive.ql.index.compact, and add Compact in their names (I'd still prefer to move this entire package out to a new subproj, but I guess we can skip that part now since most of the code went away with the virtual column approach); rename property hive.exec.index_file to hive.index.compact.file

Support WITH DEFERRED REBUILD, and require this to be specified for now to avoid confusion (per discussion in design meeting)

when generating reentrant INSERT, need to quote identifiers such as table/column names (use HiveUtils.unparseIdentifier), and may need extra escaping for special characters in getPartKVPairStringArray (I'm not sure--check with Paul)

thread_local should be private (and named threadLocal); go through public IOContext.get() instead; likewise use public getter/setter methods on IOContext instead of accessing its data members directly

need ORDER BY in virtual_column.q

remove extra semicolon in other ORDER BY's, and make sure they cover a unique key in all cases

John Sichi
added a comment - 27/Jul/10 02:18 First pass of review comments on latest patch (I'll probably have more tomorrow).
INDEX_NAME precision in the metastore should be 128 characters (not 767), following convention for other identifiers
I don't think we need INDEX_TABLE_NAME at all in the metastore; it should only be used during CREATE INDEX and then forgotten
Move HiveIndexInputFormat and HiveIndexResult to package org.apache.hadoop.hive.ql.index.compact, and add Compact in their names (I'd still prefer to move this entire package out to a new subproj, but I guess we can skip that part now since most of the code went away with the virtual column approach); rename property hive.exec.index_file to hive.index.compact.file
Support WITH DEFERRED REBUILD, and require this to be specified for now to avoid confusion (per discussion in design meeting)
when generating reentrant INSERT, need to quote identifiers such as table/column names (use HiveUtils.unparseIdentifier), and may need extra escaping for special characters in getPartKVPairStringArray (I'm not sure--check with Paul)
thread_local should be private (and named threadLocal); go through public IOContext.get() instead; likewise use public getter/setter methods on IOContext instead of accessing its data members directly
need ORDER BY in virtual_column.q
remove extra semicolon in other ORDER BY's, and make sure they cover a unique key in all cases
don't need TYPE and UPDATE as keywords in grammar

John Sichi
added a comment - 27/Jul/10 03:24 Whoops, forgot two leftover from a private diff review:
metastore/if/hive_metastore.thrift:102 instead of including the full indexTable structure inside the Index structure, can we omit it but then pass it as an additional parameter to add_index?
ql/src/java/org/apache/hadoop/hive/ql/index/compact/CompactIndexHandler.java:86 Move generic partition analysis out into Hive, since it will be the same for all plugins.
We can talk more about these tomorrow if it's not clear.

John Sichi
added a comment - 27/Jul/10 18:37 Regarding the mkset function: can we rename this to collect_array to hint that it is a UDAF? The @Description should also make this clear.
Collect is the standard SQL name for this aggregate function, but the standard version returns a multiset rather than an array, so let's call it collect_array to be specific.
Also, it will need its own independent unit tests (open a followup JIRA issue for this).

Another followup needed: REBUILD should be propagating lineage and read/write info from the reentrant INSERT statement up to the top-level statement so that hooks get called with the right information.

John Sichi
added a comment - 27/Jul/10 22:29 Another followup needed: REBUILD should be propagating lineage and read/write info from the reentrant INSERT statement up to the top-level statement so that hooks get called with the right information.

@Ashish
>>why is virtualcolumn class in the serde2 package?
will put it to ql.io package. I put it to serde2 package just because i thought it maybe needed by the serde layer. Since all codes are almost done and it is not accessed by serde,
it makes sense to move it to ql.
>>how is the file name populated
filename and block offset are all populated by record reader. filename is populated by looking at the split path when we construct the record reader. Offset is generated at runtime by record reader.

He Yongqiang
added a comment - 27/Jul/10 22:46 @Ashish
>>why is virtualcolumn class in the serde2 package?
will put it to ql.io package. I put it to serde2 package just because i thought it maybe needed by the serde layer. Since all codes are almost done and it is not accessed by serde,
it makes sense to move it to ql.
>>how is the file name populated
filename and block offset are all populated by record reader. filename is populated by looking at the split path when we construct the record reader. Offset is generated at runtime by record reader.

John Sichi
added a comment - 29/Jul/10 04:45 Yongqiang, I passed tests on Hadoop 0.20, but Ning has committed HIVE-1408 , which conflicts, so you'll need to rebase against that and then I'll try again.

John Sichi
added a comment - 29/Jul/10 07:06 Since another patch is needed, here are the review comments I mentioned above.
Javadoc for Hive.createIndex needs parameters fixed
Javadoc for HiveIndexHandler.analyzeIndexDefinition: remove storageDesc]
In HiveUtils.getIndexHandler: the message should be "Error in loading index handler" rather than "Error in loading storage handler"
GenericUDAFCollectSet @Description : "with no duplication elements" should be "with duplicate elements eliminated"
DDLSemanticAnalyzer.analyzeCreateIndex: hanlder is misspelled
Property AbstractIndexHandler.INDEX_COLS_KEY is never used; get rid of it?
For HiveIndex.INDEX_TABLE_CREATETIME property name, spell out "lastModifiedTime" instead of "lmt"