Ashish Thusoo
added a comment - 29/Jul/09 19:04 Also would be great if you could comment on how you plan to map the hbase data model to the sql data model (i.e. tables, columns etc.)
This will be a cool contribution....
SerDe would be the right way to go...
Thanks,
Ashish

The key and timestamp of hbase table will be mapped to first two default columns in hive's table automatically. So the hbased-hive table may be like <.key, .timestamp, ..., other columns defined by users>.

For example, a hbase table 'webpages', has columns <contents:page_content, anchors:>. There are 2 column families, "contents" and "anchors". The content of table 'webpages' is stored in column 'contents:page_content', the data is dense. And the anchors of a specified page will varied between different pages, so the data in 'anchros:' will be sparse.
The columns of hbase' table will be mapped manually be programmers : we can map a full column <column_family:column_name> in hbase to a primitive_type column in hive, while mapping a column family <column_family:> in hbase to a map_type column in hive. So the hbase table webpages' hive schema will be (.key, .timestamp, page_content, anchors).

Setting up schema mapping between hbase table and hive table, we need to consider how to record the shema mapping, serialize the hive object to hbase table and deserialize hbase's data to hive object.

The proposal is to add a new HbaseSerDe for recording the schema mapping in SerDe properties. So the SerDe can use its schema mapping to serialize the hive object to hbase's table and deserialize hbase's data to hive object.

The properties in HBaseSerDe will be:
1) "hbase.key.type" : the type of .key column in hive table, defining how to deserialize the .key field from hbase's key. (the hbase key is a bytes array)
2) "hbase.schema.mapping" : a string separated by comma, defining the shema mapping. The schema will be mapped in order one by one.

These properites should be provided during creating a hbased-hive table. If the "hbase.key.type" is not defined, we treat it as a string. But if the "hbase.schema.mapping" is not defined, we should fail the table creation because we do not how to deserialize hive object from hbase raw bytes data.

After invoking the 'create' command, the hive client will also create a hbase table in the specified hbase cluster. And the created hbase table will have two column families defined in HBaseSerDe properties, "contents:" and "anchros:".

3. Loading data into tables.

As we have two default hidden column (.key, .timestamp) in hbased-hive table, we must count these two columns in during inserting data.
We can eigth load data into hbased-hive table by inserting data from other tables or loading data from local filesystem.

A. Inserting data from other tables.

for example, we have a 'crawled_pages' table collecting all the pages crawled from the internet. the 'crawled_pages' is simple: <url, crawled_date, page_content>.

I. If we want to load all this data into the 'webpages' table, we will invoke the command as below:

Sijie Guo
added a comment - 31/Jul/09 08:49 The key problem to let hive analyse hbase's tables is how to map the hbase's data model to hive's sql data model.
As we know, the hbase's data is accessed by <key, column_family:column_name, timestamp>. so a meta-data mapping should be recorded in hive's metadata, as below:
-------------------------------------------------------
hbase's tablename -> hive's tablename
hbase's columns -> hive's columns
hbase's key -> hive's first column
hbase's timestamp -> hive's second column
-------------------------------------------------------
The key and timestamp of hbase table will be mapped to first two default columns in hive's table automatically. So the hbased-hive table may be like <.key, .timestamp, ..., other columns defined by users>.
For example, a hbase table 'webpages', has columns <contents:page_content, anchors:>. There are 2 column families, "contents" and "anchors". The content of table 'webpages' is stored in column 'contents:page_content', the data is dense. And the anchors of a specified page will varied between different pages, so the data in 'anchros:' will be sparse.
The columns of hbase' table will be mapped manually be programmers : we can map a full column <column_family:column_name> in hbase to a primitive_type column in hive, while mapping a column family <column_family:> in hbase to a map_type column in hive. So the hbase table webpages' hive schema will be (.key, .timestamp, page_content, anchors).
Setting up schema mapping between hbase table and hive table, we need to consider how to record the shema mapping, serialize the hive object to hbase table and deserialize hbase's data to hive object.
The proposal is to add a new HbaseSerDe for recording the schema mapping in SerDe properties. So the SerDe can use its schema mapping to serialize the hive object to hbase's table and deserialize hbase's data to hive object.
The properties in HBaseSerDe will be:
1) "hbase.key.type" : the type of .key column in hive table, defining how to deserialize the .key field from hbase's key. (the hbase key is a bytes array)
2) "hbase.schema.mapping" : a string separated by comma, defining the shema mapping. The schema will be mapped in order one by one.
These properites should be provided during creating a hbased-hive table. If the "hbase.key.type" is not defined, we treat it as a string. But if the "hbase.schema.mapping" is not defined, we should fail the table creation because we do not how to deserialize hive object from hbase raw bytes data.
A hbased-hive table's operations are showed as below:
1. Using existed hbase table as an external table in hive
The 'create' command will be as below:
-----------------------------
CREATE EXTERNAL TABLE webpages(page_content STRING, anchors MAP<STRING, STRING>)
COMMENT 'This is the pages table'
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.HBaseSerDe'
WITH SERDEPROPERTIES (
"hbase.key.type" = "string",
"hbase.columns.mapping" = "contents:page_content,anchors:",
)
STORED AS HBASETABLE
LOCATION '<hbase_table_location>'
-----------------------------
Here the hbase_table_location will identify the location of hbase and the hbase table name, such as "hbase:/hbase_master:port/hbase_tablename".
And after creating an external table using an existing hbase table, we can do analysis over the table like normal hive table.
A. Get all the urls and their pages that added after a specified time t1.
SELECT .key, page_content FROM webpages WHERE .timestamp > t1;
B. Get the revisions of a specified url <www.apache.org> from a specified time t1 to a specified time t2.
SELECT page_content FROM webpages WHERE .timestamp > t1 AND .timestamp < t2 AND .key = 'www.apache.org';
2. Creating a new hbase table as a hive table.
The 'create' command will be as below:
-----------------------------
CREATE TABLE webpages(page_content STRING, anchors MAP<STRING, STRING>)
COMMENT 'This is the pages table'
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.HBaseSerDe'
WITH SERDEPROPERTIES (
"hbase.key.type" = "string",
"hbase.columns.mapping" = "contents:page_content,anchors:",
)
STORED AS HBASETABLE
LOCATION '<hbase_table_location>'
-----------------------------
After invoking the 'create' command, the hive client will also create a hbase table in the specified hbase cluster. And the created hbase table will have two column families defined in HBaseSerDe properties, "contents:" and "anchros:".
3. Loading data into tables.
As we have two default hidden column (.key, .timestamp) in hbased-hive table, we must count these two columns in during inserting data.
We can eigth load data into hbased-hive table by inserting data from other tables or loading data from local filesystem.
A. Inserting data from other tables.
for example, we have a 'crawled_pages' table collecting all the pages crawled from the internet. the 'crawled_pages' is simple: <url, crawled_date, page_content>.
I. If we want to load all this data into the 'webpages' table, we will invoke the command as below:
FROM crawled_pages cp
INSERT TABLE webpages
SELECT cp.url, cp.crawled_date, cp.page_content, null;
II. If we do not want to specified the time during inserting these data, we can simply set the .timestamp column to 'null', as below:
FROM crawled_pages cp
INSERT TABLE webpages
SELECT cp.url, null, cp.page_content, null;
III. Crazily, if the .key column provided is null, we may throw out errors to client or just skipping the bad records?
FORM crawled_pages cp
INSERT TABLE webpages
SELECT null, null, cp.page_content, null;
B. Loading data from local filesystem (or hdfs)
Now hive just copy/move the file to the specified dir of a hive table. But we should forbbiden it during loading data into a hbased-hive table.
if we want to loading data from files in local filesystem (or hdfs) into hbased-hive tables, we can do as below:
I. create a temp external table for the original data(files).
II. load data into the hbased-hive table using 'insert' from the temp external table.
4. Performance Improvements
Some improvements may be considered during analysing hbase tables.for example, hbase key is an index to access data that can be used to accelerating hive. No clearly.
-----------------------------
forget my pool english, and welcome for comments.

Sijie Guo
added a comment - 09/Aug/09 07:05 Attach my patch.
There is a little different with my previous proposal.
creating a table will be:
-------------------------------------
CREATE EXTERNAL TABLE webpages(pageURL STRING, page_content STRING, anchors MAP<STRING, STRING>)
COMMENT 'This is the pages table'
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.HBaseSerDe'
WITH SERDEPROPERTIES (
"hbase.columns.mapping" = "contents:page_content,anchors:",
)
STORED AS HBASETABLE
LOCATION '<hbase_table_location>'
--------------------------------------
The first field defined in the hive table will be mapped to hbase's table key and the left fields will be mapped to the hbase columns specified in serde properties named "hbase.columns.mapping".
And the timestamp field is not added now. I just retrieve the latest version of each hbase cell from a hbase table now.

Thanks for your great job.
In you patch, we found many java files are modified, it is really a big effort. I don't know if there is any way to avoid such a big modification.

Regards the schema mapping between HBase table and Hive SQL table, I have following consideration.
1. We just want to use HBase as a scalable structure data store, or key-value store.
2. The performance is not good when we maped SQL columns to HBase columns in our past experience. For example, we have a table with 20 columns, then, each read or write of a row will comprise 20 key-value operations. It is ineffective.

How about consider more flexible schema mapping:
1. one HBase column can map to multiple hive-SQL columns with a SerDe. e.g. cf1:q1 =>

{(col1, col2, col3), Default SerDe}

2. one HBase column family can map to multiple hive-SQL columns with a SerDe. e.g. cf2: =>

{(col3, col5, col6), Default SerDe}

3. your MAP column (in Hive table) for sparse column family. [Optional] Since Hive is a structured data analysis front-end, we can omit this feature at the beginning.

Usually, we want a more advanced data store backend than HDFS, to achieve more flexible data placement and indexing. HBase's data model is very good to meet this requirement, but we may need not the full fearures of HBase here.

–
Look forward to have more communication with you in Chinese, by your convenience.

Schubert Zhang
added a comment - 10/Aug/09 18:24 Hi Samuel,
Thanks for your great job.
In you patch, we found many java files are modified, it is really a big effort. I don't know if there is any way to avoid such a big modification.
Regards the schema mapping between HBase table and Hive SQL table, I have following consideration.
1. We just want to use HBase as a scalable structure data store, or key-value store.
2. The performance is not good when we maped SQL columns to HBase columns in our past experience. For example, we have a table with 20 columns, then, each read or write of a row will comprise 20 key-value operations. It is ineffective.
How about consider more flexible schema mapping:
1. one HBase column can map to multiple hive-SQL columns with a SerDe. e.g. cf1:q1 =>
{(col1, col2, col3), Default SerDe}
2. one HBase column family can map to multiple hive-SQL columns with a SerDe. e.g. cf2: =>
{(col3, col5, col6), Default SerDe}
3. your MAP column (in Hive table) for sparse column family. [Optional] Since Hive is a structured data analysis front-end, we can omit this feature at the beginning.
For example:
CREATE EXTERNAL TABLE hive_table (pkey STRING, col1 STRING, col2 INT, col2, STRING, col3 INT, col4 STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.MyHBaseSerDe'
WITH SERDEPROPERTIES (
"hbase.columns.mapping" = "cf1:(col1,col2,col3) with DefaultSerDe, cf2:c1 (col4) with DefaultSerDe",
)
STORED AS HBASETABLE
LOCATION '<hbase_table_location>'
Usually, we want a more advanced data store backend than HDFS, to achieve more flexible data placement and indexing. HBase's data model is very good to meet this requirement, but we may need not the full fearures of HBase here.
–
Look forward to have more communication with you in Chinese, by your convenience.
Schubert

>> In you patch, we found many java files are modified, it is really a big effort. I don't know if there is any way to avoid such a big modification.
A HBase Table is quite different with a file in HDFS. The original Hive code is based on files. For example, when outputting the reduce results to the target table, Hive uses a FileSinkOperator to output the results to the temp file in the HDFS, and uses a MoveTask to rename the temp files in the HDFS to the target table dir. But when the the target table is based on a HBase Table, we do not need to deal with these file operations, and just output to the target HBase Table.

The modification of the original java files is to tell hive to deal with a hbase table in a differnt way.

I will try to look into the code and find a way to avoid the modification.

>> 2. The performance is not good when we maped SQL columns to HBase columns in our past experience. For example, we have a table with 20 columns, then, each read or write of a row will comprise 20 key-value operations. It is ineffective.

A good point. The schema mapping does not effect the peformance during creating a hive table. The performance is effected if we get all the mapping columns out of hbase table in an actual query operation. Some code will be added to do the column-prune during hbase table scanning.

For example, an hbase table (cf1:(co1, col2, col3), cf2:(col4,col5,col6), ... , cfn:(colk,colj,coll)) is mapping to a hive table (column1, column2, column3, column4, ... ,column n).
If a query "select column3, column4 from hbasedhivetable" is invoked, we should not let hbase scan out all the columns. We know all the hive columns used in the query, map back to the hbase column, and get the scanning list "cf1:col3 cf2:col4". We set the scanning list "cf1:col3 cf2:col4" in the HBaseInputFormat to let HBase just scan out the useful columns.

The code will be added in the new patch.

>> cf2: =>

{(col3, col5, col6), Default SerDe}

Cool. Let different SerDe work on different hbase column. I will try it in the new patch.

>> Look forward to have more communication with you in Chinese, by your convenience.
My Gtalk is : sijie0413@gmail.com

Sijie Guo
added a comment - 11/Aug/09 04:05 @schubert,
Thank you for you comment.
>> In you patch, we found many java files are modified, it is really a big effort. I don't know if there is any way to avoid such a big modification.
A HBase Table is quite different with a file in HDFS. The original Hive code is based on files. For example, when outputting the reduce results to the target table, Hive uses a FileSinkOperator to output the results to the temp file in the HDFS, and uses a MoveTask to rename the temp files in the HDFS to the target table dir. But when the the target table is based on a HBase Table, we do not need to deal with these file operations, and just output to the target HBase Table.
The modification of the original java files is to tell hive to deal with a hbase table in a differnt way.
I will try to look into the code and find a way to avoid the modification.
>> 2. The performance is not good when we maped SQL columns to HBase columns in our past experience. For example, we have a table with 20 columns, then, each read or write of a row will comprise 20 key-value operations. It is ineffective.
A good point. The schema mapping does not effect the peformance during creating a hive table. The performance is effected if we get all the mapping columns out of hbase table in an actual query operation. Some code will be added to do the column-prune during hbase table scanning.
For example, an hbase table (cf1:(co1, col2, col3), cf2:(col4,col5,col6), ... , cfn:(colk,colj,coll)) is mapping to a hive table (column1, column2, column3, column4, ... ,column n).
If a query "select column3, column4 from hbasedhivetable" is invoked, we should not let hbase scan out all the columns. We know all the hive columns used in the query, map back to the hbase column, and get the scanning list "cf1:col3 cf2:col4". We set the scanning list "cf1:col3 cf2:col4" in the HBaseInputFormat to let HBase just scan out the useful columns.
The code will be added in the new patch.
>> cf2: =>
{(col3, col5, col6), Default SerDe}
Cool. Let different SerDe work on different hbase column. I will try it in the new patch.
>> Look forward to have more communication with you in Chinese, by your convenience.
My Gtalk is : sijie0413@gmail.com

The data model mapping works. I have one suggestion though. Can we infer the columns list of the hive table from the hbase table instead of explicitly stating it in the create command. My concerns is that an addition of a column family in hbase will require an alter table on hive and if we can avoid it that would be great.

Ashish Thusoo
added a comment - 11/Aug/09 23:49 The data model mapping works. I have one suggestion though. Can we infer the columns list of the hive table from the hbase table instead of explicitly stating it in the create command. My concerns is that an addition of a column family in hbase will require an alter table on hive and if we can avoid it that would be great.

Thank you for your comment.
It is difficult to infer the columns list from a sparse column hbase table, we do not know exactly how many columns in a given hbase table. We just know all the column families of a given hbase table.
Also, the data in hbase are all raw bytes. If we do not explicitly stat the schema mapping, we will lose the information how to serialize/deserialize the data from raw bytes.

Sijie Guo
added a comment - 12/Aug/09 04:02 @Ashish
Thank you for your comment.
It is difficult to infer the columns list from a sparse column hbase table, we do not know exactly how many columns in a given hbase table. We just know all the column families of a given hbase table.
Also, the data in hbase are all raw bytes. If we do not explicitly stat the schema mapping, we will lose the information how to serialize/deserialize the data from raw bytes.

java.lang.NullPointerException
at org.apache.hadoop.hbase.mapred.TableInputFormat.configure(TableInputFormat.java:52)
at org.apache.hadoop.hive.ql.io.HiveHBaseTableInputFormat.configure(HiveHBaseTableInputFormat.java:36)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83)
at org.apache.hadoop.hive.ql.io.HiveInputFormat.getInputFormatFromCache(HiveInputFormat.java:184)
at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:211)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:331)
at org.apache.hadoop.mapred.Child.main(Child.java:158)

There is another query, nothing returned.
hive> select * from hbase_table_1;
OK
Time taken: 2.952 seconds

Sijie Guo
added a comment - 13/Aug/09 06:12 @kula @stephen
Thank you all for your comments.
1) As stephen methioned, the NullPointerException is thrown out because the COLUMN_LIST is set in the wrong job configuration.
I will fixed it in the new path.
2) It seems that "select *" statement is buggy now. I will find out the problem and fix it.

1) move the related hbase code to the contrib package, as hbase just an optional storage for hive, not neccessary.
I have tried to avoid modifying the hive original code and just add a hbase serde to connect hive with hbase. But the hbase storage model is quite different with file storage model. For example, a loadwork is used to rename/copy files from temp dir to the target table's dir if a query's target is a hive table. But in a hbased hive table, we can't rename a table now. So it's hard to let a hbased hive table to follow the logic of a normal file-based hive table. So I add some code(HiveFormatUtils) to distinguish a file-based table from a not-file-based table.

2) fix some bugs in the draft patch, such as "select *" return nothing.

"hive.othermetadata.handlers" collects the metadata handlers to handle the other metadata operations in the not-file-based hive tables. Take hbase as an example. HBaseMetadataHandler will create the neccessary hbase table and its family columns when we create a hbased hive table from hive's client. It also drop the hbase table when we drop the hive table.

The metastore read the registered handlers map from the configuration file during initialization. The registered handlers map is formated as "table_format_classname:table_metadata_handler_classname,table_format_classname:table_metadata_handler_classname,...".

1) Altering a hased-hive table is not supported now.
renaming a table in hbase is not supported now, so I just do not support rename operation. ( maybe if we rename a hive table, we do not need to rename the base hbase table.)

adding/replacing cloumns.
Now we need to specify the schema mapping in the SerDe properties explicitly. If we want to adding columns, we need to call 'alter' twice to adding columns: change the serde properties and the hive columns. Either change the serde properties first or change the hive columns first will fail now, because we validate the schema mapping during SerDe initialization. One of the hbase serde validation is to check the counts of hive columns and hbase mapping columns. If we first change the hive columns, the number of hive columns will be more than hbase mapping columns, the HBase Serde initialization will fail this alter operation. (maybe we need to remove the validation code from HBaseSerDe initialization and do it in other place?)

2) more flexible schema mapping?
As Schubert metioned before, more flexible schema mapping will be useful for user. This feature will be added later.

Sijie Guo
added a comment - 23/Aug/09 13:20 Attach a new patch.
1) move the related hbase code to the contrib package, as hbase just an optional storage for hive, not neccessary.
I have tried to avoid modifying the hive original code and just add a hbase serde to connect hive with hbase. But the hbase storage model is quite different with file storage model. For example, a loadwork is used to rename/copy files from temp dir to the target table's dir if a query's target is a hive table. But in a hbased hive table, we can't rename a table now. So it's hard to let a hbased hive table to follow the logic of a normal file-based hive table. So I add some code(HiveFormatUtils) to distinguish a file-based table from a not-file-based table.
2) fix some bugs in the draft patch, such as "select *" return nothing.
----------------------------------------------------------------------------------------------
How to use the hbase as hive's storage?
1) remember to add the contrib jar and the hbase jar in the hive's auxPath, so m/r can populate the neccessary hbase-related jars to the whole hadoop m/r cluster.
> $HIVE_HOME/bin/hive -auxPath $
{contrib_jar}
,$
{hbase_jar}
2) modify the configuration to add the following configuration parameters.
"hbase.master" : pointer to the hbase's master.
"hive.othermetadata.handlers" : "org.apache.hadoop.hive.contrib.hbase.HiveHBaseTableInputFormat:org.apache.hadoop.hive.contrib.hbase.HBaseMetadataHandler"
"hive.othermetadata.handlers" collects the metadata handlers to handle the other metadata operations in the not-file-based hive tables. Take hbase as an example. HBaseMetadataHandler will create the neccessary hbase table and its family columns when we create a hbased hive table from hive's client. It also drop the hbase table when we drop the hive table.
The metastore read the registered handlers map from the configuration file during initialization. The registered handlers map is formated as "table_format_classname:table_metadata_handler_classname,table_format_classname:table_metadata_handler_classname,...".
3) enjoy "hive over hbase"!
------------------------------------------------------------------------
Other problems.
1) Altering a hased-hive table is not supported now.
renaming a table in hbase is not supported now, so I just do not support rename operation. ( maybe if we rename a hive table, we do not need to rename the base hbase table.)
adding/replacing cloumns.
Now we need to specify the schema mapping in the SerDe properties explicitly. If we want to adding columns, we need to call 'alter' twice to adding columns: change the serde properties and the hive columns. Either change the serde properties first or change the hive columns first will fail now, because we validate the schema mapping during SerDe initialization. One of the hbase serde validation is to check the counts of hive columns and hbase mapping columns. If we first change the hive columns, the number of hive columns will be more than hbase mapping columns, the HBase Serde initialization will fail this alter operation. (maybe we need to remove the validation code from HBaseSerDe initialization and do it in other place?)
2) more flexible schema mapping?
As Schubert metioned before, more flexible schema mapping will be useful for user. This feature will be added later.
welcome for comments~

java.lang.RuntimeException: Map operator initialization failed
at org.apache.hadoop.hive.ql.exec.ExecMapper.configure(ExecMapper.java:110)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83)
at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:338)
at org.apache.hadoop.mapred.Child.main(Child.java:158)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.NullPointerException
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.initializeOp(FileSinkOperator.java:165)
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:308)
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:345)
at org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:330)
at org.apache.hadoop.hive.ql.exec.SelectOperator.initializeOp(SelectOperator.java:58)
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:308)
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:345)
at org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:330)
at org.apache.hadoop.hive.ql.exec.Operator.initializeOp(Operator.java:316)
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:308)
at org.apache.hadoop.hive.ql.exec.MapOperator.initializeOp(MapOperator.java:289)
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:308)
at org.apache.hadoop.hive.ql.exec.ExecMapper.configure(ExecMapper.java:82)
... 7 more
Caused by: java.lang.NullPointerException
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.initializeOp(FileSinkOperator.java:88)
... 19 more

Sijie Guo
added a comment - 06/Sep/09 11:54 @stephen:
Did you set the configuration parameter ' "hive.othermetadata.handlers" : "org.apache.hadoop.hive.contrib.hbase.HiveHBaseTableInputFormat:org.apache.hadoop.hive.contrib.hbase.HBaseMetadataHandler" '?
I am sorry that I have other things to handle these days. I will fix the bug immediately if I have time.

I have run the patch on my notebook. But I did not encounter the NullPointerException mentioned in your comment.
Can you send me the hive log and the userlogs of the mr job 'FROM src INSERT OVERWRITE TABLE hbase_table_1 SELECT *;' ?

Sijie Guo
added a comment - 08/Sep/09 04:51 @stephen
I have run the patch on my notebook. But I did not encounter the NullPointerException mentioned in your comment.
Can you send me the hive log and the userlogs of the mr job 'FROM src INSERT OVERWRITE TABLE hbase_table_1 SELECT *;' ?
Thanks.

stephen xie
added a comment - 08/Sep/09 06:24 thanks very much for Samuel's help.
The issue above has been resolved.
In the distributed test environment, running hive command must be added the parameter --auxpath hive_contrib.jar,hbase.jar.

Here's the result of rebasing the old patch to apply against latest trunk. This is NOT intended for submission; it's just a checkpoint of the rebasing work for anyone who needs it. For the real submission, I'll be doing quite a bit of refactoring to generalize the concept of plugging in external storage, and possibly other concepts from HIVE-1133 based on pending discussions.

eliminated parser changes; they'll probably come back in a more general form something like STORED BY 'storage-handler-class' which will encapsulate the combination of inputformat, outputformat, metastore hooks, and optimizer interaction such as filter/predicate pushdown

John Sichi
added a comment - 11/Feb/10 23:31 Here's the result of rebasing the old patch to apply against latest trunk. This is NOT intended for submission; it's just a checkpoint of the rebasing work for anyone who needs it. For the real submission, I'll be doing quite a bit of refactoring to generalize the concept of plugging in external storage, and possibly other concepts from HIVE-1133 based on pending discussions.
Notable changes from the old patch:
update to require HBase 0.20.3, resulting in new zookeeper dependency (I'm testing with zookeeper 3.2.2)
eliminated parser changes; they'll probably come back in a more general form something like STORED BY 'storage-handler-class' which will encapsulate the combination of inputformat, outputformat, metastore hooks, and optimizer interaction such as filter/predicate pushdown

ISTM that merging the HBase columnfamilies into a single Hive table is the wrong approach and could lead to poor performance; rather, each HBase CF should be its own Hive table, which may of course be joined with others as necessary. (I think using the word "table" for HBase's "collection of CFs" is unfortunate in the first place since they are different animals; fundamentally, the basic unit of data access in HBase is the CF.)

I'm interested because Cassandra is also looking at adding Hive support, and we also implement a ColumnFamily data model.

Jonathan Ellis
added a comment - 21/Feb/10 05:10 ISTM that merging the HBase columnfamilies into a single Hive table is the wrong approach and could lead to poor performance; rather, each HBase CF should be its own Hive table, which may of course be joined with others as necessary. (I think using the word "table" for HBase's "collection of CFs" is unfortunate in the first place since they are different animals; fundamentally, the basic unit of data access in HBase is the CF.)
I'm interested because Cassandra is also looking at adding Hive support, and we also implement a ColumnFamily data model.

Jonathan, thanks for the input. I think we should be able to come up with a mapping feature which encompasses what you've proposed plus what's in HIVE-806 so that it will be up to the user to decide how to map a particular set of HBase tables into Hive.

We can do this by allowing the HBase table name to be specified as part of mapping it into Hive. That way, you can have

John Sichi
added a comment - 22/Feb/10 18:54 Jonathan, thanks for the input. I think we should be able to come up with a mapping feature which encompasses what you've proposed plus what's in HIVE-806 so that it will be up to the user to decide how to map a particular set of HBase tables into Hive.
We can do this by allowing the HBase table name to be specified as part of mapping it into Hive. That way, you can have
Hive t1(c1, c2) -> HBase t.cf1(c1, c2)
Hive t2(c3, c4) -> HBase t.cf2(c3, c4)
or
Hive t(c1,c2,c3,c4) -> HBase t(cf1(c1, c2), cf2(c3, c4))
or
Hive t(cf1map, cf2map) -> HBase t(cf1, cf2)
or variations. I'm going to write up a proposal in the Hive wiki and solicit feedback.

John Sichi
added a comment - 03/Mar/10 03:06 First draft of the patch ready for review. Reviewers, please read these two accompanying docs:
http://wiki.apache.org/hadoop/Hive/HBaseIntegration
http://wiki.apache.org/hadoop/Hive/StorageHandlers
Note that for this to be committed, it needs the accompanying, which I have also attached:
hbase-0.20.3.jar
hbase-0.20.3-test.jar
zookeeper-3.2.2.jar
These should be committed to trunk/hbase-handler/lib

John, Why are pre, commit, rollback functions needed in MetaHook? Isn't it enough just to drop table as a rollback for create, and do the drop table after hive drop table? With the current definition the MetaHook implementation needs to keep state around which Hive itself doesn't do.

Also alter table on external tables should be allowed since underlying storage format for external tables is not managed by Hive itself. In such cases alter table is just changing metadata in side Hive.

Prasad Chakka
added a comment - 03/Mar/10 19:38 John, Why are pre, commit, rollback functions needed in MetaHook? Isn't it enough just to drop table as a rollback for create, and do the drop table after hive drop table? With the current definition the MetaHook implementation needs to keep state around which Hive itself doesn't do.
Also alter table on external tables should be allowed since underlying storage format for external tables is not managed by Hive itself. In such cases alter table is just changing metadata in side Hive.

Prasad, the MetaHook interface is defined that way so that if a handler wants to, it can carry out the operation in a stateful fashion (e.g. if its underlying catalog supports transactions), but there is no requirement for it to keep state, and in fact the HBaseStorageHandler implementation is itself stateless (and has a NOP for three of its method implementations).

Alter table: yes, I'm planning to create a followup task for this. The original patch had alter table support in the meta hook interface too, but I trimmed it down for now to limit the scope of the first commit.

John Sichi
added a comment - 03/Mar/10 20:13 Prasad, the MetaHook interface is defined that way so that if a handler wants to, it can carry out the operation in a stateful fashion (e.g. if its underlying catalog supports transactions), but there is no requirement for it to keep state, and in fact the HBaseStorageHandler implementation is itself stateless (and has a NOP for three of its method implementations).
Alter table: yes, I'm planning to create a followup task for this. The original patch had alter table support in the meta hook interface too, but I trimmed it down for now to limit the scope of the first commit.

John Sichi
added a comment - 04/Mar/10 01:47 While testing, found a few bugs in HBaseSerDe.serialize for the case where a Hive map is being converted into an HBase column family; I'll fix these together with whatever comes out of review.

Jonathan Ellis
added a comment - 10/Mar/10 15:37 Thanks John, I read your wiki notes and it does look like this will work fine for Cassandra at least at the conceptual level.
Is HIVE-806 redundant w/ your latest patchset now?

@Jonathan: I haven't seen any patch uploaded for HIVE-806. The comments indicate that they have a way to customize the serialization per column in HBase, which could be interesting, but it's non-essential. Once HIVE-705 gets committed, I'll post a comment on HIVE-806 and ask whether they want to keep it open or abandon it.

John Sichi
added a comment - 10/Mar/10 18:59 @Jonathan: I haven't seen any patch uploaded for HIVE-806 . The comments indicate that they have a way to customize the serialization per column in HBase, which could be interesting, but it's non-essential. Once HIVE-705 gets committed, I'll post a comment on HIVE-806 and ask whether they want to keep it open or abandon it.
@Namit: will do.