Add Hive support

Details

Description

http://hadoop.apache.org/hive/ is a project that runs SQL queries against Hadoop map/reduce clusters. (For analytics; it is too high-latency to run applications against Hive directly). HIVE-705 added support for backends other than HDFS, with HBase as the first. Cassandra support should be doable too now.

Activity

The Cassandra inputformat for Hadoop is in org.apache.cassandra.hadoop.ColumnFamilyInputFormat; the record reader and input split are in the same package. There's an example of using these in contrib/word_count, and Pig integration in contrib/pig.

You can look at the .7 patch to HIVE-705 to see how HBase support was added. Unfortunately this is not split into "Hive infrastructure refactoring" and "HBase support," they are all mixed in together.

Jonathan Ellis
added a comment - 30/Mar/10 15:43 Starting points:
The Cassandra inputformat for Hadoop is in org.apache.cassandra.hadoop.ColumnFamilyInputFormat; the record reader and input split are in the same package. There's an example of using these in contrib/word_count, and Pig integration in contrib/pig.
You can look at the .7 patch to HIVE-705 to see how HBase support was added. Unfortunately this is not split into "Hive infrastructure refactoring" and "HBase support," they are all mixed in together.

John Sichi
added a comment - 13/Apr/10 04:07 Regarding HIVE-705 , all files under hbase-handler constitute the HBase support, and the rest is Hive infrastructure refactoring, so you can use that split for reviewing them separately.

p shirish reddy
added a comment - 15/Apr/10 06:38 I have submitted the proposal for the same as a part of GSOC project. The link to the proposal is http://socghop.appspot.com/gsoc/student_proposal/private/google/gsoc2010/shirish_reddy_89/t127072582147 . I'd like suggestions and comments.

I changed the package names (for some classes, some package access was needed)

add ASL2 headers for the ASF

format the code according to cassandra standard

change some logger from log4j and commons logging to slf4j

it didn't handle well nulls in hive tables, I have fixed that for the little tests I did.

About the build, it needs hive jars in contrib/hive/lib. I don't know how to better setup this since those jars are not available in the maven repo.

About runtime, I had a lot of trouble due to some conflict between the thrift library used by hive and the one used by cassandra. hive 0.7 is using the 0.5, cassandra the 0.6. Cassandra external table in hive could not be declared due to some NoSuchMethodException.
As far as I understand hive, hive need thrift at job runtime just for handling dynamic column serialization. In my use case I didn't needed it so I did some hack: I remove every org.apache.thrift class from hive-exec.jar. Then it works nicely (for my use case).

There were some tests in the github repo. They are Hive oriented. I'm too lazy to try to make then work in cassandra's source tree.

With Hive 0.8, it will use thrift 0.7 (hopefully backward compatible with 0.6), and hive artifacts will be published on the maven repository (HIVE-1095). So probably it will be best to wait for easier integration in cassandra ?

Nicolas Lalevée
added a comment - 08/Nov/11 12:50 I cannot reopen this issue, so I'll just comment.
As suggested by Jonathan in HIVE-1434 , an hive/cassandra bridge may better fit here.
I have finally found the source of Brisk's implementation ( https://github.com/riptano/hive ). The patch I am submitting here ( CASSANDRA-913 -r1199213.patch) is based on their work. So I cannot grant any license here.
What I did on the original source:
I changed the package names (for some classes, some package access was needed)
add ASL2 headers for the ASF
format the code according to cassandra standard
change some logger from log4j and commons logging to slf4j
it didn't handle well nulls in hive tables, I have fixed that for the little tests I did.
About the build, it needs hive jars in contrib/hive/lib. I don't know how to better setup this since those jars are not available in the maven repo.
About runtime, I had a lot of trouble due to some conflict between the thrift library used by hive and the one used by cassandra. hive 0.7 is using the 0.5, cassandra the 0.6. Cassandra external table in hive could not be declared due to some NoSuchMethodException.
As far as I understand hive, hive need thrift at job runtime just for handling dynamic column serialization. In my use case I didn't needed it so I did some hack: I remove every org.apache.thrift class from hive-exec.jar. Then it works nicely (for my use case).
There were some tests in the github repo. They are Hive oriented. I'm too lazy to try to make then work in cassandra's source tree.
With Hive 0.8, it will use thrift 0.7 (hopefully backward compatible with 0.6), and hive artifacts will be published on the maven repository ( HIVE-1095 ). So probably it will be best to wait for easier integration in cassandra ?

Thanks for looking at this. As you mention we need to figure out how to get the tests working locally. This probably requires the hive test artifacts to be deployed in maven.

We are currently using the cassandra-1.0 branch on github so that should have the latest changes. Cassandra 1.1 will be upgrading to thrift 0.7 CASSANDRA-3213 at which point we should work with Hive 0.8 without conflicts.

T Jake Luciani
added a comment - 08/Nov/11 13:34 Hi Nicholas,
Thanks for looking at this. As you mention we need to figure out how to get the tests working locally. This probably requires the hive test artifacts to be deployed in maven.
We are currently using the cassandra-1.0 branch on github so that should have the latest changes. Cassandra 1.1 will be upgrading to thrift 0.7 CASSANDRA-3213 at which point we should work with Hive 0.8 without conflicts.