Using Hive to interact with HBase, Part 2

This is the second of two posts examining the use of Hive for interaction with HBase tables. This is a hands-on exploration so the first post isn’t required reading for consuming this one. Still, it might be good context.

“Nick!” you exclaim, “that first post had too many words and I don’t care about JIRA tickets. Show me how I use this thing!”

This is post is exactly that: a concrete, end-to-end example of consuming HBase over Hive. The whole mess was tested to work on a tiny little 5-node cluster running HDP-1.3.2, which means Hive 0.11.0 and HBase 0.94.6.1.

Grab some data and register it in Hive

We’ll need some data to work with. For this purpose, grab some traffic stats from wikipedia. Once we have some data, copy it up to HDFS.

As I understand it, each record is a count of page views of a specific page onWikipedia. The first column is the language code, second is the
page name, third is the number of page views, and fourth is the size of the
page in bytes. Each file contains an hour’s worth of aggregated data. None of
the above pages were particularly popular that hour.

Now that we have data and understand its raw schema, create a Hive table over
it. To do that, we’ll use a DDL script that looks like this.

Hive says the 10 files we downloaded contain just over 36.5mm records. Let’s just confirm things are working as expected by getting a second opinion. This isn’t that much data, so confirm on the command line.

[bash]
$ zcat * | wc -l
36668549
[/bash]

The record counts match up — excellent.

Transform the schema for HBase

The next step is to transform the raw data into a schema that makes sense for HBase. In our case, we’ll create a schema that allows us to calculate aggregate summaries of pages according to their titles. To do this, we want all the data for a single page grouped together. We’ll manage that by creating a Hive view that represents our target HBase schema. Here’s the DDL.

The SELECT statement uses hive to build a compound rowkey for HBase. It concatenates the project code, page name, and date, joined by the '/'character. A handy trick: it uses a simple regex to extract the date from the source file names. Run it now.

[bash]
$ hive -f 01_pgc.ddl
OK
Time taken: 2.712 seconds
[/bash]

This is just a view, so the SELECT statement won’t be evaluated until we query it for data. Registering it with hive doesn’t actually process any data. Again, make sure it works by querying Hive for a subset of the data.

Register the HBase table

Now that we have a dataset in Hive, it’s time to introduce HBase. The first step is to register our HBase table in Hive so that we can interact with it using Hive queries. That means another DDL statement. Here’s what it looks like.

This statement will tell Hive to go create an HBase table named pagecounts with the single column family f. It registers that HBase table in the Hive metastore by the name pagecounts_hbase with 3 columns: rowkey, pageviews, and bytes. The SerDe property hbase.columns.mapping makes the association from Hive column to HBase column. It says the Hive column rowkey is mapped to the HBase table’s rowkey, the Hive column pageviews to the HBase column f:c1, and bytes to the HBase column f:c2. To keep the example simple, we have Hive treat all these columns as the STRING type.

In order to use the HBase library, we need to make the HBase jars and configuration available to the local Hive process (at least until HIVE-5518 is resolved). Do that by specifying a value for the HADOOP_CLASSPATH environment variable before executing the statement.

Populate the HBase table

Now it’s time to write data to HBase. This is done using a regular Hive INSERT statement, sourcing data from the view with SELECT. There’s one more bit of administration we need to take care of though. This INSERT statement will run a mapreduce job that writes data to HBase. That means we need to tell Hive to ship the HBase jars and dependencies with the job.

Note that this is a separate step from the classpath modification we did previously. Normally you can do this with an export statement from the shell, the same way we specified the HADOOP_CLASSPATH. However there’s a bug in HDP-1.3 that requires me to use Hive’s SET statement in the script instead.

[sql]
$ cat 03_populate_hbase.hql
— ensure hbase dependency jars are shipped with the MR job
— Should export HIVE_AUX_JARS_PATH but this is broken in HDP-1.3.x
SET hive.aux.jars.path = file:///etc/hbase/conf/hbase-site.xml,file:///usr/lib/hive/lib/hive-hbase-handler-0.11.0.1.3.2.0-111.jar,file:///usr/lib/hbase/hbase-0.94.6.1.3.2.0-111-security.jar,file:///usr/lib/zookeeper/zookeeper-3.4.5.1.3.2.0-111.jar;</p>

Note there’s a big ugly bug in Hive 0.12.0 which means this doesn’t work with that version. Never fear though, we have a patch in progress. Follow along at HIVE-5515.

If you choose to use a different method for setting Hive’s auxpath, be advised that it’s a tricky process — depending on how you specify it (HIVE_AUX_JARS_PATH, --auxpath), Hive will interpret the argument differently. HIVE-2349 seeks to remedy this unfortunate state of affairs.

Here we have 10 rows with two columns each containing the data loaded using Hive. It’s now accessible in your online world using HBase. For example, perhaps you receive an updated data file and have a corrected value for one of the stats. You can update the record in HBase with a regular PUT command.

Verify data from from Hive

The HBase table remains available to you Hive world; Hive’s HBaseStorageHandler works both ways, after all.

Note that this command expects that the HADOOP_CLASSPATH is still set and HIVE_AUX_JARS_PATH as well if your query is complex.

Continue using Hive for analysis

Since the HBase table is accessible from Hive, you can continue to use Hive for your ETL processing with mapreduce. Keep in mind that the auxpath considerations apply here too, so I’ve scripted out the query instead of just running it directly at the command line.

There you have it: a hands-on, end to end demonstration of interacting with HBase from Hive. You can learn more about the nitty-gritty details in Enis’s deck on the topic, or see the presentation he and Ashutosh gave at HBaseCon. If you’re inclined to make the intersection of these technologies work better (faster, stronger), I encourage you to pick up any of the JIRA issues mentioned in this post or the previous.

Happy hacking!

Tags:

Comments

How do I query HBase tables through hive using a custom SerDe? for example Lets say I want to store tweets in HBase to avoid small files problem , how do I integrate hive Json SerDe with hbase for querying tweet data ?

I assume you have tweets stored as JSON blobs in a single column qualifier in HBase. You can’t provide a custom SerDe because the regular HBaseStorgeHandler already has it’s own implementation. One option might be to extend that one with your own custom logic. Doing so would make that table definition specific to your query. Instead, I’d read the JSON data as a String and then “hydrate” the JSON using something else — maybe the get_json_object function or a custom UDF perhaps?

2014-03-26 09:14:57,341 ERROR exec.DDLTask (DDLTask.java:execute(435)) – java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration
at org.apache.hadoop.hive.hbase.HBaseStorageHandler.setConf(HBaseStorageHandler.java:249)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
at org.apache.hadoop.hive.ql.metadata.HiveUtils.getStorageHandler(HiveUtils.java:290)
at org.apache.hadoop.hive.ql.metadata.Table.getStorageHandler(Table.java:285)
at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:3549)
at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:252)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:151)
at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:65)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1437)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1215)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1043)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:911)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:259)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:216)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:781)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:675)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:614)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.HBaseConfiguration
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
… 24 more

That stack trace looks like the hive cli doesn’t have hbase-common in it’s classpath. Perhaps your value for HADOOP_CLASSPATH isn’t being respected. Do you mind taking this question to our Hive support forums where we can get you on the right track?

That was a good explanation !! , if you help me understand how well the serde works with joins , group by etc that would be really great. I have a hive hbase table . when i select a group by query “select columna,count(*) from tablename group by columna ;” i get Null pointer exceptions.
Diagnostic Messages for this Task:
Error: java.lang.NullPointerException
at org.apache.hadoop.hive.serde2.ColumnProjectionUtils.toReadColumnIDString(ColumnProjectionUtils.java:149)
at org.apache.hadoop.hive.serde2.ColumnProjectionUtils.appendReadColumns(ColumnProjectionUtils.java:89)
at org.apache.hadoop.hive.serde2.ColumnProjectionUtils.appendReadColumns(ColumnProjectionUtils.java:102)
at org.apache.hadoop.hive.ql.io.HiveInputFormat.pushProjectionsAndFilters(HiveInputFormat.java:480)
at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:234)
at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:573)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.(MapTask.java:172)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:414)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1566)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)

Do you have any idea why this is happenign ? I suspect this might be because of Null values . SerDe doesn’t handle this ?

The 100% open source and community driven innovation of Apache Hive 2.0 and LLAP (Long Last and Process) truly brings agile analytics to the next level. It enables customers to perform sub-second interactive queries without the need for additional SQL-based analytical tools, enabling rapid analytical iterations and providing significant time-to-value. TRY HIVE LLAP TODAY Read about…

The Future of Apache Hadoop

The world's top authorities on Apache Hadoop convene at Hadoop Summit San Jose and one of the top questions that will be answered will be around the future and direction of Hadoop. Sanjay Radia - Founder and Architect, Hortonworks lead the track which selected 13 sessions around this topic. I asked Sanjay what he hoped would…

Creating the next generation mobile ad...

Our business in Europe continues to expand and I'm excited to share this guest blog post from Geoff Cleaves, Business Intelligence Manager at Billy Mobile a new Hortonworks customer based in Barcelona, Spain. This week at Billy Mobile we are migrating our core technology stack onto HDP 2.3 and boy are we looking forward to…

From Mechanical Engineer to Oil &...

I recently had the pleasure of visiting with Arvind Battula, Sr. Data Scientist at Schlumberger. We discussed his background as a chemical and mechanical engineer and his move onto the Data and Analytics team as a data scientist. The following is a transcript of my conversation with Arvind. We discussed his background, his interesting focus areas for…

Impala vs. Hive Performance Benchmark

Yahoo! JAPAN needed a data platform that could scale to generate 100,000 reports per day as well as having the ability to process large amounts of data. It needed to keep the last 13 months’ worth of data, which is approximately 500 billion rows, organized and easily accessible. Relational Database Management Systems (RDBMS) cannot scale…

Big Data Virtualization to Harness the...

Are you still learning about the Data Lake? Wondering how it can help your organization manage and leverage massive amounts of data? On September 8th, VHA, the largest member-owned health care company delivering supply chain management services and clinical services to its members, will share their experience and explain how they simplified data management and…

Introducing Availability of HDP 2.3 -...

On July 22nd, we introduced the general availability of HDP 2.3. In part 2 of this blog series, we explore notable improvements and features related to Data Access. SQL on Hadoop Spark 1.3.1 Stream Processing Systems of Engagement that scale HDP Search We are especially excited about what these data access improvements mean for our…

Hotels.com Announces CORC 1.0.0

Hortonworks is always pleased to see new contributions come into the open-source community. We worked with our customer, Hotels.com, to help them develop libraries and utilities around Apache Hive, the Apache ORC format and Cascading. It’s great to see the results released for the community. In this guest blog, Adrian Woodhead, Big Data Engineering Team…

Announcing Apache Ranger 0.5.0

As YARN drives Hadoop’s emergence as a business-critical data platform, the enterprise requires more stringent data security capabilities. The Apache Ranger delivers a comprehensive approach to security for a Hadoop cluster. It provides a platform for centralized security policy administration across the core enterprise security requirements of authorization, audit and data protection. On June 10th,…

This website uses cookies for analytics, personalisation and advertising. To learn more or change your cookie settings, please read our Cookie Policy. By continuing to browse, you agree to our use of cookies.