To take a few steps back, Apache Hive is a Hadoop-family project that provides a “data warehouse” layer over Hadoop, through a metadata layer not unlike OBIEE’s RPD together with a SQL-like language called HiveQL. Coupled with ODBC and JDBC database drivers, BI tools like OBIEE use Hive to get access to big data sources, as the HiveQL language that Hive uses is very similar to SQL used to access databases such as Oracle, SQL Server or mySQL. Delving a bit deeper in the Hive product architecture, as shown in the diagram below Hive has a number of components including a “database engine”, a metadata store, APIs for client access, and a link through to Hadoop to actually load, process and retrieve data in HDFS (Hadoop Distributed File System).

So what’s HDFS then? HFDS is a fault-tolerant, distributed filesystem that’s a core part of Apache Hadoop, and stores the data that MapReduce jobs then process via job trackers, task trackers and all the other Hadoop paraphernalia. HDFS is accessed through a URI (URL) rather than through your Linux filesystem browser, but distributions such as Cloudera’s CDH3 and CDH4 ship with tools such as Hue, shown below, that provide a web-based interface into HDFS so that you can browse HDFS like a regular OS-level filesystem.

Notice how there’s a “user” folder like we’d get with Linux, and within that folder there’s a home folder for Hive? With Hive, generally the data you manage using Hive is actually loaded into a directory structure under the “hive” user, either using data taken from another directory area in HDFS or from external files. Hive’s data is still in file form and accessed via MapReduce and Hadoop, but it’s in a directory area away from everything else. You can, however, tell Hive to create tables using data held elsewhere in HDFS, analogous to Oracle’s external tables feature, which then skips the data loading process and just maps table structures onto files held elsewhere in the Hadoop filesystem.

In most cases when we’re considering OBIEE accessing Hadoop data via Hive, the data would have been loaded into Hive-mananged tables tables beforehand, though it’s possible that Hive table metadata could have been mapped onto other data in HDFS. In your own particular Hive implementation and assuming you’ve got Hue installed, and Beeswax, a table browser for Hive that usually comes with Hue, you can see where each individual table within your Hive metastore is actually held; in the examples below, the dwh_customer Hive table is a managed table and has its data stored within the /user/hive/warehouse/ HDFS directory, whilst the ratings table has its data stored outside of Hive’s directory structure, but still within the HDFS managed filesystem.

So how does one create a Hive table, load data into it and get it ready for OBIEE access, and can ODI help with this, as we asked earlier? Before we get into ODI then, let’s take a look at how a Hive table is created and loaded, and then we’ll see how ODI does the same job.

With thanks to the ODI product development team’s David Allan, who put together some great Hive and ODI examples in this blog post, let’s start by creating a Hive table against the same movie ratings data in the right-hand screenshot below, but this time with the data actually loaded into Hive’s directory structure (i.e. a “managed” table). From the Hive command-shell, I type in the following commands to create the managed table, after SSH’ing into the VM running Hive:

So how do we get the data into this table, without any tools such as ODI? I can either load data straight from files on my local workstation, or I can upload them, for example using Hue, into the HDFS filesystem first.

Now I can use the HiveQL LOAD DATA command to load from one of these HDFS tables into Hive, and then count how many rows have been loaded, like this:

So how does this process look when using ODI to do the Hive data loading? Let’s start with importing the Hive table metadata for the movie_ratings table I just created from the Hive command-line shell, by going over to the Topology navigator in ODI 11.1.1.6 – note that you’ll need to configure ODI to connect to your Hive, HDFS and Hadoop environment beforehand, using the Oracle Data Integrator for Hadoop documentation as a guide, with this adapter being an extra-cost license option on top of base ODI Enterprise Edition.

Hive has its own technology type within the Topology navigator, and you create the connection through to Hive using the HiveJDBC driver, first adding the connection to the Hive server and then specifying the particular Hive database / namespace, in this case selecting the “default” database for my Hive system.

Now I can reverse-engineer the Hive table structures into a Designer navigator model, just like any other relational table structure.

Within the ODI Topology navigator you can then create File technology connections either to files held in HFDS, or more likely with ODI to files on your workstation, or server, filesystem, like this:

and then add the filedata stores to the Designer Navigator Model list, entering the correct delimiter information and reversing the column definitions into the datastore definition.

Now it’s a case of creating an interface to load the Hive table. In this instance, I map each of the source file “columns” into the Hive table’s columns, as the source file is delimited with an easily-usable structure.

Then, over in the Flows tab for the interface, I make sure the IKM File to Hive knowledge module is selected, keep the default values for the KM options (more on these in a moment), and then save the interface.

Now it’s a case of running the interface, and checking the results. Notice in the Operator navigator code panel, the LOAD DATA command that ODI is generating dynamically, similar to the one I wrote manually earlier on in the article.

Going back to my Hive command-line session, I can see that there’s now 100,000 rows in the movie_ratings Hive table.

Now in many cases the data going into a Hive table isn’t neatly arranged into columns within delimited files; it could be, for example, web log data that you’ll need to parse using regular expressions or other APIs or standard parsers. When that’s the case, you can use an option with the IKM File to Hive knowledge module to override the normal column-to-column mappings and instead use an expression, something Oracle have done in their demo environment for parsing these types of log files.

“ROW FORMAT SERDE” is a reference to Hive’s “Serializer – Deserializer”, or row-formatting feature, that gives you the ability to use regular expressions and other data manipulation techniques to, in this case, allocate incoming file data to the proper columns in the target hive table.

So now we’re at the point where we can use ODI to populate the Hive tables that OBIEE in turn uses to access Hadoop data sources. But what if the data we want to load into Hive isn’t in the format or shape we need, and we need to join, filter or otherwise work with Hive data and tables before we can report on it. And what if we want to get data out of Hive and into regular tables if a relational data store makes more sense than Hadoop, for a particular reporting requirement? Check back tomorrow for the final part in this series, where we’ll answer these remaining questions.