A few months ago I posted a series of articles about connecting OBIEE 11.1.1.7, Exalytics and ODI to Apache Hadoop through Hive, an SQL-interface layer for Hadoop. Hadoop/Hive connectivity is a cool new feature in OBIEE 11g but suffers from the problem common to Hive – it’s actually quite slow, particularly when you’re used to split-second response times from your BI tool. The reason it’s slow is because Hive, in the background, generates MapReduce jobs which in-turn query Hadoop, batch processing-style, for each of your queries; each one of these MapReduce jobs requires a Java program to be written, and then submitted to the Hadoop job control framework and run within a Java JVM, which adds latency to your query. It’s not uncommon for a Hive query to take a minute or so to retrieve even a small set of “rows” from the Hadoop cluster, which isn’t really what it’s for – Hive and Hadoop are typically used with very large datasets spread over multiple servers – but you can’t help thinking there must be a better way to query Hadoop data.

And there is a better way – several organisations and companies have come up with improvements to Hive, the idea being to take the central idea of SQL-access-over-Hadoop, but remove the dependency on writing MapReduce jobs on the fly, and instead create separate, typically in-memory, server processes that provide similar functionality but with much improved response times. Probably the most well-known product like this is Cloudera’s “Impala”, an open-source but commercially-sponsored Hive replacement that’s available as part of Cloudera’s “Quickstart” demo VM, downloadable for free from the Cloudera website. The architecture image below is from the Cloudera Impala datasheet, which nicely explains the differences between Hive and Impala, and where it fits in as part of the overall Hadoop framework.

Now whilst Impala isn’t officially supported by Oracle as a data source, Hive is, so I thought it’d be interesting to see if we could swap-out Hive for Impala and connect more efficiently to a Hadoop datasource. I managed to get it working, with a couple of workarounds, so I thought I’d share it here – note that in a real-world installation, where the server is on Linux/Unix, or where your server isn’t on the same machine as your BI Administration client, it’s a bit more involved as you’ll need both server-side and client-side ODBC driver install and configuration.

Keeping it simple for now though, to get this working you’ll need:

OBIEE 11.1.1.7, for either Windows or Linux – in my case, I’ve used Windows. Oracle’s recommendation is you use Linux for Hadoop access but Windows seems to work OK.

The Cloudera Quickstart CDH4 VM – you’ll need to add some data to Impala, I’ll leave this to you – if this article makes sense to you, I’m sure you can add some sample data ;-)

The Cloudera ODBC Driver for Impala – the Windows 64-bit ones are here, and the rest of the drivers are on this page.

Once you’ve downloaded the Quickstart VM and got Impala up and running, and set up OBIEE 11.1.1.7 on a separate server, start by installing the ODBC drivers so you’re ready to configure them. In my case, my Impala tables were held in the standard “default” schema, and my Quickstart VM was running on the hostname cdh4.rittmandev.com, so my ODBC configuration settings looked like this:

“cloudera” is the default username on the Quickstart VM, with a password also of “cloudera”, so when I press the Test… button and put in the password, I see the success message:

So far so good. So now over to the BI Administration tool, where the process to import the Impala table metadata is the same as with Hive. First, select the new Impala DSN from the list of ODBC connections, then bring in the Impala tables that you want to include in the RPD – in this case, two tables called “product” and “product_sales” that I added myself to Impala.

Next, double-click on the new physical database entry that the import just created, and set the Database type from ODBC Basic to Apache Hadoop, like this:

When you’re prompted to change the connection pool settings as well – ignore this and press No, and leave them as they are.

Then, create your business model and presentation layer subject area as you would do normally – in my case, I add a primary key to the products table, join it in the physical layer to the product_sales fact table, and then create corresponding BMM and Presentation Layer models so that it’s then ready to report on.

Running a quick test on the datasource, displaying some sample rows from the Impala tables, indicates it’s working OK.

So over to the dashboard. I run a simple query that sums up sales by product, and … it doesn’t work.

If you take a look at the logical and physical SQL that the BI Server is generating for the query, it all looks OK …

But the problem is that Impala doesn’t like ORDER BY clauses without a corresponding LIMIT clause, something the product insists on presumably because of the potential size of Impala/Hadoop datasets and the need to contain the rows returned in-memory. If you’re running Impala queries directly within the Impala shell, you can issue the command set DEFAULT_ORDER_BY_LIMIT = x; to provide a default LIMIT value when one isn’t specified, but I can’t as yet see how to provide that setting over an ODBC connection, so the workaround I used was to uncheck the ORDERBY_SUPPORTED database setting in the physical database properties dialog, so that the BI Server leaves-off the ORDER BY clause entirely, and does the result ordering itself after Impala returns the (now un-ordered) result set.

Saving the RPD again now, and refreshing the results, brings back the query as expected – and in under a second or so, rather than the 30, 60 seconds etc that Hive would have taken.

So – there you go. As I said, Impala’s not officially supported which means it may work, but Oracle haven’t tested it properly and you won’t be able to raise SRs etc – but it’s an interesting alternative to Hive if you’re serious about connecting OBIEE 11g to your Hadoop datasources.

Comments

I have also tried,but there are many problems:when you try two dimension tables join,you will find it will fail,impala can’t support some sql like hive.2.large table join order,impala need big table first.Could you please describe the steps configured on linux.I failed to configure the odbc on linux but success on windows.

Hi Mark,
In addition to Impala, Facebook development has just released another alternative to Hive.
To quote from their blog entry: ‘Presto is a distributed SQL query engine optimized for ad-hoc analysis at interactive speed. It supports standard ANSI SQL, including complex queries, aggregations, joins, and window functions.’

It would be also interesting to explore how this tool could be leveraged from OBIEE.

Thanks Mark.
Concerning “… you can issue the command set DEFAULT_ORDER_BY_LIMIT = x; to provide a default LIMIT value when one isn’t specified… ”

Did you try putting “DEFAULT_ORDER_BY_LIMIT = x”
into the connection script tab of the connection pool? Details i.e. described here: “http://gerardnico.com/wiki/dat/obiee/connection_scripthttp://gerardnico.com/wiki/dat/obiee/connection_script”

I was able to configure with HIve odbc for a POC. However joining dimension and fact table was an issue. I could run a select statement from either dimension for fact table.
When I run a query that has both fields, I get an error below.

Are you trying to connect to Hive, or Impala? Also can you paste the full SQL (logical and HiveQL / Impala SQL) into the comment, from the nqquery.log file? Finally, are your tables in the default database, or another one?