This Blog is intended to give budding MapReduce developers a start off in developing hadoop based applications. It involves some development tips and tricks on hadoop MapReduce programming, tools that use map reduce under the hood and some practical applications of hadoop using these tools. Most of the code samples provided here is tested on hadoop environment but still do post me if you find any not working.

Monday, October 24, 2011

Include values during execution time in hive QL/ Dynamically substitute values in hive

When you play around with data warehousing it is very common to come across scenarios where you’d like to submit values at run time. In production environments when we have to enable a hive job we usually write our series of hive operations in HQL on a file and trigger it using the hive –f option from a shell script or some workflow management systems like oozie. Let’s have this discussion limited to triggering the hive job from shell as it is the basic one.

Say I’m having a hive job called hive_job.hql, normally from a shell I’d trigger the hive job as

hive -f hive_job.hql

If I need to set some hive config parameters, say I need to enable compression in hive then I’d include the following arguments along as

Now the final one, say my hive QL is doing some operations on date range and this date is varying/dynamic. This date is to be accepted from CLI each time. We can achieve the same with the following steps

1.Pass the variables as config parameters

2.Refer these config parameters in your hive query

Let me make it more specific with a small example. I need to perform some operation on records in a table called ‘test_table’ which has a column/field named ‘creation_date’ .(ie I need to filter records based on creation_date). The date range for the operation is supplied at run time. It is achieved as

1.Pass the variables as config parameters

Here we need to pass two parameters, the start date and end date to get all the records within a specific date range

You can use BeeTamer for that. It allows to store result (or part of it) in a variable, and use this variable later in your code.

Beetamer is macro extension to Hive or Impala that allows to extend functionality of the Apache Hive and Cloudera Impala engines.

select avg(a) from abc; %capture MY_AVERAGE; select * from abc2 where avg_var=#MY_AVERAGE#;In here you save average value from you query into macro variable MY_AVERAGE and then reusing it in the second query.