Tom's playground

With Python in Hive

In this small note, it is described how an HDFS file can be stored in a Hive context. In it stored in a Hive context, it can be accessed from outside via ODBC. It is also possible to access the data as a SQL compliant database. The idea is that an abstraction is created on top of the HDFS datasets. One may then access the HDFS datasets, much like an ordinary database.
We will use the python language via spark. This avoids the bottleneck that MapReduce has created.
One starts python via spark with the command “pyspark”. If everything goes correct, we see:
Two variables are important: sc that is an anchor point for methods that can be used within Spark and HiveContext that be used as a starting point for Hive methods.

We may now approach this dataset as a table. The tablename is HiveTom. A possibility is to access the table via ODBC. We can download an ODBC connector. Each distribution (Cloudera, MapR, Hortonworks) has a ODBC connector. Once installed, we may retrieve the data in a ODBC compliant tool. As example, we may undertake this in Ecel: