Introduction to Apache Hive

18 December 2017

Apache Hive is a Data Warehousing Infrastructure built on top of Hadoop and provides table abstraction on top of data resident in HDFS as explained in their official page. It is used for providing data summarization ,query and analysis for large data sets.

Hive is used for summarization , ad-hoc analysis and data mining, spam detection and ad optimization .One of the main advantage of hive is that it is SQL in nature .Hive stores its database metadata at Hive metastore database.

It is best used for batch jobs over large sets of immutable data (like web logs).Hive is schema on read rather than schema on write, meaning it doesn’t check data during loading, but does check when a query is issued which is a good feature for ad-hoc partition.

Hive Architecture

Because of Hadoop’s “schema on read” architecture, a Hadoop cluster is a perfect reservoir of heterogeneous data—structured and unstructured—from a multitude of sources.

The tables in Apache Hive are similar to tables in a relational database, and data units are organized in a taxonomy from larger to more granular units. Databases are comprised of tables, which are made up of partitions. Data can be accessed via a simple query language and Hive supports overwriting or appending data.

Within a particular database, data in the tables is serialized and each table has a corresponding Hadoop Distributed File System (HDFS) directory. Each table can be sub-divided into partitions that determine how data is distributed within sub-directories of the table directory. Data within partitions can be further broken down into buckets.

Hive supports all the common primitive data formats such as BIGINT, BINARY, BOOLEAN, CHAR, DECIMAL, DOUBLE, FLOAT, INT, SMALLINT, STRING, TIMESTAMP, and TINYINT. In addition, analysts can combine primitive data types to form complex data types, such as structs, maps and arrays.

Figure: Hive Architecture

Data Units in Hive

Apache Hive data can be organized into

Databases: Namespaces that -separate tables and other data units from naming confliction.

Tables: Homogeneous units of data which have the same schema.

An example of a table could be page_views table, where each row could comprise of the following columns (schema):{timestamp: INT, userid: BIGINT, page_url: STRINGTYPE, referer_ur: STRINGTYPE, IP address: STRINGTYPE

Component of Hive

There are three different hive components which are as follows

(1)Hadoop cluster:

The cluster of inexpensive commodity computers on which the large data set is stored and all processing is performed.

(2)Metadata store:

The location in which the description of the structure of the large data set is kept. The important point is that a standard database is used to store the metadata and it does not store the large data set itself.

(3)Warehouse Directory

This is a type of a location called scratch-pad storage location that Hive permits to store/cache working files. It includes,

Newly created tables

Temporary results from user queries.

For processing/communication efficiency, it is typically located on a Hadoop Distributed File System (HDFS) located on the Hadoop Cluster.

Pig vs Hive

Pig and Hive data models are different from each other. In the case of Pig all data objects exist and are operated on in the script. Once the script is complete all data objects are deleted unless you stored them.

In the case of Hive we are operating on the Apache Hadoop data store. Any query you make, table that you create, data that you copy persists from query to query. You can think of Hive as providing a data workbench where you can examine, modify and manipulate the data in Apache Hadoop.

### Ways to connect to Hive Server

There are following ways in which you can connect with the Hive Server:

Thrift Client: Using thrift you can call hive commands from a various programming languages e.g. C++, Java, PHP, Python and Ruby.

JDBC Driver : It supports the Type 4 (pure Java) JDBC Driver

ODBC Driver: It supports ODBC protocol.

Apache Hcatalog

HCatalog is a table and data management layer for hadoop built on top of the Hive metastore and incorporates Hive’s Data Definition Language)(DDL).
Data which exists on Hcatalog can be processed by Apache Pig ,Apache MapReduce and Apache Hive.

One does not need to worry about the location and format of the data stored using Hcatalog.It can display data from RCFile {Row Columnar File} format, text files, or sequence files in a tabular view. It also provides REST APIs so that external systems can access these tables’ metadata.

External and Internal/Managed Table in Hive

Managed Table: When hive creates managed(default) tables, it follows schema on read and load complete file as it is, without any parsing or modification to hive data warehouse directory. And it’s schema information would be saved in hive metastore for later operational use. When an internal or managed table is dropped ,both the data file from data warehouse as well as schema from the metastore is dropped.

INTERNAL tables is used when data is temporary and we want Hive to completely manage the lifecycle of the table and data.

External Table: When we create hive external tables, it does not load source file in hive data warehouse, only add schema information in metastore.
When an external table is dropped hive does not removes the data from the source file but drops only the schema from hive metestore.We use it when data needs to remain in the underlying location even after a DROP TABLE.

When to use external and internal tables in Hive

Use EXTERNAL tables when:

The data is also used outside of Hive. For example, the data files are read and processed by an existing program that doesn’t lock the files.

Data needs to remain in the underlying location even after a DROP TABLE. This can apply if you are pointing multiple schemas (tables or views) at a single data set or if you are iterating through various possible schemas.

Hive should not own data and control settings, dirs, etc., you may have another program or process that will do those things.

You are not creating a table based on existing table (AS SELECT).

Use INTERNAL tables when:

The data is temporary

You want Hive to completely manage the lifecycle of the table and data

Serialization and Deserialization{SerDe} in Hive

Hive uses SerDe to read and write data from tables.Hive does not own the HDFS format where the data is stored in .Hadoop users can write files to HDFS with any tool/way they like and use Hive to corrctly parse that file in the format that can be ingested in Hive.

The Deserializer interface takes a string or binary representation of a record, and translates it into a Java object that Hive can manipulate.Deserializers are used to read the data when SELECT statements is used .

The Serializer, however, will take a Java object that Hive has been working with, and turn it into something that Hive can write to HDFS or another supported system.

Serializers are used to write data into HDFS by using statement like INSERT INTO and OVERWRITE.

Hive Existing SerDes

MetadataTypedColumnsetSerDe: This SerDe is used to read/write delimited records like CSV, tab-separated control-A separated records (quote is not supported yet.)

ThriftSerDe: This SerDe is used to read/write thrift serialized objects. The class file for the Thrift object must be loaded first.

DynamicSerDe: This SerDe also read/write thrift serialized objects, but it understands thrift DDL(Data Definition Language) so the schema of the object can be provided at runtime. Also it supports a lot of different protocols, including TBinaryProtocol, TJSONProtocol, TCTLSeparatedProtocol (which writes data in delimited records).