Pages

Wednesday, 3 February 2016

What is Cloudera Impala ? Impala vs Hive

Cloudera Impala is an open
source, and one of the leading analyticmassively parallelprocessing(MPP) SQL query
engine that runs natively inApache Hadoop. Cloudera Impala project was announced
in October 2012 and after successful beta test distribution and became
generally available in May 2013.Its preferred users are analysts doing ad-hoc queries over the massive data
sets stored inHadoop.

The main feature of Impala is that
with Impala we can run low-latency Adhoc SQL queries directly on the data
stored in a cluster, stored either in unstructured flat files in the file
system, or in structured HBase
tables without requiring data movement or transformation. Performance is
increased due to the fact that we need not migrate data sets to dedicated
processing systems or convert data formats prior to analysis.

Another important feature of Impala is that it is workable to the data formats
metadata, security and resource management frameworks used by Map Reduce,
Apache Hive, Apache Pig and
other components of the Hadoop stack.

Impala also supports all Hadoop file formats, including new format
Apache Parquet. Apache Parquet is a columnar storage format for the Hadoop
ecosystem created with advantages of compressed, efficient columnar data
representation available to any project in the Hadoop ecosystem, regardless of
the choice of data processing framework, data model, or programming language.

Impala queries are executed as follows:

Queries are submitted using Impala-shell command-line tool, or from a business application through an ODBC or JDBC driver.

Impala distributed query engine builds and distributes the query plan across the cluster.

It runs separate Impala Daemon (impalad) which runs on data nodes and responds to impala shell. These daemons can return data quickly without having to go through a whole Map/Reduce job.

Impalad is a process that runs on designated nodes in the cluster. It coordinates and runs queries.

Comparison With Hive

When we compare to Hive and MapReduce ,both optimized for long running batch-oriented
tasks such asETL(Read more:What is ETL), Impala is more compatible for running interactive analytical SQL
queries over small amounts of a huge data. What makes it different form HIVE is
that Impala does not rely on Map Reduce, it avoids the start-up overhead of Map
Reduce jobs and instead uses its own t’s own set of execution daemons which
need to be installed alongside your data nodes.

Hive in Hadoop
ecosystem is intended for a data warehouse system to support with easy data
aggregations, adhoc queries over large datasets which are stored in Hadoop HDFS
file systems whereas Cloudera Impala is a query engine for data stored in HDFS
and HBase.

Because Impala and Hive share the same metastore database and their tables are often used interchangeably. This cross-compatibility applies to Hive tables that use Impala-compatible types for all columns.

Partitions in Impala

As in large scaleData warehousehow we make use of partitioned
tables (Read more on: Partitions in Oracle) to speed up queries, the same way in
Impala we make use of Partitioned tables. Data is partitioned based on values
in one column and instead of looking up one row at a time from widely scattered
items, the rows with identical partition keys are physically grouped together.
Impala also takes advantage of the partitioning present in Hive tables.

Cloudera Impala makes use of the following two technologies

Columnar Storage: Since data stored in columnar fashion it gives high compression ratio and efficient scanning.

Tree Architecture: The architecture forms a massively parallel distributed multi-level serving tree for pushing down a query to the tree and then aggregating the results from the leaves.

Impala doesn't provide fault-tolerance compared to Hive. Just in case
the node fails in the middle of processing, the whole query has to be re-run.
But Impala has the advantage that even if node fails and we start over, its
total runtime is so fast that it will accomplish for the time loss.

Time savings because you do not have to move
around data and Impala does not write the intermediate results to disk.