SQL Differences Between Impala and Hive

Impala's SQL syntax follows the SQL-92 standard, and includes many industry extensions in areas such as built-in functions. See Porting
SQL from Other Database Systems to Impala for a general discussion of adapting SQL code from a variety of database systems to Impala.

Because Impala and Hive share the same metastore database and their tables are often used interchangeably, the following section covers differences between Impala and Hive in detail.

Multiple DISTINCT clauses per query, although Impala includes some workarounds for this limitation.
Note:

By default, Impala only allows a single COUNT(DISTINCT columns) expression in each query.

If you do not need precise accuracy, you can produce an estimate of the distinct values for a column by specifying NDV(column); a query can contain multiple instances of NDV(column). To make Impala automatically rewrite
COUNT(DISTINCT) expressions to NDV(), enable the APPX_COUNT_DISTINCT query option.

To produce the same result as multiple COUNT(DISTINCT) expressions, you can use the following technique for queries involving a single table:

INSERT OVERWRITE DIRECTORY; use INSERT OVERWRITE table_name or
CREATE TABLE AS SELECT to materialize query results into the HDFS directory associated with an Impala table.

Impala respects the serialization.null.format table property only for TEXT tables and ignores the property for Parquet and other formats. Hive respects the
serialization.null.format property for Parquet and other formats and converts matching values to NULL during the scan. See for using the table property in Impala.

Semantic Differences Between Impala and HiveQL Features

This section covers instances where Impala and Hive have similar functionality, sometimes including the same syntax, but there are differences in the runtime semantics of those
features.

The Hive component included in CDH 5.1 and higher now includes Sentry-enabled GRANT, REVOKE, and CREATE/DROP ROLE statements. Earlier Hive releases had a privilege system with GRANT and REVOKE statements that were
primarily intended to prevent accidental deletion of data, rather than a security mechanism to protect against malicious users.

Impala can make use of privileges set up through Hive GRANT and REVOKE statements. Impala has its own GRANT and REVOKE statements in Impala 2.0 and higher. See Enabling Sentry
Authorization for Impala for the details of authorization in Impala, including how to switch from the original policy file-based privilege model to the Sentry service using privileges stored in
the metastore database.

SQL statements and clauses:

The semantics of Impala SQL statements varies from HiveQL in some cases where they use similar SQL statement and clause names:

Impala uses different syntax and names for query hints, [SHUFFLE] and [NOSHUFFLE] rather than MapJoin or StreamJoin. See Joins in Impala SELECT Statements for the Impala details.

Impala does not expose MapReduce specific features of SORT BY, DISTRIBUTE BY, or CLUSTER
BY.

Impala does not require queries to include a FROM clause.

Data types:

Impala supports a limited set of implicit casts. This can help avoid undesired results from unexpected casting behavior.

Impala does not implicitly cast between string and numeric or Boolean types. Always use CAST() for these conversions.

Impala does perform implicit casts among the numeric types, when going from a smaller or less precise type to a larger or more precise one. For example, Impala will implicitly convert
a SMALLINT to a BIGINT or FLOAT, but to convert from DOUBLE to FLOAT or INT to TINYINT requires a call to CAST() in the query.

Impala does perform implicit casts from string to timestamp. Impala has a restricted set of literal formats for the TIMESTAMP data type and the
from_unixtime() format string; see TIMESTAMP Data Type for details.

Impala does not store or interpret timestamps using the local timezone, to avoid undesired results from unexpected time zone issues. Timestamps are stored and interpreted relative to
UTC. This difference can produce different results for some calls to similarly named date/time functions between Impala and Hive. See Impala Date and Time Functions for details about the Impala functions. See TIMESTAMP
Data Type for a discussion of how Impala handles time zones, and configuration options you can use to make Impala match the Hive behavior more closely when dealing with Parquet-encoded
TIMESTAMP data or when converting between the local time zone and UTC.

The Impala TIMESTAMP type can represent dates ranging from 1400-01-01 to 9999-12-31. This is different from the Hive date range, which is 0000-01-01 to
9999-12-31.

Impala does not return column overflows as NULL, so that customers can distinguish between NULL data and overflow conditions
similar to how they do so with traditional database systems. Impala returns the largest or smallest value in the range for the type. For example, valid values for a tinyint range from -128 to 127. In Impala, a tinyint with a value of -200 returns -128 rather than NULL. A
tinyint with a value of 200 returns 127.

If this documentation includes code, including but not limited to, code examples, Cloudera makes this available to you under the terms of the Apache License, Version 2.0, including any required
notices. A copy of the Apache License Version 2.0 can be found here.