IMPALA-7225 - Fixed an issue where the REFRESH...PARTITION statement caused statistics for the
refreshed partition to be automatically reset to -1 (unknown) . With the fix, statistics will be changed only if an explicit COMPUTE STATS statement is issued for an object.

IMPALA-7272 - Fixed a crash caused by a memory management problem when the query execution
requires finding strings inside a range defined by the lesser-than and greater-than comparisons.

IMPALA-7360 - Fixed an issue where Impala could incorrectly skip data if a record separator in
a sequence-based file (Avro, RC or sequence file) straddled an HDFS block boundary.

IMPALA-7537 - Fixed a security issue where REVOKE ALL ON SERVER did not have a permanent effect
if the ALL permission was granted using the WITH GRANT option. Running INVALIDATE METADATA no longer causes the permission to reappear.

IMPALA-7585 - Fixed an issue in KRPC, which can cause slow or hung queries on non-secured
clusters.

Issues Fixed in Impala for CDH 5.15.1

For the full list of fixed issues for all CDH components in CDH 5.15.1, see Issues Fixed in CDH 5.15.x. The
following list represents the subset of fixed Impala JIRAs from the CDH fixed issues.

IMPALA-4433 - Always generate testdata using the same time zone setting

IMPALA-4449 - Revisit table locking pattern in the catalog This commit fixes an issue where
multiple long-running operations on the same catalog object (e.g. table) can block other catalog operations from making progress.

IMPALA-4433 - Always generate test data using the same time zone setting

IMPALA-4449 - Revisit table locking pattern in the catalog. Fixes an issue where multiple
long-running operations on the same catalog object (for example, a table) can block other catalog operations from making progress

Issues Fixed in Impala for CDH 5.8.0

The following list contains the most critical fixed issues (priority='Blocker') from the JIRA system. For the full list of fixed issues in
CDH 5.8.0 / Impala 2.6.0, see this report in the Impala JIRA tracker.

RuntimeState::error_log_ crashes

A crash could occur, with stack trace pointing to impala::RuntimeState::ErrorLog.

Stress test failure: sorter.cc:745] Check failed: i == 0 (1 vs. 0)

String data coming out of agg can be corrupted by blocking operators

If a query plan contains an aggregation node producing string values anywhere within a subplan (that is,if in the SQL statement, the aggregate function appears within an inline view over
a collection column), the results of the aggregation may be incorrect.

Crash on inserting into table with binary and parquet

RowBatch::MaxTupleBufferSize() calculation incorrect, may lead to memory corruption

A crash could occur while querying tables with very large rows, for example wide tables with many columns or very large string values. This problem was identified in Impala 2.3, but had
low reproducibility in subsequent releases. The fix ensures the memory allocation size is correct.

Issues Fixed in Impala for CDH 5.7.0

The following list contains the most critical issues (priority='Blocker') from the JIRA system. For the full list of fixed issues in CDH 5.7.0 / Impala
2.5.0, see this report in the Impala JIRA tracker.

Stress test hit assert in LLVM: external function could not be resolved

PAGG hits mem_limit when switching to I/O buffers

A join query could fail with an out-of-memory error despite the apparent presence of sufficient memory. The cause was the internal ordering of operations that could cause a later phase
of the query to allocate memory required by an earlier phase of the query. The workaround was to either increase or decrease the MEM_LIMIT query option, because the
issue would only occur for a specific combination of memory limit and data volume.

Referring to the same column twice in a view definition could cause the view to omit rows where that column contained a NULL value. This could cause
incorrect results due to an inaccurate COUNT(*) value or rows missing from the result set.

Fix migration/assignment of On-clause predicates inside inline views

Some combinations of ON clauses in join queries could result in comparisons being applied at the wrong stage of query processing, leading to incorrect
results. Wrong predicate assignment could happen under the following conditions:

The query includes an inline view that contains an outer join.

That inline view is joined with another table in the enclosing query block.

That join has an ON clause containing a predicate that only references columns originating from the outer-joined tables inside the inline view.

Wrong plan of NOT IN aggregate subquery when a constant is used in subquery predicate

Planner doesn't set the has_local_target field correctly

MemPool allocation growth behavior

Currently, the MemPool would always double the size of the last allocation. This can lead to bad behavior if the MemPool transferred the ownership of all its data except the last chunk.
In the next allocation, the next allocated chunk would double the size of this large chunk, which can be undesirable.

Drop partition operations don't follow the catalog's locking protocol

The CatalogOpExecutor.alterTableDropPartition() function violates the locking protocol used in the catalog that requires catalogLock_ to be acquired before any table-level lock. That may cause deadlocks when ALTER TABLE DROP PARTITION is executed concurrently with other
DDL operations.

DataStreamSender::Channel::CloseInternal() does not close the channel on an error.

Some queries do not close an internal communication channel on an error. This will cause the node on the other side of the channel to wait indefinitely, causing the query to hang. For
example, this issue could happen on a Kerberos-enabled system if the credential cache was outdated. Although the affected query hangs, the impalad daemons
continue processing other queries.

Issues Fixed in Impala for CDH 5.5.6

Note: Impala 2.3.x is available as part of CDH 5.5.x and is not available for CDH 4. Cloudera does not intend to release future versions of Impala
for CDH 4 outside patch and maintenance releases if required. Given the end-of-maintenance status for CDH 4, Cloudera recommends all customers to migrate to
a recent CDH 5 release.

Issues Fixed in Impala for CDH 5.5.4

Note: Impala 2.3.x is available as part of CDH 5.5.x and is not available for CDH 4. Cloudera does not intend to release future versions of Impala
for CDH 4 outside patch and maintenance releases if required. Given the end-of-maintenance status for CDH 4, Cloudera recommends all customers to migrate to
a recent CDH 5 release.

Referring to the same column twice in a view definition could cause the view to omit rows where that column contained a NULL value. This could cause
incorrect results due to an inaccurate COUNT(*) value or rows missing from the result set.

Fix GRANTs on URIs with uppercase letters

A GRANT statement for a URI could be ineffective if the URI contained uppercase letters, for example in an uppercase directory name. Subsequent statements,
such as CREATE EXTERNAL TABLE with a LOCATION clause, could fail with an authorization exception.

Avoid sending large partition stats objects over thrift

The catalogd daemon could encounter a serious error when loading the incremental statistics metadata for tables with large numbers of partitions and
columns. The problem occurred when the internal representation of metadata for the table exceeded 2 GB, for example in a table with 20K partitions and 77 columns. The fix causes a COMPUTE INCREMENTAL STATS operation to fail if it would produce metadata that exceeded the maximum size.

Throw AnalysisError if table properties are too large (for the Hive metastore)

CREATE TABLE or ALTER TABLE statements could fail with metastore database errors due to length limits on the SERDEPROPERTIES and TBLPROPERTIES clauses. (The limit on key size is 256, while the limit on value size is 4000.) The fix makes Impala handle these
error conditions more cleanly, by detecting too-long values rather than passing them to the metastore database.

Make MAX_PAGE_HEADER_SIZE configurable

Impala could fail to access Parquet data files with page headers larger than 8 MB, which could occur, for example, if the minimum or maximum values for a column were long strings. The
fix adds a configuration setting --max_page_header_size, which you can use to increase the Impala size limit to a value higher than 8 MB.

reduce scanner memory usage

Queries on Parquet tables could consume excessive memory (potentially multiple gigabytes) due to producing large intermediate data values while evaluating groups of rows. The workaround
was to reduce the size of the NUM_SCANNER_THREADS query option, the BATCH_SIZE query option, or both.

Handle error when star based select item and aggregate are incorrectly used

Refactor MemPool usage in HBase scan node

Queries involving HBase tables used substantially more memory than in earlier Impala versions. The problem occurred starting in Impala 2.2.8, as a result of the changes for IMPALA-2284.
The fix for this issue involves removing a separate memory work area for HBase queries and reusing other memory that was already allocated.

Fix migration/assignment of On-clause predicates inside inline views

Some combinations of ON clauses in join queries could result in comparisons being applied at the wrong stage of query processing, leading to incorrect
results. Wrong predicate assignment could happen under the following conditions:

The query includes an inline view that contains an outer join.

That inline view is joined with another table in the enclosing query block.

That join has an ON clause containing a predicate that only references columns originating from the outer-joined tables inside the inline view.

DCHECK in parquet scanner after block read error

PAGG hits mem_limit when switching to I/O buffers

A join query could fail with an out-of-memory error despite the apparent presence of sufficient memory. The cause was the internal ordering of operations that could cause a later phase
of the query to allocate memory required by an earlier phase of the query. The workaround was to either increase or decrease the MEM_LIMIT query option, because the
issue would only occur for a specific combination of memory limit and data volume.

Issues Fixed in Impala for CDH 5.5.1

The version of Impala that is included with CDH 5.5.1 / Impala 2.3.1 is identical to CDH 5.5.0 / Impala 2.3.0. There are no new bug fixes, new features, or incompatible changes.

Issues Fixed in Impala for CDH 5.5.0

This section lists the most serious or frequently encountered customer issues fixed in CDH 5.5.0 / Impala 2.3.0. Any issues already fixed in CDH 5.4 maintenance releases (up through CDH
5.4.8) are also included. Those issues are listed under the respective CDH 5.4 sections and are not repeated here. For the full list of fixed Impala issues, see Issues Fixed in CDH 5.5.0.

Fixes for Serious Errors

A number of issues were resolved that could result in serious errors when encountered. The most critical or commonly encountered are listed here.

Query return empty result if it contains NullLiteral in inlineview

HBase scan node uses 2-4x memory after upgrade to Impala 2.2.8

Queries involving HBase tables used substantially more memory than in earlier Impala versions. The problem occurred starting in Impala 2.2.8, as a result of the changes for IMPALA-2284.
The fix for this issue involves removing a separate memory work area for HBase queries and reusing other memory that was already allocated.

Fix migration/assignment of On-clause predicates inside inline views

Some combinations of ON clauses in join queries could result in comparisons being applied at the wrong stage of query processing, leading to incorrect
results. Wrong predicate assignment could happen under the following conditions:

The query includes an inline view that contains an outer join.

That inline view is joined with another table in the enclosing query block.

That join has an ON clause containing a predicate that only references columns originating from the outer-joined tables inside the inline view.

Fix wrong predicate assignment in outer joins

Avoid sending large partition stats objects over thrift

The catalogd daemon could encounter a serious error when loading the incremental statistics metadata for tables with large numbers of partitions and
columns. The problem occurred when the internal representation of metadata for the table exceeded 2 GB, for example in a table with 20K partitions and 77 columns. The fix causes a COMPUTE INCREMENTAL STATS operation to fail if it would produce metadata that exceeded the maximum size.

Analysis exception when a binary operator contains an IN operator with values

Make MAX_PAGE_HEADER_SIZE configurable

Impala could fail to access Parquet data files with page headers larger than 8 MB, which could occur, for example, if the minimum or maximum values for a column were long strings. The
fix adds a configuration setting --max_page_header_size, which you can use to increase the Impala size limit to a value higher than 8 MB.

Some queries that activated the spill-to-disk mechanism could produce a serious error if there was insufficient memory to set up internal work areas. Now those queries produce normal
out-of-memory errors instead.

Impala is unable to read hive tables created with the "STORED AS AVRO" clause

make Parquet scanner fail query if the file size metadata is stale

If a Parquet file in HDFS was overwritten by a smaller file, Impala could encounter a serious error. Issuing a INVALIDATE METADATA statement before a
subsequent query would avoid the error. The fix allows Impala to handle such inconsistencies in Parquet file length cleanly regardless of whether the table metadata is up-to-date.

Warn if table stats are potentially corrupt.

Impala warns if it detects a discrepancy in table statistics: a table considered to have zero rows even though there are data files present. In this case, Impala also skips query
optimizations that are normally applied to very small tables.

Set the output smap of an EmptySetNode produced from an empty inline view.

Set an InsertStmt's result exprs from the source statement's result exprs.

A CREATE TABLE AS SELECT or INSERT ... SELECT statement could produce different results than a SELECT statement, for queries including a FULL JOIN clause and including literal values in the select list.

Row count not set for empty partition when spec is used with compute incremental stats

A COMPUTE INCREMENTAL STATS statement could leave the row count for an emptyp partition as -1, rather than initializing the row count to 0. The missing
statistic value could result in reduced query performance.

When the Impala COMPUTE STATS statement was run on a partitioned Parquet table that was created in Hive, the table subsequently became inaccessible in
Hive. The table was still accessible to Impala. Regaining access in Hive required a workaround of creating a new table. The error displayed in Hive was:

Avoiding a DCHECK of NULL hash table in spilled right joins

A query could encounter a serious error if it contained a RIGHT OUTER, RIGHT ANTI, or FULL
OUTER join clause and approached the memory limit on a host so that the "spill to disk" mechanism was activated.

Declaring a partition key column as a TINYINT caused problems with the COMPUTE STATS statement. The associated partitions
would always have zero estimated rows, leading to potential inefficient query plans.

Where clause does not propagate to joins inside nested views

A query that referred to a view whose query referred to another view containing a join, could return incorrect results. WHERE clauses for the outermost
query were not always applied, causing the result set to include additional rows that should have been filtered out.

Make UTC to local TimestampValue conversion faster.

Workaround IMPALA-1619 in BufferedBlockMgr::ConsumeMemory()

A join query could encounter a serious error if the query approached the memory limit on a host so that the "spill to disk" mechanism was activated, and data
volume in the join was large enough that an internal memory buffer exceeded 1 GB in size on a particular host. (Exceeding this limit would only happen for huge join queries, because Impala could
split this intermediate data into 16 parts during the join query, and the buffer only contains compact bookkeeping data rather than the actual join column data.)

Enable using Isilon as the underlying filesystem.

The certification of CDH and Impala with the Isilon filesystem involves a number of fixes to performance and flexibility for dealing with I/O using remote reads. See Using Impala with Isilon Storage for details on using Impala and Isilon together.

Fix wrong warning when insert overwrite to empty table

Expand parsing of decimals to include scientific notation

DECIMAL literals can now include e scientific notation. For example, now CAST(1e3 AS
DECIMAL(5,3)) is a valid expression. Formerly it returned NULL. Some scientific expressions might have worked before in DECIMAL
context, but only when the scale was 0.

Note: Impala 2.2.0 is available as part of CDH 5.4.0 and is not available for CDH 4. Cloudera does not intend to release future versions of Impala
for CDH 4 outside patch and maintenance releases if required. Given the end-of-maintenance status for CDH 4, Cloudera recommends all customers to migrate to
a recent CDH 5 release.

Altering a column's type causes column stats to stop sticking for that column

When the type of a column was changed in either Hive or Impala through ALTER TABLE CHANGE COLUMN, the metastore database did not correctly propagate that
change to the table that contains the column statistics. The statistics (particularly the NDV) for that column were permanently reset and could not be changed by
Impala's COMPUTE STATS command. The underlying cause is a Hive bug (HIVE-9866).

Impala may leak or use too many file descriptors

Spurious stale block locality messages

Impala could issue messages stating the block locality metadata was stale, when the metadata was actually fine. The internal "remote bytes read" counter was not
being reset properly. This issue did not cause an actual slowdown in query execution, but the spurious error could result in unnecessary debugging work and unnecessary use of the INVALIDATE METADATA statement.

DROP TABLE fails after COMPUTE STATS and ALTER TABLE RENAME to a different database.

When a table was moved from one database to another, the column statistics were not pointed to the new database.i This could result in lower performance for queries due to unavailable
statistics, and also an inability to drop the table.

IMPALA-1556 causes memory leak with secure connections

impalad daemons could experience a memory leak on clusters using Kerberos authentication, with memory usage growing as more data is transferred
across the secure channel, either to the client program or between Impala nodes. The same issue affected LDAP-secured clusters to a lesser degree, because the LDAP security only covers data
transferred back to client programs.

unix_timestamp() does not return correct time

Impala incorrectly handles text data missing a newline on the last line

Some queries did not recognize the final line of a text data file if the line did not end with a newline character. This could lead to inconsistent results, such as a different number of
rows for SELECT COUNT(*) as opposed to SELECT *.

Impala's ACLs check do not consider all group ACLs, only checked first one.

If the HDFS user ID associated with the impalad process had read or write access in HDFS based on group membership, Impala statements could still
fail with HDFS permission errors if that group was not the first listed group for that user ID.

Query return empty result if it contains NullLiteral in inlineview

Fix edge cases for decimal/integer cast

A value of type DECIMAL(3,0) could be incorrectly cast to TINYINT. The resulting out-of-range value could be incorrect.
After the fix, the smallest type that is allowed for this cast is INT, and attempting to use DECIMAL(3,0) in a TINYINT context produces a "loss of precision" error.

Constant filter expressions are not checked for errors and state cleanup on exception / DCHECK on destroying an ExprContext

An invalid constant expression in a WHERE clause (for example, an invalid regular expression pattern) could fail to clean up internal state after raising a
query error. Therefore, certain combinations of invalid expressions in a query could cause a crash, or cause a query to continue when it should halt with an error.

Properly unescape string value for HBase filters

Avoiding a DCHECK of NULL hash table in spilled right joins

A query could encounter a serious error if it contained a RIGHT OUTER, RIGHT ANTI, or FULL
OUTER join clause and approached the memory limit on a host so that the "spill to disk" mechanism was activated.

Avoid calling ProcessBatch with out_batch->AtCapacity in right joins

Issues Fixed in the 2.1.4 Release / CDH 5.3.4

This section lists the most significant Impala issues fixed in Impala 2.1.4 for CDH 5.3.4. Because CDH 5.3.5 does not include any code changes for Impala, Impala 2.1.4
is included with both CDH 5.3.4 and 5.3.5.

Crash: impala::TupleIsNullPredicate::Prepare

Expand parsing of decimals to include scientific notation

DECIMAL literals could include e scientific notation. For example, now CAST(1e3 AS
DECIMAL(5,3)) is a valid expression. Formerly it returned NULL. Some scientific expressions might have worked before in DECIMAL
context, but only when the scale was 0.

FIRST_VALUE rewrite fn type might not match slot type

AnalyticEvalNode cannot handle partition/order by exprs with NaN

A query using an analytic function could encounter an error if the evaluation of an analytic ORDER BY or PARTITION
expression resulted in a NaN value, for example if the ORDER BY or PARTITION contained a division operation where both operands were
zero.

Add compatibility flag for Hive-Parquet-Timestamps

When Hive writes TIMESTAMP values, it represents them in the local time zone of the server. Impala expects TIMESTAMP values
to always be in the UTC time zone, possibly leading to inconsistent results depending on which component created the data files. This patch introduces a new startup flag, -convert_legacy_hive_parquet_utc_timestamps for the impalad daemon. Specify -convert_legacy_hive_parquet_utc_timestamps=true to make Impala recognize Parquet data files written by Hive and automatically adjust TIMESTAMP
values read from those files into the UTC time zone for compatibility with other Impala TIMESTAMP processing. Although this setting is currently turned off by default,
consider enabling it if practical in your environment, for maximum interoperability with Hive-created Parquet files.

Use snprintf() instead of lexical_cast() in float-to-string casts

Fix partition spilling cleanup when new stream OOMs

Certain calls to aggregate functions with STRING arguments could encounter a serious error when the system ran low on memory and attempted to activate the
spill-to-disk mechanism. The error message referenced the function impala::AggregateFunctions::StringValGetValue.

Impala's ACLs check do not consider all group ACLs, only checked first one.

If the HDFS user ID associated with the impalad process had read or write access in HDFS based on group membership, Impala statements could still
fail with HDFS permission errors if that group was not the first listed group for that user ID.

external-data-source-executor leaking global jni refs

Spurious stale block locality messages

Impala could issue messages stating the block locality metadata was stale, when the metadata was actually fine. The internal "remote bytes read" counter was not
being reset properly. This issue did not cause an actual slowdown in query execution, but the spurious error could result in unnecessary debugging work and unnecessary use of the INVALIDATE METADATA statement.

IMPALA-1556 causes memory leak with secure connections

impalad daemons could experience a memory leak on clusters using Kerberos authentication, with memory usage growing as more data is transferred
across the secure channel, either to the client program or between Impala nodes. The same issue affected LDAP-secured clusters to a lesser degree, because the LDAP security only covers data
transferred back to client programs.

Kerberos fetches 3x slower

Compressed file needs to be hold on entirely in Memory

Queries on gzipped text files required holding the entire data file and its uncompressed representation in memory at the same time. SELECT and COMPUTE STATS statements could fail or perform inefficiently as a result. The fix enables streaming reads for gzipped text, so that the data is uncompressed as it is read.

Add compatibility flag for Hive-Parquet-Timestamps

When Hive writes TIMESTAMP values, it represents them in the local time zone of the server. Impala expects TIMESTAMP values
to always be in the UTC time zone, possibly leading to inconsistent results depending on which component created the data files. This patch introduces a new startup flag, -convert_legacy_hive_parquet_utc_timestamps for the impalad daemon. Specify -convert_legacy_hive_parquet_utc_timestamps=true to make Impala recognize Parquet data files written by Hive and automatically adjust TIMESTAMP
values read from those files into the UTC time zone for compatibility with other Impala TIMESTAMP processing. Although this setting is currently turned off by default,
consider enabling it if practical in your environment, for maximum interoperability with Hive-created Parquet files.

Anti join could produce incorrect results when spilling

An anti-join query (or a NOT EXISTS operation that was rewritten internally into an anti-join) could produce incorrect results if Impala reached its memory
limit, causing the query to write temporary results to disk.

GROUP BY on STRING column produces inconsistent results

Fix leaked file descriptor and excessive file descriptor use

Impala could encounter an error from running out of file descriptors. The fix reduces the amount of time file descriptors are kept open, and avoids leaking file descriptors when read
operations encounter errors.

unix_timestamp() does not return correct time

The unix_timestamp() function could return a constant value 1 instead of a representation of the time.

Workaround: Upgrading to CDH 5.2.1, or another level of CDH that includes the fix for HIVE-8627, prevents the problem from affecting future COMPUTE STATS statements. On affected levels of CDH, or for Impala tables that have become inaccessible, the workaround is to disable the hive.metastore.try.direct.sql setting in the Hive metastore hive-site.xml file and issue the INVALIDATE METADATA
statement for the affected table. You do not need to rerun the COMPUTE STATS statement for the table.

Hive-created Avro tables with columns specified by a JSON file or literal could produce errors when queried in Impala, and could not be used with the COMPUTE
STATS statement. Now you can create such tables in Impala to avoid such errors.

Certain Avro fields for byte data could cause Impala to be unable to read an Avro data file, even if the field was not part of the Impala table definition. With this fix, Impala can now
read these Avro data files, although Impala queries cannot refer to the "bytes" fields.

Support specifying a custom AuthorizationProvider in Impala

The --authorization_policy_provider_class option for impalad was added back. This option specifies a custom
AuthorizationProvider class rather than the default HadoopGroupAuthorizationProvider. It had been used for internal testing, then removed
in Impala 1.4.0, but it was considered useful by some customers.

Failed DCHECK in disk-io-mgr-reader-context.cc:174

The serious error in the title could occur, with the supplemental message:

num_used_buffers_ < 0: #used=-1 during cancellation HDFS cached data

The issue was due to the use of HDFS caching with data files accessed by Impala. Support for HDFS caching in Impala was introduced in Impala 1.4.0 for CDH 5.1.0. The fix for this issue
was backported to Impala 1.3.x, and is the only change in Impala 1.3.2 for CDH 5.0.4.

Workaround: On CDH 5.0.x, upgrade to CDH 5.0.4 with Impala 1.3.2, where this issue is fixed. In Impala 1.3.0 or 1.3.1 on CDH 5.0.x, do not use HDFS caching
for Impala data files in Impala internal or external tables. If some of these data files are cached (for example because they are used by other components that take advantage of HDFS caching), set
the query option DISABLE_CACHED_READS=true. To set that option for all Impala queries across all sessions, start impalad with the
-default_query_options option and include this setting in the option argument, or on a cluster managed by Cloudera Manager, fill in this option setting on the
Impala Daemon options page.

Resolution: This issue is fixed in Impala 1.3.2 for CDH 5.0.4. The addition of HDFS caching support in Impala 1.4 means that this issue does not apply to
any new level of Impala on CDH 5.

Impala forgets about partitions with non-existant locations

CREATE TABLE LIKE fails if source is a view

The CREATE TABLE LIKE clause was enhanced to be able to create a table with the same column definitions as a view. The resulting table is a text table
unless the STORED AS clause is specified, because a view does not have an associated file format to inherit.

Improve partition pruning time

Improve compute stats performance

The performance of the COMPUTE STATS statement was improved substantially. The efficiency of its internal operations was improved, and some statistics are
no longer gathered because they are not currently used for planning Impala queries.

Deadlock in scan node

Issues Fixed in the 1.3.3 Release / CDH 5.0.5

Impala 1.3.3 includes fixes to address what is known as the POODLE vulnerability in SSLv3. SSLv3 access is disabled in the Impala debug web UI.

Note: Impala 1.3.3 is only available as part of CDH 5.0.5, not under CDH 4.

Issues Fixed in the 1.3.2 Release / CDH 5.0.4

This backported bug fix is the only change between Impala 1.3.1 and Impala 1.3.2.

Note: Impala 1.3.3 is only available as part of CDH 5.0.5, not under CDH 4.

Failed DCHECK in disk-io-mgr-reader-context.cc:174

The serious error in the title could occur, with the supplemental message:

num_used_buffers_ < 0: #used=-1 during cancellation HDFS cached data

The issue was due to the use of HDFS caching with data files accessed by Impala. Support for HDFS caching in Impala was introduced in Impala 1.4.0 for CDH 5.1.0. The fix for this issue
was backported to Impala 1.3.x, and is the only change in Impala 1.3.2 for CDH 5.0.4.

Workaround: On CDH 5.0.x, upgrade to CDH 5.0.4 with Impala 1.3.2, where this issue is fixed. In Impala 1.3.0 or 1.3.1 on CDH 5.0.x, do not use HDFS caching
for Impala data files in Impala internal or external tables. If some of these data files are cached (for example because they are used by other components that take advantage of HDFS caching), set
the query option DISABLE_CACHED_READS=true. To set that option for all Impala queries across all sessions, start impalad with the
-default_query_options option and include this setting in the option argument, or on a cluster managed by Cloudera Manager, fill in this option setting on the
Impala Daemon options page.

Resolution: This issue is fixed in Impala 1.3.2 for CDH 5.0.4. The addition of HDFS caching support in Impala 1.4 means that this issue does not apply to
any new level of Impala on CDH 5.

Incorrect result with group by query with null value in group by data

Drop Function does not clear local library cache

When a UDF is dropped through the DROP FUNCTION statement, and then the UDF is re-created with a new .so library or JAR
file, the original version of the UDF is still used when the UDF is called from queries.

Compute stats doesn't propagate underlying error correctly

If a COMPUTE STATS statement encountered an error, the error message is "Query aborted" with no further detail. Common reasons why a
COMPUTE STATS statement might fail include network errors causing the coordinator node to lose contact with other impalad
instances, and column names that match Impala reserved words. (Currently, if a column name is an Impala reserved word,
COMPUTE STATS always returns an error.)

IO Mgr should take instance memory limit into account when creating io buffers

Workaround: Before issuing a COMPUTE STATS statement for a Parquet table, reduce the number of threads used in that operation
by issuing SET NUM_SCANNER_THREADS=2 in impala-shell. Then issue UNSET NUM_SCANNER_THREADS before
continuing with queries.

Impala should provide an option for new sub directories to automatically inherit the permissions of the parent directory

When new subdirectories are created underneath a partitioned table by an INSERT statement, previously the new subdirectories always used the default HDFS
permissions for the impala user, which might not be suitable for directories intended to be read and written by other components also.

INSERT column reordering doesn't work with SELECT clause

Issues Fixed in the 1.3.0 Release / CDH 5.0.0

This section lists the most significant issues fixed in Impala 1.3.0, primarily issues that could cause wrong results, or cause problems running the COMPUTE
STATS statement, which is very important for performance and scalability.

Inner join after right join may produce wrong results

Workaround: Including the STRAIGHT_JOIN keyword in the query prevented the issue from occurring.

Incorrect results with codegen on multi-column group by with NULLs.

A query with a GROUP BY clause referencing multiple columns could introduce incorrect NULL values in some columns of the
result set. The incorrect NULL values could appear in rows where a different GROUP BY column actually did return NULL.

Using distinct inside aggregate function may cause incorrect result when using having clause

Aggregation on union inside (inline) view not distributed properly.

An aggregation query or a query with ORDER BY and LIMIT could be executed on a single node in some cases, rather than
distributed across the cluster. This issue affected queries whose FROM clause referenced an inline view containing a UNION.

Wrong expression may be used in aggregate query if there are multiple similar expressions

Incorrect results when changing the order of aggregates in the select list with codegen enabled

Referencing the same columns in both a COUNT() and a SUM() call in the same query, or some other combinations of aggregate
function calls, could incorrectly return a result of 0 from one of the aggregate functions. This issue affected references to TINYINT and SMALLINT columns, but not INT or BIGINT columns.

COMPUTE STATS should update partitions in batches

Fail early (in analysis) when COMPUTE STATS is run against Avro table with no columns

If the columns for an Avro table were all defined in the TBLPROPERTIES or SERDEPROPERTIES clauses, the COMPUTE STATS statement would fail after completely analyzing the table, potentially causing a long delay. Although the COMPUTE STATS statement still
does not work for such tables, now the problem is detected and reported immediately.

Impala cannot load tables with more than Short.MAX_VALUE number of partitions

Various issues with HBase row key specification

Queries against HBase tables could fail with an error if the row key was compared to a function return value rather than a string constant. Also, queries against HBase tables could fail
if the WHERE clause contained combinations of comparisons that could not possibly match any row key.

Resolution: Queries now return appropriate results when function calls are used in the row key comparison. For queries involving non-existent row keys, such
as WHERE row_key IS NULL or where the lower bound is greater than the upper bound, the query succeeds and returns an empty result
set.

Issues Fixed in the 1.2.3 Release

This release is a fix release that supercedes Impala 1.2.2, with the same features and fixes as 1.2.2 plus one additional fix for compatibility with Parquet files generated outside of
Impala by components such as Hive, Pig, or MapReduce.

Impala cannot read Parquet files with multiple row groups

The parquet-mr library included with CDH4.5 writes files that are not readable by Impala, due to the presence of multiple row groups. Queries involving
these data files might result in a crash or a failure with an error such as "Column chunk should not contain two dictionary pages".

This issue does not occur for Parquet files produced by Impala INSERT statements, because Impala only produces files with a single row group.

Order of table references in FROM clause is critical for optimal performance

Impala does not currently optimize the join order of queries; instead, it joins tables in the order in which they are listed in the FROM clause. Queries that contain one or more large
tables on the right hand side of joins (either an explicit join expressed as a JOIN statement or a join implicit in the list of table references in the FROM clause) may run slowly or crash Impala due
to out-of-memory errors. For example:

SELECT ... FROM small_table JOIN large_table

Anticipated Resolution: Fixed in Impala 1.2.2.

Workaround: In Impala 1.2.2 and higher, use the COMPUTE STATS statement to gather statistics for each table involved in the
join query, after data is loaded. Prior to Impala 1.2.2, modify the query, if possible, to join the largest table first. For example:

SELECT ... FROM small_table JOIN large_table

should be modified to:

SELECT ... FROM large_table JOIN small_table

Parquet in CDH4.5 writes data files that are sometimes unreadable by Impala

Some Parquet files could be generated by other components that Impala could not read.

Scanners use too much memory when reading past scan range

While querying a table with long column values, Impala could over-allocate memory leading to an out-of-memory error. This problem was observed most frequently with tables using
uncompressed RCFile or text data files.

Join node consumes memory way beyond mem-limit

A join query could allocate a temporary work area that was larger than needed, leading to an out-of-memory error. The fix makes Impala return unused memory to the system when the memory
limit is reached, avoiding unnecessary memory errors.

Impala could encounter an out-of-memory condition setting up work areas for Parquet tables with many columns. The fix reduces the size of the allocated memory when not actually needed to
hold table data.

Views Sometimes Not Utilizing Partition Pruning

Update the serde name we write into the metastore for Parquet tables

The SerDes class string written into Parquet data files created by Impala was updated for compatibility with Parquet support in Hive. See Incompatible Changes Introduced in Impala 1.1.1 for the steps to update older Parquet data files for Hive compatibility.

Impala stopped to query AVRO tables

Queries against Avro tables could fail depending on whether the Avro schema URL was specified in the TBLPROPERTIES or SERDEPROPERTIES field. The fix causes Impala to check both fields for the schema URL.

10-20% perf regression for most queries across all table formats

This issue is due to a performance tradeoff between systems running many queries concurrently, and systems running a single query. Systems running only a single query could experience
lower performance than in early beta releases. Systems running many queries simultaneously should experience higher performance than in the beta releases.

planner fails with "Join requires at least one equality predicate between the two tables" when "from" table order does not match "where" join order

A query could fail if it involved 3 or more tables and the last join table was specified as a subquery.

INSERT statements for tables partitioned on columns involving datetime types could appear to succeed, but cause errors for subsequent queries on those
tables. The problem was especially serious if an improperly formatted timestamp value was specified for the partition key.

DDL statements (CREATE/ALTER/DROP TABLE) are not supported in the Impala Beta Release

Resolution: Fixed in 0.7

Avro is not supported in the Impala Beta Release

Resolution: Fixed in 0.7

Workaround: None

Impala does not currently allow limiting the memory consumption of a single query

It is currently not possible to limit the memory consumption of a single query. All tables on the right hand side of JOIN statements need to be able to fit in memory. If they do not,
Impala may crash due to out of memory errors.

Resolution: Fixed in 0.7

Aggregate of a subquery result set returns wrong results if the subquery contains a 'limit' and data is distributed across multiple nodes

Aggregate of a subquery result set returns wrong results if the subquery contains a 'limit' clause and data is distributed across multiple nodes. From the query plan, it looks like we
are just summing the results from each worker node.

Partition pruning for arbitrary predicates that are fully bound by a particular partition column

We currently cannot utilize a predicate like "country_code in ('DE', 'FR', 'US')" to do partitioning pruning, because that requires an equality predicate or a binary comparison.

We should create a superclass of planner.ValueRange, ValueSet, that can be constructed with an arbitrary predicate, and whose isInRange(analyzer, valueExpr) constructs a literal
predicate by substitution of the valueExpr into the predicate.

Issues Fixed in Version 0.6 of the Beta Release

Impala reads the NameNode address and port as command line parameters

Impala reads the NameNode address and port as command line parameters rather than reading them from core-site.xml. Updating the NameNode address in the
core-site.xml file does not propagate to Impala.

Severity: Low

Resolution: Fixed in 0.6 - Impala reads the namenode location and port from the Hadoop configuration files, though setting -nn and -nn_port overrides this. Users are advised not to set -nn or -nn_port.

Queries may fail on secure environment due to impalad Kerberos ticket expiration

Queries may fail on secure environment due to impalad Kerberos tickets expiring. This can happen if the Impala -kerberos_reinit_interval flag is set to a value ten minutes or less. This may lead to an impalad requesting a ticket with a lifetime that is less
than the time to the next ticket renewal.

Concurrent queries may fail when Impala uses Thrift to communicate with the Hive Metastore

Concurrent queries may fail when Impala is using Thrift to communicate with part of the Hive Metastore such as the Hive Metastore Service. In such a case, the error get_fields failed: out of sequence response" may occur because Impala shared a single Hive Metastore Client connection across threads. With Impala 0.6, a separate connection is
used for each metadata request.

Backend impalads do not cache connections to the coordinator. On a secure cluster, this introduces a latency proportional to the number of backend clients involved in query execution, as
the cost of establishing a secure connection is much higher than in the non-secure case.

UNIX_TIMESTAMP format behaviour deviates from Hive when format matches a prefix of the time value

The Impala UNIX_TIMESTAMP(val, format) operation compares the length of format and val and returns NULL if they do not match. Hive instead effectively truncates val to the length of the
format parameter.

Issues Fixed in Version 0.4 of the Beta Release

Impala fails to refresh the Hive metastore if a Hive temporary configuration file is removed

Impala is impacted by Hive bug HIVE-3596 which may cause metastore refreshes to fail if a Hive
temporary configuration file is deleted (normally located at /tmp/hive-<user>-<tmp_number>.xml). Additionally, the impala-shell will incorrectly report that
the failed metadata refresh completed successfully.

Anticipated Resolution: To be fixed in a future release

Workaround: Restart the impalad service. Use the impalad log to check for metadata refresh
errors.

Queries with large limits would hang.

Order by on a string column produces incorrect results if there are empty strings

Resolution: Fixed in 0.4

Issues Fixed in Version 0.3 of the Beta Release

All table loading errors show as unknown table

If Impala is unable to load the metadata for a table for any reason, a subsequent query referring to that table will return an unknown table error message,
even if the table is known.

Resolution: Fixed in 0.3

A table that cannot be loaded will disappear from SHOW TABLES

After failing to load metadata for a table, Impala removes that table from the list of known tables returned in SHOW TABLES. Subsequent attempts to query
the table returns 'unknown table', even if the metadata for that table is fixed.

Resolution: Fixed in 0.3

Impala cannot read from HBase tables that are not created as external tables in the hive metastore.

Attempting to select from these tables fails.

Resolution: Fixed in 0.3

Certain queries that contain OUTER JOINs may return incorrect results

Queries that contain OUTER JOINs may not return the correct results if there are predicates referencing any of the joined tables in the WHERE clause.

Resolution: Fixed in 0.3.

Issues Fixed in Version 0.2 of the Beta Release

Subqueries which contain aggregates cannot be joined with other tables or Impala may crash

Subqueries that contain an aggregate cannot be joined with another table or Impala may crash. For example:

SELECT * FROM (SELECT sum(col1) FROM some_table GROUP BY col1) t1 JOIN other_table ON (...);

Resolution: Fixed in 0.2

An insert with a limit that runs as more than one query fragment inserts more rows than the limit.

If this documentation includes code, including but not limited to, code examples, Cloudera makes this available to you under the terms of the Apache License, Version 2.0, including any required
notices. A copy of the Apache License Version 2.0 can be found here.