Table and Column Statistics

Impala can do better optimization for complex or multi-table queries when it has access to statistics about
the volume of data and how the values are distributed. Impala uses this information to help parallelize and
distribute the work for a query. For example, optimizing join queries requires a way of determining if one
table is "bigger" than another, which is a function of the number of rows and the average row size
for each table. The following sections describe the categories of statistics Impala can work
with, and how to produce them and keep them up to date.

Note:

Originally, Impala relied on the Hive mechanism for collecting statistics, through the Hive ANALYZE
TABLE statement which initiates a MapReduce job. For better user-friendliness and reliability,
Impala implements its own COMPUTE STATS statement in Impala 1.2.2 and higher, along with the
DROP STATS, SHOW TABLE STATS, and SHOW COLUMN STATS
statements.

Overview of Table Statistics

The Impala query planner can make use of statistics about entire tables and partitions.
This information includes physical characteristics such as the number of rows, number of data files,
the total size of the data files, and the file format. For partitioned tables, the numbers
are calculated per partition, and as totals for the whole table.
This metadata is stored in the metastore database, and can be updated by either Impala or Hive.
If a number is not available, the value -1 is used as a placeholder.
Some numbers, such as number and total sizes of data files, are always kept up to date because
they can be calculated cheaply, as part of gathering HDFS block metadata.

The following example shows table stats for an unpartitioned Parquet table.
The values for the number and sizes of files are always available.
Initially, the number of rows is not known, because it requires a potentially expensive
scan through the entire table, and so that value is displayed as -1.
The COMPUTE STATS statement fills in any unknown table stats values.

Impala performs some optimizations using this metadata on its own, and other optimizations by
using a combination of table and column statistics.

To check that table statistics are available for a table, and see the details of those statistics, use the
statement SHOW TABLE STATS table_name. See
SHOW Statement for details.

If you use the Hive-based methods of gathering statistics, see
the
Hive wiki for information about the required configuration on the Hive side. Where practical,
use the Impala COMPUTE STATS statement to avoid potential configuration and scalability
issues with the statistics-gathering process.

If you run the Hive statement ANALYZE TABLE COMPUTE STATISTICS FOR COLUMNS,
Impala can only use the resulting column statistics if the table is unpartitioned.
Impala cannot use Hive-generated column statistics for a partitioned table.

Overview of Column Statistics

The Impala query planner can make use of statistics about individual columns when that metadata is
available in the metastore database. This technique is most valuable for columns compared across tables in
join queries, to help estimate how many rows the query
will retrieve from each table. These statistics are also important for correlated
subqueries using the EXISTS() or IN() operators, which are processed
internally the same way as join queries.

The following example shows column stats for an unpartitioned Parquet table.
The values for the maximum and average sizes of some types are always available,
because those figures are constant for numeric and other fixed-size types.
Initially, the number of distinct values is not known, because it requires a potentially expensive
scan through the entire table, and so that value is displayed as -1.
The same applies to maximum and average sizes of variable-sized types, such as STRING.
The COMPUTE STATS statement fills in most unknown column stats values.
(It does not record the number of NULL values, because currently Impala
does not use that figure for query optimization.)

For column statistics to be effective in Impala, you also need to have table statistics for the
applicable tables, as described in Overview of Table Statistics. When you use
the Impala COMPUTE STATS statement, both table and column statistics are automatically
gathered at the same time, for all columns in the table.

Note: Prior to Impala 1.4.0,
COMPUTE STATS counted the number of
NULL values in each column and recorded that figure
in the metastore database. Because Impala does not currently use the
NULL count during query planning, Impala 1.4.0 and
higher speeds up the COMPUTE STATS statement by
skipping this NULL counting.

To check whether column statistics are available for a particular set of columns, use the SHOW
COLUMN STATS table_name statement, or check the extended
EXPLAIN output for a query against that table that refers to those columns. See
SHOW Statement and EXPLAIN Statement for details.

If you run the Hive statement ANALYZE TABLE COMPUTE STATISTICS FOR COLUMNS,
Impala can only use the resulting column statistics if the table is unpartitioned.
Impala cannot use Hive-generated column statistics for a partitioned table.

How Table and Column Statistics Work for Partitioned Tables

When you use Impala for "big data", you are highly likely to use partitioning
for your biggest tables, the ones representing data that can be logically divided
based on dates, geographic regions, or similar criteria. The table and column statistics
are especially useful for optimizing queries on such tables. For example, a query involving
one year might involve substantially more or less data than a query involving a different year,
or a range of several years. Each query might be optimized differently as a result.

The following examples show how table and column stats work with a partitioned table.
The table for this example is partitioned by year, month, and day.
For simplicity, the sample data consists of 5 partitions, all from the same year and month.
Table stats are collected independently for each partition. (In fact, the
SHOW PARTITIONS statement displays exactly the same information as
SHOW TABLE STATS for a partitioned table.) Column stats apply to
the entire table, not to individual partitions. Because the partition key column values
are represented as HDFS directories, their characteristics are typically known in advance,
even when the values for non-key columns are shown as -1.

Note:
Partitioned tables can grow so large that scanning the entire table, as the COMPUTE STATS
statement does, is impractical just to update the statistics for a new partition. The standard
COMPUTE STATS statement might take hours, or even days. That situation is where you switch
to using incremental statistics, a feature available in Impala 2.1 and higher.
See Overview of Incremental Statistics for details about this feature
and the COMPUTE INCREMENTAL STATS syntax.

If you run the Hive statement ANALYZE TABLE COMPUTE STATISTICS FOR COLUMNS,
Impala can only use the resulting column statistics if the table is unpartitioned.
Impala cannot use Hive-generated column statistics for a partitioned table.

Overview of Incremental Statistics

In Impala 2.1.0 and higher, you can use the syntax COMPUTE INCREMENTAL STATS and
DROP INCREMENTAL STATS. The INCREMENTAL clauses work with incremental
statistics, a specialized feature for partitioned tables that are large or frequently updated with new
partitions.

When you compute incremental statistics for a partitioned table, by default Impala only processes those
partitions that do not yet have incremental statistics. By processing only newly added partitions, you can
keep statistics up to date for large partitioned tables, without incurring the overhead of reprocessing the
entire table each time.

You can also compute or drop statistics for a single partition by including a PARTITION
clause in the COMPUTE INCREMENTAL STATS or DROP INCREMENTAL STATS
statement.

The metadata for incremental statistics is handled differently from the original style of statistics:

If you have an existing partitioned table for which you have already computed statistics, issuing
COMPUTE INCREMENTAL STATS without a partition clause causes Impala to rescan the
entire table. Once the incremental statistics are computed, any future COMPUTE INCREMENTAL
STATS statements only scan any new partitions and any partitions where you performed
DROP INCREMENTAL STATS.

The SHOW TABLE STATS and SHOW PARTITIONS statements now include an
additional column showing whether incremental statistics are available for each column. A partition
could already be covered by the original type of statistics based on a prior COMPUTE
STATS statement, as indicated by a value other than -1 under the
#Rows column. Impala query planning uses either kind of statistics when available.

COMPUTE INCREMENTAL STATS takes more time than COMPUTE STATS for the
same volume of data. Therefore it is most suitable for tables with large data volume where new
partitions are added frequently, making it impractical to run a full COMPUTE STATS
operation for each new partition. For unpartitioned tables, or partitioned tables that are loaded once
and not updated with new partitions, use the original COMPUTE STATS syntax.

COMPUTE INCREMENTAL STATS uses some memory in the catalogd process,
proportional to the number of partitions and number of columns in the applicable table. The memory
overhead is approximately 400 bytes for each column in each partition. This memory is reserved in the
catalogd daemon, the statestored daemon, and in each instance of
the impalad daemon.

In cases where new files are added to an existing partition, issue a REFRESH statement
for the table, followed by a DROP INCREMENTAL STATS and COMPUTE INCREMENTAL
STATS sequence for the changed partition.

The DROP INCREMENTAL STATS statement operates only on a single partition at a time. To
remove statistics (whether incremental or not) from all partitions of a table, issue a DROP
STATS statement with no INCREMENTAL or PARTITION clauses.

The following considerations apply to incremental statistics when the structure of an existing table is
changed (known as schema evolution):

If you use an ALTER TABLE statement to drop a column, the existing statistics remain
valid and COMPUTE INCREMENTAL STATS does not rescan any partitions.

If you use an ALTER TABLE statement to add a column, Impala rescans all partitions and
fills in the appropriate column-level values the next time you run COMPUTE INCREMENTAL
STATS.

If you use an ALTER TABLE statement to change the data type of a column, Impala
rescans all partitions and fills in the appropriate column-level values the next time you run
COMPUTE INCREMENTAL STATS.

If you use an ALTER TABLE statement to change the file format of a table, the existing
statistics remain valid and a subsequent COMPUTE INCREMENTAL STATS does not rescan any
partitions.

Generating Table and Column Statistics (COMPUTE STATS Statement)

To gather table statistics after loading data into a table or partition, you typically use the
COMPUTE STATS statement. This statement is available in Impala 1.2.2 and higher.
It gathers both table statistics and column statistics for all columns in a single operation.
For large partitioned tables, where you frequently need to update statistics and it is impractical
to scan the entire table each time, use the syntax COMPUTE INCREMENTAL STATS,
which is available in Impala 2.1 and higher.

If you use Hive as part of your ETL workflow, you can also use Hive to generate table and
column statistics. You might need to do extra configuration within Hive itself, the metastore,
or even set up a separate database to hold Hive-generated statistics. You might need to run
multiple statements to generate all the necessary statistics. Therefore, prefer the
Impala COMPUTE STATS statement where that technique is practical.
For details about collecting statistics through Hive, see
the Hive wiki.

If you run the Hive statement ANALYZE TABLE COMPUTE STATISTICS FOR COLUMNS,
Impala can only use the resulting column statistics if the table is unpartitioned.
Impala cannot use Hive-generated column statistics for a partitioned table.

For your very largest tables, you might find that COMPUTE STATS or even COMPUTE INCREMENTAL STATS
take so long to scan the data that it is impractical to use them regularly. In such a case, after adding a partition or inserting new data,
you can update just the number of rows property through an ALTER TABLE statement.
See Setting the NUMROWS Value Manually through ALTER TABLE for details.
Because the column statistics might be left in a stale state, do not use this technique as a replacement
for COMPUTE STATS. Only use this technique if all other means of collecting statistics are impractical, or as a
low-overhead operation that you run in between periodic COMPUTE STATS or COMPUTE INCREMENTAL STATS operations.

Detecting Missing Statistics

You can check whether a specific table has statistics using the SHOW TABLE STATS statement
(for any table) or the SHOW PARTITIONS statement (for a partitioned table). Both
statements display the same information. If a table or a partition does not have any statistics, the
#Rows field contains -1. Once you compute statistics for the table or
partition, the #Rows field changes to an accurate value.

The following example shows a table that initially does not have any statistics. The SHOW TABLE
STATS statement displays different values for #Rows before and after the
COMPUTE STATS operation.

The following example shows a similar progression with a partitioned table. Initially,
#Rows is -1. After a COMPUTE STATS operation,
#Rows changes to an accurate value. Any newly added partition starts with no statistics,
meaning that you must collect statistics after adding a new partition.

Note:
Because the default COMPUTE STATS statement creates and updates statistics for all
partitions in a table, if you expect to frequently add new partitions, use the COMPUTE INCREMENTAL
STATS syntax instead, which lets you compute stats for a single specified partition, or only for
those partitions that do not already have incremental stats.

If checking each individual table is impractical, due to a large number of tables or views that hide the
underlying base tables, you can also check for missing statistics for a particular query. Use the
EXPLAIN statement to preview query efficiency before actually running the query. Use the
query profile output available through the PROFILE command in
impala-shell or the web UI to verify query execution and timing after running the query.
Both the EXPLAIN plan and the PROFILE output display a warning if any
tables or partitions involved in the query do not have statistics.

Because Impala uses the partition pruning technique when possible to only evaluate certain
partitions, if you have a partitioned table with statistics for some partitions and not others, whether or
not the EXPLAIN statement shows the warning depends on the actual partitions used by the
query. For example, you might see warnings or not for different queries against the same table:

-- No warning because all the partitions for the year 2012 have stats.
EXPLAIN SELECT ... FROM t1 WHERE year = 2012;
-- Missing stats warning because one or more partitions in this range
-- do not have stats.
EXPLAIN SELECT ... FROM t1 WHERE year BETWEEN 2006 AND 2009;

To confirm if any partitions at all in the table are missing statistics, you might explain a query that
scans the entire table, such as SELECT COUNT(*) FROM table_name.

Keeping Statistics Up to Date

When the contents of a table or partition change significantly, recompute the stats for the relevant table
or partition. The degree of change that qualifies as "significant" varies, depending on the absolute
and relative sizes of the tables. Typically, if you add more than 30% more data to a table, it is
worthwhile to recompute stats, because the differences in number of rows and number of distinct values
might cause Impala to choose a different join order when that table is used in join queries. This guideline
is most important for the largest tables. For example, adding 30% new data to a table containing 1 TB has a
greater effect on join order than adding 30% to a table containing only a few megabytes, and the larger
table has a greater effect on query performance if Impala chooses a suboptimal join order as a result of
outdated statistics.

If you reload a complete new set of data for a table, but the number of rows and number of distinct values
for each column is relatively unchanged from before, you do not need to recompute stats for the table.

If the statistics for a table are out of date, and the table's large size makes it impractical to recompute
new stats immediately, you can use the DROP STATS statement to remove the obsolete
statistics, making it easier to identify tables that need a new COMPUTE STATS operation.

For a large partitioned table, consider using the incremental stats feature available in Impala 2.1.0 and
higher, as explained in Overview of Incremental Statistics. If you add a new
partition to a table, it is worthwhile to recompute incremental stats, because the operation only scans the
data for that one new partition.

Setting the NUMROWS Value Manually through ALTER TABLE

The most crucial piece of data in all the statistics is the number of rows in the table (for an
unpartitioned or partitioned table) and for each partition (for a partitioned table). The COMPUTE STATS
statement always gathers statistics about all columns, as well as overall table statistics. If it is not
practical to do a full COMPUTE STATS or COMPUTE INCREMENTAL STATS
operation after adding a partition or inserting data, or if you can see that Impala would produce a more
efficient plan if the number of rows was different, you can manually set the number of rows through an
ALTER TABLE statement:

-- Set total number of rows. Applies to both unpartitioned and partitioned tables.
alter table table_name set tblproperties('numRows'='new_value', 'STATS_GENERATED_VIA_STATS_TASK'='true');
-- Set total number of rows for a specific partition. Applies to partitioned tables only.
-- You must specify all the partition key columns in the PARTITION clause.
alter table table_name partition (keycol1=val1,keycol2=val2...) set tblproperties('numRows'='new_value', 'STATS_GENERATED_VIA_STATS_TASK'='true');

This statement avoids re-scanning any data files. (The requirement to include the STATS_GENERATED_VIA_STATS_TASK property is relatively new, as a
result of the issue HIVE-8648
for the Hive metastore.)

create table analysis_data stored as parquet as select * from raw_data;
Inserted 1000000000 rows in 181.98s
compute stats analysis_data;
insert into analysis_data select * from smaller_table_we_forgot_before;
Inserted 1000000 rows in 15.32s
-- Now there are 1001000000 rows. We can update this single data point in the stats.
alter table analysis_data set tblproperties('numRows'='1001000000', 'STATS_GENERATED_VIA_STATS_TASK'='true');

For a partitioned table, update both the per-partition number of rows and the number of rows for the whole
table:

-- If the table originally contained 1 million rows, and we add another partition with 30 thousand rows,
-- change the numRows property for the partition and the overall table.
alter table partitioned_data partition(year=2009, month=4) set tblproperties ('numRows'='30000', 'STATS_GENERATED_VIA_STATS_TASK'='true');
alter table partitioned_data set tblproperties ('numRows'='1030000', 'STATS_GENERATED_VIA_STATS_TASK'='true');

In practice, the COMPUTE STATS statement, or COMPUTE INCREMENTAL STATS
for a partitioned table, should be fast and convenient enough that this technique is only useful for the very
largest partitioned tables.
Because the column statistics might be left in a stale state, do not use this technique as a replacement
for COMPUTE STATS. Only use this technique if all other means of collecting statistics are impractical, or as a
low-overhead operation that you run in between periodic COMPUTE STATS or COMPUTE INCREMENTAL STATS operations.

Setting Column Stats Manually through ALTER TABLE

In Impala 2.6 and higher, you can also use the SET COLUMN STATS
clause of ALTER TABLE to manually set or change column statistics.
Only use this technique in cases where it is impractical to run
COMPUTE STATS or COMPUTE INCREMENTAL STATS
frequently enough to keep up with data changes for a huge table.

You specify a case-insensitive symbolic name for the kind of statistics:
numDVs, numNulls, avgSize, maxSize.
The key names and values are both quoted. This operation applies to an entire table,
not a specific partition. For example:

Examples of Using Table and Column Statistics with Impala

The following examples walk through a sequence of SHOW TABLE STATS, SHOW COLUMN
STATS, ALTER TABLE, and SELECT and INSERT
statements to illustrate various aspects of how Impala uses statistics to help optimize queries.

This example shows table and column statistics for the STORE column used in the
TPC-DS benchmarks for decision
support systems. It is a tiny table holding data for 12 stores. Initially, before any statistics are
gathered by a COMPUTE STATS statement, most of the numeric fields show placeholder values
of -1, indicating that the figures are unknown. The figures that are filled in are values that are easily
countable or deducible at the physical level, such as the number of files, total data size of the files,
and the maximum and average sizes for data types that have a constant size such as INT,
FLOAT, and TIMESTAMP.

With the Hive ANALYZE TABLE statement for column statistics, you had to specify each
column for which to gather statistics. The Impala COMPUTE STATS statement automatically
gathers statistics for all columns, because it reads through the entire table relatively quickly and can
efficiently compute the values for all the columns. This example shows how after running the
COMPUTE STATS statement, statistics are filled in for both the table and all its columns:

The following example shows how statistics are represented for a partitioned table. In this case, we have
set up a table to hold the world's most trivial census data, a single STRING field,
partitioned by a YEAR column. The table statistics include a separate entry for each
partition, plus final totals for the numeric fields. The column statistics include some easily deducible
facts for the partitioning column, such as the number of distinct values (the number of partition
subdirectories).

For examples showing how some queries work differently when statistics are available, see
Examples of Join Order Optimization. You can see how Impala executes a query
differently in each case by observing the EXPLAIN output before and after collecting
statistics. Measure the before and after query times, and examine the throughput numbers in before and
after SUMMARY or PROFILE output, to verify how much the improved plan
speeds up performance.