Using Statistics with Splice Machine

This topic introduces how to use database statistics with Splice
Machine. Database statistics are a form of metadata (data about data)
that assists the Splice Machine query optimizer; the statistics help the
optimizer select the most efficient approach to running a query, based
on information that has been gathered about the tables involved in the
query.

Statistics are inexact; in fact, some statistics like table cardinality
are estimated using advanced algorithms, due to the resources required
to compute these values. It’s important to keep this in mind when basing
design decisions on values in database statistics tables.

It’s also important to note that the statistics for your database are
not automatically refreshed when the data in your database changes,
which means that when you query a statistical table or view, the results
you see may not exactly match the data in the actual tables.

Collecting Statistics

You can collect statistics on a schema or table using the splice> Analyze
command.

Once collection of statistics has completed, the Splice Machine query
optimizer will automatically begin using the updated statistics to
optimize query execution plans.

When Should You Collect Statistics?

We advise that you collect statistics after you have:

Created an index on a table.

Modified a significant number of rows in a table with update, insert,
or delete operations.

A general rule-of-thumb is that you should collect statistics after
modifying more than 10% of data.

On Which Columns Should You Collect Statistics?

By default, Splice Machine collects statistics on all columns in a
table.

To reduce the operational cost of analyzing large tables (such as fact
tables), you can tell Splice Machine to not collect statistics on
certain columns by running the
SYSCS_UTIL.DISABLE_COLUMN_STATISTICS built-in
system procedure:

SYSCS_UTIL.DISABLE_COLUMN_STATISTICS( schema, table, column);

Splice Machine strongly recommends that you always collect statistics
on small tables, such as a table that has hundreds of rows on each
region server.

How to Determine if You Should Collect or Drop Statistics

You can use Explain Plan in your
development or test environment to determine how dropping or collecting
statistics changes the execution plan for a query.

Dropping Statistics

If you subsequently wish to drop statistics for a schema, you can use
the
SYSCS_UTIL.DROP_SCHEMA_STATISTICS
procedure to drop statistics for an entire schema. For example:

splice> CALL SYSCS_UTIL.DROP_SCHEMA_STATISTICS('SPLICEBBALL');

Enabling and Disabling Statistics on Specific Columns

When you collect statistics, Splice Machine automatically collects
statistics on keyed columns, which are columns in a primary key and
columns that are indexed.

Once you’ve enabled or disabled statistics collection for one or more
table columns, you should update the query optimizer by collecting
statistics on the table or schema.

Selecting Columns for Statistics Collection

You can only collect statistics on columns containing data that can be
ordered. This includes all numeric types, Boolean values, some
CHAR and BIT data types, and date and timestamp values.

When selecting columns on which statistics should be collected, keep
these ideas in mind:

The process of collecting statistics requires both memory and compute
time to complete; the more statistics you collect, the longer it takes
and the more of your computing resources that it uses.

You should collect statistics for any column that is used as a
predicate in a query.

You should collect statistics for any column that is used in a
select distinct, Group by, order by, or join clause.

You do not need to enable statistics for columns that are merely
carried through the computation; however, doing so may improve heap
size estimations, which in turn can make broadcast joins more likely
to be chosen.

As you can see, selecting columns for statistics is a tradeoff between
the resources required to collect the statistics, and the improvements
in optimization that result from having the statistics collected.

Viewing Collected Statistics

Splice Machine provides two system tables you can query to view the
statistics that have been collected for your database: