WL#8943: Extend ANALYZE TABLE with histogram support

This worklog specifies how a user can manage histograms statistics. This
includes adding histogram statistics to one or more columns, as well as removing
existing histogram statistics. We will extend the syntax for "ANALYZE TABLE" to
accept two new clauses:
- UPDATE HISTOGRAM ON column [, column] WITH n BUCKETS
- DROP HISTOGRAM ON column [, column]
When this worklog is complete, the user will be able to generate histogram
statistics for all columns with a supported data type, as well as removing them.
The histogram statistics will be stored in the dictionary table
"column_statistics" and accessible through the view
information_schema.COLUMN_STATISTICS. However, this worklog will not cover
the process of using histogram statistics in the optimizer.

F-1: The parser MUST accept the following new syntaxes for ANALYZE TABLE:
ANALYZE [NO_WRITE_TO_BINLOG | LOCAL] TABLE tbl_name
[UPDATE HISTOGRAM ON column [, column] WITH n BUCKETS]
ANALYZE [NO_WRITE_TO_BINLOG | LOCAL] TABLE tbl_name
[DROP HISTOGRAM ON column [, column]]
F-2: When either "UPDATE HISTOGRAM" or "DROP HISTOGRAM" is specified, only one
table can be specified in the table list.
F-3: The server MUST be able to properly build and store histogram statistics
for the following data types:
BOOLEAN, BIT, TINYINT, SMALLINT, MEDIUMINT, INTEGER, BIGINT
ENUM, SET
FLOAT, DOUBLE, DECIMAL
DATE, TIME, YEAR, DATETIME, TIMESTAMP
CHAR, VARCHAR, TINYTEXT, TEXT, MEDIUMTEXT, LONGTEXT
BINARY, VARBINARY, TINYBLOB, BLOB, MEDIUMBLOB, LONGBLOB
F-4: The server MUST NOT try to build and store histogram statistics for the
following data types:
GEOMETRY, POINT, LINESTRING, POLYGON, MULTIPOINT, MULTILINESTRING,
MULTIPOLYGON, GEOMETRYCOLLECTION
JSON
F-5: If the new optional part "UPDATE HISTOGRAM" is not specified, the server
MUST NOT build and/or store any histogram statistics.
F-6: If "UPDATE HISTOGRAM" is specified, the server MUST build and store
histogram statistics only for those columns specified in the column list.
F-7: If the column list contains a non-existing column, the server MUST send an
error message back to the client saying that the column does not exist.
F-8: If the column list contains a column with an unsupported data type and
"UPDATE HISTOGRAM" is specified, the server MUST send an error message back
to the client saying that the column has an unsupported data type.
F-9: The server MUST NOT build/store histogram statistics for temporary tables.
If such table is specified, an error must be reported to the user.
F-10: The server MUST NOT build/store histogram statistics for encrypted tables.
If such table is specified, an error must be reported to the user.
F-11: The server MUST NOT build/store histogram statistics for views.
If a view is specified, an error must be reported to the user.
F-12: If histogram statistic does not exist in the dictionary table
"column_statistics" for the specified column(s) and "UPDATE HISTOGRAM"
is specified, the server MUST insert a new row with a valid histogram
given that the above requirements are fulfilled.
F-13: If histogram statistic exists in the dictionary table
"column-statistics" for the specified column(s) and "UPDATE HISTOGRAM"
is specified, the server MUST update the existing row(s) given that
the above requirements are fulfilled.
F-14: The server MUST NOT build/store histogram statistics for columns that are
covered by single-part unique indexes. If such column is specified, an
error must be reported to the user.
F-15: The server MUST support histogram statistics for both stored and virtual
generated columns.
F-16: If "ANALYZE TABLE tbl DROP HISTOGRAM ON col" refers to both a valid table
and a valid column which contains histogram statistics, the existing
histogram statistics for that column MUST be removed.
F-17: If "ANALYZE TABLE tbl DROP HISTOGRAM ON col" refers to both a valid table
and a valid column which does not contain histogram statistics, the server
SHOULD send a message back to the client saying that histogram statistics
did not exist for that column.
F-18: The server MUST create at most "n" buckets when given the query
"ANALYZE TABLE tbl UPDATE HISTOGRAM ON col WITH n BUCKETS;".
F-19: The number of buckets MUST be in the range [1, 1024]. If the number
provided is outside this range, the server MUST raise the error
ER_DATA_OUT_OF_RANGE:
"Number of buckets value is out of range in 'ANALYZE TABLE'"
F-20: The server SHOULD NOT use more memory for creating histogram statistics
than specified by the session variable
"histogram_generation_max_mem_size".
F-21: The session variable "histogram_generation_max_mem_size" MUST be
adjustable only by the super user.
F-22: If the user runs "UPDATE HISTOGRAM ON col1, col2" and only one of the
columns exists and/or has a supported data type, the server SHOULD still
create a histogram for the other column.
F-23: When the user runs UPDATE or DROP HISTOGRAM, the server SHOULD not stop
processing even though it encounters one or more invalid columns.
F-24: If a column name is specified more than once, the server SHOULD raise the
error ER_DUP_FIELDNAME:
"Duplicate column name 'COLUMN'"
F-25: The minimum, maximum and default value for the variable
'histogram_generation_max_mem_size' MUST be 1000000, 4294967295 and
20000000 bytes respectively for 32-bit platforms.
F-26: The minimum, maximum and default value for the variable
'histogram_generation_max_mem_size' MUST be 1000000,
18446744073709551615 and 20000000 bytes respectively for 64-bit
platforms.
F-27: Executing a DDL command that removes the schema "db" (for instance
"DROP DATABASE db;") MUST remove any histogram statistics that belongs
to the schema named "db".
F-28: Executing a DDL command that removes the table "tbl" (for instance
the command "DROP TABLE tbl;") MUST remove any histogram
statistics that belongs to the table named "tbl".
F-29: Executing a DDL command that renames the "tbl" (for instance the
command "RENAME TABLE tbl TO tbl2;") MUST rename any
histogram statistics that belongs to the table named "tbl" so that
the statistics are kept.
F-30: Executing a DDL command that alters an existing column (for instance the
command "ALTER TABLE tbl CHANGE COLUMN col1 col2 INTEGER;") MUST remove
any histogram statistics that belongs to the column named "col1".
F-31: Executing a DDL command that removes an existing column (for instance the
command "ALTER TABLE tbl DROP COLUMN col1;") MUST remove any
histogram statistics that belongs to the column named "col1".
F-32: The statement "ALTER TABLE tbl CONVERT TO CHARACTER SET character_set"
SHOULD remove histogram statistics only for those columns where the
character set is changed. Histogram statistics for any non-character
columns should remain unchanged/untouched.

Creating histogram statistics
=============================
In order to generate histogram statistics, the user has to run "ANALYZE TABLE"
with the new optional clause "UPDATE HISTOGRAM ON column [, column] WITH n
BUCKETS". If this is not specified, histogram statistics won't be
updated/created for the table specified. This clause is introduced due to
the fact that creating histograms statistics may be very expensive for big
tables. Note that when adding the clause "UPDATE HISTOGRAM", the server won't
update any other statistics.
Simple example: ANALYZE TABLE t1 UPDATE HISTOGRAM ON column_a, column_b WITH
100 BUCKETS;
A current limitation with ANALYZE TABLE is that it is only supported for the
storage engines MyISAM, InnoDB and NDB. In order for it to work with other
storage engines, this limitation must be lifted. We will not lift this
limitation in this worklog, due to the fact that our main focus is InnoDB.
Removing histogram statistics
=============================
In order to remove histogram statistics, the user has to run "ANALYZE TABLE"
with the new optional clause "DROP HISTOGRAM ON column [, column]". Note that
when adding the clause "DROP HISTOGRAM", the server won't update any other
statistics.
Simple example: ANALYZE TABLE t1 DROP HISTOGRAM ON column_a, column_b;
We will also remove any histogram statistics that exists for a schema, table
and/or column when doing certain DDL operations. Below is a list of what kind
of DDL operations that will force a automatic removal of histogram statistics:
DROP DATABASE db
DROP TABLE tbl
ALTER TABLE tbl CHANGE/MODIFY COLUMN
ALTER TABLE tbl DROP COLUMN
ALTER TABLE tbl CONVERT TO CHARACTER SET
Future worklogs might improve this to see if we can issue a rename of the
existing histogram instead of removing it completely.
Note that for the statement "ALTER TABLE tbl CONVERT TO CHARACTER SET", we
only remove histogram statistics for character columns (TEXT, VARCHAR, CHAR).
Columns with non-character data types will remain unchanged (INT, BLOB, DOUBLE
and similar).
Executing a RENAME TABLE will rename any existing histograms.
Status/error reporting
======================
Status messages are sent back to the client using the existing method that
ANALYZE TABLE currently use, which is sending a result set back in case of both
error and normal messages. For histogram operations, the server will send back
one row for each column processed.
Data dictionary and system view
===============================
In WL#8706, we created the system table mysql.column_stats for persistent
storage of histogram data. With the new data dictionary in place, we will
replace mysql.column_stats into a dictionary table for persistent storage.
Since we are converting mysql.column_stats to a dictionary table, we will
take advantage of the dictionary machinery to insert, update and drop histogram
data from the dictionary table.
Data dictionary tables are hidden from users, so in order to inspect the data
we will provide the view information_schema.COLUMN_STATISTICS.
Example usage
=============
CREATE TABLE posts (post_id INT AUTO_INCREMENT PRIMARY KEY,
username VARCHAR(50),
date_posted DATETIME,
post_data JSON);
--- EXAMPLE 1 ---
mysql> ANALYZE TABLE posts;
-- This will work exactly as it currently does without histogram support
implemented. No change in existing behavior must occur.
--- EXAMPLE 2 ---
mysql> ANALYZE TABLE posts UPDATE HISTOGRAM ON username WITH 100 BUCKETS;
-- When the clause "UPDATE HISTOGRAM" is added, the query will NOT run
the "ANALYZE"-part as it does today without histogram support enabled. The
command will only do the following steps:
- We first find out if the specified column "username" has a supported data
type. VARCHAR is a supported data type for histogram statistics, so we
continue.
- When we know that the column has a supported data type, we simply read the
needed data from table 'posts' and create a histogram for the column. Note
that only the 42 first characters will be considered/used (see WL#8707 for
details). After the new histogram is built in memory, we write it to
the data dictionary for persistent storage. As explained later, we might
use sampling to read the data in order to avoid running out of memory.
Since we have specified "WITH 100 BUCKETS", the resulting histogram will
have at most 100 buckets.
- The ANALYZE command will give the following result set back to the user:
+-------------+-----------+----------+------------------------------+
| Table | Op | Msg_type | Msg_text |
+-------------+-----------+----------+------------------------------+
| blog.posts | histogram | status | Histogram statistics created |
| | | | for 'username' |
+-------------+-----------+----------+------------------------------+
--- EXAMPLE 3 ---
mysql> ANALYZE TABLE posts UPDATE HISTOGRAM ON date_posted, post_data WITH 20
BUCKETS;
-- As described above, the query will only perform the following new behavior:
- Check if the columns 'date_posted' and 'post_data' both have a supported
data type. The server will find out that 'post_data' is a JSON column, and
will return an error back to the client for this column.
- Since 'date_posted' has a supported data type (DATETIME), we build and
store a histogram for this column as described in the example above. The
following result set is returned back to the client.
+-------------+-----------+----------+-----------------------------------+
| Table | Op | Msg_type | Msg_text |
+-------------+-----------+----------+-----------------------------------+
| blog.posts | histogram | error | The column 'posts.post_data' has |
| | | | an unsupported data type. |
| blog.posts | histogram | status | Histogram statistics created for |
| | | | 'date_posted' |
+-------------+-----------+----------+-----------------------------------+
Note that even though one of the columns resulted in an error, we still
create and store histogram statistics for the other column
--- EXAMPLE 4 ---
mysql> ANALYZE TABLE posts DROP HISTOGRAM ON username;
-- The above query will remove the histogram from the data dictionary using
the data dictionary machinery.
The following result set will be returned to the user:
+-------------+-----------+----------+----------------------------------+
| Table | Op | Msg_type | Msg_text |
+-------------+-----------+----------+----------------------------------+
| blog.posts | histogram | status | Histogram statistics removed for |
| | | | 'username' |
+-------------+-----------+----------+----------------------------------+
--- EXAMPLE 5 ---
mysql> ANALYZE TABLE posts DROP HISTOGRAM ON post_id, date_posted;
-- The above query will remove the histogram from the data dictionary using
the data dictionary machinery.
The following result set will be returned to the user:
+-------------+-----------+----------+-----------------------------------+
| Table | Op | Msg_type | Msg_text |
+-------------+-----------+----------+-----------------------------------+
| blog.posts | histogram | status | No histogram statistics found for |
| | | | 'posts' |
| blog.posts | histogram | status | Histogram statistics removed for |
| | | | 'date_posted' |
+-------------+-----------+----------+-----------------------------------+
Since no histogram statistics exists for 'post_id', the server returns an
notification back to the user saying that nothing is removed for this column.
But since we already have histogram statistics for 'date_posted', the client
returns a row saying that the histogram is removed.
DISCUSSION
==========
There is one important design decision that needs to be made; how to properly
create histogram statistics for very large tables? The main goal is to create
histogram statistics without running out of memory, but it should also be
reasonably efficient in terms of execution time.
The steps needed to build a histogram are as follows:
1) Read the data from the specified column(s).
2) Sort the data.
3) Count up the frequency for each distinct values.
4) Create an appropriate histogram (singleton or equi-height).
We propose six different solutions/architectures that have come up after various
discussions and brainstorming. Combinations of the suggestions below may also be
possible.
== Option 1 ==
This option tries to do everything in memory. The advantage is that it's
probably the most efficient solution, as well as reasonably easy to implement.
We first introduce a new variable, which controls the maximum amount of memory
that server can use while creating histogram statistics. Let us call this
variable "histogram_max_mem".
Since we know the approx. number of rows in the table, we can estimate (with
reasonable accuracy) the amount of memory needed to read the entire column into
memory. Since we have an upper limit on the amount of memory we are allowed to
use (histogram_max_mem), we calculate the number of rows we can read given the
amount of memory we have available. Since we now have sampling available in the
handler (see WL#9127), we can then use sampling to read the wanted amount of
rows.
The downside of this method is that we cannot build a histogram based on the
complete data set if the data set is very big. The impact of this however should
not be too big, since research has shown that creating a histogram over 3-4%
of the original data set can be sufficient. And histogram statistics is only an
estimate anyways.
This is the option we will implement in this worklog. The others are kept below
just for the reference.
== Option 2 ==
Use filesort to read data ordered from the table. This will ensure that if the
table is very big, it will do the sorting on disk. It also has the advantage of
sorting all the different data types correctly, since it's the sorting method
used by the server. There are however a few problems;
1) We must run filesort for each column we want a histogram for. If we have a
2TB table with 40 columns we want histograms on, it will read the 2TB table
40 times...
2) It might be more difficult to use sampling (I think...), and for very large
tables we most certainly want to use sampling due to efficiency.
3) If there are very many distinct values, we still might end up using a lot of
memory.
The two first problems may be solved by moving a sample (10% for instance) of
the data we want over to a temporary table. We then run filesort on the sampled
data. The main, big table is then only read once, while the smaller temporary
table is read multiple times.
== Option 3 ==
A different approach could be to take advantage of the SQL Service API (see
WL#8187). We could run a query like "SELECT col, COUNT(*) FROM tbl GROUP BY col
ORDER BY col;" to get out the values both counted and sorted. This would take
advantage of already existing machinery for both sorting, grouping and counting,
and big tables would be handled like any normal SQL query.
The downside of this approach is that sampling will be difficult to implement
(if possible at all...) without the TABLESAMPLE syntax described above. It
would do a full table scan regardless of the table size. Also, the SQL Service
API is mainly intended to be used by plugins. Using it in the server might be
very difficult and far from optimal. We have also considered using the
Ed_connection (Execute Direct), but this class does not seem to be maintained
anymore.
== Option 4 ==
If an index doesn't exists for the column we want to create a histogram
statistics for, create an index temporarily. We can then read this index
(sorted), and create buckets "on the fly". This will work with any sized table
(given that there is enough space on disk), and will sort all data types
correctly.
This won't work however if the table already has 64 indexes, since there is a
hard upper limit on the number of indexes per table.
== Option 5 ==
Create a temporary table, and insert the data we want to create histogram
statistics for there. Then, create index and do the same as described for option
4. This will (partially) overcome the problem with the limit of 64 indexes. We
could also possible only read a sample of the data into the temporary table to
decrease the data size.

-- Temporary table --
We will not support creating histograms for any temporary tables. This
limitation might be lifted in the future, but it would increase the complexity
of this worklog.
-- Encrypted tables --
We will not support histograms for encrypted tables. Unless special care is
taken, it is possible to expose encrypted data since the histogram data is
stored unencrypted. To be on the safe side, we don't support histograms for
encrypted tables.
-- New session variable --
ulonglong histogram_generation_max_mem_size;
This session variable controls how much memory (approximately) the server is
allowed to use when creating histogram statistics. By adjusting the amount of
memory the server is allowed to use, it is possible to control both the accuracy
of the histogram as well as complexity of building the histogram (more memory
allowed == (usually) more accurate histogram, but also longer construction
time).
The variable can only be adjusted by the super user (root etc). It is a session
dynamic variable, and the default, minimum and maximum are as following;
32-bit platforms:
Default 20000000 (20 MB)
Min 1000000 ( 1 MB)
Max 4294967295 (~4 GB)
64-bit platforms:
Default 20000000 ( 20 MB)
Min 1000000 ( 1 MB)
Max 18446744073709551615 (~18 EB)
The default might change during development if anything else is found to be
reasonable.
-- SQL syntax --
Extend the syntax of "ANALYZE TABLE" so that the optional clauses "UPDATE
HISTOGRAM" and "DROP HISTOGRAM" can be specified. After "UPDATE/DROP
HISTOGRAM", a column list MUST be specified.
ANALYZE [NO_WRITE_TO_BINLOG | LOCAL] TABLE tbl_name
[UPDATE HISTOGRAM ON column [, column] WITH n BUCKETS]
ANALYZE [NO_WRITE_TO_BINLOG | LOCAL] TABLE tbl_name
[DROP HISTOGRAM ON column [, column]]
Neither "HISTOGRAM" or "BUCKETS" will be added as reserved words in the parser;
they will be introduced as non-reserved keywords.
When "UPDATE/DROP HISTOGRAM" is specified, only one table can be specified.
This simplification lets us skip handling of ambigous column names.
-- Histogram interface --
We will extend the existing histogram classes with the following functions and
variables.
namespace histograms {
class Histogram
{
public:
/**
Store the histogram to persistent storage.
This function will store the current histogram into the data dictionary
table "column_statistics". If a histogram already exists in the storage
for this (database, table, column), it will be updated/overwritten.
@return true on error, false otherwise.
*/
bool store_histogram(THD *original_thd) const;
/**
Remove a histogram from persistent storage.
This function will only return error if the storage engine returns any
errors. Even though if the column specified contains unsupported data type
or it doesn't have any histogram statistics, success will be returned.
@return true on error, false otherwise.
*/
static bool drop_histogram(THD *original_thd, std::string &database_name,
std::string &table_name, std::string &column_name);
};
} // namespace histogram
-- Data dictionary and system view --
In WL#8706 we created the system table mysql.column_stats for persistent storage
of histogram data. In this worklog, we will convert this table into a pure data
dictionary table. The reason is that it will give us some major advantages:
1) Data dictionary tables are hidden from users by default, so it is easier to
control access to this table through views.
2) We can take advantage of the data dictionary machinery for inserting,
updating and removing histogram data from the persistent storage. In future
worklogs, we will also make use of the data dictionary caching so we don't
have to implement our own caching mechanism.
3) Its easier to handle various situations such as dropping histograms while we
are in LOCK TABLES mode, READ ONLY mode and other server modes.
The data dictionary table will be named column_statistics, and will consist of
the following fields:
id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT
catalog_id BIGINT UNSIGNED NOT NULL
name VARCHAR(255) NOT NULL
schema_name VARCHAR(64) NOT NULL
table_name VARCHAR(64) NOT NULL
column_name VARCHAR(64) NOT NULL
histogram JSON NOT NULL
- "id" is required by the data dictionary as the primary key
- "catalog_id" will always be set to "1", since MySQL only supports one catalog.
- "name" will be a concatenation of schema_name, table_name and column_name with
a defined separator (ASCII 31, FIELD SEPARATOR) between the fields. This will
be a single, unique identified which will be used to identify histograms. We
need this field because the dictionary client can only fetch entities by a
single name or ID.
To inspect this data, we will provide a view named COLUMN_STATISTICS:
CREATE OR REPLACE DEFINER=`root`@`localhost` VIEW
information_schema.COLUMN_STAITSTICS AS
SELECT schema_name AS SCHEMA_NAME,
table_name AS TABLE_NAME,
column_name AS COLUMN_NAME,
histogram AS HISTOGRAM
FROM mysql.column_statistics;
This view will be accessible by anyone, but the results include only columns
for which the user could otherwise access.
We implement a new metadata lock namespace, "COLUMN_STATISTICS", to protect
column statistic definitions in the cache. MDL keys have namespace
"COLUMN_STATISTICS", DB/schema "" (empty string),
and the MDL name will be a SHA1 hash of the column statistics name. The reason
we use a SHA1 hash is that MDL name can only be 64 characters long, but a
column statistic name can be up to 194 characters long (schema_name + table_name
+ column_name + two separators).
-- New files/functions --
sql/sql_admin.cc
----------------
In sql_admin.cc, we extend the class Sql_cmd_analyze_table with the following
fields, methods and enums:
class Sql_cmd_analyze_table : public Sql_cmd
{
public:
enum class Histogram_command
{
NONE, UPDATE, DROP
}
private:
/// Which histogram action to perform, if any.
Histogram_command m_histogram_command { Histogram_command::NONE };
/**
The fields the user has specified in the ANALYZE TABLE command. For
instance, if the user entered the query "ANALYZE TABLE people UPDATE
HISTOGRAM ON first_name WITH 2 BUCKETS;", the array will contain a pointer
to one Item_field.
*/
Mem_root_array<Item_field *> m_histogram_fields;
public:
/**
Method that creates/updates histogram statistics. This is invoked by the
query "ANALYZE TABLE tbl UPDATE HISTOGRAM ON col [, col] WITH n BUCKETS;".
@param thd thread handler
@param table the table specified in the ANALYZE TABLE command
@param num_buckets maximum number of buckets to be created in the resulting
histogram
@return true on error, false otherwise
*/
bool update_histogram(THD *thd, TABLE_LIST *table, int num_buckets);
/**
Method that drops existing histogram statistics. This is invoked by the
query "ANALYZE TABLE tbl DROP HISTOGRAM ON col [, col];".
@param original_thd thread handler
@param tables the table specified in the ANALYZE TABLE command
@return true on error, false otherwise
*/
bool drop_histogram(THD *original_thd, TABLE_LIST *table);
Histogram_command get_histogram_command() const
{ return m_histogram_command; }
void set_histogram_command(Histogram_command value)
{ m_histogram_command= value; }
/// Add a field to the "m_histogram_fields" array.
bool add_histogram_field(Item_field *field);
};
sql/histogram/value_map.h
-------------------------
Value_map is a new class, which acts as a wrapper around a map structure. It
simplifies things like duplicate checking, handling null values, gives us a
cleaner interface as well as abstracting away most of the underlying container.
The histogram interface will be adjusted to use Value_map.
/// Base class for Value_map
class Value_map_base
{
public:
Value_map_base();
virtual ~Value_map_base();
/// @return the number of distinct values in this Value_map
virtual size_t size() const = 0;
/**
Add a given number of the same value.
@param value Value to add
@param count The number of "value" to add
@return true if error, false otherwise
*/
template <class T>
bool add_values(const T& value, const ha_rows count);
/**
Add a given number of SQL NULL values
@param num_null_values The number of SQL NULL values to add
*/
void add_null_values(const ha_rows num_null_values);
/// @return the number of SQL NULL values
ha_rows get_num_null_values() const;
/**
Build a histogram from this Value_map.
@param mem_root Where to allocate the histogram and its contents
@param num_buckets The maximum number of buckets to use
@param db_name Database name
@param tbl_name Table name
@param col_name Column name
@return nullptr if error. Otherwise, a histgram with at most "num_buckets"
buckets.
*/
virtual Histogram *build_histogram(MEM_ROOT *mem_root,
const size_t num_buckets,
std::string db_name, std::string tbl_name,
std::string col_name) const = 0;
};
template <class T>
class Value_map : public Value_map_base
{
public:
Value_map();
virtual ~Value_map();
size_t size() const override;
/// @return an iterator pointing to the beginning of the Value_map
typename value_map_type::const_iterator begin() const;
/// @return an iterator pointing to the end of the Value_map
typename value_map_type::const_iterator end() const;
virtual Histogram *build_histogram(MEM_ROOT *mem_root, size_t num_buckets,
std::string db_name, std::string tbl_name,
std::string col_name) const override;
};
Each Value_map will have its own MEM_ROOT to allocate its contents on.
We have several choices when it comes to choosing the underlying container in
the Value_map, each with their own advantages and drawbacks. We have boiled it
down to two main choices; std::map and std::vector.
std::map - Is simple to work with (takes care of duplicates and sorting), and
has a consistent performance. The main drawback is that is has a
rather big memory overhead per node (32 bytes on GCC), while the
data it contains might only be 16 bytes.
std::vector - It is a bit harder to work with (we must take care of duplicates
and sorting ourselves). However, it has very low/no memory
overhead per node, and with a low number of distinct values
(less than 20 000) it has a higher insertion rate on my machine
compared to std::map.
For now, we choose to use std::vector as the undelying container due to the
fact that it is much simpler to calculate its overhead. The performance
characteristics might also be better suited in most use cases. However, we might
change this to a different container during development if anything else is
found to be better suited.
-- Error/status reporting --
Since we are extending the existing SQL command ANALYZE TABLE, we will follow
the same approach that it has when it comes to reporting error/status messages
back to the client.
ANALYZE TABLE will return a result set back to the client with a status message.
When updating histogram statistics, we will use the same layout and structure
for the result set, but extend it with new possible messages. The layout of
the result set is as follows:
- Table
- Op
- Msg_type ("status", "error", "info", "note", or "warning")
- Msg_text
"Op" is as of now always set to "analyze". We will set this to "histogram" for
all histogram operations. The following new messages may be returned to the
client:
(see F-9)
Msg_type: "error"
Msg_text: Cannot create histogram statistics for a temporary table.
When: If running UPDATE HISTOGRAM on temporary tables.
(see F-10)
Msg_type: "error"
Msg_text: Cannot create histogram statistics for an encrypted table.
When: If running UPDATE HISTOGRAM on encrypted tables.
(see F-11)
Msg_type: "error"
Msg_text: Cannot create histogram statistics for a view.
When: If running UPDATE HISTOGRAM on a view.
(see F-14)
Msg_type: "error"
Msg_text: The column 'COL' is covered by a single-part unique index.
When: If running UPDATE HISTOGRAM on columns that are covered by a single-part
unique index.
(see F-7)
Msg_type: "error"
Msg_text: The column 'COL' does not exist.
When: If running UPDATE HISTOGRAM on a column that doesn't exists.
(see F-8)
Msg_type: "error"
Msg_text: The column 'COL' has an unsupported data type.
When: If running UPDATE HISTOGRAM on a column with an unsupported data type
(see F-3 for a list of supported data types).
(see F-2)
Msg_type: "error"
Msg_text: Only one table can be specified while modifying histogram
statistics.
When: If running UPDATE or DROP histogram and specifying more than one table.
Example: ANALYZE TABLE t1, t2 DROP HISTOGRAM ON t1.col1, t2. col2;
Msg_type: "status"
Msg_text: Histogram statistics created for COLUMN.
When: A histogram is successfully created for a column.
Msg_type: "status"
Msg_text: Histogram statistics removed for COLUMN
When: A histogram is removed for a column.
Msg_type: "status"
Msg_text: No histogram statistics found for COLUMN.
When: If running DROP HISTOGRAM on a column that does not exist.
Note that we return row per column specified, given that the table check passes.
-- Sampling --
Since we now have a sampling interface (see WL#9127), we will use sampling in
order to limit the amount of memory used when creating histogram statistics.
We know how much memory we are allowed to used (variable
histogram_generation_max_mem_size), and we know approximately how many rows
there are in the table we are reading from. We also know how much memory is
required for each data type, so we can estimate the number of rows we can fit in
memory using the following method:
size_t row_size_bytes= 0;
for (int i = 0; i < FIELDS_COUNT; i++)
{
// Row count variable.
row_size_bytes+= sizeof(ha_rows);
/*
Data type size. For instance, sizeof(double). For Strings, we are
pessimistic and do:
sizeof(String) + (charset.mbmaxlen * HISTOGRAM_MAX_COMPARE_LENGTH)
*/
row_size_bytes+= sizeof(field[i].DATA_TYPE);
/*
Overhead for each node in the Value_map. This depends on the underlying
container. For instance, std::map on GCC has a overhead of 32 bytes per
node, while std::vector has little/no overhead.
*/
row_size_bytes+= VALUE_MAP_NODE_OVERHEAD;
}
size_t rows_in_memory= histogram_generation_max_mem_size / row_size_bytes;
With this data, we can set the sampling rate to the proper value so that we
don't use more memory than specified by the variable
histogram_generation_max_mem_size.
Note that this is a very pessimistic approach. It assumes that ALL values in
the table are unique. It also assumes that for string/blob columns, the content
is always 42 characters or longer. In many cases, this won't be true. However,
if we want to be sure that we don't exceed the amount of memory specified by
histogram_generation_max_mem_size we have to be pessimistic. We will also take
care of some special cases, ENUM/SET with few entries and such. If we know that
the number of distinct values is very low, we can lower the estimated size.
As mentioned above, the overhead for each node in the Value_map depends a lot
on the underlying container. std::vector has very low/no overhead, while
std::map has a much higher overhead. As mentioned together with the interface
for Value_map, we have initially chosen std::vector as the container. However,
if testing during development shows that a different container might be better,
we will reconsider this.
A side effect by using sampling, is that counting the number of distinct values
in each equi-height bucket isn't trivial anymore. To account for this, we will
use the "unsmoothed first-order jackknife estimator" [1] to estimate the number
of distinct values in each bucket. As the sampling rate decreases, it becomes
more difficult to estimate the correct number of distinct values. However, this
estimator seems to do a fairly OK job. It is also one of the algorithm that
postgresql uses [2].
-- Performance Schema --
All allocations done while creating histogram statistics (Value_map allocations
and allocations done when building the histogram) will be instrumented by the
Performance Schema. We will add a new PSI_memory_key, and use this to instrument
all memory allocations.
The instrumentation event will be named 'memory/sql/histograms'.
-- Testing --
- In order to test the Value_map, we will create new unit tests.
- For the extended ANALYZE TABLE syntax, we will create new MTR tests that
both verifies the syntax, and that histograms are built for all supported
data types.
- Testing of the variable histogram_generation_max_mem_size will be done by
using MTR. We will run ANALYZE TABLE UPDATE HISTOGRAM, and check the amount of
memory used against performance schema.
-- Histogram storage format --
Histogram data is stored as a JSON object in persistent storage. We will extend
the JSON object with four new fields for both equi-height and singleton
histograms:
"null-values": A number between 0.0 and 1.0 telling us the fraction
of values in the column that are SQL NULL values. If 0,
there are no NULL values in the column.
"sampling-rate": A number between 0.0 and 1.0 telling us the fraction of data
that was sampled to create this histogram. A value of 1 means
that all of the data was read (no sampling).
"number-of-buckets-specified": The number of buckets that the user originally
specified/requested.
"data-type": The type of data that this histogram contains. This is needed
when reading and parsing histograms from persistent storage into
memory.
"charset-id": ID of the character set for the data that this histogram
contains. It is mostly meaningful when the "data-type" is set
to "string".
A complete JSON object may look like this:
{
"buckets": [
[
1,
0.3333333333333333
],
[
2,
0.6666666666666666
],
[
3,
1
]
],
"null-values": 0,
"last-updated": "2017-03-24 13:32:40.000000",
"sampling-rate": 1,
"histogram-type": "singleton",
"number-of-buckets-specified": 128,
"data-type": "int",
"charset-id": 8
}
-- Automatic removal of histogram --
Certain DDL operations will cause histogram statistics that belongs to the
affected schema/table/column to be removed.
DROP DATABASE db;
DROP TABLE tbl;
Both of these DDL commands will invoke the function drop_base_table in
sql_table.cc. We will hook histogram removal in this function.
ALTER TABLE tbl CHANGE/MODIFY COLUMN
ALTER TABLE tbl DROP COLUMN
ALTER TABLE tbl CONVERT TO CHARACTER SET
All of these DDL commands will invoke the function Sql_cmd_alter_table::execute
in sql_alter.cc. We will hook histogram removal in this function.
We will also ensure that removing histogram statistics is atomic. That is, if
a DDL operation fails, histogram statistics is not removed. If a DDL operation
is successful, histogram statistics must be removed.
-- Automatic rename of histogram --
RENAME TABLE tbl TO tbl2;
This command will invoke the function do_rename in sql_rename.cc. We will hook
histogram rename in this function.
-- Dump/restore --
We will extend mysqldump/pump to include an option for dumping
"ANALYZE TABLE"-statements for each column that has histogram statistics. The
option is named "--column-statistics", and is set to OFF by default (re-creating
histogram statistics for large tables might take a very long time).
The ANALYZE TABLE statement will also include the number of buckets that the
user originally specified in the ANALYZE TABLE-statement.
The SQL output for mysqldump and mysqlpump might look like this:
/*!80002 ANALYZE TABLE `t1` UPDATE HISTOGRAM ON `col1` WITH 10 BUCKETS; */
The XML output for mysqldump might look like this:
<column_statistics table_name="t1">
<field name="col1" num_buckets="10"></field>
</column_statistics>
-- Summary --
So, what will happen internally when the server receives query "ANALYZE TABLE
posts UPDATE HISTOGRAM ON username WITH 128 BUCKETS;"?
1) The method Sql_cmd_analyze_table::update_histogram gets called.
2) Open and lock the table we will read data from.
3) Create one Value_map for each column we are creating histogram statistics
for. Each Value_map will have its own MEM_ROOT to allocate the contents on.
4) Estimate how many rows we can read into the allowed memory, by using the
method described above under "sampling".
5) Initialize the sampling by calling ha_sample_init() with the appropriate
arguments.
6) Read data from the table "posts" from all the specified columns into a
Value_map. Remember that we have one Value_map for each column. In this
example, we only read data from the column posts.username.
The Value_map contains pairs of "value, count". If the column contains the
following values [1 1 1 2 2 2 2 2 3 3 3 3 3 3 3], the Value_map will
contain the values [(1 3), (2 5), (3 7)]. This is done so that we (in most
cases hopefully) will save some memory, since we expect that duplicate values
will occur.
7) After we have read the data into our Value_map, we build the histograms one
by one by calling Value_map::build_histogram. When the first histogram
is built, we store it to persistent storage ASAP. Then, we build the next and
store it and so on. In this example, we only have one column to build
histogram statistics for; posts.username.
The resulting histograms will have at most 128 buckets since we specified
"WITH 128 BUCKETS" in the SQL command.
What will happen internally when the server receives query "ANALYZE TABLE posts
DROP HISTOGRAM ON username;"?
1) The method Sql_cmd_analyze_table::drop_histogram gets called.
2) Delete the existing entry in the dictionary table "column_statistics" using
the data dictionary machinery.
References
----------
[1] Estimating the Number of Classes in a Finite Population, Peter J. Haas and
Lynne Stokes
[2] https://wiki.postgresql.org/wiki/Estimating_Distinct