Last week, yours truly has pushed a new feature into MariaDB 10.1 tree: ANALYZE statement.

The idea of this feature is to make it easy to compare query plan with query execution. ANALYZE statement will run the statement, and produce EXPLAIN-like output, where optimizer’s estimates are followed by numbers that were observed when running the query. The output looks like this:

Here,

Next to rows there is r_rows column which shows how many records were read from the table.

Next to filtered there is r_filtered column which shows which fraction of records was left after the part of the WHERE condition attached to the table was checked.

I think this should explain the feature. If you want more details, please refer to the KB article ANALYZE statement. It also discusses the meaning of the above EXPLAIN output.

Technical details and further plans

ANALYZE currently uses its own counters. Counting is done for all queries, including non-ANALYZE queries. This should be okay (not have visible overhead) as long as counting just increments integer variables in the query plan, without doing any atomic operations or making syscalls.

The upside of this approach is that it’s now trivial to make Explain in the slow query log also print ANALYZE output. When a query runs slowly, you will be able to know where exactly the optimizer was wrong.

The downside is that getting more data will not be as easy. So far, the most requested numbers beyond r_rows and r_filtered were r_time(amount of time spent in reading the table) and r_io(amount of IO that we did on the table). Counting the amount of time that was spent while reading each row will impose CPU overhead, it is a known problem. Counting IO is just incrementing a counter, but it will require interaction between ANALYZE code and storage engine(s) code, which will add complexity.

There is PERFORMANCE_SCHEMA feature, where others have already spent a lot of effort to count wait time and IO. It’s tempting to reuse it. The problem is, P_S collects the wrong data. P_S counters are global, while ANALYZE needs to count IO for each table reference separately. Consider a self-join. From P_S point of view, it is reading from the same table. From ANALYZE point of view, it is reads from two different table references. I’m currently not sure whether ANALYZE should/could rely on PERFORMANCE_SCHEMA.

A totally different angle is that tabular EXPLAIN output doesn’t allow to show much data (for example, how many rows were there before/after GROUP BY?). Here the solution is clear, I think: support EXPLAIN FORMAT=JSON and then add ANALYZE FORMAT=JSON where we can provide lots of detail.

I had been involved with subquery optimizations fairly closely, but last week I was surprised to find out that MySQL 5.6 does not support derived table merging. This feature was among the subquery features in the abandoned MySQL 6.0. In MariaDB, it was finished and released as part of MariaDB 5.3/5.5. As for MySQL, neither MySQL 5.6, nor MySQL 5.7 has this feature.

So what is this “derived merge”? It’s simple to understand. When one writes complex queries, it is common to use FROM-clause subqueries as a way to structure the query:

select
sum(o_totalprice)
from(select * from orders where o_orderpriority=’1-URGENT’)as high_prio_orders
where
o_orderdate between ‘1995-01-01′ and ‘1995-01-07′

MySQL optimizer processes this syntax very poorly. The basic problem is that FROM-subqueries are always materialized exactly as-specified. Conditions from outside the subquery are applied only after the materialization.

In our example, table orders has an index on o_orderdate, and there is a highly selective condition o_orderdate BETWEEN ... which one can use for reading through the index. But the condition is located outside the subquery, so it will not be used when reading the table. Instead, we will get the following plan:

Note that we see only one line, and the table orders is accessed through an index on o_orderdate. Running EXPLAIN EXTENDED will show why:
Message: select sum(`dbt3sf1`.`orders`.`o_totalprice`) AS `sum(o_totalprice)` from `dbt3sf1`.`orders` where ((`dbt3sf1`.`orders`.`o_orderpriority` = ‘1-URGENT’) and (`dbt3sf1`.`orders`.`o_orderDATE` between ‘1995-01-01′ and ‘1995-01-07′))

There is no FROM-clause subquery anymore. It has been merged into the upper select. This allowed the optimizer to avoid doing materialization, and also to use the condition and index on o_orderdate to construct a range access.

Query execution time for this particular example went down from 15 sec to 0.25 sec, but generally, the difference can be as big as your table is big.