mardi 10 avril 2012

Faster subqueries with materialization

In a previous post, I analyzed how a query of the famous DBT3 benchmark was
optimized by MySQL. It was this query, named "Q16" in the DBT3 jargon:

select
p_brand,
p_type,
p_size,
count(distinct ps_suppkey) as supplier_cnt
from
partsupp,
part
where
p_partkey = ps_partkey
and p_brand <> 'Brand#23'
and p_type not like 'LARGE PLATED%'
and p_size in (43, 1, 25, 5, 35, 12, 42, 40)
and ps_suppkey not in (
select
s_suppkey
from
supplier
where
s_comment like '%Customer%Complaints%'
)
group by
p_brand,
p_type,
p_size
order by
supplier_cnt desc,
p_brand,
p_type,
p_size;

Here is a brief recap of conclusions I had drawn:

for this query, MySQL tranforms the IN condition to EXISTS and then
evaluates it with the "unique_subquery" technique, which does an index
lookup into the subquery's table.

IN is evaluated 120,000 times (once per combined row of the outer tables).

The total execution time of query Q16 is 0.65 seconds.

If you look at the original subquery, before IN becomes EXISTS, you will see that it's not correlated, which means that it does not mention columns of tables of the top query's FROM clause ('partsupp' and 'part'). Thus, its resultset is constant throughout execution of the entire top query; here it is:

The transformation to EXISTS, because it injects equalities like `partsupp`.`ps_suppkey` = supplier`.`s_suppkey`
into the subquery's WHERE clause, makes the subquery correlated: it thus has to be executed 120,000 times, so we do 120,000 times this:

an index lookup in 'supplier' (which has 10,000 rows)

a test of the found row(s) against the LIKE condition.

Intuitively, determining the 4-row resultset once for all, and injecting it into the top query should yield better performance - it is fast to evaluate ps_suppkey not in (358, 2820, 3804, 9504) .
Starting from the just released MySQL 5.6.5, this transformation is automatically done by the Optimizer, and is called subquery materialization. The subquery's resultset is determined once for all and stored into an in-memory temporary table. If the temporary table has only 4 rows as in our example, searching for a match in it can be done with a scan; but if it had more rows, a hash index would help greatly. So a hash index is always created on the temporary table's columns. Last, this index is unique: there is no point in storing duplicates, and they would make the table bigger. After this one-time setup has been completed, each evaluation of IN simply does a hash index lookup into the 4-row temporary table.

My former colleague Timour Katchaounov started developing this feature years ago, when working for MySQL/Sun. In the last months, after a round of intensive QA, we have fixed some last bugs in it, in preparation for releasing in 5.6.5. But the feature still had one limitation: it was applicable only if IN was placed at certain positions in the query. For example, it couldn't be used with NOT IN. And query Q16 has a NOT IN! so the Optimizer could not apply subquery materialization to it, and was thus stuck with using EXISTS. Sad!

Why it could not work with NOT IN, is not very easy to explain. It has to do with NULL values, because they sometimes prevent using the hash index. To give an idea, look at this:(NULL, 1) NOT IN (SELECT ...)
Per the SQL standard, if the subquery's resultset contains at least one row of the form (x,1) where x is any number (or NULL), then the IN condition is neither TRUE, nor FALSE, it is UNKNOWN. So is the NOT IN condition, because it is the negation of IN, and NOT(UNKNOWN) is UNKNOWN.
Here are example of such rows: (NULL,1), (421,1), (236,1), (5329,1), ad infinitam.
We can see that those rows will not be found by a lookup in the hash index: this index is defined on the two columns, it has a usual "prefix-only" behaviour, which means that it cannot be used to search for "any value in first column, then 1 in second column". As long as the sentence starts with "any value in first column" a table scan is necessary; we should read each row of the temporary table and compare its second column with 1 until we find a matching row. And that:

will drag subquery materialization's performance down

will drag subquery materialization's code complexity up.

And I have even not covered all problems here: there can be more than two columns, there can be more than one NULL in the left argument of IN, there can also be NULLs inside the subquery's resultset.

In some lucky cases, the scan can be avoided, for example: SELECT * FROM table1 WHERE (a,b) IN (SELECT ...)
If (a,b) is (NULL,1), the IN will be UNKNOWN or FALSE. It will be UNKNOWN if the subquery's resultset contains one (x,1) as seen above; otherwise it will be FALSE. No matter what, it will not be TRUE, and this is all that WHERE wants to know - (a,b) can thus be rejected without doing a scan.
Now, for SELECT * FROM table1 WHERE (a,b) NOT IN (SELECT ...)
i.e. SELECT * FROM table1 WHERE NOT ((a,b) IN (SELECT ...))
things are different: if (a,b) is (NULL,1), the IN will be UNKNOWN or FALSE, as we said. So NOT IN will be UNKNOWN or TRUE. "Hum, can you be more specific?? I need to know if it's TRUE", will the WHERE evaluation code ask. Then we have to do the scan...

So now you understand why subquery materialization was restricted to certain placement of IN.

What I have done recently is to lift this restriction in two simple, however common, cases:

If all outer and inner expressions are not nullable, then no NULL can get in the way, so there is no problem.

If there is only one outer expression (and thus there is only one inner expression), figuring out the correct TRUE/FALSE/UNKNOWN answer is immediate. Understanding why... is left as an exercise to the reader :-)

Those two cases are independent: as long as one is satisfied, subquery materialization can apply to IN, no matter where it is placed (NOT IN, etc).

It turned out to be very easy to code this: I had a working prototype in an afternoon.

Q16 happens to meet the criteria of both cases: columns 'ps_suppkey' and 's_suppkey' are declared NOT NULL (first case), and the subquery has only one outer and one inner expression (second case).

So nowadays MySQL can, and does, use subquery materialization for query Q16; thanks to it, the execution time is down from 0.65 seconds to 0.47 seconds, which is a 25% improvement!

The new technique is visible in EXPLAIN. I want to first show how EXPLAIN was with the EXISTS transformation, so I temporarily disable subquery materialization, then run EXPLAIN, enable subquery materialization, run EXPLAIN:

If you compare, the big difference is that the third line says SUBQUERY and not DEPENDENT SUBQUERY anymore. DEPENDENT SUBQUERY means that it has be executed once per row of the top query. SUBQUERY means that it is executed only once.EXPLAIN FORMAT=JSON, another new feature in MySQL 5.6.5, shows more details of materialization:

Don't forget to scroll the box above to the right, because lines are long. This shows that:

For the top query, we read a row from 'part', then one row from 'partsupp', then execute the subquery.

the very first execution of the subquery materializes (<materialize>) select `dbt3`.`supplier`.`s_suppkey` from `dbt3`.`supplier` where (`dbt3`.`supplier`.`s_comment` like '%Customer%Complaints%') into a temporary table

Each subquery execution does a lookup on the primary key of this temporary table (<primary_index_lookup> ... in <temporary table>)

Going further down, we see how the temporary table will be filled: it will be the resultset of a table scan ("access_type": "ALL") of 'supplier' with a filtering LIKE condition.

This is the end of this post. I hope that it puts in good light the work we have put into 5.6. There are many other Optimizer features in this version, like EXPLAIN FORMAT=JSON and others; they are described in my colleagues' blog posts.

3 commentaires:

If the data is large, in-memory materialization won't work. Even if it is small, it may be preferable to create a permanent materialized view, that is maintained by triggers. See www.materialized.info. Such a permanent view typically yields a 5x to 10x performance improvement for slow queries.

hello Cliff, thanks for your comment! Note that, if the data is large, the Optimizer will do on-disc materialization. Regarding the use of a materialized view, it can certainly be faster, at the expense of rewriting the query (to use the view instead of the subquery) and creating the triggers. Subquery materialization gives a performance improvement "out-of-the-box".