mardi 29 novembre 2011

Understanding the unique_subquery optimization

If you use the EXPLAIN SELECT statement to see how your subqueries are treated by MySQL, you may sometimes meet the "unique_subquery" optimization. Here is how the manual describes it:

"unique_subquery: this type replaces ref for some IN subqueries of the following form: value IN (SELECT primary_key FROM single_table WHERE some_expr); unique_subquery is just an index lookup function that replaces the subquery completely for better efficiency".

Few weeks ago, while I was reviewing a patch fixing a bug in unique_subquery, I got a "simplification" pulsion. I told myself that:

unique_subquery is an optimization for a special case of simple subqueries (single inner table, using index, no aggregates);

we have a more general system around, used for more complex subqueries, naturally capable of handling simple ones too if we wanted;

this general system does not have the bug in question...

Then I wondered: what if we removed the unique_subquery optimization, and let the general system handle this simple subquery? This would certainly simplify code, and thus maintainance...But before removing it, of course, we should check whether unique_subquery brings a significant performance benefit.

So today I'm testing unique_subquery against the DBT3 benchmark. I grab a copy of MySQL 5.6.3, and focus on the sixteenth query of DBT3, which contains a subquery (in red) suitable for handling by unique_subquery:

select
p_brand,
p_type,
p_size,
count(distinct ps_suppkey) as supplier_cnt
from
partsupp,
part
where
p_partkey = ps_partkey
and p_brand <> 'Brand#23'
and p_type not like 'LARGE PLATED%'
and p_size in (43, 1, 25, 5, 35, 12, 42, 40)
and ps_suppkey not in (
select
s_suppkey
from
supplier
where
s_comment like '%Customer%Complaints%'
)
group by
p_brand,
p_type,
p_size
order by
supplier_cnt desc,
p_brand,
p_type,
p_size;

The only change, as expected, is that "unique_subquery" becomes "eq_ref". The used index is the same (the primary key of the "supplier" table). The optimizer has the same notion of unicity: "unique_subquery" and "eq_ref" both denote that a single lookup is needed, as the index is UNIQUE. Same index, same number of lookups: execution could well be as fast with "eq_ref" as it was with "unique_subquery".
But... no. Query now executes in 0.80 seconds. 23% slower than with unique_subquery!

Finer-grained timing shows that the extra 0.15 seconds are indeed lost in the subquery evaluation code.

To understand this, let's follow the execution in detail, based on EXPLAIN output above.

First line of EXPLAIN output: we do a table scan on the "part" table ("type=ALL" means "table scan") . The "rows" column of EXPLAIN suggests that we are going to have 199,498 rows of "part".

Second line of EXPLAIN output: for each row from the "part" table, we do an index lookup ("ref") into the "i_ps_partkey" index of the "partsupp" table; apparently such lookup will find two rows ("rows=2").

At this point, we have a row made of needed columns of "part" and of "partsupp". An upper estimate of the number of those rows is 199,498 multiplied by 2: 400,000. Actually, the real number is around 120,000 (there has been filtering going on, as the "Using where" indicates).

Then we evaluate the WHERE clause and thus the "NOT IN (subquery)" predicate (the "DEPENDENT SUBQUERY"). 120,000 evaluations of such predicate. And that's where the difference is.

EXPLAIN EXTENDED and then SHOW WARNINGS show how the predicate
looks like. Let's start with the case where unique_subquery is disabled:

/* select#1 */ select `dbt3`.`part`.`p_brand` AS `p_brand`,`dbt3`.`part`.`p_type` AS `p_type`,`dbt3`.`part`.`p_size` AS `p_size`,count(distinct `dbt3`.`partsupp`.`ps_suppkey`) AS `supplier_cnt` from `dbt3`.`partsupp` join `dbt3`.`part` where ((`dbt3`.`partsupp`.`ps_partkey` = `dbt3`.`part`.`p_partkey`) and (`dbt3`.`part`.`p_brand` <> 'Brand#23') and (not((`dbt3`.`part`.`p_type` like 'LARGE PLATED%'))) and (`dbt3`.`part`.`p_size` in (43,1,25,5,35,12,42,40)) and (not(<in_optimizer>(`dbt3`.`partsupp`.`ps_suppkey`,<exists>(/* select#2 */ select 1 from `dbt3`.`supplier` where ((`dbt3`.`supplier`.`s_comment` like '%Customer%Complaints%') and (<cache>(`dbt3`.`partsupp`.`ps_suppkey`) = `dbt3`.`supplier`.`s_suppkey`))))))) group by `dbt3`.`part`.`p_brand`,`dbt3`.`part`.`p_type`,`dbt3`.`part`.`p_size` order by count(distinct `dbt3`.`partsupp`.`ps_suppkey`) desc,`dbt3`.`part`.`p_brand`,`dbt3`.`part`.`p_type`,`dbt3`.`part`.`p_size`

Above, the part in red says that

ps_suppkey not in (
select
s_suppkey
from
supplier
where
s_comment like '%Customer%Complaints%'
)

has been transformed from "IN(non correlated subquery)" to "EXISTS(correlated subquery)", yielding this:

not exists (
select
1
from
supplier
where
s_comment like '%Customer%Complaints%'
AND s_suppkey = ps_suppkey
)

Evaluating this EXISTS() evaluates the new subquery. This means all the subquery evaluation machinery: calls to JOIN::exec(), sub_select(), evaluate_join_record()... Sure, deep down it does an index lookup like unique_subquery does, but all those function calls have a cost, and so has all the logic which is lying around ready to handle any complexity in the subquery, as this is generic subquery evaluation code ("if group_by do this", "if order_by do this", "if left_join do this": none of those if()s are entered, but deciding to enter them or not has a cost). Plus some initialization code. Plus some de-initialization code. This overhead, repeated 120,000 times, amounts to 0.15 seconds...

Now, EXPLAIN EXTENDED when unique_subquery is enabled:

/* select#1 */ select `dbt3`.`part`.`p_brand` AS `p_brand`,`dbt3`.`part`.`p_type` AS `p_type`,`dbt3`.`part`.`p_size` AS `p_size`,count(distinct `dbt3`.`partsupp`.`ps_suppkey`) AS `supplier_cnt` from `dbt3`.`partsupp` join `dbt3`.`part` where ((`dbt3`.`partsupp`.`ps_partkey` = `dbt3`.`part`.`p_partkey`) and (`dbt3`.`part`.`p_brand` <> 'Brand#23') and (not((`dbt3`.`part`.`p_type` like 'LARGE PLATED%'))) and (`dbt3`.`part`.`p_size` in (43,1,25,5,35,12,42,40)) and (not(<in_optimizer>(`dbt3`.`partsupp`.`ps_suppkey`,<exists>(<primary_index_lookup>(<cache>(`dbt3`.`partsupp`.`ps_suppkey`) in supplier on PRIMARY where ((`dbt3`.`supplier`.`s_comment` like '%Customer%Complaints%') and (<cache>(`dbt3`.`partsupp`.`ps_suppkey`) = `dbt3`.`supplier`.`s_suppkey`)))))))) group by `dbt3`.`part`.`p_brand`,`dbt3`.`part`.`p_type`,`dbt3`.`part`.`p_size` order by count(distinct `dbt3`.`partsupp`.`ps_suppkey`) desc,`dbt3`.`part`.`p_brand`,`dbt3`.`part`.`p_type`,`dbt3`.`part`.`p_size`

The optimizer has first done the same transformation (IN to EXISTS) as we saw before, then has done one more transformation, and EXISTS has become, as written in red above:

<exists>(<primary_index_lookup>(<cache>(`dbt3`.`partsupp`.`ps_suppkey`)
in supplier on PRIMARY
where ((`dbt3`.`supplier`.`s_comment` like '%Customer%Complaints%')
and (<cache>(`dbt3`.`partsupp`.`ps_suppkey`) = `dbt3`.`supplier`.`s_suppkey`))))

which is directly an index lookup ("<primary_index_lookup>"), followed by an additional WHERE clause. So the overhead of full-blown subquery evaluation is
avoided. And this overhead is not neglectable, compared to the index lookup (assuming the relevant index pages are already in memory).

So the conclusion of my experiment is that unique_subquery is worth having. I'll have to direct simplification pulsions to some other code!

Note that there also exists a similar "index_subquery" optimization applying to non-unique indices. And it's worth having, for the same reasons.