Are you using the operator IN with subqueries in MySQL and wondering about poor performance? Have you tried EXPLAINing your queries and noticed that they are reported as "DEPENDENT SUBQUERY" despite your best intent to use an uncorrelated subquery? If so, read on:

This post describes a MySQL bug (some call it a feature) which has
been open for many years and according to official documentation still
exists in version 5.6 (note that I only checked 5.0.x and 5.1.x myself
and didn't bother to look for info about 6.0, but I have no high
hopes based on the bug's age and character).

Consider the following two SQL statements:

SELECT id FROM t1 WHERE id IN (1, 2, 3)

SELECT id FROM t1 WHERE id IN (SELECT id FROM t2)

In MySQL they have very different performance characteristics:
the first one will work fast as expected, the second one will actually
cause index lookups to be executed repeatedly for each row from t1
rather than just once for the entire statement. Why?

Blame the "optimizer", as explained in the manual. Our fine uncorrelated query becomes forcingly "optimized" into an EXISTS query and thus correlated (rougly speaking; it's a bit more complicated, but you can view the result by using EXPLAIN EXTENDED followed by SHOW WARNINGS). MySQL of course must handle correlated queries per row, leading to disastrous algorithmic performance in what one would assume is a very common case.

I have no clever workaround - you can't "convince" the MySQL optimizer to behave differently. What you can do is use temporary tables to simulate what you really want to happen. But if you do, beware that you may only refer to a temporary table once in a SELECT statement due to another hugely annoying limitation.

2 comments:

In MySQL 5.6, your given query will be converted to a semi-join, or if t2.id column is unique, an ordinary join. That way, MySQL optimizer may process the tables in opposite order, that is, start with t2. This will make it possible to do index lookups repeatedly for each row from t2, instead of t1. This will of course be less total work if t2 is smaller than t1.

(I doubt that any database will be able execute your query with just one lookup for the entire statement; unless t2 contains just one row.)

Thanks for chiming in, Øystein. The execution plan for 5.6 indeed looks as you say, with a materialized t2 subquery and a join between the two tables. The join order would then be picked depending on index selectivity and what not. Still, I can't see a reason why the second form cannot be optimized into the first one if the t2 query returns a reasonably small number of rows (which can be checked after the query has been materialized).