Description:
First, I am reporting this as a bug, not a performance request, because in this case we have a query that executes very quickly, but once certain indexes are added the query takes over 20 times as long.
It is nothing to do with disk seeks, as the explanation will so, the time is lost in "statisics", so it should neither be a configuration issue.
It happens if you have multi-column indexes, and perform range scans on a huge list of values (" value in (huge list) ").
It becomes more severe if you got 2 or more ranges (which all are part of the index), and/or if the optimizer has the ability to choose from multiply indexes.
Making the value lists big enough, I managed to get the optimizer busy for full 30 second on a 3Ghz cpu, while a full table scan takes less than a second.
I use "mysqladmin proc" to monitor which state the query is in. But it can be done by "explain" too, which shows how long the optimizer takes.
btw analyzing the table, makes no difference at all
neither doe it make a different to preload the index into memory
Also, during the "statistics" phase, you get a 100% CPU load, since the query spends a while in statistics, it can easily happen that several such queries will execute at the same time. making the effect far worse. (I did experience queries beeing for 2 or 3 minutes in staistic).
This can easily be used to perform a DOS attack on a server.
How to repeat:
The perl script below, creates and populates a table, on the local db.
It prints an sql select statement.
"explain" the select statement, to find the time the optimizer spends
then run the select with "ignore index (n,t)" to do a full table scan. (due to the ignore index, the optimizer does not get into the way)
---
use DBI;
my $dbh = DBI->connect("DBI:mysql:test","root");
$dbh->do("create table i1 (n varchar(20), t int(11), index(n,t), index(t,n))");
for $n1 (1..50000) {
# create random 50k identifiers
my $n = substr( (sqrt($n1+0.5).$n1) x 30, 4,15);
$n{$n}++;
for $t1 (0..(substr $n,1,1)) {
# create up to 5 random tags per id
$t = substr($n,1,3)+$t1;
$dbh->do("insert into i1 values (\"$n\", $t)");
$t{$t}++;
print $c unless $c++ %100; # progress.., will insert total of 274700 rows
}
}
print "select count(*) from i1 where n in (".join(",", map {"\"$_\"";} (keys %n)[0..200]).") and t in (".join(",",(keys %t)[0..350]).");"
---
using the sql printed by the script..
explain select count(*) from i1 where n in (......) and t in (....);
# takes 11 second, on my pc in this example
select count(*) from i1 IGNORE INDEX (t,n) where n in (......) and t in (....);
# takes 0.3 seconds
select count(*) from i1 where n in (......) and t in (....);
# executing with optimizer, takes 12 seconds
# 30 times as long, with index
and this is where it even does "use index" according to explain, so if the index is there, the table data, does not get read at all
Suggested fix:
I dont know how the optimizer does work on this.
but which ever optimization-part it is, should not be applied to list from a certain size upward.
---
There is on more oddity. If you ignore only one index "ignore index (n)" or use "force index", you will still notice an unreasonable amount of time spend in the optimizer. I find this odd, because:
- there is only one index that can be used. So the decision is index vs table-scan
- but the index covers all fields of the query, the optimizer will find an "use index", which means there will be no table-scan needed.
So maybe there are also 2 different optimizations, that could/should be checked in differnt order

- This is not a bug: for some queries ANY DBMS can be forced to spend for the optimizer phase much more than for the execution phase.
- There is no CSC bound to this case.
- The problem existed always in all MySQL servers and was reported only in
2006.
- Any query that requires long range optimization can be always rewritten to avoid this problem
(e.g.: the predicate a IN <long list> can be replaced for
a+0 IN <long list>.
Range optimization is never performed for the last expression).
- Currently we don't have a full and easy solution that would guarantee quite limited time for the range optimizer phase.
By the above reasons I move the bug to 'To be fixed later' and mark it as a 'Feature request'.
Product management will decide in what version this optimization appears.

[4 Nov 2008 18:56]
Eric DeCosta

One idea might be to have a parameter or option that would set a maximum time limit that the optimizer can spend evaluating a query.

[7 Jul 2010 10:51]
Michal Kovacik

Is there any chance that this will be fixed in future? We just upgrade mysql 4.0.x to 5.0.51 version so this type of queries became a problem. We have temporarily fixed this by appending +0 to column in one of IN conditions, but I believe, this is not supposed to be correct fix of the problem.

[8 Oct 2015 14:44]
Andrii Nikitin

Adding few more keywords about symptoms to improve search:
Query is slow when IN() clause has many elements