Menu

A new dimension to MySQL query optimizations – part 2

This is “A new dimension to MySQL query optimizations – part 2″. If you didn’t read part 1 already I suggest you skim through it before reading on.

To summarize, the problem at hand is this: Given a query with a join between two or more tables, the MySQL optimizer’s mission is to find the best combination of join order and access method so that the response time becomes as low as possible. The optimizer does this by calculating the cost of each combination and then picks the cheapest one.

Consider the following query:

1

2

3

4

SELECT *

FROM employee JOIN department ON employee.dept_no=department.dept_no

WHERE employee.first_name="John"AND

employee.hire_date BETWEEN"2012-01-01"AND"2012-06-01"

The optimizer will calculate the cost of the alternative plans as follows:

1

2

total cost=cost(access_method_table1)+

prefix_rows_table1xcost(access_method_table2)

As explained in part 1, the problem with this calculation is that the cost of accessing table2 should not be multiplied with the number of rows returned by the chosen access method on table1 but rather the number of rows in table1 that evaluate to true for all conditions. Up until 5.6, MySQL had this wrong.

And that’s where condition filtering comes in: it provides a much better prefix rows estimate by taking into account not only conditions that are used by the chosen access method but all other relevant conditions as well.

How it works

Before we start with the examples, here are the most important things you need to know:

The conditions are investigated for each table, and a condition will contribute to the filtering estimate for that table only if:

it refers to the table at hand, and

the condition depends only on constant values or values from tables earlier in the join sequence, and

the condition is not in use by the access method

If a condition contributes to the filtering estimate, the estimate will be based on the range optimizer’s analysis since this is very accurate. If not available, index statistics is used instead. If that is not available either, heuristic numbers are used.

Conditions are assumed to have no correlation.

The condition filter estimate is shown in the filter column of EXPLAIN as per cent. While “rows” shows the estimated number of rows fetched by the chosen access method, prefix rows for the next table is rows multiplied by filter.

Condition filtering is only calculated if it can cause a change of plan. Since it only affects the cost of accessing tables later in the join sequence, it is not calculated for the last table. Thus, by definition it is not calculated for single-table queries. However, there is one exception: it is always calculated for EXPLAIN so that you can see its value.

It can be turned on and off by optimizer_switch condition_fanout_filter (“set optimizer_switch=’condition_fanout_filter=on'” etc).

MySQL estimates that it will read 8 rows through ref access. Now let’s try to join with department. MySQL 5.6 now assumes that prefix rows for department is 8 and the chosen access method of department therefore has to be executed 8 times. However, we already know that the correct number is 1 since there is only one row that matches both conditions. Although we can’t see this from the EXPLAIN, the cost of accessing department is greatly exaggerated because of this.

Now let’s take a look at MySQL 5.7. Notice that prefix rows (
rows *filtered=8*16.31%=1.3 ) is now much closer to reality. Just like before, 8 in the “rows” column is the estimated number of rows that will be read by ref access, while the new condition filtering information is shown in the “filtered” column. Since first_name=”John” is used by the ref access method, 16.31% is the condition filtering effect estimated from the remaining BETWEEN condition. When joined with department, prefix rows for department is now 1.3 instead of 8. In turn, the cost calculation is much more accurate.

If we force a table scan, none of the conditions are used by the access method and the filtered column is updated accordingly. Now we get
rows *filtered=1024*0.12%=1.23 , which is also pretty close to the correct value of 1.

These are of course only basic examples to illustrate how it works. It gets much more interesting once we look at many table joins, e.g. the queries in DBT-3 that show up to 88% reduction in response time. I might followup with a part 3 to explain these bigger queries later. In the mean time, you can experiment with your own data by downloading the MySQL 5.7 labs release.

Oh, and by the way: “Condition filtering” is only one of many planned steps towards a new and improved cost model which includes brand new features and a lot of refactoring. There are some subtle traces of this work in the 5.7.4 release; a few new APIs that don’t do much on their own. Stay tuned for more info on this subject!

3 thoughts on “A new dimension to MySQL query optimizations – part 2”

I’d like to test this feature, but I cannot find a MySQL version that has it. The mentioned http://labs.mysql.com/ page lists several builds, but none is marked to contain new optimizer features. In 5.7.4-m14 the “condition_fanout_filter” optimizer switch doesn’t exist.

So where can I find a MySQL release (preferrably source code) *with* this feature?

As explained in part 1, the problem with this calculation is that the cost of accessing table2 should not be multiplied with the number of rows returned by the chosen access method on table1 but rather the number of rows in table1 that evaluate to true for all conditions. Up until 5.6, MySQL had this wrong.

Can you explain the assertion that “until 5.6, MySQL had this wrong”? It might help me to understand something I’m investigating.