Multi Column indexes vs Index Merge

The mistake I commonly see among MySQL users is how indexes are created. Quite commonly people just index individual columns as they are referenced in where clause thinking this is the optimal indexing strategy. For example if I would have something like AGE=18 AND STATE=’CA’ they would create 2 separate indexes on AGE and STATE columns.

The better strategy is often to have combined multi-column index on (AGE,STATE). Lets see why it is the case.

MySQL indexes are (with few exceptions) BTREE indexes – this index type is very good to be able to quickly lookup the data on any its prefix and traversing ranges between values in sorted order. For example when you query AGE=18 with single column BTREE index MySQL will dive into the index to find first matching row and when will continue scanning index in order until it runs into the value of AGE more than 18 when it stops doing so assuming there are no more matching. The RANGES such as AGE BETWEEN 18 AND 20 work similar way – MySQL just stops at different value.

The enumerated ranges such as AGE IN (18,20,30) are more complicated as this is in fact several separate index lookups.

So we spoke about how MySQL uses the index but not exactly what it gets from the index – typically (unless it is covering index) MySQL gets a “row pointer” which can be primary key value (for Innodb tables) physical file offset (for MyISAM tables) or something else. It is important internally storage engine can use that value to lookup the full row data corresponding to the given index entry.

So what choices does MySQL have if you have just 2 separate indexes ? I will either use just one of them to look up the data (and check remaining portion on WHERE clause after data is read) or it can lookup the row pointers from all indexes, intersect them and when lookup the data.

Which method works better depends on selectivity and correlation. If where clause from first column selects 5% of the rows and applying where clause on second column brings it down to 1% using intersection makes sense. If second where clause just brings it down to 4.5% it is most likely better to use single index only and simply filter out rows we do not need after lookup.

Lets look at some examples now:

Shell

1

2

3

4

5

6

7

8

CREATE TABLE`idxtest`(

`i1`int(10)unsignedNOTNULL,

`i2`int(10)unsignedNOTNULL,

`val`varchar(40)DEFAULTNULL,

KEY`i1`(`i1`),

KEY`i2`(`i2`),

KEY`combined`(`i1`,`i2`)

)ENGINE=MyISAM DEFAULTCHARSET=latin1

I made columns i1 and i2 independent in this case each selecting about 1% rows from this table which contains about 10M rows.

Now if we do not allow MySQL optimizer to use second index and hence index merge, what does it turn to ? It is not combined index but single index on another column. This is because MySQL is able to estimate number of rows it will find using both indexes and as they are about the same it picks smaller index. The query takes 290ms which is exactly what we’ve seen before.

We can see here MySQL estimates 20% more rows to traverse, which is wrong of course – it can’t be more than if only index prefix is used. MySQL does not know it as it looks at stats from different indexes independently not trying to reconcile them some way.

Because index is longer query execution takes a bit longer – 300ms

So in this case we see index merge is chosen even though it turns out to be the worst plan. Though technically it is right plan considering the statistics MySQL had available.

It is very easy to disable index merge if you do not want it to run, however I do not know of the hint in MySQL which would allow forcing using index merge when MySQL does not think it should be used. I hope hint would be added at some point.

Finally let me mention the case when Index merge works much better than multiple column indexes. This is in case you’re using OR between the columns. In this case the combined index is useless and MySQL has an option of doing full table scan or doing the Union (instead of intersection) on values it gets from the single table.

I have reverted table to have i1 and i2 as independent columns in this case to look at more typical case:

Note MySQL puts i1 and combined into “possible_key” while it has no way to use them for this query.

The query takes 3370ms if this plan is used.

Note the query takes about 5 times longer even though in case of full table scan about 50 times more rows are scanned. This reflects very large performance difference between full table scan and access through the index, which seems to be about 10x (in terms of cost access per row) even though it is in memory workload.

For UNION case the optimizer is more advanced and it is able to deal with ranges:

As a summary: Use multi column indexes is typically best idea if you use AND between such columns in where clause. Index merge does helps performance but it is far from performance of combined index in this case. In case you’re using OR between columns – single column indexes are required for index merge to work and combined indexes can’t be used for such queries.

Related

Peter managed the High Performance Group within MySQL until 2006, when he founded Percona. Peter has a Master's Degree in Computer Science and is an expert in database kernels, computer hardware, and application scaling.

33 Comments

I realize that it was probably just done for clarity in this example, but creation of both i1 and combined indexes is not necessary (and seems a little redundant). A combined index for columns (a,b,c) is the same as also having indexes for (a,b) and (a).

In your summary, “In case you’re using OR between columns – single column indexes are required for index merge to work and combined indexes can’t be used for such queries.”

Are you saying that if index (a) were not created, MySQL could not perform an index merge (union) with (a,b) and (b) during an OR operation? Would this condition require the creation of (a),(b) and (a,b) indexes if you need to perform AND and OR queries.

Indeed if you want queries which conditions A=5 and B=5 as well as A=5 OR B=5 ran most efficiently you need indexes
(A), (A,B) and (B) but as (A) is prefix of (A,B) it will be enough only to have indexes (A,B) and (B)

Did you ever do the same testing for MySQL 5.0.x instead of 5.1/5.4ï¼Ÿ As in my website, the MySQL query optimizer in different version work quite differently and 5.0.x’s is much better, and sometimes I have to add some query hints to help 5.1 using the right index. Thus, until now, although new features in innodb plugin and xtradb are quite attractive for me, but I cannot push it to our production servers .

Interesting article! I tried to use your examples on one of my already existing table. On that table I had only 2 seperate indexes, so I created a 3th wich is the combination of these 2. But this combined index seems no te be used.

The sql I executed:
explain SELECT * FROM myTableName where publisher_entity_id=0 AND vurl_id=0;

Lance, we’ve seen a lot of that, too. This is why we’re working hard on a new Maatkit tool, called mk-upgrade. I think that one of the side effects of creating this tool will be a quicker feedback loop for the optimizer team — if users are finding the bugs quickly, they will have a much better chance of being fixed quickly.

I was dealing with the same problem today and got to the same conclusion by my self in the end. I wish I checked this page two hours ago 🙂
Anyway, I have a burning question.

I have a table with about 2M rows and only 2 queries:
Query A) SELECT id, title, body FROM messages WHERE type_id = __TYPEID__ AND is_active = __IS_ACTIVE__ ORDER BY date_created DESC LIMIT __PAGE__,__OFFSET__
and Query B) SELECT id, title, body FROM messages WHERE type_id = __TYPEID__ ORDER BY date_created DESC LIMIT __PAGE__,__OFFSET__

The only difference between the two queries is that the second lacks the is_active flag. I have a combined index on (type_id, is_active, date_created).

Good catch. For sorting to work using index you have to have it following = exactly so if you have (A,B,C) index you will be able to use index for sort in case A=5 ORDER BY B
or in case A=5 AND B=5 ORDER BY C but not in case A=5 ORDER BY C.

So in your case if you want to have sorting done using index by second query you will need index (type_id,date_created)

I have a question about the previous post. An index (A,B,C) would work when the query looks like ..WHERE A=5 AND B=5 ORDER BY C. Would it also work when the A and B are ordered differently, thus the query looks like WHERE B=5 AND A=5 ORDER BY C ?

I have an issue that I can’t figure out. I have a huge query that is like: where a,b,c,d,e,f pulling from a DB of nearly 5 million records. I can make a multi column index on a,b,c,d,e,f but the query is dynamic. It may search on a,b,e,f or b,d,f. The amount of indexes I would need to create for this would probably be 36 (because of the leftmost prefix rule). I know that is rather absurd. So, how can I make sure everything is indexed so searches are fast on this large query and I don’t have to make all the indexes?

Indeed this is a hard choice. You may end up having a lot of indexes if you need to have a lot of different search combinations. Some of them (for example double ranges) may not be handled effectively by MySQL at all.

One of solutions we used when indexes can’t be constructed is using Sphinx (www.sphinxsearch.com) which can be thrown on the cluster easily and which can serve queries very fast even without having perfect indexes.

I have a table with over 80,000,000 rows containing GPS information latitude, longitude, date_time, vehicle_id and a few more columns.

I can’t seem to find a good index combination for a query bringing back around 7 days of results for a vehicle. Tried (date_time,vehicle_id) and MYSQL won’t use it, (date_time) and its bringing back thousands of rows under explain. Using innodb. Inserting up to 100 per second.

where
g.date_time between ‘2009-10-15 00:00:00’ and ‘2009-10-15 23:59:58’
and g.vehicle_id = 6261
and ( (io.high is not null and ios.gpio = 8) or i.id is not null or m.id is not null or o.id is not null )
order by g.date_time;

I have a scenario here on multiple column keys. I created an partial index for person table of 2M+ records:

CREATE INDEX firstlastname_idx ON person(firstname(2), lastname(2));

The requirement is for wildcard searching on the first character on the firstname and lastname.

select firstname, lastname from person where firstname like ‘p%’ and lastname like ‘s%’;

It’s working fine but I also notice that when the server is idle for some time and then execute a query it takes more than a minute to finished the query. All subsequent query is a snap. So the problem is on the first query execution.

I do run explain and it using the partial index. I know there is a issue about mysql on warming up but I haven’t found any concrete solution/advice to this.

The issue in my application is I am getting “504 Gateway Time-out” error because of this slow first query.

For the same query”EXPLAIN SELECT avg(length(val)) FROM idxtest WHERE i1=50 AND i2=50;”, why the explain result is different after you set i1=i2.
I think mysql should still choose multi col index not index merge.

PLEASE EXPLAIN SELECT ‘guys commenting above that they need to pay for support, because you are not fed by International Red Cross database department and the cheapest way to pay is buying your book. If that is too much for them, they should at least buy some bananas, rent some monkeys and start sorting their 20M rows table on plain paper.’;

I dont think the conclusion of this post is right. I have a myisam table with a primary key spanning 5 columns. I do a select using a WHERE on every of those 5 columns ANDed. Using the primary key (multicolumn index) it takes 25s, using a single index in one of the columns it takes 1 sec. I did a profiling and most of the 25s is taken in “Sending data” stage. The primary key has cardinality of about 7M and the single column about 80. Am i missing somehting?

I have a question, i know that index or many index on a table improve high perfomance in select queries, but worsen perfomance in Insert, Update and Delete queries. Is it true?
And if is it, how could i balance the choice??? too many indexes and high performance select or high perfomance update with few indexes??? how balance it???

I had a question regarding index merge (intersect). Below are two explains that I executed for the same SQL query with a couple of mins gap on the same mysql prod instance . The first SQL query executed in 0.03 secs whereas the second SQL query when using index merge executes in 3.8 secs. I dont understand why MySQL uses index merge after a couple of mins when previously it was using a single index. I would greatly appreciate your answer.