A rule of thumb for choosing column order in indexes

I wanted to share a little rule of thumb I sometimes use to decide which columns should come first in an index. This is not specific to MySQL, it’s generally applicable to any database server with b-tree indexes. And there are a bunch of subtleties, but I will also ignore those for the sake of simplicity.

Let’s start with this query, which returns zero rows but does a full table scan. EXPLAIN says there are no possible_keys.

Shell

1

2

3

SELECT*FROM tbl WHERE status='waiting'ANDsource='twitter'

ANDno_send_before<='2009-05-28 03:17:50'ANDtries<=20

ORDER BY dateASC LIMIT1;

Don’t try to figure out the meaning of the query, because that’ll add complexity to the example 😉 In the simplest case, we want to put the most selective column first in the index, so that the number of possible matching rows is the smallest, i.e. we find the rows as quickly as possible. Assuming that all the columns have an even distribution of values, we can just count the number of matching rows for each criterion.

Shell

1

2

3

4

5

6

7

8

9

select sum(status='waiting'),sum(source='twitter'),

sum(no_send_before<='2009-05-28 03:17:50'),sum(tries<=20),count(*)

from tbl\G

***************************1.row***************************

sum(status='waiting'):550

sum(source='twitter'):37271

sum(no_send_before<='2009-05-28 03:17:50'):36975

sum(tries<=20):36569

count(*):37271

This is pretty simple — all I did was wrap each clause in a SUM() function, which in MySQL is equivalent to COUNT(number_of_times_this_is_true). It looks like the most selective criterion is “status=waiting”. Let’s put that column first in the index. Now, pull it out of the SELECT list and put it into the WHERE clause, and run the query again to get numbers within the subset of rows that match:

Shell

1

2

3

4

5

6

7

8

select sum(source='twitter'),

sum(no_send_before<='2009-05-28 03:17:50'),sum(tries<=20),count(*)

from tbl where status='waiting'\G

***************************1.row***************************

sum(source='twitter'):549

sum(no_send_before<='2009-05-28 03:17:50'):255

sum(tries<=20):294

count(*):549

So we’re down to a reasonable number of rows (the count() is changing because I’m running this on live data, by the way). It looks like the ‘source’ is no more selective, that is, it won’t filter out any more rows within this set. So adding it to the index would not be useful. We can filter this set further by either the ‘no_send_before’ or the ‘tries’ column. Doing so on either will reduce the count of matches for the other to zero:

Shell

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

select sum(source='twitter'),

sum(no_send_before<='2009-05-28 03:17:50'),sum(tries<=20),count(*)

from tbl where status='waiting'andno_send_before

<='2009-05-28 03:17:50'\G

***************************1.row***************************

sum(source='twitter'):255

sum(no_send_before<='2009-05-28 03:17:50'):255

sum(tries<=20):0

count(*):255

select sum(source='twitter'),

sum(no_send_before<='2009-05-28 03:17:50'),sum(tries<=20),count(*)

from tbl where status='waiting'andtries<=20\G

***************************1.row***************************

sum(source='twitter'):294

sum(no_send_before<='2009-05-28 03:17:50'):0

sum(tries<=20):294

count(*):294

That means we can add an index on either of (status,tries) or (status,no_send_before) and we will find the zero rows pretty efficiently. Which is better depends on what this table is really used for, which is a question I’m avoiding.

Cary: thanks for the recommendation I’m adding it to my to-buy list for my next book order. For the benefit of the readers, if there’s some deadly sin I’m committing in my simplified example above (aside from all the assumptions I’ve sidestepped), would you please let me know?

Live data can be changed dramatically within few minutes and previously looked efficient compound index can easily turn quite inefficient for the same query but for changed data set. So such technique is more suitable for immutable tables IMHO.

I have done similar things in the past though I have mostly looked at cardinality as I assumed this would be a better general use case. (note: I think this is a foolish assumption when most queries are looking for the same values…)

I have wondered if there should be some account of the size of the columns in place. For example lets say a int is slightly less selective than a varchar(50). I would think that the smaller datatype should be somewhat preferred as a smaller index would be created and I imagine traversed more quickly.

Also I think worth note would be the effect of range type restrictions for user that are unfamiliar with their effects on queries that might make use of composite indexes.

Rob, it’s a good point about size of the column. This was meant to be kind of an all-things-being-equal example. I didn’t want to get into all the subtleties of range criteria. That’s a big mess to explain clearly 🙂

I think it also makes sense to look into things in the cost/benefit way. Extending the index makes it larger. In your case adding extra column which halves amount of rows to be traversed for the query probably makes sense. In some other cases when it is only 10% gain it does not make sense.

Another thing which often make sense to consider is covering index potential in particular when trying to target several query patterns with single indexes.

Covering index slows down queries which can’t use it as covering index because it is longer but speeds up queries which can use covering index dramatically in particular in IO bound scenarios.

is it possible that using 2 server one for inserting data and the other one for reporting(making select queries) first server replicate to second server but in first one there is no index but in second server there are indexes.
because indexes make slow when you inserting,updating or deleting data but it makes fast when you select data.
how can i reorganize index when replicate data.

Wouldn’t it be better to check the cardinality of the columns instead of the selectivity of specific values ?
When you want to select the rows with another STATUS, another column order in the index might be better.

Right. That’s another half-written blog post from long ago, and as people are asking about all the nuances around this I’m regretting trying to simplify things too much. There are a dozen things to think about, though — cardinality has skew, queries use certain values more often than others, etc etc. Ultimately you really need statistical information about the values in the table and the values that are used in WHERE clauses. In the example, the query came from a real application that queries for twitter/waiting FAR more often than anything else, by the way.

For whatever it is worth I regret my comment. It is easy to be a critic (which is a hair away from what I did of asking questions I knew were hard to answer).

This technique is very good for beginners and a very good option for dealing with an unfamiliar dbms. One way or another seat time and understanding whatever optimizer you are dealing with can not be easily replaced with a single blog post. Understanding the optimizer is really key and that can not be easily summarized.

It is a pipe dream for many (most?) databases to see a process list that evenly distributes queries to hit all portions of tables evenly. Much more likely is getting hit with many queries of the same form (possibly exactly the same) on a table that changes just enough for query cache to be a bad idea. In this case the above technique with some knowledge of nuances is I think very useful…

I ran into an interesting multi key issue last week, wondering if anyone had any suggestions.
Table sent_items looked like
id int(10) autoincrement unsigned
sender_id int(11) unsigned
receiver_id int(11) unsigned
item_id tinyint(3)
sent_time int(11) (stored unix timestamp, stamped by application not server)
in InnoDB
This issue started popping up somewhere around 20 million rows in the table, possibly earlier though we didn’t notice.
There was a single key on receiver_id and a combo key on (sender_id, sent_time) due to a frequent query
to get the 50 most recently received items.
Average use had about a dozen items sent.
select * from sent_items where sender_id = ‘12345678’ limit 50;
would give 12 rows in 0.5 seconds under heavy server load.
select * from sent_items where sender_id = ‘12345678’ ORDER BY sent_time DESC limit 50;
would give the same 12 rows in 5 full seconds under heavy load.
Both queries would be nearly instant under light server load.
A query on
select * from sent_items where receiver_id = ‘12345678’ ORDER BY sent_time DESC limit 50;
With or without ORDER would be 0.5 seconds under load.
Now the sender_id has a very different cardinality than the receiver_id and records are generally grouped by a sender as a person can send many items to a number of different receivers in an instant.
Going to try changing the multi_key to a single key later, but very strange results. Wondering if you have any ideas?