Database Administrators Stack Exchange is a question and answer site for database professionals who wish to improve their database skills and learn from others in the community. It's 100% free, no registration required.

The situation involved a simple InnoDB table in MySQL comprising three columns and 10k rows. One of the columns, an integer, was indexed—and the OP sought to retrieve his entire table sorted on that column:

SELECT * FROM person ORDER BY age

He attached EXPLAIN output showing that this query was resolved with a filesort (rather than the index) and asked why that would be.

Despite the hintFORCE INDEX FOR ORDER BY (age)causing the index to be used, someone answered (with supporting comments/upvotes from others) that an index is only used for sorting when the selected columns are all read from the index (i.e. as would normally be indicated by Using index in the Extra column of EXPLAIN output). An explanation was later given that traversing the index and then fetching columns from the table results in random I/O, which MySQL views as more expensive than a filesort.

This appears to fly in the face of the manual chapter on ORDER BY Optimization, which not only conveys the strong impression that satisfying ORDER BY from an index is preferable to performing additional sorting (indeed, filesort is a combination of quicksort and mergesort and thereforemust have a lower bound of Ω(nlog n); whilst walking through the index in order and seeking into the table ought to be O(n)—so this makes perfect sense), but it also neglects to mention this alleged "optimisation" whilst also stating:

The following queries use the index to resolve the ORDER BY part:

SELECT * FROM t1
ORDER BY key_part1,key_part2,... ;

To my reading, that is precisely the case in this situation (yet the index was not being used without an explicit hint).

My questions are:

Is it indeed necessary for all selected columns to be indexed in order for MySQL to choose to use the index?

Analysis for InnoDB

When the data was loaded as InnoDB, please note that all four EXPLAIN plans used the gender index. The third and fourth EXPLAIN plans used the gender index even though the requested data was id. Why? Because id is in the PRIMARY KEY and all secondary indexes have reference pointers back to the PRIMARY KEY (via the gen_clust_index).

Analysis for MyISAM

When the data was loaded as MyISAM, please note that the first three EXPLAIN plans used the gender index. In the fourth EXPLAIN plan, the Query Optimizer decided not to use an index at all. It opted for a full table scan instead. Why?

Regardless of DBMS, Query Optimizers operate on a very simple rule-of-thumb: If an index is being screened as a candidate to be used for performing the lookup and Query Optimizer computes that it must lookup more than 5% of the total number of rows in the table:

a full index scan is done if all needed columns for retrieval are in the selected index

a full table scan otherwise

CONCLUSION

If you do not have proper covering indexes, or if the key population for any given tuple is more than 5% of the table, six things must happen:

UPDATE 2012-11-14 13:05 EDT

I took a look back at your question and at the original SO post. Then, I thought about my Analysis for InnoDB I mentioned before. It coincides with the person table. Why?

For both tables mf and person

Storage Engine is InnoDB

Primary Key is id

Table access is by secondary index

If table was MyISAM, we would see a completely different EXPLAIN plan

Now, look at the query from the SO question : select * from person order by age\G. Since there is no WHERE clause, you explicitly demanded a full table scan. The default sort order of the table would be by id (PRIMARY KEY) because of its auto_increment and the gen_clust_index (aka Clustered Index) is ordered by internal rowid. When you ordered by the index, keep in mind that InnoDB secondary indexes have the rowid attached to each index entry. This produces the internal need for full row access each time.

Setting up ORDER BY on an InnoDB table can be a rather daunting task if you ignore these facts about how InnoDB indexes are organized.

Going back to that SO query, since you explicitly demanded a full table scan, IMHO the MySQL Query Optimizer did the correct thing (or at least, chose the path of least resistance). When it comes to InnoDB and the SO query, it is far easier to perform a full table scan and then some filesort rather than doing a full index scan and a row lookup via the gen_clust_index for each secondary index entry.

I am not an advocate of using Index Hints because it ignores the EXPLAIN plan. Notwithstanding, if you really know your data better than InnoDB, you will have to resort to Index Hints, especially with queries that have no WHERE clause.

UPDATE 2012-11-14 14:21 EDT

The data is stored in a special structure called a clustered index, which is a B-tree with the primary key acting as the key value, and the actual record (rather than a pointer) in the data part. Thus, each InnoDB table must have a primary key. If one is not supplied, a special row ID column not normally visible to the user is added to act as a primary key. A secondary key will store the value of the primary key that identifies the record. The B-tree code can be found in innobase/btr/btr0btr.c.

This is why I stated earlier : it is far easier to perform a full table scan and then some filesort rather than doing a full index scan and a row lookup via the gen_clust_index for each secondary index entry. InnoDB is going to do a double index lookup every time. That sounds kind of brutal, but that's just the facts. Again, take into consideration the lack of WHERE clause. This, in itself, is the hint to the MySQL Query Optimizer to do a full table scan.

Rolando, thank you for such a thorough and detailed answer. However, it does not appear to be relevant to selecting indexes FOR ORDER BY (which is the specific case in this question). The question did state that in this case the storage engine was InnoDB (and the original SO question shows that the 10k rows are fairly uniformly distributed across 8 items, cardinality shouldn't be an issue here either). Sadly, I don't think that this answers the question.
–
eggyalNov 14 '12 at 10:34

This is interesting, as the first part was my first instinct as well (it didn't have a good cardinality so mysql chose to use full scan). But the more I read, that rule didn't seem to apply for order by optimization. Are you sure it orders by primary key for innodb clustered indexes? This post indicates the primary key gets added to the end, so wouldn't the sort still be on the explicit column(s) of the index? In short, I'm still stumped!
–
Derek DowneyNov 14 '12 at 18:20

There's a lot of information in your answer and I'd be lying if I claimed to understand it all; however, I still don't feel you've addressed a few issues at hand: 1) filesortmust have worse time complexity than traversing the index and seeking into the table; 2) it is possible to sort using the index (by forcing it); 3) the manual is (to my reading) quite clear that the index should be used in this situation - and certainly makes no mention of an "optimisation" that would prefer filesort in this situation. Are you saying that MySQL prefers this poor execution plan by design?
–
eggyalNov 14 '12 at 19:49

1

The filesort selection was decided upon by the Query Optimizer for one simple reason: It lacks foreknowledge of the data that you have. If your choice to use index hints (based on issue #2) brings you satisfying running time, then by all means, go for it. The answer I provided was just an academic exercise to show how temperamental the MySQL Query Optimizer can be as well as suggesting courses of action.
–
RolandoMySQLDBANov 14 '12 at 19:56

I've read and re-read through this and other posts, and I can only agree that this has to do with innodb ordering on the primary key since we're selecting all (and not a covering index). I am surprised there's no mention of this InnoDB-specific oddity in the ORDER BY optimization document page. Anyway, +1 to Rolando
–
Derek DowneyNov 14 '12 at 20:28

Adapted (with permission) from Denis' answer to another question on SO:

Since all records (or nearly all) will be fetched by the query, you're usually better off with no index at all. The reason for this is, it actually costs something to read an index.

As you're going for the entire table, sequentially reading the table and sorting its rows in memory may be your cheapest plan. If you only need a few rows and most will match the where clause, going for the smallest index will do the trick.

To understand why, picture the disk I/O involved.

Suppose you want the whole table without an index. To do this, you read data_page1, data_page2, data_page3, etc., visiting the various disk pages involved in order, until you reach the end of the table. You then then sort and return.

If you want the top 5 rows without an index, you'd sequentially read the entire table as before, while heap-sorting the top 5 rows. Admittedly, that's a lot of reading and sorting for a handful of rows.

Suppose, now, that you want the whole table with an index. To do this, you read index_page1, index_page2, etc., sequentially. This then leads you to visit, say, data_page3, then data_page1, then data_page3 again, then data_page2, etc., in a completely random order (that by which the sorted rows appear in the data). The IO involved makes it cheaper to just read the whole mess sequentially and sort the grab bag in memory.

If you merely want the top 5 rows of an indexed table, in contrast, using the index becomes the correct strategy. In the worst case scenario you load 5 data pages in memory and move on.

A good SQL query planner, btw, will make its decision on whether to use an index or not based on how fragmented your data is. If fetching rows in order means zooming back and forth across the table, a good planner may decide that it's not worth using the index. In contrast, if the table is clustered using that same index, the rows are guaranteed to be in order, increasing the likelihood that it'll get used.

But then, if you join the same query with another table and that other table has an extremely selective where clause that can use a small index, the planner might decide it's actually better to, e.g. fetch all IDs of rows that are tagged as foo, hash join the tables, and heap sort them in memory.