We see that dump is taken in key “block_id” order, not in primary key order. And later when we load this table, INSERTS into primary key happens in random order, and that gives us the fragmentation we see here.

You may ask what happens now with Q2 ? yes, it’s getting slow now, as we made key “block_id” inserted not in order.

Shell

1

2

3

4

5

6

7

8

9

10

11

12

13

14

+---------------------------+

|count(distinct username)|

+---------------------------+

|5903053|

+---------------------------+

1row inset(2min8.92sec)

mysql>SHOW STATUS LIKE'Innodb_scan_pages%';

+------------------------------+-------+

|Variable_name|Value|

+------------------------------+-------+

|Innodb_scan_pages_contiguous|45|

|Innodb_scan_pages_jumpy|35904|

+------------------------------+-------+

2rows inset(0.00sec)

As for mysqldump you may use

Shell

1

--order-by-primary

options to force dump in primary key order.

So notes to highlight:

InnoDB fragmentation may hurt your query significantly, especially when data is not in buffer_pool and execution goes to read from disk

Fragmentation by secondary key is much more likely than by primary key, and you cannot really control it (tough it is possible in XtraDB / InnoDB-plugin with FAST INDEX creation) so be careful with queries scan many records by secondary key

Related

Author

Vadim Tkachenko co-founded Percona in 2006 and serves as its Chief Technology Officer. Vadim leads Percona Labs, which focuses on technology research and performance evaluations of Percona’s and third-party products. Percona Labs designs no-gimmick tests of hardware, filesystems, storage engines, and databases that surpass the standard performance and functionality scenario benchmarks.
Vadim’s expertise in LAMP performance and multi-threaded programming help optimize MySQL and InnoDB internals to take full advantage of modern hardware. Oracle Corporation and its predecessors have incorporated Vadim’s source code patches into the mainstream MySQL and InnoDB products.
He also co-authored the book High Performance MySQL: Optimization, Backups, and Replication 3rd Edition.

Share this post

Comments (12)

What happened to these variables in later versions? In version Server version: 5.1.55-rel12.6-log Percona Server (GPL), 12.6, Revision 200 the “innodb_scan%” status variables do not exist, but in Server version: 5.0.92-50-log Percona SQL Server, Revision 85 (GPL) they do exist.

should not be using index but rather primary key. The data is clustered by primary key so it is generally faster.
Also SELECT * FROM table can be only “using index” if there is a key which (together with primary key) covers all columns which means it is not going to be shorter than clustered (primary key) and so even in case they are both fragmented it should not be slower.

I suspect the problem here may come with costs – because Innodb stats are inexact it may look like there are a lot less records to scan using one index than other, while of course if you’re doing full table scan or full index scan number of records is going to be same.

Interesting case.
Technically it was an equal choice in terms of index selection, since the indexed column (block_id) plus the composite primary key covers all the columns. So it was just as valid to pick that index as it was to just scan the primary key. And it would be just as fast.

Of course, when using the output from that query to insert into another table (via a dump or just INSERT … SELECT), and the destination table is InnoDB, then it’s more beneficial to have the select in primary key order. But hey, we all know that unless you specify an ORDER BY, result set order is undefined.
I don’t think this is a optimiser bug, really. If you want the select ordered, use an ORDER BY clause.

And indeed, mysqldump offers the –order-by-primary option, and using it with InnoDB is good. For other engines, the situation can be quite different. For MyISAM, ordering by primary may in fact trigger a filesort (even go to disk) since a full table scan is likely not in primary key order. It’s arbitrary (well, based on insert order + gap filling).

Right. From the size prospective the indexes are about the same. Primary key is even a bit larger as it has more information in it than primary key. However it is still better to scan in primary key order in most cases because it is less fragmented. It is very frequent to see auto increment primary key or other sequential insert patterns.

Even if the dump occurred in non-pk order, the pk order should have been restored during the import, no?
I thought that the price for the out-of-order restore would be excessive i/o during the restore as the
b-tree is continually re-organized to cope with the out-of-order inserts. After the restore the table must
be in pk order, no?

Migael, in InnoDB’s architecture, it works out faster to insert in primary key order. Yes the B+Tree gets rebalanced either way and a B+tree is by definition sorted, but there’s just less work in this particular scenario. That’s why it’s desirable that a dump for InnoDB tables has the rows in PK order.

Indeed Innodb clusters data by primary key. This clustering is however per page. For example in case we would use MyISAM we could get a “disk seek” and IO for each row in worse case scenario – in Innodb it does not happen. The data is always going to be clustered on per page basics and hence at least 16K worth of data would be read each time. The number of rows of course depends on row length and page fill factor – in this case we had few hundred rows per page.

Now note in the worse case scenario no read-aheads will trigger and all IO will be done by single thread in 16K blocks. Considering 200 IOS/sec for legacy (non Flash) hard drive you will be looking at about 3MB/sec read speed which is 30-50 times slower than sequential read speed of the same drive.

Vadim
in this case / with that special purpose in mind why do you compare the runtimes of different queries?
Q1=
SELECT count(distinct username) FROM tracker where TIME_ID >= ‘2009-07-20 00:00:00’ AND TIME_ID = ‘2009-07-20 00:00:00’ AND TIME_ID <= '2009-10-21 00:00:00'

At first glance one would use identical queries and FORCE INDEX usage, wouldn't one?

After several trials I am doing “defragmentation” of innodb 10M+ rows table by exporting to csv file and then loading to new table. Then I rename them, copy all changes which occured during export/import and delete old table. quite fast compared to other methods.