The basic idea is to compare the size of the file(s) on disk with the size in the database metadata. The DATA_LENGTH and INDEX_LENGTH fields of the TABLES table in the INFORMATION_SCHEMA database has size information calculated by the storage engine. There are a few points to note:

MyISAM tables – In MyISAM, the metadata is exact, and the files on disk are split into data (.MYD) and index (.MYI). Compare the size of the .MYD to DATA_LENGTH and compare the size of the .MYI file to INDEX_LENGTH. You can calculate exactly how fragmented each table is, and decide which tables to defragment. Sample query:

SELECT (DATA_LENGTH+INDEX_LENGTH)/1024/1024 AS sizeMb FROM INFORMATION_SCHEMA.TABLES WHERE ENGINE=’MyISAM’;

InnoDB tables when using a centralized file for data and indexes (aka “not using innodb_file_per_table”) – Because all of the data and indexes are in one file, you have to compare the sum of the size of all InnoDB tables with the size of the ibdata file(s). However, because InnoDB has estimated metadata, you cannot use a direct comparison. What I do is get a percentage difference between the size on disk and the size reported by the metadata, and if it’s more than about 20% difference than the table is likely fragmented enough to justify defragmentation. Sample query:

SELECT SUM(DATA_LENGTH+INDEX_LENGTH)/1024/1024 AS sizeMb FROM INFORMATION_SCHEMA.TABLES WHERE ENGINE=’InnoDB’;

InnoDB tables that use separate files for metadata versus data and indexes (aka “using innodb_file_per_table”) – As explained in the previous example, InnoDB uses estimated size for metadata, not exact numbers like MyISAM. So again I figure out the percentage difference and if it’s more than 20% then I will recommend defragmentation. Sample query:

SELECT (DATA_LENGTH+INDEX_LENGTH)/1024/1024 AS sizeMb FROM INFORMATION_SCHEMA.TABLES WHERE ENGINE=’InnoDB’;

Note that every database is different. I use the 20% as a guideline, not a hard-and-fast rule. It really depends on how “wrong” Innodb’s estimate of the size is.

Fragmentation occurs when variable-length data is updated, and when variable-length rows are deleted. You can probably guess which tables have the most fragmentation if you know what queries are being run. If you want to get a better sense of how often updates and deletes are being run, I highly recommend using mk-query-digest against a sample set of binary logs.