Slow DROP TABLE

It is a known fact that ext3 is not the most efficient file system out there and for example file removals can be painfully slow and cause a lot of random I/O. However, as it turns out, it can sometimes have a much more severe impact on the MySQL performance that it would seem. When or why? When you run

Shell

1

DROP TABLE

there are several things that need to happen â€“ write lock on a table so it cannot be used by any other thread, the data file(s) removal by the storage engine and of course in the end MySQL has to destroy the definition file (.frm). That’s not all that happens, there is one other thing:

Shell

1

2

3

VOID(pthread_mutex_lock(&LOCK_open));

error=mysql_rm_table_part2(thd,tables,if_exists,drop_temporary,0,0);

pthread_mutex_unlock(&LOCK_open);

The entire code which removes a table is surrounded by

Shell

1

LOCK_open

mutex. The mutex is used in various places inside MySQL, but primarily when tables are being opened or closed. It means that when

Shell

1

LOCK_open

is locked, no query can be executed because they are stopped from accessing any table.

Here is when the slow file removal on the ext3 file system starts to be a pain. Deleting a 10GB file can take a few seconds and if that is a MySQL table, the mutex remains locked for all that time stalling all queries:

The only alternative seems to be changing the file system. To XFS for example, which handles file removals much more efficiently:

EXT3

Shell

1

2

mysql>drop table large_table;

Query OK,0rows affected(7.44sec)

XFS

Shell

1

2

mysql>drop table large_table;

Query OK,0rows affected(0.29sec)

A better solution through MySQL internals could be to simulate the table drop by renaming the corresponding data file or files and physically removing them outside of the mutex lock. However it may not be that simple, because the actual removal is performed by the storage engine, so it’s not something MySQL code can control.

This is certainly not a common situation, but may become a problem to anyone when it’s the least expected (e.g. dropping old unused tables).

About Maciej Dobrzanski

Maciek is a former Percona employee. An IT consultant with the primary focus on systems, databases and application stacks performance and scalability. Expert on open source technologies such as Linux, BSD, Apache, nginx, MySQL, and many more. Co-author of dba square - a blog about how to manage, scale, and optimize MySQL performance!

Surely the correct solution is, as you say, to change the metadata (quickly) and let the file system operations happen asynchronously. Changing filesystems to solve this sounds pretty backasswards. (That said, how does reiser3 do here?)

The developers of mythtv (a TV recording system) achieved better performing deletes of large files on ext3 through the “slow delete” feature. Typically TV recordings are 1-12GB per hour, so a movie could be 20+GB in size. Without the feature, a delete of the file on ext3 would lock up the entire system for about 20-30 seconds.

The “slow delete” simply truncates the files in chunks in the background.

Generally, temp tables are going to show up in /tmp which is usually tempfs. I wouldn’t expect to see slow delete mechanics at work in that case since you’re basically memory mapped although I’ll admit to not having run a test :).

For me there is 3 cases where tables are created in /tmp : – when you use “create temporary table” – when MySQL need a temporary table which use BLOB or TEXT column – when MySQL need a temporary table which is bigger than tmp_table_size/max_heap_table_size

Right ?

And because of this last case, we can’t use /dev/shm to store temporary tables : it may use too much memory.

This issue has been a big problem for us for a long time as we constantly create and drop schemas. I worked around the issue by having a special method that drops tables one at a time and sleeps if the process count gets too high on the master or slaves. Though the solution is far from ideal as large tables still lock things up for too long.

See the mysql Worklog task for this at: forge.mysql.com/worklog/task.php?id=3983

On our production redhat machines at least it seems to be ext3, not sure if that’s default or a quirk of our particular setup though.

/dev/vg00/lvtmp /tmp ext3 defaults 1 2

And indeed tmpfs is virtual memory (RAM+SWAP) backed (http://en.wikipedia.org/wiki/TMPFS), however for a dedicated mysql machine that doesn’t make too much sense, since if you had the ram you’d just increase your tmp_table_size etc to use that ram directly i imagine. I don’t mind that these large sorts use disk, but i do mind if it blocks my other select queries for no good reason. As a general rule, locks should absolutely never be held around any I/O operation…

I haven’t examined the sources for temp tables, so I do not know if they suffer from a similar problem. I suppose since they do not require such locking, because there is no possibility for any concurrent access to them, MySQL may do things differently. However obviously if there is a large temporary table located on ext3 partition, it will cause a lot of random I/O on delete thus affecting the system performance.

Actually typically /tmp is created on the system storage (i.e. local disk) using whatever file system chosen for the installation. To boost some occasional but rather smallish temporary tables some people move it to tmpfs (which I think is unreasonable design, because you would rather want MySQL tmpdir on tmpfs rather than the system-wide /tmp).

To explain our rationale for not using tmpfs, and using a disk store like ext3:

1) Our database is typically 100% cached (both data and indexes), 16 – 20 GB in a few hundred MyISAM tables (out of 24 GB RAM on the machine). It is 99.9999% reads on a very static but highly random set of data. 2) Most sorts are small or medium sized, we have max_heap_table_size=100M and tmp_table_size=100M in order to do these in memory 3) We have occasional large (2-4 GB +) sorts that we’d like to spill to disk. It’s okay if these are somewhat slow. I/O contention is not an issues since all reads are cached and the few writes we have are not performance critical.

I’m just wondering if in this case this lock will bite us for these large sorts (which afaics, we definitely do need on disk, since our ram is mostly dedicated to caching the db / indexes). I’d go spelunking, but have no experience with the mysql codebase really. Just a concerned user here

you’re testing this with ext3’s journal data, ordered data or write-back mode? As the default is ordered data and the closest behavior to XFS can be achieved via the write-back mode (which btw will become the default soon).

Note that ext4 as it is included in RHEL5.3 is clearly marked as a technology preview, not production quality. But since they refused to support XFS in favor of the outdated ext3 for years now, the best bet for heavy I/O machines is Novell SLES anyways

I’ve found doing a TRUNCATE on a MYISAM table will resize the .MYD file without a lock. You can then drop the table instanstly. With InnoDB tables, the .IBD file doesn’t get resized/deleted until you OPTIMIZE or DROP.