Debugging a MySQL stall

After debugging a max_user_connections and a cache expiration problem in the customer app, we still experienced stalls of the mysqld process. MySQL refused all connections for 5 to 20 seconds, with no obvious reason. I’ll try to explain what I did to solve it.

– there is no visible pattern on when the problem is happening (confirmed with a graph on 2 days).

– CPU usage is increasing for a minute …

and this usage is done by the system

I created a small bash script to get all data I needed when problem happened :

while sleep 1; do

mysql -e "SELECT 1" > /dev/null

if [ "$?" != "0" ]; then

echo "Collect !"

STAMP=$(date "+%H%M%S")

echo "Problem ! Start collecing"

# add here command line to be executed

sleep 5

else echo -ne ".\r"

fi

done ;

I first used this script to record a ‘top’ session to check which processes were eating all CPU. To get top output for 10 seconds I used the following command line :

top -b -d 1 -n 10 > $STAMP"_top" &

Remember $STAMP is the timestamp used in my bash script to group outputs.

Result ? mysqld is the top process, with linux kernel process [kworker]. Not a useful information by itself, but it definitely shows there is something wrong happening in the kernel.

How do you debug a kernel on a production server ? After checking various websites, I decided to use « oprofile » (http://oprofile.sourceforge.net). It consists of a kernel driver and a daemon for collecting sample data, and several post-profiling tools for turning data into information. To my mind, pros of this tool are :

Sampling (compared to tools like strace)

Can sample kernel too (compared to … strace)

gather data in one place, and allow you to zip samples to perform the analysis you want on an other server.

Unfortunately I didn’t have oprofile on this server, so we decided to reboot the service in the middle of the night with the customer approval. The day after, we were able to sample some data.

Note : I was not able to see symbols on the same server because of a buggy binutils library. This is a known problem that is affecting oprofile. Solution was simple : copy /var/lib/oprofile to another server and use opreport with –session-dir like this :

Very interesting ! There were many CPU cycles in compaction_alloc. What is this function ? Let’s read kernel source code mm/compaction.c around line 410 :

406 /*

407 * This is a migrate-callback that "allocates" freepages by taking pages

408 * from the isolated freelists in the block we are migrating to.

409 */

410 static struct page *compaction_alloc(struct page *migratepage,

411 unsigned long data,

412 int **result)

By experience, I know that this can be related to ‘huge pages’ functionality. Are they enabled ?

sysctl -a |fgrep vm.nr_hugepages vm.nr_hugepages = 0

Seems not, but wait a second : what about « transparent huge pages » ? This is closely related and after some thought, may fit more for the problem we have currently. If you want to read more about « THP », this article by Jonathan Corbet is really interesting : http://lwn.net/Articles/423584/. You can also take a look at the official documentation inside kernel written by the same guy : http://lwn.net/Articles/423592/

To check if THP are enabled, do the following :

cat /sys/kernel/mm/transparent_hugepage/enabled[always] madvise never

This was confirmed by seeing a process « khugepaged » in processlist 😉

« So, what happened ? »

This server has 128GB RAM. Memory can get very fragmented, and once in a while, the Linux Kernel tried to migrate THP to defrag memory. This process seemd too agressive in our case and resulted to stall.

« Can it be fixed ? »

Long trend solution would be making this migration less aggressive. After a few reads on Linux Kernel mailing list, I found that someone already mentioned that the current code is too drastic and can be less aggressive : https://lkml.org/lkml/2011/7/26/103

This patch is not in kernel 3.2 nor in current trunk and I do not want to maintain kernel patches in our compil process 🙂 We could also tune khugepaged behaviour with keys in /sys/kernel/mm/transparent_hugepage/khugepaged/.

But I needed a quick fix, so we disabled THP completely like this :

echo "never" > /sys/kernel/mm/transparent_hugepage/enabled

We now waited for 24 hours and no more stalls ! Problem solved \o/

Conclusion

In this process I discovered that oprofile can be a great ally when you are fighting kernel strange behaviour, or even if you want to know what MySQL is doing. I could not use gdb nor strace because it was crashing MySQL everytime (this is because this is a very big server with 64 threads and heavy load…). Oprofile gave me the results I wanted and has no impact on performance when it is not active.

Once again, solving this bug required large panel of skills and I’m very proud of it 🙂 Fortunately, 99% of problems encountered everyday are not as tough as this one.

3 réflexions au sujet de « Debugging a MySQL stall »

Just a quick note from Okta to thank you for publishing this article: http://www.olivierdoucet.info/blog/2012/05/19/debugging-a-mysql-stall/​. We too had a problem with very high system CPU spikes and MySQL stalls. In our case they came 6 or more hours apart, and froze the database for minutes at a time. Neither Percona nor Pythian consultants could figure out the cause. After Googling your article, we set ‘transparent_hugepage=never’ and our database has been incident free for days.

Thank you too for translating your article, as none of the engineers involved in investigating the problem spoke French (and I didn’t have the patience to translate word for word ;).

Please come visit Okta if you are in San Francisco, we owe you a very nice lunch.