Small disruptions can cause great disaster

Primary Menu

Tag percona

At the company I work for we are still running Percona Server 5.1 in production and are slowly heading towards a Percona Server 5.5 rollout. It did take a lot of preparation in the past few months (write a my.cnf conversion script for example) and a lot of testing. A couple of machines already have been upgraded this week to 5.5 to compare performance and stability. So far the machines proved to be stable enough to keep them on 5.5 and even better: we already see a couple of benefits! However, the title wouldn’t have been blues if everything would have been a breeze, right? 😉

First problem we ran into was that our Cacti templates broke due to the changed InnoDB status output. So I headed towards the Cacti templates and looked in the issue tracker if the issue was already known. Apparently it was already known, but unfortunately not fixed yet. Lucky enough writing the fixes myself wasn’t much of a problem.

Secondly we ran into the issue that the history list was growing from a “steady” 200 to 4000 after upgrade. Searching on this topic revealed a problem with the purge operations but it was not clear to me what exactly was the problem. According to the MySQL documentation the default should suffice. Uhm, right?

Now I knew some things have changed in 5.5 regarding purging: a separate purge thread was already introduced in 5.1 but could it have been so different then? So I tried to find out what each and every purge variable would do:
innodb_max_purge_lag 0
innodb_purge_batch_size 20
innodb_purge_threads 1
relay_log_purge ON

At first I assumed that increasing the batch size would make the purging more efficient: the larger the batch the more it could handle, right?
Wrong: the larger you set it, the later it will purge! I found this on the MySQL documentation about it:

The granularity of changes, expressed in units of redo log records, that trigger a purge operation, flushing the changed buffer pool blocks to disk.

So the name is actually confusing! In our case it went from 20 to 40 making things worse and then from 40 to 10 making the history list go from 4000 to 1800.

Then I decided to see what the purge lag would do. Changing the purge lag as described by Peter did indeed lower the history list for a short while, but MMM also kicked the 5.5 server out of its pool because it started lagging behind in replication! So this is definitely something to keep in the back of your mind!

I did not change the purge threads to 0 since it is a machine that runs in our production environment. Also confusing is the deprecated innodb_use_purge_thread that could be set to different values than 0 and 1 but is marked as experimental.

This graph shows best how it worked out:

In the end I lowered the purge batch size to 1 and the history list went back from 4000 to its “normal” 200.

In my line of work the number of visitors grow fast near the end of the year, meaning also the number of simultaneous users will grow even greater. In our architecture that also means the number of inter-communication processes will cause the number of database connections to grow exponentially.

Something that wasn’t a problem at all during the past few months all of a sudden became a problem we were unable to put a finger on. All we found was that is sometimes takes a bit longer to make a database connection. We started timing it and 70% of the slow connections (longer than 1 second) turned out to be just around 3 seconds. In the end we always would have a connection, so what’s that to worry about?

Well, with a concurrency causing the number of connections to one single database to grow from 20 connections to 1024 means there are lots of bursts going on. So naturally we started tuning the number of threads cached (threads_cached directive) and this has a small positive effect but did not resolve the issue.

Then after three days of searching I found this little blog post by Don MacAskill where he got into a similar problem with the similar 3 seconds. Consulting Percona pointed him to increase the backlog directive (no idea to what extend), so reading that some of our engineers showed the facepalm and we first doubled (later on quadrupled) the backlog directive and this solved the issue for us.

Especially during these busy days you will find these nice little not so well known (nor documented) directives that just make that difference. Thanks Percona! 🙂