Small disruptions can cause great disaster

Primary Menu

Tag Don MacAskill

In my line of work the number of visitors grow fast near the end of the year, meaning also the number of simultaneous users will grow even greater. In our architecture that also means the number of inter-communication processes will cause the number of database connections to grow exponentially.

Something that wasn’t a problem at all during the past few months all of a sudden became a problem we were unable to put a finger on. All we found was that is sometimes takes a bit longer to make a database connection. We started timing it and 70% of the slow connections (longer than 1 second) turned out to be just around 3 seconds. In the end we always would have a connection, so what’s that to worry about?

Well, with a concurrency causing the number of connections to one single database to grow from 20 connections to 1024 means there are lots of bursts going on. So naturally we started tuning the number of threads cached (threads_cached directive) and this has a small positive effect but did not resolve the issue.

Then after three days of searching I found this little blog post by Don MacAskill where he got into a similar problem with the similar 3 seconds. Consulting Percona pointed him to increase the backlog directive (no idea to what extend), so reading that some of our engineers showed the facepalm and we first doubled (later on quadrupled) the backlog directive and this solved the issue for us.

Especially during these busy days you will find these nice little not so well known (nor documented) directives that just make that difference. Thanks Percona! 🙂