We have a MySQL-based application that is very response time sensitive. A MySQL stored procedure needs to respond on average in less than 10 milliseconds and the maximum acceptable response time is 50 milliseconds. We are hitting the average response time with flying colors (more like 3 milliseconds) but over a five hour load test we will get some (~50) responses greater than 50 milliseconds with a maximum response time sometimes as high as 265 milliseconds.

There is a lot of information out there about tuning MySQL and Linux for optimal performance, and I believe we've applied most of the best practices, but I can find very little about how to reduce the incidence of outliers and reduce maximum response time.

How can I diagnose the cause of these outlier response times and what Linux and MySQL tunings can I apply to reduce them?

Your question lacks detail so it's impossible to provide any reasonable answers. What are your queries like? What is the read/write ratio? What hardware are you running on? Are the outliers similar queries? What does your my.cnf currently look like? What version of MySQL? One thing I can guarantee you is that you will have better and more consistent performance running Percona Server than running community MySQL.
–
Aaron BrownNov 9 '11 at 2:22

What is the level of deviation here? If you are meeting your average SLA then there will always be some outliers that are three and four and five standard deviations out there. Something you will need to come to grips with on the SLA front is what level of exception is allowable against your maximum?

Several years ago there was a great commercial from motorola speaking about six sigma manufacturing principals and using a golfer as an analogy, noting that their level of defects was similar to missing one put in 100 years. The key here is that the one exception in 100 years was an allowable exception level. The cost to move from 1:100,000 to 1:1,000,000 can be extraordinary for little payoff.

You note 50 exceptions over the 50 millisecond level. This is out of a total sample size of ??? Is this rate of exception allowable/desirable from a business perspective taking into account the cost required to cut the rate?