Small disruptions can cause great disaster

Primary Menu

Tag omconfig

Some days ago I got a call from our support engineer on duty that MySQL on one of our database servers was lagging more than 1000 seconds behind in replication and the server got kicked out of the pool because of the delay. He was unable to find out why and there was absolutely nothing in the mysql log files. When I got the call it was still lagging behind but the lag was slowly decreasing again.

After a quick peek in all our monitoring systems I isolated it to this message:Cache Battery 0 in controller 0 is Charging (Ready) [probably harmless]
Apparently not that harmless! 😛

Obviously we did encountered this situation a couple of times before but apparently there was no detection on this machine.

The relearn cycle happens every 90 days and gets first scheduled when the machine gets powered on. Now imagine this happening in a master-master setup where both machines were powered on at the same time. Lucky enough you can use omconfig to reschedule the cycle up to 7 days, but then you obviously need to have detection in place.

Why did nobody come up with the idea to have a dual battery backed up cache with alternating relearn cycles? That way you can have your battery relearn without the controller going back into write-through mode. 😉