This morning’s outage

My colocation facility gave me a heads-up that they were replacing one of their giant UPSes yesterday, but that no problems were expected.

It turns out they had a problem, and my cluster lost power this morning around 6:30am EST. Things are configured for automatics failover, but what I think happened is the servers came up before the storage array did, so they couldn’t start properly, so they just sat there. It took me powering down and restarting things in the right order to resolve the problem, and once I did that everything came up cleanly and automatically, just like it was supposed to.

I’ll adjust the restart timers and see if that prevents this from happening in the future.