We have recently come out of a painful outage. Last Thursday, 11/29, there was an unexpected power outage at Space Sciences Lab. It lasted some 20 minutes. Eric came over as quickly as he could to shut machines down, but he works in another building from where our machine room is, so the UPS's had run out their fairly short on-battery time by the time he got there. It was a perfect storm in that both Matt and I (who work a few feet from the machine room) were both out.

Most machines came through OK, but three did not. Lando, an older administrative work horse (and splitter machine) appears to be dead. We have some spares from which to choose its replacement. More tragic was the fact that the master BOINC database, and its replica, suffered unrepairable corruption. This was an astonishing bit of bad luck. Both machines are on UPS and both machines have battery backed RAID controllers. One would think that all database logging would have at least made it to the RAID controller, but it obviously did not.

In order to recover the master database, we had to actually delete all of the underlying files and then recreate all of the databases from scratch before recovering from backup. A simple recovery from the backup did not work. After recreating the databases and then recovering from the backup, we ran all of the MySQL binary logs to recover up to a point in time just before the outage. Then we took a fresh backup of the database in case the next step did more harm than good. The next step was to run an extensive table check/repair on all tables in both the production and beta databases. All tables reported OK. Good! We then brought the projects up and used the fresh backup to restore the replica.

One might ask why we don't have machines automatically shut down in an on-battery situation. A good question with a lot of history. To make a long story short, our server complex has enough cross dependencies that if machines come down in the "wrong" order, other machines can hang. Plus some of of old UPS's would hiccup and cause a spurious shutdown (I'm not sure if our current crop have this problem). This was enough of a headache that we went with a very simple design. Our database machines would have battery backed RAID and be on UPS with no automatic shutdown. The theory was that the UPS would hold the machines for the duration of very short (one or two minute) power outages and, beyond that, the RAID controllers would save any pending IO. This very simple design has served us well but, as we see, not in all cases.

Eric came up with a good compromise. We will configure the BOINC replica database machine to immediately shut down (after stopping the database and unmounting its file system in case the shutdown hangs) upon detecting an on-battery condition. Nothing is dependent on this machine, so a spurious shutdown would not be a disaster. This should prevent a disaster of this magnitude from recurring.

[edit]Would it be possible to write a script that would live on the machine that needs to be shut down last that would send instructions to the others to shut down in proper order? Maybe even get feedback when they are down so it knows when to send the next command?
____________DavidSitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.

Let me add my thanks to the detailed explanation -- and knowing that you've reconfigured a battery shut down process to reduce database rebuild scenarios is nice to see. It seems that power outages on the campus are like 100 year storms -- the years are not 'human years' but rather equivalent years for some much shorter lived entity.

My most heart-felt thanks to the Boys With The Baling Wire, once again.

While some scientists may be doing frivolous work on multi-million dollar equipment, the SETI team continues to attempt to answer an age-old question on equipment (in some cases) that other teams would throw out as obsolete.

It just goes to show that it's not the toys you have to do it with, it's the spirit of the adventurers.

To echo others, THANKS, Jeff, for the extensive write-up of the problem. It is a very well-written detailed explanation of the situation that should satiate (almost) all of us geeks out here. Thanks to you and Eric for the careful troubleshooting to get the databases back on track. Matt must be crazy with envy that he missed all the excitement! (NOT)
Whit

Thanks very much Jeff for this detailed report.
Well done !
It's a series of bad luck that hit the lab.
I think many members should be aware of this post to judge the extent of the problem and stop extrapolate without knowing.
____________

Maybe there is away to do something like a solar power battery backup system. i know there is not much funding for the project. and i am sure the solution would cost a pretty penny. but i was thinking that with all the volunteers, there might be ways to do like a separate fund raising for this. a friend of mine lives in an area where he is far out from the city where the power tends to go out even in a sprinkle of rain in most cases. we ended up installing a system like this to power his entire house in case the power goes out again. and sure enough it has many times. the longest he had to run his house on solar battery backup was about 5 to 6 days. i know a house power demand is nothing near the demand of all the computer systems. i just thought i would try to suggest something that might be possible. maybe even help spark idea's from others as well. ;)
____________

Just a suggestion for you. Because all the servers are dependant, why don't you use a "UPS monitor client/server" ? The best is to use a laptop to monitor all the UPS involved and if one fails, it orders via the software all the other servers to shutdown in the good order. The bad thing is that all servers have to be restarted manually in the good order or via WOL (Wake On Lan) if you can script it.
Hope it helps !

I appreciate the information as to what happened with the outage, however - proper system design and architecture should never allow a system to be brought to it's knees.
I understand that this is pretty much a volunteer operation and such - but, imagine - if I told my boss that a 20 minute power outage would result in approx a week of downtime - well - I'm fired. Volunteer or not - it is a lot of downtime for a simple failure. Is it a matter of being able to dedicate time or lack of funding? Time? Get more / another volunteer in charge - funding / equipment - say the word and we fund it.

This is not a criticism of you - only a criticism that it wasn't prevented. Identify what you need to prevent this type of issue in the future and let us know. We pay your bills, so to speak, with our computers - and, when needed, our checkbooks. As a systems admin, I have a hard time with a 20 minute power outage causing this disruption in service - my boss would kill me.
Let us know WHAT you need - and don't be shy. The first obvious need - is a reliable UPS - let's say - at least an hour battery and safe shutdown. How much?
____________

It may be wise to replace them every 3 to 5 years to insure that you have
the capacity to do what you need.
and in this case more is always better.

also a monitor plugged in to a ups before help arrives is a wast of backup time
until help arrives.
a conveniently placed power strip plugged into the ups can be used to re-
power anything needed for shutdown once help arrives.
a turned on monitor can kill a ups in minutes when just the computer may have stayed up for an hour or more.

I'd say, once they've run themselves down once, replace them -- they won't give anything like the same amount of run time for the next outage. My UPS lasted about 2 hours running one computer (which has Boinc set to stop when on battery)(I think), my DSL modem and router, and a couple of radio scanners before it shut down the computer. The next time my power went out, just a few months later, it only lasted 20 minutes.

However, I have yet to take my own advice. ;-)

____________DavidSitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.