Well, shoot. Right at the end of the work day yesterday the air conditioning unit failed. What's worse is that the cause is still a complete mystery. When the campus A/C techs came up in the early evening they just pressed the reset button and it came back to life.

But that was after a panicked fury of shutting down every server possible to save their lives. Eric was the first on the scene and smelled burned plastic, heard broken fans, and quickly started unplugging everything he could. I came up later after the A/C was on to get the web servers going again (so people could at least see we were still alive).

This morning rolled up our sleeves and surveyed the damage, which actually wasn't too bad. We definitely lost one UPS, and possibly a power supply in one of our file servers (though it seems okay for now). Eric's hydrogen survey server seemed to take the brunt of the damage, and he was ready to reinstall the OS on what disks remained visible to the system, when suddenly after the nth reboot all drives were visible again and all data was still intact. Well, that was a pleasant surprise.

Still, there was a bit of RAID and database recovery on various servers, which is why the project largely remained offline until the end of the day today. This is still going on, so we probably won't be fully back to normal until tomorrow morning at the earliest.

- Matt-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

Uploads have been problematic since well before the air conditioning failure, and well before the Tuesday maintenance window, too - since around 09:00 PST Monday morning, judging by the first post in Number Crunching..

For a while my job depended on a system cooled by an air conditioner that I could not depend on. My solution was to get one of these and wire it into an extension cord so I could connect all the non-replaceable equipment to it. I then set it to about 80 F and had no worries about failed hardware. The catch is you must make sure your backups are up to date as the power down will be very hard and in my case the raid lost a drive often when it was powered down (very old drives).

For a while my job depended on a system cooled by an air conditioner that I could not depend on. My solution was to get one of these and wire it into an extension cord so I could connect all the non-replaceable equipment to it. I then set it to about 80 F and had no worries about failed hardware. The catch is you must make sure your backups are up to date as the power down will be very hard and in my case the raid lost a drive often when it was powered down (very old drives).

Plug a UPS into it that has the ability to trigger a graceful shutdown of the systems when the power fails. So long as the UPS has the capacity to keep power to the systems during the shutdown you should be in good shape.

For a while my job depended on a system cooled by an air conditioner that I could not depend on. My solution was to get one of these and wire it into an extension cord so I could connect all the non-replaceable equipment to it. I then set it to about 80 F and had no worries about failed hardware. The catch is you must make sure your backups are up to date as the power down will be very hard and in my case the raid lost a drive often when it was powered down (very old drives).

Plug a UPS into it that has the ability to trigger a graceful shutdown of the systems when the power fails. So long as the UPS has the capacity to keep power to the systems during the shutdown you should be in good shape.

It was a P390 running OS2 Warp and VM/ESA. It was so old it didn't have any idea what a smart UPS was. The hard drive failure would happen just because it stopped turning. On the other hand, I would have to do a cold start on VM/ESA but we never lost a byte of data with that set up. I am not sure other operating systems would be as forgiving so I provided a warning.
We did have a UPS but it's main function was to filter power glitches. One danger of putting the switch on the UPS is additional heat will be generated while the UPS reaches it's shutdown point. My room was not much large than a closet so when things overheated, they needed to be shut down fast.
The system was up 24 hours a day and often would be unattended so the failure would most likely happen when no one was around to lay hands on the system.

It was a P390 running OS2 Warp and VM/ESA. It was so old it didn't have any idea what a smart UPS was. The hard drive failure would happen just because it stopped turning. On the other hand, I would have to do a cold start on VM/ESA but we never lost a byte of data with that set up. I am not sure other operating systems would be as forgiving so I provided a warning.
We did have a UPS but it's main function was to filter power glitches. One danger of putting the switch on the UPS is additional heat will be generated while the UPS reaches it's shutdown point. My room was not much large than a closet so when things overheated, they needed to be shut down fast.
The system was up 24 hours a day and often would be unattended so the failure would most likely happen when no one was around to lay hands on the system.

I remember that box!! <g> In my 'previous life' we were running one of those and we had a 'UPS on steroids' that would power the machine for, I think, 2 hours. It might even have powered our 'server farm', but that was 6.5 years ago and my memory is iffy.

For a while my job depended on a system cooled by an air conditioner that I could not depend on. My solution was to get one of these and wire it into an extension cord so I could connect all the non-replaceable equipment to it. I then set it to about 80 F and had no worries about failed hardware. The catch is you must make sure your backups are up to date as the power down will be very hard and in my case the raid lost a drive often when it was powered down (very old drives).

Most anything semi-modern supports some sort of "dumb" signaling from a UPS.

It uses a normal serial port, and only the handshake lines. A line goes "low" to signal "low battery" and the UPS waits for the system to drop a handshake line back when it is safe for the UPS to turn off.

One could build a "UPS" whose only job was to signal low battery when the temperature got above a certain temperature, and kill power when the system said "okay."

Blasted A/C! Now we're out of the frying pan, can we just avoid the fire this time? ;-)

Good job guys.

Trying to recover some data off a laptop drive for someone at the moment. Of course, there isn't a backup, and this is the 4th system I'm trying to recover just recently. The battery has gone in the laptop and seeing as it is a normal P4@3.00Ghz, the PSU is struggling to supply everything now too.

I've had to take the hdd out and attach it to a desktop. After the usual virus checks etc, I started a chkdsk over 10 hours ago and it's less than halfway through!

Oh well. At least it keeps me busy while you guys were up to your eyeballs in it.

Nice to hear everything is almost back to normal. Unfortunate that alot of work units were aborted while trying to upload them as their deadline had passed during the downtime. A have a feeling more will be aborted as they are still unable to be uploaded..

Kinda dissapointed but what can ya do aye? You win some, you lose some - gotta keep on truckin' ! :)