Kind of a bumpy weekend. So we moved that database (which handles the seti.berkeley.edu website) from Dan's new but oddly crashy desktop on my new desktop. Then over the weekend MY new desktop started crashing at random. You'd think this is now clearly related to the database, but Dan's desktop continued to crash after moving the mysql database off of it. And upon further inspection both systems sometimes crash before the OS is even loaded.

So this looks like a hardware problem after all. Funny how both of these new systems are failing in the same manner. We think it has to do with the power outages from a couple weeks ago sending some jolts into these perhaps more sensitive systems.

But speaking of outages, completely separate from those previous power issues which have since been fixed, there was a brand new problem affecting just this building (and all the projects within it, including SETI@home/BOINC). This one was worse, starting in the middle of the night, and by the time anybody could do anything power was up and down several times, and some outlets delivering half power, etc.

The repairs were much faster, and we were stable again around noon, but upon turning everything back on we found we completely lost thinman, the main web server. Totally dead. However, quite luckily, we happened to have a spare old frankenstein machine kicking around, and I was able to do a "brain transplant" i.e. swap the drives from thinman to this other machine. Now this other machine thinks it is thinman and is working quite well as a web server. Dodged a major bullet there.

I also happened to have my old desktop nearby, so I'm using that as I diagnose the new crashy one. Not sure who is responsible for all these damages and lost time, but it definitely shouldn't be us.

- Matt

____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

Now would be a great time to get the funds for those whole-closet UPS devices. How much could that possibly cost the school? ;)

Or at the very least, some line conditioners, which are usually built-in to UPS units. Line conditioners will clean up noisy power, and also most of the time handles very strong surges just fine. May help with keeping weird power scenarios from taking out machines.. or dirty/noisy power may be what is causing those strange and random crashes.

One of my long-since retired crunchers continues to do other things for me around the house and it was acting weird and would randomly crash. Sometimes it would be weeks before it did it, other times it would be repeatedly for an hour or so. I ran memtest on it and discovered the RAM needed more voltage. Instead of the 2.6 that it wanted, I already had the board set for 2.8, so I had to crank it to 2.9, and that fixed it.

Might just be a power issue, either internal or external.
____________
Linux laptop:
Record uptime: 1484d 22h 42m
Ended due to UPS failure, as discovered 14 hours later

... Dan's new but oddly crashy desktop on my new desktop. Then over the weekend MY new desktop started crashing at random. You'd think this is now clearly related to the database, but Dan's desktop continued to crash after moving the mysql database off of it. And upon further inspection both systems sometimes crash before the OS is even loaded.

So this looks like a hardware problem after all. Funny how both of these new systems are failing in the same manner. ...

One relatively newer possibility, in addition to the usual checks, that's quick & easy to eliminate. There's been a general trend evolving lately, to supply XMP profile (or other high frequency with tight latency) memory defaulting to 'normal undervolts'.

After a typical 14 hour or so burnin period the crashy symptoms appear, & gradually worsen over time. Heavy RAM usage patterns in particular then throw either controller or RAM modules over the edge, while memtests often show clear.

The quick check is to make sure the DIMM voltage matches the XMP profile spec, and that VID (memory controller in the CPU) is set to about 70% of that (which is for impedance matching purposes, maximising signal integrity & stopping the memory controller sinking excessive current).

Jason
____________
"It is not the strongest of the species that survives, nor the most intelligent that survives. It is the one that is the most adaptable to change."
Charles Darwin

I have wondered this out loud before, but doesn't the campus have some kind of comprehensive insurance coverage that might cover the loss of equipment in cases like this?
I find it hard to believe that lab and computer equipment might not be covered.
Even most basic homeowner's insurance covers this kind of thing for example, in the case of a lightning strike.

It might be worthwhile to ask some serious questions of the proper authorities.....

Just sayin'.
____________
***************************************
I am still the kittyman.
Accept no imitations.

Now would be a great time to get the funds for those whole-closet UPS devices. How much could that possibly cost the school?

I agree. I don't know whether the Seti server closet and other kit has rack mounted UPS's, but if not then they really should have. No UPS will last for a 5 or 6 hour outage, but they will shut down kit gracefully much earlier without any damage, and they protect against the brownouts mentioned by Matt. Seti having its own automatic backup diesel generator would probably be unrealistic.

But if these power problems and outages are likely to continue over the summer then the project has to take steps to protect its kit. If UPS's are needed then I am sure we could start an emergency fund raising drive once we know what is needed and the cost. I'll most certainly chip in what I can afford.

They can, but it takes big batteries.
The main use for UPSs is protection from surges, brownouts & power falures. If the failure is long enough, then it allows the hardware to be shut down normally.
Larger UPS units are designed to keep systems up till such time as a backup generator can come online, and then keep things up when that shuts down & the system switches back to mains power.
____________
Grant
Darwin NT.

I do not think it is reasonable to try to get a UPS system that will do more than protect the machines, and allow them enough time to gracefully power off after a short timeframe running with no power. Maybe 10 minutes.

Power conditioning and voltage regulation, if they are not already a part of the lab's UPS system, should be considered. Every time you have an outage like this one (especially in an older building), some other part of the electrical system gets stressed. You might have cascading problems every few weeks for the next year before everything is all ironed out.

We've floated the idea of power stabilizing hardware to the lab, I'll let anyone know if they decide they'd like some of the same.

It's heartbreaking that our two new workstations got crippled but given the past few weeks it's understanding. We'll replace the damaged components ASAP once Matt et al figure out the issues.
____________

Even some small UPS equipment for the PC's would help keep the power gremlins from disturbing circuitry and such and shortening lifespan. I have all my gear at home on UPS for graceful shutdown and power conditioning at all times...

Find out who your campuis engineer is and raise hell ... Let people know they are destroying equipment with their shenanigans. This should bye upchanneled as mush as possible to let management know this is costing them money, time, equipment...
____________
Never engage stupid people at their level, they then have the home court advantage.....