Wrapping up the weekly "extended outage." Jeff's actually out today, but will be back to turn the servers on tomorrow (i.e. Friday, when I'm usually out).

I finally got around to testing a drive on mork (the mysql server) that the RAID card deemed "failed" at some point, but maybe that was a transient problem as it seems fine now. Nevertheless I went through the rigamarole of pulling that drive, putting a new on in, testing it, making it a new hot spare, etc.

That's all good, but the week in general has been tainted by mork issues in general. It had one of its regular mystery crashes on Tuesday (followed by a long recovery). Then last night, and again this morning, the RAID mirror of two solid state drives (where we keep the innodb logs) started going flakey on us. The partition would just disappear, sending mysql into fits. We were able to quickly recover, but we're abandoning the solid state drives for now. Honestly, they weren't adding all that much to the i/o picture because we were cautious about how we were implementing them. Now I'm glad we were cautious. The upshot of all the above meant that we had to recovery the replica as many as four times so far from the weekly backup. What a pain. The latest replica recovery is happening as I type this. All I hope is that all systems are normal and stable by tomorrow.

Everything else is fine. In fact, more than fine as a set of very generous participants donated $6000 towards a new server that will become the new science database server. THANK YOU!! We're still spec'ing out said server, but will go ahead sooner than later now that we don't have to set up a funding drive!

Meanwhile I'm still chipping away at various data analysis projects, Jeff's been fighting with data syncronization issues that have been creeping in more and more lately. We also had a "design meeting" regarding where to go with the public involvement of candidate selection. I'm finding some plug-n-play visualization utilities on line, but pretty much I'm finding (like always) it might just be easier and better if I do it all myself with tools I already know. However, some improvements go beyond that scope, so I'm digging into AJAX which is good stuff to know, I guess.

- Matt-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

Matt.. if I might share.. from my experiences of keeping together antiques that were often poorly "refurbished"..

many problems clear permanently upon "re-seating" unplugging, and plugging back in. Other times taking things out, some surprise drops loose(seen or unseen 50/50).. and are then magically "fixed". Whether they were dirty connections, a bit of dust, someones raisinette.. does not really matter as long as they clear. a bad connection invisible to the eye might nearly need "bumped".. and could be gone forever.

We came up with things such as "pencil test".. where while monitoring the signal we tapped the outside case and see if it had effects. And some of the equipment was old enough to even contain mercury relays, where the mercury would vaporize, re-solidify in obscure pieces, and refuse to work until we "bounced" (hold edge of component 3-4" above anti-static surface, drop and catch on first bounce, re-insert) to clear.

These are also good reasons why "fault tolerance" is a good(although expensive) principle.

On the reports going back.. all of these were jotted down as "re-seat to clear."

Because if we told the truth, the whole truth, and nothing but the truth... it would have been the Salem Witch trials all over again.
Janice

Avoid Kingston and consumer OCZ products when it comes to SSDs. Intel is only good if it's SLC memory and if there was an SSD move made, that SSD must have a supercapacitor to handle Server IOs per second. Pretty much the only choice when it comes to Server SSDs is the Sandforce SF-1500 controller chips with supercapacitor.What if Fiction was Fact and Fact was Fiction and vice versa?

Shouldn't you name that new server after the benefactors? Or is MRJHJT too difficult to pronounce in the office? ;-)

Well i don't believe that we would find a word that's representing the six Sponsors.

But - i would love to see a Sticker on the Server with written on it like "Mainly sponsored by Mark, Richard, Josef, Helli, John and T.A." ;-)
A Picture in the SETI@home Photo Album would also be fine so we can say years later: "Hey, look, a 1/6 of this Rig was sponsored by me". :-)

Matt.. if I might share.. from my experiences of keeping together antiques that were often poorly "refurbished"..

many problems clear permanently upon "re-seating" unplugging, and plugging back in. Other times taking things out, some surprise drops loose(seen or unseen 50/50).. and are then magically "fixed". Whether they were dirty connections, a bit of dust, someones raisinette.. does not really matter as long as they clear. a bad connection invisible to the eye might nearly need "bumped".. and could be gone forever.

We came up with things such as "pencil test".. where while monitoring the signal we tapped the outside case and see if it had effects. And some of the equipment was old enough to even contain mercury relays, where the mercury would vaporize, re-solidify in obscure pieces, and refuse to work until we "bounced" (hold edge of component 3-4" above anti-static surface, drop and catch on first bounce, re-insert) to clear.

These are also good reasons why "fault tolerance" is a good(although expensive) principle.

On the reports going back.. all of these were jotted down as "re-seat to clear."

Because if we told the truth, the whole truth, and nothing but the truth... it would have been the Salem Witch trials all over again.

One thing that used to work on CRT terminals, back in the '80s, was to give them a "slap upside the screen". Some terminals would come back to life for a time after the slap. Location (and force) was brand-dependent, and with one of the brands, there were two methods that worked, depending on symptom: the slap, directed at the upper right of the CRT, and lifting the front of the CRT about an inch, and dropping. IBM 3278's were pretty reliable, but when they went, they could (sometimes...) be brought back by slapping the back right corner, and picking up the back about .5 inch, and dropping... .

Matt.. if I might share.. from my experiences of keeping together antiques that were often poorly "refurbished"..

many problems clear permanently upon "re-seating" unplugging, and plugging back in. Other times taking things out, some surprise drops loose(seen or unseen 50/50).. and are then magically "fixed". Whether they were dirty connections, a bit of dust, someones raisinette.. does not really matter as long as they clear. a bad connection invisible to the eye might nearly need "bumped".. and could be gone forever.

We came up with things such as "pencil test".. where while monitoring the signal we tapped the outside case and see if it had effects. And some of the equipment was old enough to even contain mercury relays, where the mercury would vaporize, re-solidify in obscure pieces, and refuse to work until we "bounced" (hold edge of component 3-4" above anti-static surface, drop and catch on first bounce, re-insert) to clear.

These are also good reasons why "fault tolerance" is a good(although expensive) principle.

On the reports going back.. all of these were jotted down as "re-seat to clear."

Because if we told the truth, the whole truth, and nothing but the truth... it would have been the Salem Witch trials all over again.

One thing that used to work on CRT terminals, back in the '80s, was to give them a "slap upside the screen". Some terminals would come back to life for a time after the slap. Location (and force) was brand-dependent, and with one of the brands, there were two methods that worked, depending on symptom: the slap, directed at the upper right of the CRT, and lifting the front of the CRT about an inch, and dropping. IBM 3278's were pretty reliable, but when they went, they could (sometimes...) be brought back by slapping the back right corner, and picking up the back about .5 inch, and dropping...

AS a (mostly) mechanical engineer, it does my heart good to see my electronic colleagues adapting the time honoured and tested ways of the mech eng.

Data wise, we were able to get back to merging our various spike tables together full bore

How far through merging the spike tables are you now?

BOINC replica database saying running on the left hand side of the Server Status page yet beside Replica seconds behind master it says Offline. Is it still recovering after it's various crashes throughout the week?