I came in this morning and went about my normal chores, including checking the raw data pipeline. We have automated scripts to do most of the work, including one called "splitter_janitor" which finds files ready for deletion, takes some action, and mails me/Jeff the results. Well, I didn't get any mail. So I looked at the system in question, thumper, and found the script was hung. Some poking around led me to discover that thumper was having trouble mounting directories on server ewen (Eric's hydrogen study server, which actually crashed yesterday but came up again just fine). Well, other machines were mounting ewen just fine. So what gives?

Sometimes the automounter needs a kick, so I restarted that. No dice. I restarted nfs/nfslock to no avail either. Hunh. Around this time I noticed the primary master science database, also on thumper, had gotten wedged. Great. Eric/Jeff were brought into the fold but nobody had any great ideas as to what was wrong and therefore how to fix it. We started killing processes one by one, including the database engine itself, which could only be stopped with a kill -9 (which isn't optimal, but informix has always been perfect recovering from such ugly shutdowns). With an empty process queue we still had mounting problems.

Normally one of the first things to try is a reboot as this is easy and usually works, but we were loathe to reboot thumper since (as you might remember if you are an avid reader of these threads) that its root RAID has some funkiness where, even if it's healthy, will show up as degraded (and require a long resync) upon reboot. But we had no choice at this point, so we rebooted it, and sure enough the system booted just fine (and we could mount everything again). That's the good news, the bad news is that our fears were realized, and we're in the middle of another long painful root drive resync. The system is functional in the meantime, so really it's not that big a deal - it's just annoying, and perhaps a bit scary.

Well, that ate up my whole morning. Then moved onto my Powerpoint/PHP tasks until Bob noticed the science database load was strangely low. This led to more snooping around, finally finding that our system vader (where the assimilators run) was having trouble mounting bruno's disks (where the result files are). So we weren't inserting results, which explains the bored science database. I rebooted vader, which is much easier than thumper, and that broke another dam.

- Matt-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

Asking the following sort of question usually results with an interesting and occasionally entertaining reply; but I need to ask it here because the tone in the NC board is getting more and more emotional of late.

Looking at the amount of fix-up/patch-up that goes on in Berkeley I wondered if things would be smoother if one of the many machines were removed and its function hosted on one of the other boxes, reducing the number of project servers from N to N-1. Error rates and the like go up with the complexity of the system, so reducing the complexity a bit will reduce the theoretical performance but might be a step forward in the long run if the overhead is reduced. Any thoughts?

Asking the following sort of question usually results with an interesting and occasionally entertaining reply; but I need to ask it here because the tone in the NC board is getting more and more emotional of late.

Looking at the amount of fix-up/patch-up that goes on in Berkeley I wondered if things would be smoother if one of the many machines were removed and its function hosted on one of the other boxes, reducing the number of project servers from N to N-1. Error rates and the like go up with the complexity of the system, so reducing the complexity a bit will reduce the theoretical performance but might be a step forward in the long run if the overhead is reduced. Any thoughts?

Asking the following sort of question usually results with an interesting and occasionally entertaining reply; but I need to ask it here because the tone in the NC board is getting more and more emotional of late.

Looking at the amount of fix-up/patch-up that goes on in Berkeley I wondered if things would be smoother if one of the many machines were removed and its function hosted on one of the other boxes, reducing the number of project servers from N to N-1. Error rates and the like go up with the complexity of the system, so reducing the complexity a bit will reduce the theoretical performance but might be a step forward in the long run if the overhead is reduced. Any thoughts?

I had thoughts along the same lines but decided that since Matt and company deal with this on a daily basis, they certainly must know the best way to utilize the equipment they have available. Too bad Seti is not a govenment project where throwing more and more money at the problem is acceptable.Boinc....Boinc....Boinc....Boinc....

Has anyone noticed what seem to be a language file php script directly echoed at the top of the page?
My browser language preferences asks the server a french locale before an english one, so maybe it only appears when your browser local is different from english.

With more and more queries now running against the replica server instead of the live one, it's getting quite difficult (but more important) to spot whether website data is live or pre-recorded.

With that in mind, would it be possible to code something in 'sah_status.html' (the Server status page) to compare the data behind '10 May 2009 22:20:08 UTC' with 'now' (or now(), or gstate.now, or whatever webservers use), and if there's an unreasonable discrepancy - say more than an hour - flag a warning box for "data delayed - may not be reliable"?

With more and more queries now running against the replica server instead of the live one, it's getting quite difficult (but more important) to spot whether website data is live or pre-recorded.

With that in mind, would it be possible to code something in 'sah_status.html' (the Server status page) to compare the data behind '10 May 2009 22:20:08 UTC' with 'now' (or now(), or gstate.now, or whatever webservers use), and if there's an unreasonable discrepancy - say more than an hour - flag a warning box for "data delayed - may not be reliable"?

Supported, but a bit academic at this moment as the Status page hasn't been updated since 10 May 2009 22:20:08 UTC.

With more and more queries now running against the replica server instead of the live one, it's getting quite difficult (but more important) to spot whether website data is live or pre-recorded.

With that in mind, would it be possible to code something in 'sah_status.html' (the Server status page) to compare the data behind '10 May 2009 22:20:08 UTC' with 'now' (or now(), or gstate.now, or whatever webservers use), and if there's an unreasonable discrepancy - say more than an hour - flag a warning box for "data delayed - may not be reliable"?

Supported, but a bit academic at this moment as the Status page hasn't been updated since 10 May 2009 22:20:08 UTC.

F.

On the contrary, now is exactly the time when we need it - it's so easy to let your eye slide over the update time and process the rest of the data as if it were current. My old eyes need a big cartoon STOP sign - especially if it's still stalled in four and a half hours' time, when the time will be correct again and only one digit in the date will give the game away.