As expected it took about 1.5 days to copy all the results from our failed upload server (bruno) to the new upload server (synergy). I was out yesterday hence the lack of update from me, but nothing could get done until the result copy finished anyway.

Jeff and I tackled the remaining stuff this morning to bring synergy back up, and it's now pretending to be bruno. It's working fairly well except, predictably, the disk i/o subsystem isn't happy with lots of little random i/o's (there are only 4 working spindles on synergy, as opposed to 20 on bruno). Still, it's working heroically to recover from the past two days of data distribution silence.

Meanwhile, what the heck is wrong with bruno? I wish we knew. I've been battling this all day since getting synergy on line. It seems there are fundamental issues that transcend disks/partitions/controllers. Random drives are disappearing, random partitions are disappearing, and this was still happening after taking the 3ware card out of the system entirely... We're stumped. It might just be a cluster of simple problems with confounding symptoms. I give up for now.

By the way, bruno was named after Giordano Bruno.

Also by the way, somebody asked if we should have two upload servers. We used to have the upload server split onto two systems but this wasn't helping - in fact it was making it worse. The problem is not the lack of bandwidth i/o, but disk i/o. The results have to live somewhere, and require lots of random read/writes. So it's best if the upload server saves the results on directly attached storage. If it is also serving them over NFS (or likewise equivalent) such that a second upload server can write to them, it's too much of an overhead drag. So the upload server has to be a singular server which also (1) holds the results and (2) as much of the backend processing on these result files as possible. I think right now the only backend processing on results which bruno does NOT do is assimilation, which vader handles. You might think "why not just have the upload server save the results IT gets on ITS own storage?" Then we end up with two piles of results, randomly split, and then the NFS/mounting bottleneck is simply pushed down the pike to the validators, who need to read both piles at once.

- Matt

____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

Performance wise I'd expect Synergy to be about 10-20% of the throughput of Bruno. This means that the catch-up from an outage will be slower, with a higher retry count. So we have to sit here and be patient for a bit longer. So what? the data we are processing is already a few months old, and not "time critical".
____________
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?

Performance wise I'd expect Synergy to be about 10-20% of the throughput of Bruno. This means that the catch-up from an outage will be slower, with a higher retry count. So we have to sit here and be patient for a bit longer. So what? the data we are processing is already a few months old, and not "time critical".

Uploads and downloads seem to have settled down nicely, but there's quite a backlog growing for validations - also running on Synergy (aka 'the new Bruno'). They'll be held back by the lack of disk I/O, too - every validation attempt will require finding and retrieving at least two, and possibly several, previously uploaded result files.

Either way, synergy is way more powerful than the demands of the upload server require. Look at the old specs of bruno compared to synergy. The RAM is different by a factor of twelve! To replace bruno completely probably wouldn't cost nearly as much as carolyn and oscar, nor even synergy.
____________

Synergy does not have the ability to install additional drives in its chassis.
To add drives would require a new raid card to allow external connections, a suitable drive arrary chassis and the drives. You could ball park this at $4k.

At this point you could purchase a SuperMicro storage server, motherboard, RAID card, memory and drives for around $5.5k (I'll donate the processors again) I've already been looking at this as an option.

Synergy was never intended to have the duties of Bruno - it was a compute server with 5x 1TB SAS2 Hard drives to allow reliable operation by using RAID 6 (Which has a bunch of overhead but excellent reliability)

There was a significant need to extend the overall science of S@H and this server fit the bill to provide this and other resource have been diverted away to meet the demands of the users.

Synergy does not have the ability to install additional drives in its chassis.
To add drives would require a new raid card to allow external connections, a suitable drive arrary chassis and the drives. You could ball park this at $4k.

At this point you could purchase a SuperMicro storage server, motherboard, RAID card, memory and drives for around $5.5k (I'll donate the processors again) I've already been looking at this as an option.

Synergy was never intended to have the duties of Bruno - it was a compute server with 5x 1TB SAS2 Hard drives to allow reliable operation by using RAID 6 (Which has a bunch of overhead but excellent reliability)

There was a significant need to extend the overall science of S@H and this server fit the bill to provide this and other resource have been diverted away to meet the demands of the users.

Todd

I suppose we should wait and see if Bruno is still viable.

____________
*********************************************
Behold the power of kitty!!

Oh, I fully agree! If it was the raid card I was prepared to just order one and have it drop-shipped to Berkely but it sounds like it is more than just that being the problem.

Drive arrays are such that they require regular maintenance and should be swapped out to prevent failures.

When I worked at Cray Research we had two large storerooms of full height 5.25" 1GB Micropolis SCSI drives and one storeroom would be empty in a month. Drives were used on average for 2500 - 3000 hours before they were replaced in the array. Granted they got beat up pretty hard with insane throughput needs and were in constant operation. But this is not unlike the needs of S@H.

Oh, I fully agree! If it was the raid card I was prepared to just order one and have it drop-shipped to Berkely but it sounds like it is more than just that being the problem.

Drive arrays are such that they require regular maintenance and should be swapped out to prevent failures.

When I worked at Cray Research we had two large storerooms of full height 5.25" 1GB Micropolis SCSI drives and one storeroom would be empty in a month. Drives were used on average for 2500 - 3000 hours before they were replaced in the array. Granted they got beat up pretty hard with insane throughput needs and were in constant operation. But this is not unlike the needs of S@H.

Todd

You worked at Cray???

I was impressed with your knowledge, and your generosity.

Now I am REALLY impressed. That explains a lot.
____________
*********************************************
Behold the power of kitty!!