This one could probably go in the techincal news, but since I haven't blogged in a while, I decided to jot it down here.

Following the large outage, bruno's been having some problems keeping up. Lots of dropped connections. I guess most of you noticed that. It's not a lack of hardware this time, just an over-abundance of connection attempts.

Some of the dropped connections were local file-server connections, which causes some of the http processes to wait around which causes more dropped connections. Changing some of the TCP tuning parameters helped, but didn't solve the problem.

We did some brain storming before the outage and have come up with some tactics to combat these issues.

We're setting up our router to proxy the SYN/ACK handshakes. That way if we are flooded, the connections will be dropped before they get to bruno. That'll in turn prevent the NFS connections from getting dropped.

We're also getting rid of some configuration remnants from earlier BOINC server code. Currently bruno handles all of the incoming connections and forwards them to other machines when appropriate for uploads and downloads. We can designate other machines as upload or download handlers so that bruno won't have to touch those connections at all.

If that's not enough, we'll set up web servers on some of the other machines and get back to round robin DNS for the upload and download servers.

Well, that's enough typing for now. This weekend, one of my fingers had an unfortunate meeting with the leading edge of a 120mm fan blade inside a server case. Fortunately the fan blade broke and it doesn't look like I'll lose the fingernail. I've learned my lesson, always approach case fans from the trailing edge.

[...]
Well, that's enough typing for now. This weekend, one of my fingers had an unfortunate meeting with the leading edge of a 120mm fan blade inside a server case. Fortunately the fan blade broke and it doesn't look like I'll lose the fingernail. I've learned my lesson, always approach case fans from the trailing edge.

--
Eric

Yow. Just did that myself three months ago, and lost half the nail. It's grown back since, but damn was that annoying (I type a lot).

On the up/download issue, good plan on dropping connections at the router vs. the host itself - hopefully that will have the desired effect and give NFS a kick in the pants. Thanks again for all your and your colleagues' hard work in resurrecting Thumper!

On the up/download issue, good plan on dropping connections at the router vs. the host itself - hopefully that will have the desired effect and give NFS a kick in the pants. Thanks again for all your and your colleagues' hard work in resurrecting Thumper!

Unfortunately the router couldn't handle the load so we're back to dropping connections at bruno. I spent the last few hours getting a bruno clone, which I have tentatively named Ptolemy, up and running. (It's not quite a clone, dual 3.06 GHz hyperthreaded processors rather than dual 2.8GHz non-hyperthreaded. Where it came from is a story for another time.) I've got the OS installed and am at the point where Matt and or Jeff need to work some apache magic in order to have it be usable in a round robin DNS with bruno.

I'm going to go get some dinner, then I'll mail Matt and Jeff with a progress report. I think they'll be surprised how far I've gotten this evening.

On the up/download issue, good plan on dropping connections at the router vs. the host itself - hopefully that will have the desired effect and give NFS a kick in the pants. Thanks again for all your and your colleagues' hard work in resurrecting Thumper!

Unfortunately the router couldn't handle the load so we're back to dropping connections at bruno. I spent the last few hours getting a bruno clone, which I have tentatively named Ptolemy, up and running. (It's not quite a clone, dual 3.06 GHz hyperthreaded processors rather than dual 2.8GHz non-hyperthreaded. Where it came from is a story for another time.) I've got the OS installed and am at the point where Matt and or Jeff need to work some apache magic in order to have it be usable in a round robin DNS with bruno.

I'm going to go get some dinner, then I'll mail Matt and Jeff with a progress report. I think they'll be surprised how far I've gotten this evening.

On the up/download issue, good plan on dropping connections at the router vs. the host itself - hopefully that will have the desired effect and give NFS a kick in the pants. Thanks again for all your and your colleagues' hard work in resurrecting Thumper!

Unfortunately the router couldn't handle the load so we're back to dropping connections at bruno. I spent the last few hours getting a bruno clone, which I have tentatively named Ptolemy, up and running. (It's not quite a clone, dual 3.06 GHz hyperthreaded processors rather than dual 2.8GHz non-hyperthreaded. Where it came from is a story for another time.) I've got the OS installed and am at the point where Matt and or Jeff need to work some apache magic in order to have it be usable in a round robin DNS with bruno.

I'm going to go get some dinner, then I'll mail Matt and Jeff with a progress report. I think they'll be surprised how far I've gotten this evening.

Addendumb: I had a 'd'Oh!' moment this morning. Apparently we were running with the upload timeout set at 20 minutes (which I think is the apache default), so our connections were being dominated by machines that couldn't get through, but were hanging onto the connection.

If you look at our network traffic, you can see what happened when I lowered that to 30 seconds..... We sending about 4 times as much work as we were when I got in this morning.

Addendumb: I had a 'd'Oh!' moment this morning. Apparently we were running with the upload timeout set at 20 minutes (which I think is the apache default), so our connections were being dominated by machines that couldn't get through, but were hanging onto the connection.

If you look at our [url=http://fragment1.berkeley.edu/newcricket/grapher.cgi?target=/router-interfaces/inr-250/gigabitethernet2_3&ranges=d%3Aw&view=Octets]network traffic[url], you can see what happened when I lowered that to 30 seconds..... We sending about 4 times as much work as we were when I got in this morning.

It's good to see the progress... Hopefully soon things will be better. For the time being, uploading is still an exercise in futility on my machine.

Addendumb: I had a 'd'Oh!' moment this morning. Apparently we were running with the upload timeout set at 20 minutes (which I think is the apache default), so our connections were being dominated by machines that couldn't get through, but were hanging onto the connection.

If you look at our network traffic, you can see what happened when I lowered that to 30 seconds..... We sending about 4 times as much work as we were when I got in this morning.

OOPS lol
you lot just human
how is Ptolemy comming along ?
[edit]jipee just got a WU reported[/edit]
____________

The quick, but unsatisfying answer is "I dunno." It's certainly worth looking into, so I'll mention it to Matt and Jeff. They're the experts...

In my former job, we used it for a brief test period on a Hughes satellite link. It performed admirably, even though the decision was made to go to 56K burst frame. While I know that slow link optimization isn't exactly the same goal as what you need, the product isn't just for slow links... It might help. It might not.

Edit: Additionally, SkyX looks like another possible help for the TCP/XML/HTTP acceleration.

Hi!
It might be too low:
I've noticed several new WUs on "Results for user" list, while nothing is on my PCs. Looks like WUs are allocated, but connection is terminated before client
realize there is something to be fetched.

A long delay till same WU is re-send to another client due to timeout.

BR, 73
Iztok

Addendumb: I had a 'd'Oh!' moment this morning. Apparently we were running with the upload timeout set at 20 minutes (which I think is the apache default), so our connections were being dominated by machines that couldn't get through, but were hanging onto the connection.

If you look at our network traffic, you can see what happened when I lowered that to 30 seconds..... We sending about 4 times as much work as we were when I got in this morning.

today some of my hosts managed to upload and report almost all their WUs, vs. an average of 1-2/day/host before. The timeout change certainly seems to have eased the situation somewhat.

Still, what Iztok mentioned is worth looking into - unless there is a way for BOINC to recover that WU download, it'll put all low-bandwidth users at a disadvantage while reducing overall project efficiency.

We've moved the scheduler to bruno (from galileo) and both bruno and ptolemy are handling uploads. Only penguin is on download duty, but that may change if downloads start becoming a problem.

We'll round-robin the scheduler once we can get round-robin capable feeders built. Matt wasn't able to do it before he left for vacation.

Validators and assimilators are offline while Jeff tracks down a strange segfault. The std::vector<>::size() method is reporting an incorrect value, even though the pointers to the start and end of data are correct. IBTHOOM.

Apache on bruno hung last night in a weird state. Lots of httpd processes running, but no connections getting through. We'll need to come up with a way to detect that state and fix it without human intervention.

We've moved the scheduler to bruno (from galileo) and both bruno and ptolemy are handling uploads. Only penguin is on download duty, but that may change if downloads start becoming a problem.

We'll round-robin the scheduler once we can get round-robin capable feeders built. Matt wasn't able to do it before he left for vacation.

Validators and assimilators are offline while Jeff tracks down a strange segfault. The std::vector<>::size() method is reporting an incorrect value, even though the pointers to the start and end of data are correct. IBTHOOM.

Apache on bruno hung last night in a weird state. Lots of httpd processes running, but no connections getting through. We'll need to come up with a way to detect that state and fix it without human intervention.

Eric

Thanks for the update, Eric. :-)

Matt's on vacation? How lucky for him. And how bad for you who are left in the lab. I guess you both, Jeff and you, look forward to get rid of this sign: