Our outbound traffic has been pegged since Friday. This may seem like only a download problem, but it even affects uploads, as the basic syn/ack handshaking packets on the upload server get dropped along with the rest of the download packets that can't make it through the dam.

After discussions with Eric and Jeff, here's what we gather is happening. We use coral cache to reduce our bandwidth needs. Coral cache is an easy-to-use, free, third-party system which does some nice distributed caching just by redirecting the right apache requests to their servers. For example, somebody wants to download the latest astropulse client, they go to our download server, and then they redirected automatically to the coral cache server. The redirect is of the form such that, if the coral cache server hasn't done so already, it downloads the latest astropulse client from us, caches it, and then sends it to the requester. Once cached, it doesn't need to contact our servers again. So, in essence, all but one of the client download requests hit originate from sources outside our lab, thus saving us lots of bandwidth.

That brings us to problem 1. Many ISPs don't like redirects to third-party IPs. This is understandable. What happens in this case is a client downloads a new application, but instead of getting the actual executable they get a blob of HTML saying "this ISP doesn't like third party redirects," etc. Obviously the checksum of this HTML blob won't match the executable checksum, resulting in an application download checksum error. This has been a known problem. So we've been only using coral cache during the first couple of weeks after a new application is made available to reduce the pain of the download rush. A small fraction of our users will be inconvenienced by those redirect errors, but they'll get their clients in due time when coral cache is turned off after the initial "wave."

But then there's problem 2. An application download checksum error (a) doesn't cause exponential backoff and (b) causes all workunits also requested by this particular client to be errored out and resent. This is at least the behavior is older, yet still commonly used, boinc clients. Dave said most of that has been addressed, but if they're still bugs they'll be fixed.

In any case, what we saw this weekend was a confluence of these two problems. This may not have been an issue before due to lighter traffic patterns, but we sure fell off the deep end this time. Maybe there was a small set of heavily active clients this time around causing most of the pain. And once the network gets pegged, all hell breaks loose, and it takes a while to heal itself.

Eric actually had most of this figured out before we arrived today, and already turned off coral cache. At least the broken redirects spiraling out of control would stop happening. He also adjusted the tcp settings on the upload server to help get those partially working again (instead of only 2% uploads getting through, now it's about 50%).

The plan is to let this current state of indigestion pass on its own, and if needed change some BOINC settings (if not also BOINC code) so that future coral cache attempts will be direct links as opposed to apache redirects.

- Matt

____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

So if im understanding this, the upload/download freez was caused by software?

We or rather I started a fund rasing effort to try to help with the connection problem we are seeing more and more of lately.

We talked and debated over what would be the best cure for this.
1. A better connection for the servers to the net.
or..
2. A faster server to deal with the higher demands caused by CUDA
or..
3. ?

Matt, you are the expert here and no one will dispute that so please..
What is the number one thing that would help fix the jam ups?
There are a few including me that want to help in any way we can. Even tho we are mostly broke, we can help pull some donations in and rally for a specific goal by a specific date.
Tell use what you need and maybe the community can find a way to make it so.

One more thing.. THANK YOU FOR YOUR HARD WORK and I mean HARD WORK on this project!
Its has to be pure insanity at times.

So if im understanding this, the upload/download freez was caused by software?

Well, not software as much as implementation of a service used to get around our bandwidth limitations. So yeah, the main bottleneck when problems like this arise is our connection to the internet, which maxes out at 100Mb/sec (though we are paying for 1Gb/sec - long story). We discussed several solutions today at our general meeting. Each has its major cons. All will be quite expensive in time and/or dollars. So the key right now is to work with what we got as best we can which, 99% of the time, is plenty - and while exploring improvements for the future.

- Matt
____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

I did some quick googling over the weekend, and was surprised to come up with a quote of around $3,000 for the most visible raw ingredient - a mile of armoured, rodent-proof, direct-burial, 24-core single mode 9/125 LSOH cable (Optix CST).

Would it be possible to break down the current budget element of $80,000 for the network upgrade a bit further, so we can see where the other $77,000 would go?

Would it be possible to break down the current budget element of $80,000 for the network upgrade a bit further, so we can see where the other $77,000 would go?

Is that cable the kind that burries himself? And eventually terminate also in the fiber patch panel? :)
Now, digging would not cost too much - if you can do it by renting a "Dich Witch" and a smart contractor, but if pavement is in the way, things start to get ugly.
And yes, you need to know what is burried already there so you don't dig thru electrical cables or irrigations.

Thanks Matt for the info..
So the hardware is capable 99% of the time. Thats comforting news :)
As most of us had figured, Its the 100meg plug.
At 80k to utilize the 1G, thats gonna be hard to get raised any time soon.
HOWEVER..
I will try :)
Also.. There are many that would like to make small donations. under 10 dollars. Is there a way that Blurf can set up an PAYPAL account so people can donate $5.00 if they can. Its not much but every dime helps. I know there may be legal issues.. but Blurf is a very trusted member of the community. If Blurf held and then just made a lump donation in his name, wouldn't that skip that issue? There will need to be a tax shelter however.. non-profie org?
Also, A post office box where people can mail small checks to.. would probably be convenient for others. He could fund the PO BOX with part of the donations gathered?
Of course, Blurf would need to be willing to take on the responsiblity of handling the donations.
I'm just trying to find revenue in any way thats legal ;)

Eric.. Your ok in my book too! Can't wait to see the fruit of your labors!

My recommendation for raising funds in a quick manner would be to tender companies with a major sponsorship drive. Tax year is coming to a close and making a donation to an educational institute is tax deductable.

It would need a cruncher or group of crunchers with good solid background on financial planning to be able to offer larger companies anything. If a guy can make himself a millionaire by selling pixels on a screen as advertising, then something new could be done to help Berkeley in some way.

16,000 people donating $5 = $80,000, yes small amount of money but large number of people required.

Also depends on whether campus would allow anything like that in the first place, SETI staff is all part of the bigger Berkeley picture. Faster server with a huge amount of drives and enough RAM to be able to cope with constant I/O am sure wouldn't go amiss either. As Matt mentioned sometime ago somewhere, they also have to check the "lab" can take anymore high power equipment with the current electrical supply they have, overload = disaster.

Kudos to each of you at Berkeley though for the amazing hard work you guys have done to get things running smoothly again. Hope everything goes equally smooth tomorrow during the outage.
____________

I agree with being able to donate via PayPal. I helped with one of the presidential campaigns. Every request for money was for $5, $10, $25..not $1K or $10K. Donations could be made in many forms, including at one point PayPal. It was ridiculous the amount of money that was raised this way.

It's certainly nice to see this communication and such but the question remains, when are we going to be able to resume operations? I can upload 1 or 2 completed tasks an hour but nothing downloads.

It depends on who "we" is. I have hosts with very low SETI work fractions which were able to get fresh work even late yesterday--as they did not have an excessive number of pending uploads.

According to this message from the guy who maintains it, the relevant piece of software inhibits work request when a host has a pending upload count greater than twice its number of CPUs. For a fast host with a high fraction of time devoted to SETI, this means it has to finish uploading the great majority of the work it completed but was unable to upload during the recent unpleasantness. Exponential backdown means that some of the older work won't even retry very often.

Overall this is likely a good thing, as it spreads out the mass attack of work request compared to what would otherwise occur.

But then there's problem 2. An application download checksum error (a) doesn't cause exponential backoff and (b) causes all workunits also requested by this particular client to be errored out and resent.

This would explain why I have processed virtually nothing for the past month while my computer has been running almost 24x7?! After a week of processing an AP WU it gets errored out?!

I am very upset about having all my time and computer resources I have donated for the past month wasted. Now that I have learned how to disable AP I won't be wasting any more effort on that. Please tell me that it is not true, maybe I'll turn it back on.