Random day today for me. Catching up on various documentation/sysadmin/data pipeline tasks. Not very glamorous.

The question was raised: Why don't we compress workunits to save bandwidth? I forget the exact arguments, but I think it's a combination of small things that, when added together, make this a very low priority. First, the programming overhead to the splitters, clients, etc. - however minor it may be it's still labor and (even worse) testing. Second, the concern that binary data will freak out some incredibly protective proxies or ISPs (the traffic is all going over port 80). Third, the amount of bandwidth we'd gain by compressing workunits is relatively minor considering the possible effort of making it so. Fourth, this is really only a problem (so far) during client download phases - workunits alone don't really clobber the network except for short, infrequent events (like right after the weekly outage). We might be actually implementing better download logic to prevent coral cache from being a redirect, so that may solve this latter issue. Anyway.. this idea comes up from time to time within our group and we usually determine we have bigger fish to fry. Or lower hanging fruit.

Oh - I guess that's the end of this month's thread title theme: names of lakes in or around the Sierras that I've been to.

- Matt

____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

I understand that when the bandwidth only spikes for 'short' periods of time, it may not be a priority to fix it. However, per a not-so-recent news release, S@H's need for more computers and volunteers was going to increase many-fold. In light of this hoped-for need, can one actually afford to deprioritize the bandwidth issue?

More generally, if a virtual stress test could be performed with 10x more hosts than currently are involved (the news release suggests a potential factor of 500-yikes!), what parts of S@H wouldn't scale and would need to be re-engineered?

I understand that when the bandwidth only spikes for 'short' periods of time, it may not be a priority to fix it. However, per a not-so-recent news release, S@H's need for more computers and volunteers was going to increase many-fold. In light of this hoped-for need, can one actually afford to deprioritize the bandwidth issue?

Well, to clarify - bandwidth is definitely a priority. Solving them by compressing workunits? Not so much... It wasn't a priority in the past for reasons I stated earlier, and it isn't a priority in the near future since we'll need a lot more bandwidth than compressing workunits will provide. This is why we're exploring the bigger options mentioned in an earlier thread.

- Matt

____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

this is really only a problem (so far) during client download phases - workunits alone don't really clobber the network except for short, infrequent events (like right after the weekly outage).

Question: could the client binaries themselves be located off-site (or closer to your feed), perhaps a single server just for those files?

That technique was causing quite a bit of the grief. The problem was the redirect - which apparently a large number of firewalls and ISPs do not allow. The ISP substitutes a page that notes that this is not allowed, and the substitute page downloads correctly. BOINC thinks it has the file until it does a checksum at which point all tasks that relied on that file error out and then new ones are downloaded and run into the same problem. Churning like this is a good way of filling the bandwidth. If the BOINC client knew where to go look for the files instead of a redirect, then it would work better. Another possibility would to have something like BitTorrent - where BOINC would ask the torrent for the files and the torrent would tell BOINC where to go fetch.
____________BOINC WIKI

I understand that when the bandwidth only spikes for 'short' periods of time, it may not be a priority to fix it. However, per a not-so-recent news release, S@H's need for more computers and volunteers was going to increase many-fold. In light of this hoped-for need, can one actually afford to deprioritize the bandwidth issue?

Well, to clarify - bandwidth is definitely a priority. Solving them by compressing workunits? Not so much... It wasn't a priority in the past for reasons I stated earlier, and it isn't a priority in the near future since we'll need a lot more bandwidth than compressing workunits will provide. This is why we're exploring the bigger options mentioned in an earlier thread.

- Matt

The A2010 ALFALFA Spring 2009 observations will continue until the end of April, so the midrange work they provide gives a convenient period to decide how to handle the increase in VHAR 'shorties' after that. I guess the delivery pipeline would make the critical time the second or third week of May.

I agree compression wouldn't be enough for a permanent solution, but it might ease problems temporarily.

this is really only a problem (so far) during client download phases - workunits alone don't really clobber the network except for short, infrequent events (like right after the weekly outage).

Question: could the client binaries themselves be located off-site (or closer to your feed), perhaps a single server just for those files?

That technique was causing quite a bit of the grief. The problem was the redirect - which apparently a large number of firewalls and ISPs do not allow. The ISP substitutes a page that notes that this is not allowed, and the substitute page downloads correctly. BOINC thinks it has the file until it does a checksum at which point all tasks that relied on that file error out and then new ones are downloaded and run into the same problem. Churning like this is a good way of filling the bandwidth. If the BOINC client knew where to go look for the files instead of a redirect, then it would work better. Another possibility would to have something like BitTorrent - where BOINC would ask the torrent for the files and the torrent would tell BOINC where to go fetch.

Wearing my ISP hat for a moment, I'm not sure why an ISP would block the redirect. Corporate America is a different story.

Do the applications have to come from the same download servers as the work? Seems like the easiest solution would be to tell BOINC to look for clients.setiathome.ssl.berkeley.EDU or somesuch and download from there.

Then that could be one, or several, distributed servers around the planet just based on the number of "A" records in DNS.

... or it could be mapped to the same IP on smaller projects.
____________

Second, the concern that binary data will freak out some incredibly protective proxies or ISPs (the traffic is all going over port 80).

But binary data is already going over port 80. A lot of web pages already get compressed by the web server before they get transferred to browser with then proceeds to uncompress that stream to render it.

That's why libcurl includes zlib, to unpack the data stream. The client appears to already be ready to at least receive if not already send compressed streams. caveat vendor, I'm not a programmer and looking at the documentation I don't see a way of handling it on the sending side, not that it would be necessary.

And if the server is already doing THAT processing, zipping the file isn't going to help the bandwidth one little bit.
____________

I'd have to check the BOINC code, but this should be very straight-forward. The one disadvantage to this feature is CPU load on the web servers will increase. Memory requirements for compression are minimal.

But binary data is already going over port 80. A lot of web pages already get compressed by the web server before they get transferred to browser with then proceeds to uncompress that stream to render it.

That technique was causing quite a bit of the grief. The problem was the redirect - which apparently a large number of firewalls and ISPs do not allow. The ISP substitutes a page that notes that this is not allowed, and the substitute page downloads correctly. BOINC thinks it has the file until it does a checksum at which point all tasks that relied on that file error out and then new ones are downloaded and run into the same problem. Churning like this is a good way of filling the bandwidth. If the BOINC client knew where to go look for the files instead of a redirect, then it would work better. Another possibility would to have something like BitTorrent - where BOINC would ask the torrent for the files and the torrent would tell BOINC where to go fetch.

The problem wasn't so much with the location of the files on another server, it was with the technique used to tell the BOINC client where to collect the files from. Matt made the (very reasonable) point that 'caching' (of any description) means that the relief server doesn't have to be manually loaded with the new files when anything changes.

But subject to that limitation, and the need to issue Matt with a pair of long-range kicking boots in case anything goes wrong, then establishing an application download server with a different URL closer to the head-end of the 1Gb/s feed might buy a bit of time. Einstein does something like this - even supplying BOINC with a set of multiple download urls - and it seems to work well in general, with just occasional glitches if a mirror site goes off-air. Comparing notes with Einstein might throw up some more ideas.

However... Have you considered HTTP based compression? Apache could compress the work unit downloads realtime using gzip or deflate. You'll get about 27% smaller downloads this way.

gz compression over http is great for servers that are generating significant html/xml/css/js traffic, and although advisable to use on this forum, this traffic is a tiny fraction of the WU bandwidth requirements. Its fraction of total bandwidth would be small.

Compressibility is related to entropy, and in terms of compression a given block of data has varying degrees of 'sponginess' - you could take a large sponge, compress it and feed it though a gas pipe, and it expands again on reaching the exit. Same principle.

However, we are analysing noise that is random and has no identifiable redundancy - to any compressor it is indistinguishable from concrete!

The only viable solution that I recommend is to repackage the XML workunits with binary CDATA instead of base64 encoding.

I don't agree that this will cause any problems with routing or content filtering - http has to handle all forms of binary transmission, or we would have no web!

If the apps were properly XML compliant and not make assumptions about content encoding then they should not need to be rewritten. Transmission should be transparent regardless of content encoding, and much bandwidth could be saved as with Astropulse WUs.

That technique was causing quite a bit of the grief. The problem was the redirect - which apparently a large number of firewalls and ISPs do not allow. The ISP substitutes a page that notes that this is not allowed, and the substitute page downloads correctly. BOINC thinks it has the file until it does a checksum at which point all tasks that relied on that file error out and then new ones are downloaded and run into the same problem. Churning like this is a good way of filling the bandwidth. If the BOINC client knew where to go look for the files instead of a redirect, then it would work better. Another possibility would to have something like BitTorrent - where BOINC would ask the torrent for the files and the torrent would tell BOINC where to go fetch.

I was kind of thinking more along the lines of don't try to download any work for an application that needs to be downloaded, until the application has successfully downloaded. The problem was WUs were being assigned, BOINC was saying "hey, I need this application to run these tasks", it would then try to get the application, then the app download failed, and all the WUs were dumped, and the process would start over again at the next connect interval.

Another thing that would have greatly helped keep that situation from spiraling out of control is what I believe would be a better way to do the quota system. Instead of doubling the quota for a good result returned, it should just be +2. It was pointed out that if you are on an 8-CPU system and you turn in 800+ bad tasks, it only takes eleven (11) good results to bring the quota back to 800. 4-cpu takes 10, 2-cpu only takes 8. Then there's the CUDA quotas that were thrown into the mix, as well, with the multiply factor for that. I think +2 instead of 2x would keep problem computers at bay very nicely. It doesn't even have to be +2.. it can be +5..just as long as it's not multiplication.
____________Linux laptop uptime: 1484d 22h 42m
Ended due to UPS failure, found 14 hours after the fact

This may be a dumb question, but...
How do I know if I have a problem computer and how can I fix it if I do?

(Not wanting to be part of the problem), Thanks.

A problem computer is one that returns an excessive amount of errors. Each error reduces your daily CPU quota by one. Typically you want to be at 100 all the time, but something as simple as missing a deadline will count as an error and reduce the quota.