To briefly summarise complex back-end concerns : something has broken/changed recently on a particular machine, it is still rather unclear as to what has triggered this problem, and it's under active investigation as we speak. A number of options are being looked at - including transferring the relevant download work to another device. We can but hope ! :-0

Cheers, Mike.

( edit ) I'll add that 'download server' is a functional label which hides much important detail, to wit : the hardware & software is custom/bespoke/tuned to specific purpose, and so it is not a trivial task to procure a replacement.
____________"I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal

The root cause is not yet fully understood but we re-enabled BRP4 and FGRP1 task distribution (S6Bucket has remained active anyway), albeit on a very conservative level as we're monitoring the download server.

This sort of thing should be posted to the front page news section. That way also gets the word out via RSS.

ROFL! Oh, yeah. Right. So the people who cannot reach us can tell us that? :-)

Cheers, Mike.

( edit ) For the rest of us : those who hold the validators for editing the web content are currently incommunicado .... but the next time my car runs out of petrol I'll be sure to drive it to the next town to fill up.
____________"I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal

Now, as to the original issue : it seems E@H may be a victim of it's own success. Again alas. Posters may recall analogous problems in the past when there's been a change in workflow patterns due to new work unit types etc. Thresholds get reached, bandwidths peak ..... that sort of thing. AFAIK a key problem is maintaining logical coherence of activities across separated hardware. Naturally in a perfect world with infinite funds, plenty of staff and an accurate crystal ball these scenarios would be escaped or never entered. :-)

In any case please bear with us. Most likely temporizing measures will be put in place and then followed by more lasting ones. Right now there's alot of back end discussion on a wide range of alternatives. Your patience is very much appreciated, but I guess now might be the time ( & I can't think of a better sort of occasion ) to switch to a backup BOINC project of your choice meantime if that suits your mindset.

Cheers, Mike.
____________"I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal

Would you mind telling us what it turned out to be, in case the experience might be useful for other BOINC projects?

Sure! A few months ago we noticed that Apache wasn't able to handle the BRP/FGRP download requests anymore and switched to lighttpd which turned to be more suitable for our specific setup, data type and access pattern. The load increased even further and we seem to have crossed a crucial threshold last week such that lighttpd also wasn't up to the task anymore. Various filesystem/network/daemon tests have revealed that the web server was in fact the bottleneck and we now moved to nginx, the very efficient web server that powers Facebook, WordPress, SourceForge and GitHub for instance (third, almost second, most popular web server).

This sort of thing should be posted to the front page news section. That way also gets the word out via RSS.

ROFL! Oh, yeah. Right. So the people who cannot reach us can tell us that? :-).

I don't understand your point. Everyone could get to the web site (and RSS) just fine. It was only the upload/download of tasks that wasn't working. It would have been good to announce the issue, so that crunchers would know to redirect their machines to other projects for the duration. And it helps head off all the posts from people asking "what's up?".
____________
Dublin, California
Team: SETI.USA

We had some trouble with BRP4 workunit generation earlier today. The problem has been solved. It will take a few hours to build a buffer of unsent tasks, though. Currently all generated tasks are immediately sucked up by hungry clients.

BM

Edit: Sorry, the problem is only partially solved so far. We are still working on it.

The download server has been working ok this weekend. We did have a problem with generating workunits for BRP4 (the other searches being unaffected). This problem has been solved for now, we are generating and sending out BRP4 work again.

The workunit generation for BRP4 is a chain of various software running on a couple of machines. So far it did work well and reliable since end of September. The reboot of one of the machines on Friday morning then lead to a chain of oddities and errors that resulted in no work being generated anymore.

A couple of errors in that chain still need some investigation in order to prevent this from happening again, but we won't do it today. It's advent weekend after all, and most of the people involved (Ben, Carsten, Oliver, me) spend these days with their families.

Would it be an idea to have the FGRP work coming from a different download server? That could then free up some bandwidth, relieve the BRP download server of some load and remove the double point of failure (ie 2 download servers).

The network / server load from FGRP1 is negligible. There is a single large file that should be downloaded only once per host for all workunits, the actual data files are just a few kB, and should also be used for many tasks.

What would make more sense would be to have two download servers for BRP4, each one fed by a single workunit generator. But currently we don't need that.

We are currently investigating different ways to encode (effectively compress) the BRP4 timeseries data, such that we need to ship fewer bytes per task. This should help both server and clients.

This material is based upon work supported by the National Science
Foundation (NSF) under Grants PHY-1104902, PHY-1104617 and PHY-1105572 and by
the Max Planck Gesellschaft (MPG). Any opinions, findings, and conclusions or
recommendations expressed in this material are those of the investigators
and do not necessarily reflect the views of the NSF or the MPG.