Hello again. Today was the usual outage day, but we got a *lot* done, so I figured I'd report on a bit of it.

Everything in the server closet is now on the new Foundry X448 switch. Of course this is all internal traffic - the workunits/results are still going over our Hurricane Electric network. Still, it's a major improvement in quality and may actually grease several wheels. In fact, we may use it to replace the HE router as well at some point.

The download servers have been trading off for a bit - we are now currently settled on using vader and georgem as the download server pair. As well, I just moved from apache to nginx on those servers. I think it's working well, but if any of you notice weird behavior let me know!

Otherwise, Jeff and Eric worked pretty hard today to align the beta and public projects - for the first time in a while (years?) their database configurations match, which will make the immediate future of development a lot easier (we've been dealing with having several code sandboxes and so forth for a while).

In less great news, carolyn (the mysql server) crashed for no known reason. Probably a linux hiccup of some sort, which is common for us these days. The very silver lining is that it crashed right after the backup finished, and in such a manner than didn't cause any corruption or even get the replica server in a funny state. It's as if nothing happened, really.

However one sudden crisis at the end of the day today: the air conditioning in the building seems to have gone kaput. Our server closet is just fine (phew!) but we do have several servers not in the closet and they are burning up. We are shutting a few of the less necessary ones off for the evening. Hopefully the a/c will be fixed before too long.

- Matt

____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

Thanks for the update on the current state of things and good luck trying to keeping cool.

I was interested if you had any thoughts on the network problems of the past 7-10 days. From a novice, it looks like there was, over this last week, a correlation between AP splitters running and the whole SETI project stopping to respond.

The download servers have been trading off for a bit - we are now currently settled on using vader and georgem as the download server pair. As well, I just moved from apache to nginx on those servers. I think it's working well, but if any of you notice weird behavior let me know!

Downloads appear to be pretty much like normal- the maxed out network traffic making connections & downloads difficult (although at present not impossible).
NB- i operate connecting to only the one download server (208.68.240.13) as the other (208.68.240.18) is always timing out.

What is of concern is that for the last 3 weeks or so there have been serious issues with uploading, and frequent issues getting work from the Scheduler.

Uploads are a case of either all of them timing out straight away, or they just sit there, elapsed time ticking away & nothing happening. After 1-5 minutes they'll time out to try again later. Usually once they start to download, they continue OK. However often they will sit at 100% for anything up to 3 minutes & either finally complete, or timout & have to start from scratch again.
Whatever was done (i think your time Sunday) sorted it out, but the problem is back again right now.

And the problem with the Scheduler is that requests for work are often met with "Project has no tasks available", "No tasks sent" or "Timeout was reached" so when we are finally able to upload enough work to request more, we can't get any.

This all seemed to happen around the time all the shortie WUs started going through the system. eg on one card it usually takes 15-20min to process 3 WUs. With the shorties it's doing 3 WUs in 4-5min. That's 3 to 4 times the throughput.
____________
Grant
Darwin NT.

Thank you for the update,
i like this little techical background informations, so i know, that i'm not the only one living with fortune and/or throwbacks working with computers and technical surroundings ... ;-)

... the air conditioning in the building seems to have gone kaput...

The only good news on this is, amerikans use german words, but the word is "kaputt - kaputter - am kaputtesten".
I hope is't not "am kapputtesten", to get soon lower degrees for the servers ;-)

Matt - I think you are trying to jam too much down the internet pipe by using two download servers. Some years ago, one of the two download servers was out of commission for a day or two, and this let the internet connections proceed smoothly. I would like to suggest that you suspend one of the d/l servers for a day or two to see if the connections smooth out and actually result in better overall throughput. Please and Thanks !!!

I'm with Swibby Bear and Grant: it's time something was done about the scheduler, as 100 WU's (as last I heard) held in memory is too few, now that you guys have (as I see it) 4 to 6 different types of WU's, (CPU MB and AP, NVideo MB & AP and ATI/OpenCL MB & AP) with the assignment of type apparently (to me) being done BEFORE the WU gets into the scheduler's memory. A doubling of the number of WU's to 200, if not a quadrupling to 400, seems to be in order... I've mentioned this before, BTW, and got shot down with "It's working the way it is" - not as well as it could, (as I saw it then) and it's not working too well now, and it'll only get worse from here, as more and/or faster computers come along!
____________
.

Matt - I think you are trying to jam too much down the internet pipe by using two download servers. Some years ago, one of the two download servers was out of commission for a day or two, and this let the internet connections proceed smoothly. I would like to suggest that you suspend one of the d/l servers for a day or two to see if the connections smooth out and actually result in better overall throughput. Please and Thanks !!!

The only time i recall a single download server running everything almost came to a grinding halt.
The main download problem is bandwidth- there just isn't enough of it with the 100Mb/s connection.
However the present problems are due to underlying system issues. People aren't going to find out how well the new server situation is working untill they can return all their present work & get new work.

Untill the upload problem & the the Scheduler problem are resolved, that's not going to happen.
____________
Grant
Darwin NT.

How much would it cost to get the gigabit line up the hill?
It might help donations if there were a specific equipment list and dollar amount.

There have been a few oblique references to that in the past. The line is there, it's just a matter of being connected. And that requires the Univeristy's agreement.
It would appear campus politics is involved; untill those issues (whatever they are) are resolved it's not going to happen.

Matt, in response to your request for glitch reports - Uploads are extremely sticky to non-existent. This is resulting in my main host is finishing tasks far faster than they are being uploaded.
____________
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?

I think it's working well, but if any of you notice weird behavior let me know!

Matt,

There is something fairly new, and terrible, happening. I'm surprised that you seem unaware.

The short version: There's been something major wrong for weeks that was not wrong six weeks ago.

We're bone-dry out here, having worked-through our multi-day caches while the project was up and running.

We have Managers unable to communicate with the Client due to gigantic backlogs of un-uploaded, un-reported, and un-downloaded tasks. Lately, uploads stick after reporting 100% progress. "Update" requests are behaving badly. Lots of "transient HTTP errors," lots of scheduler requests that time-out, lots of work units taking far longer to download (after getting stuck) than to crunch (and I'm not referring to "shorties" but to the incredibly slow and usually interrupted downloads).

We are accustomed to "the usual difficulties." This is new.

Your faithful are losing hope.

Please wave a dead chicken over the racks in the server closet, or maybe out at Hurricane Electric (we can't tell). Soon, please.

Whatever you think my biases may be, try to hear what I'm telling you with fresh ears. This is relatively new, but not yesterday new.

...and it's bad.

If the things you changed Tuesday were supposed to help, as of this writing, they haven't. Things may be worse, in fact; but they were so bad to start-with that isolating a "new" bad condition isn't possible.

I am having the same problems and its getting very old. Why can't the problem be at least talked about. My cpu hasn't had work enough to keep it busy more than 1 hour. I can't get any work for cpu only gpu. whats with this?
____________

Matt,
Update on my earlier comment - the rate of failure of uploads is getting higher as time wears on.

I can't comment on downloads as I haven't had any for at least a day, probably due to the vast pile of uploads....
____________
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?

I usually remain silent regarding the projects failures, hiccups and just plain strange behavior, but the last couple of weeksHAVE BEEN RIDICULOUS!

I just looked at my GPU temp graph for the past 24 hours and noted that it showed a TOTAL run time of less than 30 minutes. During that period I managed to get about 40 CUDA-FERMI tasks by manually 'prodding' the system and aborting some uploads that had been trying to U/L for greater than 18 hours, thereby dropping the number of stalled uploads to 8 or less on this quad-core machine, allowing it to get 20 new tasks for about 15 minutes of work. This is roughly an approximation of the previous two weeks.