I'm starting to get the hang of this. If your cache is a long way below normal, it helps to reduce your cache size settings - that way you're not asking for so much in one go.

When you're recovering from dehydration, take small sips of water, not great big gulps.

So very truth, (in both cases), a smaller, f.i. 3 (or less) and an additionel 2 (or 1) days does work better, also has a shorter turn around time.
Less work to report in one go and less work needed per day, if we * all* ask for
10 + 10 days, we're shure in for SERVER trouble... :-\

In Holland we have a saying : a donkey doesn't hit the same
stone twice.

I think we're going to need to at least temporarily go back
to restricting workunits in progress on a per host basis and per RPC
basis, regardless of what complaints we get about people being unable
to keep their hosts busy.

The splitters are already showing red/orange on the server status page, and 'ready to send' is as near zero as makes no difference (there'll always be a few errors and timeouts to resend). So I'm going to turn off NNT and see what happens - let's see if we can help get this beast back under control.

Richard,

The LAST thing I want to do is get into some sort of trouble, but I read this several hours ago and it's been bugging me ever since.

Does Eric know you couldn't report 6 tasks any better than you could report 6,000?

I'm not talking about "limiting" the reporting to 6 at a time. I'm saying that if all I had was 6 tasks, I couldn't report them.

If there's some really esoteric reason limiting a machine to 20 work units means that another machine would be able to report 6, I can't fathom it.

I can't even make-up a story that sounds plausible.

Nor do I understand why using a proxy would eliminate the problem with reporting. I can't invent a reason that this would be better or worse restricting work units in progress.

I already KNOW I don't know what I'm talking about, but it would make me feel better if someone would explain in layman's terms how Eric's fix might fix a problem that can be overcome by using a proxy.

Well, the kitties won't be happy having their caches limited, but I guess if that's what it takes to right the ship........
____________
*********************************************
Behold the power of kitty!!

I didn't say I agreed with it, or understand the logic behind it.
But, I am not Eric.
Things went south after last Tuesday's outage.
And personally, I don't see what cache sizes have anything to do with it.
All was working fine with AP out of the picture. Caches were filled, comms were good, all appeared to be well.
AP fired up, and everything went to Hades in a handbasket.
Could be coincidence, I dunno.

Splitters are off now, and the bandwidth is probably gonna stay maxxed resending ghost tasks for quite a while.

And I stand corrected, tbret....
It does appear that there is a gremlin in the scheduler, and bandwidth is NOT the only problem right now.
____________
*********************************************
Behold the power of kitty!!

Well all's right with the world now. Ghosts have been downloaded, scheduler requests are working. Okay there aren't any new units being made right now but the odd updating behavior and ghost generation is fixed at least for now. Just waiting for the cricket graph to drop off as the download backlog is cleared up.

Edit: My only problem now is I have a full 6 day queue for my ATI MB cruncher but only about 2 days for the CPU MB cruncher.
____________
"Life is just nature's way of keeping meat fresh." - The Doctor

I doubt that the number of tasks awaiting validation is an issue -there is plenty of disk space, and the server doing the validation is well up to it.
Just now there are about 10,000,000 tasks "out in the field", and about 7,6000,000 tasks awaiting validation, and no new tasks being created as all the splitters are down for one reason or another. What is bemusing is the fact that the query rate is sitting at about 1200qps against the norm of 700-800qps and has been sat around there since the last outage...
____________
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?

So does that mean it will take 10 minutes for it to timeout now?
I figure if it's not going to respond within 5 minutes it's as good a time as any for it to timeout.
Usually when it did respond when the timeouts were at their worst it was within a couple of minutes; when things are going well most responses are within 20 seconds or so.

Do we know why the Scheduler is having such a hard time keeping up with the load- more RAM required, faster disk subsystem? New system?
____________
Grant
Darwin NT.

I found these graphs instructive. I've made a fixed copy - I don't know whether the site is happy about live linking - so this is a snapshot of the position just after 12:00 UTC today (graph times are UTC+1).

We've brought down 'Results in Progress' by over a million overnight, which can only be good for the health of the servers.

We can also see clearly how we got into such a mess yesterday. Somewhere round about 5am UTC Sunday morning (late Saturday evening in Berkeley), some 300,000 tasks suddenly jumped from 'Ready to Send' to 'Results in Progress'. My guess is that they all became ghosts, but I've no idea why - late halloween party in the server closet, perhaps? I'd love to be a fly on the wall in this morning's staff meeting while they scratch their heads over that one.

Anyway, back to the present. I'm finding that for hosts which have ghosts in the database (mainly fast hosts with large caches), I'm able to get them resent reasonably easily - provided I don't ask for too much at once. Large work requests are still hitting the timeout. But slower hosts or hosts with smaller caches - which haven't got any ghosts - aren't able to get any new work at all.

Mark Sattler posted an interesting theory yesterday. He wondered whether asking Synergy to run the Scheduler, several MB splitters, and several AP splitters all at the same time might have been too much, and caused the inital slowdown we saw after maintenance last week. Sounds plausible to me.

I've passed it on to the staff, and suggested that they might consider restarting the splitters on Lando - two of each - to provide a trickle of new work for smaller users who are currently getting nothing, while the power users amongst us work our way through the rest of the lost results. We'll see what they make of it.

Mark Sattler posted an interesting theory yesterday. He wondered whether asking Synergy to run the Scheduler, several MB splitters, and several AP splitters all at the same time might have been too much, and caused the inital slowdown we saw after maintenance last week. Sounds plausible to me.

Richard

Congrats, now you are in the right path, i was talking about that months ago. The problem always returns when the AP splitters starts.

Maybe a clue, put less AP splitters to work for a while and see what happens, we all could be surprise on results.

Another clue, during the last problem, i was able to DL (>150kpbps)/UL and report all with the help of a proxie with no problem (without a proxie DL(<1kpbs) and UL Ok report NO), thats interesting because thats point not for a bandwidth problem (the proxie uses the same bandwith). Talk about that with the others on the lab this could show another path to follow to.