Over the last couple days we have been having some overload problems on the server. To help reduce the server load I temporarily disabled the batch_status and server_status cron jobs. I will be manually generating those about once a day. Sorry for the inconvenience. I also reduced the "max_wus_in_progress" from 24 to 6, so you may see fewer WUs in your job queues.

For those who are interested in the cause of the overload, it was a combination of things:
1. One of the drives in our RAID is failing, causing HD performance issues. We are working to get this fixed.
2. There were 10k+ WUs that all timed out at about the same time. The server then proceeds to generate new results for all these, causing a backlog. This is the reason for reducing max_wus_in_progress.
3. The transitioner had a DB timout which caused it to crash, increasing the backlog further.
4. Not understanding why WUs were not transitioning, I stupidly ran the "transition_all" admin script since it said this would "unstick" jobs. Big mistake - this script changed the transition time of all 600k WUs in the DB, forcing the transitioner to now reprocess every WU. That only made the backlog worse.

The good news is the server is almost caught up with the backlog. I will keep you posted.

That's a new error I haven't seen before. Maybe that was a temporary glitch because I can't find anything in the feeder log. It just shows it constantly adding results to slots.

At the time you posted this there was a backlog of about 15k WUs needing validation. Now it shows only about 1000 needing validation. Is it possible the reason it couldn't report was a validation backlog? Because I don't see how a feeder problem would not allow you to report completed tasks.

Now we are having another problem... The tmp drive is full which is causing some of the daemons to crash, including the feeder. All the files in the tmp drive are owned by root so there is nothing I can do about this and I will have to get IT to look into it... it might be related to the failing hard drive.

It's fairly well known that a stopped feeder puts the server into a form of maintenance mode - nothing gets done, volunteer hosts are backed off for 1 hour (as my log shows). I presume this is deliberate to stop the situation getting worse until it can be inspected.

What that doesn't say is why the feeder stopped in the first place - your second post about the lack of temp file space seems as good an explanation as any.

At intervals throughout the day, I've seen the project come back up fully (so I could report and refill): then go into full maintenance mode with these boards down as well: then return to normal working (the current state: just reported and received new tasks). Which sounds like a good moment to go out and start celebrating the new year...

I'm surprised you were even able to post... The database crashed two days ago and we have been working hard to restore it. I am at work now but the sys admin and a grad student are still working the issue.

The fact that the message boards are back up is a good sign. Hopefully we can get the workunit and results tables back up too.

For the most part, things have been very good since we got the new SSD installed. BOINC is highly dependent on file I/O (Create work scripts copy files to download directory, incoming results are copied to the upload directory, assimilated results are copied to a final staging area, etc.), so the new drive has sped things up remarkably. For example, I have noticed the create work scripts are at least 10 times faster now with the new drive.

However, every now and then I see an error in the various logs: "Lost connection to MySQL server during query".
The MySQL connection problem causes the associated daemon to shut down. This is probably what caused the feeder to shutdown.
Since the daemons are run as cron jobs, they automatically restart within 5 minutes, so the project quickly recovers (Richard - I'm sure you already know this, so this info is for others reading this post).

I am not a database expert, so I am not sure what the root cause of the database connection problem is. Maybe someone reading this has a suggestion. Maybe I can adjust a timeout value or some other database parameter?

I was in a conference call this evening with a couple of *very* experienced BOINC server administrators. One thought that on a lightly-loaded project, a database connection might well time out between active requests: the other had never seen such a thing. On reflection, both thought there might be some useful information about a possible database server stoppage in the MySQL logs.

They suggested I copy your question to the boinc_projects mailing list, both to remind them to look again, and to get some broader responses from the community. OK if I do that in the morning?

First of all, the MySQL logs didn't give any clues, but that could have been because the verbosity level was not set appropriately.

I followed some information I found online and set the connect_timeout to 10 sec (had been 5). I did this on March 2nd, and it's now a week later, and there have been no lost connections. So I think the problem has been resolved.

I'm not sure how you have BOINC setup but something looks strange. Those WUs in your post were successfully returned earlier today by user ID 77798 with username grcpool.com-3. It also turns out that both you (ID 81451) and grcpool.com-3 have the same IP address. Maybe the log you posted was for the grcpool.com-3 user and the connection problem has resolved itself? Because if that posting was for the Igor account then your two accounts got their wires crossed.