In my experience with overheating and main intensive DB server, hardware will be probably totaly dead in a year (first RAM, then disk and at the end a motherboard), at he same time replica server (same hardware, in the same rack) was alive, and alive, ... and good spare for the main server :-)

I expect the admins will try and clear the backlog of uploading & reporting tasks before turning on the validator(s). As long as a given result is found when the parent WU goes up for validation, having been reported late makes no difference. To put it another way, the “deadline” can be understood, for all practical purposes, as the earliest moment that a task will be liable to validation—rather than automatic rejection. It’s even possible for a result to be accepted after missing a validator pass: if the validation is unsuccessful (whether due to errors or other missing results) or was inconclusive, resulting in a ‘resend’, it effectively gets a deadline extension to match the replacement tasks.

Anyway, the short version is that I wouldn’t give up on any work until I saw that the corresponding WUs had been validated without it.

In my experience with overheating and main intensive DB server, hardware will be probably totaly dead in a year (first RAM, then disk and at the end a motherboard), at he same time replica server (same hardware, in the same rack) was alive, and alive, ... and good spare for the main server :-)

BS

We had a failure of the AC in our main data center over a year ago (may 2009)while evrything was in full usage during bussiness hours.
Temprature inside the cabinets exceeded 48 degrees (we are measuring in Celcius) and many servers shut down on the overheatprotection.

We discoverd that we only had one casualty when we brought evrything up again.
In the months following no other machines failed.
If the event had any noticable impact on the whole server park it could be a slight increase in drive failures.
on the other hand this could be aswel due to the age of the machines so no hard evidence.

On 300 peices of hardware ther has not been a single RAM or MB failure.

We had a failure of the AC in our main data center over a year ago (may 2009)while evrything was in full usage during bussiness hours.
Temprature inside the cabinets exceeded 48 degrees (we are measuring in Celcius) and many servers shut down on the overheatprotection.

We discoverd that we only had one casualty when we brought evrything up again.
In the months following no other machines failed.
If the event had any noticable impact on the whole server park it could be a slight increase in drive failures.
on the other hand this could be aswel due to the age of the machines so no hard evidence.

On 300 peices of hardware ther has not been a single RAM or MB failure.

Don't call 'BS' in this forum. Whatever the source of the hardware problems the project is experiencing, they are real. Caused by the AC failure or not.

I personally have had rigs die from overheating.

It does happen. So your personal 'claimed' experience does not mean that current Seti problems might not have been caused in part by the the AC failure.
It certainly did not enhance the reliability of any of the servers in the closet.Always remember.....kitties are all Angels with fur.
'Cat lives matter.'

48C? I *WISH* my laptop would run at 48C... It's currently crunching at 68-70C.

Yeah, but stick your laptop in a cupboard that's at 48C and see what happens. Raise the ambient temp by 20C, and you pretty much raise the component temp by 20C. Hopefully things shut down, or else something breaks down, often in unpredictable ways. The system board or power supply capacitors are a good example, high temps speed up their ageing, and they may not just die. They can just loose capacitance and make the machine hang at random.

It's certainly possible that the heat treatment has prematurely aged a motherboard in one of the servers. Before the cookup it was just within spec, now it's just outside and weird things happen.

I have to say that for those of us who only have computers at work, where there is a shut down of everything over the weekend, the fact that the weekly outage always falls in the middle of the week means a very reduced window to upload/download. I have 3 big Astropulses that my machine hammered in a great time and finished on Monday evening but they may not report until next Monday.

I know this is a bad time but the 3 day outage hits me like this every week.
I appreciate that work can only be done when people are available but I thought I should point out that my long term results (7 years) are slowly dwindling away.

Firstly, thanks Jeff and Eric for taking the time to let us know what is going on. It might not seem so on occasions, but it IS appreciated.

Sometimes shutting down is the worst thing you can do to a system that is suppose to be available 24x7.

Yep, but as Seti isn't meant to be up 24/7 that's not a problem here.

Exactly. Boinc was introduced on the basis of "donating your idle computer time to science projects". Note idle computer time, it was never originally envisaged that power crunchers would want to run it 24/7. However most projects are actually up 24/7 and try to maintain that, but it has never been an agreed part of the offering.

That's appalling
news of whats up over a week apart
come on
That's no way to treat CONTRIBUTERS

That's a bit of a harsh comment there. The guys are doing their level best with minimal funding and old equipment. And they also have their own life and families to be part of as well. Those are my principles, and if you don't like them ... well, I have others.
Groucho Marx 1895-1977

I also have mine, and if you don't like them ... tough, live with it.
Chris S 2017

.... I have a question. I've just looked at the server status page and in the list is db_purge.x86_64. Now, I've also noticed that during the downtimes when the upload / download servers are offline, unlike the other stat lines "Workunits waiting for db purging" and "Results waiting for db purging" never seem to zero out. I'm curious as to why this is. Surely it would make sense that if the db_purge server is up, that the purging zero's out with no new records being added to the queue.

I'm sure you already know this, and the equipment you're using is very likely much better than what I've used, but overheating problems followed by "hangs" might be caused by bad motherboard capacitors. Check the tops of the caps to see if they're expanded or open and leaking--if they are the motherboard is a goner. Sorry if this is obvious to you folks, but figured I would throw this in since I've run across it in the past. Good luck.

I have personally seen this problem in recent years, on a friend's Athlon XP system. It happened first when plugging or unplugging USB devices caused a system hang, with Windows BSOD. A few weeks later the Blue Screens got more frequent, and an examination of the motherboard revealed about 50% of the capacitors had bulging or brown stained tops.

While I am sure that server grade hardware should be built to higher quality standards, and better tolerances than ordinary consumer equipment, I think some of the kit that SETI@home uses is pre-production or prototype.

Add to that the fact that the ageing of the capacitors may have been accelerated by recent overheating.

I hope Matt, Jeff and Eric may find this post helpful, I would certainly put flaky capacitors near the top of my suspect list.
I have just done a Google search on "bulging capacitors" that produced lots of results, including many images.

I expect the admins will try and clear the backlog of uploading & reporting tasks before turning on the validator(s). As long as a given result is found when the parent WU goes up for validation, having been reported late makes no difference. To put it another way, the “deadline” can be understood, for all practical purposes, as the earliest moment that a task will be liable to validation—rather than automatic rejection. It’s even possible for a result to be accepted after missing a validator pass: if the validation is unsuccessful (whether due to errors or other missing results) or was inconclusive, resulting in a ‘resend’, it effectively gets a deadline extension to match the replacement tasks.

Anyway, the short version is that I wouldn’t give up on any work until I saw that the corresponding WUs had been validated without it.