I noticed today in my server's event log:
"Assertion: Memory| Event = Correctable ECC@DIMM1B(CPU0)"
There was about 50 identical entries like this over the past week. (Funny enough I've been pushing Boinc harder on this server in the past two weeks than ever before.)
The errors occur in batches, with 5-10 repeats within a one hour period, and then no more errors for as much as two days until the next batch.

I'm wondering if these errors could be meaningless and just tied to perhaps a work unit that's running at the time?

Checking Boinc's Log during the times in question, yields nothing to be concerned about, no errored tasks or invalids occurred in conjunction with the RAM error.

I'm just curious because it's a new error, and so it's NOT a normal thing for my server, hell it's the first error to ever appear in the event log that wasn't related to unimportant things.

Referring to ECC RAM in servers, found online.In addition, a DIMM should be replaced whenever more than 24 Correctable Errors (CEs) originate in 24 hours from a single DIMM and no other DIMM is showing further CEs.

I'm not quite at 24 in 24 hours, but there was a couple days where I was half way there or more.

What do you guys think? should I wait it out and see, or should I try with no success to find a single DIMM that matches my kit?
____________-Dave#2

You might try shutting down, swapping the DIMMs in their sockets, and restart.
This will do a couple of things. First, will reseat the DIMMs, in case some light corrosion or dirt has made contact with the sockets poor. Second, if the error moves with the DIMM to the different socket, it would tend to confirm that the DIMM itself is having issues.
____________
***************************************
I am still the kittyman.
Accept no imitations.

Mark that module, then put it in a different slot. If you the errors go away, then you just have one of those odd electrical things happening. After a while the errors will reappear or not. If they reappear, replace the module, not much point keeping it around.
____________

I will mark the dim and randomly swap it with another slot, seems the best way to start...

I will report back when I see what happens.

Ugh. Rebooting my server is no fun. It's been running for many months nicely. I'll do it tonight after work if I have the time. Always figures things wanna get funny on me when I'm the busiest.
____________-Dave#2

I will mark the dim and randomly swap it with another slot, seems the best way to start...

I will report back when I see what happens.

Ugh. Rebooting my server is no fun. It's been running for many months nicely. I'll do it tonight after work if I have the time. Always figures things wanna get funny on me when I'm the busiest.

Depends on whether you wish to be proactive or reactive. You could always just wait it out and see if the errors increase or not. Since they are correctable and modest in number, right now they don't seem to be causing you issues. However, if the DIMM is going away, you might want to know it now so you have time to find replacements.
____________
***************************************
I am still the kittyman.
Accept no imitations.

One of my old Dell boxes is starting to complain about one of the DIMM's as well. It is actually memory one of the other servers was complains about a few months ago. So I swapped it into a less important box. Now once again that memory is getting flagged.
So I might have to break down and replace it.
____________
SETI@home classic workunits: 93,865 CPU time: 863,447 hoursJoin the BP6/VP6 User Group today!

I would first try memtest86+. Burn the ISO to a CD (actually, if you have/use any Linux distros, the install CD/DVD almost always has memtest86 on it) and put it in, boot from CD. Let it run.

Observe the errors that appear. If they are the same addresses over and over, then you have a bad DIMM. If they are randomly bad addresses, you have a power source or heat issue. Power source doesn't necessarily mean the PSU itself, but it can be. I have an old machine (Abit NF7-S v.2.0) that defaults to 2.6v for RAM, but the board has been pushed so hard for so long (I finally shut it down about a month ago) that I had to run it on 2.9v just to get the hardware monitor to show something above 2.6.

Heat could also be an issue. If they are getting too warm, they will throw errors and forget what certain bits were supposed to be.

Also, look at the address range for the list of errors. That will help you determine if it is one specific DIMM or not. Dual-channel will complicate that though, so you may end up having to just drop down to one DIMM at a time until you find the culprit.
____________
Linux laptop:
Record uptime: 1484d 22h 42m
Ended due to UPS failure, as discovered 14 hours later

Will do. Haven't had a good chance to shutdown the server yet so I'll do the memtest when I can actually shut it down and change the Dimms around.

And, heat is a STRONG possibility. The Dimms are right above the CPU sink on my boards layoout. Add in the fact that I've been running 3.5 out of 4 cores at 100% with a nice toasty CPU temp around 80°C, I think it's likely. This is hotter/faster than I've ran Boinc on this machine previously.

and, 3 days now with no further errors.
If I can rule out heat, I won't even move the DIMMs around.
If I get another error anytime soon, I'll drop Boinc down to a nice cool heat level and see if I get any more errors.
____________-Dave#2

Rest easy Dave, in the world of servers and desktops DIMM still rule while their smaller cousins SO-DIMM are used in the laptops and the like.
____________
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?

Rest easy Dave, in the world of servers and desktops DIMM still rule while their smaller cousins SO-DIMM are used in the laptops and the like.

LOL

And I still have been anything but proactive here. I haven't changed the modules around to different slots yet.
Fortunately I have had no repeats of the error yet!

I guess I'll just hope it was a fluke, and I'll blame it on some seti work-units so that I can rest easy about it. :-)

When I first started seeing EEC messages on one of my server it was only every few weeks. Then became more frequent I waited until I was getting 3 or 4 a day to do anything about it. As it was more of a warning instead of a "This part has failed. Replace it now!" message.
____________
SETI@home classic workunits: 93,865 CPU time: 863,447 hoursJoin the BP6/VP6 User Group today!

Rest easy Dave, in the world of servers and desktops DIMM still rule while their smaller cousins SO-DIMM are used in the laptops and the like.

LOL

And I still have been anything but proactive here. I haven't changed the modules around to different slots yet.
Fortunately I have had no repeats of the error yet!

I guess I'll just hope it was a fluke, and I'll blame it on some seti work-units so that I can rest easy about it. :-)

When I first started seeing EEC messages on one of my server it was only every few weeks. Then became more frequent I waited until I was getting 3 or 4 a day to do anything about it. As it was more of a warning instead of a "This part has failed. Replace it now!" message.

Gee. I never get those messages. Just usually the BSOD, LOL.
____________
***************************************
I am still the kittyman.
Accept no imitations.

Rest easy Dave, in the world of servers and desktops DIMM still rule while their smaller cousins SO-DIMM are used in the laptops and the like.

LOL

And I still have been anything but proactive here. I haven't changed the modules around to different slots yet.
Fortunately I have had no repeats of the error yet!

I guess I'll just hope it was a fluke, and I'll blame it on some seti work-units so that I can rest easy about it. :-)

When I first started seeing EEC messages on one of my server it was only every few weeks. Then became more frequent I waited until I was getting 3 or 4 a day to do anything about it. As it was more of a warning instead of a "This part has failed. Replace it now!" message.

Gee. I never get those messages. Just usually the BSOD, LOL.

IIRC these servers will actually stop using memory if they think it is starting to go wonky. However that might require it to be setup in a mirrored memory configuration. Which I don't have enough DIMMS to do on the old machines & I don't think I could pass a PO for $4000 of some old ECC DDR2 past my boss.
____________
SETI@home classic workunits: 93,865 CPU time: 863,447 hoursJoin the BP6/VP6 User Group today!

My previous rig had to use ECC memory since it was a 2p Opteron setup. For the most part, it handled memory errors fairly well. If the offending error was part of an application, it would hang for a minute and either recover or terminate unexpectedly. If it was for the kernel or something important, it would hang and not recover. I didn't get BSODs.. it would just lock up after 3-5 minutes (oddly, you could alt+tab to other programs and continue working or saving them, but the GUI for Windows would get really angry if you tried closing any programs) and eventually just need to have the reset button pressed.

ECC memory has parity information, sort of like RAID but not totally redundant. It's mostly just used as a checksum to see if what was read from there is actually what was supposed to be there. If it doesn't match, then it throws a warning message about bad memory.

Really expensive setups will let you hot-swap memory modules. Actually, I did some research on that a while back and there's apparently consumer-level programs/utilities for Windows 7/8 that will ask/tell the kernel to do what it needs to in order to vacate all the memory from a module so you can hot-swap it, but I personally wouldn't trust it.
____________
Linux laptop:
Record uptime: 1484d 22h 42m
Ended due to UPS failure, as discovered 14 hours later