Hi,
> We are currently deploying Tyan S2882 Dual Opteron Boards, and we have
> found the system to be quite unstable. After BIOS updates and kernel
> changes we still get random kernel panics when under load.
Me too :-(
We've got a 85 Node Dual Opteron Cluster. I've documented most of the
crashes on
http://clust-doc.hrz.uni-marburg.de/index.php/Hardware_Bulletin .
Our equipment:
* Dual AMP Opteron DP270 (2.0 GHz)
* MB: TYAN S2882G3-DNR
* Mem: 8*1GB PC3200 (DDR 400) ECC reg.; Corsair/Samsung CM72SD1024RLP-3200/SB
( 12 nodes have 8*2GB)
* PS: EMACS P1 6400P
* HD: 250 GB SATA from Western Digital
Dist: Debian/Sarge amd64
Kernel: various, currently 2.6.15.3 from kernel.org
BIOS: (most recent, as far as I know)
When a node crashes, we typically see a MCE + kernel panic. We get about
2 crashes per week on our 85 node cluster. Some nodes seem to be more unstable
than others but we also see instabilities on nodes that had been stable so
far. The instabilities are very hard to reproduce: we have nodes that crashed
once and ran stable afterwards. Crashes seem to occur mostly when the system
is under heavy CPU (memory?) load.
Far too many correctable ECC errors are reported (on a subset of about 10-20
nodes). Sometimes the ECC errors disappeared after I cyclically interchanged
the memory modules within one node. There seems to be a weak correlation
between the instabilities and the tendency to exhibit ECC errors. memtest86
runs fine on the momory modules.
It seems that the last BIOS upgrade has reduced the ECC error rate
somewhat.
We definitely have no temperature problem. As far as I can see (libsensor)
the voltages are ok, too.
Cheers, Thomas
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf