----- "Chris Samuel" <csamuel at vpac.org> wrote:
> Does anyone have any bright ideas ?
Wow, thanks so much to everyone who responded on this
both to the list and in private, very much appreciated!
Given there were so many of these I thought I'd try and
comment on the main points that people raised rather than
reply individually.
1) Power (lots of people)
The vendor swapped in a new PSU in one of these nodes
this morning, so we are resuming attempts to reproduce
this failure now.
The odd thing that we've noticed is that this often
seems to happen when the node is only partly loaded
(though not exclusively); for instance at one point
we saw a node fail with Fluent running on 4 cores and
a home grown code on another core (3 spare).
2) HT lockups (Scott and potentially Don)
We've seen the same "System Firmware Error" messages
on some of our nodes, sometimes associated with a
system lockup, so we're going to look into BIOS
upgrades.
3) Fluent
Well we had a node power off this morning that wasn't
running Fluent, but instead had a 4 CPU Gaussian job,
some NAMD processes from various jobs and some random
user compiled code.
I don't know whether to be glad that I Fluent isn't
so special or worried that other code can kill nodes. :-/
4) IPMI (Bogdan)
We wondered if the IPMI/BMC module might have done the
power off too, but we would hope that we would see
something in the logs.
Anyway, we'll carry on with this using the hints and
tips that people have provided and when (if?) we solve
this I'll certainly update the list with what we find!
Once again thanks so much to all of you who took the
time to reply.
All the best,
Chris
--
Christopher Samuel - (03) 9925 4751 - Systems Manager
The Victorian Partnership for Advanced Computing
P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency