Hi folks,
I have a 5-node Beowulf cluster, with 4 identical "compute" nodes
(with IDE disk, VIA processor, etc.) & 1 "master" node (more RAM, more
powerful VIA processor), connected by an unmanaged ethernet switch.
However, if I use either Scyld 28cz7 (version 3.1.9 bproc) software or
ClusterMatic 4 (version 4.0.0pre3 of bproc) software and associated
beoserv/beoboot tools on the cluster (master node), only 2 of the 4
identical compute nodes come and stay up in the cluster. The other 2
nodes reboot every 2-6 minutes, either during node_up (apparently
while insmod/bpsh of some module/library) or after coming up. These 2
nodes stay up fine if I boot them up with on-disk Linux image with
networking enabled. However, as soon as I use beo tools to control the
booting from a "master" node, they have this strange reboot behavior,
and the master realizes the lost connection soon after. The hardware
is relatively new (I guess in this case only CPU, RAM and NIC really
matter), the BIOS RAM tests succeed
every time, the OS images get downloaded via PXE/beoboot and boot
phase 2 image fine; but the strange thing is that it is always the
same 2 physical compute nodes that fail in this way under both
software systems. I have stripped down the config and fstab scripts
for the compute nodes to bare minimums.
Has anyone seen such behavior before? Any hints on how to debug this
problem? Any help will be greatly appreciated to convert my current
3-node into the
maximum 5-node cluster!
Thanks.
V