Compiling Beowulf software

Daniel Ridge wrote:
>> Your fresh Scyld Beowulf machine probably does not have LAM installed --
> we ship a very slightly modified MPICH instead.
> BTW: I had a conversation with Jeff Squyers on this point and I think we
> might be able to get LAM to support the Scyld Beowulf platform with only a
> small amout of work.
That would be rather nice. In our tests, LAM has generally performed
better than MPICH (which has 36% higher latency). Also, LAM shared
memory performance using usysv transport on our SMP boxes is about as
good as the hardware can deliver (1 microsecond latency, 266 Mbyte/s
peak bandwidth). MPICH shared memory performance is not as good (16
microsecond latency, 235 Mbyte/s peak bandwidth). On the minus side,
LAM requires an auxiliary process lamd on each node.
While we are on the topic of daemons on nodes, PBS uses some (pbs_mom).
Also, some networks require daemons on nodes (e.g. Giganet cLAN uses
clanmgr and clanagent). It would be nice if this could be incorporated
into a Scyld cluster on per-node basis (e.g. some nodes may need such
daemons, others not). Given that clusters are often built with
different generation hardware sets, there may be other specific
requirements (e.g. different lm_sensors or ECC monitoring modules). A
mechanism similar to using /var/beowulf/boot.img.# to load node # with
its own boot file may help (we were able to use this to load SMP or
uniprocessor kernels into nodes as appropriate).
Could the Scyld node_up script be augmented to carry out node-specific
hardware initialization and daemon startup? Perhaps the the default
node_up could try starting a node_specific.# script after finishing the
default setup...
Sincerely,
Josip
--
Dr. Josip Loncaric, Senior Staff Scientist mailto:josip at icase.edu
ICASE, Mail Stop 132C PGP key at http://www.icase.edu./~josip/
NASA Langley Research Center mailto:j.loncaric at larc.nasa.gov
Hampton, VA 23681-2199, USA Tel. +1 757 864-2192 Fax +1 757 864-6134