The Beowulf mailing list provides detailed discussions about
issues concerning Linux HPC clusters.
In this article I review some postings to the
Beowulf list on
Clos networks, Numactl and Multi-Core chips, and Parallel Storage.
I found some threads that I thought
were very useful to highlight. Although the discussions are several years old, I think there is still
useful information that can be used by everyone.

Clos networks

There are many, many types of networks - almost as many as there are
opinions about clusters! One that has been used for a while and perhaps
people did not know about it, is the Clos network. A
Clos network
was originally designed for telephone switching and are used in
computing networks for growing switch infrastructures to large port
counts. On
June 7, 2005,
Joe Mack, a long time cluster expert, asked a question about Clos
networks. In particular, he wanted to know if smaller Clos networks
(3 layers and below) are blocking and if they have 5 or more layers,
if they are non-blocking.

Greg Lindahl was the first to reply and explain that blocking and
non-blocking Clos networks are all a myth. He explained that there is
always interference on the network because of traffic from other
nodes (the blocking and non-blocking comments were for the telecom
industry).

{mosgoogle right}

There was some discussion about what is meant by "blocking" and
"non-blocking" but then Patrick Geoffray from Myricom
responded
with a clear, but long, response. Patrick explained that
blocking for Clos networks, particularly for Myricom's implementation,
you can get a "classical" wormhole blocking where the FIFO
(First In, First Out) on the input port is full and the requested
output port is used by another packet at that precise time. But he
pointed out that this is independent of the dimension of the Clos.
Patrick also explained how blocking can happen and how networks are
designed to perhaps alleviate some of the possibility of blocking.
Interestingly, he also pointed ways to reduce the overall
contention in a Clos network:

Load balance the routes over redundant paths

Fragment messages (i.e. pass smaller packets if possible)

Use route dispersion (Use multiple routes per destination)

Patrick went on to say, "There is just a maximum number of nodes you can
connect on a Clos of diameter n with N-port crossbars. In 2002, at
the time of the referenced email, the Myrinet crossbar we
shipped had 16 ports. So the scaling table was the following:"

Clos diameter: 1 3 5 7
Max nodes count: 16 128 1024 8192

Then Patrick said, "We now use a 32-port crossbar in the 256 ports
box, so the crossover points have shifted a bit:"

Then Patrick finished up with some theoretical discussion about
Clos networks. It's a good email to read if you are interested in
networking.

This was a nice, albeit short, discussion about Clos networking. The
interesting part is that there is more than one way to skin a network cat,
so to speak, for cluster network design. If you are at all interested
in networking for clusters, be sure to read this thread.

Numactl and Multi-Core chips

The entire world is multi-core now. I fully expect the next Christmas
toy craze, whatever it is, will have toys with a couple of cell chips
in there
performing image recognition, speech synthesis, and robotic movement.
(Could I make a cluster out of Teddy Ruxpin 2? Never mind).

A discussion started on the beowulf list about performance on a dual-core
Opteron 275 node (the discussion even dared to talk about Gaussian
performance, but that is verboten). But the discussion started talking about
processor affinity, locking a process to a specific core, and the tools
that surround this. While not the beginning of the discussion, a
post by
Igor Kozin, a very knowledgeable poster to the beowulf list, talked about using
1 core per socket for some performance testing. He mentioned using
'taskset'
for that purpose. Then Mikhail Kuzminsky
mentioned
the command,
numactl.
Both tools can be used to produce the desired outcome - pinning a process
to a specific core and/or to make sure computationally intense tasks
don't end up sharing the same socket (or core) if another socket (or core)
is free.

Another project that is worth looking into is the Portable Linux Processor Affinity (PLPA). Started as part of the OpenMPI effort, the PLPA is a single API that attempts to solve the problem that there are multiple API's for processor affinity within the Linux family. Specifically, the functions sched_setaffinity() and sched_getaffinity() have numbers and types of parameters depending on your Linux vendor and/or version of glibc.

This can
be an issue with NUMA (Non-Uniform Memory Access) architectures such as
the Opteron. With the Opteron you have a bank of memory tied to each
socket. Plus the CPU has the memory controller on-board. But each
socket (core) can access the memory on the other socket.
You may have a numerically intensive process
on one socket, but you may be accessing memory on another socket (not
good for latency).

Mark Hahn,
posted
that the sched_setaffinity function actually does most of the work
But he
also pointed out that using affinity for all tasks on a node might not be
a good idea. For example, he wondered if having one of the cores do the
interrupt handling would be worthwhile (but he did point that many Opteron
boards at the time tied the IO bridge to a single socket).

Then the discussion branched off to talk more about numactl and
what it does for you. Vincent Diepeveen
posted
an email about some testing he did with numactl and measuring
latency in his own program. He saw some differences in latency when using
the various cores on the node (there were 4 cores on the board).
Stuart Midgley
then pointed out that numactl and indeed all affinity tools are
not designed to help latency. He said that the 2.6 kernel does a pretty
good job of putting the pages on the memory controller attached to the core
that the process is using. But things are not perfect, so occasionally the
memory pages can get spread around. He also pointed out that, "... the
system buffer cache will get spread around effecting everyone." Then he
went on to say, "With numactl tools you will force the pages to be allocated
on the right memory/cpu. The processes buffer cache will also be locked
down (which is another VERY important issue)... ." Stuart mentioned that
he has used numactl tools to double the performance of his codes.

I have also seen the Linux kernel scheduler move processes around the cores.
In particular, when a daemon wakes up and needs some CPU time, the kernel
can move a numerically intensive task from one core and put it on another
so the daemon can use the first core. This move can be very costly in terms of
performance since you now have two numerically intensive tasks sharing
the same core. But, once the daemon does what it needs to and goes back
to sleep, the kernel is pretty good about moving one of the numerically
intensive tasks back to the free core. If you have a number of daemons on
the node, then this is more likely to happen. For MPI jobs this can impact
overall performance. It can also affect the repeatability of performance.
For example, if you run the same code a number of times, the spread in
the performance is much larger when you don't use numactl than
if you use it. So using numactl can help your performance and
help get more repeatable timings. You can also help yourself by turning
off as many daemons as possible on the compute nodes.

Vincent thanked Stuart but also wanted to know why the memory latency
for dual-core CPUs was worse than for single-core CPUs.
Stuart
responded that, "The first thing to note is that as you add cpu's the
cost of the
cache snooping goes up dramatically. The latency of a 4 cpu (single
core) Opteron system is (if my memory serves me correctly) around
120ns. Which is significantly higher than the latency of a dual
processor system (I think it scales roughly as O(n^2) where n is the
number of cpu's)." People generally don't know about "cache snooping"
that Stuart mentions. This is a term to describe the cores all telling
each other what they have in cache (accessing data in cache is faster
than accessing in main memory even if the data is on another chip so
they share data in their cache).
Cache snooping is one reason you don't see large socket Opteron systems
much anymore (I've been told that SGI uses hardware to help their cache
snooping).

Stuart went to say, "Now, with a dual core system, you are effectively
halving the
bandwidth/cpu over the hyper transport AND increasing the cpu count,
thus increasing the amount of cache snooping required. The end
result is drastically blown-out latencies." A very good explanation
of what is going on in the system.
Mikhail
Kuzminsky went on to discuss some details about the cache snooping
(or cache coherency) on the Opteron. In his estimation,the 30% increase
in memory latency that Vincent was seeing on the dual-core compared to
a single-core is due to the cache coherency effects.

Unfortunately Vincent has an "abrupt" email style and disagreed with
Stuart's explanation and accused him of mixing latency and bandwidth
concepts. But Stuart
went on
to give some explanation of cache snooping. In particular he also
discussed how latency and bandwidth can be linked together. A number of
other people chimed in to say that Stuart was correct.

This was a very good discussion, albeit brief, about the importance of
thinking of placement of numerically intensive tasks on multi-core
systems, particular NUMA architectures. And guess what, this problem
is only going to get more important as the number of cores per socket
increases. In the case of MPI codes, some libraries have the ability
to schedule jobs on relatively low loaded CPUs (Scali MPI Connect is
one that comes to mind, but I know there are more). Also, Doug
Eadline has talked about this problem in a number of his columns in
Linux Magazine.

Parallel Storage

On Aug 11, 2005, Brian R. Smith,
asked
about using a parallel file system as a centralized file system for their
clusters. He went on to mention that they wanted a common /home
for all of their clusters to limit the copying of data. Brian also
mentioned that they had been using
PVFS2 with great success. In particular,
he mentioned that it has been very stable (no unscheduled outages in
8 months as of the posting of the email).

Joe Mack
responded
that he thought AoE (ATA over Ethernet) was cheap (although he had no
experience with it.

Then one of the PVFS2 developers, Rob Latham,
jumped
on the thread. He pointed out the typical access pattern of /home
is much different and probably not appropriate for PVFS2. But, then again,
PVFS2 wasn't designed for that. He suggested an NFS mounted /home
on the clusters and then a /scratch using PVFS2.

{mosgoogle right}

Mark Hahn, also
responded
that he forced users on his clusters to put their "real" data in appropriate
places. They gave each user a small quotes for /home but lots of
space in other file systems: lots of space in /work which is shared
across clusters, /scratch which is local to the nodes, etc. Mark
also pointed out that you should pay attention to RAID schemes based upon
the usage pattern intended for the file system.

Then David S. and Guy Coates started talking about creating a "storage cluster"
that is a set of disparate hardware with some sort of cluster file system
on top. David
initially
thought about using
NBD,
iSCSI, or something else
to tie everything into a file system and then using
GFS. Guy
responded
that you could indeed do this and they have done it using
GPFS
instead of GFS. He said that they have good reliability with this system.

This thread is very short, but it does point out some interesting things.
First, cluster users and administrators, even 2 years ago, were starting to think
about a centralized file system for their clusters. This makes a lot of
sense because as data files grow, it doesn't make sense to move large amounts
of data to and from various clusters, file systems, and even to the desktop
of the users. It makes much more sense to have a centralized pool of storage
and use that for the clusters. Since you have one pool of storage it also
makes sense for it to be a high performing file system that can scale in
terms of capacity and performance.

I think one of the more interesting ideas in this thread was how to organize
your data. Mark Hahn pointed out that there is more than one way to skin
the storage cat and his way seems to work well for his users. There have been other
discussions on the Beowulf list about organizing your data. There is really
something to be said for sitting down and determining what most of the data
looks like (size, type, etc.) and then match the file system layout to your
data pattern.

This article was originally published in ClusterWorld Magazine. It has been
updated and formatted for the web. If you want to read more about HPC
clusters and Linux you may wish to visit
Linux Magazine.

Dr. Jeff Layton hopes to someday have a 20 TB file system in his home
computer (donations gladly accepted) so he can store all of the postings to
all of the mailing lists he monitors. He can sometimes be found lounging
at a nearby Fry's, dreaming of
hardware and drinking coffee (but never during working hours).

Unfortunately you have Javascript disabled, please enable Javascript in order to experience the comments correctly