Grids, disks, directories, and Weird OS tricks (wheee!)

The Beowulf mailing list provides detailed discussions about issues
concerning Linux HPC clusters. In this column I report on using
semi-public PC's for grid type applications and how we can handle large
numbers of files. We also turn to the ganglia-developers mailings
list to report on how one can add a "disk alive" metric to ganglia.
You can
consult the Beowulf archives,
the bioclusters archives, and the
ganglia archives
for the actual conversations.

Using Semi-Public PCs

There was an interesting discussion a few months back on the
bioclusters mailing list about using semi-public PC's for heavy
computational jobs. On Feb. 15, 2004, Arnon Klein asked about running
his jobs on semi-public machines that are running various flavors of
Windows. Arnon is asking this question because he is doing his graduate
research and needs computational power. He's already exhausted the
machines easily available to him, so he was looking for suggestions
about what to do next.

{mosgoogle right}

The first response came from Chris Dwan. Chris responded that he's in
a similar boat but has managed to put together some systems from various
campuses into something like a grid. He also provided a very useful
ranking of systems in terms of access difficulty. For example, systems
that he maintains were easiest to get into followed by systems running
Linux or OS X (which Chris also runs). The lowest two ranked systems
were Windows machines that either could be rebooted at night or could
not be rebooted at all. Chris went on to talk about some schedulers
that can steal cycles from idle workstations (e. g. SGE, torque, LSF).
Although he said that integrating disparate schedulers can be very
difficult. He did mention
Condor from the University of Wisconsin as
a possible solution. He also mentioned the grid software from
United Devices, which runs
on Windows machines but will use compute cycles
from other machines.

Farud Ghazali also mentioned that's he's also looking for a solution to
this type of problem. He pointed that there were many practical
difficulties including authentication across disparate resources.
Chris Dwan jumped in to explain how he has hacked up something to
do authentication for him.

Ron Chen joined the conversation to mention that SGE (Sun Grid Engine)
version 6.0 will integrate with JXTA which then offers Jgrid that
offers P2P (Peer-to-Peer) workload management in a fashion similar
to SETI@home. However he did say that SGE 6.0 won't be out until
May of 2004 (and it may slip slightly from then). Until then, Ron
recommended using
boinc. This package starts
jobs and transmits data using port 80, which makes it easier to
get in and out of a firewall than other approaches. It also has
versions for Windows, Linux, Solaris and OS X. John van Workum
also mentioned
GreenTea that offers a Java
P2P client that gives grid capabilities for running jobs. Bruce
Moxon also mentioned that the Cornell Theory Center, has some tools that might
help with Windows machines.

While this is discussion was short it did offer some ideas that could
help people in similar situations. There are many people and groups
thinking about the same things that Arnon mentioned in his first posting.

Disk Alive Metric

I'm sure many readers are aware of
ganglia. It is a scalable
distributed monitoring system for high performance computing
systems such as clusters and grids. It is open source and in use
on over 500 clusters throughout the world. On December 22, 2003,
on the
Ganglia Developers
mailing list Federico Sacerdoti asked
about a metric that ganglia could watch that would report if a
disk was alive or not. It seems that Federico was talking to a
Purdue (my alma mater) system administrator about a cluster that is put
together from old PCs. The disks in the machines keep failing but
ganglia fails to report the disks as down since the ganglia
daemon will still report a heartbeat even the node is basically
down. Federico posted a possible solution that he worked out with
the administrator but had not tried it.

Brooks Davis replied that he didn't think it would work, at least
in FreeBSD, because of the way Unix and Unix-like systems work. He
did offer another solution that read random blocks from a file
system to make sure the drive was still functioning.

Robert Walsh responded that he has been trying to get information
from the SMART (Self-Monitoring Analysis and Reporting Technology
System) data in most hard drives into ganglia. Brooks Davis
mentioned that he thought integrating
smartmontools
with ganglia might offer a solution.
smartmontools is a package that allows you to control and monitor
the SMART data contained in virtually all modern hard drives.

The discussion spilled over into January of 2004, where Sander van
Vliet announced that he had a preliminary working version of a
gmetric code that would test if the drives were alive. The
code walks the /proc/mounts file looking for drives that are
mounted and then attempts to write 4 bytes to the end of the current
used file system to determine if the disk is alive. If there were
no errors along the way, then the disk is alive. Sander then posted
that he had a version of his code working that used the SMART data
but the job as to be run as root. This problem was sorted out fairly
quickly though. During all of the conversation, there was an effort
to make the code work under Linux and the various BSD flavors,
especially FreeBSD. At this point the thread died out, but it appears
as though the code was working correctly for Linux and FreeBSD.

Large Number of Files

In some cases, the bioinformatics world has a need for handling
large numbers of files. This need can be a problem when you are
trying to address over 10,000 files in one directory! The people
with large mp3 collections can sympathize. The
bioclusters mailing
list had a very interesting brief discussion about how to handle this.
On Jan. 28, 2004, Dan Bolser posted a question looking for new
information for an old problem - working with directories with over
10,000 files. Dan had some tools to get around the problem of
handling this number of files in bash scripts, but felt that the
filesystem was sluggish in working with the files. He said that the
file systems used a linear, unindexed search of directories to find
files. He said that he accidentally created a directory with more
than 300,000 files which he referred to as a
"... death trap for the system." He posted some quick thoughts about
using a hash table to access the files with each node in the hash
table being a directory. You would then follow the directory structure
to find the file.

Elijah Wright posted that
ReiserFS was designed to cope with exactly
this problem (accessing files in directories with a large number of
files). Joe Landman said that he liked
XFS because it used B*-trees
which could easily handle this situation. He said in theory that XFS
can handle more than 10**7 files per directory. He thought
JFS could
handle on the order of 10**4 files per directory. Joe felt that none
of the other file systems could handle this problem. Arnon Klein offered
the possibility of using MySQL in a file system manner. In particular
he mentioned
LinFS which is a file
system of sorts that uses MySQL as a backend.

Dan, the original poster, mentioned that he would try to persuade the
administrators to try ReiserFS or XFS. Joe Landman offered the opinion
that if they administrators would not switch, then using the hash table
idea that Dan originally mentioned should work well. Joe also mentioned
that he has been badly burned by ReiserFS in the past. Elijah Wright
and Joe Landman also mentioned that XFS and ReiserFS are not really
"new" file systems in that they have been around for several years.
Joe Landman also posted some information about ext3. He said that under
heavy journal pressure (performing lots of I/O to files) ext3 had
problems. He said that the journal can become a liability because he
felt it wasn't optimized yet. Joe said that he has several customers
that are regularly seeing problems when using ext3 and software RAID.

To end the discussion Tim Cutts posted a nice short Perl script for
hashing filenames. It has a hash depth of two directories and Tim
said it was good for up to about 64 million files.

The discussion was interesting in that it shows how one can use file
systems to improve performance of applications and if that doesn't
work or is not possible, how one can use simple user-space scripts to
get around problems. While writing scripts to handle problems may not
be the most ideal solution to many people, it does allow you to
solve your problems.

Hypothetical Situation

Brent Clements posted an interesting conundrum to the
Beowulf mailing
list. He has had requests from researchers to use a queuing/scheduling
system to submit kernel builds and reboots. Preferably, a normal user
could compile a customized kernel and boot a cluster node with it.
When the job finished or if it failed to boot, reboot to the baseline
kernel.

There were a variety of solutions proposed. Many thought UML (User
Mode Linux -- Linux running Linux) might do the trick, but they were
not sure how to incorporate into a batch system. Others thought
diskless nodes and PXE DHCP booting was the way to go. After
considering all the input, Brent proposed a series of "stock" kernels
known to work with their cluster. The researchers could then modify
the source and submit their job using a perl script they had
developed. The script allows the users to reboot the the allocated
nodes using the new kernel via DHCP and TFTP. If the nodes don't
respond within 15 minutes, then the nodes are rebooted with a stock
kernel.

This discussion is interesting for several reasons. The first reason is
that you do see performance differences with kernels. I have seen
differences between the RH series of kernels and the SLES series of
kernels (never mind the different between 2.4 and 2.6). The second
reason is that there may be some interest in having users run in
a "sandbox" on the nodes so that if they crash the OS, they won't
crash the node. There is likely to be some performance penalty to
pay for this capability, but it does allow the node to stay up so that
it doesn't require a reboot. The third reason is that it would be very
simple to write a queuing/scheduling script that scheduled a job to
be run that also installed an OS before the job is run. The nodes
would run some base OS that is know to be quite stable, robust, with
all of the latest security patches. Then when a job is run on the node,
an OS is installed using something like Xen or UML before the job is
run. If the "virtual" OS is fairly small, then the lost time is not
too important and you ensure there is no OS skew among the nodes
running the job (believe it or not, this is always a problem).

This is a very cool subject and one that we are likely to see more
of in the future.

This article was originally published in ClusterWorld Magazine. It has been
updated and formatted for the web. If you want to read more about HPC
clusters and Linux you may wish to visit
Linux Magazine.

Jeff Layton has been a cluster enthusiast since 1997 and spends far
too much time reading mailing lists. He can found hanging around the Monkey
Tree at ClusterMonkey.net (don't stick your arms through the bars though).

Unfortunately you have Javascript disabled, please enable Javascript in order to experience the comments correctly