My answers to Will's comments and questions are interspersed below.
-- Rob
William Cohen wrote:
> Nathan Tallent wrote:
>
>>
>> We would like to make OProfile work on Clustermatic, a type of beowulf
>> cluster (www.clustermatic.org). Such a cluster consists of a master
>> node along with several *diskless* slave nodes where the master node
>> contains support for a global process space across all the nodes
>> (BProc). For current purposes, the key is that the slave nodes are
>> diskless, with all file system support passing through a very small
>> RAM disk. One way to handle large amounts of I/O is to have the
>> master node (which has a disk) act as a NFS server for the slave nodes
>> (where a NFS mount point would exist in each node's RAM disk).
>>
>> Our problem is that OProfile currently invariably stores configuration
>> information in /root/.oprofile/daemonrc and samples in /var/lib/oprofile.
>>
>> We think we have a temporary workaround using a start up script that
>> creates a symbolic link from /var/lib/oprofile to /NFS/oprofile/nodeXX
>> where /NFS is the NFS mount. Since the config file is small, we can
>> temporarily just store it in the RAM disk.
>>
>> To support Clustermatic in the long run, we would like to add a
>> configuration option that allows samples and configuration information
>> to be stored in a different 'base directory' (e.g.
>> /NFS/oprofile/nodeXX). DCPI behaves in a similar fashion and we have
>> found the ability to choose the location of the profile database to be
>> very useful. (We do know the OProfile databases can be moved *after*
>> the fact using oparchive, but it doesn't address our core problem of
>> diskless nodes.)
>
>
>
> There would need to be some modifications in the op_mangle_filename()
> that converts the file name into a path for the sample file. Right now
> the path to the current sample directory is compiled in.
>
> oparchive cheats by making a tree structure that mimics the original
> file tree on the machine taking the data. oparchive includes the needed
> binaries. Thus, it and the analysis tools just prepends the path to the
> tree.
>
> For the clusters the approach used by oparchive leaves something to be
> desired. It would be preferable that the software doesn't make a copy of
> the executable for each of the nodes in the cluster. That would be a
> waste for a single image environment.
For our existing tools, we've assumed that sources and binaries will be
available, but not necessarily using the same paths as existed either
when the application was built or when it was run. Our solution is to
give paths explicitly to the tools and to provide substitution rules for
replacing one path prefix with another.
>
> Are the cluster's processors homogeneous? OProfile currently expects
> that all the processors in the machine have the same processing events.
> It would be quite possible to build heterogeneous clusters, e.g. Pentium
> M and Pentium 4. Even with same processor architecture processors can
> have different clock rates. This would affect event selection and analysis.
Bproc clusters are homogeneous. We could handle heterogeneous clusters
by adding architecture/implementation specificity to the scripts that
start the demon.
>
> What kind of analysis is being done on the collected data? Accumulating
> samples for a function across all the processors? Just looking at the
> performance of individual nodes? Finding which nodes were outliers with
> many more (or fewer) samples than the average?
Here's a typical scenario we currently use on clusters with DCPI, PAPI,
or oprofile on systems with local disks:
1) On each node of the job, the batch scheduler:
Starts the demon on each node.
Runs the user's job.
Stops the demon.
(Where there are local disks, optionally run preprocessing
filters in parallel on compute nodes to extract data in our
format.)
A copy step, or the optional filtering writes data to place like
/scratch/username/jobname.number/node_xx, i.e. one directory
per node. The script creates these with the right
ownership and permissions.
2) On non-PAPI systems, a high level analysis is done to look at how
time is spent on each node with breakdowns for application, DSO's
loaded by the app, MPI threads, system, etc. We're looking for
gross anomalies such as nodes running dramatically slow, or
processes that shouldn't be running on these nodes.
3) An "interesting" multi-profile (line-level profile with multiple
metrics) is extracted from the data for each node.
4) Statistical analyses are applied to the collection of
multi-profiles to identify groups of nodes that behave similarly.
This clustering can be systematic, e.g., boundary vs interior
nodes, or there can be anomalies, e.g. load imbalances, speed/heat
issues, etc. (This stuff is a current resesearch thrust.)
5) If there are problems, do detailed browsing/analysis of
representatives of the major statistical clusters and of the
outliers to diagnose and fix performance problems.
All of the processing in steps 1-4 can be automated (scripts), so the
user/programmer can focus on the analysis/interpretation issues.
>
> Each node would need its own sample directory. How many nodes are in the
> clusters? I am just wondering if there are going to be issues with
> having tens of thousands directories in a single directory and having
> lots of open file descriptors for the processing nodes? What about
> bandwidth issues of saving the samples files off the processing nodes?
On the DCPI clusters the number of nodes can be hundreds or thousands,
but each compute node should only have a few file descriptors open.
Whether or not the parallel file system can handle this is another
issue. We worry about this and the bandwidth issue, but don't intend to
spend a large amount of time on them until we know how bad the problems
are.
When the data movement cost is incurred is an issue. If data is only
moved at the end of the job, then the main concern is that the system
not fall over. On the other hand, instrumentation overhead that
competes with the application is a problem.
One important mode of operation for big, long-running applications
(colliding black holes, dinosaur-killing asteroids) is to
collect data for a 5 minute window every couple of hours and to take a
look to ensure that nothing horrible has happened to performance.
Printf's within the application can detect the onset of problems, but
looking at profiling data is necessary for diagnosis.
>
> How do you start up the tasks on the processor nodes? I would like to
> know how each process gets the unique directory. Does a node compute the
> name locally based on it's processor name?
Start up on conventional clusters is via a batch script.
On bproc systems, the script runs on a head node and spawns parallel
processes on the compute nodes. `hostname` on compute node 23 returns
"n23", so /scratch/foo/bar/`hostname` would generate a unique path.
Assuming that the driving script ensures that /scratch/foo/bar exists,
is mounted, and that the owner/permissions are suitable,
we would then propose to run, e.g.,
"oprofile --destdir /scratch/foo/bar/`hostname` ... " on each node.
>
>> Since we'd ultimately like any work we do to make its way into the
>> official OProfile sources, we wanted to get your comments and blessing
>> on the proposal.
>
>
> It is not much fun maintaining divergent branches or patch sets to apply
> to existing packages. Making the work suitable to be included in the
> upstream package is much more desirable.
>
> -Will

Nathan Tallent wrote:
>
> We would like to make OProfile work on Clustermatic, a type of beowulf
> cluster (www.clustermatic.org). Such a cluster consists of a master
> node along with several *diskless* slave nodes where the master node
> contains support for a global process space across all the nodes
> (BProc). For current purposes, the key is that the slave nodes are
> diskless, with all file system support passing through a very small RAM
> disk. One way to handle large amounts of I/O is to have the master node
> (which has a disk) act as a NFS server for the slave nodes (where a NFS
> mount point would exist in each node's RAM disk).
>
> Our problem is that OProfile currently invariably stores configuration
> information in /root/.oprofile/daemonrc and samples in /var/lib/oprofile.
>
> We think we have a temporary workaround using a start up script that
> creates a symbolic link from /var/lib/oprofile to /NFS/oprofile/nodeXX
> where /NFS is the NFS mount. Since the config file is small, we can
> temporarily just store it in the RAM disk.
>
> To support Clustermatic in the long run, we would like to add a
> configuration option that allows samples and configuration information
> to be stored in a different 'base directory' (e.g.
> /NFS/oprofile/nodeXX). DCPI behaves in a similar fashion and we have
> found the ability to choose the location of the profile database to be
> very useful. (We do know the OProfile databases can be moved *after*
> the fact using oparchive, but it doesn't address our core problem of
> diskless nodes.)
There would need to be some modifications in the op_mangle_filename()
that converts the file name into a path for the sample file. Right now
the path to the current sample directory is compiled in.
oparchive cheats by making a tree structure that mimics the original
file tree on the machine taking the data. oparchive includes the needed
binaries. Thus, it and the analysis tools just prepends the path to the
tree.
For the clusters the approach used by oparchive leaves something to be
desired. It would be preferable that the software doesn't make a copy of
the executable for each of the nodes in the cluster. That would be a
waste for a single image environment.
Are the cluster's processors homogeneous? OProfile currently expects
that all the processors in the machine have the same processing events.
It would be quite possible to build heterogeneous clusters, e.g. Pentium
M and Pentium 4. Even with same processor architecture processors can
have different clock rates. This would affect event selection and analysis.
What kind of analysis is being done on the collected data? Accumulating
samples for a function across all the processors? Just looking at the
performance of individual nodes? Finding which nodes were outliers with
many more (or fewer) samples than the average?
Each node would need its own sample directory. How many nodes are in the
clusters? I am just wondering if there are going to be issues with
having tens of thousands directories in a single directory and having
lots of open file descriptors for the processing nodes? What about
bandwidth issues of saving the samples files off the processing nodes?
How do you start up the tasks on the processor nodes? I would like to
know how each process gets the unique directory. Does a node compute the
name locally based on it's processor name?
> Since we'd ultimately like any work we do to make its way into the
> official OProfile sources, we wanted to get your comments and blessing
> on the proposal.
It is not much fun maintaining divergent branches or patch sets to apply
to existing packages. Making the work suitable to be included in the
upstream package is much more desirable.
-Will

We would like to make OProfile work on Clustermatic, a type of beowulf
cluster (www.clustermatic.org). Such a cluster consists of a master
node along with several *diskless* slave nodes where the master node
contains support for a global process space across all the nodes
(BProc). For current purposes, the key is that the slave nodes are
diskless, with all file system support passing through a very small RAM
disk. One way to handle large amounts of I/O is to have the master node
(which has a disk) act as a NFS server for the slave nodes (where a NFS
mount point would exist in each node's RAM disk).
Our problem is that OProfile currently invariably stores configuration
information in /root/.oprofile/daemonrc and samples in /var/lib/oprofile.
We think we have a temporary workaround using a start up script that
creates a symbolic link from /var/lib/oprofile to /NFS/oprofile/nodeXX
where /NFS is the NFS mount. Since the config file is small, we can
temporarily just store it in the RAM disk.
To support Clustermatic in the long run, we would like to add a
configuration option that allows samples and configuration information
to be stored in a different 'base directory' (e.g.
/NFS/oprofile/nodeXX). DCPI behaves in a similar fashion and we have
found the ability to choose the location of the profile database to be
very useful. (We do know the OProfile databases can be moved *after*
the fact using oparchive, but it doesn't address our core problem of
diskless nodes.)
Since we'd ultimately like any work we do to make its way into the
official OProfile sources, we wanted to get your comments and blessing
on the proposal.
John Mellor-Crummey
Rob Fowler
Nathan Tallent