This article gives an introduction to the features introduced in the latest Sun HPC ClusterTools 4 software, including best practices for configuration and mixed clusters. It describes how to configure a checkpointing and migration environment using both Sun Grid Engine and Condor standalone checkpointing libraries. This article also includes discussion about administrative best practices.

Like this article? We recommend

Like this article? We recommend

This article contains a brief introduction to the features introduced with
the latest Sun HPC ClusterTools 4 software and discussions of the
administration practices for successfully configuring the Sun HPC ClusterTools
software. The first administration practice covered in this paper has long been
requested by HPC customers and deals with the ability to provide root-privileges
to regular HPC users to maintain the Sun HPC ClusterTools software. The second
practice relates to configuring mixed HPC clusters. This article also introduces
the latest release of the Sun Grid Engine (Sun GE) software release 5.2.3.1
and the Condor standalone user-level checkpointing library. Best practices are
given on how to configure a checkpointing and migration environment by using
both Sun GE software and the Condor standalone checkpointing libraries.

Introduction to Grid Computing

The products covered in this paper are among the basic and fundamental components
needed to build a grid infrastructure. Grid computing has been making the headlines
lately and is touted as the new computing paradigm for this decade because it
can increase the return on your computing assets by more effectively using your
existing hardware. The Sun GE software handles compute and resource management
at the cluster level by providing the required hooks to access the computing
grid through known application program interfaces (APIs), such as Globus and
Avaki. The Sun HPC ClusterTools 4 software provides the distributed parallel
programming environment that enables users to execute their message passing
interface (MPI) programs on a Sun UltraSPARC based cluster. The Condor
standalone libraries allow serial threaded programs to be checkpointed for later
restart if the need arises.

Sun HPC ClusterTools 4 Software Overview

The Sun HPC ClusterTools 4 software is designed specifically for
compute-intensive, technical computing environments and enables the execution of
serial and parallel high-performance applications. It provides middleware to
facilitate and manage a workload of highly resource-intensive applications on
Sun servers, as well as clusters of these servers. Additionally, it provides the
software development environment for creating and debugging MPI applications
that are parallelized and optimized for Sun servers and clusters.

The Sun HPC ClusterTools 4 software is the follow-on to the Sun HPC
ClusterTools 3.1 release. Both versions can be installed on the same system, but
only one release can be activated for use by using a reconfig command.
The Sun HPC ClusterTools includes the following new features:

Cluster nodes can span over subnets.

Administrators can use the sudo utility to set superuser (root)
privileges.

The software has been optimized for better visualization.

Loadable protocol modules are supported.

UltraSPARC III processors are supported.

The next-generation high-performance interconnect is supported.

The Sun HPC ClusterTools 4 software supports the Solaris™ 8 Operating
Environment. It is also released under the Sun Community Source License. The Sun
HPC ClusterTools 4 software consists of the following components:

The Sun Cluster Runtime Environment (Sun CRE) is a principal component of the
Sun HPC ClusterTools 4 software because it provides the job launching and load
balancing capabilities for MPI-based C, C++, and Fortran programs. The Sun HPC
ClusterTools 4 software supports up to 2048 processes and up to 64 nodes in a
cluster. The software also supports Platform Computing's load sharing
facility (LSF) as a distributed resource manager. The Sun GE software supports
only the external launcher mechanism of MPI programs and is loosely integrated
with the Sun HPC ClusterTools 4 software (refer to the Sun HPC ClusterTools 4
software documentation). The LSF provides batch queuing capabilities and
integrated launching of MPI applications. A fully Sun CRE integrated, portable
batch system (PBS) was recently made available through a patch. FIGURE 2
shows the current distributed resource management integration types with the Sun CRE software.