The existing model for cluster computing at Duke is one of many locally centralized, generally autonomous, cluster computing
operations. This model works, and it works for certain very good
reasons. Well designed clusters, located in facilities that provide
adequate infrastructure such as physical space, power, cooling capacity,
and networking, scale extremely well in their system management
requirements. That is, barring hardware failure a cluster node should
require full-time equivalent (FTE) labor on the order of an hour a
year or even less to install, update, and operate. In a department
that already has a competent systems manager or systems management
group, it is often possible to install and operate a cluster using
opportunity cost labor provided by the local manager as just another
aspect of managing the departmental LAN.

This is a particularly efficient solution, as the LAN manager already provides most of the core services required by the cluster
(e.g. account management, disk and backup services, software
installation and management services, and security) for the departmental
groups utilizing the cluster resource. These services can be extended
to the cluster nodes for essentially zero marginal cost, making the
labor cost for installing and maintaining the nodes the only cost
that scales with the size of the cluster, and this cost scales in a
particularly predictable way.

This model is also efficient for a second reason. Since there are many clusters on campus, each engineered according to the needs of its
local users and being perpetually built and rebuilt as new moneys become
available, there is an evolutionary optimization that naturally
occurs as new ideas are tried out, good ideas and bad ideas are
discovered in small scale experiments, and these ideas and
experiences shared across campus. This model works well in the rapidly
changing world of computer and networking hardware, where
``revolutionary'' changes occur every year and are an accepted part of
doing business.

This should be compared to the likely efficiency of a monolithic model
where all cluster computer operations on campus where organized and
managed by a single, centralized authority. Bad ideas would be costly
on an institutional scale instead of a departmental or group
scale; good ideas would have to diffuse into the institution from other
institutions; change would necessarily proceed at a much slower rate.
Worst of all, the cluster managers would likely become increasingly
dissociated from their client base and increasingly narrow in their
support of the wide range of user environments likely to be familiar to
the cluster users. Accountability and flexibility would be lost.

These negative elements associated with monolithic models can all be
observed now in those existing computer operations on Duke that
are heavily centralized, especially in the realms of mainframe computing
and in the generally homogeneous academic computing
clusters1. Those of us who have been associated in some
way with computing on campus over decades recall well the days of the
Triangle Universities Computation Center (TUCC) and its campus
equivalent (DUCC), and the inefficiencies that actively drove the
primary computer users on campus to abandon this model altogether in
favor of organization at the departmental scale.

For all of these reasons, the model proposed herein for improved
institutional support of cluster computing remains a model that is
centralized locally, at the departmental level where that makes
sense and in a number of distributed cluster sites where it does not
make sense. It avoids the creation of any sort of monolithic
centralized cluster facility that might become the Duke Supercomputing
Center (DSC) to mirror the North Carolina Supercomputing Center (NCSC)
as DUCC once mirrored TUCC. It relies on institutional organization and
coordination enabled by technology to achieve the desired support at the
institutional scale while retaining the flexibility and cost efficiency
of the localized management model.

The primary features of the proposed model are thus:

Mostly decentralized clusters, in a number of "cluster facilities"
in reasonable physical proximity to their users, where those users
themselves tend to be clustered, e.g. physics, math, computer science,
chemistry, engineering, other science and engineering mileau with
long-term needs for High Performance Computing (HPC). This simply
recognizes that the existing model is fundamentally sound and should not
be radically changed.

As an »extension« of the model, one or more cluster facilities
(both existing ones and new ones) can successfully house clusters
belonging to otherwise isolated groups that »don't« need to be in
immediate proximity to their clusters. Again, as the examples of Math
and ISDS, this is a viable model but needs to be promoted by Duke at the
institutional level where a cost-benefit analysis or lack of local
infrastructure make it appropriate. There are two possible models for
managing these remote clusters. Both are likely to make sense for
different kinds of clusters and cluster owners.

One is the ``owner managed'' model, where the cluster is remotely
sited but still managed by a departmental LAN manager of the department
to which the owning group belongs. This is the only remote
management model possible and in use (by Math and ISDS) at this time.
It is obviously successful, for obvious reasons (it retains most of the
zero-marginal cost advantages associated with local cluster
administration).

There are some additional cost penalties, however. The cost of
physically managing and installing the nodes is considerably higher than
with strictly local nodes, as it takes a relatively long time for the
departmental manager to travel away from their primary
departmental LAN over to the cluster site to perform such maintenance
and installation duties that require physical presence. During this
time offsite, their management of their departmental LAN is obviously
somewhat less responsive. Similarly, they are necessarily less
responsive to the needs of the cluster owners when those needs require a
trip off site over to where the cluster is physically located. At a
guess, offsite management by the systems manager of the owning group is
roughly twice as costly per node as onsite management by the local
systems manager of the owning group.

An additional model proposed for the management of these offsite
clusters is that they be managed by ``the university''. This
alternative model is one that we wish to architect and implement for a
variety of reasons. Some research groups that might wish to operate
clusters are in departments that lack the human infrastructure to
support an offsite cluster, or the departmental LAN infrastructure to be
able to realize any sort of economy of scale if they did. In addition,
groups may find advantages in the resource sharing that is enabled if
they locate their cluster under a common, university-level
administrative umbrella with several other architecturally similar
clusters. The construction of a suitable university management model
for offsite clusters is a primary focus of this white paper, although
that should not be construed as any sort of abandonment of the
local management model (onsite or offsite) where it makes the most
sense.

The existing local management model is not without flaws. Local
managers at some sites have in the past been relatively untrained
graduate students or postdocs, who have sometimes proven spectacularly
incompetent or untrustworthy. Even when done by competent and
professional local managers and there the considerable advantages
associated with zero-marginal cost extension of the existing LAN
services is obtained, the labor cost associated with running one or more
on or offsite clusters is not necessarily either trivial or acceptable
in any given departmental environment.

Running a cluster in addition to a LAN involves tradeoffs that affect
productivity in many ways, the most obvious one being that in many cases
an administrator must choose to do one or the other, performing a sort
of a task prioritization or triage as needs for services and support
emerge. If the LAN manager is relatively underutilized, this is not
generally a problem. If they are already heavily burdened, it can
easily overburden them and result in a reduction in the quality of
services.

Also, these local systems administrators are (generally) well-trained in
LAN administration but may lack expertise germane to cluster management
per se (where it differs). The construction of a university-level
mechanism to better support and to better train onsite and offsite local
managers is also a primary focus of the model proposed in this
white paper.

In order to accomplish these goals of providing clusters that are
fully managed by the University (offsite as far as the cluster owners
are concerned), providing operational support to both onsite and offsite
local managers, and providing improved training for local managers, the
University will clearly need some sort of centralized cluster
organization. This organization can improve productivity and efficiency
at the institutional level in many, much needed ways. For example, in
addition to the above, it can also help manage: cluster siting and the
building or remodeling of facilities as needed; cluster tracking and
inventory, grant-writing support, cluster architecture and standards,
personnel support (both centralized and owner/local), application
support, information coordination and dissemination, cluster integration
both on campus and off (at NCSC, for example) and the management of the
university-managed clusters.

This, then is an outline for a campus cluster support model that is
fleshed out in more detail below. In it, clusters will continue to be
both managed and physically located locally where it makes obvious sense
to do so, as this results in by far the greatest economies of scale.
Nevertheless, a University-level cluster computing operation will be
proposed that will remain at least partly delocalized itself, and which
will be responsible for providing a variety of levels and kinds of
support to groups operating or hoping to operate clusters for many
purposes throughout the University.