I use ssh for system management, and it works fine.
Why should I use Starfish?

Answer:

When all you have is a hammer, everything looks like a nail. Secure
login tools are suitable for individual remote sessions. They don't
operate at the scale of administering site infrastructure.

As a remote login tool, ssh is fine. It provides terminal
access for a single user to a single remote host for a single session.
Significantly, though, the scaling factor of ssh is
unity. Indeed, much of its behavior involves managing terminal
characteristics and environment variables. These features make sense
for a single login session. They make no sense when running multiple
sessions concurrently.

Its authentication is likewise built around the native user login model
rather than being based on a model of remote system administration.
This puts system management at risk in the event of problems with name
services and other common resources. In a robust architecture, these
services are layered on top of more fundamental system management services,
not codependent with them. The use of ssh for system
management creates an undesirable coupling between these layers.

A number of experimental system management tools depend on
ssh to provide a secure communications layer. We see some
architectural problems with this approach to modularity, first because
the communication between the tool and ssh is external to
both, second because it takes place in the clear, and third
because ssh makes no provision for transferring user
interactions between the modules. Further problems with
ssh lie in the area of certificate management. Here,
programs such as stunnel provide a richer authentication
model, although we note that they share the same architectural
problems as ssh with respect to integration. In more
tightly integrated designs, the application internally manages its own
communications and security.

In summary, ssh has no specific capabilities for system
management. It has no features to manage multiple sessions, nor to
control remote computation, nor to perform certificate management, nor
does it provide the means to integrate with other applications where
these features might be available.

We see the essentially interactive nature of ssh as being a
strength in some respects and a weakness in others. A number of system
administration tasks are geared toward interactivity rather than
scalability, which means that ssh has a viable place in the
system administrator's toolkit. But we argue that this state of affairs
is not sustainable.

At our site, management wants to buy a commercial system administration
product. There are many choices, among them OpenView, Sun Cobalt
Control Station, Tivoli, Unicenter, and BMC Patrol. These are
sophisticated products with huge development resources behind them.
Starfish looks pretty modest by comparison.

Answer:

It can be a challenge to decide whether or not such a product is
appropriate for your site. Marketing claims are not a reliable
basis for evaluation.

Most enterprise-class system administration tools address the problems
of system administration from a business perspective, rather than a
software engineering perspective. In this context, many factors apart
from technical suitability have to be considered. A solution may be
presented attractively to management while making technical issues such
as site integration difficult to assess. Coming from a software
engineering culture, we tend to be suspicious of solutions which
reveal nothing about how they are implemented. We wonder what purpose
is served by making these questions difficult to answer, particularly
where security is concerned.

Very large sites may be able to justify the licensing expense and
integration effort of adopting a large commercial system management
product for sitewide use. They may also have to consider industrial
relations and similar factors. Such a decision has lasting consequences
and should never be made lightly. Meanwhile, the proprietary nature of
the software, as well as its sheer size and complexity, make it very
difficult to evaluate from a security perspective. The accuracy and
completeness of technical information on the product must also be
evaluated.
Bearing these concerns in mind, a commercial product may prove to be the
most appropriate solution for very large sites, particularly when
technical expertise is not available internally, generic features are
acceptable, security does not need to be verified, and expense is not a
dominant factor.

Starfish is a very different value proposition. It is consciously
designed to be small and secure. Though intended for use in
sophisticated environments, Starfish is easy to evaluate and extend. Of
the commercial offerings, the Cobalt system seems to come closest to
sharing these motivations, and it goes further than Starfish in
providing abstractions such as patch management. For this very reason,
however, it has only limited support for multiple platforms.

We deeply believe that no tool is a substitute for expert system
administration, but a good tool can certainly make expertise more
effective. Starfish may be the right framework in which to develop
techniques for system administration that work at your site. It does
not take a major commitment to find out.

Starfish is evidently a tool for performing ad hoc system
management. This technique is doomed because of several factors,
notably that neither human performance nor centralized management is
able to scale.
Mark Burgess, Steve Traugott and others have instead argued that
automated system deployment is the only way to ensure consistency
in a scalable manner.

Answer:

The immediate barrier to progress at most sites is how to make sense
of the existing chaos.

Burgess and Traugott are right to propose techniques which can be scaled
to manage large computing environments. A reasonable goal for any site
would be to arrive at a condition where these techniques could be
applied. However, achieving this condition often proves to be an
exceptionally difficult challenge. Both Burgess and Traugott are
primarily concerned with the challenges of maintaining an ideal
computing environment.

The unfortunate problem is that most sites do not maintain an adequate
model of their own infrastructure. In other words, the ideal has not
yet been expressed, let alone realized. Sites tend to start small and
to evolve chaotically. Cleverness is often substituted for thoughtful
design. Not surprisingly, systems often develop complex
interdependencies which, over time, are decreasingly well understood.
With every adaptation to new requirements, site complexity not only
increases, but also becomes more difficult to model.

Having reached this state, a production computing environment faces
numerous technical and political barriers to change.
There is a common perception that technical staff are not essential
unless they are engaged in some visible activity. Their workload thus
encourages a reactive mindset in which crisis management is primary, and
design and planning are secondary functions. Once established, this
state of mind is difficult to overcome.
Change is also difficult in its own right. Owing to complexity or
imperfect knowledge, it may take a concerted effort to identify the
characteristics and relationships of individual systems, to convince the
affected parties that conversion lies in their best interests, to
develop consensus on classification and authority, to implement and test
specifications, and of course to deploy management software in a secure
manner.

In order to make the transition from chaos to order in a production
environment, an ad hoc system management tool can be
indispensable. While policy and specifications are important for
building critical infrastructure, without some tool for performing
inspection and intermediate cleanup, important features and dependencies
in the existing environment may not find their way into the model.
We argue, therefore, that such a model rarely arrives in a neat package,
much as we would like it to. Although it would be unwise to persist
indefinitely in using ad hoc approaches to system management,
their adaptability is a singular advantage when working with environments
which are not fully modelled.

Starfish has the further advantage that it provides some help with
classification and convergence. It can be used to inspect system state,
to reveal symmetries and to search for anomalies across groups of
systems. It can perform incremental restructuring and other tasks
beyond the scope of autonomous agents, to develop and test
specifications, to deploy and manage the agents, and to independently
monitor systems for policy compliance.

These activities are notationally more expressive, and thus potentially
more dangerous, than those exercised by declarative agents operating
autonomously. However, ad hoc management has some compensating
advantages. Autonomous agents tend to be much more complex than managed
agents, yet their design and behavior must also be extremely
conservative. Starfish has an expert human in the loop, which permits
a much more liberal scope in responding to disordered environments.
Its agents are simple and lightweight, and because they are centrally
managed, effects are immediately reported, which is not at all the case
for autonomous agents.

Most professions rely on a range of specialized tools and techniques,
because not all problems can be approached in the same way. The
typically complex and interdependent computing environment is a case in
point. It should be evident that, far from being harmful, ad hoc
system management is both necessary and complementary to other
techniques, especially during times of transition.

I hope that:
a) there is some kind of certification of command validity before
executing it.
b) there is some provision for rollback in case of extreme stupidity.
c) there is some provision for non-disruptive behavior during repetitive
acts and thus
d) this is better than just using
foreach host (foo bar cat dog)
ssh -l root $host "something to do"
end

Answer:

System management is not classical distributed computation.

System management is fundamentally not safe.

In an environment consisting of very few systems, and given a set of
simple and logically independent commands it might be sufficient to
simply iterate over them as illustrated above. Such a strategy is more
consistent than manually issuing commands to individual systems.
However, it does not provide a mechanism to handle unexpected results.
On a small scale, if something goes wrong, the problem may be evident by
inspection while the iteration is underway. It may also be acceptable
to defer repairs until the entire iteration has completed, or it may be
safe to interrupt the iteration while partially complete. When these
techniques cause divergence, the effect is often simply tolerated in
small computing environments.

As computing environments grow in size and complexity, this simple model
becomes increasingly brittle and difficult to maintain. It is a
qualitatively different management activity when large numbers of
systems are involved. Economies of scale, predictable behavior and
security all depend on maintaining consistency among systems. At the
same time, consistency is difficult to automate because of evolving,
and sometimes contradictory, requirements overlaid onto changing
technologies.

In other words, human judgement must often be exercised, regardless of
the scale of activity. Starfish has features intended for these
specific conditions:

It issues expressions in parallel.

Its session management is lightweight and scales well.

Expression syntax is platform independent.

It provides a framework for managing expressions and results.

Error recovery is part of the control model.

Grouping is part of the control model.

It operates within an integrated security envelope.

Authentication is based on certificates, not on user logins.

The question expresses the premise that system management can somehow be
implemented as a safe application. We are not aware of any
reason to believe this should be the case. Sandboxing, for example, is
a technique used to limit the damage that can be caused by application
behavior by limiting its scope. However, we are left with the problem
of how to implement and manage the sandbox itself, which is
nothing other than traditional system management.

System management is what we do to sustain a computing infrastructure
through the life cycles of its components. The ordinary activity of
system management involves creating and destroying systems, as well as
many other physical transformations which are not amenable to techniques
such as validation, guarding, and rollback. Indeed, it would be
difficult to imagine how these activities might be considered
safe under any interpretation, which is why we rely on the
judgement of expert professionals to conduct them. Starfish is a tool
for use by these professionals, not a substitute for their expertise.

This is not to say that a site may not choose to limit privilege or
capability under certain conditions, but we believe that such limits
should be imposed by the site as a policy decision, not within
Starfish as a design decision. Starfish functions on the
principle of strong authentication, not weak privilege.

Starfish is licensed under the GNU General Public License, which commits
it to be distributed in open source. Starfish itself is designed on the
principles of simplicity and clarity, which we recognize
to be especially important to software security.

These two factors work in combination to encourage peer review of the
software. Starfish consists of a few thousand lines of well structured
and readable code. A casual inspection should be possible over a cup of
coffee. A rigorous analysis of the Starfish code might take a couple of
days. In short, you do not need to take its security on faith.

Starfish itself is a young and evolving software product. However,
most of its capabilities are provided by mature software layers,
in particular OpenSSL and Tcl/Tk. These, too, are distributed in
open source, and have seen extremely widespread use. Starfish
benefits from the exposure to field testing these layers have
received over a number of years and at hundreds of thousands of
sites.

Tcl/Tk is an interpreted scripting language. Doesn't that impact
performance?

Answer:

This was an open question when we began to design Starfish. We wanted
the agents in particular to be very lightweight. In practice, we
find that scripting does not harm performance, and indeed may contribute
a net benefit by encouraging clean and simple design.

System management is extremely well suited to scripting languages,
because most capabilities already lie in the system being managed.
A good language gives us a unifying framework for accessing these
capabilities.

In terms of performance, we have found that more compute time is spent
in making a single SSL/TLS connection than in the entire overhead of
launching the Starfish agent. Bearing in mind that the connection is
performed entirely in native code, this speaks well for scripting
overheads.

In terms of memory usage, the Starfish agent is smaller than
snmpd, this despite the agent having full SSL/TLS
capability.
The Starfish manager is likewise half the size of xemacs.
Its memory usage of course depends on the number of sessions it has
open, but this is primarily related to connection overheads, not the
scripting environment.

We have to support a number of different platforms. How does Starfish help
us to do that?

Answer:

Platform variation is a difficult problem. A variation between
platforms in providing a given service may be irrelevant to one site and
critical to another.

Starfish provides a modular extension mechanism so that sites can adapt
it to their specific needs. By adding agent modules to Starfish, you
create abstractions which expose or hide exactly those platform details
which make sense at your site.

Application software strives, for the most part, to disguise
platform differences. System administration, on the other hand, is
fundamentally concerned with managing platform differences.

Platform variation is an intrinsically difficult problem, as every site
depends on a particular combination of platform services, bases
different abstractions upon them, and makes different design and
management tradeoffs around them.

Your organization is unique. Extensibility in Starfish is therefore
very important. By adding agent modules, you can expose or hide exactly
those platform details which make sense at your site, and you can do so
incrementally. We believe that's about as far as a system management
tool can go before it starts actually contributing to the complexity
problem rather than helping to solve it.

You have several advantages that the industry at large does not have.
Foremost among these, you don't have to solve the general problem of
platform variation, but can adapt methods to the specific needs at your
site. You should not have to settle for a general solution fitted to
the lowest common denominator. You have rich experience with your own
infrastructure, and you understand what drives your system management
priorities. Let your management tools embody these insights.

Of course as a consulting firm we would be delighted to help you with
any of this activity. The point is, you don't need us. You
always have the power to do it yourself, for free.