Software

Publications

2014

The performance profile of local area networks has changed over the last
decade, but many practical group communication and ordered messaging tools rely
on core ideas invented prior to this change. We present the
Accelerated Ring protocol, a novel ordering protocol that improves on the
performance of standard token-based protocols by allowing processes to pass the
token before they have finished multicasting. This performance improvement is
obtained while maintaining the correctness and other beneficial properties
of token-based protocols.

On 1-gigabit networks, a single-threaded daemon-based implementation of the
protocol reaches network saturation, and can reduce latency by 45% compared to
a standard token-based protocol while simultaneously increasing throughput by
30%. On 10-gigabit networks, the implementation reaches throughputs of 6 Gbps,
and can reduce latency by 30-35% while simultaneously increasing throughput by
25-40%. A production implementation of the Accelerated Ring protocol has been
adopted as the default ordering protocol for data center environments in the
Spread toolkit, a widely-used open-source group communication system.

The increasing number of cyber attacks against critical infrastructures, which
typically require large state and long system lifetimes, necessitates the
design of systems that are able to work correctly even if part of them is
compromised.

We present the first practical survivable intrusion tolerant replication
system, which defends across space and time using compiler-based diversity and
proactive recovery, respectively. Our system supports large-state applications,
and utilizes the Prime BFT
protocol (providing performance guarantees under attack) with a
compiler-based diversification engine. We devise a novel theoretical model that
computes how resilient the system is over its lifetime based on the
rejuvenation rate and the number of replicas.

This model shows that we can achieve a confidence in the system of 95% over 30
years even when we transfer a state of 1 terabyte after each rejuvenation.

Providers of critical infrastructure services strive to maintain the high
availability of their SCADA systems. This paper reports on our experience
designing, architecting, and evaluating the first survivable SCADA system
- one that is able to ensure correct behavior with minimal performance
degradation even during cyber attacks that compromise part of the system. We
describe the challenges we faced when integrating modern intrusion-tolerant
protocols with a conventional SCADA architecture and present the techniques
we developed to overcome these challenges. The results illustrate that our
survivable SCADA system not only functions correctly in the face of a cyber
attack, but that it also processes in excess of 20,000 messages per second
with a latency of less than 30 ms, making it suitable for even large-scale
deployments managing thousands of remote terminal units.

Networks with homogeneous routing nodes are constantly at risk as any vulnerability found
against a node could be used to compromise all nodes. Introducing diversity
among nodes can be used to address this problem. With few variants, the choice of assignment
of variants to nodes is critical to the overall network resiliency.

We present the Diversity Assignment Problem (DAP), the assignment of
variants to nodes in a network, and we show how to compute the optimal solution in medium-size
networks. We also present a greedy approximation to DAP that scales well to large networks.
Our solution shows that a high level of overall network resiliency can be obtained even
from variants that are weak on their own.

We provide two variations of our problem to meet real-world system needs.
First, for networks with knowledge of higher-level protocols we offer a technique to create
assignments that maximize the needs of a specific application (e.g., Paxos and BFT). Second,
for networks with knowledge of the value of traffic between each communicating pair of nodes,
we offer a weighted version that can increase resiliency between important communicating pairs
while sacrificing resiliency for the less important pairs.

Our assignments are based on assumed compromise probabilities and independence of compromises
between different diverse variants. We provide analysis when these assumed probabilities or
independence are inaccurate.

Networks with homogeneous routing nodes are
constantly at risk as any vulnerability found against a node could
be used to compromise all nodes. Introducing diversity among
nodes can be used to address this problem. With few variants,
the choice of assignment of variants to nodes is critical to the
overall network resiliency.

We present the Diversity Assignment Problem (DAP), the
assignment of variants to nodes in a network, and we show how
to compute the optimal solution in medium-size networks. We
also present a greedy approximation to DAP that scales well to
large networks. Our solution shows that a high level of overall
network resiliency can be obtained even from variants that are
weak on their own.

For real-world systems that grow incrementally over time, we
provide an online version of our solution. Lastly, we provide a
variation of our solution that is tunable for specific applications
(e.g., BFT).

We propose an intrusion-tolerant
overlay messaging service for cloud monitoring and control.
Our service, Controlled Authenticated Overlay Flooding,
combines authentication mechanisms and
flooding with an
overlay architecture to provide optimal resiliency in the presence
of compromised nodes. While Internet flooding is not
feasible because of the number of nodes and links involved,
our approach is practical as it imposes a maximal overlay
topology and confines the communication to that topology's
overlay links, limiting the cost of flooding.

This paper reports on our experience designing and implementing the first
survivable SCADA system - one capable of providing correct behavior with minimal
performance degradation even during a cyber attack that compromises part of the system.
We describe the challenges we faced while attempting to integrate modern
intrusion-tolerant protocols with the SCADA architecture and present the techniques
we developed to overcome these challenges.

Existing Byzantine-resilient replication protocols satisfy two standard correctness criteria,
safety and liveness, even in the presence of Byzantine faults. The runtime performance of these
protocols is most commonly assessed in the absence of processor faults and is usually good
in that case. However, in some protocols faulty processors can significantly degrade performance,
limiting the practical utility of these protocols in adversarial environments. This paper
demonstrates the extent of performance degradation possible in some existing protocols that
do satisfy liveness and that do perform well absent Byzantine faults. We propose a new
performance-oriented correctness criterion that requires a consistent level of performance,
even when the system exhibits Byzantine faults. We present a new Byzantine fault-tolerant
replication protocol that meets the new correctness criterion and evaluate its performance in
fault-free executions and when under attack.

Wireless mesh networks extend the connectivity range of mobile devices
by using multiple access points, some of them connected to the
Internet, to create a mesh topology and forward packets over multiple
wireless hops. However, the quality of service provided by the mesh is
impaired by the delays and disconnections caused by handoffs, as
clients move within the area covered by multiple access points. We
present the architecture and protocols of SMesh, the first transparent
wireless mesh system that offers seamless, fast handoff, supporting
real-time applications such as interactive VoIP. The handoff and
routing logic is done solely by the access points, and therefore
connectivity is attainable by any 802.11 device. In SMesh, the entire
mesh network is seen by the mobile clients as a single, omnipresent
access point, giving the mobile clients the illusion that they are
stationary. We use multicast for access points coordination and,
during handoff transitions, we use more than one access point to
handle the moving client. SMesh provides a hybrid routing protocol
that optimizes routes over wireless and wired links in a multi-homed
environment. Experimental results on a fully deployed mesh network
demonstrate the effectiveness of the SMesh architecture and its
intra-domain and inter-domain handoff protocols.

Push-to-Talk (PTT) is a useful capability for rapidly deployable wireless
mesh networks used by first responders. PTT allows several users to speak
with each other while using a single, half-duplex, communication channel,
such that only one user speaks at a time while all other users listen.
This paper presents the architecture and protocol of a robust distributed
PTT service for wireless mesh networks. The architecture supports any
802.11 client with SIP-based (Session Initiation Protocol) VoIP software
and enables the participation of regular phones. Collectively, the mesh
nodes provide the illusion of a single third party call controller,
enabling clients to participate via any reachable mesh node. Each PTT
group instantiates its own logical floor control manager that is highly
available and resilient to mesh connectivity changes such as node crashes
and recoveries and network partitions and merges. Experimental results
on a fully deployed mesh network consisting of 14 mesh nodes and tens of
emulated clients demonstrate the scalability and robustness of the system.

This paper presents the first hierarchical Byzantine
fault-tolerant replication architecture suitable to systems that
span multiple wide area sites. The architecture confines the
effects of any malicious replica to its local site, reduces message
complexity of wide area communication, and allows read-only
queries to be performed locally within a site for the price
of additional standard hardware. We present proofs that our
algorithm provides safety and liveness properties. A prototype
implementation is evaluated over several network topologies
and is compared with a flat Byzantine fault-tolerant approach.
The experimental results show considerable improvement over
flat Byzantine replication algorithms, bringing the performance
of Byzantine replication closer to existing benign fault-tolerant
replication techniques over wide area networks.

This paper presents the first architecture for large-scale, wide-area
intrusion-tolerant state machine replication that is specifically
designed to perform well even when some of the servers are Byzantine.
The architecture is hierarchical and runs attack-resilient state
machine replication protocols within and among the wide-area sites.
Given the constraints of the wide-area environment, we explore the
challenges and tradeoffs of building inter-site communication
protocols that use wide-area bandwidth efficiently yet can resist
attempts to degrade performance. The paper provides evidence that the
optional use of simple dependable components, whose compromise or
malfunction cannot cause inconsistency in the replicated service, can
significantly improve performance when the system is under attack.

Existing Byzantine-resilient replication protocols satisfy two
standard correctness criteria, safety and liveness, in the presence of
Byzantine faults. In practice, however, faulty processors can, in
some protocols, significantly degrade performance by causing the
system to make progress at an extremely slow rate. While ``correct''
in the traditional sense, systems vulnerable to such performance
degradation are of limited practical use in adversarial environments.
This paper argues that techniques for mitigating such performance
attacks are needed to bridge this ``practicality gap'' for
intrusion-tolerant replication systems. We propose a new
performance-oriented correctness criterion, and we show how failure to
meet this criterion can lead to performance degradation. We present a
new Byzantine replication protocol that achieves the criterion and
evaluate its performance in fault-free configurations and when under
attack.

The aim of this paper is to demonstrate the feasibility of authenticated throughput-efficient
routing in an unreliable and dynamically changing synchronous network in which the majority
of malicious insiders try to destroy and alter messages or disrupt communication in any way.
More specifically, in this paper we seek to answer the following questoin: Given a network
in which the majority of nodes are controlled by a node-controlling adversary and whose
topology is changing every round, is it possible to develop a protocol with polynomially-bounded
memory per processor that guarantees throughput-efficient and correct end-to-end communication?
We answer the question affirmatively for extremely general corruption patterns: We only request
that the topology of the network and the corruption pattern of the adversary leaves at least
one path each round connecting the sender and receiver through honest nodes (though this
path may change at every round).

This paper presents PICO, a generic infrastructure for secure group
communication in mobile ad-hoc networks (MANETs). PICO provides an
intrusion-tolerant group management service, allowing clients to join
or leave a logical group and enabling group members to communicate
securely using a dynamically generated group encryption key. Since
MANETs are characterized by relatively high message loss and frequent
network partitions, PICO is built around a new Byzantine
fault-tolerant agreement protocol designed to cope with these
conditions. The agreement protocol leverages weak (commutative)
semantics to allow multiple partitions to continue operating in
parallel without sacrificing correctness, and it uses threshold
cryptography to provide efficient reconciliation and coordination
without the need for reliable communication links.

Push-to-Talk (PTT) is a useful capability for rapidly deployable
wireless mesh networks used by first responders. PTT allows several
users to speak with each other while using a single, half-duplex,
communication channel, such that only one user speaks at a time while
all other users listen. Furthermore, enabling regular PSTN phone users
(e.g., cell phones) to seamlessly participate in the wireless mesh PTT
session is key to supporting the heterogeneous environment commonly
found in such settings.

This paper presents the architecture and protocol of a distributed PTT
service for wireless mesh networks. The architecture supports any 802.11
client with SIP-based VoIP software and enables the participation of
regular phones. Collectively, the mesh nodes provide the illusion of a
single third party call controller, enabling clients to participate via
any reachable mesh node. Each PTT group instantiates its own logical
floor control manager that is highly available and resilient to mesh
connectivity changes such as node crashes and recoveries and network
partitions and merges.

Experimental results on a fully deployed mesh network consisting of 14
mesh nodes and tens of emulated clients demonstrate the scalability and
robustness of the system.

This paper presents an overview of Paxos for System Builders, a
complete specification of the Paxos replication protocol such that
system builders can understand it and implement it. We evaluate the
performance of a prototype implementation and detail the safety and
liveness properties guaranteed by our specification of Paxos.

Low-cost wireless routers are changing the way people connect to
the Internet. They are also very cheap, albeit quite limited, Linux
boxes. These attributes make them ideal candidates for wireless mesh
routers.
This paper presents a minimally invasive mechanism for redundant multipath
routing in kernel-space to achieve high reliability with high throughput
in a mesh network. This service is essential for achieving fast,
lossless handoff as mobile devices roam throughout the wireless mesh
coverage area. However, redundant multipath is not natively supported by
current operating systems, limiting the routing mechanisms that can be
used in these networks to user-level implementations, which can greatly
degrade performance.
We show an architecture that integrates this mechanism in a wireless
mesh system, resulting in a high-throughput 802.11 mesh network with
fast handoff over low-cost routers.

This paper presents a complete specification of the Paxos replication
protocol such that system builders can understand it and implement it.
We evaluate the performance of a prototype implementation and detail
the safety and liveness properties guaranteed by our specification of
Paxos.

Existing Byzantine-resilient replication protocols satisfy two
standard correctness criteria, safety and liveness, in the presence of
Byzantine faults. In practice, however, faulty processors can, in
some protocols, significantly degrade performance by causing the
system to make progress at an extremely slow rate. While ``correct''
in the traditional sense, systems vulnerable to such performance
degradation are of limited practical use in adversarial environments.
This paper argues that techniques for mitigating such performance
attacks are needed to bridge this ``practicality gap'' for
intrusion-tolerant replication systems. We propose a new
performance-oriented correctness criterion, and we show how failure to
meet this criterion can lead to performance degradation. We present a
new Byzantine replication protocol that achieves the criterion and
evaluate its performance in fault-free configurations and when under
attack.

Constructing logical machines out of collections of physical machines
is a well-known technique for improving the robustness and fault
tolerance of distributed systems. We present a new, scalable
replication architecture, built upon logical machines specifically
designed to perform well in wide-area systems spanning multiple sites.
The physical machines in each site implement a logical machine by
running a local state machine replication protocol, and a wide-area
replication protocol runs among the logical machines. Implementing
logical machines via the state machine approach affords free
substitution of the fault tolerance method used in each site and in
the wide-area replication protocol, allowing one to balance
performance and fault tolerance based on perceived risk.
We present a new Byzantine fault-tolerant protocol that establishes a
reliable virtual communication link between logical machines. Our
communication protocol is efficient (a necessity in wide-area
environments), avoiding the need for redundant message sending during
normal-case operation and allowing a logical machine to consume
approximately the same wide-area bandwidth as a single physical
machine. This dramatically improves the wide-area performance of our
system compared to existing logical machine based approaches. We
implemented a prototype system and compare its performance and fault
tolerance to existing solutions.

Constructing logical machines out of collections of physical machines
is a well-known technique for improving the robustness and fault
tolerance of distributed systems. We present a new, scalable
replication architecture, built upon logical machines specifically
designed to perform well in wide-area systems spanning multiple
sites. The physical machines in each site implement a logical machine
by running a local state machine replication protocol, and a wide-area
replication protocol runs among the logical machines. Implementing
logical machines via the state machine approach affords free
substitution of the fault tolerance method used in each site and in
the wide-area replication protocol, allowing one to balance
performance and fault tolerance based on perceived risk. We present a
new Byzantine fault-tolerant protocol that establishes a reliable
virtual communication link between logical machines. Our communication
protocol is efficient (a necessity in wide-area environments),
avoiding the need for redundant message sending during normal-case
operation and allowing a logical machine to consume approximately the
same wide-area bandwidth as a single physical machine. This
dramatically improves the wide-area performance of our system compared
to existing logical machine based approaches. We implemented a
prototype system and compare its performance and fault tolerance to
existing solutions.

This paper presents a routing protocol for multi-homed wireless mesh
networks that provide uninterrupted connectivity and fast handoff.
Our approach integrates wireless and wired connectivity, using multicast
groups to coordinate decisions and seamlessly transfer connections
between several Internet gateways as mobile clients move between
access points. The protocol optimizes the use of the wireless medium
by short-cutting wireless hops through wired connections, paying a very
low overhead during handoffs. The paper demonstrates that inter-domain
handoffs occur instantaneously, with virtually no loss or delay, for
both TCP and UDP connections.

This paper presents the first hierarchical Byzantine
fault-tolerant replication architecture suitable to systems that
span multiple wide area sites. The architecture confines the
effects of any malicious replica to its local site, reduces message
complexity of wide area communication, and allows read-only
queries to be performed locally within a site for the price
of additional standard hardware. We present proofs that our
algorithm provides safety and liveness properties. A prototype
implementation is evaluated over several network topologies
and is compared with a flat Byzantine fault-tolerant approach.
The experimental results show considerable improvement over
flat Byzantine replication algorithms, bringing the performance
of Byzantine replication closer to existing benign fault-tolerant
replication techniques over wide area networks.

This paper presents the architecture and protocols of SMesh,
a completely transparent wireless mesh system that offers
seamless, fast handoff, supporting VoIP and other real-time
application traffic for any unmodified 802.11 device.
In SMesh, the entire mesh network is seen by the
mobile clients as a single, omnipresent access point.
Fast handoff is achieved by ensuring that each client is served
by at least one access point at any time. Mobile clients are
handled by a single access point during stable connectivity times.
During handoff transitions, SMesh uses more than one access point
to handle the moving client. Access points continuously monitor
the connectivity quality of any client in their range and
efficiently share this information with other access points in
the vicinity of that client to coordinate which of them should
serve the client.
Experimental results on a fully deployed mesh network consisting
of 14 access points demonstrate the effectiveness of the SMesh
architecture and its handoff protocol.

This paper presents the first hierarchical Byzantine fault-tolerant
replication architecture suitable to systems that span multiple wide
area sites. The architecture confines the effects of any malicious
replica to its local site, reduces message complexity of wide area
communication, and allows read-only queries to be performed locally
within a site for the price of additional hardware. A prototype
implementation is evaluated over several network topologies and is
compared with a flat Byzantine tolerant approach.

The cost savings and novel features associated with Voice over IP
(VoIP) are driving its adoption by service providers.
Unfortunately, the Internet's best effort service model provides no
quality of service guarantees. Because low latency and jitter is the key
requirement for supporting high quality interactive conversations,
VoIP applications use UDP to transfer data, thereby subjecting
themselves to quality degradations caused by packet loss and network
failures.

In this paper we describe an architecture to improve the performance
of such VoIP applications. Two protocols are used for localized packet
loss recovery and rapid rerouting in the event of network
failures. The protocols are deployed on the nodes of an
application-level overlay network and require no changes to the
underlying infrastructure. Experimental results indicate that the
architecture and protocols can be combined to yield voice quality on
par with the PSTN.

In this paper we identify a major security vulnerability in distributed
systems: compromised clients under adversarial control can use the
system within their authorized access rights and authenticated
channels to deliberately insert incorrect data. A significant problem
is that when
a malicious client insider is discovered, it is hard to quickly assess
the scope of the damage, and identify corrupt and suspected updates.

We propose Accountability Graph, a mechanism
that can assist applications in coping and recovering from such attacks.
The tool provides accountability enforcement and causality tracking of
updates and their dependencies. Upon detection of incorrect data (e.g.
by an external intrusion detection mechanism or human assessment),
the Accountability Graph will quickly classify all updates in the system
as either corrupted, suspected or not affected. The practicality and
usefulness of the approach is demonstrated based on the requirements
of three different applications: an open source software development project,
a military common operation picture application, and a national emergency
response system. The Accountability Graph can also be used for risk
assessment and vulnerability analysis with respect to the above attack.

When many parties share network resources on an
overlay network, mechanisms must exist to allocate the resources
and protect the network from overload. Compared to large
physical networks such as the Internet, in overlay networks the
dimensions of the task are smaller, so new and possibly more
effective techniques can be used. In this work we take a fresh
look at the problem of flow control in multi-sender multi-group
reliable multicast and unicast and explore a cost-benefit approach
that works in conjunction with Internet standard protocols such
as TCP.

In contrast to existing window-based flow control schemes we
avoid end-to-end per sender or per group feedback by looking
only at the state of the virtual links between participating nodes.
This produces control traffit proportional only to the number of
overlay network links and independent of the number of groups,
senders or receivers. We show the effectiveness of the resulting
protocol through simulations and validate the simulations with
live Internet experiments. We demonstrate near optimal utilization
of network resources, fair sharing of individual congested
links and quick adaptation to network changes.

The cost savings and novel features associated with
Voice over IP (VoIP) are driving its adoption by service providers.
Such a transition however can successfully happen only if the
quality and reliability offered is comparable to the existing
PSTN. Unfortunately, the Internet's best effort service model
provides no inherent quality of service guarantees. Because
low latency and jitter is the key requirement for supporting
high quality interactive conversations, VoIP applications use
UDP to transfer data, thereby subjecting themselves to quality
degradations caused by packet loss and network failures.
In this paper we describe an architecture to improve the
performance of such VoIP applications. Two protocols are used
for localized packet loss recovery and rapid rerouting in the event
of network failures. The protocols are deployed on the routers of
an application-level overlay network and require no changes to
the underlying infrastructure. Experimental results indicate that
the architecture and protocols can be combined to yield voice
quality on par with the PSTN.

Group communication systems are high-availability
distributed systems providing reliable and ordered message
delivery as well as a membership service to group-oriented
applications. Many such systems are built using a distributed
client-server architecture where a relatively small set of
servers -sharing information about the groups in the system - provide
service to numerous clients.
In this work, we show how group communication systems can
be enhanced with security services without sacrificing robustness
and performance. More specifically, we propose several inte-
grated security architectures for distributed client-server group
communication systems. In an integrated architecture, security
services are implemented in servers, in contrast to a layered
architecture where the same services are implemented in clients.
We discuss performance and accompanying trust issues of each
proposed architecture and present experimental results that
demonstrate the superior scalability of an integrated architecture.

The cost savings and novel features associated with Voice over IP
(VoIP) are driving its adoption by service providers. Such a transition
however can successfully happen only if the quality and reliability
offered is comparable to the existing PSTN. Unfortunately,
the Internet's best effort service model provides no inherent quality
of service guarantees. Because low latency and jitter is the
key requirement for supporting high quality interactive conversations,
VoIP applications use UDP to transfer data, thereby subjecting
themselves to performance degradations caused by packet loss
and network failures.
In this paper we describe two algorithms to improve the performance
of such VoIP applications. These mechanisms are used for
localized packet loss recovery and rapid rerouting in the event of
network failures. The algorithms are deployed on the routers of
an application-level overlay network and require no changes to the
underlying infrastructure. Initial experimental results indicate that
these two approaches can be composed to yield voice quality on
par with the PSTN.

Group key agreement is a fundamental building block for secure peer group communication systems.
Several group key management techniques were proposed in the last decade, all assuming
the existence of an underlying group communication infrastructure to provide reliable and ordered
message delivery as well as group membership information. Despite analysis, implementation and
deployment of some of these techniques, the actual costs associated with group key management
have been poorly understood so far. This resulted in an undesirable tendency: on the one hand,
adopting sub-optimal security for reliable group communication, while, on the other hand, constructing
excessively costly group key management protocols.
This paper presents a thorough performance evaluation of five notable distributed key management
techniques (for collaborative peer groups) integrated with a reliable group communication
system. An in-depth comparison and analysis of the five techniques is presented based on experimental
results obtained in actual local- and wide-area networks. The extensive performance
measurement experiments conducted for all methods offer insights into their scalability and practicality.
Furthermore, our analysis of the experimental results highlights several observations which
are not obvious from the theoretical analysis.

Although telephony subscribers are accustomed to the
consistent voice quality and high reliability
of the traditional PSTN, the promise of a single converged
IP network to carry voice and data - and the cost savings therein -
motivates the interest to adopt voice-over-IP
(VoIP) technologies. However, the Internet provides best
effort delivery, without any inherent quality of service
guarantees. Low latency is a key factor in supporting high
quality interactive conversations, and as such contemporary
VoIP solutions use UDP to transfer data over the IP
layer, despite being subject to network loss and failures.

This paper describes the use of an overlay network
through which streams of voice packets are transmitted.
Flexible application level overlay routers can understand
the stringent requirements of VoIP and implement new
algorithms that mask the limitations of the underlying Internet.
We describe two protocols that facilitate localized
recovery for lost packets and rapid rerouting in the event of
network failures. Experimental results indicate that these
two approaches can be combined to yield a quantitative
improvement to voice communication quality.

Contributory group key agreement protocols generate group keys based on contributions of all group
members. Particularly appropriate for relatively small collaborative peer groups, these protocols are
resilient to many types of attacks. Unlike most group key distribution protocols, contributory group key
agreement protocols offer strong security properties, such as key independence and perfect forward secrecy.
This paper presents the first robust contributory key agreement protocol resilient to any sequence
of group changes. The protocol, based on the Group Diffie-Hellman contributory key agreement, uses
the services of a group communication system supporting Virtual Synchrony semantics. We prove that
it provides both Virtual Synchrony and the security properties of Group Diffie-Hellman, in the presence
of any sequence of (potentially cascading) node failures, recoveries, network partitions and heals.

We implemented a secure group communication service, Secure Spread, based on our robust key
agreement protocol and Spread group communication system. To illustrate its practicality, we compare
the costs of establishing a secure group with the proposed protocol and a protocol based on centralized
group key management, adapted to offer equivalent security properties.

The Spread toolkit is a group communication system available from
www.spread.org. Spread provides a range
of reliability, ordering and stability guarantees for message delivery.
Spread supports a rich fault model that includes process crashes and
recoveries and network partitions and merges under the extended
virtual synchrony semantics. The standard virtual synchrony semantics
is also supported.

In this paper we design a generic, consistent replication architecture that
enables transparent database replication and we present the optimizations
and tradeoffs of the chosen design. We demonstrate the practicality of
our approach by building a prototype that replicates a PostgreSQL
database system. We provide experimental results for consistent wide-area
database replication. We claim that the use of an optimized synchronization
engine is the key to building a practical synchronous replication system
for wide-area network settings.

The Domain Name System (DNS) is undergoing
fundamental changes in both design and operations,
but these changes are mostly taking place in piecemeal
extensions. In this paper we consider how to maintain
a simple and robust DNS in the face of these inevitable
and essential changes. We consider some of the features
that the modern DNS is trying to incorporate and
we look at the ensuing problems from a systematic perspective.
We identify some key architectural issues and
design principles that we believe are essential to the
successful integration of such features in the DNS infrastructure.
Following these principles, we sketch two
possible mechanisms that would improve the availability
and timeliness, and decouple the zone management
from the query/response data-path of DNS while being
deployable in parallel and incrementally with respect
to the existing infrastructure.

Maintaining the availability of critical servers and
routers is an important concern for many organizations.
At the lowest level, IP addresses represent the global
namespace by which services are accessible on the Internet.
We introduce Wackamole, a completely distributed
software solution based on a provably correct algorithm
that negotiates the assignment of IP addresses among the
currently available servers upon detection of faults. This
reallocation ensures that at any given time any public IP
address of the server cluster is covered exactly once, as
long as at least one physical server survives the network
fault. The same technique is extended to support highly
available routers.
The paper presents the design considerations, algorithm
specification and correctness proof, discusses the
practical usage for server clusters and for routers, and
evaluates the performance of the system.

Reliable point-to-point communication is usually
achieved in overlay networks by applying TCP on the end
nodes of a connection. This paper presents a hop-by-hop
reliability approach that considerably reduces the latency
and jitter of reliable connections. Our approach is feasible
and beneficial in overlay networks that do not have the
scalability and interoperability requirements of the global
Internet.
The effects of the hop-by-hop reliability approach are
quantified in simulation as well as in practice using a newly
developed overlay network system that is fair with the external
traffic on the Internet. The experimental results show
that the overhead associated with overlay network processing
at the application level does not play an important factor
compared with the considerable gain of the approach.

This paper proposes several integrated security architecture
designs for client-server group communication systems. In an
integrated architecture, security services are implemented in
servers, in contrast to a layered architecture where the same services
are implemented in clients. We discuss the performance
and accompanying trust issues of each proposed architecture and
present experimental results that demonstrate the superior scalability
of an integrated architecture.

A fundamental challenge in database replication is to maintain a low cost
of updates while assuring global system consistency. The divculty of the
problem is magnified for wide-area network settings due to the high latency
and the increased likelihood of network partitions. As a consequence,
most of the research in the area has focused either on improving the
performance of local transaction execution or on replication models
with weaker semantics, which rely on application knowledge to resolve
potential con icts. In this work we identify the performance bottleneck
of the existing synchronous replication schemes as residing in the
update synchronization algorithm. We compare the performance of several
such synchronization algorithms and highlight the large performance gap
between various methods. We design a generic, synchronous replication
scheme that uses an enhanced synchronization algorithm and demonstrate
its practicality by building a prototype that replicates a PostgreSQL
database system. We claim that the use of an optimized synchronization
engine is the key to building a practical synhronous replication system
for wide-area network settings.

An ad hoc wireless network is an autonomous self-organizing
system of mobile nodes connected by wireless links where
nodes not in direct range can communicate via intermediate
nodes. A common technique used in routing protocols for ad hoc wireless networks is to establish the
routing paths on demand, as opposed to continually maintaining a complete
routing table. A significant concern in routing is the ability to function in the presence of
byzantine failures which include nodes that drop, modify, or mis-route packets in an
attempt to disrupt the routing service.
We propose an on-demand routing protocol for ad hoc wireless networks that provides
resilience to byzantine failures caused by individual or colluding nodes. Our adaptive
probing technique detects a malicious link after log n faults
have occurred, where n is the length of the path. These
links are then avoided by multiplicatively increasing their
weights and by using an on-demand route discovery protocol that finds a least weight path to the destination.

This paper presents in detail an efficient and provably
correct algorithm for database replication over partitionable networks.
Our algorithm avoids the need for end-to-end acknowledgments for each action while supporting
network partitions and merges and allowing dynamic instantiation of new replicas.
One round of end-to-end acknowledgments is required only upon a membership change event
such as a network partition. New actions may be introduced to the system at any point,
not only while in a primary component. We show how performance can be further improved for
applications that allow relaxation of consistency requirements. We provide experimental results that
demonstrate the efficiency of our approach.

We propose a secure messaging infrastructure for unicast and multicast with near-optimal
performance and stronger semantics, beyond what is naturally achieved over the Internet, and
completely transparent to the application. This infrastructure is constructed by long-lived
daemons, running as user-space programs. The daemons dynamically create and maintain a logical
overlay network. The algorithms on this overlay network are not bounded by the standard
Internet protocols. Rather, they exploit the more limited scalability of the overlay network
(as opposed to the global Internet) to optimize performance (e.g. latency, throughput). Our
infrastructure can still scale to much higher numbers than, for example, group communication
systems, as we maintain much weaker global guarantees.

This paper presents a flow control protocol for
multi-sender multi-group multicast and unicast in wide area
overlay networks. The protocol is analytically grounded and
achieves real world goals, such as simplicity, fairness and
minimal resource usage. Flows are regulated based on the
"opportunity" costs of network resources used and the benefit provided by the flow.
In contrast to existing window-based flow control schemes, we avoid end-to-end per sender
or per group feedback by looking only at the state of the
virtual links between participating nodes. This produces
control traffic proportional only to the number of overlay
network links and independent of the number of groups,
senders or receivers. We show the effectiveness of the resulting protocol through simulations and
validate the simulations with live Internet experiments.

We present an algorithm for persistent consistent distributed commit (distributed database
commit) in a dynamic, asynchronous, peer to peer network. The algorithm has constant overhead
in time and space and almost constant communication complexity, allowing it to scale to very large
size networks. Previous solutions required linear overhead in communication and space, making
them unscalable.
We introduce a modular solution based on several well defined blocks with clear formal
specifications. These blocks can be implemented in a variety of ways and we give examples of possible
implementations. Most of the existing solutions require acknowledgments from every participant
for each action. Our algorithm is highly efficient by aggregating these acknowledgments. Also,
in contrast with existing solutions, our algorithm does not require any membership knowledge.
Components are detected based on local information and the information is disseminated on an
overlay spanning tree.
The algorithm may prove to be more suited for practical implementation than the existing
ones, because of its simplicity.

This paper explores the architecture, implementation and performance of a wide and
local area database replication system. The architecture provides peer replication,
supporting diverse application semantics, based on a group communication paradigm.
Network partitions and merges, computer crashes and recoveries, and message omissions
are all handled. Using a generic replication engine and the Spread group communication
toolkit, we provide replication services for the PostgreSQL database system. We define
three different environments to be used as test-beds: a local area cluster, a wide area
network that spans the U.S.A, and an emulated wide area test bed. We conduct an
extensive set of experiments on these environments, varying the number of replicas and
clients, the mix of updates and queries, and the network latency. Our results show that
sophisticated algorithms and careful distributed systems design can make symmetric,
synchronous, peer database replication a reality for both local and wide area networks.

This paper presents in detail an efficient and provably correct algorithm for database replication
over partitionable networks. Our algorithm avoids the need for end-to-end acknowledgments for
each action while supporting network partitions and merges and allowing dynamic instantiation
of new replicas. One round of end-to-end acknowledgments is required only upon a membership
change event such as a network partition. New actions may be introduced to the system at any
point, not only while in a primary component. We show how performance can be further improved
for applications that allow relaxation of consistency requirements. We provide experimental results
that demonstrate the superiority of this approach.

Group key agreement is a fundamental building block for
secure peer group communication systems. Several group
key agreement protocols were proposed in the last decade,
all of them assuming the existence of an underlying group
communication infrastructure.
This paper presents a performance evaluation of five notable key agreement
protocols for peer groups, integrated
with a reliable group communication system (Spread).
They are: Centralized Group Key Distribution (CKD),
Burmester-Desmedt (BD), Steer et al. (STR), Group Diffie-Hellman (GDH)
and Tree-Based Group Diffie-Hellman
(TGDH). The paper includes an in-depth comparison and
analysis of conceptual results and is the first to report practical results in real-life local
and wide area networks. Our analysis of these protocols' experimental results offers
insights into their scalability and practicality.

Researchers have made much progress in designing secure and scalable protocols
to provide specific security services, such as data secrecy, data integrity,
entity authentication and access control, to multicast and group applications.
However, less emphasis has been put on how to integrate security protocols with
modern, highly efficient group communication systems and what issues arise in
secure group communication systems. In this paper, we present a flexible and
modular architecture for integrating many different authentication and access
control policies and protocols with an existing group communication system,
while allowing applications to provide their own protocols and control the
policies. This architecture maintains, as much as possible, the scalability
and performance characteristics of the unsecure system. We discuss some of the
challenges when designing such a framework and show its implementation in
the Spread wide-area group communication toolkit.

This paper presents a flow control protocol for
multi-sender multi-group multicast and unicast in wide area
overlay networks. The protocol is analytically grounded and
achieves real world goals, such as simplicity, fairness and
minimal resource usage. Flows are regulated based on the
"opportunity" costs of network resources used and the benefit provided by the flow.
In contrast to existing window-based flow control schemes, we avoid end-to-end per sender
or per group feedback by looking only at the state of the
virtual links between participating nodes. This produces
control traffic proportional only to the number of overlay
network links and independent of the number of groups,
senders or receivers. We show the effectiveness of the
resulting protocol through simulations and validate the simulations
with live Internet experiments.

View-oriented group communication systems(GCSs) are powerful tools for building
distributed applications. Over the past fifteen years, group communication researchers
developed a multitude of group communication semantics and implementations. Today,
researchers commonly design their group communication algorithms on top of simple
existings ervices such as a network membership service or a reliable FIFO multicast
framework. A natural extension of this idea is to implement one set of group
communication semantics using another. This approach is not usually utilized due to the
expensive overhead of running one set of group communication algorithms on top of
another.
This thesis argues that the Extended Virtual Synchrony (EVS) model of group
communication, implemented using a client-daemon architecture,is of such high
performance that the overhead of constructing another group communication model on
top of it is acceptable. It demonstrates that the strong safety properties provided by the
EVS model can be leveraged to create very simple algorithms that implement more
powerful group communication models.
This thesis presents several EVS algorithms for implementing a partitionable Virtual
Synchrony (VS)model of group communication. It first explicitly defines the VS and
EVS models through the presentation of their safety and liveness properties. Then,one
simple algorithm is formally proved to implement the VS model by utilizing the safety
and liveness properties of the underlying EVS system. Finally,the paper discusses
several other simple variants and algorithms that were developed during the course of this
work.

Researchers have made much progress in designing secure and scalable protocols to provide
specific security services, such as data secrecy, data integrity, entity authentication and access control,
to multicast and group applications. However, less emphasis has been put on how to integrate security protocols
with modern, highly efficient group communication systems and what issues arise in secure group communication systems. In this paper, we present a flexible and
modular architecture for integrating many different authentication and access control policies and protocols with an existing group communication system, while
allowing applications to provide their own protocols and control the policies. This architecture maintains, as much as possible, the scalability and performance
characteristics of the unsecure system. We discuss some of the challenges when designing such a framework and show its implementation in the Spread wide-area group
communication toolkit.

Secure group communication is crucial for building distributed applications
that work in dynamic environments and communicate over unsecured networks
(e.g. the Internet). Key agreement is a critical part of providing security
services for group communication systems. Most of the current contributory key agreement
protocols are not designed to tolerate failures and membership changes during
execution. In particular, nested or cascaded group membership
events (such as partitions) are not accommodated.
In this paper we present the first robust contributory
key agreement protocols resilient to any sequence of events
while preserving the group communication membership
and ordering guarantees.

Flow control, especially in multicast networks, is a problem of significant theoretical and practical
interest. We present a protocol that is analytically grounded, yet also achieves real world goals,
such as simplicity, fairness and minimal resource usage. We base our flow control protocol on the
Cost-Benefit algorithmic framework for resource management. We base decisions on the "opportunity" costs
of network resources, comparing the cost of each individual resource to the benefit it
provides. As opposed to existing window-based flow control schemes, we avoid end-to-end feedback
by basing decisions on the state of the links between participating nodes. This produces control traffic
proportional only to the number of overlay network links and independent of the number of groups.
We apply this algorithm to an existing wide area group communication system, however, it can
be applied to other multicast services as well. We show the effectiveness of the resulting protocol
through simulations and live Internet experiments.

Search engines bring order to the content on the World Wide Web (WWW).
They are the primary interface for Internet users to find information. They currently take
30 days or more to find the changes that occur on the WWW due to the enormous amount
of data that must be crawled to find these changes. Yet the WWW is experiencing an
explosive growth of content, which will either greatly increase the cost of maintaining
search engines, or the rate that changes are found will progressively get worse.
HyperDog is a proposed solution to this problem. It uses a large Internet-wide
metacomputer to track the changes on the WWW. This system filters out only the
changes that occur and feeds them back to the search engines and other indexers.
Because the percentage of new and changed pages per day is relatively small, HyperDog
represents a significant bandwidth improvement. When search engines only have to
manage the changes on the WWW, they can scale better with the growth of the WWW.
HyperDog represents a key application to take advantage of the new global
metacomputers. By distributing the crawlers to many points in the network through these
metacomputers, the crawling algorithm gains some advantages. The crawlers each scan
a small part of the WWW, perform the necessary processing to determine if the pages
have changed, and only returns back the small portion of the WWW that is new or has
changed. Together the crawlers perform the majority of work and provide the search
engines with a much smaller condensed form of the information they need.
This paper also examines the properties of these metacomputers in an attempt to
define these properties and encourage the development of other key constructive
applications.

A metacomputer is a set of machines networked together for increased computational
performance. To build an efficient metacomputer, one must assign jobs to the various
networked machines intelligently. A poor job assignment strategy can result in heavily
unbalanced loads and thrashing machines. This cripples the cluster's computational
power. A strong job assignment strategy helps a metacomputer complete all of its jobs
swift ly.
Resource heterogeneity makes job assignment more complex. Placing a job on one
machine might risk depleting its small memory. Another machine might have more free
memory but a heavily burdened CPU. Bin packing on memory protects the system
against thrashing. Load balancing protects the system against high CPU loads.
Combining the two approaches, however, gives an ad hoc heuristic algorithm with no
clear theoretical merit.
The Cost-Benefit Framework, developed in this work, offers a new approach to job
assignment on metacomputers. It smoothly handles heterogeneous resources by
convert ing them into a unitless cost. We assign (and possibly reassign) jobs to greedily
minimize this cost.
This approach gives us an online strategy provably competitive with the optimal
offline algorithm in the maximum usage of each resource. It has a weak competitive ratio
­ logarithmic in the number of machines in the cluster ­ but even this weak ratio is
unprecedented in the literature. No other known method offers any competitive guarantee
on more than one resource.
We present experimental evidence that this strategy performs extremely well in
practice, comparing it to two important benchmarks: the default round robin strategy of
the popular PVM metacomputing system, and the powerful adaptive strategy of the
Mosix system.
Metacomput ing environments are not homogeneous. In some environments, the
scheduler has a great deal of information about submitted jobs. In other cases, it has very
little. Some systems can migrate jobs without interrupting their execution. Others cannot.
We develop variants of the basic "opportunity cost" strategy of the Cost-Benefit
Framework for various metacomputing environments, and prove all of them highly
efficient.
Finally, we provide two metacomputing systems ­ a prototype and a complete system
­ based on these ideas. The Java Market prototype is a metacomputer built atop Java and
web technologies, able to integrate any consenting Internet-connected machine. The
Frugal System transforms any Jini network into a metacomputer.
This work was carried out under the supervision of Professor Yair Amir.

Secure group communication is crucial for building distributed applications
that work in dynamic environments and communicate over unsecured networks
(e.g. the Internet). Key agreement is a critical part of providing security
services for group communication systems. Most of the current contributory key agreement
protocols are not designed to tolerate failures and membership changes during
execution. In particular, nested or cascaded group membership
events (such as partitions) are not accommodated.
In this paper we present the first robust contributory
key agreement protocols resilient to any sequence of events
while preserving the group communication membership
and ordering guarantees.

Numerous applications requiring information delivery from one sender to many receivers are based on a group
communication model. Group communication systems are used in industry and military systems where
reliability and high-availability are required. With the growth of the Internet, the number of applications that can
take advantage of a group communication infrastructure increased (teleconferences, white-boards, video
conferences, distributed interactive simulation, collaborative work). Over wide area networks the need for providing
confidentiality, integrity, and authenticity of messages is essential.
In this paper we present Secure Spread, a secure version of the Spread Toolkit. Secure Spread is a group
communication system that utilizes contributory group key management developed by the Cliques project and
Blowfish symmetric encryption algorithm. Its modular design allows drop-in replacement of encryption and/or
key agreement protocol. This work will not go to the details of a complete solution that handles every possible
combination of network events. Rather it will focus on the performance evaluation in the general case. The results
will give a good indication and insight of the overall cost of security in a group communication environment.

A new method is presented for job assignment to and reassignment between machines in a
computing cluster. Our method is based on a theoretical framework that has been
experimentally tested and shown to be useful in practice. This "opportunity cost" method
converts the usage of several heterogeneous resources in a machine to a single homogeneous
"cost." Assignment and reassignment are then performed based on that cost. This is in
contrast to traditional, ad hoc methods for job assignment and reassignment. These treated
each resource as an independent entity with its own constraints, as there was no clean way to
balance one resource against another. Our method has been tested by simulations as
well as real executions and was found to perform well.

Group communication systems are proven tools upon
which to build fault-tolerant systems. As the demands for
fault-tolerance increase and more applications require reliable distributed computing over
wide area networks, wide
area group communication systems are becoming very useful. However, building a wide area
group communication system is a challenge. This paper presents the design of
the transport protocols of the Spread wide area group communication system.
We focus on two aspects of the system.
First, the value of using overlay networks for application
level group communication services. Second, the requirements and design of effective low latency
link protocols used to construct wide area group communication. We support
our claims with the results of live experiments conducted
over the Internet.

Group communication systems are used to build fault-tolerant and replicated services.
However, they also can be
very useful for building tools that support a highly available web cluster, or help
manage a large collection of machines. This paper explores the features of one particular
Group communication system called SpreadTMthat are useful in this environment and describes two
specific applications that we built and actively use. These applications are a
distributed system logging tool and an IP based high availability fail-over tool for clusters.

Managing a large collection of networked machines, with a series of incoming jobs,
requires that the jobs be assigned to machines wisely. A new approach to this problem is
presented, inspired by economic principles: the Cost-Benefit Framework. This framework
simplifies complex assignment and performs well in practice. We demonstrate this framework in
the context of an Internet-wide market for computational services and verify its
utility for a classic network of workstations.

The increasing popularity and diversity of collaborative
applications prompts a need for highly secure and reliable
communication platforms for dynamic peer groups. Security mechanisms for such
groups tend to be both expensive
and complex and their integration with reliable group communication services
presents a formidable challenge.
This paper discusses some important integration issues,
reports on our implementation experience and provides experimental results.
Our approach utilizes distributed group
key management developed by the Cliques project. We enhance it to handle
processor and network faults (under a
fail-stop or crash-and-recover model) and asynchronous
membership events (such as joins, leaves, merges and network partitions).
Our approach leverages the strong properties provided by the Spread group communication system,
such as message ordering, clean failure semantics and a
membership service. The result of this work is a secure
group communications layer and an API that provide the
application programmer with both standard group communication services and flexible security services.

mod_backhand is a dynamically loadable module for the Apache web server developed at The Center for
Networks and Distributed Systems (CNDS) at The Johns Hopkins University. This module aims to
seamlessly allow for load balancing within a cluster of machines. This paper will discuss the concepts,
implementations, advantages/disadvantages, practical applications and future directions.
mod_backhand is a part of the Backhand project initiated at CNDS. The purpose of the Backhand
project is to develop tools for effective resource management and utilization. The purpose of
mod_backhand, specifically, is to provide the infrastructure to reallocate HTTP requests to any machine
within a cluster and the framework for effective and flexible decision making.
The concept behind mod_backhand is the amalgamation of several important deviations from
"standard" practice. Many network appliances used for balancing web clusters have their foundations in
the networking world. This leads to a design drawn in the image of a router. This approach was purposely
avoided in an attempt to avoid bottlenecks and single points of failure. The approach we use allows for
maximum utilization of a network's egress points and survives link failures extremely well. The flexibility
of the implementation provides a tool that can be used to build both a single and a multiple entry point
cluster.
The Backhand project was designed to tackle all aspects of resource allocation and management.
This was done for two reasons. First, the issues that must be tackled are far out of the scope of an Apache
module. Second, it allows for intelligent separation of concepts and functionality. mod_backhand, in its
current form, attempts to solve resource allocation and management issues within a cluster of machines on
a relatively low latency, high-throughput network.
The problems involved in finding the best web server or cluster for a specific client are better
tackled at a lower level. Specific approaches including optimizing algorithms based on DNS and IP routing
are not in the scope of mod_backhand, but fit well in the more general Backhand project.

The PNNI is an umbrella protocol that allows considerable routing flexibility. While it
specifies the network signalling, it leaves route generation undefined. Any non-trivial
route generation requires the collection and dissemination of network metrics. Furthermore,
hierarchical networks require aggregation and compression of these metrics in order to be
scalable. Aggregation and compression methods are also left unspecified by the PNNI,
although the PNNI recommends using star graphs for aggregation. In contract, recent work
done by Awerbuch, Du, Khan and Shavitt [ADKS98] suggest that the minimum spanning tree
is a better choice. Their work also shows that exponential weight routing is superior to
(the ubiquitous) min-hop routing. In this paper, we extend the work of [ADKS98] to
multicast. We define a multicast protocol, and model it and incorporate it into their
simulator. Surprisingly, our results are not identical to that of [ADKS98]. However,
the recommended shcmes of [ADKS98] perform very well, and significantly out perform the
standard min-hop routing. This study further strengthens the case for including the minimum
spanning tree and exponential weight routing in the PNNI standards.

We present an O(log^2nlogM) throughput competitive algorithm for online multicast, where n
is the number of nodes in the graph, and M is the number of multicast groups. This improves
upon the previous best O(logn + loglogM)(logn + logM)logn) competitive algorithm by Goel,
Henzinger and Plotkin, and closes the gap between the upper and lower bounds to O(logn).
This algorithm uses the recent extension of the maximal dense tree algorithm to graphs with
dynamic edge weights. Our algorithm is modular, and does not involve an integrated analysis
of its winner picking and routing components.

In this paper we provide a polylogarithmic competitive online algorithm for the maximal dense
tree problem on a graph with dynamic edge weights. The previous work on the topic,
[AS97], only handled static edge weights. We expect that the inclusion of dynamic edge
weights will significantly improve the utility of our algorithm. Specifically we expect
simpler and better solutions to the important throughput competitive online multicast
problem.

Today, most of the popular web sites are served from
single locations. This basic Web client-server model
is easy to deploy and maintain and thus is very
successful. It suffers, however, from several efficiency
and availability problems. This paper presents
Walrus, a low-latency, high-throughput Web service
that addresses some of these problems. Under
Walrus, a single logical Web server can be replicated
to several clusters of identical servers where each
cluster resides in a different part of the Internet. An
important aspect of Walrus is its ability to
transparently direct the web browser to the best
replica without any changes to the web server, web
client, and network infrastructure. "Best" is a
relative term, dependent on where the client is
located on the network, the load on each replica, and
more. Walrus deploys an elegant algorithm that
balances these considerations.

The explosion of the web has led to a situation where a majority of the
traffic on the Internet is web related. Today, practically all of the popular web sites
are served from single locations. This necessitates frequent long distance network
transfers of data (potentially repeatedly) which results in a high response time for
users, and is wasteful of the available network bandwidth. Moreover, it commonly
creates a single point of failure between the web site and its Internet provider.
This paper presents a new approach to web replication, where each of the replicas
resides in a different part of the network, and the browser is automatically and
transparently directed to the "best" server. Implementing this architecture for popular
web sites will result in a better response-time and a higher availability of these sites.
Equally important, this architecture will potentially cut down a significant fraction of
the traffic on the Internet, freeing bandwidth for other uses.

Today, most of the popular web sites are served from
single locations. This basic Web client-server model
is easy to deploy and maintain and thus is very
successful. It suffers, however, from several efficiency
and availability problems. This paper presents
Walrus, a low-latency, high-throughput Web service
that addresses some of these problems. Under
Walrus, a single logical Web server can be replicated
to several clusters of identical servers where each
cluster resides in a different part of the Internet. An
important aspect of Walrus is its ability to
transparently direct the web browser to the best
replica without any changes to the web server, web
client, and network infrastructure. "Best" is a
relative term, dependent on where the client is
located on the network, the load on each replica, and
more. Walrus deploys an elegant algorithm that
balances these considerations.

Building a wide area group communication system is a challenge. This paper presents the design and protocols
of the Spread wide area group communication system. Spread integrates two low-level protocols: one for local
area networks called Ring, and one for the wide area network connecting them, called Hop. Spread decouples the
dissemination and local reliability mechanisms from the global ordering and stability protocols.
This allows many optimizations useful for wide area network settings. Spread is operational and publicly
available on the Web.

Managing a large collection of networked machines, with a series of incoming jobs,
requires that the jobs be assigned to machines wisely. A new approach to this problem is
presented, inspired by economic principles: the Cost-Benefit Framework. This framework
simplifies complex assignment and performs well in practice. We demonstrate this framework in
the context of an Internet-wide market for computational services and verify its
utility for a classic network of workstations.

A new method is presented for job assignment to and
reassignment between machines in a computing cluster.
Our method is based on a theoretical framework that
has been experimentally tested and shown to be useful
in practice. This "opportunity cost" method converts the
usage of several heterogeneous resources in a machine
to a single homogeneous "cost". Assignment and
reassignment is then performed based on that cost. This
is in contrast to previous methods for job assignment
and reassignment, which treat each resource as an
independent entity with its own constraints. These
previous methods were intrinsically ad hoc, as there
was no clean way to balance one resource against
another.

Most of the machines that are connected to the Internet are idle a significant fraction of the time.
This paper presents the Java Market, a system that allows organizations and Internet users to make use
of this wasted computational power.
Using the Java programming language and the Web technology, the Java Market is the first
metacomputing system that can seamlessly take advantage of machines of any architecture, anywhere
on the Internet. Every user on the Internet can contribute their machine's computational resources just
by pointing a Java-capable browser to the Java Market web page. Similarly, users can launch jobs to
the system by posting them on the Web and registering them with the Java Market.
In contrast, other systems that allow sharing of computational resources between machines over the
network require homogeneous system architecture. They often involve extensive installation or even
kernel-level modifications to the operating system.