Building a Hierarchical Content Distribution Network with

Unreliable Nodes
Jared Friedman and Erik Garrison
Harvard University

{jdfriedm,garris}@fas.harvard.edu
Abstract

1.1.1

Running a popular website is expensive, due to hardware and bandwidth
costs. For some websites, users might want to donate their idle computing resources to help host the website, reducing the site’s hosting costs.
Unfortunately, there is currently no easy way for a website to spread
its hosting burden to ordinary PCs, which are geographically dispersed,
unreliable, possibly malicious, and may have low system resources and
be behind NAT. We present a design and simulation for Jellyfish, a distributed web caching system which is intended to run across ordinary
unreliable PCs while maintaining very high performance. We propose
solutions for the problems of security, diverse performance characteristics, NAT, and hotspots. We also take a detailed look at the object
placement problem, propose a new algorithm for object placement, and
show through simulation that it has better performance than the commonly used CARP algorithm.

Volunteers often run mirror sites to help overloaded primary
sites host data. Unfortunately, mirror sites generally need to be
dedicated server-class machines with static IP addresses, which
greatly restricts the pool of volunteers who can contribute to
website hosting in this way. Given that typical access to web objects follows a zipf-like distribution [6], blindly mirroring content without respect to access frequency can be quite wasteful
relative to caching schemes which selectively cache the most
popular content [5, 13].
Such caching systems require a degree of coordination
greater than that typically associated with mirroring arrangements; at the same time they effectively lower the barrier to entry
in a cooperative caching system. Even a donated system capable of caching a handful of the most popular objects can aid the
cause of the cooperative system it is allied to. We believe that
problems associated with security and reliability have prevented
such systems from developing.

1 Introduction
Running a popular website is an expensive proposition.
Wikipedia, globally the 17th most popular website,1 requires
200 dedicated servers to keep the site online. The hardware
and bandwidth costs associated with a high profile site like
Wikipedia run into the hundreds of thousands of dollars. For
websites run by small companies, non-for-profit organizations,
or individuals, these costs may prove quite burdensome, and
it is only by running constant fund-raising drives that they
can maintain an acceptable quality of service to the community which they serve. Computer resource sharing projects like
SETI@home [4], Folding@home [2], BOINC [1], and Google
Compute [3] have demonstrated that many people are willing to
donate their idle computer resources to projects with common
benefit.
In light of the success of these projects, it is probable that
many people would be willing to donate their spare computing resources to help keep popular non-for-profit websites online. However, currently no suitable content distribution network
(CDN) exists to enable such activity.

1.1

Supporting websites cooperatively

There are a variety of methods by which the hosting of a website
can be shared amongst the resources of a number of individuals:
1 Alexa

ranking, as of May 2006

1.1.2

Mirroring

Commercial CDNs

Commercial Content Distribution Networks like Akamai and
Digital Island have shown that it is possible to distribute the
hosting of a high profile website around the world, with widely
dispersed servers each caching small pieces of the website, and
that doing so can greatly improve performance and absorb flash
crowds. These services are defined by their closed architectures
and high cost (some estimates have placed costs-per-byte as high
as 1.75 times that of non-CDN bandwidth [20]).

1.1.3

Cooperative caching

A wide variety of cooperative caching systems have been proposed; some are in use today. Two principle classes of cooperative storage systems exist: those designed primarily for data
storage, such as Freenet [10] (a distributed storage system aimed
at anonymity) and Riptide/Oceanstore [11] (a similar system optimized for reliable and highly-available data storage), and those
designed for temporarily caching data, such as CoralCDN [12],
a peer-to-peer system designed to distributed the costs of hosting
a website across a network of peer-based caches..

1.1.4

CoralCDN

CoralCDN is a relatively new project that comes closest to the
goals we have in mind. Coral is a fully distributed peer to peer
content distribution network. Coral is intended to allow users

with almost no hosting capacity to host a very high profile website. One of the benefits of Coral is that it is very easy to use.
Because Coral uses a DNS-based system for handling queries
directed at the CDN, one only needs to append .nyud.net:8090
to the domain of any URL and the result will be delivered from
the Coral network. This makes Coral ideal for handling the socalled “Slashdot effect”, in which very small web hosting systems can be completely overloaded when content they distribute
is referenced by popular websites. Coral is currently hosted on
Planetlab and available for use, but not open to donated computer resources due to security concerns.
While Coral stands as a major contribution to volunteer-based
CDN technology, we believe that due to its current design it
may be inapplicable to the problem space in which we are interested. Generally, Coral is ideal for websites with virtually
no hosting capacity that could be quickly overloaded through a
Slashdot effect. Coral is less suited to websites like Wikipedia
which already have substantial hosting capacity and have very
high expectations for their users experience. For such websites,
we believe that Coral has the following problems:
• Security: Coral currently has no security system. We
are most concerned with an internal attack on the system,
whereby volunteer nodes replace requested content with
content of their own choosing. The authors of Coral recognize this attack as a problem, but do not solve it. If Coral
is to remain completely distributed in its operation, doing
so might require the implementation of a complex “web of
trust” on top of the existing system.
• Openness: That any website can use Coral to dramatically
reduce its hosting costs is both a curse and a blessing. On
the plus side, it means that users can “register” a website
for Coral without that website even knowing of Coral’s existence by surfing to the “Coralized” URL corresponding
to the site.2 On the other hand, this openness allows the
network to be used for whatever purpose its decentralized
userbase desires.
In short, the entirely decentralized nature of Coral provides
no guarantees to users that their donated resources will be
spent in the manner which they support. Given that the content of a mature CoralCDN will match that of the Internet,
donated resources are just as likely to aid the publication of
low-budget, commercial pornography sites as they are free,
publicly editable encyclopedias. This possibility limits the
draw which the application has to users, in turn throttling
the effectiveness of the entire network.
• Performance: In their description of Coral, Freedman et.
al. describe median client latencies of 0.19 seconds, ostensibly a value low enough to make Coral quite usable for
web content delivery. In our experience, however, the actual latencies for Coralized webpages were much higher—
usually at least several seconds. Many factors could cause
this disparity, such as overloading of the Coral network
(likely, since it is only running on Planetlab), or the fact
2 This

is happening already as some links on Slashdot are now posted in Coralized form.

that the test mentioned in the paper requests a single file,
whereas a typical website consists of many, and thus requires the initiation of a number of separate requests. We
feel that the overlay structure of Coral, which can cause requests to travel through many volunteer nodes before ultimately terminating in an origin server, will likely lead
to higher latencies than Jellyfish’s hierarchical approach,
which bounds the maximum number of hops for a cache hit
at 2.
• Optimization: As a distributed caching system grows, the
content creators will likely want some guarantee that it is
making efficient use of the possible hosting capacity of its
volunteered machines. As we discuss later in the paper,
the goal of efficient use of resources leads to mathematically difficult object placement and request routing problems. There is also the issue of the communications overhead of keep-alives and other maintenance messages involved in implementing a particular solution, which further
complicates analysis. We have found it difficult to compare the optimality of Coral with that of Jellyfish, but given
our work with the request routing problem, we suspect that
it may be easier to build a smart heuristic algorithm when
knowledge of nodes is more centralized.

1.2

Jellyfish

In this paper, we present a design for Jellyfish3 , a cooperative
object caching system for the web. We implement Jellyfish in
simulation and discuss the results of our simulated tests. While
we do not present a working prototype of Jellyfish, we have
taken steps towards an implementation, and additionally propose
a plan for adapting the Squid caching engine as the base of Jellyfish. By way of an analysis of literature and a simulation of a
set of caching protocols in a similar (but not identical) problem
space, we relate some heuristic methods which might be used
in a deployed system. We present some methods for securing
a distributed caching system and a different and possibly more
efficient overall system design.

2

Related work

Generally speaking, Jellyfish aims to be a peer-based, decentralized caching system for web content. This problem space
is relatively unique in that virtually all previous research in distributed caching of which the authors are aware assumed that the
nodes in a given caching system were reliable in terms of their
performance, trusted to not carry out malicious behavior, and
(generally) homogeneous in their performance characteristics.
Research in peer-to-peer systems has tended to relax these
assumptions most readily. However, related work in peer-topeer systems often imposed arbitrarily high restrictions on the
elements of systems which could remain centralized, thus producing systems which are most applicable to situations in which
systems must truly be distributed, but less applicable to cases in
which some degree of relative centralization exists and can be
used to improve system performance.
3 So-named

in keeping with the naming tradition of caching systems: Coral,
CARP, Squid, and so forth.

In most cases this decentralization implies the existence of
a shared routing or object placement protocol that provides the
system with important characteristics like scalability and robustness. These structures, (most notably distributed hash tables, or
DHTs) provide deterministic, non-hierarchical routing for object
lookup and retrieval. For DHTs like Tapestry [14] and Chord
[22], this routing is in O(n log n), where n is the population of
nodes in the overlay. Such structures thus excel in the case of a
system which cannot have an explicit central authority, or where
that authority has extremely limited resources, and have thus
been deployed in support of applications like Bittorrent (where
they help to spread responsibility for the legal ramifications of
copyright-infringing file transfer). In such cases, the RTT engendered by the overlay protocol matters little relative to the reliability and decentralized qualities of the system.
By contrast, the needs of a system used for the distribution
of web content require stronger guarantees of responsiveness
than such overlay networks have yet demonstrated. Studies have
shown that users treat the web as a kind of interactive computing
environment, and tolerate wait times of little more than 8 seconds before giving up their attempt to access content [18]. Some
of the problems associated with the maintenance of DHT-based
peer-to-peer systems, such as overlay partitions resulting from
periodic, distributed routing anomalies [19], can cause unanticipated and undesirable system behavior which could cripple interactive systems.
Although they may be capable of adjusting to such issues,
purely distributed overlays which exhibit no centralization lack
methods to guarantee the real-time robustness required to maintain an acceptable quality of service. In contrast, a hierarchical
overlay structure provides routing in the order of a constant proportional to the depth of the hierarchical overlay. Additionally,
when a system is geared for deployment by a single institution or
entity (or alternatively a “friend-to-friend” based system established among such entities), it is sensible that the system utilize
the centralized properties of the human networks which it serves
to the benefit of its users.
An overview of prior work in hierarchical caching systems
will clarify the design principles we utilize in our system.
Of principal concern are the trade-offs between various object
placement and retrieval algorithms, and a variety of security systems which might be used in a network of untrusted peers.

2.1

Distributed object placement algorithms

In the most general sense, web caching is a subset of a class of
computationally difficult problems known as object placement
problems. Generally, an answer to an object placement problem describes a method of distributing resources in an efficient
manner across a graph topology. In some limited cases in which
placement is constrained by a small number of factors, the object
placement problem is soluble [23]. However, we are unaware of
any wide-area use of precise object placement algorithms.We
suspect that network operators are unlikely to employ these algorithms given the high cost of updating placement schemas after alterations to the network are made. While benefits can be
derived from a perfectly-tuned object placement algorithm, in
practical situations a distributed, an adaptive solution can be engineered to distributed data with little detriment to quality of

service. Additionally, we believe that the additional constraints
of unreliable and highly heterogeneous nodes imply the need for
an adaptive solution to the problem.
Of these adaptive solutions to the object placement problem,
the most commonly used is the so-called greedy, or “least recently used” (LRU) caching algorithm. The algorithm is conceptually simple and computationally efficient, and has been employed by a wide variety of content distribution systems in ISPs,
commercial CDNs, and caching systems at the site of content
creation and at the site of consumption (e.g. in web browsers).
In the LRU algorithm, the cache maintains a queue of all objects
it currently has in store. On receipt of a request, it returns the object, stores it (if it has not already) and places a reference to the
object on the top of its queue. When the cache is full, it simply
replaces the least-recently-used objects (i.e. those on the bottom
of the queue) with the new objects.
Other adaptive solutions seek to optimize the method by
which content is pushed to distributed caches. While uncoordinated pull-based caching is an effective method to reduce
bandwidth consumption, uncoordinated local caches provide an
unnecessarily high degree of redundancy in a networked environment, and can be supplemented with proxy caches placed at
the junctions between stub-networks and the transit links which
connect them. These caches can store the most popular objects
accessed by the users of the stub, thus saving all users storage.
These systems have been proposed since the early history of the
web [7, 21], and have in practice been used by ISPs to reduce
bandwidth costs and improve perceived latencies.
In an early paper on the topic, Gwertzman and Seltzer noted
that push-caching could be more efficient than proxy-based,
request-driven caching in distributed cache hierarchy when the
full network topology was known [13]. Otherwise, requestdriven caching provided a far greater reduction network usage
(around 65%), and a hybrid method only improved performance
by an additional 2%. More recent work has demonstrated that
adaptive clustering solutions can be found which achieve greater
gains relative to un-cached resource consumption [9].
The problems of organizational coordination associated with
establishing a global push-caching scheme would probably
negate the slight gain in efficiency which it would provide over
the highly optimal pull-based caching systems currently deployed. However, in a relatively closed content distribution
environment, where changes to content can be tracked and information about participating caches is available, a hybrid system could easily be implemented to reduce stress on central resources. Thus we have considered pushing in the case of Jellyfish.

2.2

Cache coordination systems and routing protocols

Caches which are tightly coupled in a clustered environment can
divide their labor in a variety of ways. One of the oldest systems
for coordinating caches is the Internet Cache Protocol (ICP). On
receipt of a request which it cannot fulfill (a cache miss), a cache
in an ICP-based caching cluster iteratively sends messages to
every other cache in the group. If one of these caches can fulfill
the request, it forwards the object to the first cache (a near miss).
If not, the sibling cache forwards the request to its destination.
Because the number of messages required for the coordination

of an ICP-based cache cluster grows in O(n2 ), the system is not
recommended for deployment in clusters larger than four or five,
and thus serves merely as a counterexample to our approach.
Other intercommunication protocols are implicit in the object
placement system employed by the cache cluster. One widely
used system is the Cache Array Routing Protocol (CARP),
which simply splits the hash space of the URI namespace into
ranges configured to be commensurate to the capacity of the individual servers in the cluster. When a request for an object
is received by the cluster, the result of the hash of the URI of
the object is used to direct the request to one of the coordinated
caches, which obtains the object and caches it. In this way, the
namespace is split into a set of distinct buckets, and coordination
occurs not through gossip between the servers, but a preconfigured algorithm. While CARP and similar algorithms can be used
to evenly spread load across a large cluster of caches, the hash
functions require updating when servers in the array fail [15].
As we show in our simulation, this approach has serious problems adapting to high rates of churn which might be found in
a distributed cluster of volunteer nodes. Furthermore, heuristics
based on known system factors (such as memory, bandwidth,
and storage capacity) which can be used to obtain hash range
values for caches in a closed environment prove problematic
when system capacities must be inferred from records of their
behavior.
Other approaches have attempted, with some success, to
solve the problem of request routing more adaptively. Kawai
et. al. approach the problem of fault tolerance by allowing hash
ranges to be duplicated between nodes in the system, and optimize their caching algorithm to allow caches to store certain objects requested by their local clients [16]. Kaiser et. al. construct
an adaptive distributed algorithm for object placement that has
the same goals as CARP, but relies on the convergence of routing
tables among a set of caches [15]. The authors demonstrate that,
over time, it outperforms a CARP-like protocol in terms of the
number of hops required for request resolution and the average
cache hit ratio.

3

Design

Jellyfish is designed to use volunteered computer resources
to greatly amplify the hosting capacity of a websites primary
servers while maintaining extremely high performance, security, and an excellent overall user experience. This high-level
overview describes the utilization of the system in terms of content providers, volunteers, and users:
Content providers: To use the software, content providers
(e.g., an organization or individual hosting some website) supply at least one trusted and reliable machine to run the Jellyfish server-side software. They will need to set up this serverside software and reconfigure their nameservers to make use
of Jellyfish, but do not need to change their actual code. As
a trial, they can create a separate Jellyfished subdomain (e.g.,
cached.wikipedia.org) and make changes to their primary domain only when the setup has been tested.
Volunteers: Anyone with a broadband connection and an interest in helping downloads the Jellyfish client, thus becoming a
volunteer. A volunteer selects which websites he or she would

like to help from a list of centrally approved choices. When the
volunteers computer is idle, it will connect to the main Jellyfish
servers and register itself as an available node. Requests will
immediately begin to be directed to the volunteers computer.
Users: Users, or browsers of a website, should see no apparent difference. In contrast to Coral, there is no way for users to
opt in or out of the caching system—it is under the control of
the content provider. However, content providers may elect to
use Jellyfish to make an independently accessible cached site,
affording users the ability to choose the distribution system they
use.

3.1

Request Routing

Jellyfish is based on a node/supernode structure. We chose this
structure over the aforementioned DHT-based overlay structures
because we felt that for low latency applications open to arbitrary computers this would ultimately give the best performance.
A key difficulty with distributed caching systems is that a very
large portion of personal computers are behind restrictive NATs
and firewalls and thus require the assistance of unrestricted peers
to initiate communication with normal users. Another difficulty
is the immense range of hardware and in particular bandwidth
capacities observed in real life networks. Successful low latency
applications like Skype have chosen a node/supernode structure
to exploit this immense spread of capacities instead of fighting
it.
In Jellyfish, volunteer nodes which are not behind NATs or
firewalls and have reasonably high capacities and uptimes are
promoted to being supernodes. All other volunteers are ordinary
nodes.
Each ordinary node is assigned to a nearby supernode using
existing network location techniques. In our current Jellyfish
design, there is a sharp delineation of roles—supernodes act as
DNS servers, whereas ordinary nodes act only as HTTP servers.
However, there is no reason that a computer acting as a supernode cannot simultaneously act as an ordinary node, running both
the DNS server and the HTTP server, or alternatively, change
roles over time. For the purposes of this discussion it is probably helpful to think of them as distinct machines.
Much like both Akamai and Coral, request routing in Jellyfish is done using DNS. Generally, the entire cached website will be assigned its own subdomain, generally ’cached’.
Each conceptual webpage, which includes the main HTML file
and associated image, javascript and css files, is assigned a
unique subdomain which is a simple translation of its URL. For
example, the current en.wikipedia.org/wiki/Main_
Page would be rewritten as wiki_Main_Page.cached.
en.wikipedia.org/wiki/Main_Page, and its associated images would follow the same subdomain. These links can
be rewritten permanently or using on-the-fly URL rewriting like
Akamai.
In this example, the URL cached.en.wikipedia.org
is resolved by primary DNS servers owned by Wikipedia to
point to a local supernode. The supernode chosen is based on
the IP address of the request and the round trip times to various geographically close supernodes, in addition to the current
loads on the various supernodes (avoiding overloaded servers
is the first priority, ensuring good geographic location is sec-

ond). When the user decides to go to the Wikipedia main page, it
sends a request to resolve wiki_Main_Page.cached.en.
wikipedia.org to its assigned supernode. The supernode
resolves this address to a node that it wants to serve the page
request. The algorithm the supernodes use for doing this is the
subject of the next section. The supernode will also, if necessary,
coordinate the necessary firewall/NAT traversal to establish bidirectional communication between the user and the ordinary node
at this time. The user then requests the page from the assigned
node, which acts as an ordinary HTTP server.
If the node that the supernode resolves the DNS request to
has the document, it simply responds with it. If it does not, then
a cache miss has occurred. In this case, the node will need to
retrieve the documents from the central server before it can forward them to the client. Clearly, if no node that is assigned to
a supernode has the requested document, the supernode will not
be able to forward to a child node without creating a cache miss.
On the one hand, if this happens infrequently enough, it might
be fine. On the other hand, if a nearby node that is assigned to
some other supernode is caching the data, then it seems like it
might make sense to try to get it from that node, avoiding the
additional request to the central servers. CoralCDN takes this
idea to an extreme, compared to Jellyfish, by guaranteeing that
cached documents will always be found. But if the document is
cached on the other side of the world, it may be very time consuming to retrieve it, and due to the somewhat different design
goals of Jellyfish, we would prefer such documents to simply
be cache misses. Still, a compromise between the two seems
possible.
The compromise we use makes use of cache digests, which
are a bloom filter-based method of efficiently reporting the content of a cache. Each supernode is aware of a few other nearby
supernodes. Every few minutes, it runs a bloom filter on its state
representation of the cache content of its sub-nodes to obtain
a cache digest, and sends copies of the results to its neighbor
supernodes. When a supernode receives a request that none of
its own children can handle, it checks the cache digests of its
neighbors to see if they might be storing the file. If one is, it
forwards the client’s DNS request to the supernode whose child
node stores the file. The client will therefore get the request from
a node belonging to a different supernode. We should note that
using a simple bloom filter can result in false positives, but that
these can be limited by changing the parameters of the bloom filter, or ignored, as they merely result in a cache miss at the falsely
identified node and thus cause no greater damage than that which
would have occurred if they were not employed for inter-cluster
communication. New methods, such as the “bloomier” filter, can
alternatively be used to eliminate the possibility of false positives, but they come at greater computational cost [8].
To prevent the problem from recurring, the supernode will
send a message to one of its own nodes telling it to cache the document by retrieving it from the node to which the near miss was
directed. This provision ensures that the central server pushes
a minimum number of document replicas to the Jellyfish array,
utilizing volunteer resources for the replication of data whenever
possible. By employing cache digests we are able to share state
between cache clusters without employing costly gossip-based
notification systems.

3.2

Object Placement

When a supernode gets a request for a particular subdomain, it
must decide which of its ordinary nodes (or a neighbor supernode’s nodes) to forward the request to. More generally, the supernode needs to decide what nodes should store what objects.
These are variants of the ’object placement problem’ and the ’request routing problem’, which are, unfortunately, difficult problems. From a qualitative point of view, the supernode has at least
the following considerations.
1. Ordinary nodes must be load balanced - no nodes can be
overloaded.
2. However, ordinary nodes may have vastly different capacities, and these capacities may change suddenly.
3. Popular files cannot generally be served by a single node they will need to be replicated to share the load.
4. However, too much replication is bad too. Disk capacities
are finite, and it is important for caches to store different
files in order to maximize hit ratios.
Object placement is a difficult problem in general. Given the
difficulty of finding closed form solutions for even much simpler
cases, we feel it is very unlikely that a closed form optimal solution exists for our case. Instead, we compared three heuristic
based algorithms, one a very simple one for comparison purposes only, one the commonly used CARP algorithm, and one a
new heuristic algorithm designed specifically for Jellyfish.

3.2.1

Load balanced CARP

This algorithm is a variant of the well known cache array routing protocol algorithm. CARP is a very simple system, but it is
used quite frequently in practice. Each node is assigned a weight
between 0 and 1, and the weight vector is normalized to sum to
1. Given these values, each node is assigned a unique interval
on [0,1] with size equal to its weight. CARP uses a hash function which hashes document URLs to values that fall uniformly
between 0 and 1. To route a document, the hash value of the
document URL is computed, and the document is routed to the
node whose assigned interval contains the hash value.
Ordinarily, CARP is used in situations where the caches involved are quite static. In this case, the weights for the caches are
pre-set to reasonable values and remain constant. For Jellyfish,
however, the weights must clearly be set dynamically to reflect
changing conditions on the volunteer nodes. To determine the
weights, we first measure the response time of the peer to a simple TCP request which is handled by the Jellyfish software. We
assume that the response time to this request is highly indicative
of the total load the node is experiencing. This assumption will
hold if the node is bandwidth or CPU limited, as we expect it to
be. If it were somehow limited by disk I/O or memory access,
this might not hold. If this appears to be a problem, a more complex timing test could be run periodically. We use the following
algorithms for calculating response times and weights.

3.2.2

Calculating response times

Response times are measured for each node every time interval t1 . However, the stored response times are not simply the
last observed response times. Instead, we recognize that there
is some random variation in response times, and we want to
get a value which reflects the ’typical’ response time observed
recently for each node. A simple way to do this is to use
a weighted average of the most recent time with the previous
stored average time. For node i, let the most recent response
time be ri . Let the stored typical response time be ri . To update
the typical response times, we do
ri 0 =

Every time interval t2 >= t1 , the weights for the nodes are recomputed. The weight calculation is based on a pre-specified
maximum threshold maxt for the response time of a node. This
number derives ultimately from the time we find acceptable for a
user to wait for a webpage. Intuitively, if a node is going slower
than this, we want to reduce the weight on that node. A real life
value for maxt might be something like 200ms. In our simulations, we generally use t1 = t2 ≈ 30ms. The algorithm to
recompute the weights is as follows.
For each node,
1. If the node’s response time ri maxt , its weight is reduced
by a percentage p1 .
2. Otherwise
(a) If its weight is less than the average weight of all
the nodes, its weight is increased by a percentage p2 ,
where generally, p2 < p1 .

3.2.4

Weighted Random

To test how well our weight calculation worked, we created a
very simple object placement algorithm. To route a file, the
weighted random algorithm simply chooses a peer randomly,
with each peer chosen with a probability equal to its weight.
Weighted random should load balance between the peers very
well, since it considers nothing but response times when choosing a peer. However, weighted random does not remember what
nodes have what files, causing it to create an unreasonable number of cache misses. For this reason, it is not a serious competitor
to the other two algorithms, but it provides a good benchmark for
comparing the load balancing performance.

3.2.5

Weighted Replication

In the load balanced CARP algorithm, we run the ordinary
CARP algorithm and simply use the dynamically adjusted
weights. Intuitively, there are some problems with this algorithm. First, like CARP, it does no replication - as far as the supernode is concerned, each file is stored on only one node. Second, because the weights are dynamically changing, the intervals
that nodes are assigned to also change. Intuitively, files which
are on the ’edges’ near the endpoints of intervals will tend to
get moved between nodes. When this move happens, however,
the supernode does not remember that the old node still has the
file; this is simply a limitation due to CARP being a hash-based
algorithm. Instead, the file is uselessly cached on the old node,
whereas requests for that file will be cache misses send to the
new node.
To solve these two problems, we created our own algorithm,
weighted replication. Instead of hashing URLs and considering
only the hashes, weighted replication stores a table of all URLs
the supernode has seen. Each such URL is associated with a
set of nodes, which is the set of nodes that are currently storing
that web page. We reuse the same weight-based system that we
used in the previous two algorithms. We use the same response
times, and we define an overloaded node to be a node which has
ri > maxt . The routing algorithm, however, now follows these
steps:

3. Finally, all the weights are renormalized to 1.
The motivation for this algorithm is the following. In step 1,
we want to make sure that nodes which are currently overloaded
or are going slow for some other reason have their weights reduced quickly. In step 2, we want to be constantly trying to increase weights of non-overloaded nodes. It is possible that these
nodes have excess capacity, and the only way to find out is to
increase their weights until we see signs of a capacity shortage.
The restriction we place on step 2 to not increase weights
which are already above average is not important if the demand
is very close to capacity. However, if there is a great deal of excess capacity, then there are many possible weights which could
be used that would still cause no overloading. In this circumstance, we prefer the most equal weight distribution possible
without overloading, because in the event of node failure, this
will have the least impact on the overall system. The restriction
in step 2 guarantees that if demand is well below total capacity,
the weight distribution will be exactly equal. Nodes will only get
a higher than average share if other nodes are being overloaded.

1. Given a URL, look up the nodes currently storing the URL.
2. If there aren’t any nodes currently storing this file, then:
(a) Look in cache digests to see if a neighbor supernode
is storing the file.
(b) If so, send the DNS request to the supernode that has
control of the file, tell a local node to replicate the file
from the node the DNS request resolves to, and send
the user back the IP of the node that belongs to the
other supernode.
(c) If not, choose a node from the probability distribution
consisting of the node weights, send the request to this
node and add the node-file pair to the table.
3. If some nodes are storing this file, then:
(a) Look to see if these nodes are overloaded. Specifically, rank the noes in order of their response times.

Take the 80th percentile response time out of these
(or the closest to it). If this response time is greater
than maxt , the nodes are overloaded. If the nodes
are overloaded, find a node that isn’t overloaded, and
tell that new node to cache this file also, and tell it to
download its copy from a peer currently caching this
file.
(b) Each of the nodes caching this file has an associated
weight. These weights will not in general sum to 1,
because the weights for all the nodes in the cluster
are set to sum to 1. Nevertheless, they can still be
used as a probability distribution. Using temporary
re-normalized versions of these weights, choose one
of the nodes currently caching the file according to
this probability distribution.
Here is the motivation behind weighted replication. We want
to reuse the successful weighting system for the past two algorithms. However, we also want to guarantee that no document
will be a cache miss if a known node is currently caching the
document. Rather than using hash intervals, which are fundamentally approximate, we simply use a table to store all cached
URLs. When a request comes in, if the requested file is currently
cached by some node, we clearly want a node which is caching
it to respond. But since multiple nodes may be caching it, we
need to pick which one. To do this, we simply use the adaptive
weights system used in the first two algorithms.
We also want to avoid hotspots by replicating files which are
heavily used. If most of the caches which are currently storing
a file are overloaded, then it may well be the case that that file
is heavily requested, and so we want to replicate it. It might
also be the case that that file is not heavily requested, but the
caches that are storing it are overloaded because they are storing other heavily requested files. Even in this case, though, it
makes sense to replicate the file to an unloaded cache in order
to reduce response times on this file while the problems with the
nodes currently storing it are worked out. When replicating files,
we obviously do not want to go back to the origin servers, but
instead to send the file directly from node to node.
Each node clearly has a finite storage space for data. Generally, Jellyfish users will choose a maximum size for the cache to
grow to on their hard disk. When that cache is full, Jellyfish will
free up space for new files using an LRU algorithm. The node
will of course no longer be caching the files it has gotten rid of.
In the algorithm as specified above, the supernode does not keep
track of this. Therefore, the supernode may create a cache miss
by sending a file request to a node that has evicted the requested
file. Our simulation uses the algorithm as specified above, and
these cache misses do occur. One could imagine resolving this
by having the supernode try to calculate what files have been
dumped. Alternatively, each node could periodically send the
supernode a list of all the files it has evicted.

3.3 Security
3.3.1 The Security Problem
The key security issue with Jellyfish is how to ensure that volunteer nodes serve unaltered data to users. Given our experience

with email and the Internet, one can only expect that, if volunteer nodes were simply trusted not to alter the cached data, certain organizations would jump at the chance to redirect people
seeking wikipedia.org to websites peddling prescription drugs
and pornography. To ensure this does not happen, Jellyfish uses
a security system which ultimately relies on trusted servers run
by the content provider, and also on the difficulty of an attacker
to acquire a large number of machines and human labor.

3.3.2

Security System Design

The idea behind Jellyfish’s security system is to get the volunteer nodes to check each other. When a node gets a request for
a document, in addition to handling the request, with a random
probability p, it also requests the document from another peer,
pretending to be an ordinary user. It then requests the signature
of that document, which is cached as an ordinary document by
the Jellyfish system. The signature is generated by a private key
known only to the content provider. If a node consistently returns documents that do not match their signatures, it is reported
to the central server as possibly suspect, and it may be banned
from the system.
Jellyfish’s security system begins when volunteers download
the Jellyfish client software. Each Jellyfish volunteer has a Jellyfish ID and password, which cannot be automatically generated,
requiring a unique email address and perhaps the solution of a
captcha. The Jellyfish ID is used to identify volunteer nodes and
is required to sign on to the system. Use of the Jellyfish ID prevents attackers from automatically creating an unlimited number
of nodes and allows us to ban malicious users. Users are generally banned by Jellyfish ID, but may in addition be banned by
IP address, making it more difficult for an attacker to get a large
number of nodes.
In order for nodes to check on other nodes, each node needs
to know the IP address of at least one supernode. A simple implementation would be to just tell each node the IP addresses
of all the nodes. However, this would present a critical security
flaw, because a malicious node could return correct replies to
all IP addresses in the node list, and false replies otherwise. Instead, each node should only know the IP address of some of the
supernodes, and none of the ordinary nodes. Furthermore, the
request traffic from an ordinary node must be indistinguishable
from the request traffic from an actual user, otherwise malicious
nodes could play the same trick. At a minimum, this means that
the request frequency and distribution should match that of an
actual user. The easiest way to handle this is for nodes to send
requests to any given supernode not too frequently.
An attack we worry about is a kind of denial of service attack in which the attacker runs false nodes which accuse innocent nodes of being malicious. To prevent this attack, we don’t
immediately ban some Jellyfish ID just because a single node
said it was malicious. Instead, we require that a number of different Jellyfish nodes find the same node to be serving incorrect replies. Furthermore, each Jellyfish node is only allowed to
blacklist nodes at a certain rate. Nodes which try many more
blacklists than average may reasonably be suspected of being
malicious, and can have this rate limited further.
One detail is that both nodes and supernodes could potentially be malicious, and if a node finds a request to be incorrect,

the node now needs to know if it was the fault of the node or
the supernode. The difficult way to handle this is try to do statistical analysis on the bad requests and determine overall if the
supernode seems to be at fault or if the problem lies only with
the node. This would work, but an easier way is to take advantage of the fact that the central server knows the IP addresses of
all the legitimate nodes. The node that found the invalid document can simply send both the node and supernode IP addresses
to the central server. If the node IP address is invalid for any
node currently or very recently signed in, then the supernode is
the problem; else the problem is the node.

traffic is about twice the overhead of the data traffic, making SSL
splitting highly inefficient.
In a real-world test on a particular website, the overhead of
the authentication was such that even with no cache misses, the
caching system reduced the load on the main servers by at best
90% [17]. This might be appropriate for some systems, but we
feel that this is too much overhead to take as a starting point for
Jellyfish. SSL splitting is certainly worthy of a detailed investigation, however, and could be very useful as a backup system
used to survive a massive attack on the main security system.

3.4

3.3.3

Overhead of Security System

This security system clearly adds overhead to the system. It requires nodes to make fake requests which use capacity without
directly helping users. However, a simple analysis can show that
it does not require much overhead to be quite secure. Consider
a Jellyfish network with 1000 nodes and 100 supernodes, which
is receiving 10,000 requests per second. Assume, for simplicity,
that just one of these nodes is run by a spammer, and responds to
all requests with advertisements. If we make the security overhead five percent of the total traffic to the site, and require five
’bad node’ messages before we block a node, then it will take,
on average, approximately ten seconds to block the bad node.
If instead we had a bad supernode, it would take only 1 second
on average to find and block the supernode, but admittedly, it
would have served ten times as many users bad pages. Also, the
overhead for this system, unlike for SSL splitting, is added only
to the peer network, not to the origin server.

3.3.4

Encryption Layer

In addition to the above security system, we will add one small
layer of additional security. Instead of storing the HTML and
image files on the nodes in plain text, like ordinary webservers
and caching proxies do, we will lightly encrypt the information.
Since the Jellyfish software must contain the decryption key, this
will not stop a serious attack - it is merely security through obfuscation. Nevertheless, we feel it will help greatly to keep ordinary users from casually modifying pages for fun.

3.3.5

SSL Splitting - An Alternative

An alternative possibility for Jellyfish’s security system is a technique called SSL splitting. SSL splitting is a clever idea which
is mentioned briefly in the Coral paper as a possible way to solve
the security problem [12]. The idea behind SSL splitting is that
the SSL authentication traffic and data traffic can be split. With
SSL splitting, the user makes an SSL connection to the central
server with the caches as an intermediary. SSL authentication
traffic must still be handled by the central server, but SSL data
traffic is handled by the proxy cache. The savings on central
server load this caching system can produce is clearly dependent
on the proportion of authentication traffic to data traffic. Since
SSL authentication traffic size is approximately constant per file,
the proportional overhead of the authentication traffic is closely
tied to the file size. For files of 1MB, the authentication traffic
is about 0.05% of the data traffic, making SSL splitting highly
efficient [17]. However, for files of 100 bytes, the authentication

File Aggregation

In Jellyfish, each conceptual webpage is assigned its own subdomain. This guarantees that all the component files of that webpage will be served from the same ordinary node. Prior work has
shown that this causes tremendous performance improvements
compared to many tiny HTTP requests which go to scattered
servers. If requests are not aggregated in this way, a single node
which is experiencing transient delay can stop a webpage from
loading, even if all the other files are complete. In addition,
aggregation allows the system to take advantage of persistent
connections and HTTP pipelining, which can significantly reduce download times by eliminating TCP connection creations
and tear-downs. The fact that Coral does not do this yet may
currently be detrimental to Coral’s performance. A downside to
this is a repetition of files - nearly every client will have to cache
certain standard files like logos and CSS files. Fortunately, these
files are usually small and heavily accessed anyway.

4

Simulation

4.1 Design
4.1.1 Motivation
To test our system design, we built a simulation of many aspects
of Jellyfish. So far, we have used the simulation primarily to test
and compare object placement algorithms. However, we hope
to ultimately use it to test update strategies, supernode placement and promotion, security, and system reliability. Because
we wanted to first focus on object placement, however, we simulate only the subset of Jellyfish which is relevant to the choice
of object placement algorithms.

4.1.2

Implementation

Object placement designs are made by a single supernode about
its child ordinary nodes. Therefore, our simulation looks at only
a single cluster - one supernode and a cluster of ordinary nodes.
Simulating multiple clusters is critical for testing update strategies and security schemes, but it will not affect object placement
results.
Our simulation is implemented in about 1500 lines of Erlang.
Erlang is a language designed for distributed computing, and
it features very convenient message passing between processes
and user-level threading, which is important for simulating many
entities. Erlang is based on the notion of a process. Each process is essentially a user-level thread, but processes cannot directly share data, but must pass messages to communicate. This
worked nicely with our simulation, in which every computer

simulated - every node, supernode, webserver, and user - was
a separate process.

4.1.3

Simplifying Assumptions

To make the problem tractable, our simulation makes a number
of simplifications. Simulated nodes have an associated bandwidth and hard drive space, which is generated randomly and
is different for each node. However, we do not simulate nodes’
apparent bandwidths changing, as might occur due to network
congestion or other programs running on the node. We assume
that requests for web pages are zipf distributed, as much research
has suggested that this distribution closely approximates the actual access frequency of many real-world web sites. Currently
we use an extremely simple Internet topology, in which Internet
latencies are simply randomly generated every time a message
is sent. Using a more accurate Internet topology will be crucial
when exploring multi-cluster designs which are intended to be
spread out over the globe. Within a single cluster, however, the
nodes are supposed to be close together, and the simulation of
nodes which have slow connections is already handled by assigning each node a bandwidth. In these simulations we also
assume that there are no malicious or malfunctioning nodes and
we do not look at document updates.

4.2

Results

We first used the simple weighted random algorithm to test our
adaptive weight finding system. In our simulation, we do not
give the supernode process access to actual capacities or loads of
the ordinary nodes. Instead, we use our adaptive weight finding
system to try to deduce these capacities using response times.
To make sure that this was working correctly, we compared the
weights assigned by the supernode with the actual simulated capacities of the nodes. Ideally, under heavy load, the weight for
node i should equal the bandwidth of node i divided by the sum
of the bandwidths.
When the system is under light load, however, this relation
should not hold at all. When there is much excess capacity, the
weights will tend to a more equal relationship than proportional
to bandwidth. But to test whether the weight finding system
worked under heavy load, we ran a simulation with the demand
nearly equal to the total capacity. We found that the weights
cluster around the correct values without bias (Figure 1). We
also found that the algorithm ran stably, accurately finding the
weights, until the system was running at about 90% of theoretical capacity. After that, the weight finding algorithm basically
broke down and no longer gave accurate results. 90% may be
sufficient, but further work should investigate how to increase
this threshold.
The main purpose of our simulation, however, was to compare the load balancing CARP algorithm to the weighted replication algorithm. We compared the two algorithms using three performance metrics: the frequency of ’bad misses’, the frequency
of client timeouts, and the average response time.
Bad misses are a subset of cache misses. The idea behind bad
misses is that some cache misses are unavoidable. Specifically,
the first time in the simulation that a document is requested, it
will inevitably be a cache miss. Bad misses is calculated by
taking all cache misses and removing the ones that were the first

time requesting a document. The original cache miss number
can be used without any change in the results; however, since
many cache misses are not bad misses, it is easier to see the
change in the data when subtracting the noise of the non-bad
misses.
In our simulation, we assume that each client has a limited
patience for waiting for responses. Specifically, we say that any
request that takes longer than 1s from initiation to completion
took inappropriately long, and it is labeled a user timeout. The
response times of all requests that did not time out are recorded,
and we use these values to calculate an average response time.
We primarily ran the simulation at fairly high load - between
40 and 80% of the system’s theoretical capacity. We found that
weighted replication gave a considerably lower bad miss rate
at both 40% and 80% load (Figure 2). However, the difference was greater at 80%. CARP causes more bad misses because files which are near the boundaries of caches wind up being swapped between them, causing cache misses. At higher
loads, the weights tend to change more rapidly, causing more of
this swapping to occur. Currently, bad misses only occur with
weighted replication once the node caches have filled up and
begun evicting items; this is why no bad misses for weighted
replication can be seen at the beginning of the graph.
We also found that client timeouts happened less frequently
with weighted replication (Figures 3, 4). Client timeouts occur
when load-balancing fails and some peers become overloaded.
Because CARP does not replicate files, it tends to load balance
less well. Also, CARP’s hashing scheme implicitly assumes
a uniform distribution of file access. In real life, file access
frequencies are far from uniform. In our simulation, we use
a zipf distribution to try to approximate realistic file accesses.
Because the access pattern is not uniform, CARP load balancing
does not work very well - a small adjustment of weights can lead
to a large change in traffic. In all cases, there is an initial burst
of failed requests at the start. This occurs while the supernode
learns the correct weights of the nodes. It is to a certain extent
an artifact of our simulation beginning with a bang instead of
having nodes gradually join the supernode, as would occur in
real life.

Figure 2: CARP causes more bad misses than weighted replication because documents near the edges of hash intervals get
switched between nodes.

At a 40% load, we found that the response times for weighted
replication were significantly better than the response times for
CARP (Figure 5). At 80%, weighted replication still did better,
but only slightly (Figure 6). Note however that these response
times do not include responses which timed out. Including
these response times might have affected this result. Weighted
replication appears to do better at %40 percent because it tends
to prefer high bandwidth, low latency nodes by replicating files
onto them. At higher loads, it is forced to use all nodes at close
to their actual capacities, and no longer prefers these nodes.

Figure 3: Failed requests occur less frequently with weighted
replication. Because CARP does not replicated documents, and
because request frequencies are not even approximately uniformly distributed, as hash range based approaches assume,
CARP does not load balance as effectively as weighted replication. This leads to overloaded nodes and user timeouts.

Overall, our results indicate that under the parameters we
tried, the weighted replication algorithm seems to correctly load
balance between nodes of widely varying capacities, avoid overloaded nodes, and replicate popular files to avoid hotspots. It appears to cause load balance better and give fewer cache misses
than load balanced CARP. However, benchmarking is admittedly a difficult exercise, and a different simulation design or parameter choices might have led to a different result. We did not
try to examine parameter combinations exhaustively, but looking
at the sensitivity of our results to parameter and design decisions
would be worthwhile future work.

5
5.1

Future work
Squid

The deployable version of Jellyfish would contain many components that have been built before, such as an HTTP request handler, a highly efficient caching and retrieval engine, and an Edge
Sides Includes implementation. We looked for a project that we
could build our system on top of, saving the replication of these
components. We found that the Squid caching engine seemed to
be our best option. Squid is a popular open-source caching system derived from the Harvest project. Squid is normally run by
a web host on a single server or on a small hierarchy of servers
that have been carefully configured to talk to each other.
Squid contains a highly efficient web caching system with

Figure 4: Failed requests occur less frequently with weighted
replication than with CARP. However, at 40% load, relatively
few failed requests occur with either algorithm.

respect to data storage and retrieval, but none of the features
required to build a large network of untrusted caches. It does
contain some support for inter-cache communication. However,
this is basically designed for organizations that want to distribute
their squid caches across a handful of trusted, identical computers. We think that building Jellyfish on top of Squid will save
duplication of some difficult technical effort. We also think that
the project will benefit from an attachment to an existing and
active open source community with closely aligned interests.

Figure 6: Weighted replication gives lower response times at
80% load, but the difference is less pronounced than at 40%
load. However, the variability of the response times is still considerably reduced.