Replication and Caching Position Statement

Since the Web started in 1990 is has shown an exponential growth, and now
- five years later - the growth is still exponential. No one knows
how or when the growth will slow.

The simplicity of the Web model was one of the main reasons it succeeded.
The Web model is based on URIs, HTML, and HTTP. However, HTTP by itself,
as it currently exists cannot meet the robustness or scaling requirements
that are clearly needed due to the Web's success. HTTP does not provide for
recovery in the face of network problems, and often information is unavailable
due to single point failures.

There is an urgent need for making the Web more mature in order to scale
to a number of at least 100 times the current size and robust enough for
mission and life critical applications. Network hot spots (or "flash crowds")
often appear in the internet when particular information has large topical
interest, causing major problems both to end users (poor response) and denial
of service to users of the network near the hot spot (due to the resulting
congestion in nearby parts of the Internet). Efficient techniques for replication
and caching is a corner stone in achieving the goals of scale and robustness
as well as responsiveness; how much bandwidth a good caching and replication
system could save is currently an open research question.

Problem Statement

There are currently a number of distribution protocols on the internet which
are suited to different patterns of readership. Please have a look at the
problem statement written by Tim Berners-Lee.
The protocols differ in the conditions under which information is transferred:

SMTP (Simple Mail Transfer Protocol)

Transfer on author demand. Readership defined at time of publication,
in 1..10,000 range.

NNTP (Network News Transport Protocol)

Transfer unconditionally to all sites. Wastes resources globally when
used for small readership, and doesn't scale very high. The replication of
individual news groups relies on social bindings between system administrators
of News servers

When an individual makes a piece of information available, currently that
individual's chooses the application, which implies the protocol which will
be used. Misuse of resources or bad performance occurs when the wrong protocol
is used. Here are three examples of a use of one protocol when another would
have saved resources.

The use of mailing lists to distribute information for a very wide audience
most of whom will not be interested, wastes time, disk space and bandwidth.

The use of global network newsgroups for low readership interpersonal messages,
wastes disk space and network bandwidth on all participating sites;

The use of HTTP to distribute Shoemaker-Levy comet pictures or Olympics results,
may cause network overload on the server and congestion in nearby parts of
the Network. This is often called the flash-crowd phenomenon

Recent studies
by Margo Seltzer shows that flash-crowds are very common. Documents on
a single server generally do show good locality. That is - popular documents
are really popular! Other studies have shown that documents often
are transmitted thousands of times between changes to the document; a system
which cached popular documents where the caches are (conceptually) on the
other end of a communications link should show good hit rates, rather than
just at the periphery of the network. Research is underway to prove/disprove
this hypothesis.

The Web is moving towards a more dynamic content. Web applications are becoming
customized on a per user basis. This has already a significant impact on
the efficiency of caching throughout the Web. Dynamic content can be expected
to continue to grow; however, it is possible to break down the content in
smaller static parts which in terms are cachable. Java and similar mobile
code systems enable the building of Web applications in which the content
as presented to the user may be highly customized, while most of the program
and data required to perform the customization may remain cachable. The long
term effect of dynamic content on cachability of information is therefore
uncertain; current trends have been reducing cachability, but the longer
term outlook is possibly more favorable than the current trend. Study of
Web traffic is vital to understand the trends of Web traffic.

Many content providers need to have some way of determining how much their
information is being used. While providing useful or interesting content
may be supported by advertising, even people providing information for free
need basic information to support their activities. People who are marketing
products, or providing public information (e.g. government agencies) need
to be able to justify their expenditures. Due to inadequacies in HTTP/1.0,
information providers have often disabled caching (with bad results to the
users) just to get elementary usage statistics. Mechanisms such as the hit
count proposal of Jeff Mogul (Digital) and Paul Leach (Microsoft) being discussed
in the IETF HTTP working group in combination with HTTP/1.1 facilities
may enable more of the Web content to be cachable, though content providers
desiring full demographic information may still generate significant network
load.

The Web provides an escape route for introducing efficient caching systems
by its support for proxies. The CERN server
experienced an enormous demand as a result of being the first proxy server.
Current proxy servers are placed at the leaves of the distribution tree a
long way from the origin server. As a result, most proxy servers seldom achieve
a cache hit rate of more than 50%.

Requirements

Correctness

HTTP/1.0's caching facilities are poor and error prone. Application builders
have often had to disable caching (at the expense of performance to the end
user) . Getting the slightly stale answer quickly may be desirable for many
applications, but some applications (e.g. electronic commerce) need to get
absolutely correct answers. One of HTTP/1.1's major design goals is to allow
reliable caching of content, and applications to work reliably in the face
of caching proxies. A proxy network must be able to provide the level of
consistency that the application developer has specified.

Efficiency

One could define the efficiency of the system may as the ratio between the
reference network load and the actual network load, where the reference network
load is the sum over network hops of bytes transferred using simple HTTP
with no proxy network. The actual load is the cost in actual byte/hop products,
which in a caching system has to be averaged over all connected access of
the source document.

Response time

A fundamental requirement of the Web is that access time for a random document
should be low, whatever the reader's reading history to date may have been.
Studies indicate human problem solving is impaired when retrieval times rise
above 100ms. Global network latencies are easily of this order. Preloading
of documents likely to be fetched may be needed to meet response time
constraints; but excessive bandwidth use for documents not used mitigates
against blind use of this technique.

In practice in the Wide World, as bandwidth and CPU speeds rise, the number
and length of round-trip delays becomes the limiting factor. The response
time requirement translates into a protocol requirement for minimizing the
number of serial transactions with different parties, and the number and
length of round trip times for each.

Storage Limitations

We must assume that any participating node in the system has a limited amount
of storage to make available for caching objects and metainformation involved
in the protocol. As this storage reaches capacity, the system should degrade
gracefully.

There is an overall web requirement for scalability. Given finite storage
at each node, scalability rules out systems in which complete network maps
or hypertext maps or name tables are kept at any one node.

Limited Human Intervention

A human must dedicate resources (storage, processing, bandwidth) to the use
of the system at some point. Experience with successful systems in the global
Internet has shown that end users or system administrators should not have
to be involved in making routing or optimization configurations in large
scale systems; not only is such configuration expensive in human time, such
configurations become obsolete, and systems fail as the network evolves if
the system is not dynamic in nature. For a large scale system, most replication
and caching decisions must be completely automatic, though local policy decisions
may be imposed over such an infrastructure.

The Essence of the Caching Problem

There are two complex problems involved.

One is the multicast routing problem, which is the generation of the distribution
tree for optimum efficiency for a given readership pattern (known or expected)
and a given network topology.

The second is the name resolution problem. When a client first requires access
to an object, for which a multicast route may already exist, it must in a
short time, find the location of the nearest copy. Possible solutions might
include an initial query to the original server, so long as in future access
can be directed to a closer copy if available. Most cache systems involve
grouping objects, and assuming a correlation between accesses to documents
in the same group. HTTP's basic rule is that the URL of a document, once
the initial domain name has been resolved, is a random string to which no
semantics may be attached. There is a provision for the use of hierarchical
spaces (used with relative addressing) but there are no numerical constraints
on the number of objects at the same points in a hierarchy: indeed, there
are webs containing an infinite number of virtual documents.

In the future we expect URIs to become more like names and less like addresses.
This means that any control information which enables cache efficiency may
not be encoded in the URIs, and so must be transferred in metadata, essentially
in list of names. The less structure there is in a name, the more difficult
it is to scale the storage of the name tables; the more structure there is,
the more likely it is that the name will later have to be changed as the
storage structure changes. This is an essential problem of naming.

By combining information of both object access and link traversal, an information
source may be able to provide information to a replication/caching system
to improve user performance, by preloading of documents likely to be accessed
next into downstream caches. The topology of the web therefore, as well as
of the internet, may be useful to such a system.

Any solution is likely to be sound engineering compromises combined with
some good algorithms.. This is an area for directed research, and grass roots
experimentation, and it will be interesting to see which produces the results
most effectively.

Next Steps

Short Term

The new caching facilities of HTTP/1.1 need full testing before HTTP/1.1
becomes a full Internet standard. We believe the new facilities of HTTP/1.1
allow robust applications to be built in the face of caching. This belief
needs proving, and our Jigsaw server is one of the first servers providing
both basic HTTP/1.1 and proxy support. Full validation of the HTTP/1.1 design
needs to be complete to enable graceful evolution of the Web, so that future
caching and replication systems can be deployed incrementally via the proxy
mechanisms built into all Web software.

We have also implemented HTTP/1.1 in both our library (libwww) and the Jigsaw
HTTP server, now available for anyone to use for testing of HTTP/1.1
implementations. W3C is fostering a mailing list for implementors of HTTP/1.1
to exchange implementation experience and solutions to implementation problems.
A few remaining small HTTP/1.1 design problems are currently being discussed
on the IETF HTTP working group mailing list.

We believe HTTP/1.1 fixes HTTP's worst bugs and inefficiencies. Having completed
this work, we should spend some time to understand to what degree further
compression and/or reengineering of HTTP itself would actually be beneficial
versus work on the general caching/replication problem.

Work in the distributed authoring and versioning area must be tracked to
understand the implications to caching and replication of content.

Medium Term

Many major sites today (e.g.
AltaVista,
Netscape,
Microsoft) are now replicated on a
world wide basis, generally using a combination of ad-hoc facilities (e.g.
AFS, DCE DFS, rdist, rcp, tapes delivered via overnight delivery, etc.).
For example, AltaVista is replicated on three continents
(North America,
Australia,
Europe), and each replica may
actually consist of a set of replicas. We can expect this trend to continue
for successful sites to handle the scaling load.

However, the end user has to know which replica to use, and in the case of
failure or network difficulties, the user may still find the replica they
use unavailable. Naive users, now by far the major users of the Web, are
often unable to work around problems when they arise, and discovering which
replica is "closest" to a user is left to the users, who has little information
to guide their use.

Therefore, facilities to allow automated, transparent selection of the "closest"
available replica of a service without user intervention would go
a long way toward reducing long haul or intercontinental network load (by
far the most expensive), while providing end users with much higher performance,
and much better robustness. Whether these facilities will be built into end
user browsers or into ISP's proxy caches, and what exactly these algorithms
might be is currently an urgent, open question.

The Internet infrastructure has much better knowledge as to what site may
be "closest" to a user; however, there has been little communication to date
between developers of the Web and developers of the Internet routing
infrastructure. To solve this problem we believe will require cross fertilization
between many parts of the IETF (Internet
Engineering Task Force) and the Web community. We plan to spend significant
effort fostering such ties, and evaluate possible solutions that may appear.

While W3C might prototype possible solutions to this problem, it is as likely
or more likely that a solution to this problem will come from elsewhere,
as the expertise to help with solving the "closest copy" problem is in the
general network community rather than the Web community.

Long Term

Over the last 20 years a large number of different techniques have been developed
for handling caching and replication in a distributed environment. Such
techniques can be applied to the Web. It is clear that not one single technology
will solve the problem but some of the key technologies involved are:

distributed replication and error recovery

multicast based replication

cooperative caches

W3C sees it as an essential part to bring together research and industry
in order to create a shared understanding of the problems facing the Web.
In April 1996, W3C organized a workshop on
Web robustness involving many of the inventors of the Internet and
key representatives from industry. Fostering further interactions and
experimentation with prototypes (e.g. Squid) systems need to continue, until
one or more solutions are found.

We need answers to basic questions, for example:

How much of the Web content is cachable?

How much of the Web content could be cachable?

Preloading of content is a possible way to improve response time; however,
preloading everything will clearly increase network load. How can this
load be minimized, while improving the Web's interactive performance?

Where should caches/replicas be placed in the network? Current proxy caches
are only on the very edges of the network, where only low hit rates seem
likely. Embedding caches inside the network itself may work much better.

These a just a few of the open questions. The W3C will foster research on
these problems and experimental deployment of such systems.