Network Working Group Mike O'Dell
Internet-Draft UUNET Technologies
1997/02/24 01:32:32GMT
GSE - An Alternate Addressing Architecture for IPv6
1. Status of this Memo
This document is an Internet-Draft. Internet-Drafts are working
documents of the Internet Engineering Task Force (IETF), its areas,
and its working groups. Note that other groups may also distribute
working documents as Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as ``work in progress.''
To learn the current status of any Internet-Draft, please check the
1id-abstracts.txt listing contained in the Internet-Drafts Shadow
Directories on ftp.is.co.za (Africa) , nic.nordu.net (Europe),
munnari.oz.au (Pacific Rim), ds.internic.net (US East Coast ), or
ftp.isi.edu (US West Coast).
2. Abstract
This document presents an alternative addressing architecture for
IPv6 which controls global routing growth by very aggressive
topological aggregation. It includes support for scalable multi-
homing as a distinguished service. It provides for future
independent evolution of routing and forwarding models with
essentially no impact on end systems. Finally, it frees sites and
service resellers from the tyranny of CIDR-based aggregation by
providing transparent re-homing of both.
3. Introduction
This alternative IPv6 addressing architecture addresses several
scalability issues with the current IPv6 addressing proposals.
Scaling of the global route computation
Ease of re-homing (both leaf Sites and upstream Resellers)
Economic scalability of of Multi-homing
O'Dell v3.7 [Page 1]
Internet-Draft GSE for IPv6 1997/02/24 01:32:32GMT
The current IPv6 addressing proposals address route and topology
aggregation by continuing to rely on CIDR-style "Provider-based
Addressing" coupled with a powerful new dynamic address assignment
mechanism which is intended to make renumbering more palatable.
However, CIDR-style provider-based aggregation breaks down in the
face of the accelerating growth of multi-homed sites (leaf sites or
regional networks). Worse, renumbering an entire Site to accomplish
a simple topological re-homing such as changing ISPs is a problem
whose magnitude can only grow over time. It will remain increasingly
difficult to explain this renumbering requirement to customers with
the spectre of a complete failure of this aggregation approach a
distinct possibility.
While the large IPv6 addresses provide for a huge increase in the
number of end systems which can be accommodated, it also portends a
huge increase in the number of routes required to reach them. Even if
CIDR aggregation were to continue at current levels (maintaining
current efficiency is relatively unlikely), this still presents a
serious problem for the growth of the the global route computations.
This document presents a new proposal for using the 16 byte IPv6
address which mitigates the route scaling problem and with it a
number of collateral issues. This model provides for aggressive
topological aggregation while controlling the complexity of flat-
routed regions. It exploits and supports the dynamic address
assignment machinery in IPv6 but makes the exact role of that
machinery a decision local to a Site. It is therefore subject to
engineering cost and benefit analysis rather than being mandatory for
simple Site re-homing situations.
This new model also identifies the special work done by the global
Internet infrastructure on behalf of multi-homed sites. Rather than
continuing the current "Tragedy of the Commons", the multi-homing is
isolated into a specific mechanism which is then traceable to and
incurred by only those sites wishing to subscribe to this capability.
Again, this makes it possible for sites to make informed cost-benefit
decisions about multi-homing.
4. Central Concepts of the Architecture
The architecture is based upon a few central concepts.
A strong distinction between Public and Private Topology
A strong distinction between system identity and location
GSE - Global, Site, and End-system address elements
O'Dell v3.7 [Page 2]
Internet-Draft GSE for IPv6 1997/02/24 01:32:32GMT
The deep similarity of Re-homing and Multi-homing
Rewriting address prefixes at Site boundaries
Very aggressive hierarchical network topology aggregation
Optimizing actual forwarding paths by limited-scope
cut-throughs
This model draws a strong distinction between the Public Topology
which forms the transit infrastructure of the Global Internet and a
"Site" which can contain a rich but strictly private local network
topology which cannot "leak" into the global routing machinery. The
Site is the fundamental unit of attachment to the Global Internet and
is therefore strictly a leaf, even if possibly multi-homed.
This model also draws a very strong distinction between the identity
of a computer system and where it attaches to the the Public
Topology. In IPv4 and current IPv6 models, these notions of identity
and location are deeply co-mingled and this is the fundamental reason
why simple topology changes have such wide-ranging impact on address
assignment (if aggregation is to be maintained at all).
The 16 byte IPv6 address is split into 3 pieces:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| Routing Goop | STP| End System Designator |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
6+ bytes ~2 bytes 8 bytes
Routing Goop signifies where the Site attaches to the Global
Internet. The Site Topology Partition (STP) is Site-private "LAN
segment" information. The End System Designator (ESD) specifies an
interface on an end-system.
One surprising notion is that re-homing and multi-homing are very
deeply related. Multi-homing can be viewed as rather like several
simultaneous re-homings happening at once. Achieving both painless
re-homing and scalable multi-homing rely on the same set of
fundamental mechanisms, each with a few distinct details.
Rewriting IPv6 addresses by Site Border Routers is by far the most
controversial, but also most critical part of this proposal. To
control the complexity of routing information which must be managed
within a Site and to isolate end systems and interior routers from
external topology changes, the RG of some addresses is modified by
Site Border Routers. Packets exiting a site have the RG for the Site
O'Dell v3.7 [Page 3]
Internet-Draft GSE for IPv6 1997/02/24 01:32:32GMT
egress point inserted into source addresses, while packets entering a
Site have the RG in all destination addresses replaced with a
canonical prefix signifying "within this Site" (the "Site-local
prefix").
One immediate result is that upper-layer protocols must use only the
ESD for purposes such as pseudo-header checksums and the like. The
ESD is the invariant token, the RG is possibly transient topology
information subject to change.
Topology aggregation is accomplished by partitioning the Global
Internet into a set of tree-shaped regions anchored by "Large
Structures". The Routing Goop in an address specifies a path from
the root of the tree (the Large Structure) to a point in the
topology; in the terminal case this is a Site. Large Structures are
chosen by their ability to aggregate topology and no particular
advantage flows from "being one"; actually quite the contrary. Large
Structures are responsible for subdividing the space under them and
managing that delegation. Large Structures provide a "forwarding
token of last resort" which can always be used for selecting a valid
next-hop when no other information is available. This significantly
limits the minimally-sufficient information required for a "default-
free" router. Any additional route information kept is the result of
path optimizations from cut-throughs.
While it is useful to think of the Large Structures as trees, the
collection is actually a DAG (Directed Acyclic Graph) because the
trees can touch each other via cut-throughs. By cross-propagating
selected details via a cut-through, a locally-controlled region can
learn of alternative paths to some destinations. The distance this
optimization information is propagated and the radius of the
optimization region advertised are the business of the collaborating
regions.
5. The Structure of End System Designators - the ESD
End System Designators denote every computer system in the GSE
Internet regardless of whether it is a host, router, or other network
element. While a given system can have more than one ESD, each ESD
is globally unique. This is critical for their utility to the
upper-level protocols. This uniqueness can be induced several ways
as will be seen.
A crucial design decision is whether an ESD identifies a system,
invariant of its interfaces as in the XNS architecture, or an
interface on a system as in the existing IPv4 and IPv6 architecture.
An ESD designates an interface on a computer system and that
O'Dell v3.7 [Page 4]
Internet-Draft GSE for IPv6 1997/02/24 01:32:32GMT
interface can be either physical or virtual.
When processing a GSE address, a computer system need only examine
the ESD portion of the address to determine whether a packet is
destined for that system.
There are circumstances when it is quite useful to have "an address"
for a computer system which is independent of any particular physical
interface on that system. It has become commonplace in IPv4 practice
to use a distinguished virtual interface to provide a system with
such an "interface independent identity". This technique affords the
same architectural utility of XNS while still allowing the
flexibility of the IPv4 "addressed interface" model. This model
retains the successful IPv4/IPv6 model.
NOTE: We remain intentionally vague about exactly what constitutes an
"interface" and a "computer system". The malleability of those
notions in IPv4 has proven manifestly useful in practice.
To summarize the ESD uniqueness characteristics:
(1) an ESD is globally unique
(2) an ESD designates an "interface" on "a computer system"
(3) an Interface may have more than one ESD
(current IPv6 already requires implementations to support
multiple addresses per interface)
(4) an ESD may not necessarily designate a particular
physical computer (Neighbor Discovery continues to provide
a level of virtual address translation and considerable
cleverness can be disguised therein)
There are two forms of ESD, both 8 bytes long, one a subcase of the
other.
It is clear that with the impending onslaught of the IEEE-1394
technology that 8-byte IEEE MAC addresses are simply fait accompli
and many devices will be provided with a unique identity in that
format at the time of manufacture. The 8-byte IEEE MAC Address
format includes the current 6-byte MAC Addresses as a proper
subspace. Using the 8-byte IEEE MAC address will be very convenient
for many network builders.
There are at least two issues with using *only* the IEEE 8-byte MAC
addresses as ESDs: There are point-to-point link interfaces which
have no IEEE MAC address assigned for them, and the 8-byte IEEE MAC
addresses assigned to the interfaces of a system are essentially
random. For some, there is also the issue of whether the IEEE MAC
address is "unique enough" for the purposes at hand.
O'Dell v3.7 [Page 5]
Internet-Draft GSE for IPv6 1997/02/24 01:32:32GMT
We clearly need a space for generating ESDs for interfaces which
don't come equipped with one. Some have also suggested there might
be great utility in enabling inverse lookups on just the ESD part of
an address. Assigning ESDs in semantic clusters (like current IPv4
addresses) would be a signficant aid to this end. Finally if a
network designer decides not to trust the uniqueness of the IEEE MAC
addresses, he could always use the Dynamic Numbering machinery of
IPv6 to assign ESDs.
We propose that the IETF seek a large (7 bytes or greater) subspace
of the IEEE 8-byte MAC space for allocation as IETF-NodeIDs in
semantic clusters to provide a pool of addresses which can be used
for any of the above reasons, as required. However, it is expected
that most network builders will exploit the intrinsic IEEE MAC
addresses present in many network interfaces whenever possible.
The IETF-NodeID space should be partitioned into two regions - one
exactly isomorphic to the existing IPv4 address space to provide
instant grandfathering of IPv4 addresses, and another space which is
simply larger but allocated in a similar manner.
A few comments on "global uniqueness" are in order because in
previous discussions, some have asserted that unless "uniqueness" can
be accomplished with absolute and complete mathematical perfection,
any scheme using the concept is unworkable. This extreme view
inconsistent with mass-market experience.
IEEE MAC addresses are globally unique by nature of the delegation
process where they are assigned to interfaces by the manufacturers.
Both XNS and IPX rely on this uniqueness and it works very well in
practice. IETF-NodeID values will be globally unique by nature of
the same kind of assignment mechanism. IPv4 addresses must be
globally unique for the Internet to function, and it does, mostly, by
nature of exactly the same kind of assignment mechanism.
While accidents and manufacturing defects do occasionally violate the
uniqueness of IEEE MAC address assignment, humans routinely make
errors in assigning IPv4 addresses to systems with equally mystifying
results. Given the reliance of IEEE-1394 Firewire interconnects on
these unique MAC addresses, it is likely that the frequency of these
occurence (relative to the total number of objects with assigned
addresses) will only decrease. The economic pressure to insure this
will be intense.
6. The Structure of a Site
The GSE global routing architecture ultimately views a Site as a leaf
O'Dell v3.7 [Page 6]
Internet-Draft GSE for IPv6 1997/02/24 01:32:32GMT
of the topology and doesn't concern itself with the interior of this
private topology. However, the internal topology of a Site is
extremely important to the management and operation of the Site so
the GSE address architecture provides for a rich set of
organizational alternatives with different cost-benefit tradeoffs.
The GSE address structure provides for 16384 distinct Site Topology
Partitions (STPs) within a Site. This is the number of SEGMENTS in
the internal topology, not hosts. The number of attached hosts is
limited strictly by available local network technology, and the
Site's ability to buy enough machines to exhaust the available IEEE
8-byte MAC address space, or the available 7-byte IETF-NodeID space.
Using this structure, a single Site can develop an internal topology
which is a very significant fraction of the total CIDR routes in the
IPv4 Global Internet.
An organization is not constrained to being structured as a single
Site. The trade-off is that the inter-Site topology must then be
part of the Public Topology. While the individual Sites can retain
considerable independence in topological structure and attachment to
the Global Internet, they must be aware of changes between the
constituent Sites and that re-homing of constituent Sites will
potentially impact long-running sessions. That is the cost of
exploiting the routing machinery available to the Public Topology.
Given the generous flexibility available for organizing a Site, it is
worthwhile to examine a few examples. Note that none of these
organizational approaches is exclusive. A large Site might well mix
these approaches to good effect and indeed the goal is to provide the
designer of private Site topology with a broad spectrum of design
alternatives.
The simplest structure to imagine is a Site using all IEEE MAC
Addresses with all the systems connected in a single Private Topology
Partition (i.e., all the GSE addresses carry the same STP value which
is assigned by the local network administration). Given the
sophistication of current LAN-switching technology, a Site like this
could be both large and internally complex yet have simple IPv6
addressing. The complexity is absorbed into the LAN infrastructure
and it appears to be only one partition from the GSE Site Topology
view. This structure has one very significant advantage: long-
running TCP sessions will will survive arbitrary changes in the local
topology. This works, of course, because the single STP is a virtual
topology with the real topology hidden by the LAN Switching
machinery.
The second Site model is like the one just described, except it would
O'Dell v3.7 [Page 7]
Internet-Draft GSE for IPv6 1997/02/24 01:32:32GMT
have multiple STPs with routers moving traffic between the segments.
This is very close to the common IPv4 structure of a CIDR block being
subnetted to assign a prefix to each STP. This approach has the
advantage of familiarity, but it has the disadvantage that long-lived
TCP connections don't necessarily survive arbitrary changes to the
private topology. This arises because even though the ESD is
invariant, reachability will fail because a change in the STP of one
of the system doesn't get injected into the protocol stack of the
communicating systems when they move. The existing IPv6 dynamic
address assignment machinery will serve to make such internal changes
much less painful than with IPv4, however.
One point worth noting is that even with multiple STPs routed within
a Site, a "Private Topology Partition" need not correspond to a
"physical" LAN cable. The STP values could be used to label larger
organizational structures like "Engineering" or "Finance". This
could reduce the likelihood that common internal topology changes
break long-lived connections.
The third Site model uses IETF-NodeID ESDs based on existing IPv4
address assignments. In this case, all the IPv4-style ESDs could be
placed in a single STP and then routed internally on the IPv4 address
in the lowest 4 bytes of the ESD. It must be emphasized that the
IPv4 addresses used in IPv4-style ESD must be an officially-
registered, public-use IPv4 address and NOT an RFC-1918 private-use
address. Using an RFC-1918 private-use address violates the global
uniqueness properties required of an ESD.
In all of the multi-segment cases, an IETF-NodeID ESD could be used
to designate any point-to-point link endpoint, the loopback addresses
in routers, or any other IP-accessible network elements which don't
naturally have IEEE MAC address for forming an ESD. And in all of
the cases, an IETF-NodeID ESDs could be used universally, although it
is more appropriate to use IEEE ESD form whenever possible.
In all of the cases where the real topology is not completely
virtualized by the LAN technology, there will be "Internal
Renumbering" events caused by moving systems between infrastructure
segments (STPs). This will have the effect of killing long-running
off-Site connections unless provisions are made to allow the systems
(and the routing infrastructure) to carry the previous ESDs as
synonyms for a while. Given that most significant topology moves
involve powering off the end system in question, this is hardly a
hardship. However, the powerful renumbering support already
developed for IPv6 can make those other moves considerably less
impacting.
Most importantly, external re-homing of a Site to the global
O'Dell v3.7 [Page 8]
Internet-Draft GSE for IPv6 1997/02/24 01:32:32GMT
infrastructure can be made completely transparent.
7. Dynamic Address Re-writing by Site Border Routers
A critical component of this architecture is the modification of
addresses when packets leave or enter a Site. Re-writing source
addresses to insert appropriate Routing Goop at the Site egress point
was part of the 8+8 proposal, but this proposal extends this to re-
writing destination addresses when inbound packets arrive at a Site
Border Router.
The reasons for both re-writings are the same: to insulate the
interior of the Site from external topology changes and egress policy
details.
When a Site Border Router inserts the correct RG in the source
address of outbound packets, it frees the end-systems in the Site
from having to know the RG for the Site. This is especially important
if the site is Multi-homed and the Site implements a complex egress
selection policy.
In the case of inbound packets, if the destination address were not
converted to a canonical form, the Site interior routers would have
to be aware of all the different RG which could be used to reach the
site, essentially creating aliasing of the destination addresses. In
the singly-homed case, this doesn't seem like a significant issue,
but in complex Multi-homing scenarios there could be a significant
problem managing this information.
This symmetric re-writing essentially isolates the Site from the
Global Internet just as the hard boundary between RG and STP
components insulates the Global Internet from the Site topology.
8. The Structure of Routing Goop
Routing Goop, or "RG" is the upper 6+ bytes of a GSE address. This
somewhat non-technical term was chosen because all the other
alternatives seem to have various degrees of conceptual baggage which
would be as much work to neutralize as the new notions are to explain
in the first place.
Fundamentally, RG is a Locator. It encodes the topological
connectivity of the Site containing the computer system identified by
the ESD in the lower 8 bytes. In the case of a singly-homed Site,
re-homing to a new attachment to the Public Topology will change ONLY
the RG in full GSE addresses for computer systems at that Site. One
example of such a re-homing would be a change of the Site's Internet
Service Provider. This change-over can be made essentially
O'Dell v3.7 [Page 9]
Internet-Draft GSE for IPv6 1997/02/24 01:32:32GMT
completely transparent to users both inside and outside the Site,
although it does involve a practical limit on the transition duration
relating to how long the departing ISP is willing to extend
transitional courtesies. During a changeover, though, all new
connections will be initiated via the new ISP connection.
This brings up the deep structure of the topology information carried
in RG and how it is encoded. More specifically, RG is a hierarchical
locator which is a rooted path-expression of flat-routed regions
which are tangent. Each element in the path-expression includes only
enough detail to negotiate the flat-routed region.
It has been observed before that the graph of the Global Internet is
not obviously a hierarchy so how can this work?
We start with the observation that every connected graph has at least
one labeling which forms a spanning tree covering the nodes. The
hierarchy is induced by a labeling function which partitions the
global graph into regions and recursively into subregions. This
function is only globally visible at the top-level where an initial
partitioning of the graph is used to form the first level of what
will become the hierarchy. Within each partition there is a local
sub-partition function which assigns labels, and we proceed
recursively. The nested recursions directly induce the hierarchy.
This decomposition of the Global Internet produces a recursive graph
where each level is composed of a set of subgraphs which are
explicitly connected (i.e., explicitly routed between the subgraphs)
while the structure within each subgraph is assumed to be flat-routed
(at least as seen at that level).
From an abstract viewpoint, a hierarchical partitioning can be
induced with an arbitrary choice of labeling function (as long as the
function produces the minimally-required partitioning). However, we
desire the partitions to have several important properties which
effects the choice of labeling function.
The general goal is to produce a global labeling which represents the
topology as compactly as possible, yet allows rich connectivity while
bounding the complexity of the discrete regions which are flat-
routed.
The top level objects in the GSE graph hierarchy are called "Large
Structures". These are objects chosen for their ability to naturally
represent significant topological aggregation of substructure (not
geographical, political, or geometric). The number of Large
Structures is explicitly limited to bound the complexity at the top
level of the aggregation graph.
O'Dell v3.7 [Page 10]
Internet-Draft GSE for IPv6 1997/02/24 01:32:32GMT
Within Large Structures, the (sub-)partition function is a trade-off
between the flat-routing complexity within a region and minimizing
total depth of the substructure. This is driven by the internal
topology of a Large Structure and the choices in different Large
Structures will not necessarily be the same. This is why Routing Goop
only has one hard bit boundary; Large Structures are free to
internally subdivide as they chose. They are only required to
encapsulate a significant portion of the Public Topology.
One obvious candidate for Large Structures is large networks which
already represent considerable aggregation based on existing CIDR
deployment. Another good candidate might be "Exchange Points". The
GSE model can accommodate both of these simultaneously, allowing
IPv6-style "Network-anchored Prefixes" and "Exchange-anchored
Prefixes" like that proposed by some to coexist and be subsumed into
a unified notion of "Aggregator-anchored Prefixes." Of course, these
aren't prefixes strictly in the IPv4 CIDR sense, but the left-
anchored substrings of the Routing Goop are intuitively quite
similar.
Large Structures are assigned a Large Structure Identifier, known as
an LSID. The total number of LSIDs is intentionally limited as we
assume the paths between Large Structures are only flat-routed.
Two consenting Large Structures remain free to share a tangency below
the top level and exchange routes so as to provide for improved
routing between the two of them (formalizing cut-throughs in the
natural hierarchy). The goal is to provide for manageable complexity
of the ultimate default-free zone (the top level of the global
hierarchy) while allowing for controlled circumvention of the natural
hierarchical paths.
O'Dell v3.7 [Page 11]
Internet-Draft GSE for IPv6 1997/02/24 01:32:32GMT
Bit-level structure of Routing Goop:
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| xxx | 13 Bits of LSID | Upper 16 bits of Goop |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
3 4 5 6
2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Bottom 18 bits of Routing Goop | 14 bits of Site Topology |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
NOTE: The Routing Goop structure above assumes that the GSE proposal
is designated by a 3-bit type of IPv6 address. If a GSE address is
identified by two upper bits, the LSID would expand to 14 bits. If
identified by one bit, the LSID would stay at 14 bits and the Upper
16 bits of Goop would expand to 17 bits.
Routing between two interior points of two different Large Structures
is always possible based solely on the LSID. This provides a
"forwarding strategy of last resort" for a router running "default-
free". From one point of view, the LSID partitions the Global
Internet into a set of regions such that an interior router only need
carry a "per-LSID default" pointing at an appropriate boundary router
which knows how to to handle traffic bound outside the containing
Large Structure for a point in the other Large Structure.
If two Large Structures share a tangency somewhere below the top
level, then some interior routers of both Large Structures will share
routes to exploit the tangency for optimizing paths. How this cut-
through information is distributed within the two Large Structures is
not revealed elsewhere in the global topology. The exact "shape" of
the optimization region is controlled by the decisions about which
routes to advertise across the cut-through. These decisions are made
by the collaborators and the optimized region need not be symmetric
with respect to the cut-through. The size of the optimization area
is controlled by how far routes learned via the cut-through are
propagated within the sub-graphs tangent via the cut-through. Again,
this is a matter of engineering choices made by the collaborators
operating the cut-through.
While the LSID is may appear similar to the Autonomous System Number
currently used in IPv4 policy-based routing machinery, the LSID is
quite distinct from the AS number and the two identifiers play very
different roles. AS Numbers will continue be used for policy routing
information exchange and must remain distinct from the LSID space.
O'Dell v3.7 [Page 12]
Internet-Draft GSE for IPv6 1997/02/24 01:32:32GMT
9. The "Flow" of Routing Goop
It is intuitively useful to think about Routing Goop as "flowing
downhill" through the hierarchy from the topmost Large Structures,
through the intermediate levels of the Public Topology, and
ultimately down to the Site. As the RG propagates downward, the
prefix extends to the right, just like in IPv4 CIDR, with each
extension navigating the nested flat-routed subgraphs, eventually
terminating at the Site, which then descends invisibly into the
Private Topology of that Site.
The nested flat-routed areas correspond to transit subnetworks of the
Large Structure. One very important example of such subnets is the
"reseller" or "wholesale transit customer" of a Large Structure.
(Note that whether the Large Structure is a network or an exchange
point doesn't matter.) The reseller network provides transit for
Sites, so must be part of the Public Topology and appears as a
substring within the Routing Goop, usually the right-most extension
unless the reseller has further reseller customers. In that case,
the next level reseller will have his own extension to record his
place in the Public Topology and to provide for navigating through it
as well.
The overall picture can now be drawn as a forest of trees
distributing Routing Goop down to the Sites, with each tree being a
Large Structure and the Large Structures connected arbitrarily at the
top level. This structure will be mirrored by the actual machinery
for distributing Routing Goop to the Sites as will be discussed a bit
later, but this mental image of the prefixes "flowing" from the
anchoring Large Structures is critical to understanding fundamental
self-organizing abilities in the GSE model.
While the GSE machinery is intended to be adequate for almost
completely automated self-organization with respect to the
construction and propagation of Routing Goop on an Internet-wide
basis, we proceed for now closely following current practice
(admitting manual configuration of certain information like Routing
Goop) because of the additional complexity of the self-organization
functions. Initial deployment following current practice would not
preclude eventual deployment of a fully self-organizing Global
Internet.
10. The Distribution of Routing Goop
There are two cases to consider for how Routing Goop gets
distributed: source addresses and destination addresses. In both
cases RG is part of the address, one way or another, so we show how a
full 16-byte address with the right RG gets created in these two
O'Dell v3.7 [Page 13]
Internet-Draft GSE for IPv6 1997/02/24 01:32:32GMT
cases.
10.1 RG for Source Addresses
The initial RG of a source address is almost always the Site-local
prefix. If the destination address is not within the Site, the
packet will leave the Site via one of possibly several Site Boundary
Routers. The egress Site Border Router will insert the correct RG in
the source address based on the path the destination should use to
return a packet to the sender. Except in unusual circumstances this
will be the RG which corresponds to the attachment path of that
egress Site Boundary Router to the Global Internet.
If the Site is multi-homed via just one Site Boundary Router, then
the router is free to apply whatever local policy suits. It simply
must fill in a valid RG path which leads back to a Site Boundary
Router for that Site. If the Site is multi-homed via more than one
Site Boundary Router, which router provides egress is purely local
policy and which RG gets applied is likewise local policy.
The dynamic insertion of RG upon Site egress accomplishes a number of
things.
(1) It means that for most purposes, a computer system at a Site need
not concern itself with egress policy matters which can be
particularly tricky in Multi-homed Sites.
(2) It means that computer systems are essentially not impacted at
all by topological re-homing of the Site.
(3) It means that more complex multi-homing scenarios with multiple
Site Boundary Routers each with multiple connections to the Global
Internet can execute arbitrarily complex path recovery policy without
concern for how it might impact a computer system doing source
address selection.
(4) It means that while a computer systems might forge the ESD in a
source address, it CANNOT forge the point of injection into the
Public Topology. This is not strong authentication down to the
particular computer system, but it is probably a strong deterrent to
certain obnoxious activities due to the dramatically improved
traceability. We also note that the first-hop attachment router in
the Public Topology is free to insert or override the RG if somehow
an errant packet escapes a Site carrying invalid RG, thereby
enforcing traceability. Of course, the Public first-hop router could
always just drop a packet carrying inappropriate source RG as well.
But to make it very clear, we put the burden of inserting correct RG
in exiting source addresses squarely and solely on the Site and the
O'Dell v3.7 [Page 14]
Internet-Draft GSE for IPv6 1997/02/24 01:32:32GMT
Site Border Router. Any other location of the task has bad
performance scaling.
The Site Border Router acquires the necessary RG from the first-hop
attachment router in the Public Topology. Alternately, as an initial
mechanism the RG could be statically configured, but the real goal is
completely automated propagation down the tree so that an entire
complex subtree can be rehomed without human intervention or service
disruption.
10.2 RG for Destination Addresses
Currently, an IPv6 address lookup for a DNS name returns the
information in a "AAAA" record which is the full 16 bytes of the IPv6
address.
The GSE design proposes synthesizing the 16 bytes of information in a
query response from two different sources: an "AAA" record and an
"RG" record. The "AAA" record carries the 8-byte ESD + ~2 byte STP
for the DNS name in question and the "RG" record carries 6+ bytes of
the appropriate Routing Goop.
One interesting question is how the AAA record gets paired with an RG
record in a given nameserver. One simpleminded implementation would
be to pair an RG record with a zone, but that has the problem of
requiring all the systems in that zone to use the same Routing Goop
and hence be in the same Site.
A better scheme is to carry an "RG Name" in the "AAA" record which
would allow a nameserver to concatenate an arbitrary RG prefix to the
ESD+STP producing the full 16 byte response. The "RG Name" would be
a full DNS name which could be recursively translated (and the result
cached). Structured as an "upward delegation" with an appropriate
Time-to-Live, a Site could import the Routing Goop information from
their service provider completely automatically. This capability
will be used to great advantage in the discussions of re-homing which
follows. [Interactions between RG TTL and zone TTL is an issue to be
explored more.]
Alternately, one special case for an RG record could be a delegation
to a Site Border Router which could supply the correct RG
automatically, at least in single-homed cases, and possibly in
multi-homed cases.
The result of this structure is that individual zone entries for
individual nodes (AAA records) do NOT change when a Site rehomes.
The only thing which changes (logically) is the RG information which
is composed with the node's AAA record to produce a full 16-byte
O'Dell v3.7 [Page 15]
Internet-Draft GSE for IPv6 1997/02/24 01:32:32GMT
response. This means the general Dynamic DNS machinery is NOT
required to support Site re-homing.
One implication of the special Site-local Prefix RG for intra-Site
traffic is that Sites will have to provide at least two "faces" on
their nameservice - one that returns Site-local as the RG for queries
from inside the site, and another that returns full RG responses for
requests originating outside the Site. This can be readily
accomplished by inspecting the source address - if the source address
contains the Site-local Prefix as RG, then return the same.
Otherwise, return a fully-general RG-based response (possibly based
on egress-path selection policy).
10. Re-homing A Site
When a Site changes its point of attachment to the Global Internet,
it is said to "rehome". One of the significant criticisms of IPv4
CIDR and IPv6 "Provider-based Addressing" is the requirement to
"renumber" a Site when it rehomes. One of the explicit goals of the
GSE architecture is to eliminate, or at least mitigate, the impact of
this.
It is important to reiterate the notion that the Routing Goop of a
GSE address is not just a Locator, but that it encodes a PATH from
the top level of the global hierarchy down to the Site. Changing
that path is what makes Re-homing and Multi-homing essentially
equivalent operations. We proceed with the simple case first.
When a Site wishes to rehome, it must establish a new attachment
point to the Global Internet, and hence establish a new access path.
Then it must start using that new path before the old path is
removed. The procedure is as follows:
A Site establishes a connection with a new ISP and it becomes able to
carry the traffic. At that point, the Site alters the upward
delegation of the DNS RG records. Henceforth, all new connections
made with the new translations will follow the new path to the Site.
The new connection path is then made the preferred egress path and
source addresses in packets exiting the Site immediately start being
marked with the new return path. The old connection should be
maintained for some administratively determined grace period to allow
DNS timeouts to transition new sessions to the new path and for
long-running sessions to terminate.
At first blush, it might appear that when the egress path for the
Site switches over to the new path and the Site Border Router starts
marking packets with the new RG, the return path for long-running
sessions would automatically switch over to the new path. Alas, this
O'Dell v3.7 [Page 16]
Internet-Draft GSE for IPv6 1997/02/24 01:32:32GMT
is not so because a long-running session will be using destination
address containing the old RG acquired when the session first
started.
Consideration was given to providing some kind of "path redirect"
which would allow the other end to deal with "flying cutovers" of a
running session, but the security implications of this mechanism are
too far-reaching to consider as part of initial deployment. If at
some later point it becomes clear how to accomplish this safely, then
it could be added. But the complexity, security risks, and the
magnitude of the added value do not seem worthwhile at present
(although the author would love to be convinced otherwise).
Alternately, the Site could request a "Re-homing Courtesy" from their
old ISP which would effectively make it a multi-homed Site for some
period of time. After multi-homing was established, the old
connection could be taken down and the long-running sessions would
continue to survive as long as the Site was multi-homed by way of the
Re-homing Courtesy.
Note that at no time did the re-homing effect anything internal to
the Site's Private Topology. The only change was the attachment to
the Public Topology and the Routing Goop which records that
attachment location.
11. Multi-homing a Site
One of the curiosities of IPv4 is that the network does a lot more
work for a multi-homed site but it is very hard to pin it down so
that the instigator of the effort can compensate the workers.
In the GSE model, Multi-homing is an explicit service which is
performed for a Site by the agents of the Public Topology which
provide the access for the Site. This mechanism can be made more
sophisticated, but the notion is most readily explained by
considering a Site which is dual homed to two different ISPs and
hence has two distinct access paths represented by two distinct blobs
of Routing Goop.
The Site is attached to each ISP via some link and we postulate some
kind of keep-alive protocol which determines when reachability to the
Site's border router is lost. The ISP routers serving the dual-homed
Site are identified to each other (via static configuration
information in the simplest case or a dynamic protocol in the more
general case), and when a link to the Site is lost, the ISP router
anchoring the dead link simply tunnels any traffic destined for the
Site via the other ISP router.
O'Dell v3.7 [Page 17]
Internet-Draft GSE for IPv6 1997/02/24 01:32:32GMT
This approach clearly requires coordination between the two serving
ISPs. This is not a new constraint - multi-homing already requires
considerable coordination between the Site and is providers. Of
course, creating a protocol for dynamically creating a "homing group"
is probably a very worthwhile investment but it is not absolutely
necessary at the outset.
It should be obvious now that the "Re-homing Courtesy" in the
previous section is simply doing the router-pair coordination with
the new ISP for some period of time.
[Note: Yakov and Bates are working on a draft for a Site-side
implementation of aggregation-efficient multi-homing which may
simplify this even further.]
12. Re-homing a Reseller
Re-homing a Reseller is a slightly more general case of re-homing a
Site, primarily characterized by more lead time, a longer grace
period, and some necessary coordination with customer Sites to insure
that the Routing Goop propagates correctly.
The Reseller will establish a new connection which will not only
result in a new path for the Reseller's topology, but for that of his
customer Sites. When the Reseller alters his upward delegation of
Routing Goop, it will ripple downward to his customer Sites by nature
of their upward delegations. The downward ripple of Routing Goop via
the upward delegations should cause the Site zone TTLs to be reduced
appropriately to insure caches expire well within the dual-homed
transition grace period for the Reseller.
This essentially rehomes all the Reseller's customer Sites all at the
same time the Reseller's infrastructure is re-homing and should be
completely transparent except for long-lived sessions which do not
terminate by the end of the grace period.
13. Multi-homing a Reseller
There are two parts to multi-homing a Reseller - one part similar to
the multi-homed Site case above, and one part which is quite
different.
For this discussion, assume a Reseller which is dual-homed and hence
has two different Routing Goop prefixes (remember that each path to
the top level of the hierarchy has a distinct prefix). The reseller
can solicit multi-homed tunneling services from his two access point
routers to provide alternate path service just like a multi-homed
Site. Why traffic is coming to any particular router, though, is
O'Dell v3.7 [Page 18]
Internet-Draft GSE for IPv6 1997/02/24 01:32:32GMT
influenced entirely by what routes are advertised out that particular
connection via BGP5 (or IDRP). This is rather different from the
multi-homed Site case where the ESD is the object of interest and the
RG simply gets the traffic to the Site boundary.
The question arises, however, as to which prefix gets used for
extending downward to his customer Sites. The answer in the simplest
case is to pick one and use it, making the Sites "natural" in the
chosen prefix. The alternate prefix can, of course, be advertised
out the alternate path if desired. But this work can be ascribed to
the instigator and the superior attachment points can charge for this
service. (This is somewhat akin to charging for routes, but only
routes which create a discontinuity in the routing space.)
15. A Comment on NAT Boxes
In discussions about requiring destination address re-writing for
inbound packets, Brian Carpenter remarked that with the advent of
symmetric re-writing (both inbound and outbound), the GSE
architecture is essentially "NAT that works." To some, this would be
the ultimate insult, but I think it is essentially correct. NAT
Boxes provide for isolating a Site from topology changes but severely
compromise the end-to-end model. GSE affords very similar
operational topological isolation but without violating the end-to-
end model, at least not nearly as much. If a Site wishes the
additional isolation afforded by NAT Boxes, a firewalls will
accomplish that task.
15. General Comments
While some of GSE is a radical departure from IPv6 as we currently
know it, in general it relies deeply on all the IPv6 underpinnings
which contribute so much to the attractiveness of IPv6: Neighbor
Discover, all the dynamic configuration machinery designed to make
renumbering palatable even using "provider-based addressing", and the
flexibility of the "salami headers" which make tunneling and security
attractive. The general forwarding operations based on longest-
match-under-prefix-mask and the policy-based routing machinery of
BGP5/IDRP are also simply assumed.
16. Closing Comments and Acknowledgments
This document presents a revision of the "8+8" addressing model which
has been under construction by the author since before Fall of 1995,
at least. Conversations with a great many people have contributed to
the design presented in this document. A skeletal version of this
proposal first appeared in some email from Dave Clark of MIT who
planted the seed and provided the original monicker "8+8". A great
O'Dell v3.7 [Page 19]
Internet-Draft GSE for IPv6 1997/02/24 01:32:32GMT
many others have contributed ideas and observations, all of which
went into the stew pot for the synthesis contained here.
The original "8+8" draft cited the following individuals for a
special thank-you: Vadim Antonov, Ran Atkinson, Scott Bradner, Brian
Carpenter, Noel Chiappa, Steve Deering, Sean Doran, Joel Halpern,
Christian Huitema, Tony Li, Peter Lothberg, Louis Mamakos, Radia
Perlman, Yakov Rekhter, Paul Traina.
This draft has benefited greatly from conversations with Masataka
Ohta, who convinced the author of the importance of the IETF-NodeID
in addition to the 8-byte IEEE MAC addresses, as well as Brian
Carpenter, Scott Brander, Ran Atkinson, all the people who so
graciously provided invaluable comments on the original "8+8" draft,
and of course Steve Deering, Bob Hinden, and the IPng Working Group.
17. Security Considerations
More than can be imagined.
18. Author's Address
Mike O'Dell
UUNET Technologies, Inc.
3060 Williams Drive
Fairfax, VA 22031
voice: 703-206-5890
fax: 703-206-5471
email: mo@uu.net
O'Dell v3.7 [Page 20]