ITDK 2015-08

For the latest ITDK as well as other historical releases, see
the main ITDK page.

This page describes the historical ITDK release 2015-08, which
consists of

two related router-level topologies,

an IPv6 router-level topology,

router-to-AS assignments,

geographic location of each router, and

DNS lookups of all observed IP addresses.

This ITDK is produced from active measurements conducted on
our Archipelago (Ark)
measurement infrastructure. For the IPv4 router-level topology, we
used a subset of the
IPv4 Routed /24 Topology Dataset, which is collected continuously.
Specifically, we obtained the raw IPv4 topology by
performing traceroutes to randomly-chosen destinations in each routed
/24 BGP prefix using 94 Ark monitors located in 36 countries on
Aug 16-31, 2015.
For the IPv6 router-level topology, we used a subset of the
IPv6 Topology Dataset for the same time period (to line up with
the IPv4 data above). We obtained raw IPv6 topology data from
26 monitors located in 15 countries.

Router-Level Topologies

The two included router-level topologies are generated from the same
IP-level topology but differ in the accuracy and
completeness of the alias resolution performed to create them. The
first topology is derived from aliases resolved with
MIDAR
and
iffinder,
which yield the highest confidence aliases with very low false positives.
The second topology also uses MIDAR and iffinder but further includes
aliases resolved with
kapar,
which significantly increases the coverage
of aliases but at the cost of false positives (which inflate the
size of routers and decrease the router count). Researchers should
choose the topology to use depending on the relative importance they
place on accuracy vs. comprehensiveness of alias resolution.
Choose the most accurate alias resolution if uncertain about which to use.

The IPv6 router-level topology is derived from aliases resolved by
speedtrap, which is part of the scamper suite of tools.

Each router-level topology is provided in two files, one giving the nodes
and another giving the links. There are additional files that assign
ASes to each node, provide the geographic location of each node, and
provide the DNS name of each observed interface.

Nodes File

The nodes file
lists the set of
interfaces that were inferred to be on each router.

File format:
node <node_id>: <i1> <i2> ... <in>

Each line indicates that a node node_id has
interfaces i1 to in.
Interface addresses in 224.0.0.0/3 (IANA reserved space for multicast)
are not real addresses.
They were
artificially generated
to identify potentially unique non-responding interfaces in
traceroute paths.

NOTE: In ITDK release 2013-04 and earlier, we used addresses in
0.0.0.0/8 instead of 224.0.0.0/3 for these non-real addresses.

Links File

The links file lists the set of routers and router interfaces that
were inferred to be sharing each link.
Note that these are IP layer links, not physical cables or
graph edges. More than two nodes can share the same IP link
if the nodes are all connected to the same layer 2 switch
(POS, ATM, Ethernet, etc).

Each line indicates that a link link_id connects nodes
N1 to Nm.
If it is known which router interface is connected to the link, then the
interface address is given after the node ID separated by a
colon (e.g., "N1:1.2.3.4"); otherwise, only
the node ID is given (e.g., "N1").

By joining the node and link data, one can obtain the
known and inferred interfaces of each router.
Known interfaces actually appeared in some traceroute path.
Inferred interfaces arise when we know that some router N1
connects to a known
interface i2 of another router N2, but we
never saw an actual interface on the former router.
The interfaces on an IP link are
typically assigned IP addresses from the same prefix, so we assume
that router N1 must have an inferred interface
from the same prefix as i2.

Node-AS File

The node-AS file assigns an AS to each node found in
the nodes file. We use our final
Election+Degree assignment heuristic
to infer the owner AS of each node.

File format:
node.AS <node_id> <AS> <method>

Each line indicates that the node node_id is owned/operated by
the given AS, as inferred with the given method.
There are three inference methods:

single

a router has only a single choice of AS

election

multiple ASes are present on a router, and one AS occurs more frequently than the rest

election+degree

multiple ASes are present on a router, but no AS occurs the most frequently, so the choice is based on AS degree

Hostnames File

The hostnames file contains the hostname for every IP
address in the router-level topology for which a successful reverse
DNS lookup could be found.

File format:
<timestamp> <IP_address> <hostname>

Node-Geolocation File

The node-geolocation file contains the geographic location
for each node in the nodes file. We use MaxMind's
GeoLite City
database for the geographic mapping.

RequestData Access

Note that two historical
ITDK releases made in 2002 and 2003
are also available as public datasets. These datasets should be used
with caution, as they were constructed using completely different procedures
and using topology data collected on the now decommissioned skitter
measurement infrastructure.

Below we describe the various steps involved in producing the datasets that
are part of the ITDK.

Alias resolution

For alias resolution, we rely on several CAIDA tools: iffinder,
kapar, and
MIDAR,
(recent tech report).
MIDAR (Monotonic ID-based Alias Resolution, a tool we hope
to release soon) expands on the IP velocity techniques of
RadarGun,
while kapar expands the analytical techniques of
APAR.
We use the traceroute dataset as input to MIDAR and iffinder,
which generate output files used as input to kapar.
kapar heuristically infers the set of interfaces that
belong to the same router, and the set of two or more routers on
the same IP link (a construct that represents either a
point-to-point link, or LAN or cloud with multiple attached IP
addresses).

DNS hostnames

We have an in-house bulk
DNS lookup service called HostDB that can look up millions of addresses per day. We look up all
intermediate addresses and responding destinations seen in the Topology Dataset. Each
ITDK contains a list of the successful lookups for each IP address found in the nodes dataset.

BGP

To assign IP addresses to ASes, we used a publicly available BGP
dump provided by Routeviews.
BGP (Border Gateway Protocol) is the protocol for
exchanging interdomain routing information among ASes in the
Internet. A single origin AS typically announces
("originates") each routable prefix via BGP. We perform
IP-to-AS mapping by assigning an IP address to the origin AS of
the longest matching prefix for that IP address in the BGP tables.

The goal of the AS assignment process is to determine the AS that
owns each router. For each router r, we first create an AS
frequency matrix that counts the number of interfaces (known
and inferred) from each AS that appears on r. The ASes in this
frequency matrix represent the set of possible owner ASes of r.
We use the following AS assignment heuristic to assign a router r
to an AS.

The Election heuristic assigns router r to the AS with the
highest frequency in r's AS frequency matrix. The intuition behind
this heuristic is that routers will tend to have more interfaces in
the address space of their owner. If two ASes from r's AS frequency
matrix have the same count, then Election cannot decide an
owner.

The Customer heuristic uses the AS relationship dataset to
assign relationships to each pair of ASes from r's AS frequency
matrix. Customer assigns r to the AS inferred to be a customer
of every other AS in r's AS frequency matrix. This heuristic is based
on the common practice that customer and provider routers typically
interconnect using addresses from the provider's address
space. Consequently, a router with interfaces from both the customer
and provider address spaces is assigned to the customer.

For the Degree heuristic, we first generate an AS-level graph
by assuming full-mesh connectivity among ASes from each router's AS
frequency matrix. We then use this graph to generate an AS degree for
each AS. Degree assigns router r to the smallest-degree AS from
r's AS frequency matrix, i.e., the AS most likely to be the customer
AS, based on similar intuition as the Customer heuristic.

For the Neighbor heuristic, we first determine the set of
single-AS routers to which r is connected (its single-AS
neighbors). We create a new AS frequency matrix that counts the number
of single-AS neighbors of r from each AS. The Neighbor
heuristic assigns r to the AS with the largest frequency (most
single-AS neighbors), based on the intuition that a router is
connected to a larger number of single-AS routers in its owner
AS. Neighbor produces an ambiguous assignment when multiple
ASes have the same (highest) frequency.

In case one of the previously described primary heuristics is unable
to produce an AS assignment, we attempt to break the tie using one of
the other heuristics as a tie-breaker. Our evaluation in the paper
shows that Neighbor was the best stand-alone heuristic, while
Election+Degree was the best combination.

Geolocation

We use MaxMind's free
GeoLite City
database to provide the geographic location (at city granularity)
of routers in the router-level graph. Because this database maps
individual IP addresses to locations, we take the following steps to
find the location of each router (which by definition has multiple
interfaces). We first map each interface on a router to a location.
If all interfaces map to the same location, then we assign that
location to the router; otherwise, we do not assign any location to
the router (that is, the router does not appear in the geolocation
file).