Minimizing Wide-Area Performance Disruptions in Inter-Domain Routing

Abstract:

The Internet is the platform for most of our communications needs
today. The networks underlying the Internet undergo continual change
-- both planned changes (e.g., adding a new router) or unplanned
failures. Unfortunately, these changes can lead to performance
disruptions, which affect the user experience. Because of this,
network operators have to quickly diagnose and fix any problems that
arise. Diagnosing wide-area performance disruptions is challenging:
first, each network has limited visibility into other networks, so
network operators must collect and analyze measurements of routing and
traffic data in order to infer the root cause of the disruption;
second, there are so many potential factors which might lead to
performance disruptions, and these factors are usually interdependent
of each other; third, there are no formalized ways to define metrics
and classify the performance disruption according to the causes, thus
network diagnosis is usually done in an ad-hoc manner.

The thesis conducts two case studies to diagnose wide-area performance
disruptions from the perspectives of a large tier-1 Internet Service
Provider (ISP) and a large content distribution network (CDN):
i) From the ISPs perspective, we designed and implemented a system
that tracks inter-domain route changes at scale and in real time. Our
system can be used as the building block for many diagnosis tools for
the ISPs.
ii) From the CDNs perspective, we focus on diagnosing wide-area
network changes which resulted in latency increases to access the
services in the CDN. We designed a method for automatically
classifying large increases of latency, and evaluated our techniques
on one month of measurement data to identify major sources of high
latency for the CDN.

Stepping back, the difficulties in network diagnosis can be traced
back to the inter-domain routing protocol itself. Based on the lessons
learned from the case studies, we refactor the border gateway protocol
(BGP), the main inter-domain routing protocol in two ways: first,
since the network operator has visibility into its own network and
some limited visibility in the neighboring networks, we propose to
select a route only based on the next-hop AS (instead of the networks
further away); second, the BGP protocol was designed as a way to
exchange path availability information between independent networks,
not with the operational challenges of performance, security, and
traffic engineering. This has led many to propose additional BGP
attributes that satisfy the operational needs. These proposals make
the protocol and configuration more complicated, and thus more error-
prone and more difficult for network operators to diagnose problems.
Instead, we propose simplifying the protocol, and in effect enable
addressing the operational challenges outside the protocol. Our
proposal of next-hop BGP not only simplifies the protocol, but also
has the benefits of fast convergence, incentive compatibility, and
easier support for multi-path routing.