Department of Electrical and Computer Engineering, 465 Northwestern Avenue, West Lafayette, IN 47907-2035, United States Department of Computer Science, 305 N. University St., West Lafayette, IN 47907-2107, United States

a r t i c l e

i n f o

a b s t r a c t
The Border Gateway Protocol (BGP) maintains inter-domain routing information by announcing and withdrawing IP preﬁxes. These routing updates can cause preﬁxes to be unreachable for periods of time, reducing preﬁx availability observed from different vantage points on the Internet. The observed preﬁx availability values may not meet the standards promised by Service Level Agreements (SLAs). In this paper, we develop a framework for predicting long-term availability of preﬁxes, given short-duration preﬁx information from publicly available BGP routing databases like RouteViews, and prediction models constructed from information about other Internet preﬁxes. We compare three prediction models and ﬁnd that machine learning-based prediction methods outperform a baseline model that predicts the future availability of a preﬁx to be the same as its past availability. Our results show that mean time to failure is the most important attribute for predicting availability. We also quantify how preﬁx availability is related to preﬁx length and update frequency. Our prediction models achieve 82% accuracy and 0.7 ranking quality when predicting for a future duration equal to the learning duration. We can also predict for a longer future duration, with graceful performance reduction. Our models allow ISPs to adjust BGP routing policies if predicted availability is low, and are generally useful for cloud computing systems, content distribution networks, P2P, and VoIP applications. Ó 2010 Elsevier B.V. All rights reserved.

1. Introduction The Border Gateway Protocol (BGP), the de-facto Internet inter-domain routing protocol, propagates reachability information by announcing paths to preﬁxes, which are aggregates of IP addresses. Autonomous Systems (ASes) maintain paths to preﬁxes in their routing tables, and (conditionally) update this information when route update messages are received. These update messages can be announcements, which announce an AS path to a preﬁx, or withdrawals, which indicate that no path is available

to the preﬁx. Continuous preﬁx reachability over time is crucial for the smooth operation of the Internet. This is captured using the metric of availability, deﬁned as the time duration when the preﬁx is deemed reachable divided by the total time duration we are interested in. While typical system availability metrics for telephone networks exceed ﬁve 9s, i.e., 99.999%, computer networks are known to have lower availability [1–3]. The ﬁve 9s availability value amounts to the system being down for about ﬁve minutes in a year’s period and is usually too stringent a requirement for Internet preﬁxes. Preﬁxes belonging to highly popular services such as CNN, Google, and YouTube need to be highly available, and a disruption of more than a few minutes is generally unacceptable. Internet Service Providers (ISPs) such as AT&T and Sprint usually provide availability guarantees on their backbone network through Service Level Agreements (SLAs) [4,5]. However, content providers are more

e. [13] found that data plane failures can be predicted using routing updates with about 80–90% accuracy for about 60–70% of the preﬁxes. at the cost of increased burden on the prober and elevated network trafﬁc. / Computer Networks 55 (2011) 873–889
interested in their website availability as observed from various points in the Internet.874
R. we compute attributes during a short duration observation period of publicly available routing information (e. One can modify the incentive mechanisms of BitTorrent [20] by unchoking the BitTorrent peers which are parts of a highly available preﬁx. Asia.. several reachability problems have occurred.e. Wang et al. BGP routing dynamics have been used to predict data plane failures in previous work [13. as they depend on intermediate ASes between the source and the destination. since we are concerned with the long term availability metric considering at least a few days at a time. Transient events like routing convergence and forwarding loops result in temporary reachability loss in the data plane. The reachability estimate they compute increases in accuracy as the sampling interval is made smaller. In this work. from RouteViews [16]) and develop a prediction model based on information on other Internet preﬁxes. of preﬁxes of an ISP. Currently. [9. We can enhance this technique by using the preﬁxes for which the predicted availability falls below a threshold as the potentially problem preﬁxes. the availability of the paths to preﬁxes as advertised by BGP.g. Additionally. cloud computing applications. We will make our prediction tool publicly available through a web page so it can be used for monitoring the predicted availability. data plane measurements are inherently discontinuous. the percentage of time that the control plane and data plane paths mismatch should be insigniﬁcant compared to the time over which our availability values are computed. establishing the correlation between the two planes is by itself a challenging topic [12] and detailed study of this is beyond the scope of this work. e. Our work also complements a data plane loss rate prediction system such as iPlane [19]. i. As can be seen from the discussion above. Moreover. which caused signiﬁcant disruptions and increase in web latencies to much of the Middle East. and North Africa for a period of several weeks [11]. Measuring preﬁx availability is non-trivial without an extensive measurement infrastructure comprising many vantage points. and a routing path being advertised is critical to maintaining trafﬁc ﬂow to their data centers. Thus.. However. [14] studied the correlation between control plane and data plane events and found that control plane changes mostly result in data plane performance degradation. the observations need to be made over a long period of time to obtain a reasonable estimate. Khosla et al. it is not possible to predict the existence of default routes. where ensuring content availability is a primary concern amid extensive peer churn.10]. Changing BGP attributes such as MED and community.e. previous work has shown that the control plane-advertised paths may not always imply that the paths are usable in the data plane [12–14]. This indicates that the control plane does indeed have a positive correlation with the data plane. [15] predict end-to-end path failure using the number of BGP messages observed during a 15 minute window. VoIP implementations can use predicted availability of relay nodes along with latency and loss rate estimates for better performance. most of which last less than 300 seconds [13].. Our work can optimize Hubble [18] – a system that studies black holes in the Internet by issuing traceroutes to potentially problem preﬁxes. CDNs and cloud computing applications can use the highest predicted availability replica/server to redirect the clients to. in addition to considering their download rate and latency/ loss rate estimates.g. A shortfall in measured availability requires a reactive approach that corrects the problem after the fact. Our work considers only control plane availability and hence actual preﬁx availability could be higher in the data plane if default routes are present. as they take reachability samples at periodic time instants. one can increase the penalty threshold associated with route ﬂap damping for the routes to a high availability requirement preﬁx (like a business customer) to ensure higher availability [17]. and several undersea cable cuts. can increase the perceived preﬁx availability or aid trafﬁc engineering [17]. our approach does not need additional measurement infrastructure apart from RouteViews [16]. such as the YouTube preﬁx hijack which lasted about two hours [8]. and P2P networks.. Feamster et al. There is no agreed upon method to detect the existence of default routes. This will increase detection accuracy of black holes. The key premise in this paper is that Internet preﬁx characteristics convey valuable information about preﬁx availability. A predicted long-term advertised availability value which falls short of requirements could lead to changes in BGP policies of the ISP regulating the advertisement of these preﬁxes to the rest of the Internet. Our system eliminates the need for storing information about peers at clients that are not currently downloading from these peers but may do so in the future. i. Zhang et al. Applications of our work include content distribution networks (CDNs). We argue that prediction models are viable even if preﬁxes whose availability is to be predicted and
. showing that the two planes are correlated. However. or aggregating preﬁxes. Attempts at deﬁning policies so that SLAs can be extended to several ISPs [6] and at deﬁning and estimating service availability between two end points [7] in the Internet have had limited success. Hubble uses BGP updates for a preﬁx as one of the potential indicators of problems. and then analyzing the results to identify data plane reachability problems. However. though some initial efforts have been made by the authors of [12] by controlling announcements and
withdrawals of certain preﬁxes allocated to their ASes. Our work can also be applied to peer networks. as observed from multiple vantage points in the Internet. Our work takes a predictive approach to solve the availability prediction problem. For example. predicting the advertised availability of preﬁxes. Meanwhile.. Our framework predicts long-term control plane availability.g. VoIP applications. which has been maintained by the University of Oregon for several years. Data plane reachability can exist even when control plane paths are withdrawn due to the presence of default routes [12]. focusing on withdrawals and AS path changes.15].

2]. This technique is used for summarizing the data and improving understanding of the data features. since we predict control plane availability (or existence of routing paths) to a preﬁx from multiple Internet vantage points. as opposed to [21].3). Our work is complementary to iPlane [19] and iPlane Nano [30]. applied predictive modeling in the context of BGP. whereas we consider preﬁx availability indicated by routing update messages. We therefore use randomly selected preﬁxes from RouteViews to learn models. We formulate hypotheses about how attributes of a preﬁx such as preﬁx length and update frequency relate to its availability. To the best of our knowledge. e. where one month of data was considered at a time. Our availability predictions from three models are compared to measured availability values from RouteViews. / Computer Networks 55 (2011) 873–889
875
preﬁxes used for learning prediction models are unrelated (e. we use predictive modeling to predict preﬁx behavior. As availability is a long-term metric. We ﬁnd that the performance of this model is better or worse than bagged decision trees. and prove or refute them based on our data.. (2) We consider an additional machine learning model. and growth. respectively.. a few studies. which can improve the performance of several applications like VoIP. Section 2 summarizes related work. These studies focused on end-to-end loss. and we investigate three prediction models that may learn from other preﬁxes. and that AS policy changes. This paper extends our previous work [21] as follows: (1) In addition to varying the ratio of the learning duration to the prediction duration as in [21]. update count. We also conduct a more thorough investigation of the prediction models used in [21].
2. This theme is common in other disciplines. [28] cluster routing updates into events based on the announcing peers and similarity of AS paths using descriptive modeling as the data mining technique. combined with the fact that operator reaction to path failures is relatively standard.4).g. AS de-peering. since some preﬁxes are only visible for short time periods. namely the Naïve Bayes model. while iPlane samples data plane metrics like latency. with the preﬁx popularity being a feature that can be used to predict stability. e. This is a popular model in the machine learning literature [22]. given the observed values of preﬁx attributes.29. Their results ﬁt into our prediction framework. Section 7 describes our results of applying prediction models to large sets of combinations.
The remainder of this paper is organized as follows. measured by active probes. Khosla et al. In cases where no responsive hosts within a preﬁx can be found
. utilizing four simple attributes computed from observing RouteViews data. Hubble [18] and iPlane [19. which we address in this paper. learning and predicted preﬁxes are not in the same AS). [13] predict the impact of routing changes on the data plane. can increase knowledge about preﬁxes and end hosts in the Internet. preﬁx churn. This. using statistical tests. no other work has exploited the similarity of preﬁxes in the Internet. These predictions. In Section 6. In contrast. They aim to predict reachability problems based on problematic ASes in AS paths in the routing updates observed for a preﬁx. e. caused by BGP reaction to path failure or policy changes. This is important because the availability distribution depends on the duration over which it is computed. Section 8 concludes the paper and discusses future work. reachability from various monitors.. Related work Rexford et al.. thereby showing that the prediction models are scalable (Section 7). for availability prediction. Recently. such as medicine. and throughput. taken together.30] have been developed at University of Washington for detecting data plane reachability problems and predicting data plane paths and their properties. Our work goes a step further by predicting preﬁx availability. using easily computable attributes. and experience fewer and shorter update events. (5) All results presented in this paper are for the time period of January to October 2009. Chang et al.. have been studied. our prediction framework can be easily extended to predict other preﬁx properties of interest. P2P and CDNs. but they only examined problem ASes in the path to a particular preﬁx. Preﬁx attributes like activity.g. not just the events associated with a preﬁx. where one uses known symptoms of patients with a diagnosed disease to try to diagnose patients with an unknown condition. Section 4 describes our datasets. We show that past availability of a preﬁx is inadequate for accurately predicting future availability. typically affect several preﬁxes at a time. (4) We predict availability of a large number of preﬁxes. We deﬁne the problem that we study in Section 3. but the attributes are not used to classify preﬁxes or predict preﬁx features. speciﬁcally availability. and Section 5 describes our methodology and metrics. supports this premise. known to be simpler than the bagged decision trees considered in [21] but potentially less accurate. we vary the learning duration itself. bandwidth and loss rates to end hosts in the preﬁx at a low frequency. depending on the learning duration (Section 6. This is because an important factor causing paths to preﬁxes from various vantage points to go up or down is BGP path convergence.g. delay. (3) We study the distribution of the preﬁx attributes and show. Predictability of network performance has been extensively studied. and hence this impacts prediction performance. [23] ﬁnd that highly popular preﬁxes have relatively stable BGP routes. speciﬁcally the number of update events associated with a preﬁx.g. and then predict availability of other preﬁxes. that the attributes indeed demarcate availability classes (Section 5. in [3. Zhang et al. in [24–27].g. While we focus on predicting preﬁx availability using observed routing updates. This leads to higher diversity of the visible preﬁxes. this 10-month evaluation of the prediction models is more realistic.R. we compare results from three prediction models and study the effect of classiﬁcation attributes and using certain more predictable preﬁxes on prediction results. [13]. Our work is orthogonal to theirs in the sense that we consider control plane availability. e.

We compute the availability of these combinations and use that for building our prediction models. and the availability and attribute information of other preﬁxes. we have about 14–18 GB of gzipped RIB and update ﬁles per month of data. which implies that the preﬁx was observed by the peer in the RouteViews dataset. our interest lies in predicting whether the availability value is above or below an acceptable threshold (e. and predict the availability class of a preﬁx for some time period in the future. except when studying additional attributes of Announcements in Section 6. preﬁx. it will give accurate results on unseen preﬁxes? (5) How long should one observe preﬁx attributes so that its availability can be accurately inferred? 4. We only keep the timestamp. Going back to our patient analogy. We reduce the storage space required by removing unused ﬁelds. for diagnosis or detection purposes. we discretize availability. which total about 25 GB per month of data. Methodology We deﬁne a combination as a (peer. we seek answers to the following questions for our framework: (1) How to discretize availability? How many classes and what threshold values should be used? (2) Given a set of preﬁxes with their associated attributes and availability classes.g. and the type of update (announcement or withdrawal).. Advertised availability is critical in maintaining smooth trafﬁc ﬂow to these preﬁxes. to assess prediction performance. one can use test results of a new patient to diagnose the new patient’s disease.g. iPlane cannot make predictions.. which are used to predict its long-term availability. Khosla et al. false positives. We preprocess the data using libbgpdump version 1. these observation points are the peers. In this paper. just our predictions. which passively collect data about routing tables and updates from the AS routers (peers) which actually observe preﬁxes. that advertised in SLA). e.g.4. We utilize data from January to October 2009 to build and test our prediction models. this is purely for validation of our prediction schemes – in a real deployment. This deﬁnition yields 41–43 peers in our dataset. we validate our predictions by computing the ‘‘future’’ availability class and comparing it with the predicted class. Note that these peers are not the same as the RouteViews monitors. 3. It is these peers and the preﬁxes they observe that we refer to as
.. peer IP. They are fairly well spread out over the world. and not the speciﬁc value of the availability. For the RouteViews data. Discretizing also gives us an added advantage of use of confusion matrix-based measures. preﬁx) tuple. Spurious announcements and withdrawals are ﬁltered as described in Section 4. Using continuous availability values causes problems in deﬁning error measures because a miss in high availability values (e.g. We executed scripts (contributed by the authors of [32]) from the point of view of every peer in our dataset. A problem with using raw updates from RouteViews is that they also include routing table transfers which are caused by session resets between a monitor and a peer [32].bz2 format with typical sizes of 0. We deﬁne a peer as any vantage point gathering routing information which is present in any routing table entry and at least one update. a predicted 35% instead of 40%) because of attached importance to higher values. are more speciﬁc preﬁxes (ones with longer length) less available than less speciﬁc ones? Do preﬁxes that generate more updates have lower availability?
(4) How large should a set of preﬁxes be such that if we learn our prediction model from this set.876
R. 5. through RouteViews). In this paper. given its attributes computed by observing BGP updates (for example.99. collected for a short duration of time. After preprocessing and ﬁltering table transfers (as described below).1. based on information collected from the past that is used to train prediction models. The notion of availability of a preﬁx is with reference to an observation point in the Internet. This is because. Our ‘‘test results’’ are the updates observed for a preﬁx for a limited period of time. / Computer Networks 55 (2011) 873–889
by iPlane.5. Rather than predicting continuous values of availability.8 GB per day of RIB ﬁles (sampled every two hours) and about 25 MB per day of update ﬁles (written every minutes). In this paper. we compute availability in the control plane by marking the time of an announcement of a preﬁx as the time when it goes up and a withdrawal as the time when it goes down and matching our predictions against this computed availability. These spurious updates are an artifact of the update collection methodology.. we will not have the availability classes of the future. However. We developed a script that removes the table transfer updates from the update ﬁles obtained from RouteViews. in which case our availability predictions will be the only available ones for applications.7 [31] to convert the ﬁles from the MRT format to text. and which prediction models work best? (3) How to extract and represent preﬁx attributes from RouteViews data? Which attributes of a preﬁx are most important in predicting availability? For example. Zhang et al. how accurately can one predict the availability classes of other preﬁxes. We use these ﬁltered updates for all further processing. given the symptoms and known diseases of some patients. Datasets The routing tables (RIB ﬁles) and updates available from RouteViews [16] are in . Problem deﬁnition We deﬁne the availability prediction problem to be the prediction of the BGP-advertised availability of a preﬁx. enabling one to observe the availability of preﬁxes from various points in the Internet. [32] developed the Minimum Collection Time (MCT) algorithm to identify BGP routing table transfers. 99% predicted as 94%) counts more than a miss in lower values (e. The months span a reasonable time period to prevent biasing our model selection process towards datasets from a particular timeframe when some event (such as an undersea cable cut) may have occurred.

7. However. and 0. distinguishing the time when the preﬁx is unannounced because of a covering preﬁx or when it is withdrawn due to BGP or network conditions is difﬁcult. the relationship between the preﬁxes and their subpreﬁxes can be complicated since these can be announced from different origin ASes. computing an aggregate long-term availability of preﬁxes which are subpreﬁxes of other preﬁxes is a challenging task. we treat each announced preﬁx separately as a part of the (peer. Thus. 0. We use Weka [22]. but different values of tl. Formulation and computation of aggregate availability across more speciﬁc preﬁxes and their covering preﬁxes is left as future work.e. This random selection of combinations prevents biasing our prediction results towards a speciﬁc group of combinations which may be related. If one aims to ﬁnd preﬁxes which have both high and low availability values. and low classes. a combination has an equal chance of appearing in the training and the test sets if it appears
at least once in the ﬁrst routing table of tl or an update in the period tl + tp. Discretizing availability We discretize the continuous availability value into availability classes which we predict using observed attributes. Routing policies can change over time and the announcements of the subpreﬁxes can vary depending upon transient conditions like network load. medium. We maintain the state of each combination at each point in time. the combinations present in the training and the test sets are randomly chosen from the set of combinations ‘‘visible’’ in the training and test durations tl and tp. The learning and future prediction durations are contiguous. In this work. we add an extra up or downtime depending upon the last state of the combination.75. the training and test sets are disjoint in both the combinations used and in the time period they span.g. For example. If one aims to ﬁnd preﬁxes that do not meet high availability requirements. the learning duration of tl considers all the combinations found in the ﬁrst routing table of January 09 and in the updates recorded in the duration tl. In what follows.5. We deﬁne the percentage learning duration as the ratio tl/(tl + tp) which evaluates the percentage of the duration tl + tp that is used in learning. for evaluating the models. and if the data period ended at time t2. For each of the values of this ratio.1. We deﬁne a combination to be visible in a time duration t if it exists in the ﬁrst routing table of the period t (for preventing boundary effects) or in any of the updates in the time duration. Thus.. respectively. After the prediction model is learned using the combinations from the training set and the information from the learning period. tl + tp represents the total time starting from the ﬁrst update of tl to the last one of tp. This can happen if the customer of an ISP announces a subportion of the preﬁx allocated to the ISP. We learn the prediction models from a training set. The ﬁrst routing table of the period is used to initialize the state of each combination present in the table to up (or announced). The rationale behind this is that the availability distribution may be different when computed over different periods of time. We want to investigate this difference and the effect it has on prediction for the same values of tl/(tl + tp).. 5. then for each test combination. Weka provides implementations of standard prediction models and data mining techniques for analyzing the performance of the models. The disjointness is necessary to prevent overﬁtting [22] so that the model performs well on unseen test data and to permit a realistic evaluation of the model. We then predict the availability of a disjoint set of combinations. making it indistinguishable from the announcements caused by trafﬁc engineering. which we call the test set. If the state of a combination changes from announced (A) to withdrawn (W). Since the learning period tl and prediction period tp are contiguous.9. Since routing policies are unknown. one should use two thresholds to discretize availability into high. Thus. we have 20 data points for evaluating each prediction model. we experiment with values of tl. The choice of these parameters is based on the prediction goal.25. After processing all update ﬁles. In a routing table. In this paper. The prediction models considered in this paper are described in detail in Section 6. if the last state change was to W and was reported at time t1. e. Misconﬁgurations can also cause a subpreﬁx to be announced for a short duration. We evaluate the quality of our prediction models by varying this ratio among 0. it is applied to the attributes of the combinations of the test set (computed during the learning period) to predict their availability classes in the future. / Computer Networks 55 (2011) 873–889
877
combinations. and this needs to be handled by an availability metric aggregated across preﬁxes. the prediction duration starts right after the learning duration ends. a combination is up or down when the peer associated with the combination has the corresponding preﬁx in an announced or withdrawn state.R. whereas a change from W to A leads to the recording of a downtime. The larger this ratio. Khosla et al. BGP supports aggregation of preﬁxes [33]. there can be several preﬁxes which are more speciﬁc versions (subpreﬁxes) of other preﬁxes in the table [24]. the easier the prediction since less of the future is unknown. an uptime is recorded. and at the time of each state change (as indicated by an update). Hence. a Java open-source data mining software. i. 0. and preﬁxes are frequently aggregated and deaggregated for implementing routing policies like trafﬁc engineering [17]. which consists of the combinations with known attributes computed during the learning period and availability class labels during the period. we apply the prediction model to its attributes learned from tl and we validate the availability prediction by comparing it to its availability during tp. we record a downtime or an uptime. The process of discretization uses thresholds as parameters.1. The computation of the availability of a combination for a particular time period proceeds as follows. a single threshold can discretize availability into high and low classes. we add a downtime with
. Thus. combinations containing preﬁxes from a speciﬁc AS may make it easier to predict availability of combinations containing preﬁxes from the same AS. If we denote the learning period as tl and the future prediction duration as tp. preﬁx) tuple deﬁned above. 0. the number and values of which have to be decided. where tl = 1. 19 and 30 days.

e. tl/(tl + tp) = 0.89 91..
These statistics play an important role in the choice of discretization thresholds. However. We use data from January 09 to study the effect of discretization. since there is a higher chance of combinations that have high availability in the learning period to fall below the 0. with most combinations (around 91–94%) having high availability.99999 threshold in the test period. where one value corresponds to one combination. The updates collected from RouteViews have the advantage that (barring errors) all the updates for a particular combination will be recorded.74 3. The table shows a steady increase in the number of combinations as more days are considered. 50%.999988 0.5 threshold for the low class label.9853 0.
Table 2 Percentage difference of mean availability between the training and test sets for different tl. tp À1. The availability of the combination is computed by noting the time that the combination was up (cumulative uptime) divided by the total time in which we are interested.7996 % Difference in availability of tl w.09 % High with 0.99999 and a binary class label.9975 0. The second column shows the number of (non-trivial) availability values that are considered in computing these statistics.8367 0.1.99999 threshold for high than with 0.50). but also on the value of tl. and low). We veriﬁed this observation by evaluating the prediction models of Section 6 on datasets with the two thresholds for the high class and found that the model performance for 0. and we are interested in availability. preﬁx) combinations to be extracted from the RouteViews data. we can easily extend our framework to two thresholds and a ternary class label. with the higher threshold demarcating high and medium and the lower one differentiating the medium and low classes. Median and 3rd quartile are 1 for all tl. We choose not to use the routing tables because they provide time snapshots of preﬁxes which can be reached by peers. median. These are expected to be low availability combinations. However.50) and (0. The ﬁrst quartile. to ﬁnd combinations with very low availability.9743 0.9975 0. Our goal is to compute the attributes from publicly available information from RouteViews. The medium percentage can be easily calculated since the percentages add up to 100%.99. about 1–4% of combinations fall into that category.99999. and third quartile are the values below which 25%. Computing attributes We now investigate the attributes of the (peer. tl Mean availability of learning duration tl 0.878
R. the prediction problem is more difﬁcult with a 0.9897 0. e.9743 0. If we choose a relatively lower valued threshold for high. tl 1 day 7 days 19 days 30 days % High with 0.2.99 threshold is indeed higher than that with 0. and 75% of the availability ordered combinations lie. since previously undiscovered combinations are found in newer update ﬁles. tl = 19 days means data from January 1 to 19). Comparing this computed availability with the predicted availability validates prediction results.9604 Variance 0. medium..99 threshold 93. This enables us to compare the percentage share of high under the two threshold sets.68 2. The attributes are computed for the learning period with the aim of predicting (future) availability classes for the test set.19 66.99999.92 67. and choose two different threshold sets of (0. This effect is seen in Table 2 which shows the percentage difference in the mean availability of the learning and test durations (tl and tp) for different values of tl and the same value of tl/(tl + tp). This trend of lower availability values with longer durations motivates us to study four different values of tl with the same tl/(tl + tp) ratio. / Computer Networks 55 (2011) 873–889
Table 1 Availability statistics for January 09 for different values of tl.02041 0. Khosla et al. To study this.02966
value t2 À t1.22 À6. respectively.9604 Mean availability of test duration tp 0.5 threshold 1. which is a continuous time metric.99999 threshold 67.99 À14.09 91.t. Table 1 shows the availability statistics for four values of tl starting from the beginning of January 09 (i.59
1 day 7 days 19 days 30 days
. validating that the former is an easier prediction problem. we can easily compute its availability. These new combinations were not present in the ﬁrst routing table of the month.g. which is listed in Table 3. a combination that only appears in the ﬁrst routing table of the month and has no updates for the duration under consideration will have an availability of 1.r. which contains both routing tables and updates for various combinations. otherwise they would have been found for tl = 1 day. With a 0.13 % Low with 0. Hence.13 À16. 0.0078 0. we use a single threshold of 0.999882 Mean 0. tl 1 day 7 days 19 days 30 days Number of combinations 10545170 10700675 10959231 11476218 1st Quartile 1 1 0. the class distribution will be highly skewed. Knowing the announcement and withdrawal times for a combination. we start with a ternary class label (values high.9897 0.9204 0. 5. 99%.99. Based on these observations and the signiﬁcance of ‘‘ﬁve nines’’ availability [1]. The difference in availability distributions for durations tl and tp not only depends on the value of tl/(tl + tp). This is validated by the fact that the ﬁrst quartile and mean of the combinations show a decreasing trend with these newly added combinations. The variance of the availability increases as lower values are added to the set of predominantly higher availability values.74
Table 3 Class distributions when discretizing availability.89 93.25 66.0018 0.02 1. 0.

7339 Low class Mean 22. We conjecture that this is because shorter preﬁxes represent larger. We show the means and variances of
Table 4 Attribute statistics of each class for learning period of tl = 19 days. we downsample this set of combinations (of about 11 million as in Table 1) to a set of 10. with the higher value for the low class. Assuming update frequency distribution in each month is a normally distributed random variable (valid because of
5. We divide the 10.000 combinations into ones that have high and low availability for the month and compute statistics for the attributes of each of the two groups. From [23]. We opt not to use information about to which AS a preﬁx belongs or the AS path to a preﬁx in this work. It is important to note that MTTF and MTTR are computed for the learning period. Furthermore. mean uptime and mean downtime. We evaluate the performance of the models with increasing number of learning instances in Section 6 and on larger test sets in Section 7. The MTTF of a high availability combination is higher than that of a low availability one by about 85% on average.3. and (4) mean time to recovery (MTTR). Although we compute the attributes of every combination with at least one recorded uptime or downtime. This shows that the attributes show a statistically signiﬁcant correlation with the availability class labels. Demarcating availability using attributes In this section. Further. it quickly comes back up (well within one second on average).844 58882. whereas the MTTR is almost 100% lower. which is the average number of updates observed for the combination in a time window of one hour (averaged over the learning period). we also compute update frequency. it is more likely that a longer preﬁx representing a smaller network goes down than a larger network. Downsampling does not signiﬁcantly affect the accuracy of models since prediction models typically learn well with a few hundred instances. We ﬁnd that the means of each of the four attributes are signiﬁcantly different at 1% signiﬁcance level for each of the four learning periods. or about 3% (which is statistically signiﬁcant) while the median and ﬁrst quartile differ in length by 1 and 2.e. it is known that popular destinations. Khosla et al.72 777. / Computer Networks 55 (2011) 873–889
879
The attributes of a combination are selected to relate to its availability (Section 5. It is important to note that the attributes we select do not necessarily cause high/low availability.41 3.R.2 0. have fewer updates. The consistency of the results across each of the four values of tl is convincing of the correlation between preﬁx length and availability class.76E+10 3. and hence the predicted availability for the time-disjoint test set is not a direct function of these values.7. we compute the following attributes for the learning period from routing updates observed through RouteViews: (1) preﬁx length. For most of the attributes.
all the attributes for a typical value of tl = 19 days in Table 4. respectively.0795915 Variance 4.52 0. we are looking for correlation not causality. more stable networks while small portions of the address space can be announced and withdrawn frequently for multihoming or load balancing purposes.92E+11 4. are stable. showing that these attributes correlate well with the availability class since availability computed over a longer duration is more indicative of the actual availability. We defer the investigation of how preﬁxes are similar across the same AS or neighboring ASes in the AS topology to future work. by recording the time when a combination goes up/down.57E+10 0.3).000 combinations with their attributes. We hypothesize that longer preﬁxes will have lower availability since they represent smaller networks which are more likely to go up or down. We use the paired t-test to test for equality of the means of each of the attributes of the two classes. and hence has fewer updates. We employ the Welch t-test [34. in addition to preﬁx length. The average frequency of updates observed for a high availability preﬁx is about 77% lower than for low availability ones. Correlation is sufﬁcient for a prediction model to be successful. and to be easily computable given the observed updates for the learning period so that the learning system is fast.5694
.04 1587480 0. Our intuition that the combinations with longer preﬁx lengths have lower availability is conﬁrmed.201002 0.. The mean preﬁx lengths of the high and the low availability classes usually differ by about 0. Hence. which are expected to have high availability. The difference becomes larger as tl increases.0244 Variance 6. The period of one hour is chosen so that the update frequency numbers are neither too large nor too small. (3) mean time to failure (MTTF). In summary. The difference in attribute values of the high and low classes increases with tl. The normality assumption is valid due to the Central Limit Theorem (CLT) and because we have about 3000–7000 samples in each class. their variances for the low class are higher because the class covers a wider range of availability values. (2) update frequency. These results are explained by the fact that a high availability combination stays up for a long period of time. we quantify whether the four attributes discussed in the last section indeed convey information about the availability class. called the mean time to failure (MTTF) and mean time to recovery (MTTR). An advantage of downsampling is the computational efﬁciency of building and testing the models. This result is intuitive: a high availability preﬁx has a long uptime before it fails. and when it does fail.35] which assumes that the two populations have normal distributions and their variances cannot be assumed to be equal (which is true for our data). Attribute High class Mean Preﬁx length MTTF (s) MTTR (s) Update frequency (/h) 22. This is because we want to keep our prediction model free from constraints of speciﬁc ASes or AS paths that can change. and use that to build and test models. respectively. i. we compute two additional attributes.25 2.

and P(e) is the proportion of times the values are expected to agree by chance. we construct a 99% t-Conﬁdence Interval (CI) for the average update frequency of a combination. The model is then learned using the known attributes and class labels of n À 1 folds (called the training set). Khosla et al. downsampled from all the combinations in each of four different months.36]. which plot the TPR versus the FPR. thus n = 10. but with diminishing returns. We describe the details of this scheme in Section 6.2. We do 10-fold incremental cross-validation as described in Section 5. and then ﬂattens after a certain number of instances is reached. where TP and TN are the true positives and negatives.03/hour and the variance is about 0.5. The dataset is divided randomly into n parts. While ROC curves work well for most classiﬁers. respectively. hence. The algorithm is run k times. 5. A better metric is obtained by using Receiver Operating Characteristic (ROC) curves [22]. each time with a different random seed so that different n folds are constructed in each run. generating a different set of 10 folds each time.880
R. Complete agreement corresponds to j = 1. since it has about as many TPs as FPs. The confusion matrix can be used to compute several performance measures. It is computed as: j ¼ PðoÞÀPðeÞ. The performance of each model is studied using n-fold incremental cross-validation. Thus. 5. respectively. A perfect classiﬁer has an AUC of 1. we work with 10. supervised sampling). deﬁned as: Accuracy ¼ TPþTNþFPþFN. Performance metrics We now describe the performance metrics used to evaluate a model when it is applied to the test set. The reason for comparison to a random classiﬁer is that we need to be sure that any learning-based model performs better than the random classiﬁer. which render these measures inappropriate. a trivial algorithm which predicts every availability value as high will have 90% accuracy on a dataset which has 90% high values.4 updates an hour. the most common of which is TPþTN accuracy. an option is to randomly order the instances predicted as high and low. As the number of instances to learn a model increases. The measures use data from both columns of a confusion matrix. they are not directly applicable for models which do not produce any ranking of instances in terms of probabilities of being classiﬁed as high and low. which acts as a baseline for comparison. and P FP FPR ¼ FP ¼ FPþTN. where 1ÀPðeÞ P(o) is the proportion of observed agreement between the observed and predicted values. Such a classiﬁer has an AUC of 0. called folds. The conclusion from this section is that the selected preﬁx attributes perform well in demarcating the availability classes. / Computer Networks 55 (2011) 873–889
CLT). we observe that there can be signiﬁcant class skew. we are 99% certain that it will have low availability. the model performance on test data typically improves. For example. the performance increases.
. Learning and evaluation We learn several models in this paper to predict the availability class of combinations. the class label high is treated as a positive class. for each training set size. and j = À1 indicates complete disagreement between the values. confusion matrix-based measures can be misleading with a skewed class distribution. and applied to predict the class labels of the remaining fold (the test set). The AUC of a classiﬁer is equivalent to the probability that it will rank a randomly chosen high instance higher than a randomly chosen low instance. The correlation of the attributes with the availability class is consistent with our intuition. A random classiﬁer randomly chooses either of the class labels with equal probability. and we report the mean value. and the label low is treated as a negative class. The true positive rate (TPR) and the false positive TP rate (FPR) are deﬁned as: TPR ¼ TP ¼ TPþFN. The mean update frequency of a combination. Model evaluation In this section. ROC curves are independent of class skew because they use a strict columnar ratio from the confusion matrix [37.5. one could effectively toss a coin and decide the class label. on average. A model which does not produce instance ranking has no threshold to vary.4. We use the area under the ROC curve (AUC) as a performance metric. The upper bound of the CI is computed to be about 1. and FP and FN are the false positives and negatives.5. which will be the best predictor. and hence are sensitive to the proportion of instances in the two columns [36]. which gives all possible combinations of the true and predicted class. correcting for agreement
that occurs by chance. if we observe more than an update for a combination in about 43 minutes. TPR) points.
6. From Table 3. We compare the results from our prediction models to those obtained using a random classiﬁer.1. A model is successively learned using increasing training set sizes (from each of the n training sets) and its performance on the test set is plotted against the training set size. whereas j = 0 for a random predictor. Thus. making a trivial predictor the best one. Each fold is left out at a time. resulting in n learned models and corresponding performance results. averaged over all 11. A typical shape of a learning curve is an increasing exponential. As mentioned in Section 5.. In what follows. N The Kappa statistic measures the agreement between predicted and observed values. The training and the test sets are disjoint in order to get an unbiased estimate of model error. which happens when the proportion of high availability (positive) and low availability (negative) instances in the sample are unequal. Unfortunately. we have nk performance values.4. it gives a single point in the ROC space instead of a curve.5 million combinations of Table 1 is about 0. we study three prediction models using the metrics in Section 5.28. and then rank them to produce a ROC curve. We conduct k = 5 runs. For such a model. while maintaining the class distribution of the dataset in the fold (i.e. This is because one plots a ROC curve by varying the threshold that decides between high and low to produce various (FPR. We study this using learning curves. Otherwise.000 combinations and their attributes. Any classiﬁcation algorithm can be studied using a confusion matrix.

3481 0. com-
Table 5 Results with the simple prediction model.10 67.9 0.9900 0. otherwise it is low. For example.1803 0.75 0.6606 0.1353 0. As tl/(tl + tp) increases.6728
7 days
19 days
30 days
puted for various values of tl and tl/(tl + tp) and averaged over nk = 50 runs.98 76.8 1
1 0.8 0. if the past availability exceeds 99.5 0.79 97.5315 0.2 0 0 0.6 0.6414 0. As the average accuracy in Table 5 is reasonably high.37 91.4 ROC plot 0.4082 0. Hence.01 78. and then the instances are ranked with the (predicted) highs higher than the lows.4 ROC plot 0. Simple prediction The simplest approach to predict the availability of a combination is based on the simplistic assumption that the future is the same as the past. As tl increases.9940 0. using Algorithm 3 of [36]. This is especially true when tl is small.2816 0.2242 0. tl 1 day tl / (tl + tp) 0. the typical run will give different AUC values when run with different random seeds.7154 FPR 0. we have 50 performance measures for each model averaged to give an output measurement. The performance of simple prediction is clearly better than a random classiﬁer for most cases. 1.4752 0.6713 0.9928 0. Khosla et al. We now use ROC-based metrics to evaluate this classiﬁer. the average of 50 different AUC values is reported in Table 5.1. Hence. so we modify the algorithm to draw a ROC curve.9 Accuracy (%) 88.3242 0.8085 0.4 0. tl/(tl + tp) = 0.2403 0. are listed in Table 5.9322 0. The plots show the original model performance (in Table 5) as a point (‘‘star’’) on the ROC plots. We take a typical run of the model with confusion matrix measures close to their average values. we use ROC metrics. 1.6 0.08 TPR 0.5231 0. this emphasizes the inadequacy of accuracy as a metric to evaluate performance models. Therefore.5261 0.5 0.7272 0.8933 0.5962 0.6 0. and record the TPR and FPR for each threshold.3133 0. 1(a).2 0 0 0.7961 0. the availability distribution becomes more diverse and hence the model typically performs worse.9271 0. These measures.5599 0.8 1
False Positive Rate (FPR)
False Positive Rate (FPR)
False Positive Rate (FPR)
Fig. The instances which are classiﬁed as high and low by the model are randomly reordered within their respective groups.R.25 0. The results show that while the TPR of the simple model is high. for a typical run (confusion matrix based measures close to their average values of 50 runs). This is a model where no instance ranking is performed. and simply predicts the availability of one combination at a time.9502 0. and hence future availability is quite different from past availability.9 0.8 1
True Positive Rate (TPR)
1 0.25 0.2 0 0 0.3207 0. This model does not learn based on other combinations.9 0. ROC plots for the simple prediction model. The model gives a single point in the ROC space (since it does not perform instance ranking).1575 0. However.8201 0.6 0.1 0.30 78.53 70.5720 0.6548 0.2726 0.8620 0. the accuracy of the model increases. The past availability of a combination is its availability during the learning period tl.4766 0.90 60.4 ROC plot 0.75 0.999%.1281 0.1355 0.9717 0.17 65.1917 0. the prediction problem becomes easier as more data is available for learning.6777 0. We vary the prediction threshold.45 71.5 0.7953 0.5877 0.6 0.6281 0.7410 0.1 0.7015 0.9950 0. We then investigate more sophisticated machine learning based models. The results highlight the importance of ROC curves.66 57.1 0.98 89. but merely predicts the same availability for a combination as the discretized value of its past availability.81 73.25 0.9224 0. only hard classiﬁcations are made. along with the performance of a random classiﬁer. like area under the ROC curve (AUC).02 73. as in Algorithm 2 of [36] to compute the points on a ROC curve.2641 0.8451 0. / Computer Networks 55 (2011) 873–889
881
Hence.6641 0.5813 0.6748 0.2 Average Performance Random Classifier 0. 6.1928 0.2 Average Performance Random Classifier 0. but there are occasions when it performs as good as or slightly worse than a random one as in Fig.75 0.7544 0.22 54.6163 0.99 98.1
True Positive Rate (TPR)
True Positive Rate (TPR)
1 0.8613 0.7160 0.87 83.1346
AUC 0.6224 0. classiﬁer A for tl = 30 days.8 0.6 0.1.25 0. this simple classiﬁer outperforms a random classiﬁer (as indicated by the j statistic) and hence forms a baseline model to which other sophisticated models can be compared.9107 0. We start with a simple baseline prediction model in Section 6.2955 0.3175 0.8 0. the predicted class label is high. its FPR is high as well.9911 0.4652 0.75 0.69 98. Because of inherent randomness in reordering and ranking the instances.6900 0.5774 0. This prediction approach does not learn a model based on other combinations. The AUC is computed. The ROC curves for the simple prediction model for some typical values of tl and tp are depicted in Fig.3641
j
0.4 0.4 0.6895 0.5670 0.2 Average Performance Random Classifier 0.7242 0. Thus.5 0. we compute confusion matrix-based measures. while its FPR reduces.8013 0.9421 0.8379 0.1022 0.60 96.3723 0.6326 0.5907 0.1 0.
.

3).9 0.6290 0.9 (Fig. and investigate whether the higher AUC values of the Naïve Bayes model are statistically signiﬁcant. The model uses the frequencies in the training set as estimates of the probability distributions of the attributes.6568 0.84 12.82 59.12 9.7
75 70 65 200 400
AUC Accuracy
0.6 0. The model computes.06 87. the period of the training and test sets differ by a few days to months (except when tl/(tl + tp) = 0.20 84.70 97. We ﬁnd that the null hypothesis of equality of the means is rejected for every month at 5% signiﬁcance level.85 0. Hence.95 54.7538
19 days
30 days
90 85
0.14 59. The accuracy values of Naïve Bayes are close to those of the simple model.94 98. . learning period.7394 0. The overall inferior performance of classiﬁer A is because it performs similar to a random classiﬁer for low FPRs.04 6. Naïve Bayes learning curves for tl = 30 days.12 8.5 0. and round it to the nearest integer for t table lookup using [38]. Table 7 shows the details of the test for some typical values of tl and tp.1 0.85 74.65 0. we compare the Naïve Bayes model to the simple model using ROC curves. Xm} based on Bayes rule [22].29 À9.61 À0. Naïve Bayes model The Naïve Bayes model predicts a high or low class label using the attribute vector X {X1.04 76. This ensures that the model is trained to its potential. and its performance is evaluated on the 10 different test sets produced by incremental crossvalidation.24 12.9
88.33 AUC % Change in AUC from simple model 14.09 77. and the sample variances computed using the nk = 50 data points. classiﬁer B is better.09 À0. classiﬁer B outperforms classiﬁer A.8 0. if our operating region is at low FPR.23 70. 1(c)) using AUC as the metric.9 0. Khosla et al. we see that for higher FPRs (around 0.75 0.6473 0.6956 0.66 98. tl/(tl + tp) = 0.59 11.034 À0. is not too different from the prediction period.25 0.35 77. except when learning from 30 days of data.6044 0. However.25 8. where for a smaller prediction period tp. .7173 0. If so.6990 0.288 0.89 11.9 0.e.95 8.1 0. where C is the class label.70 9. AUC.6341 0.40
0.8). instance ranking is naturally produced by the model which can be used to produce ROC curves.882
R. using both accuracy and AUC as performance measures.5 0.04 7.10 À1. For the same FPR.1.2. However.1 0. / Computer Networks 55 (2011) 873–889 Table 6 Results with Naïve Bayes model and% change from simple model.95 17.25 0. Hence. its TPR is higher and hence
Accuracy
.05
(Fig.75 0.75 0. When tl = 30 days.10 7.9.5 0. We now compare the Naïve Bayes model to the simple prediction model of Section 6. examining the ROC curve.03 83. Naïve Bayes would be a better prediction model than the simple model.25 0. 3. However.1 0.65 À1. The accuracy initially increases at a fast rate when the number of training instances is increased.6355 0. The results are given in Table 6. The plot for tl = 30 days and tl/(tl + tp) = 0.1 is illustrated in Fig.02 14.21 À1.09 À0.73 8. for each instance.7924 0. 2.9 0. . We plot a typical learning curve in Fig. We consider the better metric.35 4. using the mean values shown in Tables 5 and 6.29 80. This means that the AUC of the Naïve Bayes model indeed exceeds that of the simple model at 5% signiﬁcance level.38 46.8159 0.01 6.61 68. tl/(tl + tp) = 0. We use the Welch t-test [34.074 À0. We compute the degree of freedom m using the Welch–Satterthwaite equation.10 À1. the AUC remains relatively stable with the increase in number of training instances.
AUC
80
0.51 96. 1(b)) is worse than classiﬁer B for tl = 30 days.76 14.7173 0.75 0.7097 0. We use the accuracy and AUC as measures for comparison. We evaluate the model on each of the values of tl and tp using learning curves. i. we use the entire training set to train the Naïve Bayes model to achieve the maximum accuracy without sacriﬁcing AUC.444 0.6930 0.6761 0. and tapers off afterwards. tl tl/ (tl + tp) Accuracy (%) % Change in accuracy from simple model À0. The plots for the other time durations lead to similar conclusions. The results show that the Naïve Bayes model yields a higher AUC than the simple model for all cases.25 0.3). Finally. . This leads to different accuracies since this metric is highly dependent on class skew.5 1000
600
800
Training Set Size
Fig.7009 0. the model is often used even when its assumption does not hold due to its simplicity.96 6.7304 0.08 0. the accuracy is signiﬁcantly better with a high variance (around 26.
1 day
6. The model is learned on increasing size training sets.7009 0. using the training set to estimate P(XijC) and P(C).5 0.35] to test for equality of the performance measures (means) of the distributions of the two samples (simple and the Naïve Bayes). the probability of each class label given its attribute set and the independence assumption.. These estimated distributions are valid only when the period of parameter estimation. The ﬁgure shows that the Naïve Bayes model dominates the simple model throughout most of the ROC space.045 À0.31 4. It makes the ‘‘naïve’’ assumption that the attributes are conditionally independent given the class label.55 0.89 89. X2. whereas classiﬁer A is better for high FPRs. We perform the test on the AUCs of the two models for each of the four months. the accuracy is sig7 days
0.94 0. In what follows. while for a higher prediction period tp. 2.75
niﬁcantly lower with a low variance (around 2.5) and hence have different distributions.19 13.63 À19. This is because this model assumes that the attributes are conditionally independent given the class label.39 0.

2 0 0 0. tl/ (tl + tp) = 0. and for constructing a general enough tree to perform well on unseen test data. which uses reduction in entropy (measure of randomness) when splitting the instance set S based on an attribute A as the information gain metric to build the tree. This is a typical property of decision trees. It is also worth noting that this better performance in terms of TPR and FPR in the ROC space again points to the inadequacy of accuracy as a metric: even though the Naïve Bayes model has much lower accuracy than the simple model for these values of tl and tp.4 Naive Bayes 0.25 0. the disadvantage is that it uses less data for tree building. A method to reduce the variance of decision trees is to use bootstrap aggregating (bagging) [22]. Decision trees A decision tree is a recursive divide-and-conquer classiﬁer.99
1 0.1 0. However. ROC plots for Naïve Bayes and simple model for tl = 30 days.1 constructed with 200 training instances.75 7. While the decision trees shown all use MTTR as their root node.984 1.28
m
98 64 98 81
t-Value for 5% signiﬁcance 1. and rules to classify an instance can be read off the decision tree. An alternative is to use Reduced Error Pruning (REP) [40].8 0.5 0. since a small difference in the training data can cause different branches to be constructed. different trees use different numbers and values of attributes to make decisions. and ﬁnd that the accuracy and AUC metrics are not signiﬁcantly different among them. which divides the instances based on one attribute at a time in a top-down fashion until the leaves of the tree are reached [22]. despite its naïve assumptions. high variance.5 algorithm [22]. at very small training set sizes. Khosla et al. C4. Each baseline high variance predictor is learnt from different pseudosamples obtained by sampling with replacement from the training set. The right branches of all nodes are for a ‘‘Yes’’ decision and the left branches are for a ‘‘No’’ decision.
importance. 4. Nonetheless. holding out instances for REP can lead to insufﬁcient training data. which holds back some of the training data as a fold and uses that to estimate the error. We use the C4. This increases variance in classiﬁcation results.6 0. we decided to use REP because of the advantages of a tree which avoids overﬁtting and because we will work with sufﬁciently large datasets.984 1. We use the unpruned.5-pruned. causing mean results to appear worse. and one can choose to consider the unpruned tree.R.46 8.4 Simple Model 0.1. The implication of these results is that a model which learns based on other preﬁx combinations like the Naïve Bayes classiﬁer will typically outperform prediction without learning.2 Average Performance 0. it is better in ROC space. 4). For example.75 Statistic value 15. C4.998 1. tl 1 day 7 days 19 days 30 days tl/(tl + tp) 0. A decision tree can be easily transformed into if-then rules. with 200 training instances in each of the 10 folds.5 pruning (the default) uses an estimate of the error on the training set.3.8 classiﬁer implements the C4. Our results have a high variance.
6. the J4.8 1
False Positive Rate (FPR)
Fig.5 algorithm developed by Quinlan [39] to build decision trees. This classiﬁer has the advantage that it is interpretable. we ﬁnd decision trees with different structure and attribute values (two are shown in Fig. We apply the bagged decision tree classiﬁer to predict availability with the underlying baseline classiﬁer chosen to be decision
True Positive Rate (TPR)
Fig. tl/(tl + tp) = 0. This conﬁrms that availability is predictable using the attributes we measure. In Weka. 3. / Computer Networks 55 (2011) 873–889
883
it is closer to the ideal point in ROC space. Pruning the tree is necessary to avoid overﬁtting to the training data. predictors into a stable predictor. which results in lower AUC.
. and REP trees. or prune it based on different criteria. since the attributes of the classiﬁer are ranked from the root node downwards in the order of
Table 7 Paired t-test results of comparing AUC of Naïve Bayes model and the simple model. Decision trees for tl = 30 days.6 0. The advantage of REP is that it can lead to more unbiased estimates of the error. Bagging combines an ensemble of unstable.80 8.

.g.8 0. preﬁx length and update frequency MTTF. The results for the other values of tl and tp were similar albeit with different values. This result implies that one can predict long-term availability by learning from only a short learning period.e. and remove certain attributes of the combinations. If one desires higher performance. the performance of all the models improves. Based on the results.R.66.t.5 0. The plots show that for the same percentage learning duration. The decrease is much more steep when the learning duration tl is low. and this effect almost disappears when tl reaches 30 days. one should reduce the prediction duration for the same learning duration. The degradation in various performance metrics is studied. tl/ (tl + tp).67 À15. The plots also show that the crossover point between the performance of bagged decision trees and Naïve Bayes is about three weeks. i. as indicated above.8 À1. For example. We present typical results of removal of some of the attributes for tl = 30 days. 6. All percentage changes are w. as long as the period spans a few days. This gives further credence to the feasibility of availability prediction. We start with the bagged decision tree results from Table 8. e. respectively.75 À4. results of the corresponding models from Tables 6 and 8.00 À0.94 À3. 7..65 À1.5.r.
. a week. tl/(tl + tp) = 0. Lowering the percentage learning duration means that we have a shorter time to learn the attributes of various combinations.1. i. If we are learning our prediction models from the most recent week of data.12 À3.18 À9.4 Bagged Decision Trees Naive Bayes 0 5 Simple Model 10 15 20 25 30
Learning Duration (Days)
Learning Duration (Days)
Fig. Our prediction framework allows the system administrator to trade off prediction performance and prediction duration.93 À24. Classiﬁcation attributes We now study the importance of attributes in the prediction process. If these performance measures are acceptable. we can conclude that an availability prediction system using bagged decision trees can learn from a few days of routing data logs.96 À7. Beyond that value of tl. / Computer Networks 55 (2011) 873–889
885
There are two facets to this problem: the learning duration as a percentage of the overall period of interest. We choose these values of tl and tp as this represents a long enough learning period and a hard prediction problem: predicting availability for 9 times the learning duration. Effect of learning duration tl on prediction performance for different values of tl/(tl + tp). if we learn from a week of data. We also plot the change in the AUC for each of the models when increasing the learning duration tl.43 À14. we can slide our learning window by a day at a time to always learn from the most recent past week. 6. since there is more information available. e. as more learning data is available (higher tl).e.g. Two typical plots are shown in Fig. The results show that the prediction performance gracefully degrades as the amount of data available for learning is reduced. Naïve Bayes performs best.22 À18.4 0 5 Simple Model 10 15 20 25 30
0. as degradation increases. the importance of the removed attribute subset increases.1 in Table 9.7
0. tl/(tl + tp) = 0.33 0.6 0. so that less data is available to the prediction model.. increase the percentage learning duration. The plot of AUC against percentage learning duration for various values of the learning duration tl is shown in Fig. 7.03 À4. We choose AUC for comparison because of
Table 9 Percentage change in performance metrics with subsets of attributes for tl = 30 days..25
Past availability MTTF MTTR Preﬁx length Update frequency Preﬁx length and update frequency MTTF and MTTR MTTR. 1 day. leading to a reduction in prediction accuracy. one can triple the prediction duration at every stage. The bagged decision tree model performs the best for all learning duration percentages when the learning duration tl is less than around 3 weeks. we can predict the availability for three
weeks into the future maintaining this level of performance. Khosla et al.7
0. whereas increasing this percentage improves prediction results.97 À1.03 % Change in AUC for bagged decision trees À9. Our system can be adapted to real time deployment by sliding the time window of the learning period to always learn from the most recent data.6
AUC
AUC
0. by studying the effect of using different sets of attributes on the output metrics of Naïve Bayes and bagged decision trees.5 Bagged Decision Trees Naive Bayes 0.13 À1. Attributes used for prediction % Change in AUC for Naïve Bayes À12. Predicting the availability for about three times the learning duration gives accuracy and AUC of around 75% and 0. keeping the percentage learning duration tl/(tl + tp) constant. preﬁx length and update frequency
0. and the value of the learning duration itself. except when tl = 30 days.

we give the rationale for the selection of attributes to be the relation between these attributes and the availability of the preﬁx.. these results are poor compared to the prediction performance achieved previously. For example. and all preﬁxes which are announced according to similar policies will exhibit more predictable availability. we also use the attribute of past availability to build prediction models. unless the prediction attribute is weak.t. preﬁx) combinations for training the prediction models and for testing the effectiveness of the prediction techniques. Along with using subsets of the four attributes from Section 5. MTTF combined with the preﬁx length and update frequency give very close results to those obtained when MTTR was also added to the set. attributed to several known and unknown causes. If the preﬁx ﬂaps frequently with announcements and withdrawals.2. and are less correlated with changes in the announced or withdrawn state of preﬁxes. community. repeated announcements with different AS path do not change the Announced or Withdrawn status of preﬁxes.2. we conclude that past availability is not an adequate metric for prediction of future availability. Using either MTTF or MTTR with preﬁx length or update frequency. preﬁx length. (2) fraction of times AS path length changes over all announcements. Combining this with the results of the simple model. and all these ﬁve together for availability prediction using bagged decision trees. 6. We now investigate whether certain combinations are more predictable than others. Predictability of preﬁxes Thus far. This is explained by the fact that these attribute changes are due to AS policies for diverting trafﬁc to the inbound preﬁxes by modifying existing announcements. making a particular preﬁx group more predictable. There exists no subset or superset of the four attributes used that would cause signiﬁcantly better results than the four attribute set we have chosen. [13] predicted data plane failures using control plane updates and also observed that certain preﬁxes are more predictable than others. (4) fraction of time the community attribute changes over all announcements.g. and preﬁx length. MED. Although better than a random classiﬁer.886
R. 6. which we have not considered in this work. the old AS path. or MTTF and MTTR together. a high availability combination should have a high MTTF and low MTTR. / Computer Networks 55 (2011) 873–889
its strength as a performance metric as described earlier. It is also interesting to note that the Naïve Bayes model is much more sensitive to the removal of attributes than bagged decision trees. e. In Section 5. The intuition behind this is that the availability of a combination is more predictable from the attributes chosen in our work for certain kinds of preﬁxes than for others.. Preﬁx length and update frequency are weak attributes. we compute these attributes for each combination and use them for availability prediction.t. Additional attributes We now investigate whether the prediction accuracy can be improved if we add additional attributes that we have not considered in this work so far. The authors of [41] discovered both daily and weekly patterns in preﬁx announcements. and (5) fraction of time MED attribute changes over all announcements. we have used a random set of (peer. only Naïve Bayes (Section 6. The average AUC of prediction comes out to be only 0.g. can make better use of a single or a few attributes than Naïve Bayes by having ﬁner grained splits across the different training sets. Khosla et al. This is because we believe that these attributes are not signiﬁcantly related to availability as much as MTTF. further conﬁrming that MTTF is the strongest attribute (complemented by the use of preﬁx length and update frequency). Our methodology is motivated by [13]. MTTR. since the time to fail or recovery will characterize the availability of a combination. Performance signiﬁcantly degrades (AUC is 10–12% lower) when only past availability is used. These results lead us to the following conclusions. causes the AUC and accuracy to be within 4% of their values when no attribute is removed.55. update fre-
quency. It is intuitive that MTTF is the most important prediction attribute and MTTR is the next important. There can be several causes of BGP routing dynamics [41]. While attribute removal does hurt its performance. in contrast. averaged over all announcements.2) gives a probability of prediction of preﬁx availability as high or low based on its attributes. We also experimented with adding past availability to these attribute subsets and found that the performance did not change signiﬁcantly. There are other attributes of BGP updates [41. This attribute is used in the simple model and we seek to study the performance of machine learning-based prediction models which use this attribute to predict availability. with preﬁx length being the weakest since using it alone causes the AUC to decline by 7–25%. the trees formed based on other attributes are still reasonably accurate. Zhang et al. The additional attributes that we consider in this section are: (1) average AS path length percentage change of the changed AS path w. the results in Table 8 .r. affecting availability. We consider these attributes one at a time. The bagged decision tree model. this will be captured by our update frequency metric.42] such as AS path.1. The ﬁrst column of the table indicates which attributes of the combinations were used for prediction. a preﬁx can be withdrawn and announced with a speciﬁc pattern (e. dependent on time of day) for trafﬁc engineering purposes. and some causes are likely to be more correlated with availability. For example. (3) fraction of time the aggregator attribute changes over all announcements.5. While we leave detailed investigation of exact predictability classes of preﬁxes to future work. which affects availability. We use an option in Weka [22] to output the class prediction probabilities for each of the instances along with the true
. As usual. This is because Naïve Bayes model uses conditional independence of attributes and the Bayes rule to predict the class label. MTTF is the most important attribute since using it alone causes the least drop in AUC among any single item attribute set. we investigate whether there are more predictable combinations in our dataset. We ﬁnd that the AUC results are 16% poorer on the average across these six prediction cases w. Out of all the prediction models considered in this work. and aggregation.6.r.

6 0.25 0. 8.1.8565 0.62 31.6113 0. if one is interested in predicting the availability of a set of preﬁxes from a large number of vantage points in the Internet.5311 0.75. as illustrated in Table 11.5226 0. These results indicate a close to bimodal distribution of predictability of combinations. a 40. We chose this threshold of 0.4058 0.08 38.9 0. On the average.4581 0. The prediction takes only about 2 minutes to complete for each of the models on a 3.8837 0.95% difference in AUC exists between the two preﬁx sets.27. 3 54. We now investigate the scalability of our models. tl/(tl + tp) = 0. we run the bagged decision tree model from Section 6.16 59.4783 0.45]. This gives credence to the fact that some preﬁxes in combinations are very poor in predictability compared to others.9188 0. Pinc can never be less than 0. The plot shows that about 91% of the incorrectly classiﬁed instances have a class prediction probability above 0. the case is not borderline – the model almost surely predicts the incorrect class label.9 0.5394 % Difference in AUC of col.5 since a label is only predicted if its probability is greater than the other class label. this gives 15.4033 0. The results indicate a large difference in predictability between the two types of combinations.25 0.9 0.6624 0. we investigate the instances which were classiﬁed incorrectly. Across all the 20 cases.27 30.5 0.14 28.4893 0.5).75 0. To evaluate scalability. For each of the 20 sets of results from Table 6.5 0.45 2. 722 ‘‘poorly predicted’’ combinations.36
1 day
0.8558 0.01 29.5 0.593 0.6 0.8 0.25 0.5 0.36 30. We note the probability of incorrect classiﬁcation Pinc as the P(predicted class label – low or high) for the incorrectly classiﬁed instances output from the Naïve Bayes model. / Computer Networks 55 (2011) 873–889
887
availability class.45 32.73 24.R.25 0.12
.84 44.6 GHz single-core machine.5128 0.55 0.72 47.4848 0. Understanding the reasons behind varying preﬁx predictability has been
1 0.93 when they are incorrectly classiﬁed. We leave detailed investigation of the causes behind preﬁx predictability to future work.8608 0. We now seek to isolate the combinations which have poor prediction performance.7807 0.
Frequency
0.8626 0. To evaluate the prediction performance of the poorly predictable combinations versus the predictable ones.56 48.5501 0.74 51.33% of the total number of unique combinations for the 20 cases.75 0.75 since it is midway between 0.6971 0.3236 0. but predict the availability of all the remaining combinations in each month (about 11. which is a hard problem as well [43.40 34.1 0.8751 AUC for poorly predictable combinations 0. on the average.7 0.5777 0. Larger test datasets So far in this paper. tl tl / (tl + tp) AUC for predictable combinations 1 1 1 1 1 0.57 38.5 0.3 on both sets of combinations.7046 0. where we apply the learned models to a large number of combinations.75 0.52 À0. This implies that when a prediction error is made.5321 0.95% higher prediction performance (measured in terms of AUC) than the poorly predictable combinations.01 61. The CDF of Pinc is shown in Fig. we learn Naïve Bayes and bagged decision trees from 10.5 and 1 and we want to ignore combinations for which a slight prediction error is made.75 0. We look at the instances incorrectly classiﬁed by Naïve Bayes for all the 20 cases of Table 6 and isolate the combinations which have a probability of prediction of the (incorrect) class label exceeding 0.1 0. 000 combinations. We show the performance as indicated by AUC for both the predictable and poorly predictable combinations in Table 10. We conjecture that this due to the two types of reasons behind BGP dynamics: planned preﬁx trafﬁc engineering leading to speciﬁc update patterns.8384 0. We therefore conclude that our models are scalable
Table 11 Percentage change in performance metrics of a large prediction dataset from Tables 6 and 8 for tl = 30 days.8269 0.9
7 days
19 days
30 days
shown to be a difﬁcult problem [13] because of lack of information about AS policies and limited visibility to BGP updates from vantage points.3766 0.44.000 combinations using 10-fold cross-validation.85
0.1 0.864) and some which are poorly predictable (average AUC of around 0.4 0.75 0.65 0. the predictable combinations have 40. 4 from col.19 48. we have used training and test sets which are constructed out of a sample of 10.5589 0.
Performance metric AUC Accuracy
% Change for Naïve Bayes 1. 7.6074 0.8122 0. and.44 31.25 54. This may be required of a typical prediction application.1 0.5 million).2 0 0. Khosla et al. CDF of class label prediction probability for incorrectly classiﬁed instances using Naïve Bayes. There are some combinations which are highly predictable (having an average AUC of 0. and the nonstationary nature of link failures [13]. tl/(tl + tp) = 0.5144 0. 8.95
1
P(Class Label) on erroneous decisions
Fig.01
% Change for bagged decision trees À1.8
Table 10 Results for predictable and poorly predictable combinations obtained from bagged decision tree model.1) show about a 1–2% difference from the results in Tables 6 and 8.8469 0. This is similar to root cause identiﬁcation for BGP updates. The prediction results for a typical case (tl = 30 days. which is about 39.9 0.

Keralapura. Service availability: a new approach to characterize IP backbone topologies. is highly susceptible to a change in attributes. Naïve Bayes performs best. D. one can predict AS path ASP between the vantage point and the preﬁx P using iPlane [19] and then choose a RouteViews vantage point that has sufﬁcient overlap with ASP in AS path to P or a preﬁx in the same BGP atom as P. [2] M. ACM. Wang. leveraging previous work in this area. USA. 2005. [8] E. This extension (similar to the one used in [19]) may be inaccurate because of iPlane’s inaccuracy and the incomplete overlap of AS paths.sprintworldwide. <http://www.foxnews. has a high false positive rate and low AUC. Studying the inherent causes of predictability of preﬁxes is another topic of future work. H. 242– 253. End-to-end WAN service availability. which is also a reason for the worse performance of the simple model. but it provides a rough availability estimate. Kim. Our simple prediction model is a good baseline with high true positive rate and accuracy. These attributes are easily computable from public RouteViews data observed for a short period of time. 2008. [11] Akamai. as opposed to considering a single (peer. Internet optometry: assessing the broken glasses in internet reachability. The model. We recommend the use of bagged decision trees. http://www.com/english/solutions/sla/> (accessed April 2010). <http://www. We also aim to study availability across preﬁxes which are subpreﬁxes of other preﬁxes. Roughan.com/generic/0. pp. C. and update frequency as attributes.-N. New York. The results are promising given that we are using only public information about preﬁxes and that we are building our model using a random set of preﬁxes.outage/index. L. preﬁx) combination at a time. We can extend our framework to predict availability of an arbitrary end point as viewed by an arbitrary vantage point by using techniques similar to those used in [19]. pp. We also ﬁnd that mean time to failure is the most important attribute for prediction followed by mean time to recovery. [15] N. Balakrishnan. [16] University of Oregon. http://www. is unknown to RouteViews. 00. Computer Networks 46 (2004) 3–18. [10] Fox News. S. Nayate. ACM. pp. Feamster. Bhattacharyya. [9] Cable News Network (CNN).
. and the preﬁx. in: SIGCOMM ’06: Proceedings of the 2006 conference on Applications. <http:// searchnetworking. These results also show that the 10-fold cross-validation methodology does not suffer because of using a relatively low number (1000) of combinations in the test set. pp. and the latter performs best especially when the learning period tl is shorter than about 3 weeks. Zhang. improve the prediction quality.html>. in: Twelfth IEEE International Workshop on Quality of Service (IWQOS). 232–241. Conclusions and future work In this paper. [12] R. New York. ACM.M. 2008.B. O.att. Paxson.470224. S. 2004. learned from a moderate learning period of a week or two. IEEE/ACM Transactions on Networking 11 (2003) 300– 313. in: SIGMETRICS ’03: Proceedings of the 2003 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems. We hope to uncover interesting patterns. A framework for measuring and predicting the impact of routing changes. Our framework learns a prediction model from a set of Internet preﬁxes. this is the ﬁrst work that uses the similarity of preﬁx behavior in the Internet to predict properties such as availability. Bush. A measurement study on the impact of routing events on end-to-end internet path performance. Zmijewski. Sprint service level agreements. The Naïve Bayes model. The learning period can be a sliding window which slides with a granularity of a few days so as to feed the model with the most recent data for learning. and protocols for computer communications. Providing end-to-end service level agreements across multiple ISP networks.
Other future work plans include considering availability of a preﬁx from multiple peers. Kaashoek.com/2008/WORLD/meast/02/01/ internet. H. 2008. Mao.G. If the vantage point is also arbitrary.sid7_gci1064318. NY. <http://edition. R. we can use any other known preﬁx in the same BGP atom [46] as the original one. We plan to understand the correlation between the control and data planes. pp. a study on control plane availability is not complete without studying the data plane. NY. If the vantage point contributes to RouteViews. Gao. Wang.routeviews. Past availability is inadequate to predict future availability. 2007. we have developed a long-term availability prediction framework that uses mean time to recovery. Route Views Project. Finally.
8.M. 2008. USA. M. AT&T High Speed Internet Business Edition Service Level Agreements. Internet failure hits two continents. For longer learning periods. J. 375–386. however. and possibly exploit locality for feature prediction or for preﬁx clustering. Mideast outage. Pongpaibool. Wang. in: Proceedings of the 20th Annual FIRST Conference. Severed Cables Cut Egypt’s Internet Access Once Again. without signiﬁcant degradation in prediction quality. End-to-end routing behavior in the Internet. We will also investigate additional preﬁx attributes in our prediction model. References
[1] John Shepler. Z. mean time to failure. [13] Y. [7] R. M. however.com/story/0. such as the ASes to which the preﬁxes belong. B. Andersen.V. New York. 339–347. L. Chuah.F. IEEE/ACM Transactions on Networking 5 (1997) 601–615. Gao.00. 2009. since it gives a deeper insight into why some preﬁxes are more predictable than others. Uhlig. <http:// www. 126–137. architectures. To the best of our knowledge. which should be true unless there is a different BGP policy. to predict availability for longer future durations.com/mideastoutage>. [3] V. Khosla et al. Chandra. Dahlin. 2003. preﬁx length. Bush.akamai. 2006. modifying the long-term availability metric to incorporate the time varying nature of announcement of these preﬁxes. org/ (accessed April 2010). NY. Measuring the effects of internet path faults on reactive routing.html>.2933.html>. The Holy Grail of ﬁve-nines reliability. Mao. [14] F. Our results show that future availability is indeed predictable.techtarget. and the AS paths to the preﬁxes. [6] P. in: IMC ’09: Proceedings of the 9th ACM SIGCOMM conference on Internet measurement conference. [5] Sprint. A. G. Z.295582. Iannaccone. Maennel.com/gen/general?pid=6622 (accessed April 2010). Naïve Bayes and bagged decision trees improve on these metrics. J. / Computer Networks 55 (2011) 873–889
for availability prediction of a large number of combinations. This assumes that updates for two preﬁxes with the same AS path to a vantage point will be the same. We quantify how preﬁx availability is related to preﬁx length and update frequency.S.cnn. for which availability prediction is desired. USA. in: INFOCOM.888
R. Threats to Internet routing and global connectivity. [4] AT&T. technologies. and uses that model to predict availability of other preﬁxes.