Comparing Ways to Analyze Network Failures

San Diego, Oct. 21, 2013 -- "Unfortunately, the Internet's architecture does not include comprehensive failure measurement as a first-class capability, so network administrators use a variety of tools to track and understand failures," according to a paper* to be presented this Friday, Oct. 25, at the ACM Internet Measurement Conference (IMC) in Barcelona.

Recent CSE Ph.D. recipient Daniel Turner

Recent Computer Science and Engineering (CSE) alumnus Daniel Turner (Ph.D. '13) will make the presentation, and most of his co-authors will be on hand. They include advisor Stefan Savage and research scientist Kirill Levchenko (CSE Ph.D. '08). Missing because he will not be in Barcelona: Turner's other Ph.D. advisor, professor Alex Snoeren.

In exploring the most efficient ways to report and analyze network failures, the group had previously proposed that a combination of common data sources could do the trick. By stitching together data from router configuration logs and syslog messages, they demonstrated that a relatively detailed picture of network failures could be resolved, with backstopping from trouble tickets filed by operators of a network.

In the current paper, "A Comparison of Syslog and IS-IS for Network Failure Analysis," they go a step further, comparing syslog analysis with an analysis of contemporaneous Intermediate System to Intermediate System (IS-IS) routing protocol messages. The IS-IS protocol is designed to move information efficiently within a computer network. The CSE researchers had previously looked at syslog for five years' worth of failures on the Corporation for Education Network Initiatives in California (CENIC) regional network.

This time, the authors used a shorter, 13-month period of activity on the CENIC network and found tradeoffs in comparing the router syslog data versus real-time, IS-IS routing protocol updates. They found that syslog analysis did not capture 20% of the failures identified by IS-IS data sources. One of the areas where the syslog approach fell short was in identifying failures lasting more than 24 hours, which often can turn out to be "false positives" due to lost syslog messages. In one reported case, syslog determined that a site was isolated from the network for 17 hours -- yet the site was actually only isolated for less than one minute, according to IS-IS analysis. In another case, a site was isolated for 7 hours, but syslog only identified the problem a scant nine seconds before it ended!

"Our comparison of syslog to IS-IS is intended to be both descriptive and prescriptive," the authors note. In the end, roughly one-quarter of all events reported by one data source did not appear in the other. While IS-IS was determined to be more accurate, the authors recognized that syslog omissions largely involved short failures, while other network properties obtained through syslog analysis "are reasonably accurate."

Their conclusion? "In sum, syslog-based analyses may be useful for capturing aggregate failure characteristics where IGP data is not available," according to the paper. "It is less well-suited to situations requiring more precise failure-to-failure accounting."