8
Findings from Site Studies Traffic mix (which protocols are used; how many connections/bytes they contribute) varies widely from site to site. Mix also varies at the same site over time. Most connections have much heavier traffic in one direction than the other: –Even interactive login sessions (20:1)

9
Findings from Site Studies, cont Many random variables associated with connection characteristics (sizes, durations) are best described with log-normal distributions –But often these are not particularly good fits –And often their parameters vary significantly between datasets The largest connections in bulk transfers are very large –Tail behavior is unpredictable Many of these findings differ from assumptions used in 1990s traffic modeling

18
Burstiness Over Many Time Scales Real traffic has strong, long-range correlations Power spectrum: –Flat for Poisson processes –For measured traffic, diverges to as 0 To build Poisson-based models that capture this characteristic takes many parameters But due to great variation in Internet traffic, we are desperate for parsimonious models (few parameters)

25
Self-Similarity & Heavy Tails, cont We find heavy-tailed sizes in many types of network traffic. Just a few extreme connections dominate the entire volume. Theorems then give us that this traffic aggregates to self-similar behavior. While self-similar models are parsimonious, they are not (alas) simple. You can have self-similar correlations for which magnitude of variations is small still possible to have a statistical multiplexing gain, especially at very high aggregation Smaller time scales behave quite differently. –When very highly aggregated, they can appear Poisson!

27
End-to-End Dynamics Ultimately what the user cares about is not whats happening on a given link, but the concatenation of behaviors along all of the hops in an end-to-end path. Measurement methodology: deploy measurement servers at numerous Internet sites, measure the paths between them Exhibits N 2 scaling: as # sites grows, # paths between them grows rapidly.

30
End-to-End Routing Dynamics Analysis of 40,000 traceroute measurements between 37 sites, 900+ end-to-end paths. Route prevalence: –most end-to-end paths through the Internet dominated by a single route. Route persistence: –2/3s of routes remain unchanged for days/weeks –1/3 of routes change on time scales of seconds to hours Route symmetry: –More than half of all routes visited at least one different city in each direction Very important for tracking connection state inside network!

31
End-to-End Packet Dynamics Analysis of 20,000 TCP bulk transfers of 100 KB between 36 sites Each traced at both ends using tcpdump Benefits of using TCP: –Real-world traffic –Can probe fine-grained time scales but using congestion control Drawbacks to using TCP: –Endpoint TCP behavior a major analysis headache –TCPs loading of the transfer path also complicates analysis

33
End-to-End Packet Dynamics: Loss Half of all 100 KB transfers experienced no loss –2/3s of paths within U.S. The other half experienced significant loss: –Average 4-9%, but with wide variation TCP loss is not well described as independent Losses dominated by a few long-lived outages (Keep in mind: this is 1994-1995!) Subsequent studies: –Loss rates have gotten much better –Loss episodes well described as independent –Same holds for regions of stable delay, throughput –Time scales of constancy minutes or more

35
There is No Such Thing as Typical Heterogeneity in: –Traffic mix –Range of network capabilities Bottleneck bandwidth (orders of magnitude) Round-trip time (orders of magnitude) –Dynamic range of network conditions Congestion / degree of multiplexing / available bandwidth Proportion of traffic that is adaptive/rigid/attack Immense size & growth –Rare events will occur New applications explode on the scene

38
The Search for Invariants In the face of such diversity, identifying things that dont change has immense utility Some Internet traffic invariants: –Daily and weekly patterns –Self-similarity on time scales of 100s of msec and above –Heavy tails both in activity periods and elsewhere, e.g., topology –Poisson user session arrivals –Log-normal sizes (excluding tails) –Keystrokes have a Pareto distribution

43
Versus the Power of Modeling to Open Our Eyes Fowler & Leland, 1991: Traffic spikes (which cause actual losses) ride on longer-term ripples, that in turn ride on still longer- term swells Lacked vocabulary that came from self- similar modeling (1993) Similarly, 1993 self-similarity paper: We did so without first studying and modeling the behavior of individual Ethernet users (sources) Modeling led to suggestion to investigate heavy tails

44
Measurement Soundness How well-founded is a given Internet measurement? We can often use additional information to help calibrate. One source: protocol structure –E.g., was a packet dropped by the network … … or by the measurement device? For TCP, can check: did receiver acknowledge it? –If Yes, then dropped by measurement device –If No, then dropped by network Can also calibrate using additional information

46
Reproducibilty of Results (or lack thereof) It is rare, though sometimes occurs, that raw measurements are made available to other researchers for further analysis or for confirmation. It is more rare that analysis tools and scripts are made available, particularly in a coherent form that others can actually get to work. It is even rarer that measurement glitches, outliers, analysis fudge factors, etc., are detailed. In fact, often researchers cannot reproduce their own results.

47
Towards Reproducible Results Need to ensure a systematic approach to data reduction and analysis –I.e., a paper trail for how analysis was conducted, particularly when bugs are fixed A methodology to do this: –Enforce discipline of using a single (master) script that builds all analysis results from the raw data –Maintain all intermediary/reduced forms of the data as explicitly ephemeral –Maintain a notebook of what was done and to what effect. –Use version control for scripts & notebook. –But also really need: ways to visualize what's changed in analysis results after a re-run.

51
Magnitude of Internet Attacks As seen at Lawrence Berkeley National Laboratory, on a typical day in 2004: –> 70% of Internet connections (20 million out of 28 million) reflect clear attacks. –60 different remote hosts scan one of LBLs two blocks of 65,536 address in its entirety –More than 10,000 remote hosts engage in scanning activity Much of this activity reflects worms Much of the rest reflects automated scan- and-exploit tools

53
Design Goals for the Bro Intrusion Detection System Monitor traffic in a very high performance environment Real-time detection and response Separation of mechanism from policy Ready extensibility of both mechanism and policy Resistant to evasion

54
How Bro Works Taps GigEther fiber link passively, sends up a copy of all network traffic. Network

60
The Problem of Evasion Fundamental problem passively measuring traffic on a link: Network traffic is inherently ambiguous Generally not a significant issue for traffic characterization But is in the presence of an adversary: Attackers can craft traffic to confuse/fool monitor

62
The Problem of Crud There are many such ambiguities attackers can leverage. Unfortunately, they occur in benign traffic, too: –Legitimate tiny fragments, overlapping fragments –Receivers that acknowledge data they did not receive –Senders that retransmit different data than originally In a diverse traffic stream, you will see these Approaches for defending against evasion: –Traffic normalizers that actively remove ambiguities –Mapping of local hosts to determine their behaviors –Active participation by local hosts in intrusion detection

64
What is a Worm? Self-replicating/self-propagating code. Spreads across a network by exploiting flaws in open services. –As opposed to viruses, which require user action to quicken/spread. Not new --- Morris Worm, Nov. 1988 –6-10% of all Internet hosts infected Many more since, but none on that scale …. until ….

66
Code Red, cont Revision released July 19, 2001. Payload: flooding attack on www.whitehouse.gov. Bug lead to it dying for date 20 th of the month. But: this time random number generator correctly seeded. Bingo!

82
What if Spreading Were Well-Designed? Observation (Weaver): Much of a worms scanning is redundant. Idea: coordinated scanning –Construct permutation of address space –Each new worm starts at a random point –Worm instance that encounters another instance re-randomizes. Greatly accelerates worm in later stages.

83
What if Spreading Were Well-Designed?, cont Observation (Weaver): Accelerate initial phase using a precomputed hit-list of say 1% vulnerable hosts. At 100 scans/worm/sec, can infect huge population in a few minutes. Observation (Staniford): Compute hit-list of entire vulnerable population, propagate via divide & conquer. At 10 scans/worm/sec, infect in 10s of sec!

88
Next-Generation Worm Authors Potential for major damage with more nasty payloads :-(. Military (cyberwarfare) Criminals: –Denial-of-service, spamming for hire –Access for Sale: A New Class of Worm (Schecter/Smith, ACM CCS WORM 2003) Money on the table Arms race

89
Summary Internet measurement is deeply challenging : –Immense diversity –Internet never ceases to be a moving target –Our mental models can betray us: the Internet is full of surprises! Seek invariants Many of the last decades measurement questions -- What are the basic characteristics and properties of Internet traffic? -- have returned … … b ut now regarding Internet attacks What on Earth will the next decade hold??