BibTeX

Bookmark

OpenURL

Abstract

Abstract. Real-world measurements play an important role in understanding the characteristics and in improving the operation of BitTorrent, which is currently a popular Internet application. Much like measuring the Internet, the complexity and scale of the BitTorrent network make a single, complete measurement impractical. While a large number of measurements have already employed diverse sampling techniques to study parts of BitTorrent network, until now there exists no investigation of their sampling bias, that is, of their ability to objectively represent the characteristics of BitTorrent. In this work we present the first study of the sampling bias in BitTorrent measurements. We first introduce a novel taxonomy of sources of sampling bias in BitTorrent measurements. We then investigate the sampling among fifteen long-term BitTorrent measurements completed between 2004 and 2009, and find that different data sources and measurement techniques can lead to significantly different measurement results. Last, we formulate three recommendations to improve the design of future BitTorrent measurements, and estimate the cost of using these recommendations in practice. 1

Citations

...he last decade into empirical measurements of P2P file-sharing systems including BitTorrent, with the purpose of understanding and improving their use. Similarly to early Internet measurement efforts =-=[7, 8]-=-, due to the size of the complete network all BitTorrent measurements have employed data sampling techniques, from periodic measurements to the focus on specific BitTorrent communities. Despite this s...

... 1 Introduction Peer-to-Peer file-sharing networks such as BitTorrent serve tens of millions of users daily and are responsible for a significant percentage of the total Internet traffic. Much effort =-=[1, 2, 3, 4, 5, 6]-=- has been put in the last decade into empirical measurements of P2P file-sharing systems including BitTorrent, with the purpose of understanding and improving their use. Similarly to early Internet me...

... 1 Introduction Peer-to-Peer file-sharing networks such as BitTorrent serve tens of millions of users daily and are responsible for a significant percentage of the total Internet traffic. Much effort =-=[1, 2, 3, 4, 5, 6]-=- has been put in the last decade into empirical measurements of P2P file-sharing systems including BitTorrent, with the purpose of understanding and improving their use. Similarly to early Internet me...

...he last decade into empirical measurements of P2P file-sharing systems including BitTorrent, with the purpose of understanding and improving their use. Similarly to early Internet measurement efforts =-=[7, 8]-=-, due to the size of the complete network all BitTorrent measurements have employed data sampling techniques, from periodic measurements to the focus on specific BitTorrent communities. Despite this s...

... 1 Introduction Peer-to-Peer file-sharing networks such as BitTorrent serve tens of millions of users daily and are responsible for a significant percentage of the total Internet traffic. Much effort =-=[1, 2, 3, 4, 5, 6]-=- has been put in the last decade into empirical measurements of P2P file-sharing systems including BitTorrent, with the purpose of understanding and improving their use. Similarly to early Internet me...

...g the measurement results. In the Internet community, this ”search for invariants” process [7] fostered many new research opportunities [8]. From the large number of empirical BitTorrent measurements =-=[2, 3, 9, 10]-=-, few [9, 10] consider even aspects of the sampling bias problem. Second, understanding sampling biases leads to better understanding of2 the usage of measurement techniques, which is key to designin...

...g the measurement results. In the Internet community, this ”search for invariants” process [7] fostered many new research opportunities [8]. From the large number of empirical BitTorrent measurements =-=[2, 3, 9, 10]-=-, few [9, 10] consider even aspects of the sampling bias problem. Second, understanding sampling biases leads to better understanding of2 the usage of measurement techniques, which is key to designin...

... the complete dataset. – The Error/deviation of values metric, which mimics traditional statistical approaches for comparing probability distributions of random variables. The Kolmogorov-Smirnov test =-=[15]-=- uses the D characteristic to estimate the maximum distance between the cumulative distribution functions (CDFs) of two random variables. Similarly, we use the D characteristic to compare the measured...

...ataset comprising every message exchanged between the peers of a BitTorrent community of significant size. Thus, and similarly with the situation of exposing sampling biases for Internet measurements =-=[8, 14]-=-, we need to trace the presence of sampling bias without a ground truth. Instead, we make the observation that if measurements are unbiased, the measured characteristics should remain the same regardl...

...s existence, none covers the full set of sampling bias sources addressed in this work.11 In general, under the assumption that ”more is better”, these studies obtained data over long periods of time =-=[10,22]-=-, from more peers [6,9,10], for more files [9, 10, 23] and communities [10, 23], and filtered the raw data before analysis to eliminate some of the measurement biases [6, 10]. Closest to our work, Stu...

...bias sources addressed in this work.11 In general, under the assumption that ”more is better”, these studies obtained data over long periods of time [10,22], from more peers [6,9,10], for more files =-=[9, 10, 23]-=- and communities [10, 23], and filtered the raw data before analysis to eliminate some of the measurement biases [6, 10]. Closest to our work, Stutzbach et al. [24] assess the bias incurred by samplin...

...functions (CDFs) of two random variables. Similarly, we use the D characteristic to compare the measured and the complete dataset values. Following traditional work on computer workload modeling (see =-=[16]-=- and the references within), we say that measurements resulting in errors above 10% (D metric above 0.1) have very low accuracy, and that measurements with 5–10% error have low accuracy. 3.2 Data Sour...

...BitTorrent measurement techniques that there is no agreement on the Internet traffic share due to BitTorrent–though caching companies have put forth estimates of over 50% in 2008 [11] and 30% in 2005 =-=[12]-=-. Towards understanding sampling biases in BitTorrent measurements, our main contribution is threefold: 1. We propose a method for exposing the sampling biases in BitTorrent measurements that focuses ...

...g the measurement results. In the Internet community, this ”search for invariants” process [7] fostered many new research opportunities [8]. From the large number of empirical BitTorrent measurements =-=[2, 3, 9, 10]-=-, few [9, 10] consider even aspects of the sampling bias problem. Second, understanding sampling biases leads to better understanding of2 the usage of measurement techniques, which is key to designin...

...itiated contacts. In contrast to passive measurements, active measurements require that the other peers are accessible, for example, they are not behind a firewall. The 2007 measurement by Xie et al. =-=[18]-=- shows that up to 90% of the peers in a live streaming application are firewalled, and that less than 20% of them by-pass the firewalls.5 ID Trace Description Period Sampling Torrents Sessions Traffi...

... based on the type of content they share, either general or specific content. The specific content may be further divided into content sub-types such as video, operating system, etc.; Garbacki et al. =-=[17]-=- have identified around 200 content sub-types for the SuprNova community. 3. The Passive vs. Active Measurements. Following the terminology introduced in our previous work [9], peer-level measurements...

...t for most of the traffic. Thus, including in the measurement fewer communities and swarms may reduce the volume of acquired data without reducing accuracy. Until the recent study of four communities =-=[10]-=-, measurements have often focused on one community [3, 9], and even on only one swarm [2]. 3. Long-term dynamics Many BitTorrent communities have changed significantly over time or even disappeared. T...

... 1 Introduction Peer-to-Peer file-sharing networks such as BitTorrent serve tens of millions of users daily and are responsible for a significant percentage of the total Internet traffic. Much effort =-=[1, 2, 3, 4, 5, 6]-=- has been put in the last decade into empirical measurements of P2P file-sharing systems including BitTorrent, with the purpose of understanding and improving their use. Similarly to early Internet me...

...]. Traces studied in this work are available at the Peer-to-Peer Trace Archive (http://p2pta.ewi.tudelft.nl); for more details and analysis results of the Archive please refer to our technical report =-=[20]-=-.6 To ensure heterogeneity among the limited number of traces, we have taken into account the following controllable factors when collecting the traces. The traces cover different community types (sh...

...the full set of sampling bias sources addressed in this work.11 In general, under the assumption that ”more is better”, these studies obtained data over long periods of time [10,22], from more peers =-=[6,9,10]-=-, for more files [9, 10, 23] and communities [10, 23], and filtered the raw data before analysis to eliminate some of the measurement biases [6, 10]. Closest to our work, Stutzbach et al. [24] assess ...

...surements, and estimate the costs of implementing these recommendations (Section 6). This work is further motivated by the needs of two ongoing initiatives. First, we are continuing our previous work =-=[13]-=- on building a publicly-accessible P2P Workloads Archive, which will include in a first phase the tens of P2P measurement datasets we have acquired since 2003, and in particular the fifteen datasets w...

...13 peta-bytes of data. Overall, this paper investigates the largest number of BitTorrent datasets, to-date, as summarized in Table 1; for a complete description of the traces see our technical report =-=[19]-=-. Traces studied in this work are available at the Peer-to-Peer Trace Archive (http://p2pta.ewi.tudelft.nl); for more details and analysis results of the Archive please refer to our technical report [...

...BitTorrent measurement techniques that there is no agreement on the Internet traffic share due to BitTorrent–though caching companies have put forth estimates of over 50% in 2008 [11] and 30% in 2005 =-=[12]-=-. Towards understanding sampling biases in BitTorrent measurements, our main contribution is threefold: 1. We propose a method for exposing the sampling biases in BitTorrent measurements that focuses ...