PlanetLab Data

Experiments were performed from 102 PlanetLab nodes in 81 unique locations in June 2016. Data from each PlanetLab
node is in a separate folder inside PL-raw-data folder. Folder names are IP addresses of PlanetLab nodes. The file
pl-nodes-16Jun2016.txt includes all PlanetLab nodes and their information.

The data in each folder is presented as downloaded from the corresponding PlanetLab node, i.e. in small chunks
of 1-2 MBs each. The files contain measurements performed using the URL list in the file test-urls.txt.
The Url file contains an identifier for each website, a final unique Url used in fetches, the website's global
Alexa rank. Some lines in the URL file has a 4th field mooc, which indicates the URLs that are also used for
the MOOC-recruited end-user experiments described in Section 3.7 of the paper. The URLs were obtained from Alexa's
top 500 websites listed for each country. URL list was crawled in June 2016. URLs which caused errors while fetching
over cURL and URLs of adult websites were discarded, the latter being arbitrary and unnecessary.

Data collection in each experiment using a specific URL include the following:

Following the header, the destination URL, time of test and destination IP address is given:

DEST http://www.yepi.com/
TIME_OF_TEST 1464604288
DESTIP 72.21.91.39

Each HTML is fetched twice, and the timings of each fetch is recorded with cURL. The HTML page sometimes is served
from a different server in the second fetch. Pings and traceroute's are run towards only to the IP address of the
web server which served the HTML during the first fetch. Information obtained from cURL during each fetch is printed
in the following form:

For the meaning of these items and values, please consult cURL manual at https://curl.haxx.se/docs/manpage.html, and
the section describing the -w option, i.e. --write-out. Fetches that didn't result in a HTTP 200
status code or which caused redirects were discarded, since they are not useful for the purposes of our measurements.

The ### START TCP DATA ### header starts the section which includes information captured with tcpdump. We were only
interested in detecting packet loss, and only recorded the arrival times of each TCP byte stream along with the sequence
numbers marking the beginning and end of the received window.

Following the TCP data, output of running 30 pings and one traceroute is dumped, marked with self-explanatory section
header names.

For inflation analysis as presented in our paper, we need geolocations of PlanetLab nodes, the web servers, and the
router IP addresses seen in traceroute output. Geolocations of all the unique IP addresses seen in the data is obtained
from 6 different commercial geolocation service, and their majority vote is also obtained for comparison. This data is
in the folder called geolocations.

Analysis Code

The provided Python source code analyzes all the files given in the folder PL-raw-data using a geolocation service.
The particular geolocation service has to be chosen at the file Main.py. The code produces a comma separated file with
the following format:

Field 1 - Time of test
Field 2 - Planet lab node hostname
Field 3 - Planet lab IP address
Field 4 - Fetched page
Field 5 - Destination server IP address
Field 6 - Boolean indicating whether prot. is https (True = prot. is https)
Field 7 - Site rank
Field 8 - Boolean indicating whether page was ALSO used in MOOC end-user measurements
Field 9 - Distance (in kilometers) between origin and destination
Field 10 - Estimated loss, Boolean
Field 11 - Number of fetched bytes
Field 12 - Name resolution time (DNS) in seconds
Field 13 - TCP handshake time in seconds
Field 14 - SSL handshake time in seconds, 0 if field 6 is False
Field 15 - Request response time in seconds (time between HTTP request sent and first byte received)
Field 16 - TCP transfer time in seconds
Field 17 - Total fetch time in seconds
Field 18 - Minimum ping time in seconds between origin and destination
Field 19 - Router path latency in seconds
Field 20 - cRtt in seconds (minimum possible RTT between origin and destination)

Precomputed CSV files using all 7 geolocation services is provided in the zipped file Summary_CSV.tar.gz.

Contact Information

For questions and comments related to the data and the source code, please contact Ilker Nadi Bozkurt from
ilker at cs dot duke dot edu.