In the previous blog post, we concluded that providing an Amazon AWS-based NTP server that was a member of the NTP Pool Project was incurring ~$500/year in bandwidth charges.

In this blog post we examine the characteristics of NTP clients (mostly virtualized). We are particularly interested in the NTP polling interval, the frequency with which the NTP client polls its upstream server. The frequency with which our server is polled correlates directly with our costs (our $500 in Amazon AWS bandwidth corresponds to 46 billion NTP polls[1]). Determining which clients poll excessively may provide us a tool to reduce the costs of maintaining our NTP server.

This blog posts describes the polling interval of several clients running under several hypervisors, and one client running on bare metal (OS X). This post also describes our methodology in gathering those numbers.

NTP Polling Intervals

The polling intervals of ntpd vary from 64 seconds (the minimum) to 1024 seconds (the maximum)—as much as sixteenfold (note that these values can be overridden in the configuration file, but for purposes of our research we are focusing solely on the default values).

We discover that clients running on certain hypervisors correlate strongly in the amount of polling (e.g. the VirtualBox NTP clients frequently poll at the default minimum poll interval, 64 seconds).

A close-up of the 64-second polling interval (“minpoll”). Notice the dots are mostly VirtualBox with a sprinkling of KVM. NTP clients perform poorly under those hypervisors.

By examining the chart (the chart and the underlying data can be viewed on Google Docs), we can see the following:

The guest VMs running under VirtualBox perform the worst (with one exception: Windows). Note that their polling intervals are clustered around the 64-second mark—the minimum allowed polling interval.

The Windows VM appears to query for time but once a day. It doesn’t appear to be running ntpd; rather, it appears to set the time via the NTP protocol with a proprietary Microsoft client.

The OS X host only queried its NTP server once during a 3-hour period. Since this value (10800 seconds) is more than the default maxpoll value (1024 seconds), we suspect that OS X uses a proprietary daemon and not ntpd.

The guest VM running under ESXi performs quite well; although its datapoint is obscured in the chart, if one were to browse the underlying data, one would see that its datapoints are clustered around maxpoll, i.e. 1024 seconds.

The guest VM running under Xen (AWS) also performs quite well; its datapoints are also clustered around maxpoll.

The guest VM running under KVM performs better than the VirtualBox VMs, which is admittedly damning with faint praise. Their polling intervals tend to cluster around 128 seconds, with smaller clusters at 64 and 256 seconds.

Why We Are Not Characterizing NTP Clients on Embedded Systems

We’re ignoring embedded systems, a fairly broad category which covers things as modest as a home WiFi Access Point to as complex as a high-end Juniper router.

There are two reasons we are ignoring those systems.

We don’t have the resources to test them (we don’t have the time or the money to purchase dozens of home gateways, configure them, and measure their NTP behavior, let alone the more-expensive higher-end equipment)

The operating system of many embedded systems have roots in the Open Source community (e.g. dd-wrt is linux-based, Juniper’s JunOS is FreeBSD-based). There’s reason to believe that the NTP client of those systems would behave the same as the systems upon which they are based.

We wish we had the resources to characterize embedded systems—sometimes they are troublemakers:

The operating system of embedded systems that do not have roots in the Open Source community have a poor track record of providing good NTP clients. Netgear, SMC, and D-Link, to mention a few, have had their missteps.

Why Windows and OS X NTP Clients Don’t Matter

Windows and Apple clients don’t matter. Why?

They are not our NTP clients. Both Microsoft and Apple have made NTP servers available (time.windows.com and time.apple.com, respectively) and have made them the default NTP server for their operating system.

They rarely query for time: Windows 7 only once a day, and OS X every few hours.

We suspect that fewer than 1% of our NTP clients are either Windows or OS X (but we have no data to confirm that).

Regardless of its usefulness, we’re characterizing the behavior of their clients.

2. Setting Up the NTP Clients

The ESXi, Xen (AWS), and KVM (Hetzner) clients have already been set up (not for characterizing NTP, but we’re temporarily borrowing them to perform our measurements); however, the VirtualBox clients (specifically the Ubuntu and FreeBSD guest VMs) need to be set up.

The 3 VirtualBox and 1 Bare-Iron NTP Clients

We choose one machine of each of the four primary Operating Systems (OS X, Windows, Linux, *BSD). We define hostnames, IP addresses, and, in the case of FreeBSD and Linux, ethernet MAC addresses (we use locally-administered MAC addresses[3]). Strictly speaking, creating hostnames, defining MAC addresses, creating DHCP entries, is not necessary. We put in the effort because we prefer structure:

hostname↔IP address mappings are centralized in DNS (which is technically a distributed, not centralized, system, but we’re not here to quibble)

IP address↔MAC address mappings are centralized in one DHCP configuration file rather than being balkanized in various Vagrantfiles.

We want the Ubuntu VM to have an IP address that is distinct from the host machine’s. This will enable us to distinguish the Ubuntu VM’s NTP traffic from the host machine’s (the host machine, by the way, is an Apple Mac Pro running OS X 10.9.3).

We want the Ubuntu VM to run NTP

The former is accomplished by modifying the config.vm.network setting in the Vagrantfile to use a bridged interface (in addition to Vagrant’s default use of a NAT interface); the latter is accomplished by creating a shell script that installs and runs NTP and modifying the Vagrantfile to run said script.

we passed the -W 1 -G 10800 to tcpdump; this is to enable packet capture for 10800 seconds (i.e. 3 hours) and then stop. This will allow us to capture the same duration of traffic from our machines, which makes certain comparisons easier (e.g. the number of times upstream servers were polled over the course of three hours).

we used the -w flag (e.g. -w /tmp/ntp_vbox.pcap) to save the output to a file. This enables us to make several passes at the capture data.

We filtered for ntp traffic (port ntp)

for machines that were NTP servers as well as clients, we restricted traffic capture to the machines that were its upstream server(s) (e.g. the ESXi’s Ubuntu VM’s upstream server is 91.189.94.4, so we appended and host 91.189.94.4 to the filter)

4. Converting NTP Capture to CSV

We need to convert our output into .csv (comma-separated values) files to enable us to import them into Google Docs.

tcpdump‘s -tt flag is to generate relative timestamps, so that we may easily calculate the amount of time between each response

tcpdump‘s src host parameter is to restrict the packets to NTP responses and not NTP queries (it’s simpler if we pay attention to half the conversation)

the first awk command prints the interval (in seconds) between each NTP response

the tail command strips the very first response whose time interval is pathological (i.e. whose time interval is the number of seconds since the Epoch, e.g. 1404857430)

the sort and uniq tells us the number of times a response was made for a given interval (e.g. “384 NTP responses had a 64-second polling interval”)

the second sort command sorts the query by seconds, lexically (not numerically). The reason we sort lexically is because the join command, which we will use in the next step, requires lexical collation, not numerical. (in other words, “1 < 120 < 16 < 2″, not “1 < 2 < 16 < 120″)

the second awk command puts the data in a format that’s friendly for Google spreadsheets

we use the join command to merge the proper fields together; this is so our scatterplot will display properly. The join-field is the polling interval in seconds

we use 3 iterations of join

the first one merges the fields with common polling intervals

the second one merges the polling intervals that are present in the first file but not the second

the final one merges the polling intervals that are present in the second file but not the first

we invoke sort in order to keep our temporary files lexically collated, a requirement of join

we create a series of temporary files, the last one of which (e.g. 5192.17.csv) we will import into Google Docs

we need to perform one final sort before import (we need to sort numerically, not lexically):

sort -g < 5192.17.csv > final.csv

6. Mastering Google Docs

In order to create our scatterplot, we must comply with Google’s requirements. For example, each column needs at least 1 datapoint.

we add a value of 1 polling interval of 10800 seconds to the OS X column. During our 3-hour packet capture, our OS X host only queried its NTP server once, and we removed that packet (we measure intervals between packets, and we need at least 2 packets measure). Our data now indicates that OS X queries once every 3 hours.

we remove the column VB/FB/72.20.40.62. That NTP server is unreachable/broken and has no data points.

we add a value of 1 polling interval of 86400 seconds to the VB/W7 column. Windows 7 appears to only query for time information once per day (not discovered in this packet capture but in an earlier one)

Footnotes

2 The inclusion of FreeBSD in the list of Operating Systems is made less for its prevalence (it is vastly overshadowed by Linux in terms of deployments) than for the strong emotional attachment the author has for it.

3 To define our own addresses without fear of colliding with an existing address, we set the locally administered bit (the second least significant bit of the most significant byte) to 1.

4 The term “host” has a specific connotation within the context of virtualization, and we are deliberately mis-using using that term to achieve poetic effect (i.e. “hosts” sounds similar to “horsemen”). But let’s be clear on our terms: a “host” is an Operating System (usually running on bare-iron, but optionally running as a guest VM on another host) running virtualization software (e.g. VirtualBox, Fusion, ESXi, Xen); a “guest” is an operating system that’s running on top of the virtualization software which the host is providing.

In our example only one of the 4 hosts is truly a host—the OS X box is a true host (it provides the virtualization software (VirtualBox) on top of which the remaining 3 operating systems (Ubuntu, FreeBSD, and Windows 7) are running).

5 We’d like to point out the shortcomings of the FreeBSD setup versus the Ubuntu setup: in the Ubuntu setup, we were able to use a directive (use_dhcp_assigned_default_route) to configure Ubuntu to send outbound traffic via its bridged interface. Unfortunately, that directive didn’t work for our FreeBSD VM. So we used a script to set the default route, but the script is not executed when FreeBSD VM is rebooted, and the FreeBSD VM will revert to using the NAT interface instead of the bridged interface, which means we will no longer be able to distinguish the FreeBSD NTP traffic from the OS X host’s NTP traffic.

The workaround is to never reboot the FreeBSD VM. Instead, we use vagrant up and vagrant destroy when we need to bring up or shut down the FreeBSD VM. We incur a penalty in that it takes slightly longer to boot our machine via vagrant up.

Also note that we modified the config.vm.network to use a host-only network instead of the regular NAT network. That change was necessary for the FreeBSD guest to run the required gateway_and_ntp.sh script. Virtualbox was kind enough to warn us:

NFS requires a host-only network to be created.
Please add a host-only network to the machine (with either DHCP or a
static IP) for NFS to work.

I'm a systems administrator at Pivotal Labs. I've worked at a slew of startups and with a slew of UNIXes (OS X, Linux, FreeBSD, OpenBSD, HP-UX, AIX, Solaris/SunOS UTS, Xenix, Ultrix, and even the original UNIX). In my spare time I play rugby.