Why Is My NTP Server Costing $500/Year? Part 1

We investigated and discovered our public NTP server was heavily loaded. Over a typical 45-minute period, our instance provided time service to 248,777 unique clients (possibly more, given that a firewall may “mask” several clients), with an aggregate outbound data of 247,581,892 bytes (247 MB). Over the course of a month this traffic ballooned to 332GB outbound traffic, which cost ~$40.

This blog post discusses the techniques we used to investigate the problem. A future blog post (Part 2) will discuss how we fixed the problem.

Clue #1: The Amazon Bill

What had happened? Our account has only one instance, and it’s a t1.micro instance, Amazon’s smallest, least expensive instance. Had we deployed other instances and forgotten to shut them down? Had someone broken into our AWS account and used it to mine bitcoins? Had our instance been used in an NTP amplification attack? We scrutinized our Amazon AWS bill:

This month’s Amazon AWS Bill. The bandwidth charges are surprisingly large given that the bulk of the traffic is DNS and NTP

We determined the following:

It’s unlikely that someone had broken into our Amazon AWS account—there was only one instance spun up during the month, and it was our t1.micro.

The unexpected charges were in one area only—bandwidth.

The bandwidth was symmetrical (i.e. the total inbound bandwidth was within 5% of the total outbound bandwidth). This would indicate that our instance was not used in a NTP amplification attack (an NTP amplification attack would be indicated by lopsided bandwidth usage: outbound bandwidth would have been much higher than inbound)

We downloaded the report and imported it into a spreadsheet (Apple’s Numbers). We noticed that the traffic had a thousand-fold increase on March 30, 2014: it climbed from 4.2MB outbound to 8.9GB outbound.

We graphed the spreadsheet to get a closer look at the data:

Amazon AWS Outbound Bandwidth Usage, by Day

We used a logarithmic scale when creating the graph. A logarithmic scale has two advantages over a linear scale:

it smoothes bumps

it does a better job of displaying data that spans multiple orders of magnitude

We noticed the following:

Before 3/29/2014, daily outbound bandwidth was fairly consistent at 2-6 MB / day

After 3/30/2014, daily outbound bandwidth was fairly consistent at 8-12 GB / day

What happened on 3/30?

Clue #3: git log

We use git to track changes on our instance’s /etc/ directory; we use git log to see what changes happened on 3/30:

That was also the day when we registered our server with the NTP Pool, a volunteer effort where users can make their NTP servers available to the public for time services. The project has been so successful that it’s the default “time server” for most of the major Linux distributions.

It takes about a day for the NTP Pool to satisfy itself that your newly-added server is functional and to make it available to the public—that is likely why we didn’t see any traffic until a day later, 3/30.

Could NTP be the culprit? Had our good friend NTP stabbed us in the back? It didn’t seem possible. Furthermore, the documentation on the www.pool.ntp.org website states that typical traffic is “…roughly equivalent to 10-15Kbit/sec (sic)[2] with spikes of 50-120Kbit/sec”.

But we’re seeing 740-1000kbit/sec [3]: seventy times more than what we should be seeing. And note that we’re being generous—we assume that they are referring to outbound traffic only when they suggest it should be 10-15kbit/sec; if they meant inbound and outbound combined, then our NTP traffic is one hundred forty times more than what we should be seeing.

Clue #4: tcpdump

We need to examine packets. We decide to do a packet trace on our instance for a 45-minute period:

We don’t concern ourselves with the 8665 packets that were dropped by the kernel—they represent less than 0.15% of the overall traffic, and are thus inconsequential.

We copy the file (/tmp/aws.pcap) to our local workstation (doing traffic analysis on a t1.micro instance is painfully slow).

Is our packet trace representative? Yes.

We need to make sure our packet trace is representative of typical traffic to our server, at least in terms of throughput (kbits/sec). In other words, our packet trace should have an outbound throughput on the order of 740-1000kbit/sec.

We have a dilemma: tcpdump’s units of measurement are packets, not bytes. We will address this in two steps:

We will run tcpdump to create a .pcap file that contains only the outbound packets.

We will usepcap_len[4] to determine the total aggregate size of those packets in bytes.

Once we have the total number of bytes, we can determine the throughput.

We have 248453677 bytes / 2693 seconds, which works out to [5] 738 kbits / sec, which is in line with our typical outbound traffic (740-1000kbits/sec)

What percentage of our outbound traffic is NTP? 99.6%

We want to confirm that NTP is the bulk of our traffic. We want to make sure that NTP is the bad guy before we point fingers. Once again, we use tcpdump in conjunction with pcap_len to determine how many bytes of our outbound traffic is NTP:

We don’t care how many seconds passed; we merely care how many bytes we sent outbound (248453677) and how many of them were NTP (247581892). We determine that NTP accounts for 99.6% [6] of our outbound traffic.

NTP is the bad guy.

Clue #5: Compare against a control

We want to see if we have an excessive number of NTP clients vis-a-vis other NTP pool members. Fortunately, we have another machine (our home network’s FreeBSD firewall, also an NTP server and connected to the Comcast network) that’s also in the NTP pool. We’ll pull statistics from there:

Our AWS server is dishing out 11.8 times the NTP traffic that our home server is. But that’s not quite the metric we want; the metric we want is “how much bandwidth per unique client (unique IP address)“

We determine the number of unique NTP clients that each host (AWS and home) have:

Our AWS server is handling 7.1 times the number of unique NTP clients that our home server is handling. This is troubling. It means that certain AWS NTP clients are using up more bandwidth than they should. Furthermore, we need to remember that we ran tcpdump longer (2693.13 seconds) on our AWS server than we did (2209.59 seconds) on our home server, giving AWS more time to collect unique clients. In other words, the ratio is probably worse (if we extrapolate based on number of seconds, the AWS server is spending twice the bandwidth for each client than the home server).

Clue #6: Greedy (broken?) clients

We suspect that broken/poorly configured clients may account for much of the traffic. The grand prize belongs to an IP address located in Puerto Rico (162.220.96.14) managed to query our AWS server 18,287 times over the course of 2693 seconds, which works out to 6.7 queries / second.

The runner-up was also located in Puerto Rico (70.45.91.171), with 12,996 queries at a rate of 4.8 queries / second. Which begs the question: what the heck is going on in Puerto Rico?

Let’s take a brief moment to discuss how we generated these numbers: we lashed together a series of pipes to list each unique IP address and the number of packets our instance sent to that address, converted the output into CSV (comma-separated values), and then imported the data into a spreadsheet for visual examination.

And let’s discuss the graph below. It shows the correlation between the number of unique clients, and the number of queries each one makes. We want to see, for example, if we block any client that makes more than 289 queries in a 45-minute period, how much money would we save? (In this example, we would save $48 over the course of the year)

This chart is tricky: let’s use an example. The 25% on the Y-axis crosses 23 on the X-axis, which means, “25% of the NTP traffic is to machines which have made 23 or fewer queries”

Now we have some tools to make decisions. As with most engineering decisions, economics has a powerful say:

If we block IP addressesthat query time more than(over the courseof 45 minutes)…

…then we cutour bandwidth…

…and spendthis muchannually

0

100%

$0

3

90%

$48

23

75%

$120

51

50%

$240

79

25%

$360

289

10%

$432

864

5%

$456

18287

0%

$480

But we are uncomfortable with this heavy-handed approach of blocking IPs based on nothing more than the number of queries. Our assumption that a large number of queries is indicative of a faulty NTP client is misguided: For example, a corporate firewall would appear as a faulty client, but merely because it’s passing along hundreds of queries from many internal workstations.

Footnotes

1 The yearly cost is slightly less than $500, closer to $480; however, we decided to exercise poetic license for a catchy headline. Yes, in the finest tabloid tradition, we sacrificed accuracy on the altar of publicity.

One comment

I wonder if Puerto Rico has run out of its pool of IPv4 addresses. After Europe and Asia, just this month Latin America as well, have exhausted their IPv4 pools, many local ISPs have resorted to using NAT to deal with the scarcity of addresses (of course, after years procrastinating IPv6 and pretending that this day wouldn’t come about). Given that the source is a Puerto Rican ISP, and one of the offending addresses from a small /21 network, it’s possible that NAT is to blame. As ISP NAT increasingly becomes more prevalent, this is going to be rather touchy to deal with abuses. For is it an abuser or just several innocent users behind a NAT?

I'm a systems administrator at Pivotal Labs. I've worked at a slew of startups and with a slew of UNIXes (OS X, Linux, FreeBSD, OpenBSD, HP-UX, AIX, Solaris/SunOS UTS, Xenix, Ultrix, and even the original UNIX). In my spare time I play rugby.