I've got about 70 linux instances running on an OpenStack cluster that currently consists of two compute nodes and one controller. Also, these machines live in a RackSpace DC as part of their 'Private Cloud' program, so all of our resources are dedicated.

Previously we were using only RackSpace's NTP servers to synchronize the clocks on all of our instances, but Check_MK was frequently notifying us that the instances were syncing to themselves [stratum 10], implying that the NTP servers were not responding. Given that only 4/70+ instances had public IP addresses I assumed that RackSpace's NTP servers were ratelimiting us since they would be seeing 35+ times the normal rate of NTP queries originating from our two compute hosts. This seemed logical since the 4 instances with public IPs never generated any complaints about NTP.

To address this I changed ntpd.conf on our instances to include our controller node alongside the Rackspace servers so we would at least have a fallback when the RS servers stopped responding. [the NTP cookbook we are using does not allow us to set a preference] However, this has not stopped, or even reduced the number of NTP complaints. I've been seeing last entries in ntpq -p in excess of 60 minutes for all three hosts. I can't see how rate IP-based rate limiting might be coming into effect with the controller node since the instances and the controller reside on, and communicate through, a private network where every instance has its own IP address.

What could be causing this? As far as I've been able to tell there is nothing in the restrict default line that would cause what we're experiencing.

Since you are having problems with your controllers, can you fire up tcpdump and capture ntp on the controllers and clients? Are you seeing all the requests from the clients on the controllers?
–
ZoredacheApr 9 '14 at 21:07

Can you append ntpq -pcrv from a host and a controller node? You hav verified that there are no other FW rules in play? FYI all of your restrict server lines are redundant. They do not really do anything that is not included in the default restrict lines and make reading your config painful. peering is different than requesting time. If you really want different restrictions for your servers you can use restrict source once and it will cover all associations.
–
dfcApr 9 '14 at 22:30

@Zoredache fired it up on the controller and an instance, and it looks like what I thought was one private network is actually two. One virtual and one actual. The controller is seeing requests only from the compute nodes and not the individual instances. All the requests I generated seemed to be answered as expected, but the troubles come in spurts. I'll fire up tcpdump again once it flares up again.
–
SammitchApr 9 '14 at 22:54

@dfc yep, most of that config is redundant trash, but that's what the Opscode NTP cookbook generates so I'm stuck with it. I'm adding the requested output now...
–
SammitchApr 9 '14 at 23:00

Everything looks good from what you posted. Maybe check_mk is braindead? Can you post the ntpq -crv when the problem arises again? PS your leapseconds file is stale.
–
dfcApr 9 '14 at 23:28