I'm getting a huge offset (see below) and a drift rate > than NTP can correct. 'frequency error 672 PPM exceeds tolerance 500'. It appears as though the error is bigger than NTPD can correct.

'Googling' about it would appear that there was a problem with the system clock of Sun Blade 100s and some other Sun boxes of the same era. Sun released a SOLARIS patch to fix the problem but I can't find a solution under LINUX.

Does anyone else have a Sun with the same problem? Have you found a fix?

One possible "fix" would be to change this to a higher values so that ntp does not panic on your system. You might consider changing this to:

#define NTP_MAXFREQ 700e-6

before building your ntp and this should allow ntp to work. The main reason for the 500PPM limit is more as a sanity check as ntpd will actually work with higher frequency offsets. The author of the ntp code would likely say something about the code possibly not being stable with larger frequency offsets but you are close to 500 PPM so it will likely be OK. It is is not stable you will know it soon enough.

Looking at your ntpq output I see some other things that might be issues. For example you jitter numbers are in the 100 to 180 milliseconds range. This is way too much and is an indication of a serious problems either with your network connection or your selected ntp servers. You should be seeing jitter numbers that are closer to 1 millisecond. Your offsets are also very high at around 45 seconds. It appears that you have two ntp servers on your local network (192.168.6.2 and 192.168.6.1). Do these servers also have high jitter and offset numbers?

Things are looking better with your current setup. All of your offsets are < 10 milliseconds which is not too bad but it might be possible to make some additional improvements. In addition your jitter numbers are much improved and are about as good as you can expect when using Internet time servers. Your delay numbers are a little high and you might get these down by selecting better servers. I currently use a local reference clock so I am used to seeing very low delay, offset and jitter numbers but back when I used Internet time servers I was able to locate a good set of them were my delay numbers were all < 20 milliseconds. This improved my offsets to the 1 to 3 millisecond range which is about as good as you can get using Internet based time servers.

You might also consider using one of your servers as your primary time server and have all of your other servers sync to it. This will reduce Internet traffic since only one server will be talking to the remote time servers (I am sure that the owners of those servers would rather that only one machine from you network was hitting them). Since your other servers would be talking to the single local master time server over local network connections they would still be nearly as accurate because of very low delay and jitter numbers.

It is better than it was but is still having major issues. ntp should only need to make a step adjustment when it first starts and the fact that it is doing this about every 15 minutes is an indication that something is very wrong. By default ntpd will make step adjustments when the time is off by more than 128 milliseconds. When you do an ntpq -c rv what is the frequency? I have a gut feeling that it will be 700PPM. If that is the case then you need to make a bigger adjustment to NTP_MAXFREQ but at best this "fix" is a hack. An error of 2.8 sec/hour (0.7 secs/ 15 minutes) is on the order of 900PPM and if ntp has an adjustment of 700PPM in place then uncorrected raw clock frequency is off by about 1600PPM (or more). Not good. Those who are "time nuts" try to find machines that will drift by less than 40PPM when free running (IE. no ntp) and preferably under 20PPM. Machines with higher levels of free running drift are generally not good time keepers and your machine is way outside of the normal range (Ie. almost all machines are <100PPM).

Have you checked the kernel bugzilla to see if there are any bugs open for this issue? Also I know that there has been lots of work on time/timing/timer code in the kernel over the past year or so and that there is lots of work in this area being done right now. For example 2.6.26 is the first Linux kernel to be a nanosecond kernel and there is a chance the the LinuxPPS patches will become part of 2.6.29. I don't know what version of the kernel you are running but perhaps a newer version would work better. If not perhaps you should open a bug report if there is not one already. Also if there is an open bug report for this there is a chance that you will be able to find a patch.

Also what clock source is being used by the system? Run these to find out which clock you are using and what clock sources are available.

If the current clock source is the tsc clock you might be having issues with power management messing with the clock. Since you mentioned that the machine is lightly loaded and the clock is running way too slow this is a very real possibility. No matter what clock source is being used you should try other clock sources to see if this improves things. In general these should be preferred in this order:

tsc - if you have an invariant tsc timer - newer processors or processors with no power management features.
hpet
acpi_pm
pit
jiffies

For some of these timers you might need to change your kernel configuration to make them available and you might also need to change your kernel boot parms. Only the tsc timer could be affected by power management all of the others should keep relatively constant time no matter what power mode the processor is in. The big advantages of the tsc timer is that it has higher resolution than the others (some have a resolution of less than 1 nanosecond) and has much lower (on the order of 1000 to 1) CPU overhead to read. But these advantages are only relevant if the timer is power state invariant. In some cases turning off power management features will make the tsc timer invariant.

I've noticed that SPARC systems often are quite good at having time creep in this fashion (even with Solaris IIRC). To help combat this in the past, I've used ntp plus a daily cron job to change the time to match the NTP server's time, in case the creep has gotten too bad.

I've noticed that SPARC systems often are quite good at having time creep in this fashion (even with Solaris IIRC). To help combat this in the past, I've used ntp plus a daily cron job to change the time to match the NTP server's time, in case the creep has gotten too bad.

Doing this is a desperate action to take. This causes your system time to jump and if it jumps backwards it can cause processes that need the system time to be monotonic to have problems. You should be raising hell with the hardware vendor over this issue if it is in fact as wide spread as you say it is. FYI the correct term is drift not creep.

I notice that the OP is using kernel version 2.6.26.7. Starting with version 2.6.26 the linux kernel went from being a micro second kernel to being a nano second kernel. The issue is that glibc contains the time related headers that are used by ntp and glibc has not changed it's time related headers since linux was verion 2.2.something. So glibc is out of sync with these newer kernels. When ntp is built using the older timex.h headers from glibc with a newer kernel ntp will be trying to make adjustments that are off by 3 orders of magnitude. It may not have much affect on your issue but have you tried using a slightly older kernel (anything before the 2.6.26 series)?

I think that this problem could be adjusted in the kernel if you didn't mind hacking it. There was a long thread earlier this year on the kernel mailing list dealing with issues related to clock tick granularity and incorrect clock rates. The thread is long and contentious but it does have some patches that relate to this (perhaps indirectly) that may point you to the correct place to do what needs to be done. It can be located here:

When ntp is built using the older timex.h headers from glibc with a newer kernel ntp will be trying to make adjustments that are off by 3 orders of magnitude.

There is a flag that is used to tell the kernel if the offset being given is micro or nano. The kernel will scale the offset if need be. With the blade 100 it does not seem to be hard to find information saying they have bad clocks.

Multitply the offset given by the 2nd ntpdate by 10000 and you have a rough estimate of the drift. Below are various values given to tickadj with the approximate drift rates I get on my blade 100. This is done with ntp not running.

I do "tickadj 10012" at boot time. Ntp is reporting a drift of -20.2. Currently using a 2.6.26-gentoo-r1 kernel.

edit: replaced ultra 100 with blade 100

edit2: Last night I put freebsd 7.1 on my blade 100. The drift was 398. Even if 2.6.26 adds a static component to the drift value as per the link hvengel gave, the clock is still border line with freebsd. My time server has a 2.6.24 kernel and it did have the static component in the drift value as per that thread._________________Beware the grue.

When ntp is built using the older timex.h headers from glibc with a newer kernel ntp will be trying to make adjustments that are off by 3 orders of magnitude.

There is a flag that is used to tell the kernel if the offset being given is micro or nano. The kernel will scale the offset if need be. With the blade 100 it does not seem to be hard to find information saying they have bad clocks.

Yes there is a flag that allows ntp and the kernel to agree if things are nano or micro. But at least with the current nanosecond linux kernels with a micro only ntp (IE. ntp build with an unpatched version of glibc) the clock is not very stable. The instability probably will not be apparent if you are using public Internet ntp servers as your time source since this will typically result in offsets that are significantly larger than the error introduced by this issue. But if you are using a high accuracy local reference clock like a GPS the problem is apparent and the time will swing back and forth across a zero offset by about 500 microseconds. With ntp compiled to work in nanosecond mode (IE. using a patched glibc) this is reduced to around +-20 microseconds max and it is usually less.

For the OP's server rootdispersion=2.325 which it typical when using public Internet time servers. Rootdispersion is a measure of the amount of variance or noise in the time source. This is caused by things like variable network latency, asymmetric network delays and other factors. In this case it is about 2.3 milliseconds. With a correctly setup local reference clock/glibc/ntp this will be closer to 0.35 most of the time. For example this is a machine running a LinuxPPS patched 2.6.26 kernel, a nano second patched glibc and using a Motorola Oncore UT+ reference clock:

As you can see the offset in the above ntp querys is less than 6 microseconds and rootdispersion is close to 0.35. This system is close to the bleeding edge as far as time keeping on Linux is concerned. In this case rootdispersion mainly reflects things like variable latencies in the PPS interrupt handler and temperature fluctuations that affect the quartz oscillator on the computer motherboard which is probably the biggest factor affecting this. The reference clock PPS pulses are +-50 nanoseconds of the UTC seconds epoch which is almost two orders of magnitude below the other factors that affect rootdispersion.

I have run this same basic system with both a stock glibc and a nanosecond patched glibc and this is where I am getting my +-500 microsecond offset number from. Most of those using the LinuxPPS patchs have seen similar results and the general consensus is that these newer kernels need a nano second patched glibc and an ntp built against this version to give optimum timekeeping results.

Multitply the offset given by the 2nd ntpdate by 10000 and you have a rough estimate of the drift. Below are various values given to tickadj with the approximate drift rates I get on my blade 100. This is done with ntp not running.

Yes there is a flag that allows ntp and the kernel to agree if things are nano or micro. But at least with the current nanosecond linux kernels with a micro only ntp (IE. ntp build with an unpatched version of glibc) the clock is not very stable. The instability probably will not be apparent if you are using public Internet ntp servers as your time source since this will typically result in offsets that are significantly larger than the error introduced by this issue. But if you are using a high accuracy local reference clock like a GPS the problem is apparent and the time will swing back and forth across a zero offset by about 500 microseconds. With ntp compiled to work in nanosecond mode (IE. using a patched glibc) this is reduced to around +-20 microseconds max and it is usually less.

It is not just current kernels and a micro ntp. The slow convergence goes back to 2.6.18 and the offsets you are complaining about were not uncommon. Great for dialup but I really disliked it when using pps. 2.6.17 and earlier could react quite quickly.

If your offset is 2345 and STA_NANO is set then 2345 is used. If STA_NANO is not set then 2 * NSEC_USEC (ie 2000) will be used. I don't see that making a lot of difference. All I can think of at this late hour is ntp gives the kernel different time constants to use depending on nano (time constant unmodified) or micro (ntp time constant - 4). For micro the kernel then adds 4. With micro the kernel is not using the time constant that ntp is telling it. The time I modified the kernel so that it used the time constant it was given with micro I found convergence was much better.

As for a glibc patch I did not do much for nano support in ntp. All I did was add STA_NANO and friends the the glibc sys/timex.h. This is with a garmin gps25, unpatched (except for timex.h) glibc 2.6.1 and a 2.6.24 kernel + linuxpps + other mods.

Back onto the original topic. My blade is being put back into use and I am watching the stability of the clock after using tickadj. The only thing I have seen so far is the effects of it being in a room that changes by 10 - 15C during the day. Based on the tickadj data I have previously posted for my blade I used tickadj 10012.

hvengel, where that thread you linked to really bit me was with my alpha. Drift was 210ppm or so and when that problem was introduced it changed to high 490s' A hot day would have it hitting the limit of what ntp could handle._________________Beware the grue.