15

After a cold boot of a 6.0.8 Debian server (HP ProLiant), ntpd played havoc with system time: offset and jitter with respect to the usual and reliable reference time servers growing without limit. (Note that a twin identical server had no problem at all.) After many unsuccessful attempts to fix the problem on the ntpd side I decided to try a reboot, and everything went OK.

In order to investigate the problem I found this discrepancy, which could explain my clock problems:

root@n1:~# zgrep Detected /var/log/dmesg*
/var/log/dmesg:[    0.004000] Detected 2400.110 MHz processor.
/var/log/dmesg.0:[    0.004000] Detected 2383.579 MHz processor.
/var/log/dmesg.1.gz:[    0.004000] Detected 2400.036 MHz processor.
/var/log/dmesg.2.gz:[    0.004000] Detected 2400.298 MHz processor.
/var/log/dmesg.3.gz:[    0.004000] Detected 2400.165 MHz processor.
/var/log/dmesg.4.gz:[    0.004000] Detected 2400.410 MHz processor.

Note that in the second last boot (the problematical one) the detected CPU freq is a clear outlier. Without the outlier, error and standard deviation of the detected frequency with respect to the nominal one is +0.15 MHz ± 0.25 MHz. For the problematic boot I have an error of -16.4 Mhz, which is about 100 times greater than expected.

My questions:

  1. Can an error of this type make the ntp time discipline unstable/unusable? Is this the reason for my clock problems?

  2. Is this type of behavior a symptom of flacky hardware? Should the server go into hw maintenance?

Update

Some useful data:

  • kernel is 2.6.32-5-amd64 (Debian 2.6.32-48squeeze4)
  • current_clocksource is tsc
  • error for lpj is (of course) consistent with error on CPU freq

Some context lines for the above grep

[    0.000000] hpet clockevent registered
[    0.000000] Fast TSC calibration using PIT
[    0.004000] Detected 2400.110 MHz processor.
[    0.000008] Calibrating delay loop (skipped), value calculated using timer frequency.. 4800.22 BogoMIPS (lpj=9600440)
Stefano M
  • 313
  • 1
  • 8

2 Answers2

5

I convinced myself that the problem was a misidentified time stamp counter (TSC) frequency.

Apparently the kernel is calibrating the TSC against the programmable interval timer (PIT). Usually the identified CPU frequency is 2400.204 ± 0.134 MHz, which corresponds to about 56 ppm accuracy. After the problematic boot the CPU freq was estimated as 2383.579 MHz, which corresponds to an error of about 6900 ppm, which ntpd was not able to compensate for. In fact during the first 10h30m of functioning the system clock gained about 4m30s, which is about 7000 ppm.

Since the error in the TSC frequency corresponds to the drift in the system clock I would conclude that the abnormal clock behaviour was caused by a wrong TSC calibration.

However I never saw such a big problem: I'm still wondering about the possible causes (hw, sw?) of this wrong calibration.

Stefano M
  • 313
  • 1
  • 8
3

This type of behavior is atypical. A good check would be to monitor the values of the ntp.drift file to see if significant changes happen when the behavior was showing up. If it kept changing significantly, NTP was attempting to skew around a problem. If that was the case, it's a sign that the kernel misidentified the true clock frequency on startup, or the clock itself was slow for the wrong parts of boot. Unfortunately, this one event isn't a clear signal of hardware problems.

If it happens again, watch that ntp.drift file.

sysadmin1138
  • 131,083
  • 18
  • 173
  • 296
  • After the problematical boot ntpd never arrived to a stable PLL, so `ntpdc -c loopinfo` never gave me a frequency drift value. Now after reboot everything appears to be in order, with a stable drift value... BTW your suggestion is correct, I'm monitoring `log/loopstats` for abnormal behaviour. – Stefano M Nov 20 '13 at 21:35