0

Question

How can I fix transient, high NTP jitter?

Background information

I have an NTP server on my private network. My servers synchronize from this clock, and usually all is well. An example set of output:

ntpq> pe
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
*10.10.10.249    10.10.100.20     3 u  367 1024  377    0.096    0.145   0.142
ntpq> as

ind assid status  conf reach auth condition  last_event cnt
===========================================================
  1  2378  962a   yes   yes  none  sys.peer    sys_peer  2
ntpq> rv 2378
associd=2378 status=962a conf, reach, sel_sys.peer, 2 events, sys_peer,
srcadr=10.10.10.249, srcport=123, dstadr=10.10.200.1, dstport=123,
leap=00, stratum=3, precision=-18, rootdelay=1.190, rootdisp=37.155,
refid=10.10.100.20,
reftime=df134714.c026b762  Mon, Aug  6 2018 22:15:48.750,
rec=df134a04.507b5ad6  Mon, Aug  6 2018 22:28:20.314, reach=377,
unreach=0, hmode=3, pmode=4, hpoll=10, ppoll=10, headway=0, flash=00 ok,
keyid=0, offset=0.145, delay=0.096, dispersion=15.187, jitter=0.142,
xleave=0.052,
filtdelay=     0.10    0.10    0.05    0.08    0.09    0.11    0.11    0.11,
filtoffset=    0.14    0.16    0.19    0.12    0.02   -0.02   -0.04   -0.10,
filtdisp=      0.00   15.57   31.37   47.42   63.65   79.41   95.27  110.72

However every once in a while we will see a system increase to a much larger jitter. Digging into that when it happens, we see a single jump in the delay and offset values. Example:

filtdelay=     0.06    0.11  250.20    0.07    0.04    0.10    0.07    0.09,
filtoffset=    0.05   -0.01  124.95   -0.05   -0.05   -0.07   -0.05   -0.03,

Note in this case that offset (usually, but always) stays within 0.5/-0.5:

# ntpq -pn
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
*10.10.10.249    10.10.100.20     3 u  711 1024  377    0.112   -0.006  47.230

Sometimes the high jitter value can persist, mostly unchanging, for a few hours. The large jitter amount varies from 1 to over 100. Eventually it drops back down below 1.

Addendum We are seeing a correlation between system load and NTP jitter. As a first guess, NTP packets might be colliding with NFS traffic.

EDIT It's not a GPS clock source.

EDIT It's definitely a problem. The jitter we see roughly correlates to high offset values.

EdwardTeach
  • 622
  • 8
  • 20
  • New information since I last looked at this: some of it may be due to heat change, which caused the system's internal clock to speed up as the system warmed, or slow down as the system cooled. Some of it may also be due to relatively less precise (worse) timekeeping hardware. – EdwardTeach Oct 01 '20 at 13:26

1 Answers1

1

Based on my experience on the Mars 2003 project at JPL, where I was responsible for the software phase-locked loop that kept the ground-based spacecraft simulation in sync with the downlinked clock signal from the spacecraft, aliasing is the only phenomenon I can think of that might cause transient jitter. Aliasing happens when an association is lost between what the time signal client thinks a signal "tick" represents and what it really is. If your clients ("my servers" in your question) use an anti-aliasing algorithm to try to get back in sync after a loss of connectivity, it might take them a while to re-sync.

The Mars'03 clock signal was 8Hz, meaning that there were 8 signals per second. If the client falls behind in its sampling by more than 1/8th of a second then it will miss one of the signals and get confused. To combat this, I made the phase locked loop as robust and elastic as possible, so that it was practically impossible under normal circumstances for it to lose sync with the incoming signal. If it did lose sync (which I never saw it do unless I forced it using an oscilloscope), it would have to start over by waiting for the well-known sync pattern to come in, whereupon it could reset the phase locked loop, just as it does at startup.

I'm guessing based on this experience that your transient jitter results from transient losses of connectivity on the time sync network, which may be compounded by packet storms if your time protocol guarantees delivery as does TCP/IP. If a guaranteed delivery protocol falls behind the clock signal, aliasing results. Then the clients must do whatever they do to re-sync, and trying to guarantee delivery under these circumstances might kick up a packet storm that makes things worse before they get better. If the anti-aliasing logic is sound enough then you might want to check whether your time protocol is using TCP/IP (which guarantees delivery) or UDP (which doesn't but is much leaner) and use UDP to eliminate the packet storms.

  • The NTP protocol always uses UDP. The problem with just a single source is, the jitter might be caused by your clock source being in error, or your local system clock having errors (sleep states for power saving, perhaps?), or a congested network between you and the time source might cause random packet delays, and without a second independent time source, determining the cause of the problem can be difficult or impossible. – telcoM Aug 07 '18 at 04:20
  • We are seeing "missing" peerstats and loopstats log lines. That is: I was expecting both of those logs to update at the polling interval (approx 1024). However sometimes I'll see a gap of over an hour between synchronizations. Maybe this matches what you were seeing with missed updates. Is there any way to interrogate ntpd to find out if it's skipping those updates? – EdwardTeach Sep 17 '18 at 21:02