How should I manage and troubleshoot NTP issues?

Question

For some time now I've been fighting with some NTP issues in my company's network and I'm having a hard time to understand how the commands combine with the service. For example: In the server's /etc/ntp.conf there's a line:

server IP_of_internal_ntp_server

But when I type ntpq -p it shows me a different server's IP. In addition, through time I've learned that the way to re-sync a server's time with the NTP server is this:

service ntpd stop && ntpdate ntp_server && service ntpd start

My questions are:

Are the ntpd daemon and ntpdate command work together? if so, why do I have to stop the ntpd daemon in order to sync ntp?

The ntpq -p command, is it affected by the /etc/ntp.conf file?

In some servers a Nagios NTP check is returning NTP OK: Offset unknown while in all other servers I get a proper response and all other servers are configured just the same, why is that?

Thanks in advance, Itai

Edit #1: /etc/ntp.conf:

driftfile /var/lib/ntp/drift
fudge   127.127.1.0 stratum 10  
keys /etc/ntp/keys
restrict 0.centos.pool.ntp.org mask 255.255.255.255 nomodify notrap noquery
restrict 127.0.0.1 
restrict 1.centos.pool.ntp.org mask 255.255.255.255 nomodify notrap noquery
restrict 2.centos.pool.ntp.org mask 255.255.255.255 nomodify notrap noquery
restrict -6 ::1
restrict default kod nomodify notrap nopeer noquery
server 127.127.1.0
server 130.117.52.203

Output of ntpq -p:

[root@nyproxy15 ~]# ntpq -p
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 38.74.128.71    .INIT.          16 u    -   64    0    0.000    0.000   0.000
*LOCAL(0)        .LOCL.          10 l   45   64  377    0.000    0.000   0.001
[root@nyproxy15 ~]#

Please ignore the stratum 16, I know it needs to be fixed.

Edit #2: I've edited /etc/ntp.conf and commented out the lines you mentioned.

[root@nyproxy15 ~]# service ntpd stop ; ntpdate 130.117.52.203 ; service ntpd start
Shutting down ntpd:                                        [  OK  ]
30 Sep 08:16:30 ntpdate[31192]: adjust time server 130.117.52.203 offset -0.078324 sec
ntpd: Synchronizing with time server:                      [  OK  ]
Starting ntpd:                                             [  OK  ]
[root@nyproxy15 ~]# ntpq -p
localhost.localdomain: timed out, nothing received
***Request timed out
root@nyproxy15 ~]# ps -ef |grep ntp
root     31210     1  0 08:16 ?        00:00:00 ntpd -u ntp:ntp -p /var/run/ntpd.pid

Edit #3:

It seems like, now, after a few minutes, ntpq -p returns the correct response:

[root@nyproxy15 ~]# ntpq -p
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
*130.117.52.203  46.4.54.78       3 u    9   64  377   80.633   -9.950   1.420
[root@nyproxy15 ~]#

Could we see the whole of the `ntp.conf` on the server that gives wrong information, plus the whole of the output of `ntpq -p`? — MadHatter, Sep 30 '14 at 07:55
I agree that's a bit odd. Could you get rid of the `server 127.127.1.0` and `fudge 127.127.1.0 stratum 10` from the conf (it's pointless telling it to bind to itself, if its clock was any good you wouldn't be running ntpd), restart ntpd, update the `ntpq` output above, and show us the entry in the `ps` output, in case it's running with any odd flags? — MadHatter, Sep 30 '14 at 08:14
@MadHatter: Done, please check Edit #2. So restarting ntp like i'm doing is the right way? — Itai Ganot, Sep 30 '14 at 08:19
Yes, I believe so. The restart is great, thanks, and the flags look good - but it's very interesting that `ntpq` then returns an error. You didn't touch any other lines in the `ntp.conf` file? Is it still returning that error after a few minutes? — MadHatter, Sep 30 '14 at 08:28
ntpdate does not change the time in small amounts /drifts, but rather forces it directly on the correct one. Depending what you are running this might be an issue. Normally the daemon is responsible for adjusting the clock to be correct once the drift has finished. — Dennis Nolte, Sep 30 '14 at 08:47
@MadHatter: Seems like it's solved, so I guess the problem was with the two lines I commented in `/etc/ntp.conf`. Please create an answer and I will accept it. Thanks again MadHatter. — Itai Ganot, Sep 30 '14 at 08:50

score 1 · Accepted Answer · answered Sep 30 '14 at 08:53

If you want an NTP server to do anything reliably, you need not to lie to it about the reliability of its own clock; the lines

server 127.127.1.0

and

fudge 127.127.1.0 stratum 10

do exactly that, and it looks like getting rid of them has fixed things.

As for stopping ntpd before brute-forcing the time with ntpdate, my understanding is that there's a single structure inside the kernel for playing with the clock, and ntpd sits on it (in order to skew the time if needed). As long as it's there, ntpdate can't get a look in; so it's necessary to take it out of the picture long enough for ntpdate to work.

But my understanding's strictly from running pool servers; I'm no kernel programmer, and could be wrong about that.

How should I manage and troubleshoot NTP issues?

1 Answers1