NTP: How to establish redundant solution for NTP servers?

Question

In my company's infrastructure there are 5 data centers in remote locations.

In each remote location, there's a pair of servers which hold DNS and NTP services and are configured on each one of the servers in that location to get DNS and NTP calls from these two servers.

All servers are CentOS 6.x machines.

There's a motivation to create redundancy between these two servers in terms of DNS and NTP.

The DNS part is covered and I only have problem with NTP.

What is the correct method to make sure that when one NTP server fails, the second/rest of the servers will continue serving the clients just like nothing happened?

I've Google'ed about it and found a RedHat solution to set one of the servers as primary (by configuring it in the clients as "true") but in-case the "true" (primary) server fails... then it fails and clients wouldn't be getting NTP updates from it, so it's not a pure redundant solution.

I wanted to know if anyone had any experience with configuring such a solution?

Edit #1:

For a test of MadHatter's answer I've done the following:

I've stopped NTPd on the server which is configured as "preferred" on each one of the NTP clients.

I'm waiting for the NTP client to stop working against this server and start working against it's partner NTPd server.

I'm running ntpq -p on the client to see the change. This is the output of ntpq -p:

[root@ams2proxy10 ~]# ntpq -p
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 10.X.X.38      .INIT.          16 u    -  128    0    0.000    0.000   0.000
*10.X.X.39      131.211.8.244    2 u    2   64  377    0.123    0.104   0.220

What is "as in ntpq" ? which command shall I run please?

Edit #2: The output of as:

[root@ams2proxy10 ~]# ntpq
ntpq> as

ind assid status  conf reach auth condition  last_event cnt
===========================================================
  1 64638  8011   yes    no  none    reject    mobilize  1
  2 64639  963a   yes   yes  none  sys.peer    sys_peer  3
ntpq>

The output of pe:

ntpq> pe
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 10.X.X.38      .INIT.          16 u    -  512    0    0.000    0.000   0.000
*10.X.X.39      131.211.8.244    2 u   36   64  377    0.147    0.031 18874.7
ntpq>

"A man with one watch knows what time it is. A man with two watches is never sure." It is unfortunate that nobody pointed out that **two ntp servers is the worst possible number of ntp servers.** When the servers disagree about the correct time there is no way for clients to break the tie. You should have atleast three servers configured and preferably four so that you can sustain a temp outage of one of the time servers. The redhat documentation is remarkably negligent if it did not point this out. — dfc, Nov 19 '15 at 19:42

MadHatter · Accepted Answer · 2015-11-15T11:09:38.707

11

I suspect this is a non-problem: NTP is resilient to this already.

You don't have a "primary" NTP server, and some secondaries: you have a set of configured servers. NTPd will decide which is reliable, which is most likely to offer a good time signal, and it will constantly re-evaluate its decisions.

This is the set of bindings from my NTP pool server over the past month or so:

As you can see, most of the time state 6 (system peer) is occupied by the green line, ntp0.jonatkins.com, which is a stratum 1 server to which I bind with permission (all my other servers are stratum 2, so NTPd prefers the higher stratum server if no other factors apply).

But you can see a dip in that line early in week 44, and the numerical values below the image confirm that during the period of the graph, ntp0.jonatkins.com fell to state 4 (outlyer), while linnaeus.inf.ed.ac.uk, which spent much of its time at state 5 (candidate), nevertheless maxed out at 6 (system peer). (The lines don't go all the way down to 4 / up to 6 because these are 2-hour averages of 5-minute raw data; presumably whatever happened lasted noticeably less than 2 hours, and has therefore been smoothed out.)

This shows that, without any input on my part, NTPd decided at some point that its usual peer was insufficiently reliable, and picked the best alternative source during the "outage". As soon as its preferred peer passed its internal QA tests again, it was restored to peer status.

edited Nov 15 '15 at 11:09

answered Nov 15 '15 at 11:03

MadHatter

78,442
20
178
229

Thanks for your answer, how can I test it? If i stop (`service ntpd stop`) NTPd on one of the NTP servers, will the NTP client automatically contact the next server in steptickers/ntp.conf file? – Itai Ganot Nov 15 '15 at 12:12
Assuming all servers are listed in `ntp.conf` (`steptickers` doesn't enter into this, afaik), the NTP client should already be contacting **all** of them. If you stop the daemon on one of the listed servers, the client should quite quickly start to use the other servers as system peer. I would expect the "failover" time to be on the order of the initial binding time, ie six minutes or so. – MadHatter Nov 15 '15 at 12:14
Issuing `ntpq -p` on a NTP client which is configured to work against the NTPd server where i've stopped NTPd will show me what i'm looking for? – Itai Ganot Nov 15 '15 at 12:16
1

It's hard to answer this in the abstract. Could you do an `as` in `ntpq` on one of the current clients, cut-and-paste the answer into your question, and then perhaps we can talk more concretely. – MadHatter Nov 15 '15 at 12:21
I've edited my question to add more information, please check Edit #1, thanks. – Itai Ganot Nov 15 '15 at 12:25
Please check Edit #2. – Itai Ganot Nov 15 '15 at 12:31
1

Assuming you stopped `ntpd` on `10.x.x.38`, the client has already elevated `10.x.x.39` to system peer. Unless I misunderstand, you seem to be showing me output that demonstrates that `ntpd` has solved your problem. – MadHatter Nov 15 '15 at 12:32
@MadHatter how did you generate that graph? – haneefmubarak Nov 15 '15 at 21:49
@haneefmubarak `ntp_states`, it's a standard [munin](http://munin-monitoring.org/) plugin. – MadHatter Nov 16 '15 at 07:23

score 1 · Answer 2 · answered Nov 15 '15 at 19:46

Four or more NTP peers provide falseticker detection and n+1 redundancy. This also is Red Hat's recommendation (although it appears to be subscriber only content now).

Select 4 or more Internet sources or use the NTP Pool project. Add non Internet sources like GPS clocks if you have them. Configure all of your NTP servers to all of these sources.

Verify your NTP servers are spread throughout your infrastructure and use as few single points of failure as possible. Use different racks, power distribution, network and Internet connectivity, data centers etc.

Configure all "client" hosts to use all of your NTP servers. Configure at least 4 per client.

This configuration is very resilient. You can lose any single NTP peer and still detect falsetickers, throwing out one crazy clock.

NTP: How to establish redundant solution for NTP servers?

2 Answers2