5

I have two NTP stratum 3 servers running and wanted to create a simple check that I could tell if either of the servers time drifted and alert that it's not synced properly with the public stratum 2 servers.

My first thought was to pull time from multiple stratum 2 servers and compare that time with what my ntp servers are sending. Then alert if the drift is over X delta.

Is there a more standard way or better method for verifying that an NTP server is sending the correct time?

krizzo
  • 367
  • 2
  • 5
  • 16

2 Answers2

6

TL;DR:

  1. Configure your NTP server according to best current practices.
  2. (Shameless self-promotion warning.) Use my ntpmon check if your monitoring solution uses collectd, Nagios, or telegraf.

Long version:

Configuration

The most important foundation for good NTP monitoring is good NTP configuration. For best understanding this, read the NTP Best Current Practices (BCP 223/RFC 8633). Here's a condensed summary of its configuration recommendations:

  1. Keep your NTP software up-to-date
  2. Use between 4 and 10 sources
  3. Ensure you have a diversity of reference clocks represented in those sources
  4. Don't allow unauthenticated remote control (should be the default on most distros)
  5. Use the pool responsibly (should also be the default on most distros)
  6. Don't mix leap-smeared and non-leap-smeared sources
  7. Don't use unauthenticated broadcast mode
  8. Don't use anycast or load-balancing when you're serving time

Where to measure

Once you have a good local configuration, the main thing to remember is that your check should query the local NTP server for its metrics, rather than trying to manually measure offset from remote servers. The major NTP servers (ntpd and chronyd) already collect all the metrics you need, so checks which compare the clock against remote servers are ignoring a lot of NTP's built-in goodness.

Metric selection

So to your question, the metrics you should be most interested in are:

  • system offset: the calculated best guess of the local clock's offset from the one true time
  • root dispersion: the calculated maximum offset of the local clock from the stratum 0 sources

Monitoring

There are a few monitoring solutions for NTP - depending on what monitoring you already have in place, some might suit you better than others. I wrote an overview of these on my blog, here's a summary:

  1. Nagios:
  • check_ntp_peer: decent basic check; doesn’t check a wide enough variety of metrics; a little too liberal in what offsets it allows
  • check_ntp_time: not recommended; checks only the offset from a given remote NTP server
  • check_ntpd: reasonable check coverage; use it if you prefer perl over python.
  • ntpmon's nagios check
  1. collectd:
  • NTP plugin: some of the metrics it collects are unclear
  • ntpmon in collectd mode
  1. prometheus/influxdb
  • prometheus node exporter: not recommended; checks only the offset from a given remote NTP server
  • telegraf ntpq input plugin: a direct translation of ntpq output to telegraf metrics; this is probably too detailed if you just want to know, "Is my NTP server OK?"
  • ntpmon in telegraf mode

Caveats

  1. The above is a summary of the state as at October 2016 when I did my alerting and telemetry review. Things may have improved since.
  2. ntpmon is my project which I think overcomes the deficiencies of the checks which were available at the time. It supports both ntpd and chronyd, and the above-listed alerting and telemetry systems.
Paul Gear
  • 3,938
  • 15
  • 36
  • 1
    Thanks for the information that is a great resource and this answers my question perfectly and I'll have to take a look at your project for monitoring as that's essentially what I was looking at creating. – krizzo Feb 12 '19 at 21:56
  • 1
    @LF4 - you're welcome; if you use another monitoring tool which isn't supported yet, I'm happy to work with you on getting support added. – Paul Gear Feb 12 '19 at 21:58
  • 1
    Just to update, the mentioned collector has been removed from Prometheus' node_exporter. It's replaced with the timex (adjtimex) + ntp collectors. The former collector offers a decent way to get the estimated and max time error. See https://github.com/prometheus/node_exporter/blob/master/CHANGELOG.md#0150--2017-10-06 – Eugene Chow Apr 15 '19 at 08:55
  • The BCP which was in draft when this question was first asked has been ratified as BCP 223/RFC 8633. I've updated the link to point to it. – Paul Gear Oct 27 '19 at 20:54
3

Sure, the standard approach is to use the bundled NTP client called ntpq. This utility can be used to display the connected servers, their reachability, time difference and jitter. Here's the example:

# ntpq -p
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
*metasntp12.admi .MRS.            1 u  274 1024  377   64.445    1.086   0.450
+cecar.ddg.lth.s 130.149.17.8     2 u  811 1024  377   48.143   -0.810   0.175
 dir.mcc.ac.uk   85.199.214.100   2 u   7d 1024    0   76.708   -1.654   0.000

Here you can see that three servers are configured, two are okay (377 reachability expands to binary 11 111 1111, where 1 means successful answer and 0 mean no answer - so 377 means 100% reachability), and the last one is probably dead for some reason. Offset stands for time offset in milliseconds and jitter is the variability.

drookie
  • 8,051
  • 1
  • 17
  • 27
  • 2
    Minor correction: 377 is octal; it's 3 bits per digit, and so corresponds to binary 11 111 111, meaning 100% reachability (8 out of the last 8 polls). (They really should have encoded it as hex rather than octal, but that decision was made so long ago that it really can't be changed now.) – Paul Gear Jan 30 '19 at 01:15