TL;DR:
- Configure your NTP server according to best current practices.
- (Shameless self-promotion warning.) Use my ntpmon check if your monitoring solution uses collectd, Nagios, or telegraf.
Long version:
Configuration
The most important foundation for good NTP monitoring is good NTP configuration. For best understanding this, read the NTP Best Current Practices (BCP 223/RFC 8633). Here's a condensed summary of its configuration recommendations:
- Keep your NTP software up-to-date
- Use between 4 and 10 sources
- Ensure you have a diversity of reference clocks represented in those sources
- Don't allow unauthenticated remote control (should be the default on most distros)
- Use the pool responsibly (should also be the default on most distros)
- Don't mix leap-smeared and non-leap-smeared sources
- Don't use unauthenticated broadcast mode
- Don't use anycast or load-balancing when you're serving time
Where to measure
Once you have a good local configuration, the main thing to remember is that your check should query the local NTP server for its metrics, rather than trying to manually measure offset from remote servers. The major NTP servers (ntpd and chronyd) already collect all the metrics you need, so checks which compare the clock against remote servers are ignoring a lot of NTP's built-in goodness.
Metric selection
So to your question, the metrics you should be most interested in are:
- system offset: the calculated best guess of the local clock's offset from the one true time
- root dispersion: the calculated maximum offset of the local clock from the stratum 0 sources
Monitoring
There are a few monitoring solutions for NTP - depending on what monitoring you already have in place, some might suit you better than others. I wrote an overview of these on my blog, here's a summary:
- Nagios:
- check_ntp_peer: decent basic check; doesn’t check a wide enough variety of metrics; a little too liberal in what offsets it allows
- check_ntp_time: not recommended; checks only the offset from a given remote NTP server
- check_ntpd: reasonable check coverage; use it if you prefer perl over python.
- ntpmon's nagios check
- collectd:
- prometheus/influxdb
- prometheus node exporter: not recommended; checks only the offset from a given remote NTP server
- telegraf ntpq input plugin: a direct translation of ntpq output to telegraf metrics; this is probably too detailed if you just want to know, "Is my NTP server OK?"
- ntpmon in telegraf mode
Caveats
- The above is a summary of the state as at October 2016 when I did my alerting and telemetry review. Things may have improved since.
- ntpmon is my project which I think overcomes the deficiencies of the checks which were available at the time. It supports both ntpd and chronyd, and the above-listed alerting and telemetry systems.