How come one of my switches is off by two minutes in spite of ntp?

Question

I just noticed by pure chance that one of my Cisco 4500 switches has its clock going wrong: it is more than 2 minutes behind in spite of seemingly functional ntp. In my opinion, even a single second should not be considered acceptable for the systems involved. Also, I wouldn't have noticed the difference from diagnostics, had I not compared it to a simple wall-clock.

Some details

Here's ntp information for some of my hosts (10.0.99.1, 10.0.99.2, 10.0.1.119, 10.0.99.241) that are partly referencing one another for fallback, but mainly should all ultimately by syncing with 10.0.0.1, which again pulls the time from outside. So the time discrepancy cannot result from different original time sources. As the observations made me somewhat paranoid, "has correct time" in the following means: show clock (or date) produced an output that matches my wall-clock and my local system clock (which is fine according to http://time.is) with an error certainly below 1 seconds (accuracy of me hitting ENTER while watching my local clock)

10.0.1.119 (Ubuntu) has correct time

$ ntpq -np
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
+10.0.99.1       10.0.0.1         3 u  855 1024  377    0.904   -2.658   0.113
*10.0.0.1        130.149.17.8     2 u  266 1024  377    0.253    0.909   0.127

10.0.99.241 (Cisco 2960) has correct time

#sho ntp associations 

  address         ref clock       st   when   poll reach  delay  offset   disp
*~10.0.99.1       10.0.0.1         3     28     64   377  1.462  85.288 19.758
+~10.0.99.2       10.0.1.119       4     29     64   377  1.297  83.515  5.369
 * sys.peer, # selected, + candidate, - outlyer, x falseticker, ~ configured

10.0.99.2 (Cico 4500) has correct time

#sho ntp associations 

  address         ref clock       st   when   poll reach  delay  offset   disp
+~10.0.99.1       10.0.0.1         3      6   1024   111  1.148  -1.618 42.875
*~10.0.1.119      10.0.0.1         3     31   1024   377  0.043   1.687  1.064
 * sys.peer, # selected, + candidate, - outlyer, x falseticker, ~ configured

10.0.99.1 (Cisco 4500) lags behind by about 2 minutes 6 seconds

#sho ntp associations 

  address         ref clock       st   when   poll reach  delay  offset   disp
*~10.0.0.1        130.149.17.8     2    274   1024   377 15.625   3.681 30.403
+~10.0.99.2       10.0.1.119       4    415   1024   376 15.625   0.855 33.276
 * sys.peer, # selected, + candidate, - outlyer, x falseticker, ~ configured

#sho ntp status 
Clock is synchronized, stratum 3, reference is 10.0.0.1      
nominal freq is 250.0000 Hz, actual freq is 249.9988 Hz, precision is 2**6
reference time is DAD8B428.54C6BAEA (20:36:24.331 MESZ Sat May 7 2016)
clock offset is 3.6818 msec, root delay is 32.80 msec
root dispersion is 71.74 msec, peer dispersion is 30.40 msec
loopfilter state is 'CTRL' (Normal Controlled Loop), drift is 0.000004720 s/s
system poll interval is 1024, last update was 683 sec ago.

Questions

How come 10.0.99.1 is so far off?
How come systems that sync to 10.0.99.1 are correct?
How should I learn from the output of sho ntp status on 10.0.99.1 that the clock is actually totally out of sync (compared to all hosts and reference clocks mentioned in sho ntp asso)? For me the output looks totally like a very elaborate "I am totally happy".

EDIT: By popular demand, the output of sho clock detail

10.0.99.1

#sho clock detail 
13:06:38.605 MESZ Tue May 10 2016
Time source is NTP
Summer time starts 02:00:00 MEZ Sun Mar 27 2016
Summer time ends 03:00:00 MESZ Sun Oct 30 2016

10.0.99.2

#sho clock detail 
13:10:54.083 MESZ Tue May 10 2016
Time source is NTP
Summer time starts 02:00:00 MEZ Sun Mar 27 2016
Summer time ends 03:00:00 MESZ Sun Oct 30 2016

I can't spot any system in which IP addresses you have configured as ntp servers used by each device. And I spot a loop as well as a couple using each other as ntp servers. I believe in those cases you are supposed to specify them as ntp peers rather than servers. Though I must admit that I don't know what exactly the difference it does whether you specify it as peer or server. Also, I am not convinced it is a good idea to let everything synchronize through a single host (`10.0.0.1`). But I don't think any of my observations can directly explain the cause of your current problem. — kasperd, May 07 '16 at 21:40
One glaring problem with your ntp configuration is that each host is configured with *the worst possible number of time sources.* "A man with one watch knows what time it is, a man with two watches is never sure..." Any other number is better than two, four is probably the best choice, it gives a cushion if one is unavailable and still leaves three sources. — dfc, May 08 '16 at 04:52
What does `show clock detail ` show on the problem switch and on one of the synchronised switches? — Paul Haldane, May 08 '16 at 08:40
Interesting oddity: 99.2 polling 99.1 shows a reach of 111, which means it's losing a bunch of NTP packets. (octal 111 = binary 1001001, so ⅔ of the polls are being lost). Also interesting, 99.2 is getting correct time from 99.1 — derobert, May 09 '16 at 18:33
@derobert This is interesting in itself. Actually, I now observed a transition from 110 to 111 (shouldn't it have gone to 221?) — Hagen von Eitzen, May 10 '16 at 14:03
Your whole NTP configuration needs to be reconsidered. You need to work with stratum levels. As @kasperd pointed out, you could have a problem with a loop. You should only synchronize to servers with a lower stratum level, and those at the same stratum level could be peered, but not use each other as servers. Peered devices still need one or more servers at a lower stratum level as authoritative source(s), but will try to align themselves to other peers. Don't use busy devices (e.g. core switches) as NTP servers. — Ron Maupin, May 10 '16 at 15:03
Something very odd is going on. All the ntp output is reasonably normal and shows good sync. Yet your command to get the time from the device gave a time that's way off. That suggests that for some reason, the device with the time that's off is not setting its system clock from its ntp subsystem. — David Schwartz, May 10 '16 at 16:18
Set up at least one system to get the time from at least three outside servers. Then use only that system as source for the other systems. — Martin Schröder, May 11 '16 at 22:49
@MartinSchröder This tipp seems unrelated to the probelm at hand. My one system to get time from outside servers is 10.0.0.1 (showing a anexternal refid above). Even if I remove the peering with 10.0.99.2 from 10.0.99.1, it will still sync to 10.0.0.1, it will still tell me that it is in sync, --- and it will still be unexplicably off by two minutes — Hagen von Eitzen, May 12 '16 at 06:07
It really sounds like you've found a bug, and probably the only way forward is to reboot it and hope it goes away or to contact Cisco. — derobert, May 14 '16 at 06:53
NTP itself is definitely doing the right thing as DavidSchwartz said, and this probably is a bug, per derobert's comment. Whilst having only a single internal time source is non-ideal, I would still suggest following @MartinSchröder's advice and using 10.0.0.1 (which appears to be synced with external sources) as your main time source. The other sources aren't adding any value. — Paul Gear, Jun 08 '16 at 20:24
Stupid question, did you try querying all the ntp servers with something like "ntpdate -d ip.addr.here"? I bet one of them is off, but I didn't read your question or the followup comments very closely. Probably the 138.* one specifically. Test it! -d will prevent time from being set. — Some Linux Nerd, Jun 23 '16 at 01:25
@SomeLinuxNerd Nop. With all ips mentioned in the question, I get offsets in the millisecond range at most — Hagen von Eitzen, Jun 23 '16 at 06:14
Hm, so much for that theory. If you ever figure it out, can you post a followup? — Some Linux Nerd, Jul 11 '16 at 22:03
Check these symptoms: - The switch powers down and stays down for a few minutes to a few days for no clear reason. - The output-fail LED on the power supplies are red and no power is delivered to the chassis. The other LEDs on the power supply are green. - The Status LEDs on the switching modules and the supervisor engine are flashing green. - CPU Utilization LEDs are flashing green or off. — Jose Raul Barreras, Jul 30 '16 at 18:12
Hate to state the obvious but I didn't see it mentioned here, have you checked for firmware updates or filed a trouble ticket with Cisco support? — htm11h, Aug 01 '16 at 13:24
@htm11h I'll look into that - after all `Uptime for this control processor is 3 years, 41 weeks, 4 days, 12 hours, 39 minutes` suggests that some update might be available :) — Hagen von Eitzen, Aug 01 '16 at 20:15
I have seen flaky behavior like this where a firmware update immediately fixes the issue. I had a PowerConnect with date time issues a few years back and it was fixed by firmware update. — htm11h, Aug 01 '16 at 20:28

score 2 · Answer 1 · edited Jan 02 '17 at 02:56

I am a bit reluctant to post this as an answer because the original cause is still unclear. Nevertheless, the problem seems to be solved - at least for the moment.

Following the comments made by htm11h, I decided to update the firmware. And indeed, now that I am running with a newer firmware, the clock seems to match the correct time.

But does that mean the new firmware was the solution? Unfortunately, no. In my first attempt to load the new firmware, I forgot to change the config register, which was still on its factory default. Therefore, my first reboot ended up in the same original ROM image the router had been running for almost four years (i.e., since its initial power-on). And yet, this was sufficient for the clock to make one huge adjustment and then stay in sync. This suggests that a mere reboot might have helped - temporarily. In turn, this means that the now correct time shown with the newer firmware may still drift away from ntp time over the years to come. It will take a few days until I can safely tell whether or not the clock lost about 5 seconds per day ...

For now, the case is closed.

score 1 · Answer 2 · answered Dec 27 '16 at 22:55

I've done quite a bit of work with the NTP Pool project since the mid-90's and run several NTP Stratum-1 GPS Synced servers here. As others have stated you need more than 2 servers to get time from. I usually use 4 here for the reasons stated by Ron Maupin above. Also as listed you need to look out for loops and setting thing as servers vs. peers.

The time drift could be due to a known bug in IOS that was fixed in this IOS update dealing with the ntp.drift not getting deleted or updated correctly and thus the drift issue. Also 4 YEARS with no reboot or update must have left you in a pretty bad spot security wise as IOS Security updates come out fairly frequently.

Here's an excellent post on setting up NTP on Cisco IOS http://packetlife.net/blog/2011/mar/28/cisco-ios-clocks-and-ntp/

Hope this is helpful. Please ask if you have more questions or issues.

score 0 · Answer 3 · answered Jan 01 '17 at 23:49

Full disclosure: I've only occasionally fiddled with switch configs at all, and I'm not by any means an NTP expert.

That said, I used to see the NTP daemon on RHEL 5.x systems (yes, I'm going back, but you did say your switch had a ~4 year old image...) get stuck in a "happy" state, where it seemed to think it was perfectly synchronized but was clearly not. We would use a ClusterSSH session to run "date" on all of the systems simultaneously, and that would sometimes show as much as 5 minutes of drift between systems. If I recall correctly, we could only seem to fix the problem by restarting the daemon, and ultimately just made cron restart the service every night...

Not by any means an ideal solution, but you might be able to adopt a similar approach with a cron job to connect to the switch and initiate a reboot, or somehow "kick" the NTP daemon on the switch?

Hope this helps!