8

I saw a strange system time changing behavior in some (hardware) servers: in /var/logs/syslog, the date time preceding each log message sometimes change to a random one and gets back to normal in the next message, like the following:

Feb 22 2018 09:09:30 ...
Feb 22 2018 09:09:32 ...
Jan 13 2610 15:37:42 ...
Feb 22 2018 09:09:33 ...
Feb 22 2018 09:09:34 ...

As in the example, the sudden change of date time can be as far as hundreds of years away.

I can confirm that the log messages having the strange time stamps does not come from any specific process - it just can happen randomly for every one.

And duration between 2 abnormal time changes varies between a few minutes to a few hours (however, I suspect the abnormal time changes could happen more frequently but many of them are not revealed in the syslog, since it is not writing logs every second).

Also, since it happens on more than one server, I assume it is not a hardware problem.

More info about the severs: they are an openstack installation with one controller and a few compute nodes. Each server has ntp service running. The controller is configured to take time from its own hardware clock, and the compute node servers sync time from the controller. Note that each server have abnormal time changes at its own pace - looks like the "wrong time" is not synchronized from the controller through ntp.

I was suspecting the guest systems (virtual machines) on compute nodes could affect their host system time. But this can not explain why the controller has the same problem while not running any virtual machine.

I need a method to detect: who changed the system time and how does it happen?

RalfFriedl
  • 3,008
  • 4
  • 12
  • 17
Zhaohui Yang
  • 111
  • 5
  • 2
    Can you show the output of a `hwclock` loop? Something as: `while true; do hwclock; sleep 5; done` – shodanshok Aug 09 '18 at 08:27
  • each server has ntp service running: as client or as server? via systemd or outside of systemd via "old" ntp service? for me this looks like a providing ntp time issue. we had this issue that we did write logfiles before our time was synced (before having network connectivity, resulting in jumps of timestamps) systemd has a target you might want to rely on systemd[1]: Time has been changed systemd[1]: Reached target System Time Synchronized. – Dennis Nolte Aug 14 '18 at 10:27
  • it looks like some date fetching is running as a cron and doesn't have very good time checking. Find it, remove it and replace with ntpd which doesn't respond to large time drifts. – danblack Aug 18 '18 at 06:11
  • We have new findings and found the problem can be narrowed down to CRON messages being delayed in syslog. So I posted [another question](https://serverfault.com/questions/927278/cron-messages-being-delayed-for-arbitrary-long-time-in-syslog) . Please take a look there. – Zhaohui Yang Aug 21 '18 at 04:25
  • @shodanshok Thanks and we have been tracking software and hardware time per second for some time, using similar scripts as yours. And can confirm that we see no system time change at all while "wrong time"s still appear in syslog. Please refer to my [other question](https://serverfault.com/questions/927278/cron-messages-being-delayed-for-arbitrary-long-time-in-syslog). – Zhaohui Yang Aug 21 '18 at 04:43
  • 3
    Maybe this is your error: [Unexplainable time jumps in CRON](https://bugs.launchpad.net/ubuntu/+source/rsyslog/+bug/1429427) it was patched in **rsyslog - 7.4.4-1ubuntu2.7**. – Stone Sep 03 '18 at 12:50

2 Answers2

1

This script will tell you when a time drift occurs and the difference in the process tree, and this should help identify this if it is caused by a process changing the system time. It will print to the terminal as well as log into timedrift.log inside the current working directory.

#!/bin/bash

oldTime="$(date +%s)"
oldPsOutput="$(ps faux)"
while true; do
  sleep 1;
  currentTime="$(date +%s)"
  oldTimeplusfive="$((($oldTime+5)))"
  currentPsOutput="$(ps faux)"
  if [[ "$currentTime" -lt "$oldTime" ||  "$currentTime" -gt "$oldTimeplusfive"  ]]
  then
    (
        echo -e '\n\n======================='
        echo "currentTime=$currentTime oldTime=$oldTime oldTimeplusfive=$oldTimeplusfive"
        echo '-----------------------'
        echo "$oldPsOutput"
        echo '::::::::::::::::::::::::::'
        echo "$currentPsOutput"
    ) | tee -a timedrift.log
  fi
  oldPsOutput=$currentPsOutput
  oldTime=$currentTime
done

Credit to the original script in the Unexplainable time jumps in CRON bug that Stone mentioned as a comment.

Can you also comment as if you are using rsyslog and if so what version? Do you see it outside the realm of rsyslog (ie. apache logs, etc). This bug looks simmlar, and it would nice to confirm it or rule it out either way.

Citizen Kepler
  • 222
  • 1
  • 4
0

Actually this is a duplicate of @Stone 's comment. Just make it clear to everyone this do have an answer.

In short, there is a bug in the version of rsyslog I'm using. Which will delay syslog message it received for arbitrary length of time. Bug report is here. And upgrading rsyslog solved the problem. It is not the kernel or CRON's fault.

Zhaohui Yang
  • 111
  • 5