5

I have one NTP server which has a wrong time setting which are 7 hours in the future (timezone was changed after machine shipment, but not the time). The server itself is not synchronized, but only has its local clock. On this server >10 clients synchronize their clock which leads to a whole group of servers with a wrong time.

How can I change the time on the NTP server that the correction is slewed and all clients will get corrected, too? I first tested with just a fix via "date MMDDhhmm" which let to the clients to disconnect from server (the asterisk in front of the server name in ntpq disappeared).

I do not know how all the synchronized services will behave when I change the time on all servers manually by setting the clock back 7 hours leading the systems to have files from the future. There may be crashes and the systems provide services for a fab production.

  • 5
    This is why you set the HWCLOCK to UTC and then have system clock do the tz conversion. – dfc Oct 22 '15 at 01:18
  • If this were only 1 client that was off, then there is a way to slowly update the time. As this is the server, you will likely have to take the hit during a maintenance window that you plan with your org. Stop ntp on all the servers. Ensure your server is sync'd up with some stratum 1 servers and stable. Then use ntpdate -b4 ip.of.your.ntp.server and start ntpd on the clients. Make note of the time you did this in your internal tracking system to answer questions during your various audits as log timestamps will change. Check on clusters that are sensitive to chrono time vs. internal. – Aaron Oct 22 '15 at 02:52
  • It is even worse: The machines I talk about are all VMs which are not host synchronized (and the hosts of course have no time sync at all). I also do not have a chance to connect the NTP server to another time source, because the machines are isolated from the customer network. BTW.: I did not set this up, I only found the issue... ;-) Has someone a better idea than switching everything off? In a 24/7 environment it is not a good solution. Hopefully not the best. – Rick-Rainer Ludwig Oct 22 '15 at 04:34
  • 4
    If you learn anything from this - and it looks like it'll involve downtime - **learn not to lie to NTP**. You shouldn't advertise a machine as being an authoritative server unless it has a stratum-0 time source directly attached to it, or is synced to upstream servers that eventually are so sync'ed. – MadHatter Oct 30 '15 at 15:21
  • 1
    +1 You are right. I would not setup something like that. I am only the messanger in danger to get shot for this. ;-) – Rick-Rainer Ludwig Oct 30 '15 at 17:22

2 Answers2

5

When you talk about slewing the time, you are usually talking about small amounts of time. The fix is performed with a call to adjtime(), or on linux maybe adjtimex().

From the ntpd man page:

   -x     Normally, the time is slewed if the offset is less than the step
          threshold,  which is 128 ms by default, and stepped if above the
          threshold.  This option sets the threshold to 600  s,  which  is
          well  within  the  accuracy  window  to  set the clock manually.
          Note: Since the slew rate of typical Unix kernels is limited  to
          0.5  ms/s,  each  second  of adjustment requires an amortization
          interval of 2000 s.  Thus, an adjustment as much as 600  s  will
          take  almost  14 days to complete.  This option can be used with
          the -g and -q options.  Note: The kernel time discipline is dis‐
          abled with this option.

I doubt then that you are going to want to wait for a 7 hour correction to happen at this speed. It'd take over a year. On linux adjtime on a 32 bit system is effectively constrained to a delta of about 2000 seconds. 64 bit systems probably make that a non issue, but the speed at which the change would take effect is still a concern.

So there's a threshold in the linux implementation, and presumably others, under which you get a 'slew' which is very slow, but above this the system clocks on master and clients will be stepped, which can proceed much faster.

There will also be another threshold where if the time difference between master and client is too large, the client will assume an error and not update. From the ntpd man page:

   -g     Normally, ntpd exits with a message to the  system  log  if  the
          offset  exceeds the panic threshold, which is 1000 s by default.
          This option allows the time to  be  set  to  any  value  without
          restriction; however, this can happen only once.  If the thresh‐
          old is exceeded after that, ntpd will exit with a message to the
          system log.  This option can be used with the -q and -x options.

Note that the -g option is almost certainly not set for a daemon. It's usually used as ntpd -gq, run as a one-off at system start-up, or manually which behaves much like ntpdate. The panic threshold is presumably configurable at compile time though, so check the man page for your OS vendor(s).

It is pretty straight-forward to write a program which will make a series of time adjustments using any frequency and size of adjustment you choose. You can do this on the ntp master, and it will serve the adjusted time to its clients, but you need to know what maximum size adjustment the client systems will accept, and what minimum threshold will cause them to perform a very slow slew. To be safe, You should survey the ntp implementations on the client systems.

If you are updating systems with characteristics similar to default ntpd on linux without the -x option, then you could use a regime like making a half second adjustment every 5 seconds, and you'd get into sync over the course of about 3 days. Making sub-second adjustments that do not cross a second boundary might help to avoid things like triggering cron jobs twice, but expect that you'll probably find some sort of side effects.

If you wind up in a situation where your servers are no longer all in sync with each other, then it gets messier. If feasible, I'd want to monitor the time differences, and automatically stop doing the automated periodic updates if some servers are no longer following along, and raise an alert.

mc0e
  • 5,786
  • 17
  • 31
  • 1
    "-g option is almost certainly not set for a daemon" Bold claim, bolder disagreement. `-g` is great for operating ntp daemons. It lets ntp start and figure things out on its own without manual intervention. Why would I want ntp to not start? Same answer if you are managing more than 3 or 4 servers? – dfc Nov 02 '15 at 04:05
  • I granted the +50 bounty. It is not the answer I expected, but a good explanation why the wanted solution is not possible. Thanks! – Rick-Rainer Ludwig Nov 02 '15 at 12:02
  • 2
    @dfc: There are devices for which that's appropriate, but hopefully you wouldn't be running production web apps on them. NTP is mostly for ongoing minor tuning. If you are 1000 seconds out, someone should be looking into what went wrong, and the error might be upstream not local. – mc0e Nov 02 '15 at 14:02
0

As you know, the clients will remain synchronised if the clock change is within a small interval. On some systems this is as little as five minutes. Yours may be 10 minutes. You can jump the clock within that interval and the clients will slew to keep track.

I can see four options:

  1. Do nothing, and live with the incorrect time indefinitely.

  2. Reset the clock by four minutes (or nine minutes if you've got a 600 second interval) and repeat ad nauseum during the year that mc0e has calculated is necessary. You would really want to do this with a script. Allow for the time being incorrect for much of this year. Take copious notes of the time offset to correlate against production reports.

  3. Take the servers down for a seven hour maintenance period (Christmas Day, anyone?) and fix all the clocks properly, in one sitting.

  4. Jump the clocks and ensure that everyone knows there will be a seven hour reporting overlap. However, these same people should already know that the production times are off by seven hours, so you may find this is acceptable. (Obviously I don't know what impact this would have on your fab processes.)

None of the solutions is ideal. If production reporting times are important then option 2 is probably the worst of a bad bunch.

roaima
  • 1,567
  • 13
  • 26
  • I agree that a 'slew' probably isn't the best option, but there's a difference between a slew and a series of small steps, which can happen quicker. I'll clarify this a bit in my answer. – mc0e Oct 30 '15 at 14:41
  • @mc0e I totally agree with you. However, without knowing the implications of stepping the time backwards in the OP's scenario I can only recommend it with caution. (If it was stepping forwards it would be so much easier.) – roaima Oct 30 '15 at 14:52
  • Agreed. Various things could happen twice. We certainly don't have the info to make that assessment, but your word of caution is well placed. – mc0e Oct 30 '15 at 15:06
  • The whole situation is messed up. I talked to some people and it will be either solution 3 or 4. Solution 4 has the issue, that we do not know how robust the services are. There is a lot of software involved of questionable quality. :-( – Rick-Rainer Ludwig Oct 30 '15 at 17:24