3

During testing of our configuration of Red Hat Cluster System, NTP stepped the time by 16 seconds, and soon afterwards the cluster software locked up.

ntpd[30917]: time reset -16.332117 s

I need to repeat the failure to ensure that it was not just a coincidence. My intention is to get NTP to step the time back repeatedly until either a) I give up or b) the cluster hangs again.

If ntpd uses the same mechanism as /bin/date to set the time, then this is easy. If it uses a different mechanism and I need to trick ntpd to stepping the clock, then I am stuck.

What is the easiest way to do this testing ?

Martin
  • 506
  • 2
  • 4
  • 13

1 Answers1

2

If you want to test how a system reacts to time changes, use date to mess with the clock (this is probably sufficient for your case: My instinct says the cluster software doesn't like time() going backwards...).

If you want to test how a system reacts to time changes initiated by ntpd set up an NTP server, synchronize to it, then change the time on the NTP server (and let the client daemons do the right thing).
This isn't a huge amount of effort so it's probably worth doing anyway.

Backward jumps are usually more likely to cause problems than forward jumps, but both should be tested for completeness.

voretaq7
  • 79,345
  • 17
  • 128
  • 213
  • I followed your advice and set up a script on one node of a 3 node cluster to step time 20 seconds forwards and backwards repeatedly. This totally hosed the _whole_ cluster in short order. Thanks. – Martin Jul 15 '10 at 15:15
  • 1
    Solution-wise you'll probably have to bug RedHat for a fix, but you can probably work around this issue by telling ntpd to never step the clock (`-x` option). Cluster software is often predicated on monotonic time (clock never runs backwards), and this will force ntpd to slew the clock rather than step it. The end result is time may pass slightly slower or faster, but the value of time() and friends will always be increasing. – voretaq7 Jul 15 '10 at 18:55
  • I did some more testing - just use /bin/date to alter the time. Forward jumps of 20 seconds cause an instant recovery event: forward jumps probably cause timeouts to expire prematurely. Backward jumps also cause recovery to kick off - but it takes a lot longer to happen. I haven't managed to completely crash the cluster, but these simple trip-ups do not inspire confidence. – Martin Jul 17 '10 at 10:56