10

Resolved The problem was Hyper-V on that machine. I removed Hyper-V, installed VMware Server, ran the same VM. Time sync issues went away (< 100ms difference after a day).


My setup is like this:

HYV1 - HyperV machine (non domain) - sync irrelevant
AD1  - VM AD server on HYV1, sync'd to time.nist.gov. HyperV time sync off.
S1   - Physical machine, sync'd to domain. 
S2   - Physical machine running HyperV, sync'd to domain.
V1   - Linux VM machine on S2, sync'd to AD1. No HyperV integration.

AD1 and S1 have fine sync -- stripchart shows less than 100ms difference.

S2 drifts like crazy. Here's a bit of the stripchart against AD1:

18:33:22 d:+00.0010138s o:+05.4101899s 
18:33:24 d:+00.0010138s o:+05.4319765s 
18:33:26 d:+00.0000000s o:+05.4788429s 
18:33:28 d:+00.0000000s o:+05.6089942s 
18:33:30 d:+00.0010138s o:+05.7240269s 
18:33:32 d:+00.0000000s o:+06.0421911s 
18:33:34 d:+00.0081104s o:+06.5613708s 
18:33:37 d:+00.0000000s o:+06.9096594s 
18:33:39 d:+00.0000000s o:+06.8867838s 
18:33:41 d:+00.0010127s o:+06.8936401s 

In 20 seconds, it drifted over a second. If I manually reset it to within 1s, within a few minutes it'll be back drifting about 2 seconds. Overnight it went from ~2s to ~5s. The Linux VM inside S2 has perfect sync with AD1.

Here's the config:

C:\Users\mgg>w32tm /dumpreg /subkey:Parameters

Value Name                 Value Type          Value Data
------------------------------------------------------------

ServiceDll                 REG_EXPAND_SZ       %systemroot%\system32\w32time.dll
ServiceMain                REG_SZ              SvchostEntry_W32Time
ServiceDllUnloadOnStop     REG_DWORD           1
Type                       REG_SZ              NT5DS
NtpServer                  REG_SZ              ad01.mydomain ad02.mydomain


C:\Users\mgg>w32tm /dumpreg /subkey:Config

Value Name                Value Type          Value Data
-----------------------------------------------------------

FrequencyCorrectRate      REG_DWORD           4
PollAdjustFactor          REG_DWORD           5
LargePhaseOffset          REG_DWORD           50000000
SpikeWatchPeriod          REG_DWORD           900
LocalClockDispersion      REG_DWORD           9
HoldPeriod                REG_DWORD           5
PhaseCorrectRate          REG_DWORD           1
UpdateInterval            REG_DWORD           30000
EventLogFlags             REG_DWORD           2
AnnounceFlags             REG_DWORD           5
TimeJumpAuditOffset       REG_DWORD           28800
MinPollInterval           REG_DWORD           2
MaxPollInterval           REG_DWORD           8
MaxNegPhaseCorrection     REG_DWORD           -1
MaxPosPhaseCorrection     REG_DWORD           -1
MaxAllowedPhaseOffset     REG_DWORD           300

I looked at the event log, and apart from warnings about sync (after it gets way out of sync), there's no other warnings.

How can I go about troubleshooting this? It's the only machine that is having this problem. All the other machines (physical and virtual) are doing fine.

Edit: To clarify: The VM (AD1) has integration turned off and syncs to time.nist.gov. AD1 is fine. It's the physical machine S1 that can't sync to AD1 and drifts all over. All the other physical servers are able to sync to AD1 just fine.

Update So, it appears to be an issue of running the VM. The clock slips slowly with the VM off. Turned on, it immediately starts losing seconds. I swt the VM to only use half the resources, and that seems to have slightly mitigated it, for now. Thanks!

MichaelGG
  • 1,739
  • 8
  • 25
  • 30

7 Answers7

5

From your description, it sounds like there is an actual hardware problem with the RTC (http://en.wikipedia.org/wiki/Real-time_clock) on the motherboard of server S2.

The Hyper-V guest gets it's clock initially from the host (HYV1), but as you have Hyper-V time sync disabled, it gets all further clock updates from NIST (which is working fine). Your Linux VM is not integrated with Hyper-V, so it is getting it's time from the domain, which is also working fine. Your other physical machines are working fine, it is just a single physical server that is having 1 second of drift every 20 seconds (which is a crazy amount of drift). The time is drifting much quicker than the network time sync can reset the clock to the right time (which if I recall correctly takes place every 8 hours).

If you want to rule out Hyper-V as a cause for the error on S2, create a "no Hypervisor" boot entry, reboot without Hyper-V, and see if the time drift persists. Instructions here: http://blogs.msdn.com/virtual_pc_guy/archive/2008/04/14/creating-a-no-hypervisor-boot-entry.aspx

-Sean

Sean Earp
  • 7,207
  • 3
  • 34
  • 38
  • OK I'll try that out. – MichaelGG May 20 '09 at 00:13
  • OK, I shut down the VM (didn't disable HyperV). Clock is much better now. After about 3 minutes, it's only lost about 100ms. It's still losing, but much less than before. As soon as I turn on the VM, it goes nuts. It kist 1 second in a few seconds. Maybe cause the VM doesn't have integration services? – MichaelGG May 20 '09 at 09:15
  • Michael- This may seem out of left field here, but are you running any sort of multimedia application on the parent partition of S2? -Sean – Sean Earp May 21 '09 at 21:51
  • Nope. Problem ended up being Hyper-V. Took off Hyper-V, put on Vmware Server, ran the same VM -- no problems. Time sync is < 100ms. – MichaelGG May 22 '09 at 16:52
3

The problem is with the virtual implementation of the various clock sources (tsc, jiffies, acpi_pm, cmos_trc). The best way I have found to fix this problem with HyperV is to turn off the HyperV provided clock sync for your guest machine, then use adjtimex to adjust the time. On an Ubuntu guest OS do this...

# rm /var/log/clocks.log
# /etc/init.d/ntp-server stop
# ntpdate ntp.ubuntu.com
# hwclock -u --systohc
# adjtimex -l -u -h ntp.ubuntu.com

and answer No to both questions

# while [ /bin/true ] ; do yes | adjtimex -l -u -h ntp.ubuntu.com ; sleep 60 ; done

leave that to run for a few hours to calibrate, hit Ctrl-C to exit it.

# adjtimex -r -a -u -h ntp.ubuntu.com

this will do a least squares analysis of your clock and will find the right adjustment

# ntpdate ntp.ubuntu.com
# hwclock -u --systohc
# /etc/init.d/ntp-server start

this will resync the time on your machine and ntp should then be able to keep it in sync because it shouldn't drift too much anymore.

2

This seems to be a very common issue with VM's. See the following websites:

http://www.vmwareinfo.com/2008/04/enabling-ntp-on-esx-servers.html

http://social.technet.microsoft.com/Forums/en-US/winserverhyperv/thread/6fff3eef-1b5b-4059-8618-22ab3f5c293c

My suggestion would be to sync with just an external time server and disable any integration time sync'ing

Hopefully this helps.

rmwetmore
  • 432
  • 1
  • 5
  • 10
  • That's exactly what I have done. The VM (AD1) has integration turned off and syncs to time.nist.gov. AD1 is fine. It's the physical machine S1 that loses sync to AD1. – MichaelGG May 19 '09 at 20:45
  • Like this chap says - to set MaxAllowedPhaseOffset to 1. http://www.jaylee.org/post/2009/10/14/Hyper-V-CPU-Load-and-System-Clock-Drift.aspx – gbjbaanb Dec 23 '09 at 00:06
2

We have been running Hyper-v on Core for a while. At first we had time sync issues.....I reverted to a best practice from my old windows NT days.

I look at the servers by OS. I create a Linux, Router, Windows, Novell master.

You might not have Novell now but bear with me.

Each "master" server syncs to the router. The router to stratum. Then each member server has its master OS server and a secondary of one of the other Masters.

  • Linux to Router, then to Novell
  • Novell to Router, then to Windows
  • Windows to router, then to Linux
  • Router to Stratum, then to Core switch
  • Core Switch to Stratum, then to Router

The last piece of this stratagy is...EVERYTHING has a time server. If it does not have a time server then it is not going to be plugged into the network. From toaster to switch to phone PBX to servers.

This is one of the first things I do when I get to a new job is spend the time to map the network and set the time. I can then just check it here and there and eliminate time sync as an issue from that point on.

Flimzy
  • 2,375
  • 17
  • 26
Thomas Denton
  • 686
  • 5
  • 13
  • Hmm, I'll try adding a manual secondary and see if that helps. But everything else works fine -- just this one physical machine drifts. – MichaelGG May 19 '09 at 20:47
  • What kind of machine is it? Dell/HP/IBM - Other? I have had Dell boxes that just always need to be tuned. – Thomas Denton May 19 '09 at 21:03
  • Dell PowerEdge 850 with a Pentium D920 in it (or something around there -- 2.8GHz, does Intel VT.) – MichaelGG May 20 '09 at 08:51
  • The PE 350's would drift very bad. but that was years ago. I have not used an 850 but the SC1435 servers that are the cheaper analog to the 850 do fine. Maybe look at environment, is the server vibrating and the cmos battery loose or something crazy like that? – Thomas Denton May 20 '09 at 11:40
1

Time drifts all over the place in VMs. You really want to make sure that the NTP server is not using the local clock in any 'server' statements, as the local clock is too unreliable. One thing I've done to help is to set the "maxpoll" attribute for servers on VMed machines. This forces the ntp service to check with its upstream clocks much more often than the configured default, which help keep it true.

server [timeserver] maxpoll 12

Try a few settings to see how far down you need to get to keep time relatively reliable. 12 works for me, but each environment is different.

sysadmin1138
  • 131,083
  • 18
  • 173
  • 296
1

This may sound funny, but I bet you are running a multi-processor setup? There are known clock-drift issues with certain manufacturers cough AMD cough that happen with multi-core/multi-socket motherboards. Heavy interrupt activity - like say, running a virtual machine or two - makes the drift worse. The drift you are experiencing sounds very suspiciously like this.

For what it's worth, I do prefer AMD's offerings over Intel, so don't take this as a knock against them.

Avery Payne
  • 14,326
  • 1
  • 48
  • 87
1

Assuming that AD1 was a domain controller, I think the problem here may have been related to your Hyper-V server setting its time from one of its own guest VMs. That's why the problem went away when you switched to VMware: the VMware server does not feel compelled to synchronize its clock with a Windows domain controller.

Skyhawk
  • 14,149
  • 3
  • 52
  • 95