How "real" is the POSIX clock in a virtual machine?

Question

Introduction:

Time is OS like Linux is typically derived from a clock chip (RTC), or maintained by software using either periodic interrupts or some hardware registers (e.g. CPU's TSC cycle counter) for implementation.

Obviously in a virtual machine there is no direct hardware access (e.g. to RTC), so keeping the correct time may be tricky.

Specifically I'm wondering about the two POSIX clock implementations: CLOCK_REALTIME and CLOCK_MONOTONIC (there are more).

Disturbances

There are two major "disturbances" I'm considering:

"CPU overcommitting": giving more virtual CPUs to VMs than there are physical ones
"Live Migration": Moving a VM from one machine to another "without" affecting operations

Normal operation

Processes running in an operating system on bare hardware are interrupted only by the operation system (that has control then). So the operating system can keep the time easily.

VM operation

An operating system running in a VM does not continuously have control over the CPU. For example if the OS "does not have the CPU", it cannot process timer interrupts. In turn that could cause the timer interrupts to be lost completely, be delayed by some seemingly random amount (jitter), or maybe even be processed in rapid sequence (processing "delayed" interrupts now). Likewise the clock would not progress as linearly as expected.

Choices

CLOCK_REALTIME: If the OS is missing CPU, the real-time clock could either be slowed down (lack behind), or jump forward occasionally to keep up
CLOCK_MONOTONIC: If the OS is missing CPU, the real-time clock could either be slowed down (in relation to other VMs or wall-time), or jump forward occasionally to keep up

Effects

CLOCK_REALTIME: Obviously if the real-time clock is slow, it cannot be used as an absolute timing measure, but it would look consistent within the VM. If the clock keeps up by jumping forward variable amounts of time, it could be used as an absolute measure, but it would be bad for measuring any performance (duration) within the VM.
CLOCK_MONOTONIC: Advancing the monotonic clock only if the VM "has the CPU" will provide a consistent view of elapsed time within the VM. Making the clock jump forward variable amounts of time would prevent use for performance (duration) measurements within the VM.

Live-Migration

When live migration requires copying of gigabytes of RAM from one node to another, there will be some "freezing time" when the VM cannot run, lets say 3 seconds.

Now should the real-time jump forward by 3 seconds also, or should it loose the three seconds until being corrected manually or automatically at some later time? Likewise when the monotonic clock is being used to measure "uptime", should it take those three seconds into account by adding those, or should it account for the time when the VM actually had the CPU?

Over-committing CPU

Like above, but there are more frequent short delays instead of occasional larger ones.

Questions

What approach does Xen use?

How does VMware handle that? Are there configurable options? (I know that in Xen the VMs can be synced from the hypervisor, or run independently (e.g. synced from external by using NTP))

Are there any "best practices"?

score 1 · Answer 1 · edited Mar 25 '22 at 10:37

POSIX (and Linux in general) never really has guaranteed timers in the sense if you put something to sleep you can expect it to wake up at an exactly certain time. You can only ever guarantee that the wakeup occurred AFTER said time, not exactly on it and never before it*.

Linux isn't meant to be realtime and really just tries its best.

From man 2 nanosleep which is POSIX compliant:

nanosleep() suspends the execution of the calling thread until either at least the time specified in *req has elapsed, or the delivery of a signal that triggers the invocation of a handler in the calling thread or that terminates the process.

If you're expecting the ticks to be reliable, then the issue there is more likely you've got not a heuristic in place to manage a slide inside of a given window.

My suggestion here would be to rethink you're application design to be less reliable on exact wakeups, or have a failsafe in the case of a unexpected delay.

IE

The software aborts due to some delay anomaly.
The software on wakeup notices a difference in comparison to some other authoritative time source and 'steps' its idea of the next wakeup to compensate.
You a print a warning or provide some other notice.

Its not really plausible to think of time as being reliable in a preemptible system. Even on bare metal.

Non-Maskable Interrupts cannot be blocked.
High load means you're just scheduled out for a long time.
Interrupts to the CPU invoked by hardware can cause delays.
Minor and Major page faults can produce very long delays between timer wakeups.
Memory allocation on non owned memory banks by CPU adds delays.

This is really just a function of modern x86 computing.

At least on KVM, there is a clocksource called 'kvm-clock' which is supposed to represent ticks from the underlying hypervisor irrespective of any unknown delays in a VM. You can find that file and what you have set in this path: /sys/devices/system/clocksource/clocksource*/current_clocksource and see what your options are at /sys/devices/system/clocksource/clocksource*/available_clocksource.

But again, the underlying host can have its own delays. So its just turtles all the way down..

Don't rely on realtime guarantees where non exist. Build the software to either cope with unexpected delays or at least know about them.

NTP in general is a whole protocol meant to handle the problem of time, what time is 'correct' and what to do about handling changes to time. Its a pretty complicated problem.

The best practice is you want to set the system up to statistically make the problem unlikely, think about what (if any) would constitute a reliable authority for time in your application and then how you want to deal with the unlikely events where time does change.

Maybe you setup some SLA saying that the time will be incorrect 1 check in 1000000 samples. That is -- its possible, albeit statistically unlikely that the ticks are off.

The way I consider time when working with groups of different systems that all are related, is that its more important that their time locality* is within a small window of difference. To that extent I'd have a local time server setup which itself uses some authoritative source, then have all computers in that group sync to that local system. The very low latency to the local time server serves to reduce the local jitter and all the hosts should remain very closely synchronized.

Some timer implementations use a signal handler to trap events. IE SIGALRM, if you send a process an ALRM signal outside of the timer, it would wake up before it.
Locality here would be all computers logically related to one another all are within perhaps a few milliseconds of time within one another. But they could differ wildly between another locality, IE a group of systems which is latency wise 500ms away.