0

I have several servers (different ages) running on Ubuntu 22.04 with QEMU and libvirt.
When migrating VMs between some of these servers, the VM immediately freezes on the target system.
Tested guests were HVM machines running current Ubuntu or CentOS/Rocky Linux.

Previously the servers ran on Ubuntu 18.04 and were reinstalled recently. On 18.04 there were no problems with live-migration - this only started to happen on 22.04.

Details about the hardware:
I have multiple servers, but I just pick four of them to better compare:
1: AMD EPYC 7513 @ 1.50 GHz 2: Intel Xeon Silver 4114 CPU @ 2.20GHz
3: Intel Xeon CPU E5-2630 v4 @ 2.20GHz
4: Intel Xeon CPU E5-2620 v4 @ 2.10GHz

Migration works between 1<->2 and 3<->4. Migration does not work, when migrating from 1->3 or 1->4, but still works in the opposite direction.

Details about configuration:
Virtual machines are configured to use a "SandyBridge" CPU model, as how it worked before with the older Ubuntu version.
Also, my test guest has no hard drive, no audio device, no networking. Just a Live CD device via a virtual IDE controller.

What I debugged so far:
This seems to be a timing or interrupt related problem. The machine freezes and doesn't show any reaction to keyboard input (via VNC).
Also, the following is logged on the target host:
qemu-system-x86_64: warning: TSC frequency mismatch between VM (2600083 kHz) and host (3207181 kHz), and TSC scaling unavailable
I stumbled across this report and tried to set a fixed TSC frequency for the guest (with different values), but no success so far.
I also tried to disable TSC completely for the guest, making it relying on HPET, which makes the VM really laggy and throws warnings about stuck IRQ events on the console, sometimes even a kernel panic caused by the NMI watchdog. Also, after migration, the VM completely froze.

What I discovered:
If the guest is "running" the GRUB bootloader or a Live CD menu (so, it has not booted yet), the VM is migratable without problems. Booting the guest into recovery mode still leads to a frozen VM.
I tried noapic acpi=off notsc boot options inside the guest, but without change.

What could be causing this?
At the moment, I'm about 50% convinced, this could be related to the use of TSC by QEMU, leading to a freezing system, if TSC clocks between servers deviate too much. But it seems, I can't simply forbid the use of TSC to QEMU, even if I boot the host with the notsc kernel parameter.
But, of course, it could be related to something totally different.
Is there a way to force QEMU to use HPET or any other clock source? Or is there anything else I could try to make migrations working again?

Thanks!

Max
  • 11
  • 1
  • if i remember correctly, the issue might be the difference between the host cpu. they must be identical, this is what I remember what Proxmox says about the same – djdomi Sep 09 '22 at 18:07

0 Answers0