3

I got extremely confused and decided to re-write the question from scratch, so if some of the comments do not make sense, that is why. Apoplogies to anyone whose time I wasted.

I am having a problem with my entire system freezing when using KVM virtual machines (both with PCI passthrough). The host is Ubuntu 20.04, and this is running on a Threadripper 1950X with an ASRock x399 Taichi motherboard.

It seems that there are a lot of ways to trigger it, but one of the most reliable is to use my Kubuntu VM while playing audio in my Windows VM. It rarely if ever happens when using just the Kubuntu VM. When using the Windows VM alone, it usually happens more slowly, but happens eventually.

Probably not important, but, this entire problem started while watching a video in my Windows VM last night. The guest froze, so I forced it off, and then subsequently the guest brought down the whole system.

In any case, my strategy is to insert descriptions of what I am doing into journalctl, flagged as "###Admin note:" by using

 echo '###Admin note: test' | systemd-cat

The output as viewed with journalctl -b -1 after reboot: https://pastebin.com/fi1iR8Zx

The last message I sent to journalctl does not seem to have been saved, but did appear in the terminal:

you can see the last note I sent to systemd

And the output of journalctl -k -b -1 https://pastebin.com/1t38WsWh

Are there any other logs I should look in? What additional information would be helpful? I was thinking to post info about my PCI devices and my virsh XML file, but I don't want to over-clutter this post with irrelevant stuff.

Edit: I notice that I have a high temperature on something called SMBUSMASTER. So far I have found only speculation about what exactly that is. If the maximum safe temperature is the same as for CPUTIN (68C), then I guess I have a problem. Is this bad?

Also, other than PCI passthrough of GPUs, one unusual feature of my system is that Windows' USB devices are all connected to a PCI USB-C card, which is passed through. I don't see anything to implicate it in the log, but seemed worth mentioning.

Possible solution?

I think I may have figuered it out.

In looking over my logs, I noticed many entries like this:

Jul 10 23:43:20 virtland kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)

Googling that led me to this Ryzen bug: https://bugzilla.kernel.org/show_bug.cgi?id=196683

which led me to this thread: http://forum.asrock.com/forum_posts.asp?TID=11690&title=x370-taichi-c-states

Now I could have sworn I turned off suspend to memory long ago, but sure enough it was set to auto. In any case, I turned it off and now I am listening to music from Windows and working in Kubuntu.

Perhaps the setting got changed back to default when I did a firmware update before I set up the host, but I still don't get why it would be a problem. Could it be related to power? We are having a heatwave, and the lights sometimes flicker. I am behind an APC UPS for exactly this reason, but it has been making a lot of clicking noises when I have my air conditioning is on. I certainly was not trying to suspend to RAM at any point, but could a power fluctuation that the UPS did not correct fast enough have somehow triggered something related to this.

In any case, I won't be confident that this is fixed until I have gone a couple days without it, but this is very promising.

Indeed I did speak too soon, here are the logs from that session. I was using Windows and Kubuntu for a few hours without issue, which was a record. The MWAIT errors are still there:

journalctl -b -1 : https://pastebin.com/1feLq19U

Edit: I noticed that the last log seems to cut off well before the crash actually happens, so I reproduced the crash one more time. The output of journalctl -b -1 after rebooting cuts off more than ten minutes before the crash. Fortunately, I was running `journalctl -f, and that actually captured everything until my last comment (the crash happened less than a minute later). Unfortunately, I don't see any errors right at the time of the crash.

I am puzzled as to why the journalctl as saved to the system logs and viewable later cut off ten minutes before the output as sent to a file live. A few seconds I can understand, but ten minutes?

Anyway, for comparison, here are the logs plus my picture of the crashed screen.

  1. Output of journalctl -b -1 : Cuts off 10 minutes early, but included for comparison. https://pastebin.com/mKu4bBzL

  2. Text saved from journalctl -f > file.txt, which is complete: https://pastebin.com/kSXDRkBp

  3. Picture of the screen Today's crash

Edit: Another detail that might matter is that the Windows VM was created with a previous host installation (it was created with Arch, but with an earlier version of QEMU, so that shouldn't be a problem, right?)

Edit: Disabled C-states on my motherboard, no luck

Edit: I set up Kernel Crash Dumps as described here: https://ubuntu.com/server/docs/kernel-crash-dump There is a lot there, but one thing that caught my eye is this:

[  463.983070] vfio-pci 0000:09:00.0: vfio_ecap_init: hiding ecap 0x1b@0x2d0
[  463.983308] vfio-pci 0000:09:00.0: vfio_ecap_init: hiding ecap 0x1e@0x370
[  465.131213] vfio-pci 0000:0a:00.0: vfio_ecap_init: hiding ecap 0x19@0x280
...
[ 1047.680144] vfio-pci 0000:43:00.0: enabling device (0000 -> 0003)
[ 1047.680422] vfio-pci 0000:43:00.0: vfio_ecap_init: hiding ecap 0x19@0x900

Those are the PCI ids of all the cards I am passing through (43 is to Kubuntu, 09 and 0a to Windows). Googling hiding ecap did find threads about problems with similar hardware to mine (AMD CPU + AMD GPU passthrough): https://forums.gentoo.org/viewtopic-t-1070816-start-0.html https://forum.level1techs.com/t/rx-470-stuck-in-d3-single-gpu-passthrough/147401

I passed through my RX470 to Windows and was able to shut down or start again without issue for months. I gave up on trying to make it work with Linux long ago due to the AMD reset bug. Is it possible that this bug is showing itself again in a different form?

Stonecraft
  • 243
  • 2
  • 4
  • 15
  • The system crashed so hard it could not log those entries to disk. But there is some stuff that scrolled off the top that is probably necessary to figuring out what is going on. If that terminal is scrollable, scroll back up and get the rest of it. – Michael Hampton Jul 10 '20 at 23:41
  • Actually that I could the log to match the screen by editing `journald.conf` to `RateLimitIntervalSec=1s` More complete logs, had to pastebin due to size: https://pastebin.com/ZBP678Ph – Stonecraft Jul 11 '20 at 00:17
  • Huh? That pastebin doesn't have anything relevant to the crash. – Michael Hampton Jul 11 '20 at 00:23
  • Well, it crashed a few seconds after the " Admin note: now going to use kubuntu guest some more, while FB2K plays in Windows" Here is a pastebin to the previous log that you asked for: https://pastebin.com/pe0x4ABh – Stonecraft Jul 11 '20 at 00:26
  • Nothing in that pastebin is relevant either. It's the actual crash information that we need to look at. You took a screenshot of part of it but the important bits were scrolled off the top. – Michael Hampton Jul 11 '20 at 00:39
  • I am confused.. that log covers the entire virt-manager session, from opening virt-manager to the crash. I am doing another test now, when it crashes I will paste the entire `journalctl -xe`, not just from the start of using VMs. – Stonecraft Jul 11 '20 at 00:44
  • Again, the information is not in the journal! It is only in [one of your screenshots](https://i.stack.imgur.com/GwyLd.jpg). But you need to scroll back to get the rest of it. – Michael Hampton Jul 11 '20 at 00:53
  • @MichaelHampton I have re-written the question and added what I think is all the relevant information (at least it is the complete journalctl). – Stonecraft Jul 11 '20 at 20:28
  • This smells like a driver crapping out, but given that the whole scenario is way out of what is likely supported by ANY driver manufacturer on WIndows Client machines and there is a really non standard (for windows) hypervisor used, I would say that there liikely is no solution. As in: you are going to end up in a redirect hell where everyone tells you to update different parts. There likely are VERY few people using KVM in this way on windows, especially now that Hyper-V is "standard". Not saying you should use it, just saying that using something arcane - is likely to lead to limited support – TomTom Jul 12 '20 at 18:05
  • Windows is the guest, the host is Ubuntu. – Stonecraft Jul 12 '20 at 18:15
  • As this seems video/audio related. Could this be triggered by some DRM mechanism in the windows VM locking up your PC? Maybe you can block some CPU features like secure enclave or mainboard DRM functions to the VM. – Gerrit Jul 13 '20 at 08:15
  • @Gerrit DRM as in "Digital Rights Management" or "Direct Rendering Manager"? – Stonecraft Jul 13 '20 at 08:46
  • Also, it has happened when not playing media, that's just a reliable way to trigger it in fairly short order. – Stonecraft Jul 13 '20 at 08:54
  • I meant the "Rights" management variant. There seems to be a pretty informed forum here: https://forum.level1techs.com/t/threadripper-gpu-passthrough-working-with-vega/120594/4 – Gerrit Jul 13 '20 at 10:19
  • Does SysRq still respond on the host when the lock up happens? If not this will be tough to troubleshoot as it sounds like the kernel has locked entirely and it will be difficult to see if it spat out any messages before it gasped its last... – Anon Jul 18 '20 at 17:05
  • No, nothing did. I installed kernel crash log, and that maybe pointed to something network related. I ended up first using ZFS to rollback to before the problem did not start. Then I reformatted my zvol and reinstalled windows, and performance was crap, but it no longer affected the host. Turns out there is extra stuff I need in my XML file with later versions of qemu. I still have not finished setting up though, so I will wait a week or two and see if the problem comes back, and if it is fixed, answer my question. – Stonecraft Jul 19 '20 at 09:53

0 Answers0