Virtual disk extremely slow for KVM guests

0

I have a relatively small server with a quad core CPU (Intel i5-7400) and 16 GB RAM (DDR4 though), running a couple of virtualised guests using libvirt. I'm not using any other intermediate layer such as Proxmox. The OSes in use are about 90% Linux, 5% macOS (Mojave and up) and 5% Windows (10/2016). I never use desktop environments on Linux. The host (Ubuntu Bionic) uses ZFS with a raidz1 config to store the virtual disk files. When creating guests I always use virt-install with the proper --os-variant flag.

For all guests disk performance was extremely low, barely ever going up to 10 MB/s write speeds (even with VirtIO drivers). This also occurred regardless of virtual disk type; QCOW2, raw, QCOW2 with a 4k cluster size, and an entirely preallocated QCOW2 disk all had the same issue. When writing about 200 MB to a file the guest would simply lock up and I have to wait a couple of minutes after Ctrl+C'ing the command for it to become usable again. After doing some further research/testing I found that the writeback cache mode significantly improves performance, at least for the Linux guests. No more lock-ups and they can even write 1 GB to a file in just a couple of seconds, even when using a brand-new sparse/thin QCOW2 disk on a SATA bus.

However, the GUI guests still have extremely slow boot times, and when they finally do boot they're pretty much unusable (mouse pointer moves maybe only once per 5 seconds, keyboard input is severely delayed, opening an application takes forever, etc). I can wait an hour for Windows to boot and it'll still be stuck on the black boot screen with Windows logo and a loading icon below it, even after I managed to install VirtIO drivers before the actual Windows installation. MacOS will usually boot after 30 minutes or so, but that's using a SATA bus because I can't even install VirtIO drivers. Linux guests boot in a matter of seconds, for comparison.

For macOS I once managed to SSH into it from my own computer and run a disk speed test from there, and even with the writeback cache mode it barely reaches 10 MB/s write speeds.

All problems occur even if e.g. macOS is the only guest currently running, so I don't think it's a bottleneck with CPU or RAM. Memory isn't overcommitted anyways because in my experience that only results in issues. I tried giving the guest both a dual and quad core vCPU as well, with no noticeable changes. Also, the full qemu-system-* command-line properly contains -kvm flags so it is not doing virtualisation purely in software.

It's probably some stupid configuration thing somewhere, because even on my ancient virtualisation rig (rocking DDR2 memory) using ESXi I could boot Windows 7 guests in a reasonable period of time.

Sahbi

Posted 2019-12-29T12:49:36.393

Reputation: 1

Answers

0

Been messing around a whole lot more and found some useful/interesting stuff.

zfs set atime=off <dataset>

This was originally enabled for mypool/rootfs, which as the name indicates represents the root directory of my host OS. The VM files are stored under a different dataset (mypool/vm) for which the option was already off, but the mountpoint is still under rootfs (/vm). I have about 10 other datasets unrelated to both and the option was off for all of those as well. I don't really care for access time anyways, so I decided to just change everything to off.

zfs set xattr=sa <dataset>

This was set to on for all datasets, which apparently means that extended attributes are stored in hidden subdirectories instead of inodes, resulting in additional IO. I'm aware this change only affects files created (or modified, as far as I can tell) after the fact, but it seems to be recommended/intended for ZFS On Linux so I had to change it regardless.

Since a virtual disk file actually gets modified when an OS writes e.g. a log file, this should have some form of noticable impact. I proceeded to turn on the Mac VM and initiated a regular boot (so not in recovery or single-user mode), after "only" 10m54s I had the login screen in front of me. After logging in I can actually use the OS in a normal fashion. It doesn't take 5 seconds for the mouse pointer to move 2 pixels, rather it's pretty much real-time. If I quickly drag some windows around then they become slightly choppy, but that's apparently because libvirt's guest console isn't all that fast. When I use a physical Mac's Screen Sharing client then even quick drags are rendered pretty smoothly. Ditto for the default screensaver; it's a bit blocky but there's no lag/framedrops.

Note: all the above was done while my "standard" Linux guests were all running (8 of 'em). And very much unlike before I can even dd a 10 GB file within macOS without the VM locking up, which takes about 23 seconds to write (466226214 bytes/sec or 444.63 MiB/s). The Mac guest currently has a dual core CPU and 4 GB of RAM.

Despite all that however, Windows is still barely crawling forwards. It took 55m1s to even get the login screen background and another 4m18s before the password input box came up. But after logging in the experience is the same or even better as compared to macOS. A winsat disk -drive c shows 778.95 and 742.12 MB/s for sequential read and write, respectively.

So I decided to try a more risky setting just for the hell of it:

zfs set sync=disabled mypool/vm

Of course the default for this is standard. I found the source of Proxmox's pveperf and decided to run it for both sync values and checked the fsync performance in particular:

  • standard: barely 50 fsync/s
  • disabled: a whopping 36138 fsync/s (which is to be expected)

When it's disabled, a macOS VM now takes about 2 minutes to finish the initial boot process (Apple logo with loading bar), however it's stuck on a black screen for 10+ minutes before finally showing the login window. That's actually slower than with standard, so I flipped it back to that.

It looks like I'm not quite done digging yet, so let's try rebooting the entire host to make the 2 zfs set flags take effect for a bunch of host files too (by modifying them). Now this seems to have had quite the major impact:

  • Mac: 47 seconds until login window pops up, shuts down in 10 seconds
  • Windows: 58 seconds for login window, shuts down in 13

These boot times are perfectly acceptable, especially considering the fact that the storage is made up of good ol' rust drives. I can reliably reproduce the around-a-minute boot time everytime I do a cold boot of either VM.

So yeah, just set those 2 ZFS properties and writeback caching mode for qemu right at the start and save yourself a lot of time. =]

Sahbi

Posted 2019-12-29T12:49:36.393

Reputation: 1