How to control start and stop of libvirtd and virtlockd in pacemaker cluster?

Question

Using indirect locks of virtlockd (which is used by libvirtd) requires to use a cluster-wide shared filesystem like OCFS2. In turn this means that virtlockd must be started after the shared filesystem is mounted (otherwise the locks created would be local at best). Naturally libvirtd must be started after virtlockd, and any VM after virtlockd.

So I want for start: pacemaker, DLM, OCFS mount, virtlockd, libvirtd, VMs...

And for stop I want the opposite direction.

I have configured all those primitives (specifically systemd:libvirtd.service and systemd:virtlockd), clones and constraints correctly (I hope), but still I'm having issue with virtlockd.

In a system like SLES15 systemd is controlling those services, and it seems systemd has its own life controlling (starting) services even if they are all disabled.

So the question: Did anybody manage to succeed with such a setup?

Update (2021-02-04)

I found this "Drop-In" in the status output for virtlockd.service: /run/systemd/system/virtlockd.service.d/50-pacemaker.conf

It contains:

[Unit]
Description=Cluster Controlled virtlockd
Before=pacemaker.service pacemaker_remote.service

[Service]
Restart=no

A corresponding file /run/systemd/system/libvirtd.service.service.d/50-pacemaker.conf exists:

[Unit]
Description=Cluster Controlled libvirtd.service
Before=pacemaker.service pacemaker_remote.service

[Service]
Restart=no

Could these cause the problems I'm seeing (systemd starting libvirtd-ro.socket, libvirts-admin.socket and libvirtd.service, then starting virtlockd)?

Update (2021-02-05)

It seems the resources are started in the correct order when the node boots (e.g. after being fenced), but when pacemaker is restarted (e.g. via crm cluster restart), systemd interferes and starts the virtlockd before pacemaker wants to start it. Maybe the difference is the /run directory.

Update (2021-02-08)

Another issue I found is that even though /etc/libvirt/libvirtd.conf contains listen_tls = 1, starting libvirtd through pacemaker as indicated results in a libvirtd not having opened the TLS socket, preventing VM live migration.

systemd ist starting libvirtd before the pacemaker wants to start it. — U. Windl, Feb 03 '21 at 15:08
In general when using Pacemaker you *disable* affected local system service management from running services or otherwise interferring with Pacemaker operations. In this case, systemd must start only corosync and pacemaker, and then Pacemaker must start libvirt and its friends. Pacemaker has all options to serialize service startup cluster-wide. — Nikita Kipriyanov, Feb 08 '21 at 11:32

score 0 · Answer 1 · answered Feb 09 '21 at 13:26

There still is some locking issue during live-migration that might be a bug in libvirtd, but I think I got the solution:

Parts of this solution are found in https://bugzilla.redhat.com/show_bug.cgi?id=1750340.

The first thing is not to use systemd's "socket activation" for libvirtd. It's not enough to disable all the socket units (libvirtd.socket, libvirtd-ro.socket, libvirtd-admin.socket, libvirtd-tcp.socket, libvirtd-tls.socket), but you'll have to mask them.

Then libvirtd does not open the TLS socket, even if listen_tls=1 is set. To activate that you must (in SLES 15 SP2) edit /etc/sysconfig/libvirtd to activate LIBVIRTD_ARGS="--listen". At the same time you must deactivate LIBVIRTD_ARGS="--timeout 120" to prevent automatic termination of libvirtd.

Finally you'll have to start (once being configured) and stop virtlockd and libvirtd at the right point in time. I am using a multipath SAN device to host my VM images, while I use a clustered RAID1 (/dev/md10) to hold the locks. OCFS2 is being used as filesystem on top of both. Clustered-MD and OCFS2 both need the DLM. Cluster-wide resources use clones to distribute them. I run three test VMs once libvirt and the image path /cfs/VMI is ready. I won't explain the steps to configure using lockd with indirect locking here, just as I won't explain how to set up a clustered MD RAID or an OCFS2 filesystem.

Here is the three-node cluster configuration in crm syntax (the fencing resource being omitted):

node 1: node1 \
    attributes standby=off
node 2: node2 \
    attributes standby=off
node 3: node3 \
    attributes standby=off
primitive prm_CFS_VMI Filesystem \
    params device="/dev/disk/by-id/dm-name-VMI-PM" directory="/cfs/VMI" fstype=ocfs2 options="acl,user_xattr" \
    op start timeout=90 interval=0 \
    op stop timeout=90 interval=0 \
    op monitor interval=120 timeout=90
primitive prm_DLM ocf:pacemaker:controld \
    op start timeout=90 interval=0 \
    op stop timeout=120 interval=0 \
    op monitor interval=60 timeout=60
primitive prm_libvirtd systemd:libvirtd.service \
    op start timeout=100 interval=0 \
    op stop timeout=100 interval=0 \
    op monitor interval=60 timeout=100
primitive prm_lockspace_ocfs2 Filesystem \
    params device="/dev/md10" directory="/var/lib/libvirt/lockd" fstype=ocfs2 options="acl,user_xattr" \
    op start timeout=90 interval=0 \
    op stop timeout=90 interval=0 \
    op monitor interval=120 timeout=90
primitive prm_lockspace_raid_md10 Raid1 \
    params raidconf="/etc/mdadm/mdadm.conf" raiddev="/dev/md10" force_clones=true \
    op start timeout=90s interval=0 \
    op stop timeout=90s interval=0 \
    op monitor interval=300 timeout=90s \
    op_params OCF_CHECK_LEVEL=10
primitive prm_virtlockd systemd:virtlockd \
    op start timeout=100 interval=0 \
    op stop timeout=100 interval=0 \
    op monitor interval=60 timeout=100
primitive prm_xen_test-jeos1 VirtualDomain \
    params param config="/etc/libvirt/libxl/test-jeos1.xml" hypervisor="xen:///system" remoteuri="xen+tls://%n.domain.org" \
    op start timeout=120 interval=0 \
    op stop timeout=180 interval=0 \
    op monitor interval=600 timeout=90 \
    op migrate_to timeout=300 interval=0 \
    op migrate_from timeout=300 interval=0 \
    meta allow-migrate=true resource-stickiness=1000
primitive prm_xen_test-jeos2 VirtualDomain \
    params param config="/etc/libvirt/libxl/test-jeos2.xml" hypervisor="xen:///system" remoteuri="xen+tls://%n.domain.org" \
    op start timeout=120 interval=0 \
    op stop timeout=180 interval=0 \
    op monitor interval=600 timeout=90 \
    op migrate_to timeout=300 interval=0 \
    op migrate_from timeout=300 interval=0 \
    meta allow-migrate=true resource-stickiness=1000
primitive prm_xen_test-jeos3 VirtualDomain \
    params param config="/etc/libvirt/libxl/test-jeos3.xml" hypervisor="xen:///system" remoteuri="xen+tls://%n.domain.org" \
    op start timeout=120 interval=0 \
    op stop timeout=180 interval=0 \
    op monitor interval=600 timeout=90 \
    op migrate_to timeout=300 interval=0 \
    op migrate_from timeout=300 interval=0 \
    meta allow-migrate=true resource-stickiness=1000
clone cln_CFS_VMI prm_CFS_VMI \
    meta interleave=true
clone cln_DLM prm_DLM \
    meta interleave=true
clone cln_libvirtd prm_libvirtd \
    meta interleave=true
clone cln_lockspace_ocfs2 prm_lockspace_ocfs2 \
    meta interleave=true
clone cln_lockspace_raid_md10 prm_lockspace_raid_md10 \
    meta interleave=true
clone cln_virtlockd prm_virtlockd \
    meta interleave=true
colocation col_CFS_VMI__DLM inf: cln_CFS_VMI cln_DLM
colocation col_clustered-MD__DLM inf: cln_lockspace_raid_md10 cln_DLM
colocation col_libvirtd__cln_virtlockd inf: cln_libvirtd cln_virtlockd
colocation col_libvirtd__virtlockd inf: cln_libvirtd cln_virtlockd
colocation col_lockspace_ocfs2__DLM inf: cln_lockspace_ocfs2 cln_DLM
colocation col_lockspace_ocfs2__raid_md10 inf: cln_lockspace_ocfs2 cln_lockspace_raid_md10
colocation col_test-jeos1__CFS_VMI inf: prm_xen_test-jeos1 cln_CFS_VMI
colocation col_test-jeos2__CFS_VMI inf: prm_xen_test-jeos2 cln_CFS_VMI
colocation col_test-jeos3__CFS_VMI inf: prm_xen_test-jeos3 cln_CFS_VMI
colocation col_virtlockd__lockspace_fs inf: cln_virtlockd cln_lockspace_ocfs2
colocation col_vm__libvirtd inf: ( prm_xen_test-jeos1 prm_xen_test-jeos2 prm_xen_test-jeos3 ) cln_libvirtd
order ord_CFS_VMI_xen_test-jeos1 Mandatory: cln_CFS_VMI prm_xen_test-jeos1
order ord_CFS_VMI_xen_test-jeos2 Mandatory: cln_CFS_VMI prm_xen_test-jeos2
order ord_CFS_VMI_xen_test-jeos3 Mandatory: cln_CFS_VMI prm_xen_test-jeos3
order ord_DLM__CFS_VMI Mandatory: cln_DLM cln_CFS_VMI
order ord_DLM__clustered-MD Mandatory: cln_DLM cln_lockspace_raid_md10
order ord_DLM__lockspace_ocfs2 Mandatory: cln_DLM cln_lockspace_ocfs2
order ord_libvirtd__vm Mandatory: cln_libvirtd ( prm_xen_test-jeos1 prm_xen_test-jeos2 prm_xen_test-jeos3  )
order ord_lockspace_fs__virtlockd Mandatory: cln_lockspace_ocfs2 cln_virtlockd
order ord_raid_md10__lockspace_ocfs2 Mandatory: cln_lockspace_raid_md10 cln_lockspace_ocfs2
order ord_virtlockd__libvirtd Mandatory: cln_virtlockd cln_libvirtd

So the essential ordering is:

order ord_lockspace_fs__virtlockd Mandatory: cln_lockspace_ocfs2 cln_virtlockd
order ord_virtlockd__libvirtd Mandatory: cln_virtlockd cln_libvirtd
order ord_libvirtd__vm Mandatory: cln_libvirtd ( prm_xen_test-jeos1 prm_xen_test-jeos2 prm_xen_test-jeos3 )
order ord_CFS_VMI_xen_test-jeos1 Mandatory: cln_CFS_VMI prm_xen_test-jeos1
order ord_CFS_VMI_xen_test-jeos2 Mandatory: cln_CFS_VMI prm_xen_test-jeos2
order ord_CFS_VMI_xen_test-jeos3 Mandatory: cln_CFS_VMI prm_xen_test-jeos3

The issue remaining is that libvird claims the VM is not locked at the original node during live-migration shortly before migration succeeds.

score 0 · Answer 2 · answered Apr 06 '21 at 08:44

Windl,

In order to eliminate the gap between virtualization and HA product documents, I set up a highly available virtualization environment, the detailed steps are as below,

SLE HA hardware environment details:


Cluster node size	DELL Precision Tower 5810
Shared Disk	Iscsi (50G)
SLE HA version	SLE15 HA SP2
# of the cluster nodes	2
SBD partition	Iscsi (50MB)
Filesystem partition	Iscsi (20GB)
VM image partition	Iscsi (30GB)

Installation and setup process:

1. Attach iscsi disk to two cluster node, divide the iscsi disk into 3 partitions (e.g. 50MB for sbd, 20GB for ocfs2, the remain for VM image).
Add network bridge br0 on each cluster node (will be used when install/run virtual machine).
Set up password-free ssh login for the root user between cluster nodes.

2. Install HA and virtualization related packages on each cluster node.
# zypper in -t pattern ha_sles
# zypper in -t pattern kvm_server kvm_tools

3. Setup HA cluster and add SBD device.
refer to HA guide at http://docserv.suse.de/documents/en-us/sle-ha/15-SP2/single-html/SLE-HA-guide/#book-sleha-guide

4. Setup DLM and OCFS2 resources in crm.

e.g.

primitive dlm ocf:pacemaker:controld \
op monitor interval=60 timeout=60
primitive ocfs2-2 Filesystem \
params device="/dev/disk/by-id/scsi-149455400000000004100c3befec3dc9a81f9ce28f7a8b8de-part1" directory="/mnt/shared" fstype=ocfs2 \
op monitor interval=20 timeout=40
group base-group dlm ocfs2-2
clone base-clone base-group \
meta interleave=true

5. Setup virtlockd and libvirtd service on each cluster node.

edit /etc/libvirt/qemu.conf, set lock_manager = "lockd"

edit /etc/libvirt/qemu-lockd.conf, set file_lockspace_dir = "/mnt/shared/lockd" (note: /mnt/shared is ocfs2 file system mount point)

restart/enable libvirtd service (note: virtlockd service will be started by libvirtd service according to the configuration)

6. Install a virtual machine (e.g. sle15-nd) on the shared partition from one cluster node, dump the domain configuration to a XML file.
Move the virtual machine configuration file to the ocfs2 file system (e.g. /mnt/shared).
note: please make sure the XML configuration file does not include any references to unshared local paths.

7. Setup VirtualDomain resource and order in crm.

e.g.

primitive vm_nd1 VirtualDomain \
params config="/mnt/shared/sle15-nd.xml" remoteuri="qemu+ssh://%n/system" \
meta allow-migrate=true \
op monitor timeout=30s interval=10s \
utilization cpu=2 hv_memory=1024

order ord_fs_virt Mandatory: base-clone vm_nd1

8. Check all your changes with the show command in crm, then commit:

e.g.

crm(live/sle15sp2-test1)configure# show
node 172167755: sle15sp2-test2
node 172168091: sle15sp2-test1
primitive dlm ocf:pacemaker:controld \
op monitor interval=60 timeout=60
primitive ocfs2-2 Filesystem \
params device="/dev/disk/by-id/scsi-149455400000000004100c3befec3dc9a81f9ce28f7a8b8de-part1" directory="/mnt/shared" fstype=ocfs2 \
op monitor interval=20 timeout=40
primitive stonith-sbd stonith:external/sbd \
params pcmk_delay_max=30s
primitive vm_nd1 VirtualDomain \
params config="/mnt/shared/sle15-nd.xml" remoteuri="qemu+ssh://%n/system" \
meta allow-migrate=true \
op monitor timeout=30s interval=10s \
utilization cpu=2 hv_memory=1024
group base-group dlm ocfs2-2
clone base-clone base-group \
meta interleave=true
order ord_fs_virt Mandatory: base-clone vm_nd1
property cib-bootstrap-options: \
have-watchdog=true \
dc-version="2.0.3+20200511.2b248d828-1.10-2.0.3+20200511.2b248d828" \
cluster-infrastructure=corosync \
cluster-name=hacluster \
stonith-enabled=true
rsc_defaults rsc-options: \
resource-stickiness=1 \
migration-threshold=3
op_defaults op-options: \
timeout=600 \
record-pending=true

Verify if VM resource can work on HA cluster:

1. Verify VM resource is protected across cluster nodes.

Test result: cannot start the VM manually via virsh command when this VM is running on another cluster node.

2. Verify VM resource can be taken by another cluster node when the current cluster node crashes

Test result: after a few seconds (cluster fence time), the VM is started on another cluster node.

3. Verify VM resource can be taken by another cluster node when the current cluster node reboots

Test result: the VM is migrated to another cluster node.

4. Check if we can migrate VM resource between cluster nodes

Test result: Yes, the remote SSH connection to the VM is not broken during the whole migration.

Remarks:

1. In the actual production environment, cluster communication/management should use a separate network.

2. In the actual production environment, cluster sbd device should use a separate shared disk to avoid IO starvation.

3. Do not start VM instance manually until ocfs2 file system is mounted, since the file lockspace directory is under ocfs2 file system. In other words, you should let cluster(pacemaker) manage the start and stop of all virtual machines.

Suggestions for improvement: In installation between step 1 and step 4: If you'd explain the meaning of the partitions (maybe `fdisk -l ...`) from step 1, step 4 would be a bit easier to understand. Step 5: `/mnt/shared` most likely is a bad choice for a permanent mount. — U. Windl, Apr 07 '21 at 06:16
I wonder what happens during boot: `libvirtd` will start before the cluster, so also before OCFS2 is mounted. Probably "probes" for the VMs will be started also before virtlockd will have a shared filesystem. Will virtlockd handle the situation correctly when its lock filesystem is mounted after start? (If it does a chdir into it, it probably won't work). Also I think as a proof of concept you should define at least two VMs so that each node can run one. — U. Windl, Apr 07 '21 at 06:16
virlockd will access the lock file when virsh starts VM image, that means we need to make sure all VirtualDomain resources startup after ocfs2 share file system. — Gang He, Apr 08 '21 at 06:14

How to control start and stop of libvirtd and virtlockd in pacemaker cluster?

Update (2021-02-04)

Update (2021-02-05)

Update (2021-02-08)

2 Answers2

SLE HA hardware environment details:

Installation and setup process:

Verify if VM resource can work on HA cluster:

Remarks: