KVM+DRBD replicated between two active-passive servers with manual switching

Question

I need to build 2-node cluster(-like?) solution in active-passive mode, that is, one server is active while the other is passive (standby) that continuously gets the data replicated from active. KVM-based virtual machines would be running on active node.

In case of the active node being unavailable for any reason I would like to manually switch to the second node (becoming active and the other passive).

I've seen this tutorial: https://www.alteeve.com/w/AN!Cluster_Tutorial_2#Technologies_We_Will_Use

However, I'm not brave enough to trust fully automatic failover and build something that complex and trust it to operate correctly. Too much risk of split-brain situation, complexity failing somehow, data corruption, etc, while my maximum downtime requirement is not so severe as to require immediate automatic failover.

I'm having trouble finding information on how to build this kind of configuration. If you have done this, please share the info / HOWTO in an answer.

Or maybe it is possible to build highly reliable automatic failover with Linux nodes? The trouble with Linux high-availability is that there seems to have been a surge of interest in the concept like 8 years ago and many tutorials are quite old by now. This suggests that there may have been substantial problems with HA in practice and some/many sysadmins simply dropped it.

If that is possible, please share the info how to build it and your experiences with clusters running in production.

score 6 · Answer 1 · answered Oct 14 '18 at 22:46

6

Why not using things which have been checked by thousands of users and proved their reliability? You can just deploy free Hyper-V server with, for example, StarWind VSAN Free and get true HA without any issues. Check out this manual: https://www.starwindsoftware.com/resource-library/starwind-virtual-san-hyperconverged-2-node-scenario-with-hyper-v-server-2016

answered Oct 14 '18 at 22:46

batistuta09

8,723
9
21

It's Windows... – LetMeSOThat4U Nov 07 '18 at 10:13

shodanshok · Accepted Answer · 2019-08-07T17:48:56.210

I have a very similar installation with the setup you described: a KVM server with a stanby replica via DRBD active/passive. To have a system as simple as possible (and to avoid any automatic split-brain, ie: due to my customer messing with the cluster network), I also ditched automatic cluster failover.

The system is 5+ years old and never gave me any problem. My volume setup is the following:

a dedicated RAID volume for VM storage;
a small overlay volume containing QEMU/KVM config files;
bigger volumes for virtual disks;
a DRBD resources managing the entire dedicated array block device.

I wrote some shell scripts to help me in case of failover. You can found them here

Please note that the system was architected for maximum performance, even at the expense of features as fast snapshots and file-based (rather than volume-based) virtual disks.

Rebuilding a similar, active/passive setup now, I would heavily lean toward using ZFS and continuous async replication via send/recv. It is not real-time, block based replication, but it is more than sufficient for 90%+ case.

If realtime replication is really needed, I would use DRBD on top of a ZVOL + XFS; I tested such a setup + automatic pacemaker switch in my lab with great satisfaction, in fact. If using 3rdy part modules (as ZoL is) is not possible, I would use a DRBD resources on top of a lvmthin volume + XFS.

Thank you so much for the scripts, I've read them and they gave me much better idea what to do. However, while you mention "overlay volume", I'm not sure what it is really? Googling "overlay drbd" or kvm gave me no results, could you please elaborate on how you do that and what's the purpose? I prefer LVM volumes for KVM as disks for reasons of mgmt rather than performance (and you can snapshot LVM-based KVM too), don't see why qcow2 would be better. Anyway, my goal is maximum reliability and would gladly trade some performance for it. If you add those bits of info, I'll accept your answer. — LetMeSOThat4U, Oct 09 '18 at 16:00
What I called the "overlay volume" is the volume hosting qemu/kvm/libvirt configuration. The point it that for a KVM standby server you need to replicate not only the vdisk images, but qemu config also. When writing about file-based disks I was not referring to qcow2 files, but to (possibily sparse) raw disk images (with snapshots done via lvm / lvmthin / ZFS / whatever). — shodanshok, Oct 09 '18 at 20:08

score 3 · Answer 3 · answered Oct 08 '18 at 16:45

You can totally setup DRBD and use it in a purely manual fashion. The process should not be complex at all. You would simply do what a Pacemaker or Rgmanager cluster does, but by hand. Essentially:

Stop the VM on the active node
Demote DRBD on the active node
Promote DRBD on the peer node
Start the VM on the peer node

Naturally, this will require that both nodes have the proper packages installed, and the VM's configurations and definition exist on both nodes.

I can assure that the Linux HA stack (corosync and pacemaker) are still actively developed and supported. Many guides are old, the software has been around for 10 years. When done properly, there are no major problems or issues. It is not abandoned, but it is no longer "new and exciting".

Thanks! It's nice to hear that about HA stack. One thing I'm not clear how to set up in reliable manner is stopping VMs on new passive and starting them on new active. Do you happen to know how to do this well? If you added that to the answer, I'd appreciate. — LetMeSOThat4U, Oct 08 '18 at 19:12

Chaoxiang N · Answer 4 · 2018-11-08T09:04:34.243

Active/Passive clusters are still heavilly used in many places, and running in production. Please find below our production setup, it is working fine, and you can either let it run in manual mode (orchestrate=start), or enable automatic failover (orchestrate=ha). We use zfs to benefit from zfs send/receive, and zfs snapshots, but it is also possible to use drbd if you prefer synchronous replication.

Prerequisites :

2 nodes (in my setup 2 physical nodes 400 kilometers distance)
internal disks
1 zfs pool on each node
stretched vlan (in my setup we use "vrack" at OVH hosting provider)

Steps :

install opensvc agent on both nodes (https://repo.opensvc.com)
form opensvc cluster (3 commands needed, described in the screencast at https://www.opensvc.com)
create a root ssh trust between both nodes
create 1 opensvc service per kvm guest [service config file below]

root@node1:~$ svcmgr -s win1 print config

[DEFAULT]
env = PRD
nodes = node1.acme.com node2.acme.com
id = 7a10881d-e5d5-4817-a8fe-e7a2004c5520
orchestrate = start

[fs#1]
mnt_opt = rw,xattr,acl
mnt = /srv/{svcname}
dev = data/{svcname}
type = zfs

[container#0]
type = kvm
name = {svcname}
guestos = windows
shared = true

[sync#1]
src = data/{svcname}
dst = data/{svcname}
type = zfs
target = nodes
recursive = true
schedule = @12h

A few explanations :

service is named "win1" and each {svcname} in the service config file is a reference pointing to actual service name (win1)
service start do the following :
- mount zfs dataset data/win1 on mountpoint /srv/win1
- start kvm container win1
ressource sync#1 is used to declare an asynchronous zfs dataset replication to the slave node (data/win1 on node1 is sent to data/win1 on node2), once per 12 hours in the example (zfs send/receive is managed by the opensvc agent)
opensvc agent is also dealing with kvm qemu config replication, and defining it when the service is relocated to the slave node

Some management commands :

svcmgr -s win1 start start the service
svcmgr -s win1 stop stop the service
svcmgr -s win1 stop --rid container#0 stop the container referenced container#0 in the config file
svcmgr -s win1 switch relocate the service to the other node
svcmgr -s win1 sync update trigger an incremental zfs dataset copy
svcmgr -s win1 sync full trigger a full zfs dataset copy

Some services I manage also need zfs snapshots on a regular basis (daily/weekly/monthly), with retention, in this case I add the following config snippet to the service configuration file, and the opensvc agent does the job.

[sync#1sd]
type = zfssnap
dataset = data/{svcname}
schedule = 23:00-23:59@61
keep = 7
name = daily
recursive = true
sync_max_delay = 1d

[sync#1sw]
type = zfssnap
dataset = data/{svcname}
schedule = 23:00-23:59@61 sun
keep = 4
name = weekly
recursive = true
sync_max_delay = 7d

[sync#1sm]
type = zfssnap
dataset = data/{svcname}
schedule = 23:00-23:59@61 * *:first
keep = 6
name = monthly
recursive = true
sync_max_delay = 31d

As requested, I also add one lvm/drbd/kvm config :

drbd resource config /etc/drbd.d/kvmdrbd.res :

resource kvmdrbd {
    device /dev/drbd10;
    disk /dev/drbdvg/drbdlv;
    on node1 {
        address 1.2.3.4:12345;
        meta-disk internal;
    }
    on node2 {
        address 4.3.2.1:12345;
        meta-disk internal;
    }
}

opensvc service config file /etc/opensvc/kvmdrbd.conf :

root@node1# svcmgr -s kvmdrbd print config
[DEFAULT]
env = PRD
nodes = node1.acme.com node2.acme.com
id = 7a10881d-f4d3-1234-a2cd-e7a2018c4321
orchestrate = start

[disk#1]
type = lvm
vgname = {env.drbdvgname}
standby = true

[disk#2]
type = drbd
standby = true
shared = true
res = {svcname}

[fs#0]
mnt = {env.basedir}/{svcname}
type = ext4
dev = /dev/{env.drbddev}
shared = true

[container#0]
type = kvm
name = {svcname}
shared = true

[sync#i0]
schedule = @1440

[env]
basedir = /srv
drbddev = drbd10
drbdvgname = drbdvg

Some explanations :

in my setup, I replicate lvm lv with drbd. I create a filesystem on the drbd block device. In this filesystem, I create 1 flat file per disk I want to present to the kvm guest.
disk#1 : is the lvm vg hosting the big logical volume. should be at least 5GB.
disk#2 : is the drbd disk pointed by the drbd resource name. If opensvc service is named "foo", you should have /etc/drbd.d/foo.res. Or change disk#2.res parameter in the service config file.
fs#0 : the main filesystem hosting all disk files for kvm guest
container#0 : the kvm guest, same name as the opensvc service in the example. agent must be able to dns resolve the kvm guest, to do a ping check before accepting to start the service (if ping answer, the kvm guest is already running somewhere, and it is not a good idea to start it. double start protection ensured by opensvc agent)
standby = true : mean that this resource must remain up when the service is running on the other node. In our example, it is needed to keep drbd running fine
shared = true : https://docs.opensvc.com/latest/agent.service.provisioning.html#shared-resources

Thank you very much, I did not know about opensvc. Looks like a very viable configuration. If you have drbd config at hand, I would also love to see it. — LetMeSOThat4U, Nov 07 '18 at 10:16
ok, i've just completed my post with requested lvm/drbd/kvm setup example. — Chaoxiang N, Nov 08 '18 at 09:06

score 0 · Answer 5 · answered Oct 08 '18 at 13:02

I'm currently up to an extremely similar system. 2 servers, one active, one backup and they both have a few VMs running inside them. Database is being replicated and the fileservers are in constant sync with rsync (but only one way). In case of emergency, the secondary server is being served. There was the idea of using Pacemaker and Corosync but since this has to be 100%, I didn't have the courage to experiment. My idea is NginX watching over the servers. This could be done because I'm using a webapplication, but in your case, I don't know if you could use it. DRBD is a mess for me. The previous servers were using it and while it seemingly worked, it felt like I'm trying to dissect a human body.

Check this out, it might help you: http://jensd.be/156/linux/building-a-high-available-failover-cluster-with-pacemaker-corosync-pcs

It doesn't look hard, in fact, in a small environment I've already tried it and worked. Easy to learn, easy to make, easy to maintain. Actually I think this is what you are looking for.

KVM+DRBD replicated between two active-passive servers with manual switching

5 Answers5

Linked