DRBD as DR: syncing datastores of 2 ESXI hosts, vmdk consistency?

Question

does anyone have experience with using DRBD (protocol C) to sync parts of the datastores of 2 esxi hosts for disaster recovery of selected guests?

I have 2-3 guests that should be able to recover from hardware failure of the host in as little time as possible, but still with manual intervention and without losing too much data.

I'd like to build something like this:

1 DRBD VM on each of the 2 esxi hosts syncing their local SAS storage (primary/secondary, active/passive).

This mirrored storage should be attached to only 1 esxi host at a time via ISCSI or NFS and be used for those guests to make their vmdks sync to the second, "passive" esxi host. In the event of a hardware failure the 2nd esxi host should then attach the DRBD storage to power up those VMs (done manually of course).

I have found some information about doing this on the net, but what I haven't found any information for is consistency of the vmdks.

While this is of course not meant as a replacement for backups, backup tools for hypervisors usually make sure that the guests' filesystems and databases are quiesced before taking the snapshot or backup.

With this continuous sync this wouldn't be possible though. That's why I wonder if this is even worth doing.

What if the vmdks themselves get damaged because the hardware failure occurs at a bad time. I know DRBD discards writes that aren't complete, but is that sufficient to have a consistent (meaning "working" from esxi's point of view, apart from guest filesystem consistency which of course cannot be guaranteed this way) vmdk?

I hope that, in the event of a crash, a guest brought up on the second esxi could behave as if the VM just ungracefully shut down (with all the possible drawbacks this usually might have in other scenarios), but would that really be the case? Couldn't the vmdks as a whole get damaged?

Thank you very much for reading and your thoughts.

Max

score 4 · Answer 1 · edited Apr 13 '17 at 12:14

I did extensive tests with scenerio's like you describe. I tried having a storage server with failover capability using DRBD, then using iSCSI to attach that storage to Debian machines running Xen. I quickly gave up on that, because I had too many problems. Part of those could be me, though. One of them was that the iSCSI block devices weren't created.

Then I proceeded to try to make two Debian Xen machines, and have the LVM block devices on which the VMs reside replicated with DRBD. I did have file system barrier errors to overcome.

Then my performance was bad, which I tracked down to the al-extents options. The version of DRBD I used, 8.3, had too low a default value. I upped it to 3833, since I don't really care about the slightly longer resync time.

I also did a whole bunch of experiments with killing power to nodes. DRBD was very graceful with that. The VM did respond as you hope: bringing it online on the other node went fine, just by saying it was recovering its journal. Restarting the node also plainly resynced the device. Of course, real node failure is often ugly with half-working disks, network traffic, etc, which is hard to predict. You're smart to promoto a slave manually only.

I have been running the setup for about 2 years. Node hasn't failed yet :), nor has the DRBD.

In my tests, I found it a hugely more convenient to not have a central storage server with failover, but run Xen both on the DRBD primary and secondary. The iSCSI setup is something I'd like to try again, but that won't happen anytime soon.

Also, I don't work with image files, but LVM block devices instead. This has proven more robust for me. Snapsnotting on LVM is possible, for one.

One funny thing to note: DRBD has a mode that allows it run diskless on the primary node. I once had a disk failure on my primary Xen node that went unnoticed (kernel MD didn't kick the drive, but I had constant ATA errors). For weeks without me knowing, the VM just ran fine in diskless mode, using the other server as storage :)

Thank you very much for sharing your experience! This is already very helpful, although I won't mark it as an answer because you used lvm devices instead of image files and of course no vmdks to be specific. I hope it's nitpicking and doesn't really make any difference, but as I have absolutely no idea about the vmdk format, I'm still worried they could become corrupted. Although the whole thing is influenced by other risk factors anyway (the guests filesystem for once). Thanks again! :) — mx82, Dec 20 '14 at 16:39
About the diskless mode: I've read about that and it sounds amazing! Glad to hear that you successfully made use of that feature. I guess it would be good to be notified when that happens though :D although maybe it's sufficient to notice the sudden performance degradation. Depends on the scenario of course :) DRBD really is amazing. — mx82, Dec 20 '14 at 16:45
Since then, I put a notice on our central rsyslogd server to report any ata exception to me. I have more than a handful of examples where it (retroactively) predicted disk failure. I'm a bit amazed that the kernel MD doesn't kick the drive out in those situation, or at least notifies you. — Halfgaar, Dec 21 '14 at 08:23

DRBD as DR: syncing datastores of 2 ESXI hosts, vmdk consistency?

1 Answers1