7

My Debian 8.9 DRBD 8.4.3 setup somehow has got into a state where the two nodes cannot connect over the network any more. They should replicate a single resource r1, but immediately after drbdadm down r1; drbadm up r1 on both nodes their /proc/drbd describe the situation as follows:

on 1st node (Connection State is either WFConnection or StandAlone):

1: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C r-----
   ns:0 nr:0 dw:0 dr:912 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:20

on 2nd node:

1: cs:StandAlone ro:Secondary/Unknown ds:UpToDate/DUnknown   r-----
   ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:48

The two nodes can ping each other over the IP addresses cited in /etc/drbd.d/r1.res, and netstat shows that both are listening on the cited port.

How can I (further diagnose and) get out of this situation so that the two nodes can become Connected and replicate over DRBD again?

BTW, on a higher level of abstraction this problem currently manifests itself by systemctl start drbd never exiting, apparently because it gets stuck in drbdadm wait-connect all (as suggested by /lib/systemd/system/drbd.service).

rookie09
  • 573
  • 1
  • 5
  • 14

1 Answers1

14

The situation was apparently caused by a case of split-brain.

I had not noticed this because I had only inspected recent journal entries for drbd.service (sudo journalctl -u drbd), but the problem apparently was reported in other kernel logs and slightly earlier (sudo journalctl | grep Split-Brain).

With that, manually solving the split-brain (as described here or here) also resolved the troublesome situation as follows.

On split-brain victim (assuming the DRBD resource is r1):

drbdadm disconnect r1
drbdadm secondary r1
drbdadm connect --discard-my-data r1

On split-brain survivor:

drbdadm primary r1
drbdadm connect r1
Tr33beard
  • 3
  • 3
rookie09
  • 573
  • 1
  • 5
  • 14
  • 2
    It's best to include your steps in your answer versus linking to a site that might move later. I imagine you just needed `drbdadm disconnect r1` on both nodes, then `drbdadm connect r1 --discard-my-data` on the victim, and `drbdadm connect r1` on the survivor. – Matt Kereczman Aug 25 '17 at 14:44
  • @MattKereczman Done now. – rookie09 Aug 31 '17 at 06:11