Ganeti disks degraded drbd cs:NetworkFailure

Question

I have an instance (with 2disks) on Ganeti with both disks degraded (probably due to a connection problem?). This instance was working correctly for many years until this morning.

On my master

$ gnt-instance info myinstance
...
   -disk/0
      on primary:   /dev/drbd4 (147:4) in sync, status *DEGRADED*
      on secondary: /dev/drbd4 (147:4) in sync, status *DEGRADED*
      child devices:
        - child 0: lvm, size 20.0G
          logical_id:   kvmvg/299a0bdf-1acb-4bcd-ac43-eb02b0928757.disk0_data
          on primary:   /dev/kvmvg/299a0bdf-1acb-4bcd-ac43-eb02b0928757.disk0_data (254:10)
          on secondary: /dev/kvmvg/299a0bdf-1acb-4bcd-ac43-eb02b0928757.disk0_data (254:8)
        - child 1: lvm, size 128M
          logical_id:   kvmvg/299a0bdf-1acb-4bcd-ac43-eb02b0928757.disk0_meta
          on primary:   /dev/kvmvg/299a0bdf-1acb-4bcd-ac43-eb02b0928757.disk0_meta (254:11)
          on secondary: /dev/kvmvg/299a0bdf-1acb-4bcd-ac43-eb02b0928757.disk0_meta (254:9)

...

On primary node

$ cat /proc/drbd
 4: cs:NetworkFailure ro:Primary/Unknown ds:UpToDate/DUnknown C r----
    ns:678399926 nr:0 dw:678315292 dr:25942012 al:22230 bm:16189 lo:0 pe:196 ua:0 ap:195 ep:1 wo:b oos:0

On secondary node

$ cat /proc/drbd
 4: cs:WFConnection ro:Secondary/Unknown ds:UpToDate/DUnknown C r----
    ns:0 nr:678340009 dw:678340009 dr:0 al:0 bm:14884 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0

I can't reboot nor shutdown the instance (Operation timed out).

I think it is NOT a brain split issue because there is no "standalone" and on primary node it is "Primary/Unknown" and on secondary it is "Secondary/Unknown".

I tried to run "drbdadm connect all" on the secondary node, but that did nothing.

I tried to replace disk but it failed:

gnt-instance replace-disks -s myinstance
Thu Jun  2 11:32:00 2016 Replacing disk(s) 0, 1 for myinstancel
Thu Jun  2 11:36:00 2016  - WARNING: Could not prepare block device disk/1 on node primaryNode (is_primary=False, pass=1): Error while assembling disk: drbd5: cannot activate, unknown or unhandled reason
Thu Jun  2 11:38:01 2016  - WARNING: Could not prepare block device disk/0 on node primaryNode (is_primary=True, pass=2): Error while assembling disk: drbd4: cannot activate, unknown or unhandled reason
Thu Jun  2 11:40:02 2016  - WARNING: Could not prepare block device disk/1 on node primaryNode (is_primary=True, pass=2): Error while assembling disk: drbd5: cannot activate, unknown or unhandled reason
Failure: command execution error:
Disk consistency error

And now it looks like this:

$ gnt-instance info myinstance
...
    -disk/0 
      on primary:   /dev/drbd4 (147:4) in sync, status *DEGRADED*
      (no more secondary)
      child devices:
        - child 0: lvm, size 20.0G
          logical_id:   kvmvg/299a0bdf-1acb-4bcd-ac43-eb02b0928757.disk0_data
          on primary:   /dev/kvmvg/299a0bdf-1acb-4bcd-ac43-eb02b0928757.disk0_data (254:10)
          on secondary: /dev/kvmvg/299a0bdf-1acb-4bcd-ac43-eb02b0928757.disk0_data (254:8)
        - child 1: lvm, size 128M
          logical_id:   kvmvg/299a0bdf-1acb-4bcd-ac43-eb02b0928757.disk0_meta
          on primary:   /dev/kvmvg/299a0bdf-1acb-4bcd-ac43-eb02b0928757.disk0_meta (254:11)
          on secondary: /dev/kvmvg/299a0bdf-1acb-4bcd-ac43-eb02b0928757.disk0_meta (254:9)

On primary node

$ cat /proc/drbd
 4: cs:NetworkFailure ro:Primary/Unknown ds:UpToDate/DUnknown C r----
    ns:678399926 nr:0 dw:678315292 dr:25942012 al:22230 bm:16189 lo:0 pe:196 ua:0 ap:195 ep:1 wo:b oos:0

And on secondary node:

$ cat /proc/drbd
...
4: cs:Unconfigured
5: cs:Unconfigured

Any idea how to solve this?

DRBD version: 8.3.7

Ganeti version: 2.4.5

OS: Debian 6.0

DRBD version: 8.3.7; Ganeti version: 2.4.5; OS: Debian 6.0. Node:8 — dalf, Jun 02 '16 at 06:36
Wow...that's quite some versions back..anyway, did you try to use: `gnt-instance replace-disks -n ` (not sure if this exists in 2.4.5 already, can't check right now)? If so, what does this give? — gxx, Jun 02 '16 at 06:48
yes I tried this command but same result as when I tried with "-s": Failure: command execution error: Disk consistency error — dalf, Jun 02 '16 at 06:54
Do you know, on which node the data is still correct? If so, could you run on the other node (this is the node, where the data should be overwritten) `drbdsetup /dev/drbd4 down`, and after this, `gnt-instance replace-disks -n `. — gxx, Jun 02 '16 at 11:24

score 2 · Accepted Answer · answered Jun 03 '16 at 02:22

After investigating a little bit more, I found out that on the primary node there was kvm zombie process:

PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                      
17520 root    20   0     0    0    0 Z  613  0.0  13922:24 kvm <defunct>

I was not sure how to get rid of it properly.

I tried to migrate all primary instances (I had just 2) from this node, but this failed (error related to bdrm). I rebooted the node. When shutting down the machine, it was stuck because of drbd. The message was something like:

No response from the DRBD driver! Is the module loaded?

So I pushed the button to shut down the machine. Machine rebooted (without any error) and after a few minutes the Ganeti instances automatically started.

On the master node I ran:

$ gnt-instance info myinstance
...
     on primary:   /dev/drbd4 (147:4) *RECOVERING* 12.80%, ETA 288s, status *DEGRADED*
     on secondary: /dev/drbd4 (147:4) *RECOVERING* 12.80%, ETA 275s, status *DEGRADED* *UNCERTAIN STATE*
....

After waiting a few minutes the recovering was done and now it is in sync.

Conclusion: everything is OK now but I wish I didn't have to reboot the node.

Thanks gf_ for your help.

Ganeti disks degraded drbd cs:NetworkFailure

1 Answers1