2

I have an OCFS2 cluster running on top of a dual-primary DRBD setup on Ubuntu 16.04. Yesterday, I pushed this cluster into production and it seemed to run well for a while. But today, the cluster seems to have died. I am no longer able to mount the ocfs2 filesystem after I reboot a node. When I run:

mount.ocfs2 /dev/drbd0 /mnt/drbd

It just sits there waiting and waiting, but it's not mounting. OCFS2 seems to run fine, looking at the dmesg -H output:

[ +12.308685] ocfs2: Registered cluster interface o2cb
[ +0.012233] OCFS2 User DLM kernel interface loaded
[Feb24 14:34] o2net: Connected to node edmure (num 0) at 192.168.2.11:7777
[ +4.092023] o2dlm: Joining domain CCEFD26343174950A6BEF9A2F83B6735 ( 0 1 ) 2 nodes

It connects correctly to the other node on the LAN and joins the domain. The DRBD resource is also up and running without any problems:

% cat /proc/drbd
version: 8.4.5 (api:1/proto:86-101)
srcversion: 2A6B2FA4F0703B49CA9C727 
 0: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
    ns:403 nr:4529 dw:4932 dr:1006 al:1 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0

However, if I run the mount command, it just hangs. Every two minutes, I get this message in the dmesg output:

[ +23.059786] INFO: task mount.ocfs2:1788 blocked for more than 120 seconds.
[ +0.000932] Not tainted 4.4.0-64-generic #85-Ubuntu
[ +0.000681] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ +0.000697] mount.ocfs2 D ffff880035ccba08 0 1788 1787 0x00000000
[ +0.000005] ffff880035ccba08 ffff8800a9b02000 ffff88013abf0000 ffff8800a9996600
[ +0.000002] ffff880035ccc000 ffff880035ccbbb0 ffff880035ccbba8 ffff8800a9996600
[ +0.000002] 0000000000000000 ffff880035ccba20 ffffffff818384d5 7fffffffffffffff
[ +0.000002] Call Trace:
[ +0.000010] [] schedule+0x35/0x80
[ +0.000002] [] schedule_timeout+0x1b5/0x270
[ +0.000003] [] wait_for_completion+0xb3/0x140
[ +0.000004] [] ? wake_up_q+0x70/0x70
[ +0.000042] [] __ocfs2_cluster_lock.isra.34+0x415/0x750 [ocfs2]
[ +0.000011] [] ? ocfs2_add_lockres_tracking+0x59/0xb0 [ocfs2]
[ +0.000011] [] ocfs2_super_lock+0xa5/0x250 [ocfs2]
[ +0.000014] [] ocfs2_fill_super+0xbda/0x1280 [ocfs2]
[ +0.000004] [] mount_bdev+0x26d/0x2c0
[ +0.000013] [] ? perf_trace_ocfs2_initialize_super+0x210/0x210 [ocfs2]
[ +0.000003] [] ? alloc_pages_current+0x8c/0x110
[ +0.000011] [] ocfs2_mount+0x15/0x20 [ocfs2]
[ +0.000002] [] mount_fs+0x38/0x160
[ +0.000002] [] vfs_kern_mount+0x67/0x110
[ +0.000003] [] do_mount+0x25f/0xda0
[ +0.000002] [] SyS_mount+0x9f/0x100
[ +0.000002] [] entry_SYSCALL_64_fastpath+0x16/0x71

The process is in the D (uninterruptable) state, so there's nothing I can do with it and it just remains in this state. I'm not really sure what I should make of this. Other then dmesg, I didn't find any useful logs on the systems. Running an strace on the mount process also doesn't reveal anything, it just seems to be waiting, but there no clue what it's waiting for.

My cluster config looks like this:

cluster:
        node_count = 2
        name = media-ocfs2
node:
        ip_port = 7777
        ip_address = 192.168.2.11
        number = 0
        name = edmure
        cluster = media-ocfs2

node:
        ip_port = 7777
        ip_address = 192.168.2.12
        number = 1
        name = brynden
        cluster = media-ocfs2

Does anybody have any idea how I can fix or further debug this issue?

Oldskool
  • 2,005
  • 1
  • 16
  • 26

0 Answers0