Hung kernel tasks after unclean shutdown of ceph cluster

Question

I am running ceph (created by the rook-ceph operator v0.9.3) on kubernetes v1.13. After an unclean shutdown of our cluster, some processes randomly go into uninterruptible sleep. After some time, the kubernetes cluster fails to schedule new Pods. Looking through dmesg, I found this:

[ 3021.890423] INFO: task tp_fstore_op:22689 blocked for more than 120 seconds.
[ 3021.890456]       Tainted: G           O    4.9.0-8-amd64 #1 Debian 4.9.144-3.1
[ 3021.890480] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3021.890504] tp_fstore_op    D    0 22689  20967 0x00000000
[ 3021.890508]  ffff93c0a5dc0080 0000000000000000 ffff93d137954540 ffff93c1fe8d8980
[ 3021.890510]  ffff93bf42e823c0 ffffb9ae3834b7b0 ffffffff9e0144b9 0000000000008000
[ 3021.890512]  0000000000000040 ffff93c1fe8d8980 ffff93c0a9156300 ffff93d137954540
[ 3021.890515] Call Trace:
[ 3021.890524]  [<ffffffff9e0144b9>] ? __schedule+0x239/0x6f0
[ 3021.890571]  [<ffffffffc0b69321>] ? xfs_reclaim_inode+0x131/0x340 [xfs]
[ 3021.890574]  [<ffffffff9e0149a2>] ? schedule+0x32/0x80
[ 3021.890576]  [<ffffffff9e017d4d>] ? schedule_timeout+0x1dd/0x380
[ 3021.890602]  [<ffffffffc0b8556d>] ? _xfs_log_force_lsn+0x22d/0x320 [xfs]
[ 3021.890613]  [<ffffffff9daf107e>] ? ktime_get+0x3e/0xb0
[ 3021.890635]  [<ffffffffc0b69321>] ? xfs_reclaim_inode+0x131/0x340 [xfs]
[ 3021.890638]  [<ffffffff9e01421d>] ? io_schedule_timeout+0x9d/0x100
[ 3021.890659]  [<ffffffffc0b71e24>] ? __xfs_iunpin_wait+0xd4/0x160 [xfs]
[ 3021.890662]  [<ffffffff9dabd3f0>] ? wake_atomic_t_function+0x60/0x60
[ 3021.890681]  [<ffffffffc0b69321>] ? xfs_reclaim_inode+0x131/0x340 [xfs]
[ 3021.890699]  [<ffffffffc0b6970e>] ? xfs_reclaim_inodes_ag+0x1de/0x300 [xfs]
[ 3021.890702]  [<ffffffff9db91885>] ? node_dirty_ok+0x125/0x170
[ 3021.890704]  [<ffffffff9dd53419>] ? list_del+0x9/0x30
[ 3021.890707]  [<ffffffff9dbe599a>] ? page_is_poisoned+0xa/0x20
[ 3021.890709]  [<ffffffff9db8ba0e>] ? get_page_from_freelist+0x88e/0xb20
[ 3021.890712]  [<ffffffff9daae1ff>] ? select_task_rq_fair+0x51f/0x7e0
[ 3021.890714]  [<ffffffff9daad9d5>] ? select_idle_sibling+0x25/0x330
[ 3021.890716]  [<ffffffff9daa5674>] ? try_to_wake_up+0x54/0x3c0
[ 3021.890734]  [<ffffffffc0b6a771>] ? xfs_reclaim_inodes_nr+0x31/0x40 [xfs]
[ 3021.890736]  [<ffffffff9dc0eed8>] ? super_cache_scan+0x188/0x190
[ 3021.890738]  [<ffffffff9db97a0a>] ? shrink_slab.part.38+0x21a/0x440
[ 3021.890740]  [<ffffffff9db9c3ca>] ? shrink_node+0x10a/0x340
[ 3021.890742]  [<ffffffff9db9c6f1>] ? do_try_to_free_pages+0xf1/0x310
[ 3021.890744]  [<ffffffff9dd38b6a>] ? __next_node_in+0x3a/0x50
[ 3021.890745]  [<ffffffff9db9cb73>] ? try_to_free_mem_cgroup_pages+0xc3/0x1a0
[ 3021.890748]  [<ffffffff9dbfd147>] ? try_charge+0x147/0x6f0
[ 3021.890750]  [<ffffffff9dc01237>] ? mem_cgroup_try_charge+0x67/0x1b0
[ 3021.890752]  [<ffffffff9dbbb1d2>] ? handle_mm_fault+0x10e2/0x1310
[ 3021.890755]  [<ffffffff9dc0ac30>] ? new_sync_write+0xe0/0x130
[ 3021.890758]  [<ffffffff9da622f5>] ? __do_page_fault+0x255/0x4f0
[ 3021.890760]  [<ffffffff9e01a618>] ? page_fault+0x28/0x30

Immediately after that, accesses to the RBDs produce similar errors:

[ 3021.890820] INFO: task xfsaild/rbd2:23307 blocked for more than 120 seconds.
[ 3021.890845]       Tainted: G           O    4.9.0-8-amd64 #1 Debian 4.9.144-3.1
[ 3021.890867] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3021.890896] xfsaild/rbd2    D    0 23307      2 0x00000000
[ 3021.890898]  ffff93c182e46480 0000000000000000 ffff93d0d3a4ca00 ffff93d1fdb58980
[ 3021.890900]  ffff93d1f6a4a180 ffffb9ae24e07d80 ffffffff9e0144b9 0000000000000246
[ 3021.890903]  00ffffff9dae787d ffff93d1fdb58980 e182622c538e97d5 ffff93d0d3a4ca00
[ 3021.890905] Call Trace:
[ 3021.890909]  [<ffffffff9e0144b9>] ? __schedule+0x239/0x6f0
[ 3021.890911]  [<ffffffff9e0149a2>] ? schedule+0x32/0x80
[ 3021.890948]  [<ffffffffc0b8508c>] ? _xfs_log_force+0x15c/0x2b0 [xfs]
[ 3021.890949]  [<ffffffff9daa5a70>] ? wake_up_q+0x70/0x70
[ 3021.890973]  [<ffffffffc0b92895>] ? xfsaild+0x1a5/0x7a0 [xfs]
[ 3021.890994]  [<ffffffffc0b926f0>] ? xfs_trans_ail_cursor_first+0x80/0x80 [xfs]
[ 3021.890996]  [<ffffffff9da9a5d9>] ? kthread+0xd9/0xf0
[ 3021.890998]  [<ffffffff9e019364>] ? __switch_to_asm+0x34/0x70
[ 3021.891000]  [<ffffffff9da9a500>] ? kthread_park+0x60/0x60
[ 3021.891002]  [<ffffffff9e0193f7>] ? ret_from_fork+0x57/0x70
[ 3021.891004] INFO: task xfsaild/rbd3:23438 blocked for more than 120 seconds.
[ 3021.891027]       Tainted: G           O    4.9.0-8-amd64 #1 Debian 4.9.144-3.1
[ 3021.891050] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3021.891074] xfsaild/rbd3    D    0 23438      2 0x00000000
[ 3021.891075]  ffff93c0fb0464c0 0000000000000000 ffff93d0a88f61c0 ffff93d1fdd18980
[ 3021.891077]  ffff93d1f6a80340 ffffb9ae24e37d80 ffffffff9e0144b9 0000000000000246
[ 3021.891080]  00ffffff9dae787d ffff93d1fdd18980 10168cfc448e06f4 ffff93d0a88f61c0
[ 3021.891081] Call Trace:
[ 3021.891084]  [<ffffffff9e0144b9>] ? __schedule+0x239/0x6f0
[ 3021.891086]  [<ffffffff9e0149a2>] ? schedule+0x32/0x80
[ 3021.891108]  [<ffffffffc0b8508c>] ? _xfs_log_force+0x15c/0x2b0 [xfs]
[ 3021.891109]  [<ffffffff9daa5a70>] ? wake_up_q+0x70/0x70
[ 3021.891130]  [<ffffffffc0b92895>] ? xfsaild+0x1a5/0x7a0 [xfs]
[ 3021.891151]  [<ffffffffc0b926f0>] ? xfs_trans_ail_cursor_first+0x80/0x80 [xfs]
[ 3021.891153]  [<ffffffff9da9a5d9>] ? kthread+0xd9/0xf0
[ 3021.891154]  [<ffffffff9e019364>] ? __switch_to_asm+0x34/0x70
[ 3021.891156]  [<ffffffff9da9a500>] ? kthread_park+0x60/0x60
[ 3021.891158]  [<ffffffff9e0193f7>] ? ret_from_fork+0x57/0x70

There are more errors in dmesg, but they all follow the same schema: Some process tries to perform some operations on XFS, the kernel task gets stuck and the process remains in uninterruptible sleep.

Shortly after, libceph reports that the OSDs are down:

[ 4218.521314] libceph: osd0 down

Journalctl does not report any additional errors.

The unclean shutdown was necessary due to similar problems when a Kubernetes Pod tried to write a file that was too large for the attached Volume. The Volume was provided by rook-ceph. This is the config I am using:

Cluster config:

apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
  name: rook-ceph
  namespace: rook-ceph
spec:
  cephVersion:
    image: "ceph/ceph:v13.2.5-20190319"
  dataDirHostPath: "/var/rook/data"
  dashboard:
    enabled: True
    port: 80
    ssl: False
  network:
    hostNetwork: False  # use SDN (Canal) as network
  mon:
    count: 3
    allowMultiplePerNode: True 
  resources:  # http://docs.ceph.com/docs/mimic/start/hardware-recommendations/
    mgr:
      requests:
        cpu: 4
        memory: "2Gi"
      limits:
        cpu: 4
        memory: "2Gi"
    mon:
      requests:
        cpu: 0.5
        memory: "2Gi"
      limits:
        cpu: 0.5
        memory: "2Gi"
    osd:
      requests:
        cpu: 2
        memory: "5Gi"
      limits:
        cpu: 2
        memory: "5Gi"
  storage:
    useAllNodes: False
    nodes:
    - name: "kubernetes-master"  # matches node label: kubernetes.io/hostname
    useAllDevices: False
    directories:
    - path: "/var/rook/filestore"

BlockPool config:

apiVersion: ceph.rook.io/v1
kind: CephBlockPool
metadata:
  name: volatile-replicapool
  namespace: rook-ceph
spec:
  failureDomain: osd
  replicated:
    size: 1

And the StorageClasses:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
   name: ceph-block-development
provisioner: ceph.rook.io/block
parameters:
  blockPool: volatile-replicapool
  clusterNamespace: rook-ceph
  fstype: xfs
reclaimPolicy: Delete
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
   name: ceph-block-production
provisioner: ceph.rook.io/block
parameters:
  blockPool: volatile-replicapool
  clusterNamespace: rook-ceph
  fstype: xfs
reclaimPolicy: Retain

I am running Linux 4.9.0-8-amd64 #1 SMP Debian 4.9.144-3.1 (2019-02-19) x86_64.

Any pointers as to how to debug this issue would be greatly appreciated.

Thanks in advance.

Hung kernel tasks after unclean shutdown of ceph cluster

0 Answers0