I am running ceph (created by the rook-ceph operator v0.9.3) on kubernetes v1.13. After an unclean shutdown of our cluster, some processes randomly go into uninterruptible sleep. After some time, the kubernetes cluster fails to schedule new Pods. Looking through dmesg, I found this:
[ 3021.890423] INFO: task tp_fstore_op:22689 blocked for more than 120 seconds.
[ 3021.890456] Tainted: G O 4.9.0-8-amd64 #1 Debian 4.9.144-3.1
[ 3021.890480] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3021.890504] tp_fstore_op D 0 22689 20967 0x00000000
[ 3021.890508] ffff93c0a5dc0080 0000000000000000 ffff93d137954540 ffff93c1fe8d8980
[ 3021.890510] ffff93bf42e823c0 ffffb9ae3834b7b0 ffffffff9e0144b9 0000000000008000
[ 3021.890512] 0000000000000040 ffff93c1fe8d8980 ffff93c0a9156300 ffff93d137954540
[ 3021.890515] Call Trace:
[ 3021.890524] [<ffffffff9e0144b9>] ? __schedule+0x239/0x6f0
[ 3021.890571] [<ffffffffc0b69321>] ? xfs_reclaim_inode+0x131/0x340 [xfs]
[ 3021.890574] [<ffffffff9e0149a2>] ? schedule+0x32/0x80
[ 3021.890576] [<ffffffff9e017d4d>] ? schedule_timeout+0x1dd/0x380
[ 3021.890602] [<ffffffffc0b8556d>] ? _xfs_log_force_lsn+0x22d/0x320 [xfs]
[ 3021.890613] [<ffffffff9daf107e>] ? ktime_get+0x3e/0xb0
[ 3021.890635] [<ffffffffc0b69321>] ? xfs_reclaim_inode+0x131/0x340 [xfs]
[ 3021.890638] [<ffffffff9e01421d>] ? io_schedule_timeout+0x9d/0x100
[ 3021.890659] [<ffffffffc0b71e24>] ? __xfs_iunpin_wait+0xd4/0x160 [xfs]
[ 3021.890662] [<ffffffff9dabd3f0>] ? wake_atomic_t_function+0x60/0x60
[ 3021.890681] [<ffffffffc0b69321>] ? xfs_reclaim_inode+0x131/0x340 [xfs]
[ 3021.890699] [<ffffffffc0b6970e>] ? xfs_reclaim_inodes_ag+0x1de/0x300 [xfs]
[ 3021.890702] [<ffffffff9db91885>] ? node_dirty_ok+0x125/0x170
[ 3021.890704] [<ffffffff9dd53419>] ? list_del+0x9/0x30
[ 3021.890707] [<ffffffff9dbe599a>] ? page_is_poisoned+0xa/0x20
[ 3021.890709] [<ffffffff9db8ba0e>] ? get_page_from_freelist+0x88e/0xb20
[ 3021.890712] [<ffffffff9daae1ff>] ? select_task_rq_fair+0x51f/0x7e0
[ 3021.890714] [<ffffffff9daad9d5>] ? select_idle_sibling+0x25/0x330
[ 3021.890716] [<ffffffff9daa5674>] ? try_to_wake_up+0x54/0x3c0
[ 3021.890734] [<ffffffffc0b6a771>] ? xfs_reclaim_inodes_nr+0x31/0x40 [xfs]
[ 3021.890736] [<ffffffff9dc0eed8>] ? super_cache_scan+0x188/0x190
[ 3021.890738] [<ffffffff9db97a0a>] ? shrink_slab.part.38+0x21a/0x440
[ 3021.890740] [<ffffffff9db9c3ca>] ? shrink_node+0x10a/0x340
[ 3021.890742] [<ffffffff9db9c6f1>] ? do_try_to_free_pages+0xf1/0x310
[ 3021.890744] [<ffffffff9dd38b6a>] ? __next_node_in+0x3a/0x50
[ 3021.890745] [<ffffffff9db9cb73>] ? try_to_free_mem_cgroup_pages+0xc3/0x1a0
[ 3021.890748] [<ffffffff9dbfd147>] ? try_charge+0x147/0x6f0
[ 3021.890750] [<ffffffff9dc01237>] ? mem_cgroup_try_charge+0x67/0x1b0
[ 3021.890752] [<ffffffff9dbbb1d2>] ? handle_mm_fault+0x10e2/0x1310
[ 3021.890755] [<ffffffff9dc0ac30>] ? new_sync_write+0xe0/0x130
[ 3021.890758] [<ffffffff9da622f5>] ? __do_page_fault+0x255/0x4f0
[ 3021.890760] [<ffffffff9e01a618>] ? page_fault+0x28/0x30
Immediately after that, accesses to the RBDs produce similar errors:
[ 3021.890820] INFO: task xfsaild/rbd2:23307 blocked for more than 120 seconds.
[ 3021.890845] Tainted: G O 4.9.0-8-amd64 #1 Debian 4.9.144-3.1
[ 3021.890867] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3021.890896] xfsaild/rbd2 D 0 23307 2 0x00000000
[ 3021.890898] ffff93c182e46480 0000000000000000 ffff93d0d3a4ca00 ffff93d1fdb58980
[ 3021.890900] ffff93d1f6a4a180 ffffb9ae24e07d80 ffffffff9e0144b9 0000000000000246
[ 3021.890903] 00ffffff9dae787d ffff93d1fdb58980 e182622c538e97d5 ffff93d0d3a4ca00
[ 3021.890905] Call Trace:
[ 3021.890909] [<ffffffff9e0144b9>] ? __schedule+0x239/0x6f0
[ 3021.890911] [<ffffffff9e0149a2>] ? schedule+0x32/0x80
[ 3021.890948] [<ffffffffc0b8508c>] ? _xfs_log_force+0x15c/0x2b0 [xfs]
[ 3021.890949] [<ffffffff9daa5a70>] ? wake_up_q+0x70/0x70
[ 3021.890973] [<ffffffffc0b92895>] ? xfsaild+0x1a5/0x7a0 [xfs]
[ 3021.890994] [<ffffffffc0b926f0>] ? xfs_trans_ail_cursor_first+0x80/0x80 [xfs]
[ 3021.890996] [<ffffffff9da9a5d9>] ? kthread+0xd9/0xf0
[ 3021.890998] [<ffffffff9e019364>] ? __switch_to_asm+0x34/0x70
[ 3021.891000] [<ffffffff9da9a500>] ? kthread_park+0x60/0x60
[ 3021.891002] [<ffffffff9e0193f7>] ? ret_from_fork+0x57/0x70
[ 3021.891004] INFO: task xfsaild/rbd3:23438 blocked for more than 120 seconds.
[ 3021.891027] Tainted: G O 4.9.0-8-amd64 #1 Debian 4.9.144-3.1
[ 3021.891050] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3021.891074] xfsaild/rbd3 D 0 23438 2 0x00000000
[ 3021.891075] ffff93c0fb0464c0 0000000000000000 ffff93d0a88f61c0 ffff93d1fdd18980
[ 3021.891077] ffff93d1f6a80340 ffffb9ae24e37d80 ffffffff9e0144b9 0000000000000246
[ 3021.891080] 00ffffff9dae787d ffff93d1fdd18980 10168cfc448e06f4 ffff93d0a88f61c0
[ 3021.891081] Call Trace:
[ 3021.891084] [<ffffffff9e0144b9>] ? __schedule+0x239/0x6f0
[ 3021.891086] [<ffffffff9e0149a2>] ? schedule+0x32/0x80
[ 3021.891108] [<ffffffffc0b8508c>] ? _xfs_log_force+0x15c/0x2b0 [xfs]
[ 3021.891109] [<ffffffff9daa5a70>] ? wake_up_q+0x70/0x70
[ 3021.891130] [<ffffffffc0b92895>] ? xfsaild+0x1a5/0x7a0 [xfs]
[ 3021.891151] [<ffffffffc0b926f0>] ? xfs_trans_ail_cursor_first+0x80/0x80 [xfs]
[ 3021.891153] [<ffffffff9da9a5d9>] ? kthread+0xd9/0xf0
[ 3021.891154] [<ffffffff9e019364>] ? __switch_to_asm+0x34/0x70
[ 3021.891156] [<ffffffff9da9a500>] ? kthread_park+0x60/0x60
[ 3021.891158] [<ffffffff9e0193f7>] ? ret_from_fork+0x57/0x70
There are more errors in dmesg, but they all follow the same schema: Some process tries to perform some operations on XFS, the kernel task gets stuck and the process remains in uninterruptible sleep.
Shortly after, libceph reports that the OSDs are down:
[ 4218.521314] libceph: osd0 down
Journalctl does not report any additional errors.
The unclean shutdown was necessary due to similar problems when a Kubernetes Pod tried to write a file that was too large for the attached Volume. The Volume was provided by rook-ceph. This is the config I am using:
Cluster config:
apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
name: rook-ceph
namespace: rook-ceph
spec:
cephVersion:
image: "ceph/ceph:v13.2.5-20190319"
dataDirHostPath: "/var/rook/data"
dashboard:
enabled: True
port: 80
ssl: False
network:
hostNetwork: False # use SDN (Canal) as network
mon:
count: 3
allowMultiplePerNode: True
resources: # http://docs.ceph.com/docs/mimic/start/hardware-recommendations/
mgr:
requests:
cpu: 4
memory: "2Gi"
limits:
cpu: 4
memory: "2Gi"
mon:
requests:
cpu: 0.5
memory: "2Gi"
limits:
cpu: 0.5
memory: "2Gi"
osd:
requests:
cpu: 2
memory: "5Gi"
limits:
cpu: 2
memory: "5Gi"
storage:
useAllNodes: False
nodes:
- name: "kubernetes-master" # matches node label: kubernetes.io/hostname
useAllDevices: False
directories:
- path: "/var/rook/filestore"
BlockPool config:
apiVersion: ceph.rook.io/v1
kind: CephBlockPool
metadata:
name: volatile-replicapool
namespace: rook-ceph
spec:
failureDomain: osd
replicated:
size: 1
And the StorageClasses:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: ceph-block-development
provisioner: ceph.rook.io/block
parameters:
blockPool: volatile-replicapool
clusterNamespace: rook-ceph
fstype: xfs
reclaimPolicy: Delete
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: ceph-block-production
provisioner: ceph.rook.io/block
parameters:
blockPool: volatile-replicapool
clusterNamespace: rook-ceph
fstype: xfs
reclaimPolicy: Retain
I am running Linux 4.9.0-8-amd64 #1 SMP Debian 4.9.144-3.1 (2019-02-19) x86_64
.
Any pointers as to how to debug this issue would be greatly appreciated.
Thanks in advance.