1

Our setup is a 3-node RHEL 7.3 bare-metal Kubernetes cluster running on Docker.

We have a multipath FC SAN block device discovered on all three nodes. This device is used as a Kubernetes Persistent Volume with ext4 filesystem. The definition of this object follows as:

apiVersion: v1
kind: PersistentVolume
metadata:
  annotations:
    pv.kubernetes.io/bound-by-controller: "yes"
  creationTimestamp: 2019-01-04T13:49:42Z
  finalizers:
  - kubernetes.io/pv-protection
  labels:
    ...
spec:
  accessModes:
  - ReadWriteMany
  capacity:
    storage: 15Gi
  claimRef:
    apiVersion: v1
    kind: PersistentVolumeClaim
    name: ...
    namespace: ...
    resourceVersion: ...
    uid: ...
  fc:
    fsType: ext4
    lun: 1
    targetWWNs:
    - ...04
    - ...15
  persistentVolumeReclaimPolicy: Retain
status:
  phase: Bound

The pod using this volume crashed and upon restart started to complain about an inconsistency and requesting to run fsck.

Warning  FailedMount            1m (x13 over 26m)  kubelet, node2     MountVolume.WaitForAttach failed for volume "rtbm-prod-influxdb-pv" : fc: failed to mount fc volume /dev/dm-9 [ext4] to /var/lib/kubelet/plugins/kubernetes.io/fc/500
60e801232d404-lun-1, error 'fsck' found errors on device /dev/dm-9 but could not correct them: fsck from util-linux 2.23.2
k8s-san-0 contains a file system with errors, check forced.
k8s-san-0: Entry '675' in /data/_internal/monitor (262147) has an incorrect filetype (was 2, should be 1).


k8s-san-0: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.

However we were unable to kick-off the fsck. We have undeployed the pod and fsck was still complaining with

# fsck.ext4 /dev/mapper/mpathb
e2fsck 1.42.9 (28-Dec-2013)
/dev/mapper/mpathb is in use.
e2fsck: Cannot continue, aborting.

I tried to see what exactly was using the device:

# mount -l | grep -i mpathb
# lsof /dev/mapper/mpathb
# grep mpathb /proc/mounts
# fuser -m /dev/mapper/mpathb

But to all these tools the usage was invisible. What else could I check in order to find out what's holding my block device?

  • Are you using multipath storage drivers? Such as multipath scsi? Maybe you can stop the service with `service multipathd stop`. https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/dm_multipath/mpio_setup – zymhan Jan 04 '19 at 14:34
  • 1
    Yes, I am. Thanks for the hint. In the meantime we've stopped docker and kubelet services which allowed us to fix the issue, but I was wondering if there's a more fine-grained approach that doesn't require a full node outage (we have multiple SAN devices on this node and stopping the multipath daemon would affect them all). – Bernard Halas Jan 04 '19 at 14:45

0 Answers0