Currently trying to handle this right on my 3-day memorial weekend :D
- Ceph 13.2.4 (Filestore)
- Rook 0.9
- Kubernetes 1.14.1
https://gist.github.com/sfxworks/ce77473a93b96570af319120e74535ec
My setup is a Kubernetes cluser with rook handling Ceph. Using 13.2.4, I have this issue with one of my OSDs always restarting. This happened recently. No power failure or anything occurred on the node.
2019-05-25 01:06:07.192 7fb923359700 3 rocksdb: [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.4/rpm/el7/BUILD/ceph-13.2.4/src/rocksdb/db/db_impl_compaction_flush.cc:1929] Compaction error: Corruption: block checksum mismatch: expected 862584094, got 1969278739 in /var/lib/rook/osd1/current/omap/002408.sst offset 15647059 size 3855
There are a few more in this gist with a similar error message. The only other one states:
2019-05-25 01:06:07.192 7fb939a4a1c0 0 filestore(/var/lib/rook/osd1) EPERM suggests file(s) in osd data dir not owned by ceph user, or leveldb corruption
I checked this on the node. All is root as the other ones are. It's also containerized, and deleting the pod to have the operator recreate this did not help.
Only thing I was able to find to assist was https://tracker.ceph.com/issues/21303, but this seems to be a year old. I am not sure where to begin with this. Any leads or points to some documentation to follow, or a solution if you have one, would be of great help. I see some tools for bluestore, but I do not know how applicable they are and want to be very careful given the situation.
In the worst-case scenario, I have backups. Willing to try things within reason.
Edit:
If it's just an OSD, am I safe to destroy it and have rook remake it?
Here's ceph status
as of late
sh-4.2# ceph status
cluster:
id: e5a100b0-6abd-4968-8895-300501aa9200
health: HEALTH_WARN
Degraded data redundancy: 3407/13644 objects degraded (24.971%), 48 pgs degraded, 48 pgs undersized
services:
mon: 3 daemons, quorum c,a,e
mgr: a(active)
osd: 3 osds: 2 up, 2 in
data:
pools: 1 pools, 100 pgs
objects: 6.82 k objects, 20 GiB
usage: 109 GiB used, 792 GiB / 900 GiB avail
pgs: 3407/13644 objects degraded (24.971%)
52 active+clean
48 active+undersized+degraded
io:
client: 366 KiB/s wr, 0 op/s rd, 42 op/s wr