0

Currently trying to handle this right on my 3-day memorial weekend :D

  • Ceph 13.2.4 (Filestore)
  • Rook 0.9
  • Kubernetes 1.14.1

https://gist.github.com/sfxworks/ce77473a93b96570af319120e74535ec

My setup is a Kubernetes cluser with rook handling Ceph. Using 13.2.4, I have this issue with one of my OSDs always restarting. This happened recently. No power failure or anything occurred on the node.

2019-05-25 01:06:07.192 7fb923359700  3 rocksdb: [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.4/rpm/el7/BUILD/ceph-13.2.4/src/rocksdb/db/db_impl_compaction_flush.cc:1929] Compaction error: Corruption: block checksum mismatch: expected 862584094, got 1969278739  in /var/lib/rook/osd1/current/omap/002408.sst offset 15647059 size 3855

There are a few more in this gist with a similar error message. The only other one states:

2019-05-25 01:06:07.192 7fb939a4a1c0  0 filestore(/var/lib/rook/osd1) EPERM suggests file(s) in osd data dir not owned by ceph user, or leveldb corruption

I checked this on the node. All is root as the other ones are. It's also containerized, and deleting the pod to have the operator recreate this did not help.

Only thing I was able to find to assist was https://tracker.ceph.com/issues/21303, but this seems to be a year old. I am not sure where to begin with this. Any leads or points to some documentation to follow, or a solution if you have one, would be of great help. I see some tools for bluestore, but I do not know how applicable they are and want to be very careful given the situation.

In the worst-case scenario, I have backups. Willing to try things within reason.

Edit:

If it's just an OSD, am I safe to destroy it and have rook remake it? Here's ceph status as of late

sh-4.2# ceph status
  cluster:
    id:     e5a100b0-6abd-4968-8895-300501aa9200
    health: HEALTH_WARN
            Degraded data redundancy: 3407/13644 objects degraded (24.971%), 48 pgs degraded, 48 pgs undersized

  services:
    mon: 3 daemons, quorum c,a,e
    mgr: a(active)
    osd: 3 osds: 2 up, 2 in

  data:
    pools:   1 pools, 100 pgs
    objects: 6.82 k objects, 20 GiB
    usage:   109 GiB used, 792 GiB / 900 GiB avail
    pgs:     3407/13644 objects degraded (24.971%)
             52 active+clean
             48 active+undersized+degraded

  io:
    client:   366 KiB/s wr, 0 op/s rd, 42 op/s wr
sfxworks
  • 157
  • 1
  • 8
  • First things first: you have too few PGs. – BMDan May 25 '19 at 06:37
  • What is the recommended amount of pages to have? I admit a lot of these are set as to how rook defaulted them. I've yet the time to configure this all properly. – sfxworks May 25 '19 at 17:15
  • `pg`s are Placement Groups, not pages. You probably want to read up a bit on Ceph internals; PGs are pretty fundamental to grokking what's going on under the hood, and you'll have a pretty rough time fixing problems without that knowledge. – BMDan May 25 '19 at 23:59
  • Ah thanks for that. I will definitely be doing some online studying sooner than I planned due to this. – sfxworks May 26 '19 at 05:49

1 Answers1

2

If it's just an OSD, am I safe to destroy it and have rook remake it?

Based on your ceph status, you have degraded data, but no stuck/down data. So yes, you can kill the third OSD, but note in doing so that you leave yourself vulnerable to anything that could take either of the remaining OSDs offline while you work to bring up a replacement Third.

EPERM

Are you doing something very silly, like running this on top of NFS? What does df /var/lib/rook/osd1 show, and what about grep /var/lib/rook/osd1 /proc/mounts?

block checksum mismatch

This also aligns with the NFS hypothesis, but it could also be caused by bad hardware, bad drivers, bad FS drivers, (admittedly, only very) bad VFS config, or a few other things that I can't think of at the moment. A few shots in the dark:

  • Any chance more than one daemon is occupying the same data directory by accident?
  • What's the uptime on the machine?
  • Is the hardware under any particular hardship (e.g. overclocking)? Does a CPU/memory stress-test complete successfully?
  • You don't specify VM vs. hardware, but regardless, does any level of your stack have -o nobarrier set on a relevant FS?

Followups

@quantomworks asked:

`I am unsure what you mean by "-o nobarrier".

With Ceph running atop MDs, you have at least two filesystems, and as many as infinity. Specifically, you have:

  1. The filesystem you're hosting atop Ceph.
  2. The filesystem that hosts the OSD files themselves.
  3. (through ∞) The filesystem(s) that underlay the filesystems above--e.g. the filesystem that hosts a file mounted via -o loop upon which an OSD is hosted.

This can be rather difficult to track down without specialized tools, and even when you've done so, you can't actually guarantee that barriers are being honored, because drivers lie, firmware lies, and hardware lies. Basically, the fact that I/O happens at all is a minor miracle. What I was asking for here is probably most easily solved by grep -i barrier /proc/mounts on all relevant machines, rather than actually trying to work out which FSes are relevant in truth.

Anyway, one of the easier ways to make Ceph OSDs very grumpy is to provide unreliable write semantics. A "barrier" is a tool used in a stream of writes to ensure that, regardless of any downstream batching, nothing after the barrier will ever be persisted to disk before everything before the barrier is persisted. A simple example, a bank transfer:

Write 1: Reduce Abby's account balance by $100 Write 2: Increase Bobby's account balance by $100

In this scenario, if due to batching with some previous writes or positioning of heads over magnetic media or solar flares, Write 2 happens first and then the machine loses power, little Bobby's just gotten some "free" money. Therefore, at Legal's insistence, we insert a barrier request between Writes 1 and 2, thereby guaranteeing that while we might time-travel backwards a bit if we lose power, we'll never have lost a cent of our own money. (Abby's loss would also be reimbursed if we lived in a transactionally-consistent world like a database, of course, but barriers are one of the ways such worlds are constructed in the first place.)

Ceph capitalizes on barriers (amongst other tricks) to try to simultaneously deliver throughput and some semblance of data consistency in the face of unclean shutdowns. This latter point is why I also asked for your uptime; while there are other ways to repeatedly shut down an OSD uncleanly (kill -9 or oom_killer come to mind), a pretty reliable one is to have a flake box that reboots itself once an hour.

BMDan
  • 7,129
  • 2
  • 22
  • 34
  • Thank you for this. Following https://gist.github.com/cheethoe/49d9c1d0003e44423e54a060e0b3fbf1 along with your initial answer has resolved my issues. df on /var/lib/rook yield /dev/md3. I use an OVH dedicated server configured with raid 0. My wish is to get access to the 2 NVMe drives and switch to blue-storage. However, I am having a bit of trouble with that given their templates and interface when setting up a machine. Uptime - 17:13:52 up 49 days. CPU is set to be able to overclock. I am unsure what you mean by "-o nobarrier" – sfxworks May 25 '19 at 17:14