0

After an ungraceful shutdown of our Kubernetes cluster, the API server didn't start working. After some investigation, I found the following errors in each etcd member on each node.

How can I recover the cluster?

Master1

2021-01-12 13:34:54.273559 I | etcdserver: recovered store from snapshot at index 143742270
2021-01-12 13:34:54.281853 I | mvcc: restore compact to 127098354
2021-01-12 13:34:54.310003 C | mvcc: store.keyindex: put with unexpected smaller revision [{127097381 0} / {127099854 0}]
panic: store.keyindex: put with unexpected smaller revision [{127097381 0} / {127099854 0}]
# ... stack trace ...

Master2

panic: freepages: failed to get all reachable pages (page 3630520571184623672: out of bounds: 11503)
# ... stack trace ...

Master3

2021-01-13 12:10:35.428458 I | etcdserver: recovered store from snapshot at index 143735303
2021-01-13 12:10:35.437350 I | mvcc: restore compact to 127098354
2021-01-13 12:10:35.481940 C | mvcc: store.keyindex: put with unexpected smaller revision [{127097229 0} / {127099849 0}]
panic: store.keyindex: put with unexpected smaller revision [{127097229 0} / {127099849 0}]
# ... stack trace ...
Ahmad Ahmadi
  • 101
  • 3
  • That's a pretty common on-disk etcd corruption, although I'm stunned to see it affect **all** of your etcd nodes. Running on a Raspberry PI I'm guessing? Anyway, following the [etcd disaster recovery guide](https://etcd.io/docs/v3.4.0/op-guide/recovery/) is the only recourse – mdaniel Jan 13 '21 at 16:55
  • They're running on debian on ESXi. I looked over the guide but doesn't seem straightforward as we don't have a snapshot. – Ahmad Ahmadi Jan 13 '21 at 18:02
  • Heh, yes, recovering etcd is almost never "straightforward"; if you still have the data directory (on master1 and 3 in your case), it is likely possible to start a _new_ single node etcd cluster then create a snapshot from it, but "possible" is the key word there – mdaniel Jan 14 '21 at 03:42

0 Answers0