0

I have a bare-metal (kubeadm) kubernetes cluster that's really unstable, and I traced it back to an etcd issue.

From the etcd pod's description I get:

Image: k8s.gcr.io/etcd:3.4.13-0
Liveness: ... #success=1 #failure=8
Startup:  ... #success=1 #failure=24

In the logs startup sequence seems fine (compared to another cluster), then I get a lot of warnings:

etcdserver: [...] request ... took too long to execute

But I don't think it's hardware related because etcd_disk_backend_commit_duration_seconds 99th percentile is at 16ms which is fine according to the FAQ.

Anyways, this goes on for a few minutes, and then I guess this causes the restart:

etcdserver/api/etcdhttp: /health error; QGET failed etcdserver: request timed out (status code 503)

Any idea what further steps I can take to diagnose the issue and fix etcd ?

Antoine
  • 281
  • 3
  • 8
  • Did you see this [issue](https://github.com/etcd-io/etcd/issues/11809)? Is it similar to yours? – Mikołaj Głodziak Sep 29 '21 at 14:21
  • Well it has some similarities, but in the issue you mention the timeouts start just after startup wheras in my case it starts after a few minutes of uptime. Also it isn't clear if there is a crash in the other issue, whereas for me there is for sure. But I'll continue to look into disk performance until I get a better idea... – Antoine Sep 30 '21 at 07:00
  • Which version of Kubernetes did you use? Can you provide steps how exactly did you set up the cluster? – Mikołaj Głodziak Oct 07 '21 at 08:54
  • Hello @Antoine. Any updates? – Wytrzymały Wiktor Oct 12 '21 at 09:53
  • Thanks, I was able to get help on github and resolve the issue: https://github.com/etcd-io/etcd/issues/13373. I think at some point my node changed its private IP because of hardware issues, and upon upgrading etcd it caused configuration issues. Fix was to dump+restore etcd data. – Antoine Oct 21 '21 at 14:23

0 Answers0