Container runtime/kubelet failures on EKS cluster nodes

Question

I'm currently working with a Kubernetes cluster, hosted on AWS with EKS, that is encountering strange (to me) failures. Our nodes (instance type c5.2xlarge, AMI ami-0f54a2f7d2e9c88b3/amazon-eks-node-v25) run along until, with no apparent change in load, huge volumes of errors begin to surface in kubelet logs. (I'm looking at it via journalctl -u kubelet).

The error messages don't paint a consistent story -- different nodes show different sets of events prior to failure -- but eventually the node enters a NotReady state. Sometimes, they will recover of their own accord, but the probability and timing of this is variable.

Here's a sample of logs immediately preceding the status change on one node:

Dec 05 21:41:57 ip-10-0-18-250.us-west-2.compute.internal kubelet[4051]: W1205 21:41:57.671381    4051 fs.go:571] Killing cmd [nice -n 19 du -s /var/lib/docker/overlay2/2af435b23328675b6ccddcd29da7a8681118ae90c78755933916d15c247653cc/diff] due to timeout(2m0s)
Dec 05 21:41:57 ip-10-0-18-250.us-west-2.compute.internal kubelet[4051]: E1205 21:41:57.673113    4051 remote_runtime.go:434] Status from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Dec 05 21:41:57 ip-10-0-18-250.us-west-2.compute.internal kubelet[4051]: E1205 21:41:57.676913    4051 kubelet.go:2114] Container runtime sanity check failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Dec 05 21:41:57 ip-10-0-18-250.us-west-2.compute.internal kubelet[4051]: E1205 21:41:57.809324    4051 remote_runtime.go:332] ExecSync e264b31c91ae2d10381cbebd0c4a1e3b0deeefcc60dd5762b7f6f3ac9a7c5d1a '/bin/bash -c pgrep python >/dev/null 2>&1' from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Dec 05 21:41:57 ip-10-0-18-250.us-west-2.compute.internal kubelet[4051]: I1205 21:41:57.833254    4051 kubelet.go:1799] skipping pod synchronization - [container runtime is down]
Dec 05 21:41:57 ip-10-0-18-250.us-west-2.compute.internal kubelet[4051]: I1205 21:41:57.843768    4051 kubelet_node_status.go:814] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2018-12-05 21:41:57.843747845 +0000 UTC m=+6231.746946646 LastTransitionTime:2018-12-05 21:41:57.843747845 +0000 UTC m=+6231.746946646 Reason:KubeletNotReady Message:container runtime is down}
Dec 05 21:41:57 ip-10-0-18-250.us-west-2.compute.internal kubelet[4051]: I1205 21:41:57.933579    4051 kubelet.go:1799] skipping pod synchronization - [container runtime is down]
Dec 05 21:41:58 ip-10-0-18-250.us-west-2.compute.internal kubelet[4051]: I1205 21:41:58.159892    4051 kubelet.go:1799] skipping pod synchronization - [container runtime is down]
Dec 05 21:41:58 ip-10-0-18-250.us-west-2.compute.internal kubelet[4051]: I1205 21:41:58.561026    4051 kubelet.go:1799] skipping pod synchronization - [container runtime is down]
Dec 05 21:41:59 ip-10-0-18-250.us-west-2.compute.internal kubelet[4051]: I1205 21:41:59.381016    4051 kubelet.go:1799] skipping pod synchronization - [container runtime is down]
Dec 05 21:42:00 ip-10-0-18-250.us-west-2.compute.internal kubelet[4051]: I1205 21:42:00.985015    4051 kubelet.go:1799] skipping pod synchronization - [container runtime is down]

In other cases, things go sideways in the midst of a bunch of warnings to the effect that NetworkPlugin cni failed on the status hook for pod "<pod-name>": CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "<container-ID>".

A third scenario doesn't actually surface the change in node status, but terminates in

kubelet_node_status.go:377] Error updating node status, will retry: error getting node "<node-private-ip>": Unauthorized

This happens after other errors of the form

cni.go:227] Error while adding to cni network: add cmd: failed to assign an IP address to container

and

raw.go:87] Error while processing event ("/sys/fs/cgroup/devices/system.slice/run-27618.scope": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/devices/system.slice/run-27618.scope: no such file or directory

This has been a real headscratcher as the nodes drop out of service on a seemingly unpredictable cadence and without consistent error behavior. What could be a unifying cause (or causes) between these failures?

I'm happy to provide more information on the cluster or more detailed logs, so kindly let me know. Thanks in advance for any assistance!

Do you happen to run Kubernetes cron jobs with a frequent cadence (e.g. minutes) on such failing nodes? — Bogdan, Dec 10 '18 at 12:36

Container runtime/kubelet failures on EKS cluster nodes

0 Answers0