How to tell why Kubernetes liveness probe gets HTTP 503 while logs show 200 OK?

Question

I've deployed a pod with Apache httpd (official image, tag 2.4.41), serving HTTP/2.0 plaintext on port 8082, and it's working "fine", but I see restarts every few hours (kubectl get pod/mypod, showing 60 restarts in 5 days).

The logs always show "caught SIGWINCH, shutting down gracefully" (-p, for previous pod):

$ kubectl logs -c httpdcontainer -p pod/mypod
  [...]
  127.0.0.1 - - [15/Jan/2020:11:12:27 +0000] "GET / HTTP/2.0" 200 9578
  [Wed Jan 15 11:12:40.433582 2020] [mpm_event:notice] [pid 1:tid 139963289400448] AH00492: caught SIGWINCH, shutting down gracefully
  127.0.0.1 - - [15/Jan/2020:11:12:37 +0000] "GET / HTTP/2.0" 200 9578

The SIGWINCH signal is probably coming from a stop signal via Docker (as per official Dockerfile), e.g. Kubernetes liveness probe? The GET requests are the configured liveness probes on /, but as you can see Apache is returning 200 OK just fine.

Kubernetes kubelet seems to disagree with the 200 OK from Apache and is restarting the pod on purpose:

$ kubectl describe pod/mypod
[...]
Type     Reason     Age                       From                                    Message
----     ------     ----                      ----                                    -------
Warning  Unhealthy  38m (x208 over 5d)        kubelet, node01.kube.mydomain.tld  Readiness probe failed: Get http://192.168.87.178:8082/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Normal   Killing    38m (x60 over 4d23h)      kubelet, node01.kube.mydomain.tld  Container docs failed liveness probe, will be restarted
Warning  Unhealthy  38m (x221 over 5d)        kubelet, node01.kube.mydomain.tld  Liveness probe failed: Get http://192.168.87.178:8082/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Normal   Pulled     38m (x60 over 4d23h)      kubelet, node01.kube.mydomain.tld  Container image "myregistry.mydomain.tld/foo/bar/myimage@sha256:<checksum>" already present on machine
Normal   Created    38m (x61 over 5d19h)      kubelet, node01.kube.mydomain.tld  Created container docs
Normal   Started    38m (x61 over 5d19h)      kubelet, node01.kube.mydomain.tld  Started container docs
Warning  Unhealthy  13m (x1644 over 5d19h)    kubelet, node01.kube.mydomain.tld  Readiness probe failed: HTTP probe failed with statuscode: 503
Warning  Unhealthy  3m16s (x1717 over 5d19h)  kubelet, node01.kube.mydomain.tld  Liveness probe failed: HTTP probe failed with statuscode: 503

I have no clue how to tell what is to blame here. I'm using Kubernetes 1.16.4, deployed with kubeadm and Calico CNI.

Can You verify if it happens only on a single node? Check if the node that is running this pod is healthy. — Piotr Malec, Jan 15 '20 at 15:15

score 2 · Answer 1 · answered Jan 15 '20 at 16:15

After digging into this more and more it appears that the Docker daemon was killing the container for going over the memory limit as logged to system logs:

Jan 15 12:12:40 node01 kernel: [2411297.634996] httpd invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=998
[...]
Jan 15 12:12:40 node01 kernel: [2411297.672084] oom_reaper: reaped process 519 (httpd), now anon-rss:0kB, file-rss:0kB, shmem-rss:68kB

Why httpd suddenly is going over the memory limit remains a question, but out of scope here.
Why Kubernetes doesn't report the container being killed for going over memory limit (lastState report as per docs) remains a question to me.
Logs probably don't show the output of any 503 answer, because the container is killed by the Docker daemon before it has written it to stdout/stderr.
I still fail to understand the sequence of events here if out-of-memory is the cause, because it does receive a graceful shutdown signal first and the response is logged as 503 by kubelet (not timeout).

Even if this is the cause, it is a very bad UX for the Kubernetes admin to hunt it down.

to debug this kind of issues, the first command you can use is kubectl get events — c4f4t0r, Jan 15 '20 at 17:26

How to tell why Kubernetes liveness probe gets HTTP 503 while logs show 200 OK?

1 Answers1