I've been trying to diagnose this issue for several days and have a pretty good picture what is happening, but still no idea why.
The symptom is requests to different Services with failing TCP network connection (EHOSTUNREACH
, ECONNREFUSED
, Connection reset by peer
, No route to host
, Connection refused
, Connection timed out
, etc) from other Pods. I've gone through logs in detail around one incident and for some reason no request was sent downstream to any of the Pods in the ReplicaSet backing the Service for 9 seconds.
I couldn't find any obvious signs why traffic to Pods stopped or restarted and this what I'd need some help with, not really sure where else to look or what to try.
There are a few more things which might be relevant:
- Readiness probe requests went through and responded to successfully every second - these were the only requests reaching Pods
- DNS resolution seems to be working as some Pods logged being unable to connect to the Service's resolved IP
- Cluster logs showing
io.k8s.core.v1.endpoints.update
API calls around the outage, but Node IPs were inaddresses
(notnotReadyAddresses
) list - The issue seems to happen more often when pods are being destroyed (or created) so deployment, auto-scaling or Node pre-emption, but it did happen with a stable number of replicas too
- We're running Kubernetes 1.15 and 1.16 on GKE, unfortunately not VPC-native (alias IP) as the clusters were created a few years ago
- We don't have Istio or any other service mesh running, but I'm very tempted now
Thanks in advance for any suggestions!