1

I've been trying to diagnose this issue for several days and have a pretty good picture what is happening, but still no idea why.

The symptom is requests to different Services with failing TCP network connection (EHOSTUNREACH, ECONNREFUSED, Connection reset by peer, No route to host, Connection refused, Connection timed out, etc) from other Pods. I've gone through logs in detail around one incident and for some reason no request was sent downstream to any of the Pods in the ReplicaSet backing the Service for 9 seconds.

I couldn't find any obvious signs why traffic to Pods stopped or restarted and this what I'd need some help with, not really sure where else to look or what to try.

There are a few more things which might be relevant:

  • Readiness probe requests went through and responded to successfully every second - these were the only requests reaching Pods
  • DNS resolution seems to be working as some Pods logged being unable to connect to the Service's resolved IP
  • Cluster logs showing io.k8s.core.v1.endpoints.update API calls around the outage, but Node IPs were in addresses (not notReadyAddresses) list
  • The issue seems to happen more often when pods are being destroyed (or created) so deployment, auto-scaling or Node pre-emption, but it did happen with a stable number of replicas too
  • We're running Kubernetes 1.15 and 1.16 on GKE, unfortunately not VPC-native (alias IP) as the clusters were created a few years ago
  • We don't have Istio or any other service mesh running, but I'm very tempted now

Thanks in advance for any suggestions!

dain
  • 145
  • 1
  • 8
  • Is it possible for you to provide a detailed description of your configuration so that it can be reproduced on other test GKE cluster ? Does it happen on both 1.15 and 1.16 GKE clusters ? – mario Oct 23 '20 at 16:42
  • @mario I'm not sure if it would be possible to share a simple reproduction as it's difficult to reliably reproduce even with the actual clusters and services. It does happen on both versions. At this point I was just looking for possible causes to explore, ie why would a Service not send traffic to healthy pods and how to diagnose. – dain Oct 23 '20 at 19:52
  • 1
    Hi @dain, I think in such situation it makes sense to contact [GCP Support](https://cloud.google.com/support). If this is something hard or even impossible to easily reproduce on another GKE test cluster, it would be rather difficult to find appropriate help here. – mario Oct 30 '20 at 10:25
  • Thanks @mario, I'm trying now to switch over to new, VPC-native clusters to see if that solves the problem. Even if not, at least I can start sampling network traffic with flow logs and hopefully understand the issue. – dain Oct 30 '20 at 20:54
  • @dain Did you solve this issue? Does VPC-native cluster help? Do you still encounter this issue? Any steps to replicate? – PjoterS Feb 24 '21 at 09:16
  • @PjoterS haven't managed to fully solve the issue yet. Don't think VPC-native made much difference, but increasing times and tries on probes did help a lot. Still occasionally getting it, my best guess is that Kubernetes for whatever reason thinks all pods in a service are unable to serve a request and just blocks the connection. Don't know how to replicate, happens too rarely to establish the cause. We're going to add some more instrumentation soon, will report back if there are any new findings. – dain Feb 25 '21 at 00:17

1 Answers1

1

This issue is very complex and hard to replicate. It would required to replicate this issue with OPs environment with the same configuration.

If you encounter situation which might be caused by Bug on GKE or GCP and it's almost impossible to replicate (like this one), best approach would be to reach Google Support for deep analysis.

Alternatively you could reach Public Issue Tracker.

PjoterS
  • 615
  • 3
  • 11