We've experienced 4 AUTO_REPAIR_NODES events(revealed by the command gcloud container operations list) on our GKE cluster during the past 1 month. The consequence of node-auto-repair is that the node gets recreated and gets attached a new external IP, and the new external IP, which was not whitelisted by third-party services, eventually caused failure of services running on that the new node.

I noticed that we have "Automatic node repair" enabled in our Kubernetes cluster and felt tempted to disable that, but before I do that, I need to know more about the situation.

My questions are:

  1. What are some common causes that makes a node unhealthy in the first place? I'm aware of this article https://cloud.google.com/kubernetes-engine/docs/how-to/node-auto-repair#node_repair_process which says, "a node reports a NotReady status on consecutive checks over the given time threshold" would trigger auto repair. But what could cause a node to become NotReady?
  2. I'm also aware of this article https://kubernetes.io/docs/concepts/architecture/nodes/#node-status which mentions the full list of node status: {OutOfDisk, Ready, MemoryPressure, PIDPressure, DiskPressure, NetworkUnavailable, ConfigOK}. I wonder, if any of {OutOfDisk, MemoryPressure, PIDPressure, DiskPressure, NetworkUnavailable} becomes true for a node, would that node becomes NotReady?
  3. What negative consequences could I get after I disable "Automatic node repair" in the cluster? I'm basically wondering whether we could end up in a worse situation than auto-repaired nodes and newly-attached-not-whitelisted IP. Once "Automatic node repair" is disabled, then for the pods that are running on an Unhealthy node that would've been auto-repaired, would Kubernetes create new pods on other nodes?
  • 141
  • 5

1 Answers1

  1. The master essentially performs a health check on the node. if the node can't respond, or if the node declares itself NotReady, it will be repaired by node autorepair. There is also a Node Problem detector on GKE nodes which can detect issues on the OS.

  2. Any of the mentioned conditions can cause the node to go into NotReady. There are some other possible factors as well such as repeating errors at the OS level.

  3. Turning off node auto repair can lead to nodes going NotReady and staying that way. Although in many occasions, the node will try to address the issue by either killing pods or processes, it is possible that a node gets stuck in NotReady

Rather than disabling node auto repair, I would recommend changing your setup due to the whitelisting requirement. Instead, you can setup a NAT gateway for all outbound GKE traffic; you can assign a static IP to the NAT and just worry about whitelisting that IP. You won't have to worry about the Nodes changing IPs anymore.

Patrick W
  • 582
  • 2
  • 8