1

I am using multiple GKE managed clusters on version 1.14.8-gke.12 in a shared VPC setting. Suddenly, one of my clusters has stopped giving proper metrics for HPA. The metric server is up and running, but this is the output on HPA:

NAME                                    REFERENCE                                          TARGETS                        MINPODS   MAXPODS   REPLICAS   AGE
nginx-public-nginx-ingress-controller   Deployment/nginx-public-nginx-ingress-controller   <unknown>/50%, <unknown>/50%   2         11        2          93m

Checking the default metrics-server installation on gke, I saw the following in logs:

E1221 18:53:13.491188       1 manager.go:111] unable to fully collect metrics: unable to fully scrape metrics from source kubelet_summary:NODE_NAME: unable to fetch metrics from Kubelet NODE_NAME (NODE_IP): Get http://NODE_IP:10255/stats/summary/: context deadline exceeded
E1221 18:53:43.421617       1 manager.go:111] unable to fully collect metrics: unable to fully scrape metrics from source kubelet_summary:NODE_NAME: unable to fetch metrics from Kubelet NODE_NAME (NODE_IP): Get http://NODE_IP:10255/stats/summary/: dial tcp NODE_IP:10255: i/o timeout

Running a curl on the said address manually gives me all data within 10 milliseconds. I've checked the network configurations and both the pod network range as well as the node network range have access to this port.

Questions:

  1. What is the default timeout on metrics-server? Can we change it on Google a managed cluster?

  2. This is a production cluster and I am unable to replicate this issue on any other cluster, but could disabling Google's Horizontal Pod Autoscaling support and installing metrics-server manually help here?

Additionally, quite an expected fashion, updating to 1.15 didn’t help here.

Aditya Aggarwal
  • 113
  • 1
  • 5

1 Answers1

1

At first, I'd recommend you to check if you still have default firewall rules at VPC network->Firewall rules to be sure that all metric requests are able to go through your firewall.

Then try to reach each nodes of your cluster using curl and get metrics.

After that, look for some logs at Stackdriver->Logging with filter like this:

resource.type="k8s_container"
resource.labels.project_id="YOUR_PROJECT_ID"
resource.labels.cluster_name="YOUR_CLUSTER_NAME"
resource.labels.namespace_name="kube-system"
labels.k8s-pod/k8s-app="metrics-server"
labels.k8s-pod/version="YOUR_VERSION_OF_METRICS_SERVER"
severity>=WARNING

and with additional line:

resource.type="k8s_container"
resource.labels.project_id="YOUR_PROJECT_ID"
resource.labels.cluster_name="YOUR_CLUSTER_NAME"
resource.labels.namespace_name="kube-system"
labels.k8s-pod/k8s-app="metrics-server"
labels.k8s-pod/version="YOUR_VERSION_OF_METRICS_SERVER"
severity>=WARNING
"503"

and share them here.

Serhii Rohoza
  • 1,354
  • 2
  • 4
  • 14