GKE nodes fail to upgrade

Question

I hope this is the right place to ask, I've followed all the support links from the GCP console which have led me here!

In the past I would frequently keep my GKE k8s master and nodes up to date using the GCP console. This all worked seamlessly... until now. And I have no idea why. I've found some GKE Stackdriver logs but nothing seems to be obvious as to why this is failing. I can upgrade the master with no problem, but when presented with the option to upgrades the nodes (there is only one node in the cluster) the upgrade starts, then seconds later it stops. The node remains on the same version and I'm again notified that an upgrade is available.

I also notice that I'm no longer to exec to running Pods. The error is:

Error from server: error dialing backend: No SSH tunnels currently open. Were the targets able to accept an ssh-key for user "gke-<random-digits>"?

I'm not sure if this is related.

I've read through this: https://cloud.google.com/kubernetes-engine/docs/troubleshooting#kubect_commands_hang but it's not helped.

I tried resetting the node but that had no impact either. Well, the only effect I saw was that one of my containers could no longer mount a volume due to I/O errors!

Any ideas where to start?!

Edit
Current Master version: 1.11.6-gke.0
Current Node version: 1.10.7-gke.2

Last operation from: gcloud beta container operations describe

endTime: '2019-01-15T23:14:21.936649805Z'
name: operation-1547594061714-c2750b4c
operationType: UPDATE_CLUSTER
selfLink: https://container.googleapis.com/v1beta1/projects/348462677705/zones/europe-west2-a/operations/operation-1547594061714-c2750b4c
startTime: '2019-01-15T23:14:21.714859214Z'
status: DONE
targetLink: https://container.googleapis.com/v1beta1/projects/348462677705/zones/europe-west2-a/clusters/clubbed-cluster
zone: europe-west2-a

The above UPDATE_CLUSTER operation occurs continually every minute of every day!

I suspect your SSH connection issue is related to this [issue](https://issuetracker.google.com/119820482) that we are already aware of and working on at the moment. Concerning your upgrade, I just ran a test upgrade and it worked for me, but I am most likely not simulating the 100% exact environment you are using. — Milad Tabrizi, Jan 11 '19 at 21:04
You can also run this [command](https://cloud.google.com/kubernetes-engine/docs/how-to/upgrading-a-cluster#check_status) to get a better idea of what is happening during the upgrade: gcloud beta container operations list and then describe the operation with command: gcloud beta container operations describe [OPERATION_ID] It might be best to file a support case directly with [GCP support](https://issuetracker.google.com/issues/new?component=187077&template=0) or report the issue directly in our [Public Issue Tracker](https://cloud.google.com/support/) for your issues upgrading GKE. — Milad Tabrizi, Jan 11 '19 at 21:05
Thanks @Milad, it seems a very interesting coincidence that the SSH issue started happening after upgrading the master, I've hit that 1.11 broken version. I was a good few revisions behind the latest so went slowly. I finally jumped to the latest hoping it'd resolve the node upgrade problem, then hit the SSH problem. I have run the command to check the container operations and see something very odd. I see `CLUSTER_UPDATED` operations in a `DONE` state repeating once every minute throughout the entire day! And it's still going! — Paul, Jan 11 '19 at 22:07
We will need to establish that these 2 issues are not one and the same before we can start debugging one or the other. What is the status of your upgrade at the moment? I ask because upgrading past version 1.11.2-gke.24 (and any versions above) seems to have resolved the SSH issue for some users, can you confirm this? — Patrick W, Jan 14 '19 at 21:36
Can you also run: gcloud beta container operations describe [OPERATION_ID] >>>>>> will use server formatting And post any logs you have/errors in the thread so we can review? You can also keep track of what is going on with the investigation through the public issue thread in the interim. — Patrick W, Jan 14 '19 at 21:36
Hi @PatrickW, the master is on version 1.11.6-gke.0, I notice a newer 1.11.6-gke.2 version has appeared which I could try later. — Paul, Jan 15 '19 at 19:13
1.11.6 shouldn't still have SSH tunnel issues. Are your nodes using 1.11.2? If so, you should report the issue through the [Public Issue Tracker](https://developers.google.com/issue-tracker/) — Patrick W, Jan 15 '19 at 20:09
The nodes are still stuck at 1.10.7-gke.2 as I can't upgrade them. I raised the issue here: https://issuetracker.google.com/issues/122676412. The node upgrade failures were occurring on 1.10 versions of the master too. I didn't get the SSH failures until I upgraded the master beyond 1.10, unfortunately I'm not sure at which 1.11 version the SSH failures started. — Paul, Jan 15 '19 at 23:14
@PatrickW - I've edited my question to include the output of the `operations` command you requested. I've only supplied the last one given there are hundreds - maybe thousands of them! — Paul, Jan 15 '19 at 23:20
This was the [public issue tracker](https://issuetracker.google.com/119820482) we used to track this issue and the one you listed is not 100% the same problem even though the might sound related. — James B, Jan 17 '19 at 21:25

score 1 · Accepted Answer · answered Jan 15 '19 at 23:43

1

Back in business! Both the node upgrade failure and SSH failures have been resolved after upgrading to 1.11.6-gke.2.

Could it have been related to the Endpoint API fixes noted in the Jan 14, 2019 release notes?

Thanks for the support!

answered Jan 15 '19 at 23:43

Paul

61
1
6

GKE nodes fail to upgrade

1 Answers1