AWS Cloud Provider Integration with Kubernetes - Nodes stuck with "uninitialized: true" taint after bootstrapping

Question

Summary

I am attempting to bootstrap a Kubernetes cluster on AWS using Kubeadm. Please before you suggest them, I am not interested in using EKS or another bootstrapping solution like Kops, Kubespray, etc.

It appears that there is a lot of inaccurate information about the proper procedures out there due to a schism with respect to Cloud Provider integrations not being managed out of tree rather in-tree. So I've been struggling to get a clear picture in my head about how to properly set this integration up.

The Requirements

The official repo indicates three requirements.

1) You must initialize kubelet, kube-apiserver, and kube-controller-manager with the --cloud-provider=external argument. If I understand things correctly, this allows you to use the out of tree provider. Using aws here instead would use the in-tree provider which is on a deprecation timeline.

2) You must create two IAM policies, associate them with IAM Instance Profiles, and launch your Kubernetes nodes with said policy attached.

3) Each node in the cluster must have the same hostname that is associated with the underlying EC2 instance as its Private DNS name.

In addition to this, I believe it was once required to attach the following Tags to your EC2 instances, Route Tables, Security Groups, and Subnets. Which I have done for good measure as well:

"kubernetes.io/cluster/${var.K8S_CLUSTER_NAME}" = "kubernetes.io/cluster/${var.K8S_CLUSTER_NAME}"

The Problem

Despite this, however, when my worker nodes come online after bootstrapping they have the following taint applied:

node.cloudprovider.kubernetes.io/uninitialized: true

This obviously implies that the nodes have not been initialized by the Cloud Provider. I'm not really sure where to go from here. There is an open request for additional instructions on how to use the Cloud Provider integration with AWS but it is currently unsatisfied.

My Configuration

You might have noticed I left a comment on that issue detailing my issue as well. Here is a summary of the details of my environment showing that I should be in compliance with the listed requirements.

1) My Kubeadm config files set the cloud provider to external in four places

KubeletConfiguration and InitConfiguration

nodeRegistration:
  kubeletExtraArgs:
    cloud-provider: external

ClusterConfiguration

apiServer:
  extraArgs:
    cloud-provider: external

ClusterConfiguration

controllerManager:
  extraArgs:
    cloud-provider: external

2) My EC2 instances were launched with an instance profile with the IAM policies outlined in the README:

$> aws ec2 describe-instances --instance-ids INSTANCE.ID | jq '.Reservations[].Instances[].IamInstanceProfile[]'
"arn:aws-us-gov:iam::ACCOUNT.ID:instance-profile/PROFILE-NAME"

3) The hostnames are the EC2 Private DNS names:

$> hostname -f
ip-10-0-10-91.us-gov-west-1.compute.internal

4) The EC2 instances as well as my route tables, subnets, etc are tagged with:

"kubernetes.io/cluster/${var.K8S_CLUSTER_NAME}" = "kubernetes.io/cluster/${var.K8S_CLUSTER_NAME}"

As a result, it looks like I am in compliance with all of the requirements so I am unsure why my nodes are still left with that Taint. Any help would be greatly appreciated!

EDIT

I have updated the tags on each instance to:

"kubernetes.io/cluster/${var.K8S_CLUSTER_NAME}" = "owned"

And added this tag to each Subnet:

"kubernetes.io/role/internal-elb" = 1

This has not resolved the situation, however.

EDIT 2

A user elsewhere suggested that the issue may be that I didn't apply the RBAC and DaemonSet resources present in the manifests directory of the cloud-provider-aws repo. After doing so using this image, I can confirm that this has NOT resolved my issue since the aws-cloud-controller-manager appears to expect you to be using aws not external` as per the logs produced by the pod on startup:

Generated self-signed cert in-memory

Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.

Version: v0.0.0-master+$Format:%h$

WARNING: aws built-in cloud provider is now deprecated. The AWS provider is deprecated and will be removed in a future release

Building AWS cloudprovider

Zone not specified in configuration file; querying AWS metadata service

Cloud provider could not be initialized: could not init cloud provider "aws": clusterID tags did not match: "example-14150" vs "True"

EDIT 3

I built a new image using the repo as of commit 6a14c81. It can be found here. It appears to also be using the aws provider by default?

Cloud provider could not be initialized: could not init cloud provider "aws": clusterID tags did not match: "example-14150" vs "True"

It is entirely possible the cloud-provider is not gov-cloud aware, and you are fighting an uphill battle attempting to use `cloud-provider=external` with the shape it's currently in. Having said that, your question made it appear that you didn't deploy the [DaemonSet](https://github.com/kubernetes/cloud-provider-aws/blob/master/manifests/aws-cloud-controller-manager-daemonset.yaml) -- is that true? — mdaniel, May 06 '20 at 23:18
I was not aware that I had to deploy that DaemonSet. However, after having done so, I do not see any changes. Additionally, I was wrong. Updating the Instance tags did NOT fix the tainting issue after all as per my edit. So the nodes are currently tainted as `uninitialized`. I assume I should just revert back to `aws` instead of `external` at this point? I assumed since the schism occurred about a year ago things would be more mature now? — TJ Zimmerman, May 06 '20 at 23:28
Reverting back to `cloud-provider=aws` and then creating [`cloud_config` with at _least_ `kubernetesClusterId=`](https://github.com/kubernetes-sigs/kubespray/blob/v2.13.0/roles/kubernetes/node/templates/cloud-configs/aws-cloud-config.j2#L8) would be my strong recommendation (apologies for the kubespray link, it was the fastest example I could find; you'll have to manage that file via cloud-init, since kubeadm isn't prepared for external content like that) — mdaniel, May 06 '20 at 23:47
Hm, that seems like a dangerous decision given the deprecation timeline but I understand why you say that. Any idea why my worker nodes throw a `failed to load Kubelet config file /var/lib/kubelet/config.yaml` on boot now? I can confirm that file doesn't exist. They were bootstrapping fine before changing to `cloud-provider=aws` and adding a simple cloud-config.yml file indicating the `kubernetesClusterId`. The Master fails too but with a lot more expansive amount of errors. Mostly mentioning `node "ip-10-0-10-55.us-gov-west-1.compute.internal" not found`. Which IS the correct internal AWS DNS — TJ Zimmerman, May 07 '20 at 02:19
Your line of questioning seems to have pivoted from cloud-provider woes over to kubeadm join woes, so perhaps a fresh question wherein you describe out the steps you have taken and the errors you are experiencing will help. Good luck! — mdaniel, May 07 '20 at 03:01
I disagree. The new behavior was triggered by changing cloud providers so this is still relevant. — TJ Zimmerman, May 07 '20 at 03:06

TJ Zimmerman · Accepted Answer · 2020-05-13T16:25:23.083

2

The documentation does not mention it is required to deploy the AWS Cloud Controller Manager along with its required RBAC policies. These can be found in /manifests on the repo.

There is not currently a published AWS Cloud Controller Manager image. So you will need to build it and host it yourself or use my image from the newest commit found here.

You will notice that --cloud-provider=aws is passed as an argument. Despite being the EXTERNAL cloud provider integration, it IS in fact necessary to pass aws not external here.

Lastly, all of your instances must also be tagged with: "KubernetesCluster" = var.K8S_CLUSTER_NAME

edited May 13 '20 at 16:25

answered May 11 '20 at 19:22

TJ Zimmerman

241
5
17

Zimmerman I am also trying to achieve the same setup with AWS and Kubespray. I have updated cloud provider and created roles and attached to all nodes. ```kubelet node "ip-10-0-11-45.ap-south-1.compute.internal" not found Failed to contact API server when waiting for CSINode publishing: csinodes.storage.k8s.io "ip-10-0-11-45.ap-south-1.compute.internal" is forbidden: User "system:node:master1" cannot get resource "csinodes" in API group "storage.k8s.io" at the cluster scope: can only access CSINode with the same name as the requesting node``` – Sreejith May 31 '20 at 13:22
Have you deployed the RBAC resources as I mentioned in the answer you commented on? `The documentation does not mention it is required to deploy the AWS Cloud Controller Manager along with its required RBAC policies. These can be found in /manifests on the repo` – TJ Zimmerman Jun 01 '20 at 07:49
Issue was with the tagname, I forgot to update that. Now the issue is resolved. – Sreejith Jun 01 '20 at 08:16