EKS cluster nodes go from Ready to NotReady after approximately 30 minutes with authorization failures

Question

I am using eksctl to set up a cluster on EKS/AWS.

Following the guide in the EKS documentation, I use default values for pretty much everything.

The cluster is created successfully, I update the Kubernetes configuration from the cluster, and I can run the various kubectl commands successfully - e.g. "kubectl get nodes" shows me the nodes are in the "Ready" state.

I do not touch anything else, I have a clean out-of-the-box cluster working with no other changes made and so far it would appear everything is working as expected. I don't deploy any applications to it, I just leave it alone.

The problem is after some relatively short period of time (roughly 30 minutes after the cluster is created), the nodes change from "Ready" to "NotReady" and it never recovers.

The event log shows this (I redacted the IPs):

LAST SEEN   TYPE     REASON                    OBJECT        MESSAGE
22m         Normal   Starting                  node/ip-[x]   Starting kubelet.
22m         Normal   NodeHasSufficientMemory   node/ip-[x]   Node ip-[x] status is now: NodeHasSufficientMemory
22m         Normal   NodeHasNoDiskPressure     node/ip-[x]   Node ip-[x] status is now: NodeHasNoDiskPressure
22m         Normal   NodeHasSufficientPID      node/ip-[x]   Node ip-[x] status is now: NodeHasSufficientPID
22m         Normal   NodeAllocatableEnforced   node/ip-[x]   Updated Node Allocatable limit across pods
22m         Normal   RegisteredNode            node/ip-[x]   Node ip-[x] event: Registered Node ip-[x] in Controller
22m         Normal   Starting                  node/ip-[x]   Starting kube-proxy.
21m         Normal   NodeReady                 node/ip-[x]   Node ip-[x] status is now: NodeReady
7m34s       Normal   NodeNotReady              node/ip-[x]   Node ip-[x] status is now: NodeNotReady

Same events for the other node in the cluster.

Connecting to the instance and inspecting /var/log/messages shows this at the same time the node goes to NotReady:

Mar  7 10:40:37 ip-[X] kubelet: E0307 10:40:37.259207    3896 kubelet_node_status.go:385] Error updating node status, will retry: error getting node "ip-[x]": Unauthorized
Mar  7 10:40:37 ip-[X] kubelet: E0307 10:40:37.385044    3896 kubelet_node_status.go:385] Error updating node status, will retry: error getting node "ip-[x]": Unauthorized
Mar  7 10:40:37 ip-[X] kubelet: E0307 10:40:37.621271    3896 reflector.go:270] object-"kube-system"/"aws-node-token-bdxwv": Failed to watch *v1.Secret: the server has asked for the client to provide credentials (get secrets)
Mar  7 10:40:37 ip-[X] kubelet: E0307 10:40:37.621320    3896 reflector.go:270] object-"kube-system"/"coredns": Failed to watch *v1.ConfigMap: the server has asked for the client to provide credentials (get configmaps)
Mar  7 10:40:37 ip-[X] kubelet: E0307 10:40:37.638850    3896 reflector.go:270] k8s.io/client-go/informers/factory.go:133: Failed to watch *v1beta1.RuntimeClass: the server has asked for the client to provide credentials (get runtimeclasses.node.k8s.io)
Mar  7 10:40:37 ip-[X] kubelet: E0307 10:40:37.707074    3896 reflector.go:270] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to watch *v1.Pod: the server has asked for the client to provide credentials (get pods)
Mar  7 10:40:37 ip-[X] kubelet: E0307 10:40:37.711386    3896 reflector.go:270] object-"kube-system"/"coredns-token-67fzd": Failed to watch *v1.Secret: the server has asked for the client to provide credentials (get secrets)
Mar  7 10:40:37 ip-[X] kubelet: E0307 10:40:37.714899    3896 reflector.go:270] object-"kube-system"/"kube-proxy-config": Failed to watch *v1.ConfigMap: the server has asked for the client to provide credentials (get configmaps)
Mar  7 10:40:37 ip-[X] kubelet: E0307 10:40:37.720884    3896 kubelet_node_status.go:385] Error updating node status, will retry: error getting node "ip-[x]": Unauthorized
Mar  7 10:40:37 ip-[X] kubelet: E0307 10:40:37.868003    3896 kubelet_node_status.go:385] Error updating node status, will retry: error getting node "ip-[x]": Unauthorized
Mar  7 10:40:37 ip-[X] kubelet: E0307 10:40:37.868067    3896 controller.go:125] failed to ensure node lease exists, will retry in 200ms, error: Get https://[X]/apis/coordination.k8s.io/v1beta1/namespaces/kube-node-lease/leases/ip-[x]?timeout=10s: write tcp 192.168.91.167:50866->34.249.27.158:443: use of closed network connection
Mar  7 10:40:38 ip-[X] kubelet: E0307 10:40:38.017157    3896 kubelet_node_status.go:385] Error updating node status, will retry: error getting node "ip-[x]": Unauthorized
Mar  7 10:40:38 ip-[X] kubelet: E0307 10:40:38.017182    3896 kubelet_node_status.go:372] Unable to update node status: update node status exceeds retry count
Mar  7 10:40:38 ip-[X] kubelet: E0307 10:40:38.200053    3896 controller.go:125] failed to ensure node lease exists, will retry in 400ms, error: Unauthorized
Mar  7 10:40:38 ip-[X] kubelet: E0307 10:40:38.517193    3896 reflector.go:270] object-"kube-system"/"kube-proxy": Failed to watch *v1.ConfigMap: the server has asked for the client to provide credentials (get configmaps)
Mar  7 10:40:38 ip-[X] kubelet: E0307 10:40:38.729756    3896 controller.go:125] failed to ensure node lease exists, will retry in 800ms, error: Unauthorized
Mar  7 10:40:38 ip-[X] kubelet: E0307 10:40:38.752267    3896 reflector.go:126] object-"kube-system"/"aws-node-token-bdxwv": Failed to list *v1.Secret: Unauthorized
Mar  7 10:40:38 ip-[X] kubelet: E0307 10:40:38.824988    3896 reflector.go:126] object-"kube-system"/"coredns": Failed to list *v1.ConfigMap: Unauthorized
Mar  7 10:40:38 ip-[X] kubelet: E0307 10:40:38.899566    3896 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1beta1.RuntimeClass: Unauthorized
Mar  7 10:40:38 ip-[X] kubelet: E0307 10:40:38.963756    3896 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1beta1.CSIDriver: Unauthorized
Mar  7 10:40:38 ip-[X] kubelet: E0307 10:40:38.963822    3896 reflector.go:126] object-"kube-system"/"kube-proxy-config": Failed to list *v1.ConfigMap: Unauthorized

CloudWatch logs for the authenticator component show many of these messages:

time="2020-03-07T10:40:37Z" level=warning msg="access denied" arn="arn:aws:iam::[ACCOUNT_ID]]:role/AmazonSSMRoleForInstancesQuickSetup" client="127.0.0.1:50132" error="ARN is not mapped: arn:aws:iam::[ACCOUNT_ID]:role/amazonssmroleforinstancesquicksetup" method=POST path=/authenticate

I confirmed that role does exist in via IAM console.

Clearly this node is reporting NotReady because of these authentication failures.

Is this some authentication token that timed out after approximately 30 minutes, and if so shouldn't a new token automatically be requested? Or am I supposed to set something else up?

I was surprised that a fresh cluster created by eksctl would show this problem.

What did I miss?

the posted upvoted answer just regurgitated the documentation i already followed and did not help solve my problem, i did solve it myself though - if i can find my notes i'll post an update. i moved on to Azure instead of AWS so it's no longer fresh in my mind sorry. — caprica, Aug 12 '20 at 15:23

score 2 · Accepted Answer · answered Aug 12 '20 at 16:14

These are the steps I followed to resolve this issue...

Connect to the failing instance via SSH.
Execute "aws sts get-caller-identity"
Note the ARN of the user, it will likely be something like this arn:aws:sts::999999999999:assumed-role/AmazonSSMRoleForInstancesQuickSetup/i-00000000000ffffff

Note the role here is AmazonSSMRoleForInstancesQuickSetup, this seems wrong to me - but AFAIK I followed the guides to the letter when creating the cluster.

Issues so far:

a) Why is this role being used for the AWS identity?

b) If this is the right role, why is it successful at first and only fails 30 minutes after cluster creation?

c) If this is the right role, what access rights are missing?

Personally, this feels like it is the wrong role to me, but I solved my problem by addressing point (c).

Continuing the steps...

If this role is inspected via the IAM service in the AWS console, it can be seen that it does not have all of the required permissions, by default it has:

AmazonSSMManagedInstanceCore

Assuming this role is the correct role to use, then it needs at least the following policy added to it:

AmazonEC2ContainerRegistryPowerUser

Attach that policy in the usual way, I admit this may be granting more privileges than needed but that's for another day.

At this point, the AWS security configuration should now be correct, but this is not the end of the story.

Kubernetes, through the kubelet process, has its own security role mappings to consider - this is to map Kubernetes users to IAM users or roles on AWS.

This configuration is maintained by editing a Kubernetes configmap.

Edit the configmap with "kubectl edit -n kube-system configmap/aws-auth".

This is the configuration immediately after creating the cluster, before making any changes:

apiVersion: v1
data:
  mapRoles: |
    - groups:
      - system:bootstrappers
      - system:nodes
      rolearn: arn:aws:iam::999999999999:role/eksctl-my-demo-nodegroup-my-demo-NodeInstanceRole-AAAAAAAAAAAAA
      username: system:node:{{EC2PrivateDNSName}}
kind: ConfigMap
metadata:
  [...whatever...]

The only role mapped here is the node instance role - this role was created automatically during the provisioning of the cluster via eksctl.

Change the configmap:

apiVersion: v1
data:
  mapRoles: |
    - rolearn: arn:aws:iam::999999999999:role/eksctl-my-demo-nodegroup-my-demo-NodeInstanceRole-AAAAAAAAAAAAA
      username: system:node:{{EC2PrivateDNSName}}
      groups:
      - system:bootstrappers
      - system:nodes
    - rolearn: arn:aws:iam::999999999999:role/AmazonSSMRoleForInstancesQuickSetup
      username: MyDemoEKSRole
      groups:
      - system:masters
    - rolearn: arn:aws:iam::999999999999:role/MyDemoEKSRole
      username: CodeBuild
      groups:
      - system:masters
      - system:bootstrappers
      - system:nodes
kind: ConfigMap
metadata:
  [...whatever...]

I have mapped the AmazonSSMRoleForInstancesQuickSetup role as a Kubernetes masters role.

I have also mapped the MyDemoEKSRole cluster security role previously created for cluster provisioning to the various Kubernetes roles, for the case where Kubernetes is being invoked by a CodeBuild pipeline.

Save this config map and eventually the cluster will repair itself and report ready.

Conclusion:

After executing all of these cluster post-creation steps, my authentication failures ceased, and the cluster started reporting a successful status again, clearing the health-check and returning the node to a Ready status.

I freely admit this might not be the "right" way to solve my issue, and it definitely feels like I opened up the security way more than I should have, but it definitely worked and solved my problem.

As mentioned shortly after this we transitioned to Azure instead of AWS so I never took this any further - but I did end up with a fully working cluster with no longer any expiring credentials.

Naively I suppose I expected the tools to create a working cluster for me. There was no mention of this issue or these steps anywhere in any guide that I found.

score 1 · Answer 2 · answered Mar 09 '20 at 13:09

It looks like your role is expiring.

You can get help from Amazon EKS Troubleshooting section Unauthorized or Access Denied (kubectl).

If you receive one of the following errors while running kubectl commands, then your kubectl is not configured properly for Amazon EKS or the IAM user or role credentials that you are using do not map to a Kubernetes RBAC user with sufficient permissions in your Amazon EKS cluster.

could not get token: AccessDenied: Access denied

error: You must be logged in to the server (Unauthorized)

error: the server doesn't have a resource type "svc"

This could be because the cluster was created with one set of AWS credentials (from an IAM user or role), and kubectl is using a different set of credentials.

When an Amazon EKS cluster is created, the IAM entity (user or role) that creates the cluster is added to the Kubernetes RBAC authorization table as the administrator (with system:master permissions). Initially, only that IAM user can make calls to the Kubernetes API server using kubectl. For more information, see Managing Users or IAM Roles for your Cluster. Also, the AWS IAM Authenticator for Kubernetes uses the AWS SDK for Go to authenticate against your Amazon EKS cluster. If you use the console to create the cluster, you must ensure that the same IAM user credentials are in the AWS SDK credential chain when you are running kubectl commands on your cluster.

If you install and configure the AWS CLI, you can configure the IAM credentials for your user. If the AWS CLI is configured properly for your user, then the AWS IAM Authenticator for Kubernetes can find those credentials as well. For more information, see Configuring the AWS CLI in the AWS Command Line Interface User Guide.

If you assumed a role to create the Amazon EKS cluster, you must ensure that kubectl is configured to assume the same role. Use the following command to update your kubeconfig file to use an IAM role. For more information, see Create a kubeconfig for Amazon EKS.

aws --region `region-code` eks update-kubeconfig --name `cluster_name` --role-arn arn:aws:iam::`aws_account_id`:role/`role_name

To map an IAM user to a Kubernetes RBAC user, see Managing Users or IAM Roles for your Cluster or watch a video about how to map a user.

You should read about Managing Cluster Authentication for AWS and Create a kubeconfig for Amazon EKS.

Keep in mind you should be using aws-iam-authenticator the installation process is available here.

score 1 · Answer 3 · answered Aug 21 '20 at 12:38

When you deploy a nodegroup, you must grant a nodegroup role like described in the doc (aws-auth-cm.yaml)

My nodes went to NotReady with errors like in the question above after ~7 mins after start despite all pods looked ok.

The reason: the role of the nodegroup was not the same as role in that yaml due to mistype.

EKS cluster nodes go from Ready to NotReady after approximately 30 minutes with authorization failures

3 Answers3