0

My kubernetes cluster got stuck at terminating state. below is the current state.

pods:

kubectl get po
NAME              READY   STATUS        RESTARTS   AGE
dashboard-0       1/1     Terminating   0          3h12m
data-cruncher-0   1/2     Terminating   0          3h12m
db-0              3/3     Terminating   0          3h12m
prometheus-0      3/3     Terminating   0          3h12m
register-0        3/3     Terminating   0          3h12m

pod logs showing as authorization error.

kubectl logs dashboard-0
Error from server (InternalError): Internal error occurred: Authorization error (user=kube-apiserver-kubelet-client, verb=get, resource=nodes, subresource=proxy)

statefulset,deployments,daemonset,events, services:

[ec2-user@ip-172-31-7-229 ~]$ kubectl get statefulset
No resources found in default namespace.
[ec2-user@ip-172-31-7-229 ~]$ kubectl get deploy
No resources found in default namespace.
[ec2-user@ip-172-31-7-229 ~]$ kubectl get svc
NAME         TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
kubernetes   ClusterIP   172.20.0.1   <none>        443/TCP   3h20m
[ec2-user@ip-172-31-7-229 ~]$ kubectl get events
No resources found in default namespace.
[ec2-user@ip-172-31-7-229 ~]$ kubectl get daemonset
No resources found in default namespace.

pvc and pv:

[ec2-user@ip-172-31-7-229 ~]$ kubectl get pvc
NAME                         STATUS        VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
db-persistent-storage-db-0   Bound         pvc-2d287652-c927-4c63-a463-e40b7da1686f   100Gi      RWO            ssd            3h14m
prometheus-pvc               Terminating   pvc-0327f200-5b88-412a-a029-bc302f09333d   20Gi       RWO            hdd            3h14m
register-pvc                 Terminating   pvc-dfd5deef-9f2d-4e60-a84b-55512e094cb6   20Gi       RWO            ssd            3h14m
[ec2-user@ip-172-31-7-229 ~]$ kubectl get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                STORAGECLASS   REASON   AGE
pvc-0327f200-5b88-412a-a029-bc302f09333d   20Gi       RWO            Delete           Bound    default/prometheus-pvc               hdd                     3h14m
pvc-2d287652-c927-4c63-a463-e40b7da1686f   100Gi      RWO            Delete           Bound    default/db-persistent-storage-db-0   ssd                     3h14m
pvc-dfd5deef-9f2d-4e60-a84b-55512e094cb6   20Gi       RWO            Delete           Bound    default/register-pvc                 ssd                     3h14m

nodes showing as:

kubectl get nodes
NAME                                         STATUS     ROLES    AGE     VERSION
ip-10-0-134-174.us-west-2.compute.internal   NotReady   <none>   3h17m   v1.21.12-eks-5308cf7
ip-10-0-142-12.us-west-2.compute.internal    NotReady   <none>   3h15m   v1.21.12-eks-5308cf7

pvc description:

 kubectl describe pvc prometheus-pvc
Name:          prometheus-pvc
Namespace:     default
StorageClass:  hdd
Status:        Terminating (lasts 3h11m)
Volume:        pvc-0327f200-5b88-412a-a029-bc302f09333d
Labels:        app=prometheus
Annotations:   pv.kubernetes.io/bind-completed: yes
               pv.kubernetes.io/bound-by-controller: yes
               volume.beta.kubernetes.io/storage-provisioner: kubernetes.io/aws-ebs
               volume.kubernetes.io/selected-node: ip-10-0-134-174.us-west-2.compute.internal
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:      20Gi
Access Modes:  RWO
VolumeMode:    Filesystem
Used By:       prometheus-0
Events:        <none>

And the pv:

 kubectl get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM
            STORAGECLASS   REASON   AGE
pvc-0327f200-5b88-412a-a029-bc302f09333d   20Gi       RWO            Delete           Bound    default/prometheus-pvc               hdd                     3h18m
pvc-2d287652-c927-4c63-a463-e40b7da1686f   100Gi      RWO            Delete           Bound    default/db-persistent-storage-db-0   ssd                     3h18m
pvc-dfd5deef-9f2d-4e60-a84b-55512e094cb6   20Gi       RWO            Delete           Bound    default/register-pvc                 ssd                     3h18m

And, the pv description.

[ec2-user@ip-172-31-7-229 ~]$ kubectl describe pv pvc-0327f200-5b88-412a-a029-bc302f09333d
Name:              pvc-0327f200-5b88-412a-a029-bc302f09333d
Labels:            topology.kubernetes.io/region=us-west-2
                   topology.kubernetes.io/zone=us-west-2b
Annotations:       kubernetes.io/createdby: aws-ebs-dynamic-provisioner
                   pv.kubernetes.io/bound-by-controller: yes
                   pv.kubernetes.io/provisioned-by: kubernetes.io/aws-ebs
Finalizers:        [kubernetes.io/pv-protection]
StorageClass:      hdd
Status:            Bound
Claim:             default/prometheus-pvc
Reclaim Policy:    Delete
Access Modes:      RWO
VolumeMode:        Filesystem
Capacity:          20Gi
Node Affinity:
  Required Terms:
    Term 0:        topology.kubernetes.io/zone in [us-west-2b]
                   topology.kubernetes.io/region in [us-west-2]
Message:
Source:
    Type:       AWSElasticBlockStore (a Persistent Disk resource in AWS)
    VolumeID:   aws://us-west-2b/vol-00d432f06a2fbd806
    FSType:     ext4
    Partition:  0
    ReadOnly:   false
Events:         <none>

I tried releasing the EBS volume in above description from aws web console and it immediately attached another volume and stuck in same terminating state.

[ec2-user@ip-172-31-7-229 ~]$ kubectl get events
LAST SEEN   TYPE     REASON                   OBJECT             MESSAGE
90s         Normal   SuccessfulAttachVolume   pod/prometheus-0   AttachVolume.Attach succeeded for volume "pvc-0327f200-5b88-412a-a029-bc302f09333d"
[ec2-user@ip-172-31-7-229 ~]$ kubctl describe po prometheus-0
-bash: kubctl: command not found
[ec2-user@ip-172-31-7-229 ~]$ kubectl describe po prometheus-0
Name:                      prometheus-0
Namespace:                 default
Priority:                  0
Node:                      ip-10-0-134-174.us-west-2.compute.internal/10.0.134.174
Start Time:                Sat, 25 Jun 2022 11:51:44 +0000
Labels:                    app=prometheus
                           controller-revision-hash=prometheus-5c84fc57f4
                           statefulset.kubernetes.io/pod-name=prometheus-0
Annotations:               kubectl.kubernetes.io/default-container: prometheus
                           kubernetes.io/psp: eks.privileged
                           seccomp.security.alpha.kubernetes.io/pod: runtime/default
Status:                    Terminating (lasts 3h20m)
Termination Grace Period:  30s
IP:                        10.0.129.199
IPs:
  IP:           10.0.129.199
Controlled By:  StatefulSet/prometheus
Containers:
  prometheus:
    Container ID:   docker://50f159a0e5e64502d614479791c0b6af381630dca28450dbc7fe237746998457
    Image:          809541265033.dkr.ecr.us-east-2.amazonaws.com/prometheus:nightlye2e
    Image ID:       docker-pullable://809541265033.dkr.ecr.us-east-2.amazonaws.com/prometheus@sha256:d4602ccdc676a9211645fc9710a2668f6b62ee59d080ed0df6bbee7c92f26014
    Port:           9090/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Sat, 25 Jun 2022 11:51:59 +0000
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     5
      memory:  10Gi
    Requests:
      cpu:     25m
      memory:  500Mi
    Liveness:  http-get http://:9090/-/healthy delay=15s timeout=1s period=10s #success=1 #failure=3
    Environment:
      JOB_NAME:                     dev-default-nightlye2e
      AWS_DEFAULT_REGION:           us-west-2
      AWS_REGION:                   us-west-2
      AWS_ROLE_ARN:                 arn:aws:iam::775902114032:role/project-n-dev-default-a2ea-logs
      AWS_WEB_IDENTITY_TOKEN_FILE:  /var/run/secrets/eks.amazonaws.com/serviceaccount/token
    Mounts:
      /prometheus/ from prometheus-storage-volume (rw)
      /var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-v5clx (ro)
  pusher:
    Container ID:  docker://6b994ea0ac6fa15cb6ff51f582037a4c181818f98b1e000e112b731e73639ce5
    Image:         809541265033.dkr.ecr.us-east-2.amazonaws.com/pusher:nightlye2e
    Image ID:      docker-pullable://809541265033.dkr.ecr.us-east-2.amazonaws.com/pusher@sha256:f410b40069f9d4a6e77fe7db03bd5f6a108c223b17f5d552b10838f073ae8c99
    Port:          <none>
    Host Port:     <none>
    Command:
      ./prometheus.sh
      900
      s3://project-n-logs-us-west-2/dev-default/project-n-dev-default-a2ea-829a
    State:          Running
      Started:      Sat, 25 Jun 2022 11:52:00 +0000
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     500m
      memory:  512Mi
    Requests:
      cpu:     200m
      memory:  128Mi
    Environment:
      AWS_REGION:                   us-west-2
      AWS_ROLE_ARN:                 arn:aws:iam::775902114032:role/project-n-dev-default-a2ea-logs
      AWS_WEB_IDENTITY_TOKEN_FILE:  /var/run/secrets/eks.amazonaws.com/serviceaccount/token
    Mounts:
      /prometheus/ from prometheus-storage-volume (rw)
      /var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-v5clx (ro)
  resize-buddy:
    Container ID:  docker://ba259862b18d91ae475a13993afe3b377c069dc48153c42f3fdca4c97a60fad1
    Image:         809541265033.dkr.ecr.us-east-2.amazonaws.com/pusher:nightlye2e
    Image ID:      docker-pullable://809541265033.dkr.ecr.us-east-2.amazonaws.com/pusher@sha256:f410b40069f9d4a6e77fe7db03bd5f6a108c223b17f5d552b10838f073ae8c99
    Port:          <none>
    Host Port:     <none>
    Command:
      ./pvc-expander.sh
      60
      prometheus-pvc
      /prometheus
      80
      2
    State:          Running
      Started:      Sat, 25 Jun 2022 11:52:04 +0000
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     50m
      memory:  256Mi
    Requests:
      cpu:     10m
      memory:  128Mi
    Environment:
      AWS_DEFAULT_REGION:           us-west-2
      AWS_REGION:                   us-west-2
      AWS_ROLE_ARN:                 arn:aws:iam::775902114032:role/project-n-dev-default-a2ea-logs
      AWS_WEB_IDENTITY_TOKEN_FILE:  /var/run/secrets/eks.amazonaws.com/serviceaccount/token
    Mounts:
      /prometheus/ from prometheus-storage-volume (rw)
      /var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-v5clx (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   True
  PodScheduled      True
Volumes:
  aws-iam-token:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  86400
  prometheus-storage-volume:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  prometheus-pvc
    ReadOnly:   false
  kube-api-access-v5clx:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              nodeUse=main
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason                  Age                    From                     Message
  ----    ------                  ----                   ----                     -------
  Normal  SuccessfulAttachVolume  2m15s (x2 over 3h26m)  attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-0327f200-5b88-412a-a029-bc302f09333d"

It is automatically getting attached again

[ec2-user@ip-172-31-7-229 ~]$ kubectl describe pv pvc-0327f200-5b88-412a-a029-bc302f09333d
Name:              pvc-0327f200-5b88-412a-a029-bc302f09333d
Labels:            topology.kubernetes.io/region=us-west-2
                   topology.kubernetes.io/zone=us-west-2b
Annotations:       kubernetes.io/createdby: aws-ebs-dynamic-provisioner
                   pv.kubernetes.io/bound-by-controller: yes
                   pv.kubernetes.io/provisioned-by: kubernetes.io/aws-ebs
Finalizers:        [kubernetes.io/pv-protection]
StorageClass:      hdd
Status:            Terminating (lasts 4m7s)
Claim:             default/prometheus-pvc
Reclaim Policy:    Delete
Access Modes:      RWO
VolumeMode:        Filesystem
Capacity:          20Gi
Node Affinity:
  Required Terms:
    Term 0:        topology.kubernetes.io/zone in [us-west-2b]
                   topology.kubernetes.io/region in [us-west-2]
Message:
Source:
    Type:       AWSElasticBlockStore (a Persistent Disk resource in AWS)
    VolumeID:   aws://us-west-2b/vol-00d432f06a2fbd806
    FSType:     ext4
    Partition:  0
    ReadOnly:   false
Events:         <none>

How to fix this cleanup? Removing the finalizers is not removing the linked resources

My new observations:

kubectl delete pod <podname> --force

the above command deleting the pods but leaving behind pvc's and pv's so did ran the same force commands on them also. Which deleted the resources in kubernetes but left behind the EBS volumes linked to these pv's on aws.

Those should be manually deleted from the console.

When I delete the pv's from kubecetl force option, below message shown, it seems it won't check whether deletion happened or not with force option.

pvs deletion

And the events showing the pv's still in place.

[ec2-user@ip-172-31-14-155 .ssh]$ kubectl get events
LAST SEEN   TYPE     REASON         OBJECT                                                      MESSAGE
37s         Normal   VolumeDelete   persistentvolume/pvc-2d4b48d7-4da1-4872-9bd8-0afb6b94420e   error deleting EBS volume "vol-0ceb4ee469f6a35ef" since volume is currently a
ttached to "i-04fe5c6db79bea12e"
37s         Normal   VolumeDelete   persistentvolume/pvc-6d22eb5c-cdf2-40d2-8ce2-9472c979a1de   error deleting EBS volume "vol-01e2bbe04a7b30257" since volume is currently a
ttached to "i-04fe5c6db79bea12e"
22s         Normal   VolumeDelete   persistentvolume/pvc-c438bd0c-b90b-4138-a3db-e517fabe4d66   error deleting EBS volume "vol-06791441b4084d524" since volume is currently a
ttached to "i-04fe5c6db79bea12e"

What is the conflict area to fix this so that it should clean directly from terraform destroy.

  • Does this answer your question? [terraform destroy failing for kubernetes provider with pvc in aws eks, how to fix that?](https://serverfault.com/questions/1103575/terraform-destroy-failing-for-kubernetes-provider-with-pvc-in-aws-eks-how-to-fi) – SYN Jun 25 '22 at 19:07

1 Answers1

0

First when your output of " kubectl get nodes " shows your nodes status is not ready that is not a good sign and you should fix that first.
This caused the problem that you see in pods log :

Error from server (InternalError): Internal error occurred: Authorization error (user=kube-apiserver-kubelet-client, verb=get, resource=nodes, subresource=proxy)

It is saying that it can't access the nodes because your nodes are not available.
To fix nodes not available you should look at the CNI documentation.

Salar
  • 54
  • 5
  • Hi Salar, nodes are available first, but with terraform destroy, the kubenretes content should be removed first as they are deployed using kubernetes terraform provider. During the deletion of pods it got stuck and then the left over resources are this way. I explained this already above. How would the deployment happens without even nodes first? – Uday Kiran Reddy Jun 26 '22 at 09:00
  • if i connect to the cluster without running destroy in terraform first. And delete the individual resources, they are getting deleted, without any force option. But one ebs volume exists in the web console in available state. But with the earlier cluster, where destroy is ran and failed, it seems we are getting authorization errors. So, somewhere in the terrafor destroy perissions are gettin deleted first. Can you check once in that approach. But I am not sure of that, can you check these tf files and guide me. https://github.com/uday1kiran/logs/raw/master/aws.zip – Uday Kiran Reddy Jun 27 '22 at 14:17