My kubernetes cluster got stuck at terminating state. below is the current state.
pods:
kubectl get po
NAME READY STATUS RESTARTS AGE
dashboard-0 1/1 Terminating 0 3h12m
data-cruncher-0 1/2 Terminating 0 3h12m
db-0 3/3 Terminating 0 3h12m
prometheus-0 3/3 Terminating 0 3h12m
register-0 3/3 Terminating 0 3h12m
pod logs showing as authorization error.
kubectl logs dashboard-0
Error from server (InternalError): Internal error occurred: Authorization error (user=kube-apiserver-kubelet-client, verb=get, resource=nodes, subresource=proxy)
statefulset,deployments,daemonset,events, services:
[ec2-user@ip-172-31-7-229 ~]$ kubectl get statefulset
No resources found in default namespace.
[ec2-user@ip-172-31-7-229 ~]$ kubectl get deploy
No resources found in default namespace.
[ec2-user@ip-172-31-7-229 ~]$ kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 172.20.0.1 <none> 443/TCP 3h20m
[ec2-user@ip-172-31-7-229 ~]$ kubectl get events
No resources found in default namespace.
[ec2-user@ip-172-31-7-229 ~]$ kubectl get daemonset
No resources found in default namespace.
pvc and pv:
[ec2-user@ip-172-31-7-229 ~]$ kubectl get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
db-persistent-storage-db-0 Bound pvc-2d287652-c927-4c63-a463-e40b7da1686f 100Gi RWO ssd 3h14m
prometheus-pvc Terminating pvc-0327f200-5b88-412a-a029-bc302f09333d 20Gi RWO hdd 3h14m
register-pvc Terminating pvc-dfd5deef-9f2d-4e60-a84b-55512e094cb6 20Gi RWO ssd 3h14m
[ec2-user@ip-172-31-7-229 ~]$ kubectl get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
pvc-0327f200-5b88-412a-a029-bc302f09333d 20Gi RWO Delete Bound default/prometheus-pvc hdd 3h14m
pvc-2d287652-c927-4c63-a463-e40b7da1686f 100Gi RWO Delete Bound default/db-persistent-storage-db-0 ssd 3h14m
pvc-dfd5deef-9f2d-4e60-a84b-55512e094cb6 20Gi RWO Delete Bound default/register-pvc ssd 3h14m
nodes showing as:
kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-10-0-134-174.us-west-2.compute.internal NotReady <none> 3h17m v1.21.12-eks-5308cf7
ip-10-0-142-12.us-west-2.compute.internal NotReady <none> 3h15m v1.21.12-eks-5308cf7
pvc description:
kubectl describe pvc prometheus-pvc
Name: prometheus-pvc
Namespace: default
StorageClass: hdd
Status: Terminating (lasts 3h11m)
Volume: pvc-0327f200-5b88-412a-a029-bc302f09333d
Labels: app=prometheus
Annotations: pv.kubernetes.io/bind-completed: yes
pv.kubernetes.io/bound-by-controller: yes
volume.beta.kubernetes.io/storage-provisioner: kubernetes.io/aws-ebs
volume.kubernetes.io/selected-node: ip-10-0-134-174.us-west-2.compute.internal
Finalizers: [kubernetes.io/pvc-protection]
Capacity: 20Gi
Access Modes: RWO
VolumeMode: Filesystem
Used By: prometheus-0
Events: <none>
And the pv:
kubectl get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM
STORAGECLASS REASON AGE
pvc-0327f200-5b88-412a-a029-bc302f09333d 20Gi RWO Delete Bound default/prometheus-pvc hdd 3h18m
pvc-2d287652-c927-4c63-a463-e40b7da1686f 100Gi RWO Delete Bound default/db-persistent-storage-db-0 ssd 3h18m
pvc-dfd5deef-9f2d-4e60-a84b-55512e094cb6 20Gi RWO Delete Bound default/register-pvc ssd 3h18m
And, the pv description.
[ec2-user@ip-172-31-7-229 ~]$ kubectl describe pv pvc-0327f200-5b88-412a-a029-bc302f09333d
Name: pvc-0327f200-5b88-412a-a029-bc302f09333d
Labels: topology.kubernetes.io/region=us-west-2
topology.kubernetes.io/zone=us-west-2b
Annotations: kubernetes.io/createdby: aws-ebs-dynamic-provisioner
pv.kubernetes.io/bound-by-controller: yes
pv.kubernetes.io/provisioned-by: kubernetes.io/aws-ebs
Finalizers: [kubernetes.io/pv-protection]
StorageClass: hdd
Status: Bound
Claim: default/prometheus-pvc
Reclaim Policy: Delete
Access Modes: RWO
VolumeMode: Filesystem
Capacity: 20Gi
Node Affinity:
Required Terms:
Term 0: topology.kubernetes.io/zone in [us-west-2b]
topology.kubernetes.io/region in [us-west-2]
Message:
Source:
Type: AWSElasticBlockStore (a Persistent Disk resource in AWS)
VolumeID: aws://us-west-2b/vol-00d432f06a2fbd806
FSType: ext4
Partition: 0
ReadOnly: false
Events: <none>
I tried releasing the EBS volume in above description from aws web console and it immediately attached another volume and stuck in same terminating state.
[ec2-user@ip-172-31-7-229 ~]$ kubectl get events
LAST SEEN TYPE REASON OBJECT MESSAGE
90s Normal SuccessfulAttachVolume pod/prometheus-0 AttachVolume.Attach succeeded for volume "pvc-0327f200-5b88-412a-a029-bc302f09333d"
[ec2-user@ip-172-31-7-229 ~]$ kubctl describe po prometheus-0
-bash: kubctl: command not found
[ec2-user@ip-172-31-7-229 ~]$ kubectl describe po prometheus-0
Name: prometheus-0
Namespace: default
Priority: 0
Node: ip-10-0-134-174.us-west-2.compute.internal/10.0.134.174
Start Time: Sat, 25 Jun 2022 11:51:44 +0000
Labels: app=prometheus
controller-revision-hash=prometheus-5c84fc57f4
statefulset.kubernetes.io/pod-name=prometheus-0
Annotations: kubectl.kubernetes.io/default-container: prometheus
kubernetes.io/psp: eks.privileged
seccomp.security.alpha.kubernetes.io/pod: runtime/default
Status: Terminating (lasts 3h20m)
Termination Grace Period: 30s
IP: 10.0.129.199
IPs:
IP: 10.0.129.199
Controlled By: StatefulSet/prometheus
Containers:
prometheus:
Container ID: docker://50f159a0e5e64502d614479791c0b6af381630dca28450dbc7fe237746998457
Image: 809541265033.dkr.ecr.us-east-2.amazonaws.com/prometheus:nightlye2e
Image ID: docker-pullable://809541265033.dkr.ecr.us-east-2.amazonaws.com/prometheus@sha256:d4602ccdc676a9211645fc9710a2668f6b62ee59d080ed0df6bbee7c92f26014
Port: 9090/TCP
Host Port: 0/TCP
State: Running
Started: Sat, 25 Jun 2022 11:51:59 +0000
Ready: True
Restart Count: 0
Limits:
cpu: 5
memory: 10Gi
Requests:
cpu: 25m
memory: 500Mi
Liveness: http-get http://:9090/-/healthy delay=15s timeout=1s period=10s #success=1 #failure=3
Environment:
JOB_NAME: dev-default-nightlye2e
AWS_DEFAULT_REGION: us-west-2
AWS_REGION: us-west-2
AWS_ROLE_ARN: arn:aws:iam::775902114032:role/project-n-dev-default-a2ea-logs
AWS_WEB_IDENTITY_TOKEN_FILE: /var/run/secrets/eks.amazonaws.com/serviceaccount/token
Mounts:
/prometheus/ from prometheus-storage-volume (rw)
/var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-v5clx (ro)
pusher:
Container ID: docker://6b994ea0ac6fa15cb6ff51f582037a4c181818f98b1e000e112b731e73639ce5
Image: 809541265033.dkr.ecr.us-east-2.amazonaws.com/pusher:nightlye2e
Image ID: docker-pullable://809541265033.dkr.ecr.us-east-2.amazonaws.com/pusher@sha256:f410b40069f9d4a6e77fe7db03bd5f6a108c223b17f5d552b10838f073ae8c99
Port: <none>
Host Port: <none>
Command:
./prometheus.sh
900
s3://project-n-logs-us-west-2/dev-default/project-n-dev-default-a2ea-829a
State: Running
Started: Sat, 25 Jun 2022 11:52:00 +0000
Ready: True
Restart Count: 0
Limits:
cpu: 500m
memory: 512Mi
Requests:
cpu: 200m
memory: 128Mi
Environment:
AWS_REGION: us-west-2
AWS_ROLE_ARN: arn:aws:iam::775902114032:role/project-n-dev-default-a2ea-logs
AWS_WEB_IDENTITY_TOKEN_FILE: /var/run/secrets/eks.amazonaws.com/serviceaccount/token
Mounts:
/prometheus/ from prometheus-storage-volume (rw)
/var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-v5clx (ro)
resize-buddy:
Container ID: docker://ba259862b18d91ae475a13993afe3b377c069dc48153c42f3fdca4c97a60fad1
Image: 809541265033.dkr.ecr.us-east-2.amazonaws.com/pusher:nightlye2e
Image ID: docker-pullable://809541265033.dkr.ecr.us-east-2.amazonaws.com/pusher@sha256:f410b40069f9d4a6e77fe7db03bd5f6a108c223b17f5d552b10838f073ae8c99
Port: <none>
Host Port: <none>
Command:
./pvc-expander.sh
60
prometheus-pvc
/prometheus
80
2
State: Running
Started: Sat, 25 Jun 2022 11:52:04 +0000
Ready: True
Restart Count: 0
Limits:
cpu: 50m
memory: 256Mi
Requests:
cpu: 10m
memory: 128Mi
Environment:
AWS_DEFAULT_REGION: us-west-2
AWS_REGION: us-west-2
AWS_ROLE_ARN: arn:aws:iam::775902114032:role/project-n-dev-default-a2ea-logs
AWS_WEB_IDENTITY_TOKEN_FILE: /var/run/secrets/eks.amazonaws.com/serviceaccount/token
Mounts:
/prometheus/ from prometheus-storage-volume (rw)
/var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-v5clx (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady True
PodScheduled True
Volumes:
aws-iam-token:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 86400
prometheus-storage-volume:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: prometheus-pvc
ReadOnly: false
kube-api-access-v5clx:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: nodeUse=main
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulAttachVolume 2m15s (x2 over 3h26m) attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-0327f200-5b88-412a-a029-bc302f09333d"
It is automatically getting attached again
[ec2-user@ip-172-31-7-229 ~]$ kubectl describe pv pvc-0327f200-5b88-412a-a029-bc302f09333d
Name: pvc-0327f200-5b88-412a-a029-bc302f09333d
Labels: topology.kubernetes.io/region=us-west-2
topology.kubernetes.io/zone=us-west-2b
Annotations: kubernetes.io/createdby: aws-ebs-dynamic-provisioner
pv.kubernetes.io/bound-by-controller: yes
pv.kubernetes.io/provisioned-by: kubernetes.io/aws-ebs
Finalizers: [kubernetes.io/pv-protection]
StorageClass: hdd
Status: Terminating (lasts 4m7s)
Claim: default/prometheus-pvc
Reclaim Policy: Delete
Access Modes: RWO
VolumeMode: Filesystem
Capacity: 20Gi
Node Affinity:
Required Terms:
Term 0: topology.kubernetes.io/zone in [us-west-2b]
topology.kubernetes.io/region in [us-west-2]
Message:
Source:
Type: AWSElasticBlockStore (a Persistent Disk resource in AWS)
VolumeID: aws://us-west-2b/vol-00d432f06a2fbd806
FSType: ext4
Partition: 0
ReadOnly: false
Events: <none>
How to fix this cleanup? Removing the finalizers is not removing the linked resources
My new observations:
kubectl delete pod <podname> --force
the above command deleting the pods but leaving behind pvc's and pv's so did ran the same force commands on them also. Which deleted the resources in kubernetes but left behind the EBS volumes linked to these pv's on aws.
Those should be manually deleted from the console.
When I delete the pv's from kubecetl force option, below message shown, it seems it won't check whether deletion happened or not with force option.
And the events showing the pv's still in place.
[ec2-user@ip-172-31-14-155 .ssh]$ kubectl get events
LAST SEEN TYPE REASON OBJECT MESSAGE
37s Normal VolumeDelete persistentvolume/pvc-2d4b48d7-4da1-4872-9bd8-0afb6b94420e error deleting EBS volume "vol-0ceb4ee469f6a35ef" since volume is currently a
ttached to "i-04fe5c6db79bea12e"
37s Normal VolumeDelete persistentvolume/pvc-6d22eb5c-cdf2-40d2-8ce2-9472c979a1de error deleting EBS volume "vol-01e2bbe04a7b30257" since volume is currently a
ttached to "i-04fe5c6db79bea12e"
22s Normal VolumeDelete persistentvolume/pvc-c438bd0c-b90b-4138-a3db-e517fabe4d66 error deleting EBS volume "vol-06791441b4084d524" since volume is currently a
ttached to "i-04fe5c6db79bea12e"
What is the conflict area to fix this so that it should clean directly from terraform destroy.