terraform destroy failing for kubernetes provider with pvc in aws eks, how to fix that?

Question

We have done kubernetes deployment using terraform kubernetes provider, while creating the cluster eks itself.

When we try to destroy after that, didn't use the product yet, just testing the destroy. Got below error with terraform destroy.

kubernetes_persistent_volume_claim.prometheus-pvc: Still destroying... [id=default/prometheus-pvc, 19m30s elapsed]
kubernetes_persistent_volume_claim.register-pvc[0]: Still destroying... [id=default/register-pvc, 19m30s elapsed]
kubernetes_persistent_volume_claim.register-pvc[0]: Still destroying... [id=default/register-pvc, 19m40s elapsed]
kubernetes_persistent_volume_claim.prometheus-pvc: Still destroying... [id=default/prometheus-pvc, 19m40s elapsed]
kubernetes_persistent_volume_claim.prometheus-pvc: Still destroying... [id=default/prometheus-pvc, 19m50s elapsed]
kubernetes_persistent_volume_claim.register-pvc[0]: Still destroying... [id=default/register-pvc, 19m50s elapsed]
╷
│ Error: Persistent volume claim prometheus-pvc still exists with finalizers: [kubernetes.io/pvc-protection]
│ 
│ 
╵
╷
│ Error: Persistent volume claim register-pvc still exists with finalizers: [kubernetes.io/pvc-protection]
│ 
│ 
╵
time=2022-06-17T19:38:38Z level=error msg=1 error occurred:
    * exit status 1
Error destroying Terraform

persistent volumes:

kubectl get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                STORAGECLASS   REASON   AGE
pvc-51256bfd-4e32-4a4f-a24b-c0f47f9e1d63   100Gi      RWO            Delete           Bound    default/db-persistent-storage-db-0   ssd                     171m
pvc-9453236c-ffc3-4161-a205-e057c3e1ba77   20Gi       RWO            Delete           Bound    default/prometheus-pvc               hdd                     171m
pvc-ddfef2b9-9723-4651-916b-2cb75baf0f22   20Gi       RWO            Delete           Bound    default/register-pvc                 ssd                     171m

persistence volume claims:

kubectl get pvc
NAME                         STATUS        VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
db-persistent-storage-db-0   Bound         pvc-51256bfd-4e32-4a4f-a24b-c0f47f9e1d63   100Gi      RWO            ssd            173m
prometheus-pvc               Terminating   pvc-9453236c-ffc3-4161-a205-e057c3e1ba77   20Gi       RWO            hdd            173m
register-pvc                 Terminating   pvc-ddfef2b9-9723-4651-916b-2cb75baf0f22   20Gi       RWO            ssd            173m

And the events:

45m         Normal    VolumeDelete             persistentvolume/pvc-0e5c621a-529c-4458-b224-39ea22a783fc   error deleting EBS volume "vol-0e36ca327609ae963" since volume is currently attached to "i-0bff735f4c0871705"
46m         Warning   NodeNotReady             pod/quicksilver-pg2wb                                       Node is not ready
46m         Normal    Killing                  pod/reducer-0                                               Stopping container reducer
46m         Normal    Killing                  pod/reducer-0                                               Stopping container checkup-buddy
45m         Warning   Unhealthy                pod/reducer-0                                               Readiness probe failed: Get "http://10.0.130.242:9001/": dial tcp 10.0.130.242:9001: connect: connection refused
46m         Warning   NodeNotReady             pod/register-0                                              Node is not ready
44m         Normal    TaintManagerEviction     pod/register-0                                              Cancelling deletion of Pod default/register-0

And that instance seems to be part of auto scaling group of kubernetes

    [ec2-user@ip-172-31-16-242 software]$ kubectl get po
NAME                          READY   STATUS        RESTARTS   AGE
auto-updater-27601140-4gqtw   0/1     Error         0          50m
auto-updater-27601140-hzrnl   0/1     Error         0          49m
auto-updater-27601140-kmspn   0/1     Error         0          50m
auto-updater-27601140-m4ws6   0/1     Error         0          49m
auto-updater-27601140-wsdpm   0/1     Error         0          45m
auto-updater-27601140-z2m7r   0/1     Error         0          48m
estimator-0                   3/3     Terminating   0          51m
reducer-0                     1/2     Terminating   0          51m
[ec2-user@ip-172-31-16-242 software]$ kubectl get pvc
NAME                         STATUS        VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
db-persistent-storage-db-0   Bound         pvc-ca829a02-9bf5-4540-9900-b6e5ab4624a2   100Gi      RWO            ssd            52m
estimator                    Terminating   pvc-e028acd5-eeb1-4028-89c2-a42c1d28091e   200Gi      RWO            hdd            52m
[ec2-user@ip-172-31-16-242 software]$ kubectl get logs estimator-0
error: the server doesn't have a resource type "logs"
[ec2-user@ip-172-31-16-242 software]$ kubectl logs estimator-0
error: a container name must be specified for pod estimator-0, choose one of: [postgres estimator resize-buddy]
[ec2-user@ip-172-31-16-242 software]$ kubectl logs estimator-0 -c estimator
Error from server (InternalError): Internal error occurred: Authorization error (user=kube-apiserver-kubelet-client, verb=get, resource=nodes, subresource=proxy)
[ec2-user@ip-172-31-16-242 software]$ kubectl logs estimator-0 -c postgres
Error from server (InternalError): Internal error occurred: Authorization error (user=kube-apiserver-kubelet-client, verb=get, resource=nodes, subresource=proxy)
[ec2-user@ip-172-31-16-242 software]$ kubectl logs estimator-0 -c resize-buddy
Error from server (InternalError): Internal error occurred: Authorization error (user=kube-apiserver-kubelet-client, verb=get, resource=nodes, subresource=proxy)
[ec2-user@ip-172-31-16-242 software]$

And details of reducer pod.

[ec2-user@ip-172-31-16-242 software]$ kubectl logs reducer-0
Default container name "data-cruncher" not found in pod reducer-0
error: a container name must be specified for pod reducer-0, choose one of: [reducer checkup-buddy] or one of the init containers: [set-resource-owner]
[ec2-user@ip-172-31-16-242 software]$ kubectl logs reducer-0 -c reducer
Error from server (InternalError): Internal error occurred: Authorization error (user=kube-apiserver-kubelet-client, verb=get, resource=nodes, subresource=proxy)
[ec2-user@ip-172-31-16-242 software]$ kubectl logs reducer-0 -c checkup-buddy
Error from server (InternalError): Internal error occurred: Authorization error (user=kube-apiserver-kubelet-client, verb=get, resource=nodes, subresource=proxy)

Even I checked auto-updater pods. there also getting similar authorization error.

kubectl logs auto-updater-27601140-4gqtw
Error from server (InternalError): Internal error occurred: Authorization error (user=kube-apiserver-kubelet-client, verb=get, resource=nodes, subresource=proxy)

And I tried to check a pvc content by using kubectl edit and got below info.

volume.beta.kubernetes.io/storage-provisioner: kubernetes.io/aws-ebs

[ec2-user@ip-172-31-16-242 ~]$ kubectl describe pv pvc-ca829a02-9bf5-4540-9900-b6e5ab4624a2
Name:              pvc-ca829a02-9bf5-4540-9900-b6e5ab4624a2
Labels:            topology.kubernetes.io/region=us-west-2
                   topology.kubernetes.io/zone=us-west-2b
Annotations:       kubernetes.io/createdby: aws-ebs-dynamic-provisioner
                   pv.kubernetes.io/bound-by-controller: yes
                   pv.kubernetes.io/provisioned-by: kubernetes.io/aws-ebs
Finalizers:        [kubernetes.io/pv-protection]
StorageClass:      ssd
Status:            Terminating (lasts 89m)
Claim:             default/db-persistent-storage-db-0
Reclaim Policy:    Delete
Access Modes:      RWO
VolumeMode:        Filesystem
Capacity:          100Gi
Node Affinity:
  Required Terms:
    Term 0:        topology.kubernetes.io/zone in [us-west-2b]
                   topology.kubernetes.io/region in [us-west-2]
Message:
Source:
    Type:       AWSElasticBlockStore (a Persistent Disk resource in AWS)
    VolumeID:   aws://us-west-2b/vol-02bc902640cdb406c
    FSType:     ext4
    Partition:  0
    ReadOnly:   false
Events:         <none>

No volume attachments

[ec2-user@ip-172-31-16-242 ~]$ kubectl get volumeattachment
No resources found

nodes:

kubectl get node
NAME                                         STATUS     ROLES    AGE    VERSION
ip-10-0-134-174.us-west-2.compute.internal   NotReady   <none>   3h8m   v1.21.12-eks-5308cf7
ip-10-0-142-12.us-west-2.compute.internal    NotReady   <none>   3h5m   v1.21.12-eks-5308cf7

And no nodes in the compute section.

nodes

Please suggest how to fix this.

What's the status in your cluster: are they PVC gone? If not, it's likely they're still attached to a Pod. — SYN, Jun 18 '22 at 14:54
Have you checked what's the status of that PVC, though? Could you tell us more about the state of your cluster when facing this? — SYN, Jun 20 '22 at 19:06
state of cluster? I didn't understand what to look there, the cluster is there, that's where the above command trying to destroy the same. — Uday Kiran Reddy, Jun 21 '22 at 02:55
Are those PVC gone? Any Pod that might have that PVC attached? Any event showing in the corresponding namespace? You need to troubleshoot kubernetes here. Terraform won't do more than a kubectl delete. Which is not guaranteed to complete, depending on the state of your cluster. — SYN, Jun 21 '22 at 22:57
37m Normal VolumeDelete persistentvolume/pvc-0e5c621a-529c-4458-b224-39ea22a783fc error deleting EBS volume "vol-0e36ca327609ae963" since volume is currently attached to "i-0bff735f4c0871705" — Uday Kiran Reddy, Jun 23 '22 at 13:14
Allright: then, you need to figure out why that PVC is still attached in here: If no pod shows as running, that could still be using it: it could be due to force-deleting a pod (doing so: kubelet would most likely leave some bits). Are you using EBS CSI, or in-tree driver? If CSI: any chance the volume-attacher would be unresponsive/check logs. If CSI: do you still have a VolumeAttachment corresponding to this PV? — SYN, Jun 23 '22 at 17:12
I am using kubernetes terraform provider for deloyment, and terraform destroy is having stuck here. — Uday Kiran Reddy, Jun 23 '22 at 19:48
Please, stop bringing terraform back ... it really is unrelated, describing your PVC: you have the first part of your answer. Volume is stuck. It all has to do with Kubernetes state. Now try to answer my previous questions. Or if anything's not clear, let me know. — SYN, Jun 23 '22 at 20:05
I didn't get how to check the volume-attacher or type of the driver. Can you please give more details. I added few more details above. — Uday Kiran Reddy, Jun 24 '22 at 11:52
Annotations on PV suggests you're using in-tree driver (`pv.kubernetes.io/provisioned-by: kubernetes.io/aws-ebs`). In the events you shown us, we have some `Node is not ready`. Check a kubectl get nodes: looks like your `i-0bff735f4c0871705` is down ... At which point, I would try to reboot that node (aws cli or console). See if that un-blocks it. — SYN, Jun 25 '22 at 06:23
I tried nuking that account completely and tried again but got some other results, as it will pile up here, I created another post, can you see that https://serverfault.com/questions/1104111/pods-stuck-at-terminating-stage-and-the-pvc-how-to-fix-that — Uday Kiran Reddy, Jun 25 '22 at 15:14
As you can see, both nodes listed in "kubectl get nodes" are down. If you didn't start to panic by now, I would suggest you do, ... forget about deleting PVC. What's going on here?! Did you try to reboot them already? — SYN, Jun 25 '22 at 19:02
Nodes age is 3h ... Last time I asked for PVC list, post was updated with ages 173minutes .... It's hard to follow what's going on here, although obviously you've been recreating that cluster in between troubleshooting, ... Your cluster is broken right now. Forget about terraform, PVC, stop deleting your cluster and start investigating: what's going on, why is it your nodes aren't ready? Are instances down? Is kubelet communication with control plane broken? ... There's something very wrong behind this ... And it has nothing to do with your PVC. — SYN, Jun 25 '22 at 19:07
when we run terraform destroy, obviously resources will get destroyed. The control plane is not broken. ANd of course, recreating a cluster means, cleaning up everyting and trying again, and the error shows it is stuck at PVC. I deleted the cluster completely all the resources and posted the logs in the other post. Can you check that clearly once — Uday Kiran Reddy, Jun 25 '22 at 19:44
allright. Then you should delete your own kubernetes objects before shutting down cluster. You could remove those objects from your state: at which point, you would have to clean the corresponding resources from AWS (especially EBS volumes). Best would be to fix dependencies in your terraform code. Maybe easier: split your code in two: deploying clusters, then deploying objects in cluster. — SYN, Jun 25 '22 at 20:33
Although splitting the code shouldn't be necessary. I currently work for a customer, with a large code base (over 1000 objects in state, in the same piece of terraform code). Managing OpenShift cluster deployments, from DNS records, to EC2/Azure/VMware instances, setting up kubernetes objects once API's up ... If your dependencies are correct, deletion should work in reverse order. See `depends_on`. — SYN, Jun 25 '22 at 20:37
if i connect to the cluster without running destroy in terraform first. And delete the individual resources, they are getting deleted, without any force option. But one ebs volume exists in the web console in available state. But with the earlier cluster, where destroy is ran and failed, it seems we are getting authorization errors. So, somewhere in the terrafor destroy perissions are gettin deleted first. Can you check once in that approach. But I am not sure of that, can you check these tf files and guide me. https://github.com/uday1kiran/logs/raw/master/aws.zip — Uday Kiran Reddy, Jun 27 '22 at 14:16
It seems the issue is with order of deletion, the aws-auth configmap used for accessing the cluster is getting destroyed before the deletion of other resources, even after that EBS volumes are left behind, need help on that also. Any option to ignore this particular resource during the terraform destroy? — Uday Kiran Reddy, Jun 28 '22 at 18:48
thank you, but I need to add that in templates itself, not as separate command. Any option of excluding resources to not delete atleast, so that if terraform is skipped the configmap, It will be destroyed anyways once the cluster is deleted — Uday Kiran Reddy, Jul 01 '22 at 09:00
Sure: fix dependencies in your code. If your dependency tree is exhaustive, destroying resources would work in reverse order. If you have async deletions, check this to make sure those resources are gone before marking those destroyed: https://www.terraform.io/language/resources/provisioners/syntax#destroy-time-provisioners . Or just split your code separating k8s resources from infrastructure ones. — SYN, Jul 03 '22 at 05:37
the configmap entry is added as dependency to pvc, then it is deleting at last. Some what resolved, but the ebs volumes are left without deleting in the aws account. the reclaim policy also already set to Delete in kubernetes pvc, can you sugest how to troubleshoot this? — Uday Kiran Reddy, Jul 04 '22 at 20:16
ebs volumes left without deleting: sounds like your StorageClass has a reclaimPolicy set to "Retain". You should look into Delete. See https://kubernetes.io/docs/concepts/storage/storage-classes/#reclaim-policy . — SYN, Jul 05 '22 at 17:00
it is actually set to Delete only. But when I try cleaning up manually they are getting deleted but when I destroy the terraform template. — Uday Kiran Reddy, Jul 06 '22 at 16:19

terraform destroy failing for kubernetes provider with pvc in aws eks, how to fix that?

0 Answers0

Linked