proper shutdown of a kubernetes cluster

Question

Imagine the following scenario:

You run a kubernetes cluster in your datacenter, which was deployed with kubeadm.
It consists of one masternode (running etcd as a static pod, as deployed by kubeadm) and 3 worker nodes
the nodes as virtual machines running on vmware

Today, you open your e-mail and you are notified the datacenter will move to a new location. The physical servers will be turned off, moved to the new location and powered on again.

What is the correct shutdown procedure for your kubernetes cluster (without messing up your etcd data)?

This what I did:

stopped the master server first (this includes etcd ofc), to prevent pods from being rescheduled to other nodes when I turn off the worker nodes.
stopped each worker node

After the migration:

powered on the worker nodes first
powered on the master node next

After doing this, I ended up with one of two scenarios:

etcd data is corrupt and the etcd pod exits with an error
getting errors like this: "Operation cannot be fulfilled on nodes "worker-002": the object has been modified; please apply your changes to the latest version and try again". my logs are getting flooded with these messages.

How could this have been prevented? I don't think running etcd in HA mode would help here, as all etcd nodes would have to be shut down at once too, so you end up with a similar situation as a single node scenario. I get the impression that Etcd is quite... fragile, compared to other K/V stores like Consul.

score 2 · Answer 1 · answered Jan 24 '18 at 15:23

You will need to stop on master

kupe-apiserver
kube-scheduler
kube-controller
kubelet(if applicable)
kube-proxy(if applicable)

If you have federation also stop federation-apiserver

Run a backup(snapshot) of etcd and stop etcd when done

For each node stop

kubelet
kube-proxy

Etcd is as robust as consul, what do you mean by instable ?!

When restore though you have the etcd data, this is not valid immediately ... you should read on backups on kubernetes

FYI - this app will do a proper backup of your cluster - https://github.com/heptio/ark — silviud, Jan 28 '18 at 14:55

score 0 · Answer 2 · answered Jan 24 '18 at 14:28

In fact, etcd is rather resilient with it's journal based approach, but, as always, you should have a backup done just prior to the migration / shutdown, just to be on a safe side. If there is an issue with etcd, just recover the backup and you're good to go.

As you will restart your whole cluster, the order you do it is not really that important, all the containers will have to start again anyway, meaning kubelet will have to connect to a working API.

Where did you get this instable impression of etcd from, I have no idea.

I did a proper shutdown of the master node, and still the etcd data was corrupt. There was no disk issue and plenty of free space left. So yeah, when data went corrupt even after a proper shutdown, that's when I got the idea... — Jeroen Jacobs, Jan 24 '18 at 15:33

proper shutdown of a kubernetes cluster

2 Answers2