Recover from failed etcd2 / CoreOS cluster

Question

I've got a cluster of 3 CoreOS machines running on Azure. I rebooted two at the same time and the cluster failed as expected.

I need to replace the discovery token, cloud-config is read on every bootup, but according to the CoreOS docs:

Once an instance is provisioned on Azure, the cloud-config cannot be modified.

Is there a method to recover from this, short of destroying and then deploying a new cluster ?

score 1 · Answer 1 · answered Aug 11 '15 at 08:09

1

there configuration file exists in location

/var/lib/waagent/CustomData

Using

sudo vim /var/lib/waagent/CustomData

You should be able to edit it. After the reboot configuration will pick up.

answered Aug 11 '15 at 08:09

Tadas Šubonis

171
1
6

score 0 · Answer 2 · answered Jul 04 '15 at 20:28

You could try modifying the etcd service definition in /run/systemd/system/etcd.service.d/20-cloudinit.conf - you should see something like

[Service]
Environment="ETCD_ADDR=10.1.1.1:4001"
Environment="ETCD_DISCOVERY=https://discovery.etcd.io/47fabddb4eed191a09bf5b70ba93426a"
Environment="ETCD_PEER_ADDR=10.1.1.1:7001"

Modify the discovery URL to your new one, then restart it

systemctl daemon-reload
systemctl restart etcd

You will need to test if this survives a reboot on Azure though!

It doesn't survive a reboot. I might start looking for a new host. — Mark, Jul 05 '15 at 22:30

score 0 · Answer 3 · answered Dec 13 '15 at 02:50

If you remove two nodes in three node cluster, you loose the quorum, with 3 nodes you can only with lose one node, for more information about the fault tolerance for the CoreOS:

Fault Tolerance Table

It is recommended to have an odd number of members in a cluster. Having an odd cluster size doesn't change the number needed for majority, but you gain a higher tolerance for failure by adding the extra member. You can see this in practice when comparing even and odd sized clusters:
Cluster Size    Majority    Failure Tolerance
1   1   0
3   2   1
4   3   1
5   3   2
6   4   2
7   4   3
8   5   3
9   5   4

https://coreos.com/etcd/docs/latest/admin_guide.html

Recover from failed etcd2 / CoreOS cluster

3 Answers3