2

I'm new to CoreOs and I have been experimenting with it on DigitalOcean. Let me start out by saying, I'm not sure if this is a DigitalOcean issue or CoreOS issue.


How to replicate:

  • Spin up 2 CoreOS and link them together to form a cluster through Cloud-Config.
  • In the Digital Ocean Dashboard, power-off both droplets and resize.
  • Power-on both droplets.
  • ssh into one of the droplets
  • run fleetctl list-machines

You should get

2015/04/22 21:05:50 INFO client.go:291: Failed getting response from http://127.0.0.1:4001/: dial tcp 127.0.0.1:4001: connection refused
2015/04/22 21:05:50 ERROR client.go:213: Unable to get result for {Get /coreos.com/fleet/machines}, retrying in 100ms
2015/04/22 21:05:50 INFO client.go:291: Failed getting response from http://127.0.0.1:4001/: dial tcp 127.0.0.1:4001: connection refused
2015/04/22 21:05:50 ERROR client.go:213: Unable to get result for {Get /coreos.com/fleet/machines}, retrying in 200ms
2015/04/22 21:05:50 INFO client.go:291: Failed getting response from http://127.0.0.1:4001/: dial tcp 127.0.0.1:4001: connection refused
2015/04/22 21:05:50 ERROR client.go:213: Unable to get result for {Get /coreos.com/fleet/machines}, retrying in 400ms
2015/04/22 21:05:51 INFO client.go:291: Failed getting response from http://127.0.0.1:4001/: dial tcp 127.0.0.1:4001: connection refused
2015/04/22 21:05:51 ERROR client.go:213: Unable to get result for {Get /coreos.com/fleet/machines}, retrying in 800ms
2015/04/22 21:05:51 INFO client.go:291: Failed getting response from http://127.0.0.1:4001/: dial tcp 127.0.0.1:4001: connection refused
2015/04/22 21:05:51 ERROR client.go:213: Unable to get result for {Get /_coreos.com/fleet/machines}, retrying in 1s

Executing journalctl -u etcd
will show

Apr 22 14:38:02 test etcd[578]: [etcd] Apr 22 14:38:02.471 INFO | f507c71154cc47b1804558c7298d0313: state changed from 'leader' to 'follower'.
Apr 22 14:38:02 test etcd[578]: [etcd] Apr 22 14:38:02.471 INFO | f507c71154cc47b1804558c7298d0313: term #7 started.
Apr 22 14:38:02 test etcd[578]: [etcd] Apr 22 14:38:02.471 INFO | f507c71154cc47b1804558c7298d0313: leader changed from 'f507c71154cc47b1804558c7298d0313' to ''.
Apr 22 14:38:11 test etcd[578]: [etcd] Apr 22 14:38:11.257 INFO | f507c71154cc47b1804558c7298d0313: state changed from 'follower' to 'candidate'.
Apr 22 14:38:11 test etcd[578]: [etcd] Apr 22 14:38:11.258 INFO | f507c71154cc47b1804558c7298d0313: leader changed from 'fa61f58c81fd4e7abe9ac0b6585fafef' to ''.
Apr 22 14:38:11 test etcd[578]: [etcd] Apr 22 14:38:11.546 INFO | f507c71154cc47b1804558c7298d0313: state changed from 'candidate' to 'follower'.
Apr 22 14:38:11 test etcd[578]: [etcd] Apr 22 14:38:11.547 INFO | f507c71154cc47b1804558c7298d0313: term #9 started.
Apr 22 14:41:14 test etcd[578]: [etcd] Apr 22 14:41:14.847 INFO | f507c71154cc47b1804558c7298d0313: snapshot of 10004 events at index 10004 completed
Apr 22 14:53:45 test etcd[578]: [etcd] Apr 22 14:53:45.297 INFO | f507c71154cc47b1804558c7298d0313: warning: heartbeat near election timeout: 359.350151ms
Apr 22 14:55:22 test etcd[578]: [etcd] Apr 22 14:55:22.381 INFO | f507c71154cc47b1804558c7298d0313: warning: heartbeat near election timeout: 1.574255587s
Apr 22 15:31:17 test etcd[578]: [etcd] Apr 22 15:31:17.551 INFO | f507c71154cc47b1804558c7298d0313: snapshot of 10001 events at index 20005 completed
Apr 22 16:19:53 test etcd[578]: [etcd] Apr 22 16:19:53.870 INFO | f507c71154cc47b1804558c7298d0313: snapshot of 10007 events at index 30012 completed
Apr 22 17:08:00 test etcd[578]: [etcd] Apr 22 17:08:00.254 INFO | f507c71154cc47b1804558c7298d0313: snapshot of 10007 events at index 40019 completed
Apr 22 17:57:30 test etcd[578]: [etcd] Apr 22 17:57:30.622 INFO | f507c71154cc47b1804558c7298d0313: snapshot of 10008 events at index 50027 completed
Apr 22 18:48:04 test etcd[578]: [etcd] Apr 22 18:48:04.084 INFO | f507c71154cc47b1804558c7298d0313: snapshot of 10008 events at index 60035 completed
Apr 22 19:38:37 test etcd[578]: [etcd] Apr 22 19:38:37.641 INFO | f507c71154cc47b1804558c7298d0313: snapshot of 10007 events at index 70042 completed
Apr 22 20:07:41 test etcd[578]: [etcd] Apr 22 20:07:39.493 INFO | f507c71154cc47b1804558c7298d0313: state changed from 'follower' to 'candidate'.
Apr 22 20:07:44 test etcd[578]: [etcd] Apr 22 20:07:44.282 INFO | f507c71154cc47b1804558c7298d0313: leader changed from 'fa61f58c81fd4e7abe9ac0b6585fafef' to ''.
Apr 22 20:07:44 test etcd[578]: [etcd] Apr 22 20:07:44.895 INFO | f507c71154cc47b1804558c7298d0313: state changed from 'candidate' to 'follower'.
Apr 22 20:07:44 test etcd[578]: [etcd] Apr 22 20:07:44.899 INFO | f507c71154cc47b1804558c7298d0313: term #13 started.
Apr 22 20:09:39 test etcd[578]: [etcd] Apr 22 20:09:39.269 INFO | f507c71154cc47b1804558c7298d0313: state changed from 'follower' to 'candidate'.
Apr 22 20:09:39 test etcd[578]: [etcd] Apr 22 20:09:39.302 INFO | f507c71154cc47b1804558c7298d0313: leader changed from 'fa61f58c81fd4e7abe9ac0b6585fafef' to ''.
Apr 22 20:09:39 test etcd[578]: [etcd] Apr 22 20:09:39.631 INFO | f507c71154cc47b1804558c7298d0313: state changed from 'candidate' to 'follower'.
Apr 22 20:09:39 test etcd[578]: [etcd] Apr 22 20:09:39.632 INFO | f507c71154cc47b1804558c7298d0313: term #15 started.
Apr 22 20:11:18 test systemd[1]: Stopping etcd...
Apr 22 20:11:18 test systemd[1]: etcd.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Apr 22 20:11:18 test systemd[1]: Stopped etcd.
Apr 22 20:11:18 test systemd[1]: Unit etcd.service entered failed state.
Apr 22 20:11:18 test systemd[1]: etcd.service failed.

and the following will show systemctl cat etcd.service

/usr/lib64/systemd/system/etcd.service
[Unit]
Description=etcd

[Service]
User=etcd
PermissionsStartOnly=true
Environment=ETCDDATADIR=/var/lib/etcd
Environment=ETCD_NAME=%m
ExecStart=/usr/bin/etcd
Restart=always
RestartSec=10s
LimitNOFILE=40000

Is this a CoreOS issue? Pretty much the whole CoreOS cluster is broken. The machines are no longer linked up and I can't seem to figure out a way to link them back up or how to prevent this from happening. I can't find anything about this online.

mbejda
  • 151
  • 5

1 Answers1

2

Ok, after hours and hours of debugging, I came to the conclusion that Digital Ocean doesn't respect the cloud-config when you turn off the droplet, resize the droplet and turn on the droplet. I managed to linkup the instances into a cluster again by creating another cloud-config inside each server instance and executing this following command.
sudo coreos-cloudinit --from-file cloud-config.yml

mbejda
  • 151
  • 5