I'm new to CoreOs and I have been experimenting with it on DigitalOcean. Let me start out by saying, I'm not sure if this is a DigitalOcean issue or CoreOS issue.
How to replicate:
- Spin up 2 CoreOS and link them together to form a cluster through Cloud-Config.
- In the Digital Ocean Dashboard, power-off both droplets and resize.
- Power-on both droplets.
- ssh into one of the droplets
- run fleetctl list-machines
You should get
2015/04/22 21:05:50 INFO client.go:291: Failed getting response from http://127.0.0.1:4001/: dial tcp 127.0.0.1:4001: connection refused 2015/04/22 21:05:50 ERROR client.go:213: Unable to get result for {Get /coreos.com/fleet/machines}, retrying in 100ms 2015/04/22 21:05:50 INFO client.go:291: Failed getting response from http://127.0.0.1:4001/: dial tcp 127.0.0.1:4001: connection refused 2015/04/22 21:05:50 ERROR client.go:213: Unable to get result for {Get /coreos.com/fleet/machines}, retrying in 200ms 2015/04/22 21:05:50 INFO client.go:291: Failed getting response from http://127.0.0.1:4001/: dial tcp 127.0.0.1:4001: connection refused 2015/04/22 21:05:50 ERROR client.go:213: Unable to get result for {Get /coreos.com/fleet/machines}, retrying in 400ms 2015/04/22 21:05:51 INFO client.go:291: Failed getting response from http://127.0.0.1:4001/: dial tcp 127.0.0.1:4001: connection refused 2015/04/22 21:05:51 ERROR client.go:213: Unable to get result for {Get /coreos.com/fleet/machines}, retrying in 800ms 2015/04/22 21:05:51 INFO client.go:291: Failed getting response from http://127.0.0.1:4001/: dial tcp 127.0.0.1:4001: connection refused 2015/04/22 21:05:51 ERROR client.go:213: Unable to get result for {Get /_coreos.com/fleet/machines}, retrying in 1s
Executing
journalctl -u etcd
will show
Apr 22 14:38:02 test etcd[578]: [etcd] Apr 22 14:38:02.471 INFO | f507c71154cc47b1804558c7298d0313: state changed from 'leader' to 'follower'. Apr 22 14:38:02 test etcd[578]: [etcd] Apr 22 14:38:02.471 INFO | f507c71154cc47b1804558c7298d0313: term #7 started. Apr 22 14:38:02 test etcd[578]: [etcd] Apr 22 14:38:02.471 INFO | f507c71154cc47b1804558c7298d0313: leader changed from 'f507c71154cc47b1804558c7298d0313' to ''. Apr 22 14:38:11 test etcd[578]: [etcd] Apr 22 14:38:11.257 INFO | f507c71154cc47b1804558c7298d0313: state changed from 'follower' to 'candidate'. Apr 22 14:38:11 test etcd[578]: [etcd] Apr 22 14:38:11.258 INFO | f507c71154cc47b1804558c7298d0313: leader changed from 'fa61f58c81fd4e7abe9ac0b6585fafef' to ''. Apr 22 14:38:11 test etcd[578]: [etcd] Apr 22 14:38:11.546 INFO | f507c71154cc47b1804558c7298d0313: state changed from 'candidate' to 'follower'. Apr 22 14:38:11 test etcd[578]: [etcd] Apr 22 14:38:11.547 INFO | f507c71154cc47b1804558c7298d0313: term #9 started. Apr 22 14:41:14 test etcd[578]: [etcd] Apr 22 14:41:14.847 INFO | f507c71154cc47b1804558c7298d0313: snapshot of 10004 events at index 10004 completed Apr 22 14:53:45 test etcd[578]: [etcd] Apr 22 14:53:45.297 INFO | f507c71154cc47b1804558c7298d0313: warning: heartbeat near election timeout: 359.350151ms Apr 22 14:55:22 test etcd[578]: [etcd] Apr 22 14:55:22.381 INFO | f507c71154cc47b1804558c7298d0313: warning: heartbeat near election timeout: 1.574255587s Apr 22 15:31:17 test etcd[578]: [etcd] Apr 22 15:31:17.551 INFO | f507c71154cc47b1804558c7298d0313: snapshot of 10001 events at index 20005 completed Apr 22 16:19:53 test etcd[578]: [etcd] Apr 22 16:19:53.870 INFO | f507c71154cc47b1804558c7298d0313: snapshot of 10007 events at index 30012 completed Apr 22 17:08:00 test etcd[578]: [etcd] Apr 22 17:08:00.254 INFO | f507c71154cc47b1804558c7298d0313: snapshot of 10007 events at index 40019 completed Apr 22 17:57:30 test etcd[578]: [etcd] Apr 22 17:57:30.622 INFO | f507c71154cc47b1804558c7298d0313: snapshot of 10008 events at index 50027 completed Apr 22 18:48:04 test etcd[578]: [etcd] Apr 22 18:48:04.084 INFO | f507c71154cc47b1804558c7298d0313: snapshot of 10008 events at index 60035 completed Apr 22 19:38:37 test etcd[578]: [etcd] Apr 22 19:38:37.641 INFO | f507c71154cc47b1804558c7298d0313: snapshot of 10007 events at index 70042 completed Apr 22 20:07:41 test etcd[578]: [etcd] Apr 22 20:07:39.493 INFO | f507c71154cc47b1804558c7298d0313: state changed from 'follower' to 'candidate'. Apr 22 20:07:44 test etcd[578]: [etcd] Apr 22 20:07:44.282 INFO | f507c71154cc47b1804558c7298d0313: leader changed from 'fa61f58c81fd4e7abe9ac0b6585fafef' to ''. Apr 22 20:07:44 test etcd[578]: [etcd] Apr 22 20:07:44.895 INFO | f507c71154cc47b1804558c7298d0313: state changed from 'candidate' to 'follower'. Apr 22 20:07:44 test etcd[578]: [etcd] Apr 22 20:07:44.899 INFO | f507c71154cc47b1804558c7298d0313: term #13 started. Apr 22 20:09:39 test etcd[578]: [etcd] Apr 22 20:09:39.269 INFO | f507c71154cc47b1804558c7298d0313: state changed from 'follower' to 'candidate'. Apr 22 20:09:39 test etcd[578]: [etcd] Apr 22 20:09:39.302 INFO | f507c71154cc47b1804558c7298d0313: leader changed from 'fa61f58c81fd4e7abe9ac0b6585fafef' to ''. Apr 22 20:09:39 test etcd[578]: [etcd] Apr 22 20:09:39.631 INFO | f507c71154cc47b1804558c7298d0313: state changed from 'candidate' to 'follower'. Apr 22 20:09:39 test etcd[578]: [etcd] Apr 22 20:09:39.632 INFO | f507c71154cc47b1804558c7298d0313: term #15 started. Apr 22 20:11:18 test systemd[1]: Stopping etcd... Apr 22 20:11:18 test systemd[1]: etcd.service: main process exited, code=exited, status=2/INVALIDARGUMENT Apr 22 20:11:18 test systemd[1]: Stopped etcd. Apr 22 20:11:18 test systemd[1]: Unit etcd.service entered failed state. Apr 22 20:11:18 test systemd[1]: etcd.service failed.
and the following will show systemctl cat etcd.service
/usr/lib64/systemd/system/etcd.service [Unit] Description=etcd [Service] User=etcd PermissionsStartOnly=true Environment=ETCDDATADIR=/var/lib/etcd Environment=ETCD_NAME=%m ExecStart=/usr/bin/etcd Restart=always RestartSec=10s LimitNOFILE=40000
Is this a CoreOS issue? Pretty much the whole CoreOS cluster is broken. The machines are no longer linked up and I can't seem to figure out a way to link them back up or how to prevent this from happening. I can't find anything about this online.