0

Really hoping somebody can help me out here. I had a CoreOS etcd2 member's disk fill up. After a reboot etcd2 was in a bad way. Ultimately I removed the /var/lib/etcd2/member data directory and went through the instructions to remove and re-add the machine to my cluster as detailed here: https://coreos.com/etcd/docs/latest/runtime-configuration.html#remove-a-member

However, while I can run etcd2 manually, attempting to start it with systemctl does not work. This is the etcd2.service:

[Unit]
Description=etcd2
Conflicts=etcd.service

[Service]
User=etcd
Type=notify
Environment=ETCD_DATA_DIR=/var/lib/etcd2
Environment=ETCD_NAME=%m
ExecStart=/usr/bin/etcd2
Restart=always
RestartSec=10s
LimitNOFILE=40000
TimeoutStartSec=0

[Install]
WantedBy=multi-user.target

# /run/systemd/system/etcd2.service.d/10-oem.conf
[Service]
Environment=ETCD_ELECTION_TIMEOUT=1200
# /run/systemd/system/etcd2.service.d/20-cloudinit.conf
[Service]
Environment="ETCD_ADVERTISE_CLIENT_URLS=http://172.31.9.22:2379,http://172.31.9.22:4001"
Environment="ETCD_DISCOVERY=https://discovery.etcd.io/567d080563e28e62cf886e48425f632b"
Environment="ETCD_INITIAL_ADVERTISE_PEER_URLS=http://172.31.9.22:2380"
Environment="ETCD_LISTEN_CLIENT_URLS=http://0.0.0.0:2379"
Environment="ETCD_LISTEN_PEER_URLS=http://172.31.9.22:2380"
Environment="ETCD_DEBUG=true"

I added the ETCD_DEBUG=true to try and improve the logging output. Speaking of which:

Feb 26 07:23:36 geo-coreos-database-02 systemd[1]: Starting etcd2...
Feb 26 07:23:36 geo-coreos-database-02 etcd2[2939]: recognized and used environment variable ETCD_ADVERTISE_CLIENT_URLS=http://172.31.9.22:2379,http://172.31.9.22:4001
Feb 26 07:23:36 geo-coreos-database-02 etcd2[2939]: recognized and used environment variable ETCD_DATA_DIR=/var/lib/etcd2
Feb 26 07:23:36 geo-coreos-database-02 etcd2[2939]: recognized and used environment variable ETCD_DEBUG=true
Feb 26 07:23:36 geo-coreos-database-02 etcd2[2939]: recognized and used environment variable ETCD_DISCOVERY=https://discovery.etcd.io/567d080563e28e62cf886e48425f632b
Feb 26 07:23:36 geo-coreos-database-02 etcd2[2939]: recognized and used environment variable ETCD_ELECTION_TIMEOUT=1200
Feb 26 07:23:36 geo-coreos-database-02 etcd2[2939]: recognized and used environment variable ETCD_INITIAL_ADVERTISE_PEER_URLS=http://172.31.9.22:2380
Feb 26 07:23:36 geo-coreos-database-02 etcd2[2939]: recognized and used environment variable ETCD_LISTEN_CLIENT_URLS=http://0.0.0.0:2379
Feb 26 07:23:36 geo-coreos-database-02 etcd2[2939]: recognized and used environment variable ETCD_LISTEN_PEER_URLS=http://172.31.9.22:2380
Feb 26 07:23:37 geo-coreos-database-02 systemd[1]: etcd2.service: Main process exited, code=exited, status=1/FAILURE
Feb 26 07:23:37 geo-coreos-database-02 systemd[1]: Failed to start etcd2.
Feb 26 07:23:37 geo-coreos-database-02 systemd[1]: etcd2.service: Unit entered failed state.
Feb 26 07:23:37 geo-coreos-database-02 systemd[1]: etcd2.service: Failed with result 'exit-code'.

Not very helpful. However when I run it manually, basing the config on the etcd2.service, the server starts and runs in the foreground without issue:

export ETCD_NAME="e0a8edc41f634fcf9451b5c68e3442bd"
export ETCD_DATA_DIR=/var/lib/etcd2 
export ETCD_ADVERTISE_CLIENT_URLS=http://172.31.9.22:2379,http://172.31.9.22:4001
export ETCD_DISCOVERY=https://discovery.etcd.io/567d080563e28e62cf886e48425f632b
export ETCD_INITIAL_ADVERTISE_PEER_URLS=http://172.31.9.22:2380
export ETCD_LISTEN_CLIENT_URLS=http://0.0.0.0:2379
export ETCD_LISTEN_PEER_URLS=http://172.31.9.22:2380
export ETCD_DEBUG=true
etcd2 

This starts the server, with debugging as I would expect and I can even run etcdctl commands on the server. I am at an absolute loss as to how to debug this further. I absolutely don't want to create a new cluster for something so trivial, but this always seems the way with etcd, it's the only thing that every causes problems for us and they are often like this, obscure and difficult to fix.

2 Answers2

1

Of course I figure this out straight after I post a question. When I first ran the command manually, I did it as the root user. Therefore the data directory was owned by root, not the etcd user. Changing the permissions fixed the issue. Still pretty terrible that it doesn't log that, even in debug mode...

  • Opened an issue on the etcd repo: https://github.com/coreos/etcd/issues/4637 – Rob Feb 27 '16 at 19:53
  • Thank you @Rob, I should have done this myself but after a long and frustrating day, I just posted my answer here and called it a night. – David Monagle Feb 28 '16 at 04:13
0

I was using cloud-config and I had to add initial-cluster-state: existing to my cloud-config. Without this initial-cluster-state was defaulting to new and it appears that my re-added node was picking up its old GUID which had been "permanently deleted" and caused etcd2 to fail to startup.

Here's the journal log without initial-cluster-state: existing in my cloud-config:

Jun 03 11:10:36 giscoreos2 systemd[1]: Stopped etcd2.
Jun 03 11:10:36 giscoreos2 systemd[1]: Starting etcd2...
Jun 03 11:10:36 giscoreos2 etcd2[19970]: recognized and used environment variable ETCD_ADVERTISE_CLIENT_URLS=http://giscoreos2.example.com:2379
Jun 03 11:10:36 giscoreos2 etcd2[19970]: recognized and used environment variable ETCD_DATA_DIR=/var/lib/etcd2
Jun 03 11:10:36 giscoreos2 etcd2[19970]: recognized and used environment variable ETCD_DISCOVERY_SRV=example-corp.us
Jun 03 11:10:36 giscoreos2 etcd2[19970]: recognized and used environment variable ETCD_INITIAL_ADVERTISE_PEER_URLS=http://giscoreos2.example.com:2380
Jun 03 11:10:36 giscoreos2 etcd2[19970]: recognized and used environment variable ETCD_INITIAL_CLUSTER_TOKEN=etcd-cluster-1
Jun 03 11:10:36 giscoreos2 etcd2[19970]: recognized and used environment variable ETCD_LISTEN_CLIENT_URLS=http://0.0.0.0:2379
Jun 03 11:10:36 giscoreos2 etcd2[19970]: recognized and used environment variable ETCD_LISTEN_PEER_URLS=http://giscoreos2.example.com:2380
Jun 03 11:10:36 giscoreos2 etcd2[19970]: recognized and used environment variable ETCD_NAME=giscoreos2
Jun 03 11:10:36 giscoreos2 etcd2[19970]: etcd Version: 2.3.1
Jun 03 11:10:36 giscoreos2 etcd2[19970]: Git SHA: 2b67f52
Jun 03 11:10:36 giscoreos2 etcd2[19970]: Go Version: go1.5.3
Jun 03 11:10:36 giscoreos2 etcd2[19970]: Go OS/Arch: linux/amd64
Jun 03 11:10:36 giscoreos2 etcd2[19970]: setting maximum number of CPUs to 16, total number of available CPUs is 16
Jun 03 11:10:36 giscoreos2 etcd2[19970]: the server is already initialized as member before, starting as etcd member...
Jun 03 11:10:36 giscoreos2 etcd2[19970]: got bootstrap from DNS for etcd-server at http://giscoreos3.example.com:2380
Jun 03 11:10:36 giscoreos2 etcd2[19970]: got bootstrap from DNS for etcd-server at http://giscoreos1.example.com:2380
Jun 03 11:10:36 giscoreos2 etcd2[19970]: got bootstrap from DNS for etcd-server at http://giscoreos2.example.com:2380
Jun 03 11:10:36 giscoreos2 etcd2[19970]: listening for peers on http://giscoreos2.example.com:2380
Jun 03 11:10:36 giscoreos2 etcd2[19970]: listening for client requests on http://0.0.0.0:2379
Jun 03 11:10:36 giscoreos2 etcd2[19970]: name = giscoreos2
Jun 03 11:10:36 giscoreos2 etcd2[19970]: data dir = /var/lib/etcd2
Jun 03 11:10:36 giscoreos2 etcd2[19970]: member dir = /var/lib/etcd2/member
Jun 03 11:10:36 giscoreos2 etcd2[19970]: heartbeat = 100ms
Jun 03 11:10:36 giscoreos2 etcd2[19970]: election = 1000ms
Jun 03 11:10:36 giscoreos2 etcd2[19970]: snapshot count = 10000
Jun 03 11:10:36 giscoreos2 etcd2[19970]: advertise client URLs = http://giscoreos2.example.com:2379
Jun 03 11:10:36 giscoreos2 etcd2[19970]: restarting member d5f2eb850214f772 in cluster e1013a21e485d6ec at commit index 3
Jun 03 11:10:36 giscoreos2 etcd2[19970]: d5f2eb850214f772 became follower at term 242
Jun 03 11:10:36 giscoreos2 etcd2[19970]: newRaft d5f2eb850214f772 [peers: [], term: 242, commit: 3, applied: 0, lastindex: 3, lastterm: 1]
Jun 03 11:10:36 giscoreos2 etcd2[19970]: starting server... [version: 2.3.1, cluster version: to_be_decided]
Jun 03 11:10:36 giscoreos2 systemd[1]: Started etcd2.
Jun 03 11:10:36 giscoreos2 etcd2[19970]: failed to find member c7f5106228d2b8a7 in cluster e1013a21e485d6ec
Jun 03 11:10:36 giscoreos2 etcd2[19970]: failed to find member c7f5106228d2b8a7 in cluster e1013a21e485d6ec
Jun 03 11:10:36 giscoreos2 etcd2[19970]: added member 2350c0e41172376a [http://giscoreos1.example.com:2380] to cluster e1013a21e485d6ec
Jun 03 11:10:36 giscoreos2 systemd[1]: Starting Network fabric for containers...
Jun 03 11:10:36 giscoreos2 etcd2[19970]: the member has been permanently removed from the cluster
Jun 03 11:10:36 giscoreos2 etcd2[19970]: the data-dir used by this member must be removed.
Jun 03 11:10:36 giscoreos2 etcd2[19970]: streaming request ignored (ID mismatch got b584365ea9e04f4d want d5f2eb850214f772)
Jun 03 11:10:36 giscoreos2 etcd2[19970]: streaming request ignored (ID mismatch got b584365ea9e04f4d want d5f2eb850214f772)
Jun 03 11:10:36 giscoreos2 etcd2[19970]: added member c7f5106228d2b8a7 [http://giscoreos3.example.com:2380] to cluster e1013a21e485d6ec
Jun 03 11:10:36 giscoreos2 etcd2[19970]: added local member d5f2eb850214f772 [http://giscoreos2.example.com:2380] to cluster e1013a21e485d6ec
Jun 03 11:10:36 giscoreos2 etcd2[19970]: aborting publish because server is stopped

Here's my journal with initial-cluster-state: existing in my cloud-config:

Jun 03 11:19:02 giscoreos2 systemd[1]: Starting etcd2...
Jun 03 11:19:02 giscoreos2 etcd2[20222]: recognized and used environment variable ETCD_ADVERTISE_CLIENT_URLS=http://giscoreos2.example.com:2379
Jun 03 11:19:02 giscoreos2 etcd2[20222]: recognized and used environment variable ETCD_DATA_DIR=/var/lib/etcd2
Jun 03 11:19:02 giscoreos2 etcd2[20222]: recognized and used environment variable ETCD_DISCOVERY_SRV=exammple-corp.us
Jun 03 11:19:02 giscoreos2 etcd2[20222]: recognized and used environment variable ETCD_INITIAL_ADVERTISE_PEER_URLS=http://giscoreos2.example.com:2380
Jun 03 11:19:02 giscoreos2 etcd2[20222]: recognized and used environment variable ETCD_INITIAL_CLUSTER_STATE=existing
Jun 03 11:19:02 giscoreos2 etcd2[20222]: recognized and used environment variable ETCD_INITIAL_CLUSTER_TOKEN=etcd-cluster-1
Jun 03 11:19:02 giscoreos2 etcd2[20222]: recognized and used environment variable ETCD_LISTEN_CLIENT_URLS=http://0.0.0.0:2379
Jun 03 11:19:02 giscoreos2 etcd2[20222]: recognized and used environment variable ETCD_LISTEN_PEER_URLS=http://giscoreos2.example.com:2380
Jun 03 11:19:02 giscoreos2 etcd2[20222]: recognized and used environment variable ETCD_NAME=giscoreos2
Jun 03 11:19:02 giscoreos2 etcd2[20222]: etcd Version: 2.3.1
Jun 03 11:19:02 giscoreos2 etcd2[20222]: Git SHA: 2b67f52
Jun 03 11:19:02 giscoreos2 etcd2[20222]: Go Version: go1.5.3
Jun 03 11:19:02 giscoreos2 etcd2[20222]: Go OS/Arch: linux/amd64
Jun 03 11:19:02 giscoreos2 etcd2[20222]: setting maximum number of CPUs to 16, total number of available CPUs is 16
Jun 03 11:19:02 giscoreos2 etcd2[20222]: got bootstrap from DNS for etcd-server at http://giscoreos1.example.com:2380
Jun 03 11:19:02 giscoreos2 etcd2[20222]: got bootstrap from DNS for etcd-server at http://giscoreos2.example.com:2380
Jun 03 11:19:02 giscoreos2 etcd2[20222]: got bootstrap from DNS for etcd-server at http://giscoreos3.example.com:2380
Jun 03 11:19:02 giscoreos2 etcd2[20222]: listening for peers on http://giscoreos2.example.com:2380
Jun 03 11:19:02 giscoreos2 etcd2[20222]: listening for client requests on http://0.0.0.0:2379
Jun 03 11:19:02 giscoreos2 etcd2[20222]: resolving giscoreos1.example.com:2380 to 10.240.160.152:2380
Jun 03 11:19:02 giscoreos2 etcd2[20222]: resolving giscoreos1.example.com:2380 to 10.240.160.152:2380
Jun 03 11:19:02 giscoreos2 etcd2[20222]: resolving giscoreos2.example.com:2380 to 10.240.160.57:2380
Jun 03 11:19:02 giscoreos2 etcd2[20222]: resolving giscoreos2.example.com:2380 to 10.240.160.57:2380
Jun 03 11:19:02 giscoreos2 etcd2[20222]: resolving giscoreos3.example.com:2380 to 10.240.160.6:2380
Jun 03 11:19:02 giscoreos2 etcd2[20222]: resolving giscoreos3.example.com:2380 to 10.240.160.6:2380
Jun 03 11:19:02 giscoreos2 etcd2[20222]: name = giscoreos2
Jun 03 11:19:02 giscoreos2 etcd2[20222]: data dir = /var/lib/etcd2
Jun 03 11:19:02 giscoreos2 etcd2[20222]: member dir = /var/lib/etcd2/member
Jun 03 11:19:02 giscoreos2 etcd2[20222]: heartbeat = 100ms
Jun 03 11:19:02 giscoreos2 etcd2[20222]: election = 1000ms
Jun 03 11:19:02 giscoreos2 etcd2[20222]: snapshot count = 10000
Jun 03 11:19:02 giscoreos2 etcd2[20222]: advertise client URLs = http://giscoreos2.example.com:2379
Jun 03 11:19:02 giscoreos2 etcd2[20222]: starting member b584365ea9e04f4d in cluster e1013a21e485d6ec
Jun 03 11:19:02 giscoreos2 etcd2[20222]: b584365ea9e04f4d became follower at term 0
Jun 03 11:19:02 giscoreos2 etcd2[20222]: newRaft b584365ea9e04f4d [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]
Jun 03 11:19:02 giscoreos2 etcd2[20222]: b584365ea9e04f4d became follower at term 1
Jun 03 11:19:02 giscoreos2 etcd2[20222]: the connection with 2350c0e41172376a became active
Jun 03 11:19:02 giscoreos2 etcd2[20222]: starting server... [version: 2.3.1, cluster version: to_be_decided]
Jun 03 11:19:02 giscoreos2 systemd[1]: Started etcd2.
HeatfanJohn
  • 366
  • 5
  • 14