API-Server on master stops after adding second control-plane

Question

In my current test setup I've several VMs running Debian-11. All nodes have a private IP and a second wireguard interface. In the future the nodes will be in different locations with different network and Wireguard is used to "overlay" all the different network environments. I want to install a Kubernetes on all nodes.

node   public ip        wireguard ip
vm1    192.168.10.10    10.11.12.10
vm2    192.168.10.11    10.11.12.11
vm3    192.168.10.12    10.11.12.12
...

So I've installed docker and kubeadm/kubelet/kubectl in version 1.23.5 on all nodes. Also I've installed a haproxy on all nodes too. It works as a load balancer by listing to localhost:443 and forwarding the requests to one of the online control-planes.

Then I started the cluster with kubeadm

vm01> kubeadm init --apiserver-advertise-address=10.11.12.10 --pod-network-cidr=10.20.0.0/16

After that I tested to integrate either flannel or calico. Either by adding --iface=<wireguard-interface> or by setting the custom manifest ...nodeAddressAutodetectionV4.interface: <wireguard-interface>.

When I add a normal node - everything is fine. The node is added, pods are created and the communication is done via the defined network interface.

When I add a control plane without the wireguard interface, I can also add different control planes with

vm2> kubeadm join 127.0.0.1:443 --token ... --discovery-token-ca-cert-hash sha256:...  --control-plane

Of course before that, I've copied several files from vm01 to vm02 from /etc/kubernetes/pki like the ca.*, sa.*, front-proxy-ca.*, apiserver-kubelet-client.* and etcd/ca.*.

But when I use the flannel or calico network together with the wireguard interface, something strange happens after the join command.

root@vm02:~# kubeadm join 127.0.0.1:443 --token nwevkx.tzm37tb4qx3wg2jz --discovery-token-ca-cert-hash sha256:9a97a5846ad823647ccb1892971c5f0004043d88f62328d051a31ce8b697ad4a --control-plane
[preflight] Running pre-flight checks
[preflight] Reading configuration from the cluster...
[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[preflight] Running pre-flight checks before initializing the new control plane instance
[preflight] Pulling images required for setting up a Kubernetes cluster
[preflight] This might take a minute or two, depending on the speed of your internet connection
[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'
[certs] Using certificateDir folder "/etc/kubernetes/pki"
[certs] Generating "front-proxy-client" certificate and key
[certs] Generating "etcd/server" certificate and key
[certs] etcd/server serving cert is signed for DNS names [localhost mimas] and IPs [192.168.10.11 127.0.0.1 ::1]
[certs] Generating "etcd/peer" certificate and key
[certs] etcd/peer serving cert is signed for DNS names [localhost mimas] and IPs [192.168.10.11 127.0.0.1 ::1]
[certs] Generating "etcd/healthcheck-client" certificate and key
[certs] Generating "apiserver-etcd-client" certificate and key
[certs] Generating "apiserver" certificate and key
[certs] apiserver serving cert is signed for DNS names [kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local mimas] and IPs [10.96.0.1 192.168.10.11 127.0.0.1]
[certs] Using the existing "apiserver-kubelet-client" certificate and key
[certs] Valid certificates and keys now exist in "/etc/kubernetes/pki"
[certs] Using the existing "sa" key
[kubeconfig] Generating kubeconfig files
[kubeconfig] Using kubeconfig folder "/etc/kubernetes"
[endpoint] WARNING: port specified in controlPlaneEndpoint overrides bindPort in the controlplane address
[kubeconfig] Writing "admin.conf" kubeconfig file
[endpoint] WARNING: port specified in controlPlaneEndpoint overrides bindPort in the controlplane address
[kubeconfig] Writing "controller-manager.conf" kubeconfig file
[endpoint] WARNING: port specified in controlPlaneEndpoint overrides bindPort in the controlplane address
[kubeconfig] Writing "scheduler.conf" kubeconfig file
[control-plane] Using manifest folder "/etc/kubernetes/manifests"
[control-plane] Creating static Pod manifest for "kube-apiserver"
[control-plane] Creating static Pod manifest for "kube-controller-manager"
[control-plane] Creating static Pod manifest for "kube-scheduler"
[check-etcd] Checking that the etcd cluster is healthy
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Starting the kubelet
[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...
[etcd] Announced new etcd member joining to the existing etcd cluster
[etcd] Creating static Pod manifest for "etcd"
[etcd] Waiting for the new etcd member to join the cluster. This can take up to 40s
[kubelet-check] Initial timeout of 40s passed.
error execution phase control-plane-join/etcd: error creating local etcd static pod manifest file: timeout waiting for etcd cluster to be available
To see the stack trace of this error execute with --v=5 or higher

And after that timeout even on vm01 the API server stops working, I cannot run any kubeadm or kubectl commands anymore. The HTTPS service on 6443 is dead. But neither I understand why the API server on vm01 stops working when adding a second API server nor I can find a reason, whe the output is talking about the 192.168.... IPs, because the cluster should communicate only via the 10.11.12.0/24 wireguard network.

score 0 · Answer 1 · answered Apr 05 '22 at 09:41

After finding a similar problem in https://stackoverflow.com/questions/64227042/setting-up-a-kubernetes-master-on-a-different-ip I think, this is also the solution here. When I add --apiserver-advertise-address=<this-wireguard-ip>, the output changes (no 192.168.. IP) and it joins. What I don't understand, why VM01 API server stops working.

Whatever the join command is doing under the hood, it needs to create a etcd service on the second control plane and that service must also run on the same IP then the flannel/calico network interface. In case of using the primary network interface this parameter is not necessary on the second/third control plane.

API-Server on master stops after adding second control-plane

1 Answers1