Connection timeouts when scaling more than one pod instance in Kubernetes

Question

Running Kubernetes with flannel on a local ESXI server with 3 VMs, a master and two nodes. On all of the nodes, I have Kubernetes 1.15.5, Ubuntu 18.04, and Docker 18.09.7. A green field install.

Nginx runs fine with a single pod on either node, but when scaling to two pods, random connection timeouts start occurring after a long pause from curl.

kubectl apply -f nginx.yaml

  deployment.apps/nginx configured
  service/nginx unchanged

cat nginx.yaml

  apiVersion: apps/v1beta2
  kind: Deployment
  metadata:
    name: nginx
  spec:
    selector:
      matchLabels:
        app: nginx
    replicas: 1
    template:
      metadata:
        labels:
          app: nginx
      spec:
        containers:
        - name: nginx
          image: nginx:1
          ports:
          - name: http
            containerPort: 80

  ---
  apiVersion: v1
  kind: Service
  metadata:
    name: nginx
  spec:
    ports:
    - name: http
      nodePort: 32000
      port: 80
      protocol: TCP
      targetPort: 80
    selector:
      app: nginx
    type: NodePort

kubectl get services,pods,deployments,daemonsets -o wide

  NAME                 TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)        AGE     SELECTOR
  service/kubernetes   ClusterIP   10.96.0.1       <none>        443/TCP        6d17h   <none>
  service/nginx        NodePort    10.102.48.211   <none>        80:32000/TCP   45m     app=nginx

  NAME                         READY   STATUS    RESTARTS   AGE   IP          NODE          NOMINATED NODE   READINESS GATES
  pod/nginx-6d4fbdf4df-q7jdt   1/1     Running   0          45m   10.10.2.6   kubernetes3   <none>           <none>

  NAME                          READY   UP-TO-DATE   AVAILABLE   AGE   CONTAINERS   IMAGES    SELECTOR
  deployment.extensions/nginx   1/1     1            1           45m   nginx        nginx:1   app=nginx

curl http://kubernetes3:32000 returns the nginx page

curl http://kubernetes2:32000 returns a connection timeout.

Scaling up two two pods

kubectl scale --replicas=2 deployment nginx

  deployment.extensions/nginx scaled

kubectl get services,pods,deployments,daemonsets -o wide

  NAME                 TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)        AGE     SELECTOR
  service/kubernetes   ClusterIP   10.96.0.1       <none>        443/TCP        6d17h   <none>
  service/nginx        NodePort    10.102.48.211   <none>        80:32000/TCP   48m     app=nginx

  NAME                         READY   STATUS    RESTARTS   AGE   IP          NODE          NOMINATED NODE   READINESS GATES
  pod/nginx-6d4fbdf4df-q7jdt   1/1     Running   0          48m   10.10.2.6   kubernetes3   <none>           <none>
  pod/nginx-6d4fbdf4df-zg2n5   1/1     Running   0          42s   10.10.1.5   kubernetes2   <none>           <none>

  NAME                          READY   UP-TO-DATE   AVAILABLE   AGE   CONTAINERS   IMAGES    SELECTOR
  deployment.extensions/nginx   2/2     2            2           48m   nginx        nginx:1   app=nginx

curl http://kubernetes3:32000 works half the time and curl http://kubernetes2:32000 works almost half of the time. The other half, I get a connection timeout. If I run the commands on node 3 or 2, I get the same thing. Telnet gets the random timeouts as well although ports are listening on all nodes and I have full connectivity between all nodes.

   url: (7) Failed to connect to kubernetes2 port 32000: Connection timed out

telnet -d kubernetes3 32000

Trying <IP>...
setsockopt (SO_DEBUG): Permission denied

kubernetes3:~$ netstat -tulpn | grep 3200

(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
tcp6       0      0 :::32000                :::*                    LISTEN      -

Why am I getting these timeouts when I scale up to two or more instances?

Dave Brunkow · Answer 1 · 2019-11-17T04:51:33.313

Flannel cidr and initialized cluster cidr were different.

I initilized the cluster with

kubeadm init --pod-network-cidr=10.10.0.0/16

But I ran the stock flannel file which comes with a network of

"Network": "10.244.0.0/16"

My fix was to download the flannel file, then remove the installation.

kubectl delete -f kube-flannel.yml

I then modified flannel.yml to match how I initialized the network.

"Network": "10.10.0.0/16"

Lastly, re-install flannel.

kubectl apply -k kube-flannel.yml

Now I can run the curl commands without error. There are other ways to do this such as going with the subnet provided by flannel but this was the easiest way I found.

Connection timeouts when scaling more than one pod instance in Kubernetes

1 Answers1