Docker Swarm's overlay network DNS consistently resolving hostnames to an IP address one number lower than a container's actual IP

Question

I'm having a very odd reproducible issue with Docker Swarm. I'm attempting to deploy a Dgraph server cluster via Docker Swarm across four LXD containers. For context, Dgraph Zero is the control server and each Dgraph Alpha server does the lifting and must connect to a Zero server on launch. Ratel is just a web UI server for database queries and mutations. The topography looks like this:

Host: KDE Neon workstation
- LXD Container: zero
  - Docker Node: dg-zero
    - Docker Container: dgraph_zero
    - Docker Container: dgraph_ratel
- LXD Container: alpha1
  - Docker Node: dg-alpha1
    - Docker Container: dgraph_alpha1
- LXD Container: alpha2
  - Docker Node: dg-alpha2
    - Docker Container: dgraph_alpha2
- LXD Container: alpha3
  - Docker Node: dg-alpha3
    - Docker Container: dgraph_alpha3

These are all deployed in as a swarm via docker stack deploy using the following docker-compose.yml config:

version: "3"
networks:
  dgraph:
services:
  zero:
    image: dgraph/dgraph:latest
    hostname: "zero"
    volumes:
      - data-volume:/dgraph
    ports:
      - "5080:5080"
      - "6080:6080"
    networks:
      - dgraph
    deploy:
      placement:
        constraints:
          - node.hostname == dg-zero
    command: dgraph zero --my=zero:5080 --replicas 3 --bindall=true
  alpha1:
    image: dgraph/dgraph:latest
    hostname: "alpha1"
    volumes:
      - data-volume:/dgraph
    ports:
      - "8080:8080"
      - "9080:9080"
    networks:
      - dgraph
    deploy:
      placement:
        constraints:
          - node.hostname == dg-alpha1
    command: dgraph alpha --my=alpha1:7080 --lru_mb=1024 --zero=zero:5080 --bindall=true
  alpha2:
    image: dgraph/dgraph:latest
    hostname: "alpha2"
    volumes:
      - data-volume:/dgraph
    ports:
      - "8081:8081"
      - "9081:9081"
    networks:
      - dgraph
    deploy:
      placement:
        constraints:
          - node.hostname == dg-alpha2
    command: dgraph alpha --my=alpha2:7081 --lru_mb=1024 --zero=zero:5080 -o 1 --bindall=true
  alpha3:
    image: dgraph/dgraph:latest
    hostname: "alpha3"
    volumes:
      - data-volume:/dgraph
    ports:
      - "8082:8082"
      - "9082:9082"
    networks:
      - dgraph
    deploy:
      placement:
        constraints:
          - node.hostname == dg-alpha3
    command: dgraph alpha --my=alpha3:7082 --lru_mb=1024 --zero=zero:5080 -o 2 --bindall=true
  ratel:
    image: dgraph/dgraph:latest
    hostname: "ratel"
    ports:
      - "8000:8000"
    networks:
      - dgraph
    command: dgraph-ratel
volumes:
  data-volume:

All of the services are deploying correctly and running. The problem is they cannot communicate with one another because the hostnames are not resolving to the correct IP.

I have a swarm currently running and when I run docker container inspect on my running Zero container I get the following output for the network config:

"NetworkSettings": {
            "Bridge": "",
            "SandboxID": "4548013a15833d086b281a8c2dd61ced6ea5c92f815a305f7337effe9b04a13a",
            "HairpinMode": false,
            "LinkLocalIPv6Address": "",
            "LinkLocalIPv6PrefixLen": 0,
            "Ports": {
                "8080/tcp": null,
                "9080/tcp": null
            },
            "SandboxKey": "/var/run/docker/netns/4548013a1583",
            "SecondaryIPAddresses": null,
            "SecondaryIPv6Addresses": null,
            "EndpointID": "",
            "Gateway": "",
            "GlobalIPv6Address": "",
            "GlobalIPv6PrefixLen": 0,
            "IPAddress": "",
            "IPPrefixLen": 0,
            "IPv6Gateway": "",
            "MacAddress": "",
            "Networks": {
                "dgraph_dgraph": {
                    "IPAMConfig": {
                        "IPv4Address": "10.0.9.3"
                    },
                    "Links": null,
                    "Aliases": [
                        "8b48711ab0cd"
                    ],
                    "NetworkID": "lve3kr9vm42rwu1nci897zey7",
                    "EndpointID": "056ae62475da805ec212d9ec2b2e4a5c9e09e2405c15ad6e8b298e90669b512d",
                    "Gateway": "",
                    "IPAddress": "10.0.9.3",
                    "IPPrefixLen": 24,
                    "IPv6Gateway": "",
                    "GlobalIPv6Address": "",
                    "GlobalIPv6PrefixLen": 0,
                    "MacAddress": "02:42:0a:00:09:03",
                    "DriverOpts": null
                },
                "ingress": {
                    "IPAMConfig": {
                        "IPv4Address": "10.0.0.157"
                    },
                    "Links": null,
                    "Aliases": [
                        "8b48711ab0cd"
                    ],
                    "NetworkID": "vjhpbsc1766lbvtu169fmh81l",
                    "EndpointID": "29bbc4de97e98b2e05a46dd42020dd1fbb75ff07d8c08a00b8ba6f2f4e00ec2a",
                    "Gateway": "",
                    "IPAddress": "10.0.0.157",
                    "IPPrefixLen": 24,
                    "IPv6Gateway": "",
                    "GlobalIPv6Address": "",
                    "GlobalIPv6PrefixLen": 0,
                    "MacAddress": "02:42:0a:00:00:9d",
                    "DriverOpts": null
                }
            }
        }

As can be seen, the IP address of my Zero container is 10.0.9.3. However if I login into the shell of any of my Alpha containers and run ping zero it attempts to ping 10.0.9.2. This is consistently reproducible by destroying and re-deploying the stack. The final octet in the IP address of my Zero container will always be one greater than the IP address the hostname zero resolves to in the other three containers. I can ping all of my containers from within one another using the correct IP addresses, but I must use host names because I don't know what the other container's addresses will be before swarm creation and I need to configure the server cluster to speak to each other.

There are no firewall rules in place. My LXD containers can all communicate. All LXD container IP addresses are listed as peers in the Docker network config. I don't know if it's related or not, but I'm unable to reach my services from the host containers even though I've published ports. If it's possible that's related I can provide more information I've found but for now I'll leave that out.

How do I figure out why my swarm isn't resolving the correct hostnames to the correct containers?

Docker version is 19.03.5 from the official repo. LXD containers are official 18.04 cloud containers from Ubuntu's LXD image server. Host is KDE Neon based on Ubuntu 18.04.

By default, a swarm service uses the `endpoint_mode: vip`. The dns name of a service will resolve to its virtual ip and incomming traffic will be balanced to all the service tasks (~the replica containers) of the service. If you configure `endpoint_mode: dnsrr` instead, the dns name will return ip's of the service task's in round robbin mode. See https://docs.docker.com/compose/compose-file/compose-file-v3/#endpoint_mode for details on how to configure it in your compose file. — Metin, Jan 14 '21 at 22:12

Docker Swarm's overlay network DNS consistently resolving hostnames to an IP address one number lower than a container's actual IP

0 Answers0