Docker network timeouts when using bridge

Question

I'm running on a dedicated server with Ubuntu version 20.04.3 LTS (kernel 5.4.0-96-generic) and Docker 20.10.7, build 20.10.7-0ubuntu5~20.04.2. The system is a fresh install.

I have a Dockerfile for one of my services, which pulls some libraries in with apt and go get. One of the intermediate containers always fails to connect to the internet with either DNS or TCP Timeout errors. Which one of the containers fails is completely random.

Also note that the problem is not with one specific service, I tried building a completely different service which runs on NodeJS and the npm install failed with the same errors

Today I also had the problem that my Nginx container was not reachable with. All connections to it resulted in timeout errors.

Connections between containers using docker networks also don't work correctly.

Running sudo systemctl restart docker temporarily fixes the problem, but it reappears one or two builds down the line. When I build with the host network instead of the default bridge network, the problem is gone, which is why I suspected a faulty bridge config.

I've tried reinstalling Docker, resetting the iptables and bridge configs, setting different DNS servers, to no avail. The docker log files show no errors.

What could be the cause of this issue?

Update:

I've disabled UFW, but had no success. This is a dump from my dmesg log during a build that timed out, maybe this helps identify the cause:

[758001.967161] docker0: port 1(vethd0c7887) entered blocking state
[758001.967165] docker0: port 1(vethd0c7887) entered disabled state
[758001.967281] device vethd0c7887 entered promiscuous mode
[758002.000567] IPv6: ADDRCONF(NETDEV_CHANGE): veth7e3840a: link becomes ready
[758002.000621] IPv6: ADDRCONF(NETDEV_CHANGE): vethd0c7887: link becomes ready
[758002.000644] docker0: port 1(vethd0c7887) entered blocking state
[758002.000646] docker0: port 1(vethd0c7887) entered forwarding state
[758002.268554] docker0: port 1(vethd0c7887) entered disabled state
[758002.269581] eth0: renamed from veth7e3840a
[758002.293056] docker0: port 1(vethd0c7887) entered blocking state
[758002.293063] docker0: port 1(vethd0c7887) entered forwarding state
[758041.497891] docker0: port 1(vethd0c7887) entered disabled state
[758041.497997] veth7e3840a: renamed from eth0
[758041.547558] docker0: port 1(vethd0c7887) entered disabled state
[758041.551998] device vethd0c7887 left promiscuous mode
[758041.552008] docker0: port 1(vethd0c7887) entered disabled state

just a random guess.. but if you could check your firewall service also and see if there any failures in there and disable it and retry if required. As i recently faced similary issue in kubernetes cluster dns resolution for which had to disable the firewalld service completely. — sb9, Feb 01 '22 at 09:48
@sb9 I have some `dmesg` logs saying that UFW blocked some bridge connections. I disabled the UFW completely and restarted dockerd, but my docker builds still time out :( — Twometer, Feb 01 '22 at 10:03
ok.. please try to check with a dnsutil image and do nslookup for any FQDN from within container and from host and see if the results show same. docker run -it tutum/dnsutils nslookup docker run -it tutum/dnsutils dig do you have selinux enabled on your ubuntu machine. If you could check, disable and restart your machine. Not sure if that might cause any issues. — sb9, Feb 01 '22 at 10:41
@sb9 sorry for the late reply, I had some stress going on. I've checked, selinux is disabled on my machine. I've tried restarting, but that didn't help either. I did the tests you proposed, these are my results: https://pastebin.com/u3RTgxww - it seems to work for just one container after a restart — Twometer, Feb 13 '22 at 00:30
@sb9 I digged around a bit, and found out that after the first request, my `docker0` network loses its IPv4 address, therefore being unable to receive any more packets. I confirmed this by using `sudo ifconfig docker0 172.17.0.1`, which fixes the issue temporarily. — Twometer, Feb 13 '22 at 01:41

score 1 · Answer 1 · answered Jan 28 '22 at 21:06

1

If you have these in dmesg:

[15300.615904] neighbour: arp_cache: neighbor table overflow!

try this:

sudo sysctl -w net.ipv4.neigh.default.gc_thresh3=30000
sudo sysctl -w net.ipv4.neigh.default.gc_thresh2=20000
sudo sysctl -w net.ipv4.neigh.default.gc_thresh1=10000

answered Jan 28 '22 at 21:06

Ole Tange

2,836
5
29
45

Thank you, but I have not found any such messages in my `dmesg` – Twometer Jan 29 '22 at 01:24

score 0 · Accepted Answer · answered Feb 13 '22 at 01:53

Finally, after a lot of digging around, I found the issue:

My docker0 network was losing its IPv4 address after the first request terminated, and therefore was unable to communicate with the rest of the internet.

This issue comment on GitHub finally fixed the issue for me: moby#40217: My systemd-networkd was managing the docker0 network, and somehow the carrier loss check was triggered, which then caused networkd to remove the IPv4. Marking the docker0 and br-* networks as unmanaged finally made everything work correctly

Docker network timeouts when using bridge

2 Answers2