2

I'm running an Ubuntu 18.04 Virtual Machine hosted by Nutanix. Every so often, I will see an issue where a docker network will be created with docker-compose, and then the machine will go completely unresponsive. I've only seen this three times, with many months in between, so it's been difficult to find a pattern.

The last command I ran to bring about this issue was:

docker-compose -f /path/to/compose.yml up

Which started creating a network and then failed:

Creating network "compose_kong-ee" with the default driver
packet_write_wait: Connection to 10.120.160.100 port 22: Broken pipe

Now, if I restart the machine without stopping the docker daemon, the system will crash. Looking at kern.log, I see a br-a249 interface that was disabled right around the time that this occurred (~12:30 utc):

Jun  3 12:26:28 USDALXKADV01 kernel: [1397332.561962] br-a249e56f08c5: port 1(vethe00f673) entered disabled state
Jun  3 12:26:28 USDALXKADV01 kernel: [1397332.565381] veth6a978dc: renamed from eth0
Jun  3 12:26:28 USDALXKADV01 kernel: [1397332.583094] br-a249e56f08c5: port 1(vethe00f673) entered disabled state
Jun  3 12:26:28 USDALXKADV01 kernel: [1397332.590855] device vethe00f673 left promiscuous mode
Jun  3 12:26:28 USDALXKADV01 kernel: [1397332.590860] br-a249e56f08c5: port 1(vethe00f673) entered disabled state
Jun  3 12:29:15 USDALXKADV01 kernel: [1397500.520269] IPv6: ADDRCONF(NETDEV_UP): br-342fcad19ff7: link is not ready
Jun  3 12:48:52 USDALXKADV01 kernel: [1398677.266687] vmxnet3 0000:03:00.0 ens160: intr type 3, mode 0, 2 vectors allocated
Jun  3 12:48:52 USDALXKADV01 kernel: [1398677.268869] vmxnet3 0000:03:00.0 ens160: NIC Link is Up 10000 Mbps
Jun  3 12:48:52 USDALXKADV01 kernel: [1398677.271911] IPv6: ADDRCONF(NETDEV_UP): ens160: link is not ready
Jun  3 12:48:52 USDALXKADV01 kernel: [1398677.271929] IPv6: ADDRCONF(NETDEV_CHANGE): ens160: link becomes ready
Jun  3 12:50:57 USDALXKADV01 kernel: [1398801.703190] br-a249e56f08c5: port 2(vetha6450a1) entered disabled state
Jun  3 12:50:57 USDALXKADV01 kernel: [1398801.705254] veth1c11b8c: renamed from eth0
Jun  3 12:50:57 USDALXKADV01 kernel: [1398801.718831] br-a249e56f08c5: port 2(vetha6450a1) entered disabled state
Jun  3 12:50:57 USDALXKADV01 kernel: [1398801.726647] device vetha6450a1 left promiscuous mode
Jun  3 12:50:57 USDALXKADV01 kernel: [1398801.726652] br-a249e56f08c5: port 2(vetha6450a1) entered disabled state

And this interface was up when I shelled in, which correspond to one of two container I was running at the time. I issued docker stop to one of them to facilitate an upgrade:

  IP address for docker0:         172.17.0.1
  IP address for br-a14bcb10b447: 172.18.0.1
  IP address for br-a249e56f08c5: 172.22.0.1

Looking at syslog entries, I can see the time span of my entire ssh session:

Jun  3 12:22:33 USDALXKADV01 systemd[1]: Started Session 468 of user kong.
Jun  3 12:26:27 USDALXKADV01 containerd[1468]: time="2020-06-03T12:26:27.921166420Z" level=info msg="shim reaped" id=f3a678f2747a3398a15dd605299d1b18b9d173e6c68d4bb8c8c44e7a56c2ed2a
Jun  3 12:26:27 USDALXKADV01 dockerd[12217]: time="2020-06-03T12:26:27.933328211Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jun  3 12:26:28 USDALXKADV01 kernel: [1397332.561962] br-a249e56f08c5: port 1(vethe00f673) entered disabled state
Jun  3 12:26:28 USDALXKADV01 systemd-networkd[1398]: vethe00f673: Lost carrier
Jun  3 12:26:28 USDALXKADV01 systemd-timesyncd[499]: Network configuration changed, trying to establish connection.
Jun  3 12:26:28 USDALXKADV01 kernel: [1397332.565381] veth6a978dc: renamed from eth0
Jun  3 12:26:28 USDALXKADV01 systemd-udevd[14938]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Jun  3 12:26:28 USDALXKADV01 kernel: [1397332.583094] br-a249e56f08c5: port 1(vethe00f673) entered disabled state
Jun  3 12:26:28 USDALXKADV01 systemd-networkd[1398]: vethe00f673: Link DOWN
Jun  3 12:26:28 USDALXKADV01 networkd-dispatcher[763]: WARNING:Unknown index 173 seen, reloading interface list
Jun  3 12:26:28 USDALXKADV01 kernel: [1397332.590855] device vethe00f673 left promiscuous mode
Jun  3 12:26:28 USDALXKADV01 kernel: [1397332.590860] br-a249e56f08c5: port 1(vethe00f673) entered disabled state
Jun  3 12:26:28 USDALXKADV01 networkd-dispatcher[763]: ERROR:Unknown interface index 173 seen even after reload
Jun  3 12:26:28 USDALXKADV01 systemd-timesyncd[499]: Synchronized to time server 91.189.89.198:123 (ntp.ubuntu.com).
Jun  3 12:26:28 USDALXKADV01 systemd[1]: Starting OpenNebula delayed reconfiguration script...
Jun  3 12:26:28 USDALXKADV01 systemd[1]: Started OpenNebula delayed reconfiguration script.
Jun  3 12:27:28 USDALXKADV01 one-contextd[15036]: Started for type all to reconfigure
Jun  3 12:27:28 USDALXKADV01 one-contextd[15040]: Acquiring lock /var/run/one-context/one-context.lock
Jun  3 12:27:28 USDALXKADV01 one-contextd[15042]: Acquired lock /var/run/one-context/one-context.lock
Jun  3 12:27:29 USDALXKADV01 one-contextd[15055]: Reading context via vmtoolsd
Jun  3 12:27:29 USDALXKADV01 one-contextd[15064]: Comparing /var/run/one-context/context.sh.0KXIk7 and /var/run/one-context/context.sh.local for changes
Jun  3 12:27:29 USDALXKADV01 one-contextd[15066]: No changes in context, skipping
Jun  3 12:27:29 USDALXKADV01 one-contextd[15067]: Comparing /var/run/one-context/context.sh.0KXIk7 and /var/run/one-context/context.sh.network for changes
Jun  3 12:27:29 USDALXKADV01 one-contextd[15069]: No changes in context, skipping
Jun  3 12:27:29 USDALXKADV01 one-contextd[15070]: Done
Jun  3 12:27:29 USDALXKADV01 one-contextd[15071]: Unmounting /var/run/one-context/mount.VEaypB
Jun  3 12:27:29 USDALXKADV01 one-contextd[15075]: Releasing lock /var/run/one-context/one-context.lock
Jun  3 12:29:15 USDALXKADV01 systemd-udevd[15148]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Jun  3 12:29:15 USDALXKADV01 systemd-networkd[1398]: br-342fcad19ff7: Link UP
Jun  3 12:29:15 USDALXKADV01 systemd-timesyncd[499]: Network configuration changed, trying to establish connection.
Jun  3 12:29:15 USDALXKADV01 networkd-dispatcher[763]: WARNING:Unknown index 177 seen, reloading interface list
Jun  3 12:29:15 USDALXKADV01 kernel: [1397500.520269] IPv6: ADDRCONF(NETDEV_UP): br-342fcad19ff7: link is not ready
Jun  3 12:29:16 USDALXKADV01 systemd-timesyncd[499]: Synchronized to time server 91.189.89.198:123 (ntp.ubuntu.com).
Jun  3 12:29:16 USDALXKADV01 systemd[1]: Starting OpenNebula delayed reconfiguration script...
Jun  3 12:29:16 USDALXKADV01 systemd[1]: Started OpenNebula delayed reconfiguration script.
Jun  3 12:29:16 USDALXKADV01 systemd-timesyncd[499]: Network configuration changed, trying to establish connection.
Jun  3 12:29:16 USDALXKADV01 systemd-timesyncd[499]: Synchronized to time server 91.189.89.198:123 (ntp.ubuntu.com).
Jun  3 12:30:16 USDALXKADV01 one-contextd[15249]: Started for type all to reconfigure
Jun  3 12:30:16 USDALXKADV01 one-contextd[15253]: Acquiring lock /var/run/one-context/one-context.lock
Jun  3 12:30:16 USDALXKADV01 one-contextd[15255]: Acquired lock /var/run/one-context/one-context.lock
Jun  3 12:30:16 USDALXKADV01 one-contextd[15268]: Reading context via vmtoolsd
Jun  3 12:30:16 USDALXKADV01 one-contextd[15277]: Comparing /var/run/one-context/context.sh.JXXPUJ and /var/run/one-context/context.sh.local for changes
Jun  3 12:30:16 USDALXKADV01 one-contextd[15279]: No changes in context, skipping
Jun  3 12:30:16 USDALXKADV01 one-contextd[15280]: Comparing /var/run/one-context/context.sh.JXXPUJ and /var/run/one-context/context.sh.network for changes
Jun  3 12:30:16 USDALXKADV01 one-contextd[15282]: No changes in context, skipping
Jun  3 12:30:16 USDALXKADV01 one-contextd[15283]: Done
Jun  3 12:30:16 USDALXKADV01 one-contextd[15284]: Unmounting /var/run/one-context/mount.arNJ1s
Jun  3 12:30:16 USDALXKADV01 one-contextd[15288]: Releasing lock /var/run/one-context/one-context.lock

What would cause this problem? I'd rather not have to do a fresh docker install, since this seems to be a band-aid and doesn't get at the crux of the issue. It seems that this is related to how docker sets up networks, but I'm not 100% positive that this is the problem.

I'm using Docker version 19.03.8, build afacb8b7f0

Update

I removed one of the networks by doing a Very Bad Thing and editing the local-kv.db file in /var/lib/docker/network/files. Removing the network allows docker to start without a hitch.

While this fixes the issue, it still doesn't give me a clear answer as to what's going on here. It appears that with the default docker configuration on this particular machine, there's a cap to the number of networks you can have at any given time

C.Nivs
  • 123
  • 1
  • 6
  • I'm seeing this, too. Upgraded from docker 19.03 to 20.10, but still seeing it. I can get console access and see that the machine is up and running, but network access is not working. – jdhildeb Feb 24 '22 at 16:31

1 Answers1

2

maybe:

indeed you should tune:

Bash Stack
  • 420
  • 2
  • 6
  • A lot of possibilities here, but good info. I'll go down the respective rabbit holes and see what I can find – C.Nivs Jun 03 '20 at 16:42
  • you could use ssh from another machine , then `tail -f /var/log/*.log` to see what is going on , but that might also fail when your route breaks – Bash Stack Jun 07 '20 at 10:53
  • 1
    Wound up being the docker network configuration, the machine by default only supports 5 networks, so I had to make changes to the docker daemon.json file – C.Nivs Jun 18 '20 at 15:18
  • @C.Nivs can you be more specific about "machine only supports 5 networks" and what changes you made in the daemon.js file? – jdhildeb Mar 02 '22 at 18:56