I've got a mystery for you today. We run a small, three node Elasticsearch cluster based on CoreOS (2023.5.0 / Linux 4.19.25-coreos) on Azure. Elasticsearch is run inside a docker container in host network mode. After running almost completely maintenance free for over a year we've been seeing machines enter into a very interesting state.
Update
This issue was solved by a fix to a driver in the Linux kernel. See answer below.
Symptoms
Basically, networking between the affected machine and the other two nodes dies. All are in the same virtual network and the same subnet and can ususally communicate with eath other. The affected node can still be reached from other subnets (I can ssh into it) and from a different peered virtual network. The machine also has (very spotty) connection to the Internet, but most requests just time out.
We have observed that on an affected node, the number of "sockets used" reported by /proc/net/sockstat
is very high (~4.5k instead of ~300 on a healthy node). Monitoring shows that this number rapidly rises from the moment that the node becomes unavailable.
The fun thing is that we cannot seem to identify the source of these used sockets:
# cat /proc/net/sockstat
sockets: used 4566
TCP: inuse 2 orphan 0 tw 2 alloc 98 mem 4
UDP: inuse 1 mem 0
UDPLITE: inuse 0
RAW: inuse 0
FRAG: inuse 0 memory 0
# cat /proc/net/sockstat6
TCP6: inuse 98
UDP6: inuse 1
UDPLITE6: inuse 0
RAW6: inuse 1
FRAG6: inuse 0 memory 0
Other than that the machine seems fine. There are no suspicious processes running, CPU usage is minimal and there is plety of available memory.
Pinging an "unreachable" VM in the same subnet results in a few EAGAIN
responses to recvmsg
and then crossing over to getting ENOBUFS
back from sendmsg
. strace ping output here
I've collected some additional output (before any modifications to the system were made) and posted it in this gist: https://gist.github.com/privatwolke/e7e2e7eb0272787765f5d3726f37107c
Analysis
We've tried shutting down everything we can think of on the server, with elasticsearch being the first suspect. But shutting down the elasticsearch container does not free up the used sockets. Same thing for all CoreOS-related processes (update-engine, locksmithd, ...) or even the whole Docker runtime or Azure-specific stuff. Nothing seemed to help.
But now it get's even weirder: We attempted to run tcpdump
on the machine to see what is going on. And behold: The problem resolved itself, connectivity was restored. Our theory was that tcpdump does some sort of syscall that resolves it. We ran tcpdump with gdb and set breakpoints on all syscalls. After stepping through loads of breakpoints, we finally found that the act of setting promiscous mode on the capturing socket (specifically this line in libpcap) is the thing that resets the sockets used counter and returns us to a normal state.
Additional Findings
- We have verified that running
tcpdump
with the-p/--no-promiscuous-mode
flag does not clear the sockets used counter and return the machine to a usable state. - Running
ifconfig eth0 txqueuelen 1001
resets the sockets used counter but connectivity is not restored. - Setting promisc mode manually with
ip link set eth0 promisc on
also does not restore connectivity.net.ipv4.xfrm4_gc_thresh
is set to 32768 and increasing it slightly does not resolve the problem.
We've been in contact with Azure who are as baffled by this as we are. I understand that this is likely not the problem but just a symptom. But it is the only tangible thing that I found so far. My hope is that by understanding the symptom I can get closer to the root cause. The network interfaces on Azure are run with this network driver.
Maybe CoreOS/Kernel is to blame?
From a timeline point of view, the problems started on 2019-03-11 which is the day that CoreOS auto-updated to the latest version. According to the release notes, this update contained a kernel update from 4.15.23 to 4.19.25. I'm still going through the changelogs to see if anything could be an issue there. So far I have only discovered that the hyperv network driver has received quite a few updates in recent months, not all of which seem to be part of 4.19.25. The patchset that CoreOS applied to 4.19.25 is not that impressive, but the patch that introduces a fake nf_conntrack_ipv4 module is new.
Help!
So far, the questions we have are the following:
What could cause this "sockets used" metric to skyrocket? I have read through the kernel sources for this metric and it seems to be just a counter with no reference to what kind of sockets those actually are or what created them.
Why does the number flatline at about 4.5k? Which limit would be causing this?
Did something significant change between kernel 4.14.96 and 4.19.25?
Why does the
setsockopt()
call in libpcap reset the state?
Related CoreOS bug: https://github.com/coreos/bugs/issues/2572