14

I've got a mystery for you today. We run a small, three node Elasticsearch cluster based on CoreOS (2023.5.0 / Linux 4.19.25-coreos) on Azure. Elasticsearch is run inside a docker container in host network mode. After running almost completely maintenance free for over a year we've been seeing machines enter into a very interesting state.

Update

This issue was solved by a fix to a driver in the Linux kernel. See answer below.

Symptoms

Basically, networking between the affected machine and the other two nodes dies. All are in the same virtual network and the same subnet and can ususally communicate with eath other. The affected node can still be reached from other subnets (I can ssh into it) and from a different peered virtual network. The machine also has (very spotty) connection to the Internet, but most requests just time out.

We have observed that on an affected node, the number of "sockets used" reported by /proc/net/sockstat is very high (~4.5k instead of ~300 on a healthy node). Monitoring shows that this number rapidly rises from the moment that the node becomes unavailable.

The fun thing is that we cannot seem to identify the source of these used sockets:

# cat /proc/net/sockstat
sockets: used 4566
TCP: inuse 2 orphan 0 tw 2 alloc 98 mem 4
UDP: inuse 1 mem 0
UDPLITE: inuse 0
RAW: inuse 0
FRAG: inuse 0 memory 0

# cat /proc/net/sockstat6
TCP6: inuse 98
UDP6: inuse 1
UDPLITE6: inuse 0
RAW6: inuse 1
FRAG6: inuse 0 memory 0

Other than that the machine seems fine. There are no suspicious processes running, CPU usage is minimal and there is plety of available memory.

Pinging an "unreachable" VM in the same subnet results in a few EAGAIN responses to recvmsg and then crossing over to getting ENOBUFS back from sendmsg. strace ping output here

I've collected some additional output (before any modifications to the system were made) and posted it in this gist: https://gist.github.com/privatwolke/e7e2e7eb0272787765f5d3726f37107c

Analysis

We've tried shutting down everything we can think of on the server, with elasticsearch being the first suspect. But shutting down the elasticsearch container does not free up the used sockets. Same thing for all CoreOS-related processes (update-engine, locksmithd, ...) or even the whole Docker runtime or Azure-specific stuff. Nothing seemed to help.

But now it get's even weirder: We attempted to run tcpdump on the machine to see what is going on. And behold: The problem resolved itself, connectivity was restored. Our theory was that tcpdump does some sort of syscall that resolves it. We ran tcpdump with gdb and set breakpoints on all syscalls. After stepping through loads of breakpoints, we finally found that the act of setting promiscous mode on the capturing socket (specifically this line in libpcap) is the thing that resets the sockets used counter and returns us to a normal state.

Additional Findings

  • We have verified that running tcpdump with the -p/--no-promiscuous-mode flag does not clear the sockets used counter and return the machine to a usable state.
  • Running ifconfig eth0 txqueuelen 1001 resets the sockets used counter but connectivity is not restored.
  • Setting promisc mode manually with ip link set eth0 promisc on also does not restore connectivity.
    • net.ipv4.xfrm4_gc_thresh is set to 32768 and increasing it slightly does not resolve the problem.

sockets used

We've been in contact with Azure who are as baffled by this as we are. I understand that this is likely not the problem but just a symptom. But it is the only tangible thing that I found so far. My hope is that by understanding the symptom I can get closer to the root cause. The network interfaces on Azure are run with this network driver.

Maybe CoreOS/Kernel is to blame?

From a timeline point of view, the problems started on 2019-03-11 which is the day that CoreOS auto-updated to the latest version. According to the release notes, this update contained a kernel update from 4.15.23 to 4.19.25. I'm still going through the changelogs to see if anything could be an issue there. So far I have only discovered that the hyperv network driver has received quite a few updates in recent months, not all of which seem to be part of 4.19.25. The patchset that CoreOS applied to 4.19.25 is not that impressive, but the patch that introduces a fake nf_conntrack_ipv4 module is new.

Update: Possible related incoming kernel patch?

Help!

So far, the questions we have are the following:

  • What could cause this "sockets used" metric to skyrocket? I have read through the kernel sources for this metric and it seems to be just a counter with no reference to what kind of sockets those actually are or what created them.

  • Why does the number flatline at about 4.5k? Which limit would be causing this?

  • Did something significant change between kernel 4.14.96 and 4.19.25?

  • Why does the setsockopt() call in libpcap reset the state?

Related CoreOS bug: https://github.com/coreos/bugs/issues/2572

Stephan Klein
  • 233
  • 1
  • 10
  • The open sockets is a resulting problem, not the root problem IMHO. I had this behaviour on a linux system with macvlan devices (with their own mac addresses) on a bridge device. Setting the bridge to promisc made the macvlan devices work. I don't know coreos or azure. The problem is that an underlying layer does not know about the mac addresses on the upper levels. – AndreasM Mar 25 '19 at 13:05
  • Thank you for your comment! I realize that a high number of sockets used is not the root cause, I'm just clinging on the one tangible thing that I can identfy as abnormal on the machine. – Stephan Klein Mar 25 '19 at 14:07
  • Hi Stephan. Any news? please report 1) is WOL enabled? 2) does sysctl -w net.ipv4.route.flush=1 resolve? 3) what is the arp cache at no working state? at working state? – Massimo May 06 '19 at 16:02

2 Answers2

4

First of all, thank you for very well written question!

As the level of detail you described is very high and you are already on gdb level, I assume my answer won't be much of use for you. Anyway, here's a try:

  • Presumably you already tried something like ss -ae and lsof -n?
  • Does dmesg return anything interesting when this happens?
  • Do you use iptables on the server?
  • If you set the promiscuous mode using some other way than tcpdump (say, ip link set [interface] promisc on), does this also fix the problem?
  • Have you checked for any suspicious processes, files or other weird activity? Just thinking that maybe some uninvited nasty process is lurking in the shadows hiding itself, and goes silent whenever promiscuous mode is set?
  • If you leave tcpdump running background, will this problem return?

I hope this helps.

Janne Pikkarainen
  • 31,454
  • 4
  • 56
  • 78
  • 1
    Thank you for your reply! I have indeed gathered the output of some of the commands that you reference. They are now also linked in the question (https://gist.github.com/privatwolke/e7e2e7eb0272787765f5d3726f37107c). The strange thing is that get way less sockets reported from `ss`, `lsof` and `netstat` than from "sockets used" in `/proc/net/sockstat`. Only the total count (which seems to be just read from that file) is the same. `iptables` runs but doesn't have special rules (see gist), I have not tried setting promiscous mode myself or running tcpdump continously. Will do that next time. – Stephan Klein Mar 25 '19 at 14:22
  • I have added the output of `ss -aepi` to my output collection: https://gist.github.com/privatwolke/e7e2e7eb0272787765f5d3726f37107c#file-ss_aepi -- Sadly dmesg returns exactly nothing when this is happening. In fact, the latest entry before the incident is 5 days old. – Stephan Klein Mar 27 '19 at 22:12
  • Added [`dmesg / journalctl -k`](https://gist.github.com/privatwolke/e7e2e7eb0272787765f5d3726f37107c#file-journalctl_k) output. – Stephan Klein Mar 27 '19 at 22:57
  • I have verified that `ip link set eth0 promisc on` alone does not restore the machine to a usable state. – Stephan Klein Apr 02 '19 at 13:00
  • Hello, Have you had a look at this other question in this site? https://serverfault.com/questions/614453/no-buffer-space-available-on-connect It seems to imply you might be depleting the xfrm4 dest cache. You can increase it with this kernel setting: `xfrm4_gc_thresh - INTEGER` `The threshold at which we will start garbage collecting for IPv4` `destination cache entries. At twice this value the system will` `refuse new allocations.` As far as I can tell it's related to IPsec, which you don't seem to be running here either, though. – Pedro Perez Apr 02 '19 at 13:20
  • `net.ipv4.xfrm4_gc_thresh` was set to 32768, which I increased to 34000. Sadly, it didn't do the trick :-( – Stephan Klein Apr 02 '19 at 13:29
0

This was caused by a bug in the hv_netsvc driver in the Linux kernel. We could resolve this with a Microsoft developer and managed to get the fix applied upstream.

I'll quote the commit message here as it sums the problem up quite well:

When the ring buffer is almost full due to RX completion messages, a TX packet may reach the "low watermark" and cause the queue stopped. If the TX completion arrives earlier than queue stopping, the wakeup may be missed.

This patch moves the check for the last pending packet to cover both EAGAIN and success cases, so the queue will be reliably waked up when necessary.

For future reference, the commit that fixes this is https://github.com/torvalds/linux/commit/6d9cfab853ca60b2f77b5e4c40443216988cba1f.

Stephan Klein
  • 233
  • 1
  • 10