0

First question! Hi!

Running on Ubuntu 16.04.

Hardware info: lspci | awk '/[Nn]et/ {print $1}' | xargs -i% lspci -ks %

00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (2) I219-V
    Subsystem: ASUSTeK Computer Inc. Ethernet Connection (2) I219-V
    Kernel driver in use: e1000e
    Kernel modules: e1000e
02:00.0 Network controller: Intel Corporation Device 093c (rev 3a)
    Subsystem: Intel Corporation Device 7001

I am facing some weird ethernet stalls when running a P2P application -> more precisely: https://github.com/prysmaticlabs/prysm . As per the same application logs, around 30 peers are connected to my machine. Bandwidth utilization has been low (peaks of 6 Mbps), I am running on Cat6 cable, and got around 120 Mbps of fiber uplink, and ports correctly forwarded as reported by canyouseeme org. Other P2P apps, such as torrents, do not show any conflicting behaviors.

As said, symptoms are weird. When I run the application, it does not seem to lose connectivity. But the moment another application that needs to run on the network (example, web browsing, chatting, file transfer), the interface stalls for seconds, or even minutes. I noticed this because browsing would timeout often.

When the stalls happen, the application keeps running normally, but all other apps lose internet connection. I monitor ICMP (ping) traffic:

  • From the host to the router
  • From another local host to the stalling host

In both devices, it stops returning any kind of response (terminal stops output, no feedback and no timeouts). After the long stall, suddenly, all packages are acknowledged. See this sample:

64 bytes from 192.168.1.1: icmp_seq=1122 ttl=64 time=0.304 ms
64 bytes from 192.168.1.1: icmp_seq=1123 ttl=64 time=0.303 ms
64 bytes from 192.168.1.1: icmp_seq=1124 ttl=64 time=0.313 ms
64 bytes from 192.168.1.1: icmp_seq=1125 ttl=64 time=0.263 ms
64 bytes from 192.168.1.1: icmp_seq=1126 ttl=64 time=0.266 ms
64 bytes from 192.168.1.1: icmp_seq=1127 ttl=64 time=0.273 ms
64 bytes from 192.168.1.1: icmp_seq=1128 ttl=64 time=0.289 ms
64 bytes from 192.168.1.1: icmp_seq=1129 ttl=64 time=0.276 ms
64 bytes from 192.168.1.1: icmp_seq=1130 ttl=64 time=0.280 ms
64 bytes from 192.168.1.1: icmp_seq=1131 ttl=64 time=0.635 ms
64 bytes from 192.168.1.1: icmp_seq=1132 ttl=64 time=0.292 ms
64 bytes from 192.168.1.1: icmp_seq=1133 ttl=64 time=0.537 ms
64 bytes from 192.168.1.1: icmp_seq=1134 ttl=64 time=0.299 ms
64 bytes from 192.168.1.1: icmp_seq=1135 ttl=64 time=0.272 ms
64 bytes from 192.168.1.1: icmp_seq=1136 ttl=64 time=27625 ms
64 bytes from 192.168.1.1: icmp_seq=1137 ttl=64 time=26635 ms
64 bytes from 192.168.1.1: icmp_seq=1138 ttl=64 time=25631 ms
64 bytes from 192.168.1.1: icmp_seq=1139 ttl=64 time=24640 ms
64 bytes from 192.168.1.1: icmp_seq=1140 ttl=64 time=23641 ms
64 bytes from 192.168.1.1: icmp_seq=1141 ttl=64 time=22671 ms
64 bytes from 192.168.1.1: icmp_seq=1142 ttl=64 time=21648 ms
64 bytes from 192.168.1.1: icmp_seq=1143 ttl=64 time=20652 ms
64 bytes from 192.168.1.1: icmp_seq=1144 ttl=64 time=19658 ms
64 bytes from 192.168.1.1: icmp_seq=1145 ttl=64 time=18655 ms
64 bytes from 192.168.1.1: icmp_seq=1146 ttl=64 time=17658 ms
64 bytes from 192.168.1.1: icmp_seq=1147 ttl=64 time=16659 ms
64 bytes from 192.168.1.1: icmp_seq=1148 ttl=64 time=15655 ms
64 bytes from 192.168.1.1: icmp_seq=1149 ttl=64 time=14632 ms
64 bytes from 192.168.1.1: icmp_seq=1150 ttl=64 time=13611 ms
64 bytes from 192.168.1.1: icmp_seq=1151 ttl=64 time=12588 ms
64 bytes from 192.168.1.1: icmp_seq=1152 ttl=64 time=11565 ms
64 bytes from 192.168.1.1: icmp_seq=1153 ttl=64 time=10542 ms
64 bytes from 192.168.1.1: icmp_seq=1154 ttl=64 time=9522 ms
64 bytes from 192.168.1.1: icmp_seq=1155 ttl=64 time=8501 ms
64 bytes from 192.168.1.1: icmp_seq=1156 ttl=64 time=7478 ms
64 bytes from 192.168.1.1: icmp_seq=1157 ttl=64 time=6459 ms
64 bytes from 192.168.1.1: icmp_seq=1158 ttl=64 time=5436 ms
64 bytes from 192.168.1.1: icmp_seq=1159 ttl=64 time=4415 ms
64 bytes from 192.168.1.1: icmp_seq=1160 ttl=64 time=3391 ms
64 bytes from 192.168.1.1: icmp_seq=1161 ttl=64 time=2370 ms
64 bytes from 192.168.1.1: icmp_seq=1162 ttl=64 time=1350 ms
64 bytes from 192.168.1.1: icmp_seq=1163 ttl=64 time=320 ms
64 bytes from 192.168.1.1: icmp_seq=1164 ttl=64 time=2.73 ms
64 bytes from 192.168.1.1: icmp_seq=1165 ttl=64 time=0.258 ms
64 bytes from 192.168.1.1: icmp_seq=1166 ttl=64 time=0.303 ms

Then the network returns to normal, for a while.

Things that I have tried:

  • Increasing MTU from 1500 to 9000 (no effect)
  • Increasing txqueuelen from 1000 to 11000 (no effect)
  • Limiting the number of peers that can connect (no effect)
  • Virtualization (no effect)
  • Removing port forwarding. This seems to work, although it beats the purpose of the app and makes it considerably slower.

At this point I have two theories:

1) Either the gateway is acting funny (cannot check). I discard this because other devices in the network run OK, both in local connections and outside connections 2) Or some kind of memory buffer is choking, but don't know which.

I 'd appreciate inspiration!

  • Which network card do you have? Feel free to add it to the question so that everybody sees that right away. – Eduardo Trápani Jan 27 '20 at 13:38
  • This could be spanning tree convergence. Did you check all of the spanning tree settings on the relevant ports? Also, does the gateway inspect content? – Spencer Jan 27 '20 at 16:58
  • Hi @EduardoTrápani , I added the requested information to the post. – Rafael Rodríguez Jan 28 '20 at 06:16
  • Hi @Spencer , the used hardware is of residential type, so these settings are limited - no spanning tree. In any case, there are not redundant links in the network. The gateway does not inspect content to my knowledge. – Rafael Rodríguez Jan 28 '20 at 06:21

2 Answers2

0

For that card you could try booting with this kernel parameter. This explains how to do it:

pcie_aspm=off

Another way is using ethtool. For example:

sudo ethtool -G eth0 rx 256 tx 256

That comes from here.

Eduardo Trápani
  • 1,140
  • 6
  • 10
0

After more debugging of all the elements in the network I found that although the effects in other devices are much less noticeable, they are indeed being affected by the traffic jam, so this leads me to think that the issue lies within the router/switch, which is probably choking to keep with the demand of the P2P application, maybe because of NAT translations. I will try to get more advanced hardware to solve this.