0

I've had this issue ever since I got this new router and flashed it with dd-wrt.
It's not really impactful (I'll describe the scenario) but I'm curious about it...

This is the diagram of the network setup:

  • Manjaro Linux running on VMware Fusion (in a Mac/OSX host) connected through WiFi
  • 3 raspberry Pis (running Raspbian) connected to switch 1 (and then router)
  • 1 NAS (WDCloud) connected to switch 1
  • 1 raspberry Pi connected to switch 2 (which is connected to switch 1)

Given the setup, the issue:

  • Mac over WiFi, Manjaro VM in bridged mode
    • Pinging any of the 4 Pis shows packet loss in under 5min - sometimes 20%, sometimes more
    • Pinging the NAS shows no packet loss at all
  • Mac over WiFi, Manjaro VM in NAT
    • no packet loss on any scenario
  • Mac over LAN, Manjaro VM in NAT or Bridged mode
    • no packet loss on any scenario

So, my initial guess was that it was something related to Fusion bridged mode because pinging directly from Mac (host) never had any loss (nor using the VM with NAT).

  • Tried Virtualbox, same happens (bridged shows packet loss, NAT does not).
  • Played a lot with DDWRT WiFi settings but nothing seemed to make any difference.

Realized that pinging NAS had no packet loss so it looked like something only in the Bridged+WiFi+Raspberry combination, so I ran tcpdump icmp on one of the raspberries and started pinging from the VM

Ping output in the VM:

64 bytes from  (192.168.1.22): icmp_seq=13 ttl=64 time=2.40 ms
64 bytes from  (192.168.1.22): icmp_seq=14 ttl=64 time=2.50 ms
===> lost sequences 15 to 42 <===
64 bytes from  (192.168.1.22): icmp_seq=43 ttl=64 time=34.1 ms
64 bytes from  (192.168.1.22): icmp_seq=44 ttl=64 time=2.31 ms

tcpdump output in the Pi:

01:24:42.397835 IP stretch > 192.168.1.22: ICMP echo request, id 436, seq 13, length 64                                    
01:24:42.397919 IP 192.168.1.22 > stretch: ICMP echo reply, id 436, seq 13, length 64                                      
01:24:43.399899 IP stretch > 192.168.1.22: ICMP echo request, id 436, seq 14, length 64                                    
01:24:43.399948 IP 192.168.1.22 > stretch: ICMP echo reply, id 436, seq 14, length 64                                      
01:24:44.404887 IP stretch > 192.168.1.22: ICMP echo request, id 436, seq 15, length 64                                    
01:24:45.422542 IP stretch > 192.168.1.22: ICMP echo request, id 436, seq 16, length 64                                    
===> requests hit but no replay is sent... <===
01:25:12.044102 IP stretch > 192.168.1.22: ICMP echo request, id 436, seq 42, length 64                                    
01:25:13.068516 IP stretch > 192.168.1.22: ICMP echo request, id 436, seq 43, length 64                                    
01:25:13.099164 IP 192.168.1.22 > stretch: ICMP echo reply, id 436, seq 43, length 64                                      
01:25:14.071065 IP stretch > 192.168.1.22: ICMP echo request, id 436, seq 44, length 64                                    
01:25:14.071129 IP 192.168.1.22 > stretch: ICMP echo reply, id 436, seq 44, length 64                                      

Conclusion (I think): ping requests hit the raspberry Pi but no replies are sent (for that period, about 30s).
I'm using ping as it is the easiest to show/test packet loss, but this also happens with TCP as SSH sessions hang now and then.

Any hints on what to check on the raspberry pi configuration to understand why it's not sending the ICMP replies? It makes it look related to the Pi, but why would this not happen in the other scenarios (Mac WiFi + VM bridged), as the Pi remains constant?

Filipe Pina
  • 113
  • 8

1 Answers1

1

I think it may caused by ARP conflicts. you may want to check the MAC address of the 4 Pis, and also the router. by running ifconfig, cheap Pis may have same MAC address.

also you can confirm by running arp -a when ping is good and bad to see the ARP table difference.

try to run tcpdump -i any arp also helps

Ethan Xu
  • 343
  • 2
  • 10
  • what do you mean `cheap Pis`? but no, run `ip all` on all 5 Pis, parsed all MACs in there and no duplicates.. what would be the expected difference in `arp -a`? and you say to run it (and tcpdump) in the VM pinging, or the raspberry? I just ran in the raspberry and `arp -a`, when ping is failing, shows `incomplete` as MAC for the VM IP. and `tcpdump -i any arp or icmp` actually shown that the ping replies are not sent due to `host unreachable`.. – Filipe Pina May 23 '20 at 09:10
  • so I just tried doing the other way around: ping the VM from the Pi and Pi reports host unreachable directly in the ping tool (instead of timing out like the VM). so I guess VM does have some sort of connectivity issue, but it's not WiFi signal quality as it doesn't fail with external IPs or pinging the router IP directly. ping from VM to Pi: ping shows 50% packet loss ping from Pi to VM: ping explicitly states host unreachable every now and then for `tcpdump arp` output, I'm not sure what I should be looking for, a lot of requests there and no replies – Filipe Pina May 23 '20 at 09:14
  • so I noticed it's not just the Pis anymore, but pinging any internal device from a WiFi device. just tried running `arp -s IP MAC` in one of the Pi (with IP/MAC of the wifi VM) and it resoved packet loss! Thanks, now I know where to dig more to find the issue, ARP I'll google how to add a persistent arp entry, if possible, and see if it stops happening – Filipe Pina May 23 '20 at 12:35
  • sorry, didn't mean anything. some Pi manufacturers will reuse MAC address on their Pis and you may have same MAC addresses for different Pis. – Ethan Xu May 23 '20 at 23:20
  • if your Pis have different MAC addresses, I suggest checking all your network devices one by one to see you you have some devices that have the same MAC or IP address. persistent ARP table is not a good idea because you're not actually fixing the problem. – Ethan Xu May 23 '20 at 23:22
  • try to use `tcpdump arp` and `arp -a` together and analyze how the ARP table changes. `tcpdump` is very useful, but you need to understand how ARP works and how it reflects ARP table. – Ethan Xu May 23 '20 at 23:26
  • Yes, I used static to confirm that it was indeed an ARP issue and that you pointed me out in the right direction. Now I’ll try to find out why that happens, but at least I have something more specific to look into :) thanks – Filipe Pina May 24 '20 at 09:56