Can't communicate between multiple interfaces on same subnet

Question

I have multiple ethernet interfaces on one machine, all on the same subnet. Normally these are set up to run on separate VM's, and I understand the limitations imposed by Linux as described here, but I've been tasked to try and make it work on one host. I've been able to configure them such that traffic in and out of the host is directed through the correct device. What I can't do is communicate from one device to another. Here is what I've done to configure the devices so far:

Set static IP addresses:

ip addr add 192.168.1.124 dev eth0
ip addr add 192.168.1.125 dev eth1
ip addr add 192.168.1.126 dev eth2
...

Enable arp filtering:

sysctl -w net.ipv4.conf.all.arp_filter=1

Implement source-based routing as follows:

Append the following to /etc/iproute2/rt_tables

1     eth0
2     eth1
3     eth2
...

Add default route to table

ip route add default via 192.168.1.11 table eth0
ip route add default via 192.168.1.11 table eth1
ip route add default via 192.168.1.11 table eth2
...

Add subnet route through specific device based on src IP

ip route add 192.168.1.0/24 dev eth0 src 192.168.1.124 table eth0
ip route add 192.168.1.0/24 dev eth0 src 192.168.1.124 table eth1
ip route add 192.168.1.0/24 dev eth0 src 192.168.1.124 table eth2
...

add rule

ip rule add from 192.168.1.124 table eth0
ip rule add from 192.168.1.124 table eth1
ip rule add from 192.168.1.124 table eth2
...

The device hardware takes care of filtering ingress packets based on destination IP.

Like I said, at this point I can confirm with tcpdump that traffic in and out of the host is directed through the correct device. Egress multicast go to the correct device as long as the src IP is bound. Multicast packets sent from one device are received by all the others. What I can't do is ping from one device to another. Using tcpdump, I see the egress arp requests on the sending device and the ingress arp requests on the receiving device, but no response is made. If i add the arp entry directly, I likewise see the ping request on both devices but no response is made.

UPDATE:

Data can be sent between IP addresses assigned to the interfaces, but the network stack isn't sending it out through the devices. ICMP and multicast packets ARE passing through the devices but no responses are sent back.

Is there a way to:

A) Force packets out the device even when sending to the same host?

B) Force the host to respond to ICMP requests from the same host?

score 0 · Answer 1 · answered Sep 04 '22 at 01:02

Beside various minor issues the issue is about policy routing: one must distinguish the route for an egress packet from the route for an ingress packet with exactly the same content: they use two different paths.

Let's rewrite it completely (using numbers for table values, so it's easily testable without change to files).

Needed for correct ARP handling that will follow policy routing and tie ARP traffic to the correct interface:

sysctl -w net.ipv4.conf.eth0.arp_filter=1
sysctl -w net.ipv4.conf.eth1.arp_filter=1
sysctl -w net.ipv4.conf.eth2.arp_filter=1

Add addresses, following OP's way:

ip address add 192.168.1.124/32 dev eth0
ip address add 192.168.1.125/32 dev eth1
ip address add 192.168.1.126/32 dev eth2

Add LAN routes, one different route per table (OP pasted the same). No need to hint the source (which would be a different address, for each): it's selected by the routing rules that will be added later.

ip route add 192.168.1.0/24 dev eth0 table 124
ip route add 192.168.1.0/24 dev eth1 table 125
ip route add 192.168.1.0/24 dev eth2 table 126

Add gateway routes (now they are reachable within their table):

ip route add default via 192.168.1.11 table 124
ip route add default via 192.168.1.11 table 125
ip route add default via 192.168.1.11 table 126

Policy routing: one must treat differently the two directions of a packet from 192.168.1.124 to 192.168.1.125: when it's emitted through eth0, and when the very same packet is received through eth1. iif lo is the special syntax telling to apply a policy routing rule only for emitted packets (rather than for any case, including received from an interface). Use specific preference (that will be useful for the following step):

ip rule add pref 124 iif lo from 192.168.1.124 lookup 124
ip rule add pref 125 iif lo from 192.168.1.125 lookup 125
ip rule add pref 126 iif lo from 192.168.1.126 lookup 126

Alas the local table with policy routing rule preference 0 prevents these rules to apply. For example, here nothing changed:

$ ip route get from 192.168.1.124 to 192.168.1.125
local 192.168.1.125 from 192.168.1.124 dev lo table local uid 1000 
    cache <local>

Just move the local table lookup after these rules:

ip rule add pref 200 lookup local
ip rule delete pref 0 lookup local

Which now gives the two paths for the same packet: one for egress and one for ingress (the moved local table still resolves the ingress path):

$ ip route get from 192.168.1.124 to 192.168.1.125
192.168.1.125 from 192.168.1.124 dev eth0 table 124 uid 1000 
    cache 
$ ip route get from 192.168.1.124 iif eth1 to 192.168.1.125
local 192.168.1.125 from 192.168.1.124 dev lo table local 
    cache <local> iif eth1

Note: arp_filter will have the correct interface selected for reply (ie: it won't be eth2 for above example, but always eth1) because only the correct interface will reply to ARP queries.

This can optionally be further enforced with Strict Reverse Path Forwarding usingrp_filter:

Before (using an unexpected path):

$ ip route get from 192.168.1.124 iif eth2 to 192.168.1.125
local 192.168.1.125 from 192.168.1.124 dev lo table local 
    cache <local> iif eth2 


sysctl -w net.ipv4.conf.eth0.rp_filter=1
sysctl -w net.ipv4.conf.eth1.rp_filter=1
sysctl -w net.ipv4.conf.eth2.rp_filter=1

After:

$ ip route get from 192.168.1.124 iif eth2 to 192.168.1.125
RTNETLINK answers: Invalid cross-device link

A ping from an IP address to the same IP address makes no sense to go over the wire because it would use twice the same interface: from 192.168.1.124 to 192.168.1.124 should use eth0 but a switch (or even a hub) by default will never send back the packet where it came from (that would be hairpin mode), so even the ARP request to resolve the destination will fail whatever configuration is done, before even (failing at) sending or receiving back the IP packet. This case should be put back to the local routing table over the lo device as initially. Either add 3 more policy rules, or add a "hole" in the routing table using throw so rule evaluation is bumped to the following rules and will thus reach the local routing table:

Before:

$ ip route get from 192.168.1.124 to 192.168.1.124
192.168.1.124 from 192.168.1.124 dev eth0 table 124 uid 1000 
    cache 

ip route add throw 192.168.1.124/32 table 124
ip route add throw 192.168.1.125/32 table 125
ip route add throw 192.168.1.126/32 table 126

After:

$ ip route get from 192.168.1.124 to 192.168.1.124
local 192.168.1.124 from 192.168.1.124 dev lo table local uid 1000 
    cache <local>

Note that the system is unable to emit a packet without binding the source address first as there's no such route, but that was OP's choice when /32 addresses were used:

$ ip route get 192.168.1.127
RTNETLINK answers: Network is unreachable
$ ip route get from 192.168.1.126 to 192.168.1.127
192.168.1.127 from 192.168.1.126 dev eth2 table 126 uid 1000 
    cache

One can still add default choices for example in the main routing table if needed. For example if 192.168.1.125 is to be the default:

# ip route add 192.168.1.0/24 dev eth1
# ip route get 192.168.1.127
192.168.1.127 dev eth1 src 192.168.1.125 uid 0 
    cache

The really default default route too must be specified again in the main table (implicit is fine, eth1 is resolved from the previously added entry):

ip route add default via 192.168.1.11

With these settings, the host can reach from any of its interfaces other systems anywhere, as long as the source is specified (or else it will default to 192.168.1.125 on eth1), and also, solving OP's goal, can ping itself from one address to an other different one of its addresses over the wire. Simulated using network namespaces:

# ip neigh flush all
# ping -c3 -I 192.168.1.124 192.168.1.125
PING 192.168.1.125 (192.168.1.125) from 192.168.1.124 : 56(84) bytes of data.
64 bytes from 192.168.1.125: icmp_seq=1 ttl=64 time=0.087 ms
64 bytes from 192.168.1.125: icmp_seq=2 ttl=64 time=0.059 ms
64 bytes from 192.168.1.125: icmp_seq=3 ttl=64 time=0.063 ms

--- 192.168.1.125 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2027ms
rtt min/avg/max/mdev = 0.059/0.069/0.087/0.012 ms

Here the first ping was longer because of the ARP request that was needed first. Of course tcpdump can confirm it (here captures on the involved interfaces):

# tcpdump -ttttt -l -e -n -s0 -p -i eth0
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
 00:00:00.000000 fe:c7:45:36:a1:84 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 192.168.1.125 tell 192.168.1.124, length 28
 00:00:00.000042 ce:59:ff:a1:96:ff > fe:c7:45:36:a1:84, ethertype ARP (0x0806), length 42: Reply 192.168.1.125 is-at ce:59:ff:a1:96:ff, length 28
 00:00:00.000045 fe:c7:45:36:a1:84 > ce:59:ff:a1:96:ff, ethertype IPv4 (0x0800), length 98: 192.168.1.124 > 192.168.1.125: ICMP echo request, id 59125, seq 1, length 64
 00:00:00.000061 ce:59:ff:a1:96:ff > fe:c7:45:36:a1:84, ethertype IPv4 (0x0800), length 98: 192.168.1.125 > 192.168.1.124: ICMP echo reply, id 59125, seq 1, length 64
 00:00:01.003101 fe:c7:45:36:a1:84 > ce:59:ff:a1:96:ff, ethertype IPv4 (0x0800), length 98: 192.168.1.124 > 192.168.1.125: ICMP echo request, id 59125, seq 2, length 64
 00:00:01.003135 ce:59:ff:a1:96:ff > fe:c7:45:36:a1:84, ethertype IPv4 (0x0800), length 98: 192.168.1.125 > 192.168.1.124: ICMP echo reply, id 59125, seq 2, length 64
 00:00:02.027155 fe:c7:45:36:a1:84 > ce:59:ff:a1:96:ff, ethertype IPv4 (0x0800), length 98: 192.168.1.124 > 192.168.1.125: ICMP echo request, id 59125, seq 3, length 64
 00:00:02.027190 ce:59:ff:a1:96:ff > fe:c7:45:36:a1:84, ethertype IPv4 (0x0800), length 98: 192.168.1.125 > 192.168.1.124: ICMP echo reply, id 59125, seq 3, length 64
 00:00:05.099215 ce:59:ff:a1:96:ff > fe:c7:45:36:a1:84, ethertype ARP (0x0806), length 42: Request who-has 192.168.1.124 tell 192.168.1.125, length 28
 00:00:05.099248 fe:c7:45:36:a1:84 > ce:59:ff:a1:96:ff, ethertype ARP (0x0806), length 42: Reply 192.168.1.124 is-at fe:c7:45:36:a1:84, length 28
^C
10 packets captured
10 packets received by filter
0 packets dropped by kernel

# tcpdump -ttttt -l -e -n -s0 -p -i eth1
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on eth1, link-type EN10MB (Ethernet), snapshot length 262144 bytes
 00:00:00.000000 fe:c7:45:36:a1:84 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 192.168.1.125 tell 192.168.1.124, length 28
 00:00:00.000022 ce:59:ff:a1:96:ff > fe:c7:45:36:a1:84, ethertype ARP (0x0806), length 42: Reply 192.168.1.125 is-at ce:59:ff:a1:96:ff, length 28
 00:00:00.000031 fe:c7:45:36:a1:84 > ce:59:ff:a1:96:ff, ethertype IPv4 (0x0800), length 98: 192.168.1.124 > 192.168.1.125: ICMP echo request, id 59125, seq 1, length 64
 00:00:00.000041 ce:59:ff:a1:96:ff > fe:c7:45:36:a1:84, ethertype IPv4 (0x0800), length 98: 192.168.1.125 > 192.168.1.124: ICMP echo reply, id 59125, seq 1, length 64
 00:00:01.003098 fe:c7:45:36:a1:84 > ce:59:ff:a1:96:ff, ethertype IPv4 (0x0800), length 98: 192.168.1.124 > 192.168.1.125: ICMP echo request, id 59125, seq 2, length 64
 00:00:01.003113 ce:59:ff:a1:96:ff > fe:c7:45:36:a1:84, ethertype IPv4 (0x0800), length 98: 192.168.1.125 > 192.168.1.124: ICMP echo reply, id 59125, seq 2, length 64
 00:00:02.027154 fe:c7:45:36:a1:84 > ce:59:ff:a1:96:ff, ethertype IPv4 (0x0800), length 98: 192.168.1.124 > 192.168.1.125: ICMP echo request, id 59125, seq 3, length 64
 00:00:02.027170 ce:59:ff:a1:96:ff > fe:c7:45:36:a1:84, ethertype IPv4 (0x0800), length 98: 192.168.1.125 > 192.168.1.124: ICMP echo reply, id 59125, seq 3, length 64
 00:00:05.099052 ce:59:ff:a1:96:ff > fe:c7:45:36:a1:84, ethertype ARP (0x0806), length 42: Request who-has 192.168.1.124 tell 192.168.1.125, length 28
 00:00:05.099241 fe:c7:45:36:a1:84 > ce:59:ff:a1:96:ff, ethertype ARP (0x0806), length 42: Reply 192.168.1.124 is-at fe:c7:45:36:a1:84, length 28
^C
10 packets captured
10 packets received by filter
0 packets dropped by kernel

Can't communicate between multiple interfaces on same subnet

1 Answers1