Multipath routing in post-3.6 kernels

Question

As you all probably know, ipv4 route cache has been removed in 3.6 Linux kernel series, which had serious impact on multipath routing. IPv4 routing code (unlike IPv6 one) selects next hop in a round-robin fashion, so packets from given source IP to given destination IP don't always go via the same next hop. Before 3.6 the routing cache was correcting that situation, as next hop, once selected, was staying in the cache, and all further packets from the same source to the same destination were going through that next hop. Now next hop is re-selected for each packet, which leads to strange things: with 2 equal cost default routes in the routing table, each pointing to one internet provider, I can't even establish TCP connection, because initial SYN and final ACK go via different routes, and because of NAT on each path they arrive to destination as packets from different sources.

Is there any relatively easy way to restore normal behaviour of multipath routing, so that next hop is selected per-flow rather than per-packet? Are there patches around to make IPv4 next hop selection hash-based, like it is for IPv6? Or how do you all deal with it?

Do you have a "split access" setup similar to this here: http://lartc.org/howto/lartc.rpdb.multiple-links.html ? If so, what does your ruleset and routes look like? — the-wabbit, Jun 08 '15 at 11:11
try to use "ip route get 173.194.112.247" multiple times and post the output — c4f4t0r, Jun 08 '15 at 11:15
Thanks for tasty question. :) first of all, you didn't give us an example. So I suppose you have something like `ip ro add 8.8.8.8/32 nexthop via 1.2.3.4 nexthop via 1.2.3.5` is that correct assumption? — poige, Jun 08 '15 at 11:28
Yes that's correct, but usually it's ip route add 0.0.0.0/0 with multiple next hops. — Eugene, Jun 08 '15 at 14:26
the-wabbit, yes, exactly like that. "provider 1" and "provider2" in my case are border routers connected to my internal network and provider's network and they do source NAT. On my internal router I just have default gateway with 2 hops pointing to provider1 and provider2, no other routes. Firewall rules just allow some services (like HTTP) for client machines and block everything else. — Eugene, Jun 08 '15 at 14:45

score 9 · Answer 1 · edited Apr 13 '17 at 12:14

If possible, upgrade to Linux Kernel >= 4.4 ....

Hash-based multipath routing has been introduced, which in many ways is better than the pre 3.6 behaviour. It is flow-based, taking a hash of the source and destination IPs (ports are ignored) to keep the path steady for individual connections. One downside is that I believe there were various algorithms/config modes available pre 3.6, but now you get what you're given!. You can use affect the choice of path by weight though.

If you are in my situation then you actually want the 3.6 >= behaviour < 4.4 but it is no longer supported.

If you do upgrade to >= 4.4 then this should do the trick, without all the other commands:

ip route add default  proto static scope global \
nexthop  via <gw_1> weight 1 \
nexthop  via <gw_2> weight 1

Alternatively by device:

ip route add default  proto static scope global \
 nexthop  dev <if_1> weight 1 \
 nexthop  dev <if_2> weight 1

For anybody coming to this solution - have a look also at: net.ipv4.fib_multipath_use_neigh for automatically disabling "dropped" nexthop/gateway. — Rostislav Kandilarov, Nov 08 '18 at 23:59

score 7 · Answer 2 · answered Jun 08 '15 at 13:12

"Relatively easy" is a difficult term, but you might

set up routing tables for each of your links - one table per link, with a single default gateway
use netfilter to stamp identical marks on all packets of a single stream
use the ip rule table to route the packets via different routing tables depending on the mark
use a multi-nexthop weighted route to balance the first-in-a-session packets over your gateways/links.

There has been a discussion at the netfilter mailing list on this topic where I am stealing the listings from:

1. Routing rules (RPDB and FIB)

ip route add default via <gw_1> lable link1
ip route add <net_gw1> dev <dev_gw1> table link1
ip route add default via <gw_2> table link2
ip route add <net_gw2> dev <dev_gw2> table link2

/sbin/ip route add default  proto static scope global table lb \
 nexthop  via <gw_1> weight 1 \
 nexthop  via <gw_2> weight 1

ip rule add prio 10 table main
ip rule add prio 20 from <net_gw1> table link1
ip rule add prio 21 from <net_gw2> table link2
ip rule add prio 50 fwmark 0x301 table link1
ip rule add prio 51 fwmark 0x302 table link2
ip rule add prio 100 table lb

ip route del default

2. Firewall rules (using ipset to force a "flow" LB mode)

ipset create lb_link1 hash:ip,port,ip timeout 1200
ipset create lb_link2 hash:ip,port,ip timeout 1200

# Set firewall marks and ipset hash
iptables -t mangle -N SETMARK
iptables -t mangle -A SETMARK -o <if_gw1> -j MARK --set-mark 0x301
iptables -t mangle -A SETMARK -m mark --mark 0x301 -m set !
--match-set lb_link1 src,dstport,dst -j SET \
          --add-set lb_link1 src,dstport,dst
iptables -t mangle -A SETMARK -o <if_gw2> -j MARK --set-mark 0x302
iptables -t mangle -A SETMARK -m mark --mark 0x302 -m set !
--match-set lb_link2 src,dstport,dst -j SET \
          --add-set lb_link2 src,dstport,dst

# Reload marks by ipset hash
iptables -t mangle -N GETMARK
iptables -t mangle -A GETMARK -m mark --mark 0x0 -m set --match-set
lb_link1 src,dstport,dst -j MARK --set-mark 0x301
iptables -t mangle -A GETMARK -m mark --mark 0x0 -m set --match-set
lb_link2 src,dstport,dst -j MARK --set-mark 0x302

# Defining and save firewall marks
iptables -t mangle -N CNTRACK
iptables -t mangle -A CNTRACK -o <if_gw1> -m mark --mark 0x0 -j SETMARK
iptables -t mangle -A CNTRACK -o <if_gw2> -m mark --mark 0x0 -j SETMARK
iptables -t mangle -A CNTRACK -m mark ! --mark 0x0 -j CONNMARK --save-mark
iptables -t mangle -A POSTROUTING -j CNTRACK

# Reload all firewall marks
# Use OUTPUT chain for local access (Squid proxy, for example)
iptables -t mangle -A OUTPUT -m mark --mark 0x0 -j CONNMARK --restore-mark
iptables -t mangle -A OUTPUT -m mark --mark 0x0 -j GETMARK
iptables -t mangle -A PREROUTING -m mark --mark 0x0 -j CONNMARK --restore-mark
iptables -t mangle -A PREROUTING -m mark --mark 0x0 -j GETMARK

You might want to follow the netfilter mailing list discussion for some variations of the above.

Not sure, but might be simpler to `u32` to get important parameters hashed and then "label" assigned for `ip rule`'s — poige, Jun 08 '15 at 13:31
Thank you, but that looks like pretty complex solution. What I don't quite understand is what piece here is responsible for "stamp identical marks on all packets of a single stream"? How does that ipset magic work? I thought ipset is just a set of particular IPs that is hashed and can be matched in rules. — Eugene, Jun 08 '15 at 14:36
You are right about `ipset` - it is just creating sets which are filled using `--add-set` and matched against using `--match-set` - but this is mostly for the connections in NEW state. For ESTABLISHED state connections the mark is stamped on the packets using the `--restore-mark` parameter of the `CONNMARK` target - this directive is copying the connection's mark to the packet. The connection's mark is previously set using `--save-mark` in the `POSTROUTING` chain (where packets belonging to NEW connections would pass through). The script seems overly convoluted to me, but it conveys the idea. — the-wabbit, Jun 08 '15 at 16:24
Yes, now I got the idea, I think. The last question: do you understand why don't kernel developers introduce hash-based next hop selection for ipv4? Is there some reason for not implementing it along with route cache removal? Similar solution for ipv6 works quite well. Isn't all that connmark magic an overkill for such a simple task? — Eugene, Jun 08 '15 at 23:31
@Eugene unfortunately, I am far from being close enough to the IP stack development (or Linux Kernel development in general) to authoritatively answer any of your questions, but I would speculate that multipathing using different providers with IPv4 was considered to be too much of a corner case to put any more work into it. Using netfilter CONNMARKs obviously looks like a nasty kludge, but might even have been considered as a "usable workaround" in the decision of dropping the route cache code. — the-wabbit, Jun 12 '15 at 12:52
@Eugene Kernel developers have done this now! See my answer. — bao7uo, Dec 12 '16 at 22:00

Multipath routing in post-3.6 kernels

2 Answers2