How exactly & specifically does layer 3 LACP destination address hashing work?

Question

Based on an earlier question over a year ago (Multiplexed 1 Gbps Ethernet?), I went off and setup a new rack with a new ISP with LACP links all over the place. We need this because we have individual servers (one application, one IP) serving up thousands of client computers all over the Internet in excess of 1Gbps cumulative.

This LACP idea is supposed to let us break the 1Gbps barrier without spending a fortune on 10GoE switches and NICs. Unfortunately, I've run into some problems regarding with outbound traffic distribution. (This despite Kevin Kuphal's warning in the above linked question.)

The ISP's router is a Cisco of some sort. (I deduced that from the MAC address.) My switch is an HP ProCurve 2510G-24. And the servers are HP DL 380 G5s running Debian Lenny. One server is a hot standby. Our application cannot be clustered. Here is a simplified network diagram that includes all relevan network nodes with IPs, MACs and interfaces.

alt text

While it has all the detail it is a bit hard to work with and describe my problem. So, for simplicity's sake, here is a network diagram reduced to the nodes and physical links.

alt text

So I went off and installed my kit at the new rack and connected my ISP's cabling from their router. Both servers have an LACP link to my switch, and the switch is has an LACP link to the ISP router. Right from the start I realized that my LACP configuration was not correct: testing showed all traffic to and from each server was going over one physical GoE link exclusively between both server-to-switch and switch-to-router.

alt text

With some google searches and lots of RTMF time regarding linux NIC bonding, I discovered that I could control the NIC bonding by modifiying /etc/modules

# /etc/modules: kernel modules to load at boot time.
# mode=4 is for lacp
# xmit_hash_policy=1 means to use layer3+4(TCP/IP src/dst) & not default layer2 
bonding mode=4 miimon=100 max_bonds=2 xmit_hash_policy=1

loop

This got the traffic leaving my server over both NICs as expected. But the traffic was moving from the switch to router over only one physical link, still.

alt text

We need that traffic going over both physical links. After reading and rereading the 2510G-24's Management and Configuration Guide, I find:

[LACP uses] source-destination address pairs (SA/DA) for distributing outbound traffic over trunked links. SA/DA (source address/destination address) causes the switch to distribute outbound traffic to the links within the trunk group on the basis of source/ destination address pairs. That is, the switch sends traffic from the same source address to the same destination address through the same trunked link, and sends traffic from the same source address to a different destination address through a different link, depending on the rotation of path assignments among the links in the trunk.

It seems that a bonded link presents only one MAC address, and therefore my server-to-router path is always going to be over one path from switch-to-router because the switch sees but one MAC (and not two--one from each port) for both LACP'd links.

Got it. But this is what I want:

alt text

A more expensive HP ProCurve switch is the 2910al uses level 3 source & destination addresses in it's hash. From the "Outbound Traffic Distribution Across Trunked Links" section of the ProCurve 2910al's Management and Configuration Guide:

The actual distribution of the traffic through a trunk depends on a calculation using bits from the Source Address and Destination address. When an IP address is available, the calculation includes the last five bits of the IP source address and IP destination address, otherwise the MAC addresses are used.

OK. So, for this to work the way I want it to, the destination address is the key since my source address is fixed. This leads on to my question:

How exactly & specifically does layer 3 LACP hashing work?

I need to know which destination address is used:

the client's IP, the end destination?
Or the router's IP, the next physical link transmission destination.

We've not gone off and bought a replacement switch yet. Please help me understand exactly if the layer 3 LACP destination address hashing is or is not what I need. Buying another useless switch is not an option.

Excellent, well researched question! Unfortunately, I don't know the answer... — Doug Luxem, Aug 19 '10 at 20:05
Can you look at the spanning tree cost of each bridge / trunk on the ProCurve? — dbasnett, Aug 19 '10 at 20:16
Also the state and priority? It seems that when HP <---> Cisco that the trunks may not have the same priority and end up blocked. An advertisement for not mixing vendors???? — dbasnett, Aug 19 '10 at 20:36
This is possibly the best formatted question I've seen on Server Fault — sclarson, Aug 19 '10 at 20:46
I hope someone can take the same amount of care over the answer as was lavished on the question. — Neil Trodden, Aug 19 '10 at 20:49

score 14 · Accepted Answer · edited Apr 13 '17 at 12:14

What you're looking for is commonly called a "transmit hash policy" or "transmit hash algorithm". It controls the selection of a port from a group of aggregate ports with which to transmit a frame.

Getting my hands on the 802.3ad standard has proven difficult because I'm not willing to spend money on it. Having said that, I've been able to glean some information from a semi-official source that sheds some light on what you're looking for. Per this presentation from the 2007 Ottawa, ON, CA IEEE High Speed Study Group meeting the 802.3ad standard does not mandate particular algorithms for the "frame distributor":

This standard does not mandate any particular distribution algorithm(s); however, any distribution algorithm shall ensure that, when frames are received by a Frame Collector as specified in 43.2.3, the algorithm shall not cause a) Mis-ordering of frames that are part of any given conversation, or b) Duplication of frames. The above requirement to maintain frame ordering is met by ensuring that all frames that compose a given conversation are transmitted on a single link in the order that they are generated by the MAC Client; hence, this requirement does not involve the addition (or modification) of any information to the MAC frame, nor any buffering or processing on the part of the corresponding Frame Collector in order to re-order frames.

So, whatever algorithm a switch / NIC driver uses to distribute transmitted frames must adhere to the requirements as stated in that presentation (which, presumably, was quoting from the standard). There is no particular algorithm specified, only a compliant behavior defined.

Even though there's no algorithm specified, we can look at a particular implementation to get a feel for how such an algorithm might work. The Linux kernel "bonding" driver, for example, has an 802.3ad-compliant transmit hash policy that applies the function (see bonding.txt in the Documentation\networking directory of the kernel source):

Destination Port = ((<source IP> XOR <dest IP>) AND 0xFFFF) 
    XOR (<source MAC> XOR <destination MAC>)) MOD <ports in aggregate group>

This causes both the source and destination IP addresses, as well as the source and destination MAC addresses, to influence the port selection.

The destination IP address used in this type of hashing would be the address that's present in the frame. Take a second to think about that. The router's IP address, in an Ethernet frame header away from your server to the Internet, isn't encapsulated anywhere in such a frame. The router's MAC address is present in the header of such a frame, but the router's IP address isn't. The destination IP address encapsulated in the frame's payload will be the address of the Internet client making the request to your server.

A transmit hash policy that takes into account both source and destination IP addresses, assuming you have a widely varied pool of clients, should do pretty well for you. In general, more widely varied source and/or destination IP addresses in the traffic flowing across such an aggregated infrastructure will result in more efficient aggregation when a layer 3-based transmit hash policy is used.

Your diagrams show requests coming directly to the servers from the Internet, but it's worth pointing out what a proxy might do to the situation. If you're proxying client requests to your servers then, as chris speaks about in his answer then you may cause bottlenecks. If that proxy is making the request from its own source IP address, instead of from the Internet client's IP address, you'll have fewer possible "flows" in a strictly layer 3-based transmit hash policy.

A transmit hash policy could also take layer 4 information (TCP / UDP port numbers) into account, too, so long as it kept with the requirements in the 802.3ad standard. Such an algorithm is in the Linux kernel, as you reference in your question. Beware that the the documentation for that algorithm warns that, due to fragmentation, traffic may not necessarily flow along the same path and, as such, the algorithm isn't strictly 802.3ad-compliant.

Yes, I have sorted out the linux server's *"transmit hash policy"*. (A very educational experience that made this question possible.) it is the darn switch that has me in a pickle. Thanks for the info on IP frames--I'm a bit weak with how the lower levels of the network stack. In my mind the frame was addressed to the router, with destination deeper in the payload. :P — Stu Thompson, Aug 20 '10 at 09:52

score 5 · Answer 2 · answered Jun 16 '11 at 12:30

very suprisingly, a few days ago our testing showed that xmit_hash_policy=layer3+4 will not have any effect between two directly connected linux servers, all traffic will use one port. both run xen with 1 bridge that has the bonding device as a member. most Obviously, the bridge could cause the problem, just that it does not make sense AT ALL considering that ip+port based hashing would be used.

I know some people actually manage to push 180MB+ over bonded links (i.e. ceph users), so it does work in general. Possible things to look at: - We used old CentOS 5.4 - The OPs example would mean the second LACP "unhashes" the connections - does that make sense, ever?

What this thread and documentation reading etc etc has shown me:

Generally everyone knows a lot about this, is good at reciting theory from the bonding howto or even the IEEE standards, whereas practical experience is close to none.
The RHEL documentation is incomplete at best.
The bonding documentation is from 2001 and not current enough
layer2+3 mode is apparently not in CentOS (it doesnt show in modinfo, and in our test it dropped all traffic when enabled)
It does not help that SUSE (BONDING_MODULE_OPTS), Debian (-o bondXX) and RedHat (BONDING_OPTS) all have different ways to specify per-bond mode settings
The CentOS/RHEL5 kernel module is "SMP safe" but not "SMP capable" (see facebook highperformance talk) - it does NOT scale above one CPU, so with bonding higher cpu clock > many cores

If anyone ends up a good high-performance bonding setup, or really knows what they're talking about it would be awesome if they took half an hour to write a new small howto that documents ONE working example using LACP, no odd stuff and bandwidth > one link

It gets worse: Different versions of Debian have different methods for configuring bonding! I've actually documented how I setup my bonding in a blog post, which seems to get decent traffic. — Stu Thompson, Jun 16 '11 at 13:05

score 2 · Answer 3 · answered Aug 19 '10 at 19:53

If your switch sees the true L3 destination, it can hash on that. Basically if you've got 2 links, think link 1 is for odd numbered destinations, link 2 is for even numbered destinations. I don't think they ever use the next-hop IP unless configured to do so, but that's pretty much the same as using the MAC address of the target.

The problem you're going to run into is that, depending on your traffic, the destination will always be the single server's single IP address so you'll never use that other link. If the destination is the remote system on the internet, you'll get even distribution, but if it is something like a web server, where your system is the destination address, the switch will always send traffic over only one of the available links.

You'll be in even worse shape if there is a load balancer somewhere in there, because then the "remote" IP will always be either the load balancer's IP or the server. You could get around that a bit by using lots of IP addresses on the load balancer and the server, but that's a hack.

You may want to expand your horizon of vendors a bit. Other vendors, such as extreme networks, can hash on things like:

L3_L4 algorithm—Layer 3 and Layer 4, the combined source and destination IP addresses and source and destination TCP and UDP port numbers. Available on SummitStack and Summit X250e, X450a, X450e, and X650 series switches.

So basically as long as the client's source port (which typically changes a lot) changes, you'll evenly distribute the traffic. I'm sure other vendors have similar features.

Even hashing on source and destination IP would be enough to avoid hot-spots, so long as you don't have a load balancer in the mix.

Thanks. No load balancing. And I'm not worried about inbound traffic--we have a >50:1 out:in traffic ratio. (It's a Web video application.) — Stu Thompson, Aug 20 '10 at 09:45
I think in your case the hash on destination won't get you anything since the switch will see the destination as your server. L2 traffic engineering just isn't very good. And 'hash' in this sort of application is going to be pretty primitive -- figure the best you can do is add up all the bits in whatever address(es) are in use and if the result is 0 go out one link or 1 go out the other. — chris, Aug 20 '10 at 13:17
As I understand it from my above ProCurve 2910al quote, the hash is on the last five bits of the source *and* destination. So, no matter if one (my server) is fixed, the other is going to vary for almost every client at Level 3. Level 2? That is my current problem--there is only one source and one destination address to hash against. — Stu Thompson, Aug 20 '10 at 14:01

score 0 · Answer 4 · answered Aug 19 '10 at 16:47

I will guess that it's off of the client IP, not the router. The real source and destination IPs will be at a fixed offset in the packet, and that's going to be fast to do hashing on. Hashing the router IP would require a lookup based on the MAC, right?

score 0 · Answer 5 · answered Mar 26 '16 at 18:24

Since I just ended up back here, a few things I learned by now: To avoid gray hair, you need a decent switch that supports a layer3+4 policy, and the same also in Linux.

In quite a few cases the standards-perverting blowtorch called ALB/SLB (mode6) might work better. Operationally it sucks though.

Myself I try to use 3+4 where possible, since I often want that bandwidth between two adjacent systems.

I've also tried with OpenVSwitch and had once instance where that disrupted traffic flows (every first packet lost... i have no idea)

How exactly & specifically does layer 3 LACP destination address hashing work?

How exactly & specifically does layer 3 LACP hashing work?

5 Answers5

Linked