I have a bug on some servers where LACP (802.3ad
) is not working.
I have on all servers a bonding device bond0
with two eth
slaves and each interface is plugged on a different swich, and both switches configured with LACP.
Everything seems to be ok, but a network engineer detected some MLAG (Arista LACP implementation) was not working while the physical devices were up.
When I looked to /proc/net/bonding/bond0
of affected servers, I found each interface has a different Aggregator ID
. On nominal servers the Aggregator ID
is the same.
The issue can be reproduced by switching off and on the port on the switch, then we can observe despite physical link is up, MLAG is down. The bug is present on RHEL 6 and 7 (but not all servers are affected).
Configuration
#/etc/sysconfig/network-scripts/ifcfg-bond0
DEVICE=bond0
MACADDR=14:02:ec:44:e9:80
IPADDR=xxx.xxx.xxx.xxx
NETMASK=xxx.xxx.xxx.xxx
BONDING_OPTS="mode=802.3ad lacp_rate=slow xmit_hash_policy=layer3+4"
BOOTPROTO=none
ONBOOT=yes
USERCTL=no
NM_CONTROLLED=no
PEERDNS=no
# /etc/sysconfig/network-scripts/ifcfg-eno49 (same for other interface)
HWADDR=14:02:ec:44:e9:80
MASTER=bond0
SLAVE=yes
BOOTPROTO=none
ONBOOT=yes
USERCTL=no
NM_CONTROLLED=no
PEERDNS=no
We have a workaround now - set down and up eth
interface on server - but this is not ideal.
To check LACP protocol, I did
tcpdump -i eno49 -tt -vv -nnn ether host 01:80:c2:00:00:02
I can see a packet every 30 seconds on one interface but on the other I see a packet every 1 second as is it was trying to establish LACP session.
Do you have a way to troubleshoot and fix that ?
(sorry if I did not use the right term for network I'm not really skilled in LACP)
Thanks