Network traffic doesn't appear to leave the trunk

Question

I'm in the process of staging up some new virtualization servers, and part of that is to get some higher-bandwidth pipes into them. The ultimate goal is to bind 4 GigE ports into a single trunk carrying 802.1q tagged traffic. I can get that far, however I've run into a strange problem. But first, a diagram.

----------       ----------  1GbE trunks 
|        | 10GbE |        | ------------- --------
|  SW1   |-------|   SW2  | ------------- | VM1  |
|        |       |        | ------------- --------
----------       ----------
     |                |  1GbE  -----------
     | 1GbE           |--------| client2 |
     |                         -----------
----------
|        | 1GbE -----------
|  SW3   |------| client1 |
|        |      -----------
----------

All the switches are HP ProCurve 2910al switches and are not stacked. Client2 in the above diagram is in the same VLAN as VM1 is. Client1 is in a different VLAN. For the VM machine (CentOS 6) both iptables and SELinux have been disabled.

My problem is that when trunking is involved, two-way network traffic is impossible when talking to either Client machine. TCPDUMP shows that pings are received by them and ECHO REPLY packets are sent, but the VM host never sees them. At the same time, if I try to ping the VM from a client machine, it also doesn't work. The fact I can't ping client2, which is on the same subnet, suggests something is screwy in the network layer somewhere.

Strangely, from the VM host I can ping the gateway IPs on any of the switches. If I use a single interface everything works fine both with and without VLAN tagging. If I just bind a single interface and turn VLAN tagging on that interface, I can go anywhere. Build a trunk, and I'm limited to the switch-fabric.

The type of trunk doesn't seem to matter. Right now they're configured with mode 0 trunks (balance-rr), though using LACP/802.1qa behaves the same way.

vlan 70 
   name "Virtualization Subnet" 
   untagged 35,36,38,40 
   tagged Trk1-Trk2,Trk5,Trk8 
   no ip address 
   jumbo 
   exit

That's the VLAN config on SW2 up there. SW1's VLAN 70 definition has the "ip address" defined on it. The above snippet is in the fully-untrunked mode. When I'm trunked:

trunk 35-36,38,40 Trk16 trunk
vlan 70 
   name "Virtualization Subnet" 
   tagged Trk1-Trk2,Trk5,Trk8,Trk16
   no ip address 
   jumbo 
   exit

The 802.1qa/LACP version trades out the trunk definition for trunk 35-36,38,40 Trk16 lacp but as I said, doesn't change the problem presentation.

Client2 is actually connected to SW1, but putting it there in the chart would have made formatting trickier. In any case, the only thing in the Interface stanza is a name directive; it is listed as an untagged port in the vlan 70 stanza for SW1.

What am I missing?

Can you post the VLAN stanza's of your Procurve switches? And also what ports the hypervisor (aka VM)1, clients 1 & 2 are using? — jftuga, Jul 26 '11 at 17:29
For switches sw1,2,3 are all of the uplink trunk'd ports (to other switches) tagged in vlan 70? Also, what does tracert show you? — jftuga, Jul 26 '11 at 17:54
@jftuga Yes, all of the inter-switch links are trunked and tagged. SW3 does NOT have VLAN 70 on it. Traceroute shows little of interest, the trace dies at the hop when it would get to the VM host. Also, from within the switch itself I can't ping the VM host IP address when trunked. I'll see if I can get something in place to sniff that set of trunked ports. — sysadmin1138, Jul 26 '11 at 18:07
You say that this is a VM, as in Virtual Machine? Are you running this on ESX(i)? — pauska, Jul 26 '11 at 18:14
@Pauska Haven't gotten that far yet. This is bare-iron CentOS 6 right now. HP BL465G7. — sysadmin1138, Jul 26 '11 at 18:15
@sysadmin1138: Oh, ok. LACP is not supported on ESX(i), so that's why I asked. Perhaps you could join the comms room on the chat, so that we can discuss this without the "noise" it generates? — pauska, Jul 26 '11 at 18:19

score 7 · Accepted Answer · edited Apr 13 '17 at 12:14

After a long debate in chat involving MikeyB, Pauska, and ChrisS, the problem ended up being two-fold:

A possible bug in CentOS 6 was not changing the module options for the bonding module as part of service network restart, so it wasn't tracking my changes between LACP mode (4) and roundrobin (0).
Round-Robin mode doesn't like to work with ProCurve switches.

Once I forced the bonded interface to LACP/802.1qa mode through this command:

ifconfig bond0 down
echo "4" > /sys/class/net/bond0/bonding/mode
ifconfig bond0 up

Both the server and the switch were talking. At that point, starting with only one interface enabled on the switch, traffic started working normally. Enabling a second, third, and finally, the fourth interfaces all kept traffic working.

Ultimately, LACP-mode is what made things work. The clue was that round-robin mode worked when there was only one enabled switch-port in the Trunk. The server survives a reboot and comes up in the correct mode. However, a service network restart does not cause the MODE="4" part of the ifcfg-bond0 file in /etc/sysconfig/network-scripts/ to be take effect. If that mode changes, it'll remain what was set at boot (or more likely, module-load time of the bonding module).

A very professional question and answer. Bound to help someone. — artifex, Jul 28 '11 at 05:51

score 0 · Answer 2 · answered Jul 26 '11 at 18:10

0

You have in your config:

trunk 35-36,38,40 Trk16 trunk
vlan 70 
   name "Virtualization Subnet" 
   tagged Trk1-Trk2,Trk5,Trk8,Trk16
   no ip address 
   jumbo 
   exit

Shouldn't that be:

   untagged Trk16
   tagged Trk1-Trk2,Trk5,Trk8

answered Jul 26 '11 at 18:10

MikeyB

38,725
10
102
186

Well, there is an error in the original post, but not what you're suggesting. Under the untrunked config there should be a "untagged Trk16" on vlan 70. – pauska Jul 26 '11 at 18:15
I've tried that variant as well. Both variants perform the same way, doesn't work. Using `untagged 35-36,38,40` and `tagged 35-36,38,40...` both work so long as I don't try to aggregate interfaces on the Linux server. `untagged Trk16` and `tagged Trk16...` both don't work. – sysadmin1138 Jul 26 '11 at 18:17
Running Xen? Does Centos 6 still muck with the interface definitions? I recall a problem I had where the vlan interfaces were created off the incorrect interface (the phys instead of the bridge or vice-versa) and weird things happened. – MikeyB Jul 26 '11 at 18:22

Network traffic doesn't appear to leave the trunk

2 Answers2