1

OK, here's my situation.

alt text

This is on the internet. The 6224 is the router in this picture and physically resides in Kanata.

Both VLAN 1697 and 3994 are provided by an internet service provider. These VLANs are provided through a single 1Gb ethernet wire.

The Kanata hosts are directly attached to the 6224; the other two sites are remote.

VLAN 3994 is a single IP address space, so theoretically it shouldn't matter physically where the hosts on that subnet are.

Here's the problem.

I have a monitoring system which is connected further into the internet, so probes from the monitor would come in to this diagram on the 1697 VLAN.

When I ping hosts at Albert or Bells Corners from the internet, there is 0 loss. The connection looks perfect.

When I ping hosts at Kanata, I lose anywhere from 10 to 40% of the pings. The loss is not predictable, but: when I do lose them, I always lose at least 3, usually 4, rarely more, pings in a bunch.

I have attached a monitor directly to the 6224 in Kanata on 3994..

When the monitor pings the 6224 routing interface, I see exactly the same loss pattern -- but NOT at the same time as the loss from the remote system. Ping time is around 1ms.

When the monitor pings another system directly attached to the 6224, there is 0 loss. Ping time is about 0.1ms, one-tenth of the time to ping the router.

Anyone know what is going on here?

Update to make things less clear maybe

What seems to be going on is that traffic that comes in and goes out the ISP's connection is fine. Traffic that is going from the router brain to the switching brain (or back, maybe) is what is having the problem.

I can't blame the ISP because internet access to/from the two remote sites is solid. It is only hosts which are directly attached to the 6224 which are having issues.

Update 2

OK, after a lot of time staring at traces, I have a more specific symptom.

I did a tcpdump on vlan 3994 of the ISP uplink looking for my own address on the theory that all I should see is broadcast traffic going to the remote sites. Instead, I saw the packets that I would have expected to see on my system's interface going down the TLS on this VLAN.

So:

For some reason, the 6224 frequently thinks that my system is at the far end of the TLS.

When I inspect the switching-table when things are working, my entry looks like this:

3994     0007.E924.F714        2/g16      Dynamic

…which makes sense since it is plugged into port 16. However, when it is broken, it looks like this:

3994     0007.E924.F714        2/g22      Dynamic

Streams of misdirected packets seem to be led by a broadcast from my system. However, I see one broadcast leave my system, and two on the 3994 VLAN to the TLS. Usually it is a IGMP V2 Membership Report / Join Group 224.0.0.251, but sometimes it is the management chip on my system arping for itself (it does this every 2 seconds or so for reasons which are stupid).

This implies that there is a system in Bells Corners or Albert which is hearing my broadcast, and echoing it back for some reason. So the 6224 goes ah, this mac must really be down the TLS link, and adjusts its switching-table accordingly.

Does this description of the problem ring any bells ?

David Mackintosh
  • 14,223
  • 6
  • 46
  • 77
  • Hmmh - just to be clear you are seeing packets drop when trying to ping one set of addresses in the .129/26 subnet (which you use for VLAN 3994 addresses) from some other address in that subnet while most other addresses in that subnet are fine? Is it at all possible that there is simply a bucket load of traffic keeping the Katana host network interfaces really busy? Do you have Broadcast Storm Control enabled and is there any chance that the Katana hosts might be doing something bad that is triggering that? – Helvick Dec 22 '10 at 23:55
  • tcpdumps from the monitoring host directly attached show no storming, practically no broadcasting, and very little in the way of direct traffic. I would have thought that if the /30 address on the 6224 was being attacked, it would affect IPs at the remote sites too. – David Mackintosh Dec 23 '10 at 03:39
  • just to eliminate a physical problem affecting a couple of ports can you swap the physical connections the Katana systems use with ones that you know work fine to see if the problem persists. Some of the scenarios you describe are happening within a L2 domain which strongly indicates either a misconfiguration or fault on individual ports. – Helvick Dec 23 '10 at 10:15
  • Problems seem consistant no matter what physical ports are in use. – David Mackintosh Dec 23 '10 at 15:08
  • Perhaps you've created a rather long loop? How are you transporting the VLAN3994 (layer 2) to the remote sites? – SpacemanSpiff Dec 24 '10 at 18:48
  • Possible loop, yes, but since I'm only seeing some packets duplicated, I don't think so. The TLS VLAN 3994 is present at all three sites through a ISP, we've been told it's conceptually similar to a MPLS network delivering it. – David Mackintosh Dec 24 '10 at 19:38

1 Answers1

1

OK, I figured this out and I'll write it out here. This particular solution is unlikely to help anyone because it is an edge case.

Back in the ancient history of the link with this provider, we added a second VLAN to the primary one. At the time, the provider then connected this VLAN as both tagged and untagged on their side of the connection. Their switch treats the tagged and untagged as separate connections.

So what happens is my system connected to the Dell emits a arp broadcast (the management interface on this computer emits arp packets every half second for reasons which are stupid), which the switch forwards down the link to the remote site. The switch at the provider hears the broadcast on the untagged interface -- and sends it back to me on the tagged interface. The switch hears this and then concludes that the mac address originating the broadcast is really reachable via the provider's link. Follow up packets therefore get misdirected.

The solution was to have the provider change their configuration so that it agreed with that on the Dell. All general connection problems have ceased.

David Mackintosh
  • 14,223
  • 6
  • 46
  • 77