18

We have a system which is suffering from comms outages on a gigabit ethernet network. The traffic load on the network is such as to slightly stress a 100Mb network, but there are gigabit switches and NICs and cables throughout - or so I am told by the customer who built the network we are plugging into.

We plugged in a laptop running Wireshark via a 100baseT hub and found that it reported lots of "Ethernet II" packets where the raw data, when displayed as ASCII, basically looks like this:

PUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU

Naturally I immediately named this issue "Network PUU" and many giggles ensued. We're all in our forties or so, but I guess some of us never grow up (guilty!)

Anyway, more seriously, other perfectly valid packets were being corrupted by this data. IPv4 headers were getting bytes replaced with U bytes as well as there being data corruption which would cause the software to reject the data, even if the IP checksums didn't fail to match. We are pretty sure that this data spewing onto the network is causing the comms outages. What we don't know is where it might be coming from.

Has anyone ever seen this happen before? Did you solve it? Did you figure out where it came from?

====EDITED====

Added mention of the hub to the original description since, judging from the comments below, it is the most likely source of the corruption! The tool we used to try and find the network issue appears to have added a new and worse network issue.

AlastairG
  • 348
  • 2
  • 13
  • Thankyou for all your replies. I just found out that the hub we were using to sniff the traffic is 100Mb and the problem only occurs when a specific type of data (very 40+KB all sent at once over TCP) is being transmitted. – AlastairG Nov 10 '21 at 16:18
  • To Ron and Zac, you both talked about the source MAC address. The network packet raw data is literally just PUUUUU..., so the source address is 0x55555555 and the destination address is 0x50555555. Allegedly. I may have missed out a few 0x55 bytes. – AlastairG Nov 10 '21 at 16:20
  • Notice that ASCII `U` is also `0x55` – Bergi Nov 11 '21 at 01:29
  • 7
    @Bergi ...which (if it's not obvious) is `01010101` in binary, possibly indicating either a testing/debug value or the result of a physical layer malfunction. – NobodyNada Nov 11 '21 at 01:56
  • 2
    @AlastairG Do you really mean "hub" or just "unmanaged switch". If it is really a hub throw it out of a window immediately. Hubs have no place whatsoever in a modern (as in "this millenium") network. – Tonny Nov 11 '21 at 13:33
  • @Tonny: Hubs are much more useful than switches for packet capture, to the point where quality switches are capable of designating one port to act as a hub and mirror all traffic being forwarded through all other paths. – Ben Voigt Nov 11 '21 at 22:44
  • Hubs are useless in a modern network, because they stopped making them back in 100M times. A 100M hub can't keep up with a 1G network. Throw them out. Some smart switches support promiscuous mode where a monitoring port can receive all traffic. – user10489 Nov 12 '21 at 01:12
  • @BenVoigt Hubs don't have buffering (or just 1 packet at best) and are limited to 100 Mb/s. That's not good enough anymore. And that is not even counting the mess hubs create with collision handling. Collision logic for hubs is based on 50+ years old half-duplex logic from the coax days. This doesn't play nice with a modern ethernet were every link is a full-duplex link that normally only uses collisions as a hold/resume mechanism to deal with (hopefully rare) buffer-full conditions. The packet capture is an accidental benefit of a hub, but we have better tools (port-mirroring) for that now. – Tonny Nov 12 '21 at 11:01
  • As Ben Voigt says, we used a hub for network monitoring. I am not an IT guy, not really, even though I ended up doing IT for my little 10 man company. I didn't know you could set ports on switches to be promiscuous. I guess my knowledge of ethernet is a bit out-dated. – AlastairG Nov 15 '21 at 08:34

4 Answers4

18

Anyway, more seriously, other perfectly valid packets were being corrupted by this data. IPv4 headers were getting bytes replaced with U bytes as well as there being data corruption which would cause the software to reject the data, even if the IP checksums didn't fail to match.

It's surprising that just alternating bits (U is ASCII 0x55 or 01010101b) actually make up valid Ethernet frames or even valid IP packets. If this corruption crawls into mainly intact frames/packets as well, it can only be caused by - most likely - a faulty switch (bad buffer memory) or a faulty host (NIC or RAM).

If frame data is corrupted in transport, on the cable, the FCS extremely likely fails to verify, making the very next switch drop that frame. However, if such a frame is transported through the network with a valid FCS, it must have been corrupted before that FCS was calculated, which mandates a defective switch or host.

You'll need to trace back that traffic. If the source MAC address isn't valid or can't be checked on intermediate (unmanaged) switches you'll need to trace your way back along the cables.

Zac67
  • 8,639
  • 2
  • 10
  • 28
  • 4
    I vote for a faulty memory block in a switch, too. Looks like not all memory is bad since it breaks packets only in bigger data bursts. – fraxinus Nov 11 '21 at 09:22
  • And it would not surprise me to find that the problem is greatly exacerbated by disabling autonegotiation on gigabit connections. [Autonegotiation is a requirement for using 1000BASE-T according to *Section 28D.5 Extensions required for Clause40 (1000BASE-T)*. At least the clock source has to be negotiated, as one endpoint must be master and the other endpoint must be slave.](https://en.wikipedia.org/wiki/Gigabit_Ethernet#1000BASE-T) I have no idea why supposedly-smart people disable autonegotiation on networks. You don't keep driving your car at 60mph/90kph with a flat "because it should" – Andrew Henle Nov 13 '21 at 13:51
  • 3
    @AndrewHenle Any device linking 1000BASE-T with disabled autonegotiation can be considered broken (and you'd need two of them). – Zac67 Nov 13 '21 at 13:55
12

Sounds like you have a bad NIC card. If the source MAC address valid, you can find it by checking the switch MAC tables. If it is corrupted, you'll just have to start unplugging devices to find it.

Ron Trunk
  • 2,149
  • 1
  • 10
  • 19
3

That sounds as if you have a device (probably a 100 Mb/s switch) somewhere that can't deal with the traffic-flow and starts corrupting packets when its internal buffers overflow.
(Or it just has a bad RAM).

It doesn't notice it has corrupt packets and will happily be re-transmitting them, with freshly calculated new checksums. So the bad packets are accepted by other switches (checksum is good, switches don't care that the content is non-sense) and forwarded through the entire network.

It is actually worse than that:
Consider how switches learn which device (mac-address) is behind which port. Any packet destined for a mac-address which isn't learned yet by the switch is flooded to all switch-ports (except the one it came in from). This effectively turns a packet for an unlearned mac-address into a temporary broadcast.
Because your switches will never learn these mac-addresses (after all they are corruption, not real mac-addresses) they are ALL treated like broadcasts...
This essentially floods the whole network with un-deliverable packets.
(And note that normal broadcast-storm mitigations don't work in this case. They only act on REAL broadcast packets, not on these learning-floods.)

The only way to troubleshoot this is to disable 1 switch at a time and see if that makes the problem go away. If you can narrow it down to 1 switch it will be that switch itself or a device connected behind that switch.

Tonny
  • 6,252
  • 1
  • 17
  • 31
-1

The difference between a hub and a switch is that when a switch gets a collision, it either throws out the second packet, or it stores it and then forwards it when the first packet finishes; where a hub will merrily allow the collision to happen and just replaces the contents of the packet with 10101... to indicate that it was a collision and continues sending that until both packets have finished.

The solution here is to get rid of hubs, as they are obsolete. They stopped making hubs before 1G was available, so a hub has to be 100M or slower. The 1G network standard does not support hubs.

For a little history, before there were hubs, there were repeaters. The difference between a repeater and a hub is that the repeater receives the analog signal, cleans it up slightly back into a nice square wave, and then retransmits it, where a hub actually looks at what is in the packet a little bit and tries to make sure the packet is well formed. However, neither one of them does anything to fix collisions, they just let them happen. Repeaters and hubs are from back when ethernet was considered to be an unbuffered bus and only one device on the network can speak at a time. When ethernet was a true bus (10base2 and 10base5), to start a packet, you transmit start bits (10101...) until the first bit reaches the furthest ends of the network, and if nobody else has interrupted you in the mean time, then you continue your packet. If you get interrupted, you have a collision and both parties back off and tries again at a random time later. If one party doesn't abort, then you have a late collision. Your hub is turning your late collisions into all start bits. Possibly something in the path is not recognizing the packet as a late collision and rather than dropping it is reforming it as a valid packet. Or your promiscuous packet sniffer sees invalid packets as well as valid ones.

Contrast this with a switch, which not only fixes collisions, but can support full duplex, where a packet is being transmitted while a different one is being received. The 100M standard supports switches that both do and do not support full duplex and this is negotiated between devices when the cable is plugged in. The 1G ethernet standard requires all devices to support full duplex, so a hub is not allowed on a 1G connection, and therefore, 1G hubs do not exist.

user10489
  • 474
  • 1
  • 2
  • 12
  • 3
    _"a hub will merrily allow the collision to happen and just replaces the contents of the packet with 10101... to indicate that it was a collision and continues sending that until both packets have finished"_ - can you elaborate? I know of hubs as mere repeaters on layer 1 (i.e. physical, purely electrical connections), not modifying packets at all, at most (temporarily) blocking ports that prove to be troublesome. – CodeCaster Nov 12 '21 at 13:12
  • 3
    A jam signal only has 62 bits of alternating 0101. The pattern in the question looks much longer. Also, you don't explain how a jam signal makes it into a valid frame. Gigabit Ethernet very well defined repeater hubs - see IEEE 802.3 Clause 41 - they just didn't materialize any more (fortunately). A hub is a *multiport repeater* - the difference is the number of ports. A hub does *not* inspect frames (FCS) in any way, it only reacts to jabber. *you transmit start bits (10101...) until the first bit reaches the furthest ends of the network* - definitely not. – Zac67 Nov 12 '21 at 18:50
  • The preamble plus SFD is exactly 48 bits. This is where I stopped reading... – Zac67 Nov 12 '21 at 18:50
  • You may be looking at a newer standard. What I'm looking at is 10base2 and 10base5. Also, there is a big difference between a normal collision and a late collision. The hubs only looked at the packets enough to figure out where they start. The repeater doesn't even do that, it just reforms the square wave. – user10489 Nov 12 '21 at 22:57
  • @Zac67: 48 bits at 10MHz if I do my math right is more than 200m which if I recall correctly was close to the approx max network length, so your numbers don't disagree, assuming my memory is covering 10base2. – user10489 Nov 12 '21 at 23:23
  • 48 / 100M * e * .64 for 100BASE-TX is roughly 92m, while the maximum length of the collision domain is 300m (two repeaters). With 100BASE-FX, the maximum length is 6000m. Your theory is wrong. – Zac67 Nov 13 '21 at 08:27
  • PS: Sorry, 100BASE-FX in HDX is limited to 400, so the maximum collision domain length is 1200m. – Zac67 Nov 13 '21 at 09:01
  • The collision by preamble timing doesn't apply to 100base-TX, it only applyies to 10base2 and 10base5, which if I recall correctly, were 200m and 500m. The length difference would be accounted for because velocity factor of RG58 is around 60% and 10base5 requires a much higher grade coax which probably has a velocity factor around 90-98% – user10489 Nov 13 '21 at 12:12
  • 10BASE5 with two repeaters (5-4-3 rule allows 3 mixing segments) has a maximum collision domain length of 1500m. RG8's velocity factor is .77 as per Clause 8. Do the math. 100BASE-TX uses the exact same preamble as all Ethernet variants. The preamble isn't used for collision detection anywhere (other than that there's a carrier present). You should really read IEEE 802.3. – Zac67 Nov 13 '21 at 13:47
  • It's not the preamble but the *minimum frame size* (including preamble) that needs to span the whole collision domain and back - the sender needs to still be transmitting when a collision is sensed/signalled or it won't retransmit. – Zac67 Nov 13 '21 at 14:03
  • 10BASE2 coax cables have a maximum length of 185 meters. 10base5 is 500m. I don't know where you get 1500m, maybe you are thinking 1600 ft. – user10489 Nov 13 '21 at 14:15
  • @Zac67: that sounds right. But also there's a mechanism there for early termination of transmission to prevent late collisions on large packets. – user10489 Nov 13 '21 at 14:18
  • A *late collision* is a collision beyond the minimum frame size. It can only happen if the collision domain is longer than permitted or if the late sender is malfunctioning (or with duplex mismatch). – Zac67 Nov 13 '21 at 14:27
  • Right. If everything is in spec, late collisions should not occur. – user10489 Nov 13 '21 at 14:34
  • This sounds like the most likely answer since it explains the alternating bits as indicative of a collision. The PUU occurs when one part of the system is trying to send something like 40kB in a single write to the socket. Naturally this is split into a number of maximum size packets (1500 bytes IIRC). However it appears that experts who know far more than I do think that this cannot be the correct answer because the PUU is too long. Could multiple collisions cause the problem? Or having a 100MbaseT hub between two Gigabit ethernet devices with a different collision detection system? – AlastairG Nov 15 '21 at 08:48
  • I think it is very likely that one or both 1G switches are not handling the half duplex hub correctly and feeding it full duplex anyway, and the hub is turning something into one long late collision. Then, again, one or both 1G switches are not handling the collision correctly, because you don't get collisions in full duplex mode. – user10489 Nov 15 '21 at 12:20