I am right now getting additional grey hair fighting a phenomenon concerning packet loss between machines on the Internet.

Check the diagram below. Note that whenever I use "SSH" I could use "HTTPS"; the same phenomenon occurs for that protocol.

A SSH server running Fedora 22 is on "Site A" (wine red). I never had any connection problems till "recently".

SSH connections to "Site A" from Amazon EC2 machines running Fedora 22 or Fedora 23 work perfectly well (hosts shown in green inside the "Amazon EC2" box)

SSH connections to "Site A" from "Site B", which is on the same AS, do not work from any Fedora system I tested (orange boxes). However they do work from a Windows 7 system using Putty. The same (dual-boot) hardware is involved in both cases. "Site B" also has a firewall but that does not seem to play any role: I have tried to set up the connection directly from the FritzBox router and it still didn't work for Fedora but worked for Windows.

How does the problem manifest itself:

When you connect using SSH, there is an initial packet exchange going on (as shown by tcpdump). However, after 20 packets or so, the outgoing packets seem to not go anywhere anymore; no acknowledgements come back from Site A. You never get to the password prompt. A CTRL-C properly resets the connection, after which Linux still tries to send the packets that were never ACKed for a bit.

The setup

I suspect there is some problem at my ISP, in particular I suspect that the ISP performs suspect magic in order to implement the "fixed IP address" at Site B, which is the only thing that changed "recently".

However, I can't understand what would account for the fact an SSH connection works from Windows but not from Linux under the same conditions, network-wise. What should I be looking for?

David Tonhofer
  • 910
  • 1
  • 9
  • 29
  • It could be so many things. You really haven't given enough information. I suspect it could be configuration issues in your consumer grade Fritz!box's. I'm surprised you're using them. – hookenz Feb 10 '16 at 22:51
  • 2
    Not quite convinced this could be involved (I don't think so many packets would get exchanged before the connection dies), but have you checked it's not a blocked path MTU discovery? This can happen if somewhere along the path there's a lower-than-usual MTU (e.g. PPPoE links and the like) and the firewall blocks ICMPs from going through. Check MTU, and try ping -D (set don't fragment bit) with different packet sizes. – jcaron Feb 10 '16 at 22:53
  • 1
    @jcaron I think you're correct : most of the packets before the stall are small, the packet that is not getting through has a nice suspicious-looking size right at MTU for PPPoE, and the one larger packet that gets sent before is unreasonably large (1900 bytes) suggesting that TCP Segmentation Offload is in use, and that it was probably broken into smaller sizes as a result (though I admit I would have expected one of those to be MTU-sized as well). – Jeremy Gibbons Feb 11 '16 at 04:52
  • @Matt The Fritzes are ok for now and generally work (just had one mystery failure in the last year). There are no serious server applications on the "inside". – David Tonhofer Feb 13 '16 at 23:07
  • Ahh thats ok then. I've had some fritzies crumble under load. MTU could indeed be an issue. – hookenz Feb 14 '16 at 18:21

2 Answers2


Your packet trace shows:

22:29:22.180852 IP (tos 0x0, ttl 64, id 52989, offset 0, flags [DF], proto TCP (6), length 1900)   
SITE_B_LAN_ADDR.54358 > SITE_A.SSH_PORT: Flags [P.], cksum 0x05c4 (incorrect -> 0xadce), seq 22:1870, ack 22, win 229, options [nop,nop,TS val 4294917498 ecr 71539420], length 1848

Note its a 1900 sized byte length with a dont fragment option set on the packet. Typical MTUs tend to be between 1400-1500 bytes.

Your probably getting packet too big ICMP messages back but your dropping all ICMP traffic inbound at the site A firewall.

To test for this you'd have to do the packet trace on your firewall for icmp and tcp 22.

Make sure you permit ICMP packet too big messages inbound at site A.

Alternatively you could try setting the MTU on your Linux boxes at Site A to something under the size of your network MTU. I am hazarding a guess that on Fedora you have jumbo packets enabled but on Windows you do not.

Matthew Ife
  • 22,927
  • 2
  • 54
  • 71
  • Thanks Matthew. If have written up my work at the same time. Looking at Jumbo Frames, too, but I will open a ticket at the ISP. – David Tonhofer Feb 13 '16 at 23:05
  • 1
    Its not just your ISP with regards to jumbo frames, if you set dont fragment and any hop between you and site B doesnt accept such an MTU the frame is dropped and the dropping hop sends a ICMP packet too big message back to you. There could be 10 other hops between you and your site despite what your autnomous system is set to do - so you'll need to resolve this at your end. – Matthew Ife Feb 13 '16 at 23:10

After the suggestions of the dear commenters, I have looked to see whether an MTU problem could be the cause.

The following was found when trying to connect from "Site A" to "Site B" from a Fedora system. On a Windows system everything is working perfectly fine -- wireshark indicates that outgoing packets' length never exceeds 1158 byte, so the problem is not triggered there.

In brief, if I read this correctly:

  1. There is an initial successful exchange of small packets.
  2. A packet with length 1900 is sent. I suppose the network card will break this up because the MTU for the local network is 1500.
  3. A router in the ISP network with address tells us to "please fragment the packet to MTU 1492".
  4. Wilco! A packet with length 1492 is sent.
  5. A router in the ISP network with address tells us to "please fragment the packet to MTU 1492".
  6. Things go downhill from here.

It looks like I will have to open a ticket with the ISP (which is POST Telecom Luxembourg btw, in case someone googles for similar problems).

It also suggests a remediation. Force the MTU to SITE_A to 1000:

ip route add $SITE_A_IP via $GATEWAY_IP dev $ETHDEV mtu lock 1000

Indeed, this fixes the problem.

Reference info

Use ping to test MTU behaviour:

ping -c $COUNT -M $MTUDS -s $PPLSZ $HOST


  • COUNT=1: "One ping only"
  • MTUDS=do: MTU discovery strategy is "prohibit fragmentation, even local one" i.e. set the 'DF' (don't fragment) bit (why is this 'do'? dunno). USE THIS.
  • MTUDS=want: MTU discovery strategy is "do PMTU discovery, fragment locally when packet size is large" i.e. set the 'DF' bit and fragment locally
  • MTUDS=dont: MTU discovery strategy is "don't set the 'DF' bit", i.e. fragment as needed
  • PPLSZ=1464: ICMP ping packet payload size in byte.

Use tcpdump to monitor all ICMP packets and packets from and to "Site A":

tcpdump -vvv -n -nn icmp or '(' host $SITE_A_IP ')'

This is a bit hard to read though.

Watch what the kernel thinks about the MTU to "Site A".

watch ip route get to $SITE_A_IP

Note that a lower MTU than the default will get cached with a TTL of 600 seconds after the first failed ping.


Suppose the maximum IP packet size in byte (i.e. the size of the Ethernet payload) is 1492 (this is the case on Amazon EC2), then an interesting ping payload size would be 1465, because the 28 byte used for the IP and ICMP header information would give 1493, one byte pas the maximum.

Then ping -c 1 -M want -s 1465 $HOST_IP does the following:

On the first ping you get "Frag needed and DF set (mtu = 1492) 100% packet loss". tcpdump shows echo request part 1 (length 1493) going out and a router of the target network sending back an "ICMP unreachable" with the request to fragment down to MTU 1492. A cached entry with MTU=1492 appears in the kernel route cache.

On subsequent pings you get "1 packets transmitted, 1 received". tcpdump shows echo request part 1 (length 1492) and echo request part 2 (length 21, offset 1472) and the corresponding echo reply (length 1493).

Or you can use traceroute

# traceroute --mtu SITE_A 1500

Packet size 1500. Traceroute tells us that route has MTU 1492

traceroute to SITE_A (SITE_A_IP), 30 hops max, 1500 byte packets
 1  gateway (  0.550 ms  0.536 ms  0.393 ms
 2 (  1.458 ms  1.485 ms  1.344 ms
 3 (  4.889 ms F=1492  2.968 ms  4.854 ms
 4 (  4.955 ms !F-1492  3.559 ms !F-1492  5.022 ms !F-1492

Try with 1492: same problem!

traceroute to SITE_A (SITE_A_IP), 30 hops max, 1492 byte packets
 1  gateway (  0.635 ms  0.554 ms  0.483 ms
 2 (  1.510 ms  1.504 ms  1.311 ms
 3 (  48.305 ms  17.436 ms  5.496 ms
 4 (  5.963 ms !F-1492  6.865 ms !F-1492  4.887 ms !F-1492

Try with 1491: same problem!

traceroute to SITE_A (SITE_A_IP), 30 hops max, 1491 byte packets
 1  gateway (  0.594 ms  0.650 ms  0.492 ms
 2 (  1.716 ms  1.782 ms  1.580 ms
 3 (  7.327 ms  7.385 ms  4.775 ms
 4 (  5.210 ms !F-1492  5.624 ms !F-1492  4.841 ms !F-1492

Try with 1490: we get through. There is bound to be some off-by-one error in there.

traceroute to SITE_A (SITE_A_IP), 30 hops max, 1490 byte packets
 1  gateway (  0.616 ms  0.688 ms  0.484 ms
 2 (  1.712 ms  1.853 ms  1.611 ms
 3 (  6.248 ms  7.008 ms  4.995 ms
 4  SITE_A_IP.dyn.luxdsl.pt.lu (SITE_A_IP)  12.441 ms !X  9.641 ms !X  9.576 ms !X

Further info of interest:

David Tonhofer
  • 910
  • 1
  • 9
  • 29
  • Actually still not happy, might there be a reason for the recalcitrant – David Tonhofer Feb 13 '16 at 23:12
  • Definitely a problem at the "POST Luxembourg" ISP. If the Linux box is on the pool of IP addresses [](https://db-ip.com/all/87.240.253) the problem occurs, but not if it is on other pools. – David Tonhofer Oct 04 '16 at 10:56