5

I have a very weird issue with certain packets not arriving on the destination host. It happens when we transmit a POST that is somewhat larger than the MTU. We can reproduce it with this script:

#!/usr/bin/python

import urllib2

magic_length = 2297
logurl = 'http://www.example.nl/'
data = (magic_length - len(logurl)) * 'X'
headers = {'content-type': 'application/x-www-form-urlencoded', 'User-Agent': 'Fake'}
request = urllib2.Request(logurl, data, headers)                                        
handler = urllib2.build_opener(urllib2.HTTPHandler())                                   
answer = handler.open(request, timeout=5)

The sending party doesn't get ACKs and does retransmissions. The receiving party never sees it.

It is dependent on where you run the script, and where you POST to. My home connection is one that fails (and incidentally, I've had problems with AJAX POSTs not getting through since a few months; since I have a new modem).

If I reduce the MTU of the sending machine by 100, it works again. But, if I reduce magic_length by 100 too, it fails again. A first theory was that a layer of my ADSL (like PPPoA) adds headers and causes packets to be split erroneously, but that doesn't seem to be it then.

Perhaps something goes wrong with MTU discovery. Some hop down the line blocking all ICMP perhaps? This is the first part of a traceroute to google from my home:

traceroute to google.com (74.125.133.102), 30 hops max, 60 byte packets
 1  dsldevice.lan (192.168.2.254)  0.453 ms  0.547 ms  0.636 ms
 2  195.190.243.7 (195.190.243.7)  29.836 ms  29.947 ms  29.986 ms
 3  nl-zl-dc2-git-cr02.kpn.net (213.75.64.237)  37.004 ms  37.153 ms  37.204 ms
 4  nl-rt-dc2-ice-ir02.kpn.net (213.75.64.236)  37.261 ms  37.300 ms  37.339 ms
 5  72.14.198.161 (72.14.198.161)  38.351 ms  38.395 ms  38.405 ms
 6  209.85.254.92 (209.85.254.92)  37.976 ms  38.103 ms  37.972 ms
 7  209.85.253.247 (209.85.253.247)  38.612 ms 72.14.238.153 (72.14.238.153)  33.709 ms 209.85.253.249 (209.85.253.249)  36.890 ms
 8  209.85.240.158 (209.85.240.158)  41.052 ms  41.104 ms 209.85.244.102 (209.85.244.102)  41.164 ms
 9  209.85.249.12 (209.85.249.12)  38.392 ms 209.85.249.14 (209.85.249.14)  38.247 ms  38.851 ms^C

If I ping 213.75.64.237, I get (I've never actually seen 'packet filtered' as a response on STDOUT...):

PING 213.75.64.237 (213.75.64.237) 56(84) bytes of data.
From 213.75.64.237 icmp_seq=1 Packet filtered

The rest I can ping.

This answer seems similar. However, my script doesn't set the DF (don't fragment) flag (edit: correction, the tcpdmp does show that flag is set on the POST request), nor can I see ICMP requests coming back to me when I run the script on a host that does work. Plus, the packets are already split up by the sender, and sending the second packet fails.

How do I proceed? ISPs NOCs are hard enough to reach as it is, so I need to have proof of what's going on. They're not going to help me figure it out...

Edit: to confirm or deny the ICMP type 4 (fragmentation required) hypotheses, I did this:

$ ping -c 1 -M do -s 1472 host
PING host (1.2.3.4) 1472(1500) bytes of data.
1480 bytes from host (1.2.3.4): icmp_req=1 ttl=50 time=33.8 ms

This works, but I'm a bit confused. Does the "(1500)" mean the total fragment size? I assume so, because 1480 bytes + 20 bytes IP header is 1500 bytes.

If I increase the size of the ping by one:

$ ping -c 1 -M do -s 1473 host
PING host (1.2.3.4) 1473(1501) bytes of data.
From pannekoek.lan (192.168.2.5) icmp_seq=1 Frag needed and DF set (mtu = 1500)

So, this would mean the path between the two hosts does allow packets of 1500 bytes and no fragmentation issues occur. It seems I'm back to square one.

Edit again: I have found something significant. The problem is simply that packets of certain sizes don't arrive. It happens between my modem and the ISP's first gateway:

$ for i in `seq 1025 1030`; do ping -c 1 -M do -s $i 195.190.243.7; done
PING 195.190.243.7 (195.190.243.7) 1025(1053) bytes of data.
1033 bytes from 195.190.243.7: icmp_req=1 ttl=254 time=31.2 ms  <- works

--- 195.190.243.7 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 31.273/31.273/31.273/0.000 ms
==========================
PING 195.190.243.7 (195.190.243.7) 1026(1054) bytes of data.

--- 195.190.243.7 ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms <- packet loss   
==========================
PING 195.190.243.7 (195.190.243.7) 1027(1055) bytes of data.

--- 195.190.243.7 ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms <- packet loss
==========================
PING 195.190.243.7 (195.190.243.7) 1028(1056) bytes of data.

--- 195.190.243.7 ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms <- packet loss
==========================
PING 195.190.243.7 (195.190.243.7) 1029(1057) bytes of data.

--- 195.190.243.7 ping statistics --- 
1 packets transmitted, 0 received, 100% packet loss, time 0ms <- packet loss
==========================
PING 195.190.243.7 (195.190.243.7) 1030(1058) bytes of data.
1038 bytes from 195.190.243.7: icmp_req=1 ttl=254 time=31.1 ms <- works

--- 195.190.243.7 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 31.177/31.177/31.177/0.000 ms

I guess I have to convince them it's their problem.

Halfgaar
  • 7,921
  • 5
  • 42
  • 81
  • This type of problem usually is because of something blocking ICMP replies that result from Path MTU Discovery. No ICMP messages coming back is part of the problem -- if they weren't blocked, your computer would receive them and lower the path MTU as required. – Barmar Oct 28 '14 at 18:17
  • @Barmar that was my hypotheses indeed, but I think I've proven that wrong. See my latest edit. – Halfgaar Oct 29 '14 at 08:59
  • Weird. One note - make sure you have more than a single sequence of pings to show a pattern of packet sizes. A router's lowest priority is responding to pings, so if the router is busy, it will drop the requests rather than responding. You may have done this already, but your example only shows the one run. – Dan Pritts Nov 13 '14 at 18:06
  • @DanPritts I have been able to do this consistently ever since this post. The ISP says it's looking into it, but I wonder if they're really... – Halfgaar Nov 15 '14 at 16:07
  • One relatively simple test is to replace your modem. Firmware bug in that could do something like this. Without any visibility into the network beyond that, there's not much you can do. – Dan Pritts Nov 17 '14 at 19:15

1 Answers1

1

Somewhere along the line from point A to point B, a router has been configured with a lower MTU and that is what is breaking things. Have you tried doing a trace to see where exactly the ICMP packets are getting lost?

  • I don't think that is/was it, because I don't get the fragmentation required reply. And, it happened when connecting to the first gateway behind the modem. – Halfgaar Jan 31 '15 at 11:38