23

New details added at the end of this question; it's possible that I'm zeroing in on the cause.

I have a UDP OpenVPN-based VPN set up in tap mode (I need tap because I need the VPN to pass multicast packets, which doesn't seem to be possible with tun networks) with a handful of clients across the Internet. I've been experiencing frequent TCP connection freezes over the VPN. That is, I will establish a TCP connection (e.g. an SSH connection, but other protocols have similar issues), and at some point during the session, it seems that traffic will cease being transmitted over that TCP session.

This seems to be related to points at which large data transfers occur, such as if I execute an ls command in an SSH session, or if I cat a long log file. Some Google searches turn up a number of answers like this previous one on Server Fault, indicating that the likely culprit is an MTU issue: that during periods of high traffic, the VPN is trying to send packets that get dropped somewhere in the pipes between the VPN endpoints. The above-linked answer suggests using the following OpenVPN configuration settings to mitigate the problem:

fragment 1400
mssfix

This should limit the MTU used on the VPN to 1400 bytes and fix the TCP maximum segment size to prevent the generation of any packets larger than that. This seems to mitigate the problem a bit, but I still frequently see the freezes. I've tried a number of sizes as arguments to the fragment directive: 1200, 1000, 576, all with similar results. I can't think of any strange network topology between the two ends that could trigger such a problem: the VPN server is running on a pfSense machine connected directly to the Internet, and my client is also connected directly to the Internet at another location.

One other strange piece of the puzzle: if I run the tracepath utility, then that seems to band-aid the problem. A sample run looks like:

[~]$ tracepath -n 192.168.100.91
 1:  192.168.100.90                                        0.039ms pmtu 1500
 1:  192.168.100.91                                       40.823ms reached
 1:  192.168.100.91                                       19.846ms reached
     Resume: pmtu 1500 hops 1 back 64 

The above run is between two clients on the VPN: I initiated the trace from 192.168.100.90 to the destination of 192.168.100.91. Both clients were configured with fragment 1200; mssfix; in an attempt to limit the MTU used on the link. The above results would seem to suggest that tracepath was able to detect a path MTU of 1500 bytes between the two clients. I would assume that it would be somewhat smaller due to the fragmentation settings specified in the OpenVPN configuration. I found that result somewhat strange.

Even stranger, however: if I have a TCP connection in the stalled state (e.g. an SSH session with a directory listing that froze in the middle), then executing the tracepath command shown above causes the connection to start up again! I can't figure out any reasonable explanation for why this would be the case, but I feel like this might be pointing toward a solution to ultimately eradicate the problem.

Does anyone have any recommendations for other things to try?

Edit: I've come back and looked at this a bit further, and have found only more confounding information:

  • I set the OpenVPN connection to fragment at 1400 bytes, as shown above. Then, I connected to the VPN from across the Internet and used Wireshark to look at the UDP packets that were sent to the VPN server while the stall occurred. None were greater than the specified 1400 byte count, so the fragmentation seems to be functioning properly.

  • To verify that even a 1400-byte MTU would be sufficient, I pinged the VPN server using the following (Linux) command:

    ping <host> -s 1450 -M do
    

    This (I believe) sends a 1450-byte packet with fragmentation disabled (I at least verified that it didn't work if I set it to an obviously-too-large value like 1600 bytes). These seem to work just fine; I get replies back from the host with no issue.

So, maybe this isn't an MTU issue at all. I'm just confused as to what else it might be!

Edit 2: The rabbit hole just keeps getting deeper: I've now isolated the problem a bit more. It seems to be related to the exact OS that the VPN client uses. I have successfully duplicated the problem on at least three Ubuntu machines (versions 12.04 through 13.04). I can reliably duplicate an SSH connection freeze within a minute or so by just cat-ing a large log file.

However, if I do the same test using a CentOS 6 machine as a client, then I don't see the problem! I've tested using the exact same OpenVPN client version as I was using on the Ubuntu machines. I can cat log files for hours without seeing the connection freeze. This seems to provide some insight as to the ultimate cause, but I'm just not sure what that insight is.

I have examined the traffic over the VPN using Wireshark. I'm not a TCP expert, so I'm not sure what to make of the gory details, but the gist is that at some point, a UDP packet gets dropped due to the limited bandwidth of the Internet link, causing TCP retransmissions inside the VPN tunnel. On the CentOS client, these retransmissions occur properly and things move on happily. At some point with the Ubuntu clients, though, the remote end starts retransmitting the same TCP segment over and over (with the transmit delay increasing between each retransmission). The client sends what looks like a valid TCP ACK to each retransmission, but the remote end still continues to transmit the same TCP segment periodically. This extends ad infinitum and the connection stalls. My question here would be:

  • Does anyone have any recommendations for how to troubleshoot and/or determine the root cause of the TCP issue? It's as if the remote end isn't accepting the ACK messages sent by the VPN client.

One common difference between the CentOS node and the various Ubuntu releases is that Ubuntu has a much more recent Linux kernel version (from 3.2 in Ubuntu 12.04 to 3.8 in 13.04). A pointer to some new kernel bug maybe? I'm assuming that if that were so, then I wouldn't be the only one experiencing the problem; I don't think this seems like a particularly exotic setup.

Jason R
  • 398
  • 1
  • 3
  • 10
  • Routing multicast packets over a `tun` network should be possible by means of running multicast routing daemons (such as [pimd](https://github.com/troglobit/pimd)) *and* having the OpenVPN server use the `--topology` options set to "subnet" -- see [the manual](https://community.openvpn.net/openvpn/wiki/Openvpn23ManPage) – kostix Mar 18 '13 at 16:01
  • Do the VPN client or server indicate anything in the logs at the time of these issues? – mgorven Mar 18 '13 at 17:51
  • @mgorven: Definitely not on the client. I'll have to do some work to get at the server logs. – Jason R Mar 18 '13 at 18:59
  • @mgorven: I've finally had a chance to come back to this. Nothing at all in the client or server logs when this happens. It's really baffling. – Jason R May 19 '13 at 13:15
  • 2
    Is there any possibility that the clients that freeze have local firewalls that are dropping ICMP-fragmentation-needed packets, where as those that don't, don't, and are therefore fragmenting correctly? – MadHatter Jun 19 '13 at 14:20
  • @MadHatter: Thanks for the suggestion. I don't think that's the case. An examination of `iptables` on one of the Ubuntu machines I was testing with indicates that there aren't any rules in place. I guess it is possible that there's something else blocking them. I do believe that the packet fragmentation performed by OpenVPN is working properly based upon watching the tunnel traffic using Wireshark, so it *shouldn't* be an MTU problem. – Jason R Jun 19 '13 at 14:23

4 Answers4

11

This command solves it for me:

$ sudo ip link set dev tun0 mtu 1350 && echo ":)"

You can Verify tun0 settings with

$ ip a s

Cheers!

Michael Hampton
  • 237,123
  • 42
  • 477
  • 940
  • On the client or server-side ?? – Matt Apr 09 '18 at 09:32
  • Thanks a lot! @Matt, it depends where the problem is located. For us it was on server, but it may be on client side. Also the value can differ, you can test with `ping -s 1350 -M do` to find the right value – Eino Gourdin Jun 01 '18 at 14:41
3

Disable Window Scaling in TCP, with:

sysctl -w net.ipv4.tcp_window_scaling=0

After doing that, SSH to Debian/Ubuntu Systems over VPN are working fine for me.

Colt
  • 1,939
  • 6
  • 20
  • 25
Mifpi
  • 31
  • 1
0

On Windows using Putty, you have to change the MTU by going to local connection for the vpn connection -> details on the network interface (TAP windows Adapter or something like that)-> Advanced -> Properties -> MTU (change it to something lower than 1500). You may have to reconnect. It worked for me on Windows and Putty

Nick_K
  • 133
  • 5
-1

It looks like this is a buffering issue. I have the same problem, and I can avoid it by throttling the transfer speed. Not the best way, but it might help someone find a better solution for this.

See update 1 here: How to prevent SSH freezes over an openvpn client to client connection

Atomo
  • 69
  • 1
  • 3