6

I have an issue with packets dropping to a third party data center in Florida, USA. The issue only occurs on Azure Virtual Machines, no matter which data center the VM is in. I've done the same tests simultaneously from other non-Azure networks, and there is no packets loss. The Azure Virtual Machines were "vanilla" / out of the box with no software loaded or other customizations / changes.

I've already spoken to the network admins at the data center and the only packets they are seeing are the ones that don't timeout; the packets that timeout never reach their firewall, so it sounds like something on the Azure side (especially since the packets consistently drop/timeout from multiple Azure data centers / regions). Does anyone know how I might solve this?

The test I was running was a continuous TCP ping (using tcping.exe) to port 80 (since ICMP is blocked on Azure):

tcping -t 216.155.111.149 80
tcping -t 216.155.111.151 80
tcping -t 216.155.111.146 80

Other evidence supporting the fact that it's not the third party data center is that I can run the same continuous TCP ping from my home computer / work computer and drop no packets. I also setup a tunnel VPN from the Azure VM to a VM at a non-Azure data center and no packets are dropped. The only time packets are dropped is when the traffic goes out to the internet/WAN directly via Azure.

I know the next step would be some trace route tests, but since Azure blocks ICMP, I had to use nmap to run a TCP trace route; pasted below are the screenshots from those tests.

nmap -sS -p 80 -Pn --traceroute 216.155.111.149

test1

test2

test3

test4

Andrew Bucklin
  • 435
  • 1
  • 5
  • 12
  • I'm seeing this too. It's bizarre. Did you ever get this resolved? – Charles Offenbacher Oct 27 '15 at 00:49
  • @CharlesOffenbacher, No, I am still having this issue. As a work around for the time being, I created a Linux VM with another cloud hosting provider, installed a VPN server role on that new server, connected the Azure Windows Server 2012 R2 guest VM to that VPN server and created a static routing policy to route only the traffic destined for that IP range via the VPN connection (all other traffic will still flow out via the Azure WAN to the internet like normal). But this isn't a permanent solution. I'm still hopeful someone will respond and help get this fixed permanently. – Andrew Bucklin Oct 27 '15 at 15:29
  • Wow, that's tough! I'm not sure if I'm actually experiencing exactly the same issue, but possibly. What I'm seeing is that servers in Azure are randomly unable to establish a connection with non-Azure servers for what looks like packet loss. However, when I tcpdump, I see that my non-Azure server actually receives a packet but doesn't respond occasionally. I'm thinking my issue is related to the Azure NAT doing some weird things with timestamps. http://stackoverflow.com/questions/8893888/dropping-of-connections-with-tcp-tw-recycle . – Charles Offenbacher Oct 27 '15 at 17:28
  • http://serverfault.com/questions/235965/why-would-a-server-not-send-a-syn-ack-packet-in-response-to-a-syn-packet – Charles Offenbacher Oct 27 '15 at 17:29
  • TCP traceroute won't work either as you learn each hop when you receive an ICMP "TTL expired" back. Those ICMPs won't make it back to your VM. Anyway, I was thinking if you might be hitting the scenario described here (shameless plug): http://blogs.msdn.com/b/mast/archive/2015/07/14/azure-snat.aspx and in any case, a simultaneous network capture would greatly help troubleshooting. – Pedro Perez Oct 29 '15 at 17:53
  • @PedroPerez Thanks for the comment, but I think there's something else going on here. First of all, I only have 1 VM behind the Cloud Service. Second of all, I can run a continuous PING to any other site (such as Google.com) and have absolutely no packets lost. However, when running that same PING test to this particular IP range, only the first 10-11 PINGs are successful followed by 26-28 PINGs that fail. Then that same pattern repeats itself. Remember, this same issue occurs in multiple Azure data centers from multiple vanilla VMs. The issue does NOT occur with VMs from other IaaS providers. – Andrew Bucklin Nov 02 '15 at 22:39
  • 1
    @AndrewBucklin I've reproduced and found a workaround for your issue. You still need those simultaneous captures if you want to get to the bottom of this, though. – Pedro Perez Nov 03 '15 at 15:39

1 Answers1

2

As I've mentioned on my comment, you're effectively hitting a similar scenario as described in this article.

I could easily reproduce your behaviour:

Issue reproduced

And I could easily work around the issue by adding an Instance-Level Public IP to the VM:

Issue solved

It is difficult to say what is exactly going on, as we don't have simultaneous captures, but my understanding is that the edge device (potentially a firewall) on the remote site (www.oandp.com) keeps closed connections on it's connection table for longer than Azure does, so when Azure uses one of the freed (i.e. already used) ports and the remote side still thinks that connection is not fully closed, our SYN packets get dropped.

The ILPIP applies a static NAT or a "one to one NAT", hence there's no port translation nor port reuse (unless your OS does it), thus avoiding the issue.

Pedro Perez
  • 5,652
  • 1
  • 10
  • 11
  • Would this affect the private IP of the VM itself and the site-to-site VPN that's already in place for the customer? – Andrew Bucklin Nov 03 '15 at 19:11
  • No, it is just a public IP address for this specific instance, so the private IP (DIP) will remain the same. The ILPIP will be your public source IP by default for traffic going to Internet. – Pedro Perez Nov 03 '15 at 19:23
  • 1
    Bingo! Much appreciated. The traffic destined for the internet immediately switched to route via the new PIP that I assigned to the VM. VPN connectivity didn't even hiccup. Thanks again. – Andrew Bucklin Nov 03 '15 at 23:50
  • 1
    The "ILPIP" or instance-level public IP is not supported in the modern Azure - there is no solution for getting around this issue: http://superuser.com/questions/1132967/how-to-assign-instance-level-public-ipilpip-to-azure-vm-in-armresouce-manager – joonas.fi Jan 23 '17 at 09:26
  • 1
    Hi joonas.fi - In ARM ("modern Azure") you can attach public IPs directly to your VM. It is in fact the default when you deploy a VM. The problem above might only appear if you purposedly deploy VMs in an Availability Set, behind a Load Balancer and remove the VM's public IP addresses (known now as PIPs). Hope this helps! – Pedro Perez Jan 23 '17 at 16:39