11

I have a strange intermittent connectivity problem happening about once every two weeks.

First my configuration: I am running a HyperV failover cluster with two physical hosts (node01 and node02). The hosts are both running Windows Server 2008 R2 HyperV server (the free one) with SP1. On those hosts I am running two VM's each running Windows Server 2008 R2 Web edition with SP1. My storage server is Windows Storage Server 2008 connected via iSCSI. Both hosts as well as the storage server are running the latest network drivers downloaded directly from Intel's website.

Here's the problem: 99.99% of the time, everything works perfectly. About once every two - three weeks, the VMs will both simultaneously lose network connectivity, both incoming and outgoing. When this happens,

  1. I cannot RDP into either VM.
  2. I can RDP into either host.
  3. I can connect to either VM from the Failover Cluster Manager by right-clicking on the node and selecting 'Connect to Virtual Machine'
  4. Once I connect to the VM as described in #3 above, I cannot get to any websites or machines on the LAN. Disabling and re-enabling the virtual network connection inside the VM doesn't fix the problem.
  5. If I move the VM to a different node, that fixes the problem (for the next two weeks).
  6. If I reboot the host and move the VM back onto it, that fixes the problem (for the next two weeks).
  7. When this happens, the failover cluster does NOT automatically failover the VM.
  8. There are no unusual event log entries on any of the hosts or VMs.

This has happened about 5 times with the same symptoms as described above. I suspect a network driver or network hardware issue, but since I'm already running the latest drivers I'm not sure what to do about it.

This is a real head-scratcher ... any ideas?

Update

I found a very similar case here: Virutal Machine loses network connectivity on Hyper V Cluster

Update 7/29/2011

After installing hotfixes and updating network drivers, I am still experiencing the same problem. In response to the comment asking for hardware details, the server is an Intel SR1670HV, which is 1U chassis containing two independent S5500HV motherboards. Communication is via the motherboards' integrated NICs which are Intel 82574L. The network driver is version 16.2.49.0.

Mike
  • 1,261
  • 5
  • 18
  • 31

8 Answers8

9

We used to have a problem like this where I'm at. I don't remember the exact details, but the final solution had to do with a conflicting mac address assigned dynamically to a virtual network adapter. Pinning those down so they weren't dynamic helped a lot. You normally don't want to do that because it can make it harder to move a virtual machine to a different host, but it helped us in this instance.

The other part is the physical nics were made by broadcom and we also had a configuration error there, where a previous admin had tried incorrectly to use the broadcom utility to trunk the two nics together on the host for improved bandwidth/throughput. We removed that setup and configured one of the nics so it had no IP at all on the host machine, but could still be used for passthrough to virtual guests. Then we set each virtual machine to only use one nic or the other, balancing the load based on historical traffic. Of course that means no failover if an adapter or connection goes down, and we haven't followed through well to see whether traffic has remained balanced over time, but it's been rock solid stable since then.

Joel Coel
  • 12,910
  • 13
  • 61
  • 99
5

I am aware that this is an old question, but I encountered the same issue and wasted so much time getting it resolved that I thought I would share the solution that worked for me. I found the solution to my problem here:

http://invendows.wordpress.com/2008/03/06/network-issue-with-hyper-v/

The solution in my situation was to disable TCP Offloading on the VMs. I will quote the relevant section from the link:

In order to to disable TCP Offloading I had to create and set a new registry value in each VM connected to the Broadcom 8507 Nextreme II NIC.

I used the following registry change to disable TCP Offloading:

Key: HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters

Value(DWORD): DisableTaskOffload = 1

After disabling TCP offload on each VM this way all trouble was over and I was able to connect multiple VMs to one NIC port of the Broadcom 5708 Nextreme II NIC.

My server has Broadcom NetExtreme NICs, so it seems for me the cause of this issue was definitely driver related, but setting DisableTaskOffload = 1 resolved the issue completely for me. Hope that this information saves someone else hours of searching!

BruceHill
  • 151
  • 1
  • 2
3

I have run into something similar in a much simpler Hyper-V environment, and ran across this article at Microsoft. Seems to fit with your situation if the web servers are heavily used.

http://support.microsoft.com/kb/974909 - The network connection of a running Hyper-V virtual machine is lost under heavy outgoing network traffic on a Windows Server 2008 R2-based computer

Christopher
  • 1,673
  • 12
  • 17
  • The KB article you reference was pre-SP1, but I did a similar post-SP1 one that looks promising: http://support.microsoft.com/kb/2263829 – Mike Jun 10 '11 at 05:09
  • 1
    I removed this as the answer because I installed the hotfix but the problem is still occurring. Therefore, this question remains unanswered... – Mike Jul 29 '11 at 21:30
2

On the network adapter properties for the VM guest, have you disabled Jumbo Packets and Large Send Offload? Based on my experience with these settings, I would definitely try it.

Greg Askew
  • 34,339
  • 3
  • 52
  • 81
2

We had this same problem, though in our case it was every 24-48 hours. I would double-check that your antivirus/firewall product spcefically supports Server 2008 with Hyper-V, if not, try a different (or temporarily removing if feasible) your anti-virus/firewall product as a test to see if the issue goes away.

After a call to Microsoft and several dump/log file uploads later, they determined that TrendMicro OfficeScan was the culprit in our case. We were using a version that turned out to not be explicitly supported on Hyper-V, once we upgraded to the latest release, the problem went away.

Jesse
  • 121
  • 1
2

This turned out to be a hardware issue -- I isolated the problem to a Netgear GSM7224v2 managed switch, replaced it with a D-Link DGS-1024D, and everything has been working fine ever since.

As a "lesson learned," in this case I probably spent 99% of my diagnostic effort troubleshooting software settings for what turned out to be a hardware issue. I even paid Microsoft Support $259 (and spent a lot of time on the phone with them) to help me figure it out by poking around at software settings. I guess the moral of the story is to suspect your hardware just as much as your software.

Mike
  • 1,261
  • 5
  • 18
  • 31
0

I also had this issue on a with a Microsoft Windows Server 2012 virtual server on a Microsoft Windows Server 2016 virtual host. I did not have Broadcom NICs so I ruled that out. I tried disabling VMQ and IPsec task offloading to no avail.

What ended up working for me was removing the virtual NIC and re-adding it. It may have had something to do with what others mentioned about a poorly assigned dynamic MAC address, but I'm not sure.

Just thought I'd add what worked for me in this instance for the next weary traveler.

Matt Binford
  • 101
  • 1
-4

https://support.microsoft.com/en-us/kb/2986895

It is known issue with Broadcom 1gigabyte network adapters.