5

We're running a number of VMs on a 6 node failover cluster of blades using Hyper V.

We have an intermittent issue (every few days at different times - not a fixed frequency) of VMs losing network connectivity. Console access to the VM suggests all is fine and the underlying blade has normal connectivity. To resolve the problem we either have to re-start the VM or, more usually, we do a live migration to another blade which fires up connectivity and we then migrate it back to the original blade.

I've had 3 instances of this happen with a specific VM running on a particular blade however it has happened once with a different VM running on a different blade. All VMs and blades have the same basic setup and are running Windows 2008 R2.

Any ideas where I should be looking to diagnose the possible causes of this problem as the event logs provide no help?

Edit:

I've checked that each blade is running the latest NIC drivers and all seem to be fine.

Something that is confusing me - a failover or restart of the VM resolves the issue. Whilst I need to work out the underlying issue that is causing the NICs to hang I'm also concerned that the VM didn't failover to another node which would have solved the outage for me. Is there a way to configure the cluster so that it can tell that the VM guest has lost connectivity and fail it over? As things stand the cluster is assuming that the VM is running happily as I presume Hyper V says everything is great even though there is a problem.

Edit:

Thought I'd update this since the problem is still outstanding - less frequent but still seemingly random as to which VM is affected. Latest checks were that all VMs were running the same MPIO drivers and the same drivers versions for the virtual NICs. Everything looks to be identical with some VMs that are running on the same blade centre but outside of this cluster & these VMs have never experienced any problems.

Chris W
  • 2,670
  • 1
  • 23
  • 32
  • TCP offload turned off? Could be a bad driver - I had crap like that happening left and right. – TomTom Mar 24 '10 at 09:04
  • Might want to check/uncheck the network optimisations – commandbreak Mar 24 '10 at 09:58
  • Is that on the virtual NIC in the VM or the physical host? It's currently set to "Rx & Tx Enabled" which i take to be the default since the other VMs all have that setting also. – Chris W Mar 24 '10 at 10:00
  • what type of blades, we have hp c7000 and noticed this problem updated the drivers and so far problem has gone! Also there are a couple critical hotfixs that should be be applied especially if your running nehalems! tr – tony roth Mar 24 '10 at 21:57
  • any update on this? Lots of chatter (deleted "answers" saying that they have the same problem) so if you do have an update that'd be great. – Jeff Atwood Apr 03 '11 at 02:20

3 Answers3

3

Could this be the answer to your problem: http://support.microsoft.com/kb/974909

  • Interesting - I wouldn't have said heavy traffic but it could be related. I'll look in to that hot fix further. – Chris W Mar 31 '10 at 14:22
  • I spoke with Microsoft support today about this exact same problem (see http://serverfault.com/questions/278860/why-are-my-hyperv-vms-randomly-losing-connectivity) and was pointed to a similar KB: http://support.microsoft.com/kb/2263829. – Mike Jun 10 '11 at 05:14
  • As a followup to my previous comment regarding the hotfix suggested by Microsoft, I can now say that it did NOT fix the problem. I am still experiencing the random loss of network connectivity described by the original question. – Mike Jul 29 '11 at 21:33
0

Do you by chance have port security turned on your for your switch ports? Make sure that you have a large enough number of MACs allowed. Also what is your network configuration like on the parents? Are you teaming?

Tatas
  • 2,091
  • 1
  • 13
  • 19
0

Not the ideal answer that I'd hoped for but in this case it worked for our set-up...

We took the affected VMs out of the cluster, removed the NICs and then re-created them. In conjunction each blade was pulled from the cluster and had all drivers updated before they were pulled back in.

The loss of connectivity problem was clear for the next 6 weeks that I monitored them - a job change after than means I'm not sure if the problem is still resolved;)!

Chris W
  • 2,670
  • 1
  • 23
  • 32
  • When you say you "removed the NICs and then re-created them" can you be more specific about what you did? Do you mean you removed the virtual NICs from the VMs or something else? – Mike Jun 10 '11 at 05:16
  • If I remember rightly I deleted the NICs from the VMs. In Device Manager inside the VMs I think I also deleted all the entries for NICs by showing hidden devices. Then add fresh NICs to the VM and let the OS auto detect them. – Chris W Jun 15 '11 at 10:05