3

I'm having a very odd issue with a single ESXI host.

I have 2 identical hosts, core i3, 6 nics, 16g ram. 4 of the nics are used for Management, vmotion, vm network, all on different vlans. They all go to a HP Procurve 24 port gig switch in a static trunk. The other two nics are iSCSI.

There are 2 VSS's, the one with 4nics, and the second with just the 2 and iSCSI traffic.

Configuration on both hosts is identical, hardware is identical. Both hosts are running at about 30% utilization both cpu and memory. They are running ESXI v. 5.1.

What is happening is that all of the sudden host 2 will drop out of vCenter. ( vCenter is hosted on a physical machine ). No error, it just loses connection.

If I try to ping the host from vCenter I cannot. If I try to ping from my workstation I can most of the time and I can SSH into it. If I "test management network" from the DCUI it can ping the gateway and the dns servers. If I restart the management network I still cannot get to it from vCenter.

If I do a services.sh restart it all completes with no error but doesn't help, host is still not able to register with vCenter nor be pinged by vCenter.

The only thing so far that remedies this is to completely restart the host. I did a log export but I'm not really even sure what to look for at this point. What logs should I be looking at? The only other piece of information I can add is that this seems to happen at the same time of the day, early in the morning. There is nothing running at this time, no backup jobs etc.

ewwhite
  • 194,921
  • 91
  • 434
  • 799
TheEditor
  • 231
  • 1
  • 4
  • 15
  • Can you describe the hardware? We need to know server make/model, storage configuration (local disk? RAID controller? USB/SD card boot?) – ewwhite Jan 30 '14 at 13:27
  • Sure. Intel DH61CR board, core i3, 16g ddr3 ram, single sata internal drive. 4 intel pro ports and 2 syba per host. All VM's are on a FreeNAS, 10 disk SAN, running through iSCSI. – TheEditor Jan 30 '14 at 13:45
  • Can you give me the specific model of the NIC and the BUILD VERSION/NUMBER of the ESXi install? – ewwhite Jan 30 '14 at 13:49
  • Intel PRO/1000 Pt Dual Port Server Adapter, Syba SD-PEX24009 PCI-Express x. VMware ESXi 5.1.0 build-1065491 VMware ESXi 5.1.0 Update 1 – TheEditor Jan 30 '14 at 13:54
  • What is the network config for the vCenter physical machine? Also when you say you have 4 nics for management, etc.. How are those nics configured? Are these in an Active-Active configuration or Active-Passive? – Mike Naylor Jan 30 '14 at 14:20
  • On the vCenter server there is a single nic connected directly to the switch. The 4 nics are all active using IP Hash and the 4 physical ports on the switch are in a static trunk group. There are 2 vss's. The first has 2 vmkernels, 1 for vmotion ( vlan 5) One for management ( vlan 1 ) , it also has 3 port groups. Outside ( vlan 2), DMZ ( vlan 3 ) , and vm network (vlan 1). The second vswitch has 2 vmkernel ports both iscsi. Multipath iSCSI as per vMware instructions. – TheEditor Jan 30 '14 at 14:25
  • Has this ever worked 100% of the time or is this a new machine - just thinking it might be worth checking your firewall as I've seen that happen on boxes that don't have VC-to-Host UDP 902 allowed – Chopper3 Jan 30 '14 at 14:38
  • I've had this setup running for almost 7 months now. I just verified that port was allowed. Granted this is not only my home production but also my home lab so it's in an ever changing state but it runs perfectly. This issue has been over the last few weeks. Which log would point me to a firewall issue? I was attempting to get smtp working by creating an smtp.xml file but that was removed and the sytem back to normal. Plus all actions done on 1 host have been done on both and host 1 is stable. – TheEditor Jan 30 '14 at 14:44
  • @TheEditor Please post the relevant hostd.log snippet from the affected host/time. – ewwhite Jan 30 '14 at 15:06

1 Answers1

2

Whenever I see these issues on whitebox hardware, I check the drivers (and firmware) of the critical components involved (NIC, storage) and then suggest updating to the newest revision of the ESXi distribution using the VMware Patch Portal or Update Manager.

Lab or no lab, you're running an old build: ESXi 1065491 versus the current ESXi 1483097.

Go ahead and run the updates as a first start: Are VMware ESXi 5 patches cumulative?

Following that, I would dig into the actual hosts' logs to see what's happening near the vCenter disconnection time. Check /var/log/hostd.log and /var/log/vmkernel.log.

If you're certain that there aren't any firewalling, DNS or other networking issues, this is your best bet to understand what's happening.

If all else fails, this is ESXi, and you have shared storage. Spending time troubleshooting a build like this isn't always useful, especially if the other host is performing well. Copy your settings off via PowerCLI, rebuild and restore the host.

ewwhite
  • 194,921
  • 91
  • 434
  • 799
  • I think I'm going to run all updates then check back and see if that fixes it. I see that PR1012837 seems to address something along these lines with networking disconnecting. I'll do that later today and if I still have issues pull logs and check back. Thanks @ewwhite – TheEditor Jan 30 '14 at 15:30
  • Just an update: That fixed the issue. The random network crashes have stopped. – TheEditor Feb 03 '14 at 18:41
  • @theeditor The patch fixed the issue? – ewwhite Feb 03 '14 at 19:09
  • Yep seemed to. Haven't had any issues since then. Rock solid. It was my own fault for not doing it sooner. – TheEditor Feb 03 '14 at 19:20