2

I have a standalone ESXi 5.5.0 b2143827. It is running on a Dell R710 with 144GB of RAM. It has approximately 20 VM's on it.

Right now, I cannot get onto the console via the VMWare vSphere client or SSH. It just acts as if the server does not exist. The host will come back at seemingly random times and I can get onto the host via SSH and the vSphere client, but then it will just go off the network again at an undetermined time in the future. I can access it through the emergency console on the physical host itself (Alt+F1).

However, all the VM's are active and working. But about 10 times a day, all the VMs will drop off the network for between 15 seconds and 5 minutes. Then they will come back just fine and everything keeps on ticking.

I have done the following:

  • It was on a previous build, I updated it to b2143827. This made no difference
  • /sbin/services.sh restart - this does not help the situation
  • Restarted the physical host. This made no difference.
  • From the physical console (Alt+F1) I have pinged another physical device on the network. It does not drop any packets at all.
  • From the physical console, I have pinged a virtual machine on the host. It suffers approximately 80% loss
  • From a remote machine, I can ping the management IP address with 0% packet loss
  • From a remote machine, I can ping a VM on the host and can see the host clearly go off and back on the network occasionally
  • I watched tail -f /var/log/hostd.log for a while and saw nothing untoward happening there
  • The system is installed on an SD card. I have shut the server down, DD'd the card to another card, then booted it on the new card. Same issue.
  • Tried a different network switch
  • Ran the Dell Update Manager and updated every single firmware to the latest version.

I'm at a loss where to go from here. This server has operated flawlessly for the past 2.5 years. VMWare used to be installed on a physical drive, but 6 months ago it was moved onto the SD card so we could reconfigure the physical drives.

Mark Henderson
  • 68,316
  • 31
  • 175
  • 255

2 Answers2

2

I'd suggest updating the firmware of the Broadcom NICs on your Dell PowerEdge server. The fact that you're seeing external connectivity problem in addition to VM-specific pings points at a NIC issue.

  • Can you try another NIC device? (this host has four)
  • How many uplinks do you have from the Standard vSwitch? (you should have multiple live uplinks)
  • How reproducible is the issue?

Regarding the SDHC boot, I really only advocate the use of SD/USB boot on ESXi servers that are member of a vSphere cluster and have shared storage. Due to the failure mode of those cards under ESXi, there's no advantage to using them to boot standalone systems. See the differences between ESXi's installable and embedded modes.

ewwhite
  • 194,921
  • 91
  • 434
  • 799
  • I'll build a Dell update disc for the server and try it tonight. I've been given the OK to try a fresh ESXi install on it, which I am doing right now. vSwitch has 2 active adapters, not sure of hashing mode. Have tried changing the NICs around. I cannot do anything to repro it on demand, but it occurs regularly. Let's see if it happens after a rebuild. – Mark Henderson Nov 16 '14 at 21:13
  • If you rebuild, can you skip the SD boot? It impacts your logging options. – ewwhite Nov 16 '14 at 21:14
  • All the current drives are vmfs with vms living on them, so I can't reuse them, and there's no physical devices left. I could get a good fast USB drive instead if you think that's a better option? – Mark Henderson Nov 16 '14 at 21:24
  • Wait just re-read your comment on SD/USB and you think it's *not* a good option. – Mark Henderson Nov 16 '14 at 21:36
  • I'll only do it with clustered hosts. For standalone, [it's a little dangerous](http://serverfault.com/questions/643980/backup-entire-usb-drive-containing-bootable-partitions-in-debian). – ewwhite Nov 16 '14 at 21:38
  • Hrm. Well I wish I'd known this 6 months ago. It seemed like a good idea at the time to make the most of the physical media. The installation back to the SDHC card is almost finished, so if this gets me through this blip I will organise a time to move ESXi back to the physical drives. – Mark Henderson Nov 16 '14 at 21:48
  • With Dell (LSI) or HP RAID controllers, just make a small Logical Drive/Virtual Drive of ~16GB for ESXi. Then make another for your VMFS filesystems. – ewwhite Nov 16 '14 at 21:57
  • Re-installing ESXi has not helped. This all points to a hardware problem. I'll try updating the drivers tonight, and I am getting another server to swap the drives into, worst case. – Mark Henderson Nov 16 '14 at 23:16
  • Firmware, not drivers... :) – ewwhite Nov 16 '14 at 23:20
  • Yeah, that's what I meant :\ – Mark Henderson Nov 16 '14 at 23:27
  • Firmware was shamefully out of date, but alas, after update no success. I received a spare R710 today, so going to swap the drives into the new server. Gotta be faulty hardware. – Mark Henderson Nov 17 '14 at 09:25
  • I am so pissed off at what the actual problem was (see my answer in this question). But hey, at least now I have a spare on-site server and totally up to date firmware. – Mark Henderson Nov 19 '14 at 04:53
1

After 3 days of non-stop troubleshooting, I have eventually found that the problem is... wait for it... our Cisco ASA crapping itself and flooding the network with bogus traffic.

Because we were running pretty basic switching, and the server environment is 100% virtualised, we didn't notice anything inside the network stack.

The biggest red herring that I suffered here was pinging the guest OS from its host. I would have thought that this was totally standalone against the physical NICs, but apparently not.

I eventually found the problem by mirroring the management port on the switch and watching traffic to/from it with Wireshark, and seeing traffic leave the source port, but never, ever, arrive at the destination. Because I couldn't see it at inside the network itself, it then only took me 4 more hours to isolate the ASA as the source of the problem.

Since removing the ASA from the network, everything has been smooth sailing.


Turns out the ASA had not crapped itself, someone had created a mangled NAT rule that did not have no-proxy-arp, so it started responding to ARP requests on the entire internal /24. Deleting that rule and serving a firm boot up the arse of the person who added it and we now have our what, why and who.

This also explains why the host-only networking wasn't working as expected. The ASA was responding to the ARP request so the host didn't know to make it a host-only networking request.

Mark Henderson
  • 68,316
  • 31
  • 175
  • 255