3

I have a host that is part of a 4 host cluster in HA.

Sometime yesterday I noticed the host stopped responding, in the vsphere console it shows up greyed out as (not responding) and all VMs on it show up as (inaccessible). The VMs them self are still running normally, I can remote desktop to them and everything is up. There are critical servers on this machine. I have tried to right click the host and "Connect" after a few hours it simply fails. I cannot move the VMs on it, all actions are greyed out. On the Host pressing F2 gives me the login prompt, after entering my credentials nothing happens. ALT+F1 doesn't let me do anything as it's not enabled. SSH is not enabled. With ALT+F11 I can see that hostd has crashed, that's probably the problem. I have called Vmware as I have full support but after a very short call he said there's nothing to do but to forcefully shutdown the host.

I would rather not do that, I would like to restart the hostd but I can't seem to have any access. I tried PowerCLI but connection to the host times out. Vsphere directly to the host also times out. Pinging the host works, so there is network at least.

Anyone know any other way to get the shell?

Thanks.

More info: Running ESXi 5.5.0 1331820, on a Dell PowerEdge R720, Dell PERC H710

I checked the DRAC and the local volume is healthy. It's actually only a raid 1, all VMs are on a SAN. The vmware esxi welcome page works, but if I click on "browse datastores in this host's inventory" it never shows up. The mob seems to be working properly also "hostip/mob/?moid=ServiceInstance&doPath=content";

On the ALT+F11 console: 2014-09-11T7:15:02.329Z cpu12:57750311)hostd detected to be non-reponsive

The same line, different time and cpu 11 times.

Enriquev
  • 253
  • 3
  • 10
  • Can you tell me what type of hardware this server is running on? Server make/model, disk controller and disk arrangement, please. – ewwhite Sep 11 '14 at 15:14
  • @ewwhite It is running on a Dell PowerEdge R720, Dell PERC H710 is the disk controller but only runs local datastore, all VMs are running on a SAN. – Enriquev Sep 11 '14 at 15:20
  • Also, can you provide the build number of ESXi? – ewwhite Sep 11 '14 at 15:21
  • And can you ssh to the host? – ewwhite Sep 11 '14 at 15:21
  • Nope, like I said SSH is disabled. Running ESXi 5.5.0 1331820 – Enriquev Sep 11 '14 at 15:22
  • can you plan a time to shut down the critical servers and bring up on other hosts so you can hard-reset it? – user16081-JoeT Sep 11 '14 at 15:26
  • @user16081-JoeT Yes and No, according to my SLA I have to send a one week notice to clients before any downtime. So yes I guess that will be able to shutdown, but I still don't see how I could bring them up on another host since I can't migrate. Also if I find a solution to get the console or something I would like to start hostd, vmotion and then shutdown that host. – Enriquev Sep 11 '14 at 15:32
  • @Enriquev That's not going to happen. You will not be able to manage this host without a reboot. – ewwhite Sep 11 '14 at 15:34
  • You say that this is an HA cluster. Were the VM's on this host migrated to another host? Were these VM's on local storage? If so, how are you doing HA? It sounds like this host lost it's connection to storage. – joeqwerty Sep 11 '14 at 15:37
  • @joeqwerty It's vSphere HA, but a condition where the host's local storage fails is a loophole. The host can't say, "I'm not healthy", and the VMs are definitely running because they and the networking stack are in RAM. If the OP turns the host off, it'll initiate HA failover and the VMs will restart on other available cluster members. I'm suggesting the soft shutdown from within the OS because it will be cleaner. Either way, there will be downtime. – ewwhite Sep 11 '14 at 15:40
  • @joeqwerty It's like ewwhite says it, in this state HA was unable to kick in in time and the machines did not migrate. – Enriquev Sep 11 '14 at 15:49
  • @ewwhite: I think I understand what you're saying. From the perspective of HA this node is still up and running, so no migration of VM's occur. But, why do the VM's show as unavailable if they're on a SAN? Additionally, if it were a local storage failure why doesn't the host crash outright, assuming vSphere is installed on local storage? Just trying to better understand the problem. I appreciate your insight. – joeqwerty Sep 11 '14 at 15:51
  • @joeqwerty VCenter relies on the host to tell it about VM's that are running - as it can't speak to the host, it marks them as "Unavailable." Remember that HA is separate to VCenter and does its own thing using heartbeats – Dan Sep 11 '14 at 15:54
  • @joeqwerty It's like removing a disk from a running Linux system. Programs in memory will continue to run. The networking should still run... But I bet you wouldn't be able to log on. And there would be a gap in the logs... vCenter still needs to heartbeat the host. But everything is showing up grey because hostd is unavailable. – ewwhite Sep 11 '14 at 15:55
  • @ewwhite do you have any information on the heartbeat? If I were to somehow be able to block it migration would start? – Enriquev Sep 11 '14 at 16:02
  • 2
    @Enriquev You don't have any options right now. You can read about [VMware HA *in great detail* here](http://www.yellow-bricks.com/vmware-high-availability-deepdiv/), but I've been through this. You need to reboot. – ewwhite Sep 11 '14 at 16:04

1 Answers1

2

This sounds like a local storage issue to me. I worked in an environment with hundreds of ESXi hosts who ran on local RAID storage. Unfortunately, the local storage controllers in the hardware were unstable... a toxic mix of bad LSI firmware revisions, defective backplanes and Supermicro hardware.

But the behavior your describing is indicative of a local storage issue. Your running VMs are in RAM, the network stack is unaffected, but the ability to manage the host is compromised. Your login doesn't work because the host can't read from local disk. The same thing for any other commands that require disk access.

Your best option here is to schedule an orderly shutdown of the VMs (from within the guest operating systems). From there, manually fail the host (power off, reboot, etc.) Let it remain in maintenance mode or outside of the cluster selection. Power your VMs on and allow them to run elsewhere in the vSphere cluster.

If you're interested in debugging the host's issues, check the Dell DRAC for information about storage array status. That will point you in the right direction.

ewwhite
  • 194,921
  • 91
  • 434
  • 799
  • Thanks for your answer, I did check the DRAC and the local volume is healthy. It's actually only a raid 1. Forgot to mention, the vmware esxi welcome page works, but if I click on "browse datastores in this host's inventory" it never shows up. The mob seems to be working properly also "https://hostip/mob/?moid=ServiceInstance&doPath=content" – Enriquev Sep 11 '14 at 15:40
  • Also, how can I manually fail the host, when I right click it most of the options are greyed out. – Enriquev Sep 11 '14 at 15:41
  • @Enriquev turn it off. – ewwhite Sep 11 '14 at 15:41
  • @Enriquev Just because the volume says it's "healthy" doesn't mean it is. I didn't say your disks failed, I'm suggesting a controller lockup. You won't be able to tell until you can get into the host to see if it's able to log. But again, your fix here is a reboot. – ewwhite Sep 11 '14 at 15:45
  • I understand, I also saw your other comment. Thanks – Enriquev Sep 11 '14 at 15:47
  • Anecdotically, ESXi has turned out to be rather resilient in regard of local storage issues. A host of mine (4.1 at that time) has once been booted up off a LUN provided by a temporary testing FC target. Over half a year after I have removed the said FC target (and thus the boot volume for the ESXi instance), it was still running happily and serving VMs. I shut it down as I needed to power down the entire cabinet. – the-wabbit Sep 11 '14 at 15:50
  • @the-wabbit Yes, it works... I mean, the VMs aren't down. But I've also seen this at scale. Enough to make informed recommendations on when to use local storage, USB/SDHC or SAN-boot, since the [failure modes are all different](http://serverfault.com/q/549253/13325). Had this been a USB/SDHC card, the host would still be manageable. – ewwhite Sep 11 '14 at 15:52