Intermittent full system hang on VM server running CentOS 5.10

Question

CentOS 5.10 / VMWare ESX 5.1

I've got an older email server running CentOS 5.10 (with SendMail) and it's experiencing intermittent hangs wherein the system becomes completely unresponsive. During these times, I can't connect to it at all and the virtual console is unresponsive.

The strange part is that our VMWare admin group aren't seeing any obvious resource spikes that would be indicative of insufficient resources, load spikes, etc. Furthermore, when I examine the system logs (e.g. maillog, messages, etc) there's a noticeable absence in ALL log activity during the time of the hang which suggests that these outages are severe enough to prevent logging (or perhaps there's a filesystem/disk issue).

The one abnormality is that sendmail logging on the box was pretty high (98 instead of the usual level 9). I'm going to set it back to normal shortly.

I'm stumped on where I can go for more info here. Is there a thread dump that would tell me what the OS was working on during the hang?

Additional information:

Kernel version is: 2.6.18-371.4.1.el5 #1 SMP Thu Jan 30 06:09:24 EST 2014 i686 i686 i386 GNU/Linux
The storage is handled on a shared SAN.
VMWare tools is not installed on the system as per internal policy however we've been running for a long time without vmware tools so we don't think the absence of it is necessarily the root cause.
Specific version of VMWare is: VMware ESXi 5.1.0 build-2000251
Hardware is IBM 3850 M2, Model 7233AC1

Please post the ESXi build number/version. Also, describe the storage configuration of the host. Is it a SAN? Is it local storage? Server make/model may also help. — ewwhite, Nov 10 '14 at 21:10
Also post your kernel version and whether you're using VMware tools. — ewwhite, Nov 10 '14 at 21:19
@ewwhite thanks. Added some additional information. Checking with the vmware team for the rest of it. — Mike B, Nov 10 '14 at 21:50

score 2 · Answer 1 · edited Apr 13 '17 at 12:14

2

So, 32-bit CentOS 5.10... That's not necessarily a problem...

But you should always have the VMware tools installed when running an operating system supported by VMware. This can be extremely helpful when vSphere/ESXi host memory gets constrained, plus it adds the memory balloon driver, better NIC interface options (for your EL5 system) and power management.

In general, look at what the SAN is doing at the time these issues occur. Also, if you're not using VMware tools, there's a good chance that ESXi isn't on a stable revision level. Please report back on the ESXi build number. You'll see it at the top of the vSphere Client when connected to the host.

Edit:

Since this is a vSphere cluster, can you have the team check memory allocation. I've seen Linux VMs hang or lock-up because of bad memory configuration. This can include setting RAM limit in the vSphere client for the VM in question. This can also include situations where your cluster is too overcommitted on RAM and/or where the VMs have been allocated too much RAM.

See: vSphere education - What are the downsides of configuring VMs with *too* much RAM?

Any deeper analysis would require seeing some of the VMware cluster/resource status screens.

edited Apr 13 '17 at 12:14

Community

1

answered Nov 10 '14 at 21:59

ewwhite

194,921
91
434
799

Thanks. I'll request more information from the vmware team regarding build. Can you elaborate on what you mean by "look at what the SAN is doing at the time..."? Should the VMWare team be reviewing logs? Performance metrics? Etc? – Mike B Nov 10 '14 at 22:25
Also, apologies for clearly asking a question that has been asked before (as per your meta post). I did check beforehand to see if this was already asked but didn't seem to find it. :-( – Mike B Nov 10 '14 at 22:26
Oh, the meta post was more about people not upgrading VMware. It's almost dangerous not to these days. – ewwhite Nov 10 '14 at 22:33
Added additional information regarding VMWare build and VM host hardware. Is 2000251 a "stable" build? – Mike B Nov 11 '14 at 15:48
@MikeB That's [July 2014](http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2079117). Not too bad... I suspect RAM, plus the lack of the balloon drive provided by the VMware Tools. See my edit above. – ewwhite Nov 11 '14 at 16:31
Thanks - I'm reviewing the other documentation but I'm still a bit confused (thanks so much for your patience). If these types of hangs are the result of bad memory configuration (or SAN issues), is that something that would be fairly obvious on the vmware management side? Is there a log that would show "proof"? The prevailing point coming back my way is that this is only happening on a few servers. I'm reluctant to tell them "check to see if it's something on your end" if they're not going to be able to prove it one way or the other. – Mike B Nov 11 '14 at 16:58
1

No, it's not immediately obvious. The VMware team would need to check the resource shares of the VM cluster/host in question. There are a few places to look, but it's hard to do this blind. – ewwhite Nov 11 '14 at 17:00

score 1 · Accepted Answer · answered Dec 03 '14 at 00:35

I just wanted to close the loop on this one. The mysterious hangs stopped occurring after we scaled back SendMail logging from 99 down to 9 (default). Admittedly, that was a reaaaaaaly high log-level setting but I’ve never seen that completely grind a server to a halt. Also no idea how long it had been set that way.

My guess is that the intermittent nature of this stemmed from a combination of mediocre disk I/O speeds and occasional SMTP load spikes.

Thanks everyone for your help.

Intermittent full system hang on VM server running CentOS 5.10

Additional information:

2 Answers2