The mystery of the bad CentOS template - all VMware VMs based on this template crash sometimes

Question

I have a template from CentOS 7 (1602) that I have deployed roughly 200 VMs using it until I noticed the issue, so it would be ideal to fix these VM's rather than start from scratch.

The VM's 'randomly' fail, usually between 7PM and 11PM, sometimes two nights in a row, sometimes not for a week or two. When one VM fails, most of them also fail. They seem to loose disk access. Rebooting the VM immediately solves the issue and it does not reoocur for at least 24 hours. Even when we don't reboot them till the next day they still reboot during this time period.

Some of the VM's have nothing installed on them and still have this issue. Root partition and boot partition are hardly used. Logs show no issues.

No other VMs are affected except this particular centos template. We are using VMWare 4 (I know, I know) but we have never had any issues other than this and new images have no issue. I see no spikes in CPU or disk use in VMWare around the failure.

Here is a screenshot as it fails:

OnFailure

Here is a screenshot when trying to access the VM after a number of minutes has elapsed:

AfterFailure

Example bootstrap script used on these servers: http://pastebin.com/gs3AzV5m

@AirCombat Is the OS up-to-date? Have you updated the kernel to the current release? — ewwhite, Jul 25 '16 at 12:08
~ewwhite The bootstrap script above installs yum-cron. We update daily and reboot at least once a month for new kernels. The issue has been ongoing for a number of months. — ZZ9, Jul 25 '16 at 13:00

score 1 · Answer 1 · edited Apr 13 '17 at 12:14

1

This is probably due to OS support or a resource issue. EL7 was not intended for use with vSphere 4. The VMware support matrix reinforces this.

I see you're using open-vm-tools, but it looks like you may have a deeper issue.

See: https://access.redhat.com/solutions/21849
and: https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1009996

On running RHEL as a Virtual Machine under VMWare, the "soft lockup" messages might indicate high levels of overcommitment (especially memory overcommitment) or other virtualization overheads.

200 VMs is a large number, and vSphere 4 is an old release. I couldn't imagine starting a new rollout on such an old release of vSphere, and I'm sure you're no longer under VMware support.

What does the infrastructure and cluster setup look like?
How many hosts?
What are the hosts' resources? RAM amount? CPU type/count?
What type of storage?
What is the vCPU and RAM profile of these VMs?

Are you heavily overcommitted to the point where your system is killing itself?

edited Apr 13 '17 at 12:14

Community

1

answered Jul 25 '16 at 10:52

ewwhite

194,921
91
434
799

Thanks for your answer. I have over 1000+ Centos 7 VMs running on VMWare 4. This issue is specific to this template, not CentOS as a whole. We have 8 hosts per cluster. 3 Clusters. 500GB RAM, 4x 4Core CPU per host. Typically about 8GB of RAM each, 4vCPUs (max in VMWare 4 i think). iSCSI storage. We have plenty of spare recourses, IOPs is low, CPU use is around 5% globally. RAM use is fairly low. The 2TB datastores have about 600GB remaining. This provisioned. Thick, it would be about 12TB but thats because the virtual disks are big. – ZZ9 Jul 25 '16 at 10:58
But what are the VM profiles? How much total RAM per host/cluster and how much RAM per VM? How overcommitted are you? – ewwhite Jul 25 '16 at 11:01
Sorry, I hit enter before I finished typing. – ZZ9 Jul 25 '16 at 11:01
Its not an issue with overallocation. FYI, We are in the process of rolling our VMWare 6 but these VMs will be running for another 6-9 months. We do not have access to Redhat's website for political reasons... – ZZ9 Jul 25 '16 at 11:21
Well, I'm pretty sure it's incompatible. Just state to management and show them that it is unsupported. Migrate them to VMware 6 too, if you can. Or at leat deploy the current template on vSphere 6 to evaluate if that is causing it. – aairey Jul 25 '16 at 11:43
@AirCombat This still sounds way overcommitted. Can you post screenshots of the cluster resource distribution chart or anything like that? Otherwise, I don't know why a template would cause this type of issue... but maybe a fresh build or kickstart approach makes more sense. – ewwhite Jul 25 '16 at 18:35
Storage IOPS is less than 1k-2k which is nothing (it has spiked at 45k on other clusters), memory use is around 60% globally and CPU varies between 2% - 6% on each host. This issue affects only this template, older and newer Centos 7 templates work fine and have never had issues with them. There is something in the OS causing this. – ZZ9 Jul 27 '16 at 11:22
@AirCombat Then make a new template. _We_ can't help you with this. – ewwhite Jul 27 '16 at 11:30

The mystery of the bad CentOS template - all VMware VMs based on this template crash sometimes

1 Answers1