10

We bought some software from a small'ish company, it's a Windows 32-bit video content workflow manager, there's been some customisation by them.

We've been working fine for over a year running this code in a VMWare ESXi 4.1u2 VM on W2K3EE-32-bit (that's what they support running it on).

Then they updated their code a month or so back and we started seeing one of the vCPUs periodically pegging at 100%, the second vCPU is fairly idle, say 5-7% - so we just assumed that the code's badly threaded and contacted them about it.

They've now come back to us saying that their code doesn't work in a VM, they've known about this requirement for 18 months or so, and that they want us to V2P it. They say they only see this problem when ran inside VMs. I've a call with their senior programmer scheduled in a few hours to discuss.

Now luckily we have a few physicals that we can do this on, bit time-consuming but do'able.

My question however is that given this VM doesn't touch any hardware directly, is on a very modern host and actually has very low requirements (2 x vCPU, 4GB, 20GB boot vdisk, 100GB data vdisk, single vNIC and nothing else) what could possibly be the issue with running it in a VM, if there is one?

Obviously I'm strongly pursuing this with them but I just wondered if anyone else has found a regular application, that somehow misbehaves inside a VM but not on a physical.

Chopper3
  • 100,240
  • 9
  • 106
  • 238
  • Are both vCPU's pulling from the same CPU? Do you have it setup that each real core maps directly to a vCPU? Are you doing anything funny like having hyper-threading enabled on your CPU's? These are some questions that should help address anything that maybe causing them some slow down on your end that you can address. You will probably have a better idea after talking to the senior programmer either how to address the issues that maybe cropping up from having it run in a VM or you will know for certain whether they are just doing it wrong. It could just be that the code is written in java. – Wilshire Mar 13 '12 at 16:16
  • I'm letting ESXi do it's own thing in terms of process scheduling, and on >55xx-series Xeons hyperthreading isn't considered 'funny', it works and is very useful - oh and the code's .NET 3.5 by the way. – Chopper3 Mar 14 '12 at 08:56
  • I know that MySQL Cluster apparently does not 'officially' work in a virtualized environment either. Reason? Dunno! :P – Ben Ashton Mar 16 '12 at 22:15

3 Answers3

3

While I can't speak for this vendor or the software package, I have worked for a large (multinational) vendor, where one of the pieces of software they sold had very specific known issues when running on VMware.

In this case, one issue could cause the software to deadlock, and the other could cause data corruption. As such, customers were advised not the run the software in a virtual environment. Some still did, and in all the cases I was aware of, they ran into one or both of the problems.

So while it is rare, there can be cases where software does not perform as you would expect it to in VMware.

While I realise it doesn't directly help your problem, it does show that VMWare is not always the perfect system.

Footnote: in this case the vendor was able to work with VMware to find resolutions (some code fixes, some VMWare config changes), and they now have some (very specific) guidance on how to run the software on VMWare.

Sam
  • 871
  • 7
  • 12
  • That's exactly the kind of thing I'm sad about but grateful to hear - as I mention to Janne in his response we get so used to things working correctly in VMs that finding such an odd set of circumstances left me a little bewildered to be honest, so hearing from you that I'm not alone in this is comforting at least. I've not heard anything positive from the software vendor yet but I know they're looking into the problem, can't imagine a fix for a month or so though unfortunately. Thanks again. – Chopper3 Mar 19 '12 at 12:51
3

With ESX v5 and the Monster VM limit (32vCPU 1TB RAM), the number of applications having issues with VM is shrinking. Most of the ones I've experienced are either : - relying on time to be linear (realtime processes or apps that needs to have linear time ... this can usually be tweaked) - apps causing lots of hardware interrupts or context switching

In most cases, you should be able to ask your vmware rep to talk to those guys. I believe vmware still has a team of people dedicated to make things work (they had a support lab just for this in the early days).

As for a solution, I had a similar issue with VM having high CPU usage (but host having plenty of CPU resources free). We fixed the issue by migrating to a server with a Nehalem CPU and changing the CPU compatibility level in EVC (if you have a cluster with DRS/HA)

hdex
  • 31
  • 2
  • Thank you for your response - very kind of you when this really isn't a black and white kind of question. Your examples are very useful, I'm going to go back and examine context switching in particular. Oh and all our servers are on exactly the same CPU (X5690's) with EVC set uniformly, but thanks again. – Chopper3 Mar 19 '12 at 13:39
2

I have seen similar problem with VMware ESX + Debian 6 + OpenLDAP 2.4.x (whatever the exact version of OpenLDAP is apt-gettable...).

Under day-to-day operations it works OK, but things like importing a largish LDIF file with 400 000 or so entries are very slow (50-100x slower than with physical servers). Also with long-duration, high-volume benchmarking everything is going smoothly with couple of milliseconds response time, but occasionally there are strange peaks ranging from 500 to 25 000(!) milliseconds.

With physical servers I'm unable to reproduce these problems. And yes, I spent around three weeks trying to isolate the problem, tuning all kind of parameters from operating system parameters to slapd values to BerkeleyDB values ... nothing helped.

Janne Pikkarainen
  • 31,454
  • 4
  • 56
  • 78
  • Thank you very much for sharing your experiences, I can't say I don't find this whole thing slightly odd - I'm an experiences virtualisation geek and I'm so used to things just working that to find an application that does this has shaken my beliefs in a way, so it's good to hear I'm not in an isolated position. Thank you. – Chopper3 Mar 19 '12 at 12:49
  • 1
    Another two examples: Atlassian says that both `Jira` and `Confluence` are not recommended to run in a VM(ware) environment. There must be a pattern for these exceptions, I just have not figured out yet what that might be. My OpenLDAP installation is not very I/O intensive (3 MB/s write and not too many IOPS in peaks during benchmark), it uses maybe 20-40% CPU, and around 150 MB RAM. Should not be too hard to handle. Perhaps it has something to do with threading, but vmstat reports context switches etc to be at normal level. – Janne Pikkarainen Mar 19 '12 at 12:52
  • My current theory is that this has something to do with the OS time keeping. VMware has had all kinds of strange clock issues in the past and even now sometimes you have to pass some `tsc=pit` stylish parameters during boot, and at least OpenLDAP is VERY sensitive to system clock accuracy. Maybe I should strace all the problematic apps and see if they all heavily use `gettimeofday()` or so. – Janne Pikkarainen Mar 19 '12 at 12:54
  • Thank you again, you're right about time in-VM, it's inherently all over the place so I'd understand this but I can't help but think that even if that were an issue it'd be a very quick problem for our vendors to spot in their code, mind you it's not actually a time-sensitive application, it just grabs video content and processes it, hmmm. Thanks again. – Chopper3 Mar 19 '12 at 13:36