23

Often, an installation of our on-site, debian-stable based application runs in a virtual machine - typically in VMware ESXi. In the general case we do not have visibility into or influence over their virtualization environment and do not have access to e.g. the VMware vCenter client or equivalent. I focus on VMware here, because that by far is the most common we see.

We'd like to:

  • Tell a customer's VMware admin: You can run our application in e.g. your VMware ESX environment, as long as it meets performance criteria X, Y and Z.
  • Be able to determine if criteria X, Y and Z are in fact met continuously (e.g. also right now), even on a running system (we cannot stop our application and run benchmarks, and an initial benchmark won't suffice, since performance in virtual environments changes over time).
  • Have confidence that if criteria X, Y and Z are met, we will have adequate virtual HW resources to run our application with satisfactory performance.

Now what are X, Y and Z?

We have seen time and again, that when there are performance problems, the problem isn't with our application, but with the virtualization environment. E.g. another virtual machine uses tons of CPU, memory or the SAN on which the disks are actually stored get heavy use by something other than our application. We currently have no way to prove or disprove that.

Theoretically it could also be possible that sometimes our application is slow... ;-)

How does one determine the root cause of our performance problems: Virtual environment or our application?

There are typically 3 areas for performance problems CPU, Memory and DISK I/O.

CPU

In e.g. VMware the administrator can specify Reservation and Limit, expressed in MHz, but is e.g. 512MHz on one ESX host exactly the same as 512MHz on another ESX host, possibly in a completely different ESX cluster?

And how does one measure whether we actually get that? While our application is running, we can perhaps see that we are at 212% CPU utilization on 4 CPUs. Is that because our application is doing a lot or because another VM on the same host is running a CPU intensive task and using all the CPU?

Memory (Ballooning?)

If we ask for e.g. 16GB RAM, that is often configured, but because of ballooning, we actually only get 4GB, and surprise, our application performs poorly.

One can ask the VMware tools about the current ballooning, but we've find that it often lies (or is inaccurate at least). We've seen examples where the OS thinks there is 16GB total RAM, the sum of the resident memory (RSS) of all processes is 4GB RAM, but there is only 2GB RAM free, even when VMware tools tells us there is 0 ballooning :-(

Also, just adding RSS together isn't valid, as there could easily be shared RAM, e.g. copy-on-write memory so 512MB + 512MB doesn't necessarily mean 1GB but could mean something less. So one can't simply subtract RSS from all processes to get a measure for how much RAM should be free and thereby detect ballooning reliably. One can detect some cases of ballooning, but there are other cases where ballooning is in effect, but not detectable by this method.

Disk I/O

I guess we could graph over time the number of disk reads and writes, the number of bytes read and written, and the IO wait %. But will that give us an accurate picture of disk I/O? I imagine that if there is a bitcoin miner running in another VM using all the CPU, our IO wait % will go up, even if the underlying SAN gives exactly the same performance, simply because our CPU resources go down, and hence IO wait (which is measured in %) goes up.

So in summary, what language can we use to describe to e.g. a VMware admin, what performance we need, in a portable and measurable way?

Peter V. Mørch
  • 812
  • 7
  • 15
  • What are the actual requirements of your application? What you've described so far is not enough for me to accurately gauge the resource requirements in my environment, and I'm well-versed in VMware. Your target audience would have an even more difficult time. In practice, I end up disregarding vendor requirements and measuring/right-sizing VMs based on historical metrics and observation using vRealize Operations Manager. – ewwhite Jun 18 '17 at 18:17
  • 1
    @ewwhite: I'm not a hardware expert by any means. But let me be specific and say it runs fine on a [Core i7-5820K](http://ark.intel.com/products/82932/Intel-Core-i7-5820K-Processor-15M-Cache-up-to-3_60-GHz) with 8GB RAM. Magnetic disks ca. 2015 are fine, SSD is better (I can be more specific here, if need be). We need 80GB free disk space. – Peter V. Mørch Jun 18 '17 at 18:40
  • 2
    As an admin, I'd say, "how many cores do I need to allocate, what is the actual RAM requirement, what is the storage requirement from an IOPs and throughput perspective, what is the growth rate of the storage, am I okay with thin-provisioning, etc?" – ewwhite Jun 18 '17 at 18:47
  • What does your application require from a performance perspective? Do you have benchmarks for your application? Saying `"It runs fine with x, y, and z"` isn't precise enough. You need to be able to tell your customers precisely what your application requires. If they give you those resources and the application performs poorly then the question isn't `"What do we need from a resource perspective?"`, but `"Why is it performing poorly even though the proper resources have been allocated?"` – joeqwerty Jun 18 '17 at 19:13
  • @joeqwerty: "What does your application require from a performance perspective?" - I have to admit, I don't know what language (e.g. terms, units) to answer your question in, which I guess is the core of my question. Do you have an example answer to your question (e.g. from another application), that you think is of high quality that we could emulate, modified to suit the needs of our application? – Peter V. Mørch Jun 19 '17 at 00:14
  • Windows has an API called `GetThreadTimes`. It tells you how long a thread is spending in user- or kernel-mode, which is far more useful than how much wall-clock time has elapsed, since it can tell you if your application is the one hogging the CPU. I don't know if Linux has something similar, but if it does, maybe look into that. – user541686 Jun 19 '17 at 00:14
  • @PeterV.Mørch Was this resolved? – ewwhite Oct 13 '17 at 10:37
  • 1
    @ewwhite: "Resolved"? No. I still don't have the 25-word encantation I can give to a VMware admin, and then be able to test and know that we will get predictable performance, because, as you know, "it depends". But I've accepted your answer, because I now think such a precise and measurable requirement is not possible and your information goes a long way towards speaking the proper language. In the future, I'm going to recommend we go the "If you want us to troubleshoot performance, we'll need at least view access to your vCenter" route. – Peter V. Mørch Oct 22 '17 at 08:54

1 Answers1

23
  • Seriously, most VMware administrators aren't good at this: Poor understanding of resource management, often no Linux knowledge (it helps) and lack of time bandwidth. I find most in-house admins have a difficult time maintaining deep virtualization knowledge.

  • Luckily, there's a book you can read!

  • Most VMware environments aren't great: Poor cluster design, bad resource planning, substandard storage (i.e. Synology NAS), misconfigured HA, no monitoring or patching.

  • VMware as an organization fails us: They are particularly bad at disseminating up-to-date information and promoting best practices. Basic searches for common questions generate results from 2009 and older revisions of VMware, despite the fact that processes and designs have changed over time.

All of these things will work against you.

You should determine the real requirements of your solution. Being able to accurately state that your appliance requires: 2 vCPU, 8GB RAM and 500 IOPs storage performance would go a long way to someone like me.

The other approach is to observe a healthy or ideal environment and extrapolate the metrics from there.

You've described problems with certain deployments. What were the issues and bottlenecks?


An example of a right-sized VM:

An Exchange server for a 300-user organization.

  • We have 6 weeks of workload/stress heatmaps versus time.
  • 6 vCPUs keeps us above the stress zone with buffer room for spikes.
  • 32GB RAM keeps us above the stress value, but isn't an unreasonable amount above what's really needed.

enter image description here

  • I could reclaim a few GB of RAM and a vCPU, but all in, this is an efficient VM.
  • It would be wise to get this type of monitoring of your application under ideal conditions.

enter image description here


Examples of VM resource monitoring.

Good-ish: - VM is right-sized. - CPU is overcommitted across the cluster, but we're not running into contention.

enter image description here

Bad-ish:

  • VM won't ever get all of the RAM it's configured with.
  • VM is already swapping RAM.
  • CPU is way over-configured.

enter image description here

ewwhite
  • 194,921
  • 91
  • 434
  • 799
  • 2
    Thanks, ewwhite for your answer. For the sake of argument, lets say that at one customer, it runs great with: 2 vCPU, 8GB RAM and 500 IOPs storage performance (from your answer). At another customer site, we ask for the same thing and get that, according to the VMware admin. However, the 2vCPUs are shared with 17 other CPU hungry VMs and the 8GB RAM is also ballooned. I don't understand VM disks very well, so lets say we actually get that. Our app performs great in the first of these two ESXi environments, and horribly in the other. How do I measure that the difference from inside the VMs? – Peter V. Mørch Jun 18 '17 at 18:46
  • 1
    You can monitor "[CPU Steal](http://www.stackdriver.com/understanding-cpu-steal-experiment/)" in top within your VM to see if CPU have been too heavily overcommitted. For RAM ballooning/swapping, it's tough to tell from inside the VM, except for the bad performance. You can ask for a view of the vCenter and resources for the VM, though. See above for examples. – ewwhite Jun 18 '17 at 18:51
  • 1
    I'll look into CPU Steal. We do sometimes end up with the VMware admin pointing fingers at our application and us pointing fingers at the slow VMware environment. However, we most-often don't have even view access to vSphere and then it becomes tough to troubleshoot, when it works fine in other installations. I guess one approach could be: "If you want us to troubleshoot performance, we'll need at least view access to your vCenter" – Peter V. Mørch Jun 18 '17 at 19:01
  • 3
    Most VMware admins don't even know how to read these things. I spend a lot of time cleaning up after them. So as a vendor, it's tough to ask for access or insight into their setup. But I think it would be best to solidify your requirements then enforce. While I usually don't recommend setting reservations, if your application is critical it may make sense. Or at the very least, setting a "shares priority". What does the application do? – ewwhite Jun 18 '17 at 19:05
  • "What does the application do?" - It is a network monitoring application. Collect statistics, logs, service status of equipment in the customer's network, presented in a web UI. So of course load requirements depend highly on number of hosts we monitor, amount of logs they generate, users in the UI and many other factors. But we (some of our other guys) have a feel for which setups need what HW. Just not when it comes to VMware, because we don't know what language to frame such absolute / "independent-of-ESXi-cluster-specifics" requirements in. – Peter V. Mørch Jun 19 '17 at 00:25
  • 1
    Have you guys considered distributing this as a VMware appliance with a small, medium and large recommended config? – ewwhite Jun 19 '17 at 00:47