Some pieces of information you should find the answer to before deciding on a path forward:
CPU usage for the affected VMs
- From the guest operating system's view, is CPU usage often above 80%, and/or shows plateaus rather than spikes in usage? It's likely your VM is CPU starved. Add more vCPUs (but think about possible licensing issues).
- Are some vCPUs in your servers significantly less loaded than others? You could have a scaling problem in your application, where simply throwing more vCPUs into a single VM (or into a physical machine) won't help matters.
- Do the
CPU ready
times indicate that the host has been overcommitted? A rule of thumb you sometimes see is that you want less than 5% average ready time, but my experience is that even that is way too much for a system you actually do work in. Note that if you use vCenter, the indicated ready time is in aggregate milliseconds since the last graph update. In "realtime" view, the graph updates every 20 seconds (=20000 ms), so the average percentage per CPU for a VM can be calculated using the formula (indicated_ready_time * 100 / 20000) / number_of_vcpu
.
RAM usage
(Should always be checked from within the guest operating system)
- Usually above 80%? Add memory.
- Signs of memory leaks? Fix the application or be prepared to restart/reboot more often.
- Signs of heavy swapping? Check for configuration issues. Add memory.
- Do you have key applications/processes that "inexplicably" use less than 4 GB of memory? They may need to be rebuilt or reconfigured to utilize 64-bit addressing.
Also check disk and network performance for latency issues.
Depending on how your application scales it might be an idea to add more web servers rather than to add compute power or memory to the existing ones.
Once you have an idea of where your bottlenecks are and how best to utilize your hardware, you can start making a business case for what to purchase.
The main case for virtual machines is that they are easier to manage, easier to backup and easier to migrate in case of system failure. They allow for better utilization of your hardware, provided that they don't actually require all resources you can throw at them, and if you use paravirtualized network interfaces the communication between machines on the same host is as fast as the CPU can manage rather than being limited to physical network interface speeds.
A system running directly on a physical machine will, of course, have no overhead due to resource sharing, but this is only a benefit if you can use the available power.