How much overhead does x86/x64 virtualization have?

Question

How much overhead does x86/x64 virtualization (I'll probably be using VirtualBox, possbly VMWare, definitely not paravirtualization) have for each of the following operations a Win64 host and Linux64 guest using Intel hardware virtualization?

Purely CPU-bound, user mode 64-bit code
Purely CPU-bound, user mode 32-bit code
File I/O to the hard drive (I care mostly about throughput, not latency)
Network I/O
Thread synchronization primitives (mutexes, semaphores, condition variables)
Thread context switches
Atomic operations (using the lock prefix, things like compare-and-swap)

I'm primarily interested in the hardware assisted x64 case (both Intel and AMD) but wouldn't mind hearing about the unassisted binary translation and x86 (i.e. 32-bit host and guest) cases, too. I'm not interested in paravirtualization.

(1) "x86" means 32-bit. You will not be able to run 64-bit code. AMD64 (also known as x64) virtualization has different limitations because it requires hardware extensions. (2) Do you mean x86 virtualization by binary translation (x86 only) or hardware assisted virtualization (VT)? — Skyhawk, Apr 20 '11 at 23:55

score 29 · Answer 1 · edited Oct 21 '17 at 19:16

I found that there isn't simple and absolute answer for questions like yours. Each virtualization solution behaves differently on specific performance tests. Also, tests like disk I/O throughput can be split in many different tests (read, write, rewrite, ...) and the results will vary from solution to solution, and from scenario to scenario. This is why it is not trivial to point one solution as being the fastest for disk I/O, and this is why there is no absolute answer for labels like overhead for disk I/O.

It gets more complex when trying to find relation between different benchmark tests. None of the solutions I've tested had good performance on micro-operations tests. For example: Inside VM one single call to "gettimeofday()" took, in average, 11.5 times more clock cycles to complete than on hardware. The hypervisors are optimized for real world applications and do not perform well on micro-operations. This may not be a problem for your application that may fit better as real world application. I mean by micro-operation any application that spends less than 1,000 clock cycles to finish(For a 2.6 GHz CPU, 1,000 clock cycles are spent in 385 nanoseconds, or 3.85e-7 seconds).

I did extensive benchmark testing on the four main solutions for data center consolidation for x86 archictecture. I did almost 3000 tests comparing performance inside VMs with the hardware performance. I've called 'overhead' the difference of maximum performance measured inside VM(s) with maximum performance measured on hardware.

The solutions:

VMWare ESXi 5
Microsoft Hyper-V Windows 2008 R2 SP1
Citrix XenServer 6
Red Hat Enterprise Virtualization 2.2

The guest OSs:

Microsoft Windows 2008 R2 64 bits
Red Hat Enterprise Linux 6.1 64 bits

Test Info:

Servers: 2X Sun Fire X4150 each with 8GB of RAM, 2X Intel Xeon E5440 CPU, and four gigabit Ethernet ports
Disks: 6X 136GB SAS disks over iSCSI over gigabit ethernet

Benchmark Software:

CPU and Memory: Linpack benchmark for both 32 and 64 bits. This is CPU and memory intensive.
Disk I/O and Latency: Bonnie++
Network I/O: Netperf: TCP_STREAM, TCP_RR, TCP_CRR, UDP_RR and UDP_STREAM
Micro-operations: rdtscbench: System calls, inter process pipe communication

The averages are calculated with the parameters:

CPU and Memory: AVERAGE(HPL32, HPL64)
Disk I/O: AVERAGE(put_block, rewrite, get_block)
Network I/O: AVERAGE(tcp_crr, tcp_rr, tcp_stream, udp_rr, udp_stream)
Micro-operations AVERAGE(getpid(), sysconf(), gettimeofday(), malloc[1M], malloc[1G], 2pipes[], simplemath[])

For my test scenario, using my metrics, the averages of the results of the four virtualization solutions are:

VM layer overhead, Linux guest:

CPU and Memory: 14.36%
Network I/O: 24.46%
Disk I/O: 8.84%
Disk latency for reading: 2.41 times slower
Micro-operations execution time: 10.84 times slower

VM layer overhead, Windows guest:

CPU and Memory average for both 32 and 64 bits: 13.06%
Network I/O: 35.27%
Disk I/O: 15.20%

Please note that those values are generic, and do not reflect the specific cases scenario.

Please take a look at the full article: http://petersenna.com/en/projects/81-performance-overhead-and-comparative-performance-of-4-virtualization-solutions

`For a 2.6 GHz CPU, 1,000 clock cycles are spent in 23 milliseconds`, shouldn't that be a simple division of 1,000 by 2,600,000 to get the number of seconds 1,000 clock cycles take? (which is not 23 milliseconds) — dvdvorle, Mar 07 '13 at 11:49
@Mr. Happy, you are right. I got 385 nanoseconds by: 1000 / 2600000000 = 0.000000385 = 385 nanoseconds. Do you agree with this? Thanks for pointing this out. — Peter, Mar 07 '13 at 15:19
@dyasny, I'm looking for hardware to repeat the tests with updated versions. Any idea where can I found it? — Peter, Mar 07 '13 at 15:20
@PeterSenna Yes agreed! I was even wrong in my comment xD (2.6 GHz isn't the same as 2,600,000 Hz ofc) — dvdvorle, Mar 07 '13 at 18:43

score 4 · Answer 2 · answered Apr 21 '11 at 05:09

There are too many variables in your question, however I could try to narrow it down. Let's assume that you go with VMware ESX, you do everything right - latest CPU with support for virtualaization, VMware tools with paravirtualized storage and network drivers, plenty of memory. Now let's assume that you run a single virtual machine on this setup. From my experience, you should have ~90% of CPU speed for CPU bound workload. I cannot tell you much about network speeds, since we are using 1Gbps links and I can saturate it without a problem, it may be different with 10Gbps link however we do not have any of those. Storage throughput depends on type of storage, with I can get around ~80% of storage throughput with local storage, but for 1Gbps NFS it is close to 100% since networking is bottleneck here. Cannot tell about other metrics, you will need to do experimentation with your own code.

These numbers are very approximate and it highly depends on your load type, your hardware, your networking. It is getting even fuzzier when you run multiple workloads on the server. But what I'm truing to say here is that under ideal conditions you should be able to get as close as 90% of native performance.

Also from my experience the much bigger problem for high performance applications is latency and it is especially true for client server applications. We have a computation engine that receives request from 30+ clients, performs short computations and returns results. On bare metal it usually pushes CPU to 100% but same server on VMware can only load CPU to 60-80% and this is primarily because of the latency in handling requests/replies.

I can speak from experience that saturating a 10GbE link with a single VM is very difficult. We've used VMWare FT, which can easilly saturate a 1Gbps link on its own, over 10Gbe and it didn't come close to saturating it. — Mark Henderson, Apr 21 '11 at 05:31

score 1 · Answer 3 · edited May 29 '20 at 05:48

1

I haven't dug down to the performance of the basic primitives like context switching and atomic operations, but here are my results of a brute force test I carried out recently with different hypervisors. It should be indicative of what you might expect if you are mostly CPU and RAM bandwidth limited.

https://altechnative.net/virtual-performance-or-lack-thereof/

edited May 29 '20 at 05:48

Gordan Bobić

936
4
10

answered Aug 07 '12 at 13:25

Gordan

59
1

2

That's great that you've got some info for Xen and KVM... But what about the most popular two Hypervisors?! They're completely missing. And you've including several Type 2 Hypervisors, no sane SysAdmin would use that for production. – Chris S Aug 07 '12 at 13:32
Down voted. Link is dead. – Tim Duncklee Feb 13 '18 at 01:49

How much overhead does x86/x64 virtualization have?

3 Answers3

Linked