Today we encountered a really strange behaviour on two identical kvm and qemu hosts (Dell R910). The host systems each have 4 x 10 Cores, which means that 40 physical cores are displayed as 80 within the operating system (Ubuntu Linux 10.04 64 Bit, Kernel 3.0).
We started a Windows 2003 32 Bit VM (1 CPU, 1 GB RAM, we changed those values multiple times) on one of the nodes and noticed that it took 15 minutes until the boot process began. During those 15 minutes, a black screen is shown and nothing happens. libvirt and the host system show that the qemu-kvm process for the guest is almost idling. stracing this process only shows some FUTEX entries, but nothing special.
After those 15 minutes, the Windows VM suddenly starts booting and the Windows logo occurs. After a few seconds, the VM is ready to be used. The VM itself is very performant, so this is no performance issue.
We tried to pin the CPUs with the virsh and taskset tools, but this only made things worse.
When we boot the Windows VM with a Linux Live CD there is also a black screen for several minutes, but not as long as 15. When booting another VM on this host (Ubuntu 10.04) it also has the black screen problem, and also here the black screen is only shown for 2-3 minutes (instead of 15).
So, summerinzing this: Each guest on each of those identical nodes suffers from idling a few minutes after being started. After a few minutes, the boot process suddenly starts. We have observed that the idling time happens right after the bios of the guest was initialized.
One of our employees had the idea to limit the amount of CPUs with maxcpus=40 (because of 40 physical cores existing) within Grub (kernel parameter) and suddenly the "black-screen-idling"-behaviour disappeared.
Searching the KVM and Qemu mailing lists, the internet, forums, serverfault and other various sites for known bugs etc. showed no useful results. Even asking in the dev IRC channels brought no new ideas. The people there recommend us to use CPU pinning, but as stated before it didn't help.
My question is now: Is there a sort of limit of CPUs for a qemu or kvm host system? Browsing the source code of those two tools showed that KVM would send a warning if your host has more than 255 CPUs. But we are not even scratching on that limit.
Some stuff about the host system:
3.0.0-20-server
kvm 1:84+dfsg-0ubuntu16+0.14.0+noroms+0ubuntu4
kvm-pxe 5.4.4-7ubuntu2
qemu-kvm 0.14.0+noroms-0ubuntu4
qemu-common 0.14.0+noroms-0ubuntu4
libvirt 0.8.8-1ubuntu6
4 x Intel(R) Xeon(R) CPU E7-4870 @ 2.40GHz, 10 Cores
Edit: Also tried the 3.2 kernel (with maxcpus parameter not being used) - unfortunately this made things worse. dstat shows a rising amount of context swiches:
----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read writ| recv send| in out | int csw
0 0 99 0 0 0|1164k 638k| 0 0 | 0 0 |4972 6319
0 1 99 0 0 0| 0 0 |3456B 4847B| 0 0 | 18k 33k
0 1 99 0 0 0| 0 0 |6126B 4550B| 0 0 | 17k 33k
0 1 99 0 0 0| 0 0 |1772B 4139B| 0 0 | 17k 33k
0 1 99 0 0 0| 0 0 |5507B 3674B| 0 0 | 17k 32k
Normal values would be around 7000 for this system with one VM up.
Edit: I started the host systems with maxcpus=40 as a boot parameter. virsh nodeinfo shows 40 physical cores and no hyperthreaded ones.
When starting the virtual machine, it still has a "boot break" of around 30 seconds. During that thime, the amount of context switches rises from 300 (per second) to 600 000 (per second). After 30 seconds of black screen, the VM starts the normal boot process and the context switches go down to <7000 a second:
----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read writ| recv send| in out | int csw
1 2 97 0 0 0| 943k 0 | 26k 12k| 0 0 | 22k 40k
3 7 84 6 0 0| 26M 64k| 71k 18k| 0 0 | 10k 16k
1 1 97 1 0 0|5282k 2560B|9751B 15k| 0 0 | 13k 23k
1 4 95 0 0 0|1216k 0 | 14k 18k| 0 0 | 295k 592k
1 3 96 0 0 0| 0 52k|5518B 7299B| 0 0 | 228k 456k
1 3 96 0 0 0| 12k 24k|1228B 1514B| 0 0 | 258k 518k
1 4 96 0 0 0| 0 0 | 14k 32k| 0 0 | 280k 565k
1 3 96 0 0 0| 0 0 | 19k 38k| 0 0 | 284k 573k
1 3 96 0 0 0| 0 0 |6465B 7203B| 0 0 | 288k 581k
1 3 96 0 0 0| 0 172k| 26k 11k| 0 0 | 290k 584k
1 3 96 0 0 0| 0 0 | 23k 11k| 0 0 | 288k 580k
1 3 96 0 0 0| 0 12k|5678B 4556B| 0 0 | 289k 583k
1 3 96 0 0 0| 0 0 |1192B 2929B| 0 0 | 288k 580k
1 3 96 0 0 0| 0 0 |6304B 10k| 0 0 | 272k 547k
1 3 96 0 0 0|4096B 52k|8330B 14k| 0 0 | 300k 605k
1 3 96 0 0 0| 0 24k| 11k 20k| 0 0 | 293k 591k
1 3 96 0 0 0| 0 0 | 13k 28k| 0 0 | 291k 587k
1 3 96 0 0 0| 0 512B| 10k 18k| 0 0 | 291k 587k
2 3 95 0 0 0| 0 0 |6653B 10k| 0 0 | 167k 337k
3 0 97 0 0 0| 0 160k| 23k 5524B| 0 0 | 10k 19k
7 0 92 0 0 0| 0 36k| 22k 3335B| 0 0 | 949 924
10 0 90 0 0 0| 0 0 |5172B 3318B| 0 0 | 908 923
5 0 94 0 0 0| 0 0 |2234B 2825B| 0 0 | 846 875
Edit: As requested, I'll add an excerpt of the strace -f -p :
25734 <... read resumed> "\16\0\0\0\0\0\0\0\376\377\377\377\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0"..., 128) = 128
25752 futex(0x927e60, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
25734 rt_sigaction(SIGALRM, NULL, {0x4b2300, ~[KILL STOP RTMIN RT_1], SA_RESTORER, 0x7fe09ac108f0}, 8) = 0
25734 write(8, "\1\0\0\0\0\0\0\0", 8) = 8
25734 read(15, 0x7fffcea69f70, 128) = -1 EAGAIN (Resource temporarily unavailable)
25734 timer_gettime(0x1, {it_interval={0, 0}, it_value={0, 0}}) = 0
25734 timer_settime(0x1, 0, {it_interval={0, 0}, it_value={0, 250000}}, NULL) = 0
25734 timer_gettime(0x1, {it_interval={0, 0}, it_value={0, 182592}}) = 0
25734 futex(0x927e60, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
25752 <... futex resumed> ) = 0
25734 <... futex resumed> ) = 1
25752 futex(0x927e60, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
25734 select(25, [7 10 14 15 16 17 18 24], [], [], {1, 0} <unfinished ...>