6

Today we encountered a really strange behaviour on two identical kvm and qemu hosts (Dell R910). The host systems each have 4 x 10 Cores, which means that 40 physical cores are displayed as 80 within the operating system (Ubuntu Linux 10.04 64 Bit, Kernel 3.0).

We started a Windows 2003 32 Bit VM (1 CPU, 1 GB RAM, we changed those values multiple times) on one of the nodes and noticed that it took 15 minutes until the boot process began. During those 15 minutes, a black screen is shown and nothing happens. libvirt and the host system show that the qemu-kvm process for the guest is almost idling. stracing this process only shows some FUTEX entries, but nothing special.

After those 15 minutes, the Windows VM suddenly starts booting and the Windows logo occurs. After a few seconds, the VM is ready to be used. The VM itself is very performant, so this is no performance issue.

We tried to pin the CPUs with the virsh and taskset tools, but this only made things worse.

When we boot the Windows VM with a Linux Live CD there is also a black screen for several minutes, but not as long as 15. When booting another VM on this host (Ubuntu 10.04) it also has the black screen problem, and also here the black screen is only shown for 2-3 minutes (instead of 15).

So, summerinzing this: Each guest on each of those identical nodes suffers from idling a few minutes after being started. After a few minutes, the boot process suddenly starts. We have observed that the idling time happens right after the bios of the guest was initialized.

One of our employees had the idea to limit the amount of CPUs with maxcpus=40 (because of 40 physical cores existing) within Grub (kernel parameter) and suddenly the "black-screen-idling"-behaviour disappeared.

Searching the KVM and Qemu mailing lists, the internet, forums, serverfault and other various sites for known bugs etc. showed no useful results. Even asking in the dev IRC channels brought no new ideas. The people there recommend us to use CPU pinning, but as stated before it didn't help.

My question is now: Is there a sort of limit of CPUs for a qemu or kvm host system? Browsing the source code of those two tools showed that KVM would send a warning if your host has more than 255 CPUs. But we are not even scratching on that limit.

Some stuff about the host system:

3.0.0-20-server
kvm 1:84+dfsg-0ubuntu16+0.14.0+noroms+0ubuntu4
kvm-pxe 5.4.4-7ubuntu2
qemu-kvm 0.14.0+noroms-0ubuntu4
qemu-common 0.14.0+noroms-0ubuntu4
libvirt 0.8.8-1ubuntu6
4 x Intel(R) Xeon(R) CPU E7-4870  @ 2.40GHz, 10 Cores

Edit: Also tried the 3.2 kernel (with maxcpus parameter not being used) - unfortunately this made things worse. dstat shows a rising amount of context swiches:

----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw 
  0   0  99   0   0   0|1164k  638k|   0     0 |   0     0 |4972  6319 
  0   1  99   0   0   0|   0     0 |3456B 4847B|   0     0 |  18k   33k
  0   1  99   0   0   0|   0     0 |6126B 4550B|   0     0 |  17k   33k
  0   1  99   0   0   0|   0     0 |1772B 4139B|   0     0 |  17k   33k
  0   1  99   0   0   0|   0     0 |5507B 3674B|   0     0 |  17k   32k

Normal values would be around 7000 for this system with one VM up.

Edit: I started the host systems with maxcpus=40 as a boot parameter. virsh nodeinfo shows 40 physical cores and no hyperthreaded ones.

When starting the virtual machine, it still has a "boot break" of around 30 seconds. During that thime, the amount of context switches rises from 300 (per second) to 600 000 (per second). After 30 seconds of black screen, the VM starts the normal boot process and the context switches go down to <7000 a second:

----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw 
  1   2  97   0   0   0| 943k    0 |  26k   12k|   0     0 |  22k   40k
  3   7  84   6   0   0|  26M   64k|  71k   18k|   0     0 |  10k   16k
  1   1  97   1   0   0|5282k 2560B|9751B   15k|   0     0 |  13k   23k
  1   4  95   0   0   0|1216k    0 |  14k   18k|   0     0 | 295k  592k
  1   3  96   0   0   0|   0    52k|5518B 7299B|   0     0 | 228k  456k
  1   3  96   0   0   0|  12k   24k|1228B 1514B|   0     0 | 258k  518k
  1   4  96   0   0   0|   0     0 |  14k   32k|   0     0 | 280k  565k
  1   3  96   0   0   0|   0     0 |  19k   38k|   0     0 | 284k  573k
  1   3  96   0   0   0|   0     0 |6465B 7203B|   0     0 | 288k  581k
  1   3  96   0   0   0|   0   172k|  26k   11k|   0     0 | 290k  584k
  1   3  96   0   0   0|   0     0 |  23k   11k|   0     0 | 288k  580k
  1   3  96   0   0   0|   0    12k|5678B 4556B|   0     0 | 289k  583k
  1   3  96   0   0   0|   0     0 |1192B 2929B|   0     0 | 288k  580k
  1   3  96   0   0   0|   0     0 |6304B   10k|   0     0 | 272k  547k
  1   3  96   0   0   0|4096B   52k|8330B   14k|   0     0 | 300k  605k
  1   3  96   0   0   0|   0    24k|  11k   20k|   0     0 | 293k  591k
  1   3  96   0   0   0|   0     0 |  13k   28k|   0     0 | 291k  587k
  1   3  96   0   0   0|   0   512B|  10k   18k|   0     0 | 291k  587k
  2   3  95   0   0   0|   0     0 |6653B   10k|   0     0 | 167k  337k
  3   0  97   0   0   0|   0   160k|  23k 5524B|   0     0 |  10k   19k
  7   0  92   0   0   0|   0    36k|  22k 3335B|   0     0 | 949   924 
 10   0  90   0   0   0|   0     0 |5172B 3318B|   0     0 | 908   923 
  5   0  94   0   0   0|   0     0 |2234B 2825B|   0     0 | 846   875 

Edit: As requested, I'll add an excerpt of the strace -f -p :

25734 <... read resumed> "\16\0\0\0\0\0\0\0\376\377\377\377\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0"..., 128) = 128
25752 futex(0x927e60, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
25734 rt_sigaction(SIGALRM, NULL, {0x4b2300, ~[KILL STOP RTMIN RT_1], SA_RESTORER, 0x7fe09ac108f0}, 8) = 0
25734 write(8, "\1\0\0\0\0\0\0\0", 8)   = 8
25734 read(15, 0x7fffcea69f70, 128)     = -1 EAGAIN (Resource temporarily unavailable)
25734 timer_gettime(0x1, {it_interval={0, 0}, it_value={0, 0}}) = 0
25734 timer_settime(0x1, 0, {it_interval={0, 0}, it_value={0, 250000}}, NULL) = 0
25734 timer_gettime(0x1, {it_interval={0, 0}, it_value={0, 182592}}) = 0
25734 futex(0x927e60, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
25752 <... futex resumed> )             = 0
25734 <... futex resumed> )             = 1
25752 futex(0x927e60, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
25734 select(25, [7 10 14 15 16 17 18 24], [], [], {1, 0} <unfinished ...>
PythonLearner
  • 1,022
  • 2
  • 12
  • 29
  • Have you tried another distro? Watching kvm and qemu questions on this site, I think I have only see one or two issues I could reproduce on RHEL or Fedora. The rest turned out to be poor packaging and testing on Ubuntu/Debian. – dyasny Jun 22 '12 at 12:06
  • Unfortunately switching to another distribution is not possible.. – PythonLearner Jun 22 '12 at 12:20
  • 3
    I have many problems with KVM on a Ubuntu 10.04 host. I think the KVM in the version is a little buggy. Is it maybe possible to upgrade to the latest LTS? – Christopher Perrin Jun 22 '12 at 14:06
  • Upgrading to Ubuntu 12.04 might be an option - I am currently discussing it with my team members. – PythonLearner Jun 22 '12 at 14:23
  • 2
    Has this behavior only occurred with Windows guests, or can you repeat it with others? – Brennan Jun 23 '12 at 19:12
  • This behaviour has also occured with Linux guests, but less dramatic. It also does not matter if I boot a guest from hard disk or a live CD, such as Knoppix. – PythonLearner Jun 23 '12 at 19:29
  • 1
    Also since 3.0.0-20-server is not the kernel shipped with 10.04 you should rephrase the question accordingly as this is not 10.04's KVM. – pfo Jun 24 '12 at 08:11
  • Well, it is the KVM version which comes via default with 10.04, isn't it? The kernel version 3.0 was installed later. But sure, I can update my question with this information. – PythonLearner Jun 24 '12 at 09:10
  • 1
    What do you mean with "less dramatic" with Linux guests? Is there a blackout, too, or can you see the boot-process that is busy with initializing all the CPUs? – Nils Jun 24 '12 at 20:33
  • "less dramatic" means that Linux guests show this effect, too, but not for 15 minutes. Instead, the "boot break" and black screen "only" takes a couple of minutes (maybe 3-4). – PythonLearner Jun 24 '12 at 21:11
  • 1
    Have you tried using `strace -p ` ? – Oneiroi Jun 24 '12 at 21:15
  • Yes, unfortunately I can't see much (besides MANY FUTEX entries). – PythonLearner Jun 25 '12 at 08:31
  • I just updated the question with the strace log excerpt. – PythonLearner Jun 25 '12 at 08:32

2 Answers2

5

As recommend in one of the comments (thanks cperrin88), Ubuntu 12.04 brought the solution. Some parameters:

  • Kernel 3.2
  • 80 cores (40 physical, 80 because of Intel HT)
  • kvm 1:84+dfsg-0ubuntu16+1.0+noroms+0ubuntu13
  • kvm-ipxe 1.0.0+git-3.55f6c88-0ubuntu1
  • qemu-kvm 1.0+noroms-0ubuntu13
  • libvirt 0.9.8-2ubuntu17.1

The Windows guest now shows a boot bar during the first 30 seconds of the boot and then just boots (normal behaviour).

The amount of context switches is now very low compared to the test scenarious I've had earlier (between 200 and 24k per second).

So, problem solved. I just need to find out what changed (I guess it was a bug in KVM).

Thanks to all the comments and your efforts!

PythonLearner
  • 1,022
  • 2
  • 12
  • 29
3

I have encountered quite a few bugs with KVM on Ubuntu 10.04. (which I still have to use) including growing caches that get swapped and serious performance issues.

I recommend upgrading to the latest LTS release in hope it will fix a few bugs.

Christopher Perrin
  • 4,741
  • 17
  • 32