Terrible multi-threaded performance in Kubuntu in VirtualBox in Debian machine

3

We have an unpleasant performance issue when running our multi-threaded rendering software inside a virtual machine.

We are running Kubuntu 12.04 in a VirtualBox 4.0.10_Debianr72436 which runs headlessly in Debian (6.0.6, 2.6.32-5-amd64) computing server. It has 2*6 cores Intel Xeon X5660 processor with hyperthreading with around 64GB operating memory. We connect to the VM via TigerVNC Viewer for X version 1.1.0. Virtual machine is currently set up to use all 24 cores, but the problems described below can be observed when it is configured to lower counts (e.g., 12).

The problem:

When we run our renderer with just one rendering thread, it runs at speed comparable to what we get when run directly on metal on other machines (Intel Core 2 Duo MacBooks). However, as we increase the number of working threads, it speeds up only slightly (pretty far from 1/n) and at around 5 threads it starts to actually slow down. From 8 threads and more it is even slower than a single-threaded application. When the renderer is run directly on metal on our MacBooks there are no issues, no matter how many threads you specify to run. For instance 16 threads on a dual core CPU run as fast as the two-threaded instance.

We then tried to run multiple single-threaded instances of our renderer in parallel with surprising result. When we run 4 instances, everything is OK - they run at similar speed as one instance, but when we run 6 instances, all of them slow down by around 50%!

We also tried to run another renderer (pbrt v.2) to test how others are doing and whether their results were better. It scaled well up to 13 threads, but then it slowed down as well (but not as much as our software).

Our renderer is written in Objective C combined with C and bits of assembler. We use XADD and CAS operations in our code for accessing shared data. There is a strong suspicion that these two can be the source of our problems. Any ideas on this?

BTW: We cannot install Obj-C runtime and other needed libraries and run our software directly on the metal because of server policy.

VM config excerpt:

  • Memory size: 4000MB
  • Page Fusion: off
  • VRAM size: 12MB
  • HPET: off
  • Chipset: piix3
  • Firmware: BIOS
  • Number of CPUs: 24
  • Synthetic Cpu: off
  • CPUID overrides: None
  • ACPI: on
  • IOAPIC: on
  • PAE: off
  • Time offset: 0 ms
  • RTC: UTC
  • Hardw. virt.ext: on
  • Hardw. virt.ext exclusive: off
  • Nested Paging: on
  • Large Pages: on
  • VT-x VPID: on
  • 3D Acceleration: off
  • 2D Video Acceleration: off
  • Additions run level: 2
  • Configured memory balloon size: 0 MB

ivokabel

Posted 2013-01-09T22:30:54.960

Reputation: 131

Do you implement your own spin loops or locks at all? Because it is very important, for example, that you not loop on CAS operations. (Primarily because it requires the cache line be held exclusively by the core even if the compare fails and no write is done. Doing these things yourself is hard. See the comments on this question.)

– David Schwartz – 2013-01-17T22:53:25.810

Answers

0

I am spitballing here but... On the GUI, right click on the Kubuntu instance and choose settings while it is not running. Check to see if your CPU is limited there. You probably will want to see how your system responds to choosing 20 or 22 CPU's as opposed to 24 to mitigate resource competition between virtual and host box. Then try running a single instance on with 20 threads. I would expect to see the CPU on the machine spike for the 20 cores and the remaining 4 would also increase to 100% while trying to keep up. Do you have other applications running on this machine other than your VM?

Termanader

Posted 2013-01-09T22:30:54.960

Reputation: 61