5

I've set up several KVM based networks before, and never encountered this issue, can't for the life of me think what I'd have set up differently previously.

Setup

Basically, Ive got and entirely Dell Stack:

  • 2x Dell N2024's (stacked gigabit switches)
  • Several Dell R720's for KVM Hypervisors
  • 2x Dell R320's for gateway/firewalls

All machines run CentOS6.5, the hypervisors, basically standard install with a few sysctl tweaks.

At the moment, I've got a few test VM's setup, with similar setup to their masters (CentOS 6.X, base install with basic puppet driven configuration). All VM's are:

  • Bridged to one of two physically separated networks (i.e each hypervisor has two ethernet connections, one for a public/DMZ bridged LAN, the other, a private one)
  • All VM's use virtio for network, block devices (basically bog standard result of running the virt-install command) -- e.g (example libvirt config)

    <interface type='bridge'>
          <mac address='52:54:00:11:a7:f0'/>
          <source bridge='dmzbr0'/>
          <model type='virtio'/>
          <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </interface>
    
  • and all VM's have access to between 2 and 8 VCPU's and 8 and 64GB RAM, and their drives are LVM volumes on the host machine

Some simple file copies within the VM, and dd tests yield perfectly acceptable results (300MB/s - 800MB/s in these small scale synthetic tests)

Network Performance between Physical Machines

I've left Jumbo Frame/MTU configurations for now, and server to server transfer will quite happily max out the gigabit connection (or there about) (100MB/s -> 118MB/s flat over several large file tests to/from each machine)

Network Performance between a Physical Machine and VM (and VM to VM)

Rsync/SSH transfer consistently changing (unstable) but always between 24MB/s and a max of about 38MB/s

I've performed several other tests: - Between a Physical machines IP on one bridge to the VM (on another bridge) - Between a Physical machines IP on one bridge to the VM (on the same bridge) - Tried starting the VM's using e1000 device drivers instead of virtio

Nothing seems to have worked, has anyone encountered this much of a performance degradation before? I've just checked my older network (hosted at another DC), and apart from the fact it uses a different switch (a very much cheaper old PowerConnect 2824) the VM network performance seems to be closer to 80-90% of raw network performance (not less than half)

If I can provide any setup/configs or extra information, I'm more than happy to!

Update (14/08/2014)

Tried a few things:

  • Enabled Jumbo frames/MTU 9000 on host bridge and adapter and VM's (marginal performance improvement (average above 30MB/s)
  • Tested GSO,LRO,TSO off/on on host (no noticeable effect)
  • Tested further sysctl optimisations (tweaking rmem/wmem, with sustained 1-2% performance increase)
  • Tested vhost_net driver (small increase in performance)
  • vhost_net driver enabled (as above) with the same sysctl optimisations (at least a 10-20% performance jump on previously)
  • as per redhat's performance optimisation guide they mentioned enabling multiqueue could help, though I noticed no difference.

The host seems to sit at 125% CPU (for the host process), could this have something to do with assigning too many VCPU's to the guest or CPU/Numa affinity?

However, after all that, I seem to have increased the average sustained rate of between 25-30MB/s to 40-45MB/s. It's a decent improvement, but I'm sure I can get closer to bare metal performance (it's still a fair way under half at the moment).

Any other ideas?

kwiksand
  • 463
  • 1
  • 8
  • 16
  • you mentioned jumbo frames, are they set up on the entire network stack and in the VMs? – dyasny Aug 13 '14 at 13:15
  • I specifically haven't enabled them at all as yet, not on the switch or any of the machines – kwiksand Aug 13 '14 at 15:19
  • ok, I'd start with `ethtool -k` and start playing with disabling TSO, LRO, GSO – dyasny Aug 13 '14 at 17:26
  • I've set these to off in a test just now, but hasn't helped as yet. Any further information on 'playing' with them? Do I alter the bridged ethernet port on the host? The ethernet port? Or just the eth0 adapater inside each guest? – kwiksand Aug 13 '14 at 23:31
  • the ethX adapter under the bridge on the host actually – dyasny Aug 14 '14 at 03:25
  • Yes, definitely tried that. Just out of interest, what kind of network performance have you seen between host and guest? – kwiksand Aug 14 '14 at 12:53
  • on a gigabit link, usually about 60-70 mbytes/sec – dyasny Aug 14 '14 at 13:33
  • That's what I'd have expected too. hmm :( – kwiksand Aug 15 '14 at 09:41
  • check the dell site, there might be newer firmware available for the broadcoms. My stack is almost entirely Dell as well (a bunch of R610 and R620) and I've no problems with performance without any tweaks – dyasny Aug 15 '14 at 14:26
  • Thanks again dyasny, I'll see if there's any updates to those cards – kwiksand Aug 15 '14 at 23:11
  • I'd love to see the tweaks you have applied to get to over 40% in a blogpost or a KB article somewhere, might be useful – dyasny Aug 16 '14 at 01:15
  • Usually, I'd write it up, but I'm still trying to find the issue, all well and good I've got some improvement, but it's still under performing by almost 50% of what I'd expect it to. – kwiksand Aug 17 '14 at 16:52

1 Answers1

1

Your KVM instances should be able to saturate your hosts network connection with no issues.

My first recommendation here is to upgrade both the host and guest's kernel. The stock CentOS 6.5 kernel does not have great performance for KVM. I'd suggest kernel-lt from ELRepo (or kernel-ml if you're feeling brave). This should give you a decent boost in performance right off the bat.

Next up, try testing with iperf3 (or even the older iperf). This will give you as close to a pure network connection as possible. Your rsync/ssh tests are not really valid, because they're definitely hitting the disk. RSync especially may not be doing sequential IO like your dd test (try using fio instead).

The interesting thing here is that VM to VM traffic will not actually hit the network controller. This is going to be done purely on the host, so the rest of your network (and the various offload settings) don't really have any meaning here.

One other thing to check: Has your server throttled down the CPUs? We've had a number of Dell machines think they were idle, and start running the CPU significantly slower then they should have been. The power saving stuff does not always recognize server workloads well.

You definitely want virtio here, don't even waste your time testing any of the emulated options.

You didn't mention it, but if your server has the i350 based NICs, you can look into SR-IOV (assuming you only want <= 7 VMs per machine). This gives the VM direct access to the physical NIC (at the cost of loss of functionality, such as no nwfilter support), and will be more efficient. You do not need this to get full gigabit speeds though.

devicenull
  • 5,572
  • 1
  • 25
  • 31
  • Thanks devicenull, very helpful! Testing with iperf does seem to immediately yield gigabit speeds, so it may have been a bit stupid of me to think I'd get the same performance out of rsync/scp. – kwiksand Aug 24 '14 at 08:09
  • As another test, I tried changing the compression type of the ssh connection (arcfour) and the speed increased (up to about 65MB/s). In this case it's sounding a lot like CPU throttling, as you said, or at least that it's CPU bound. Weird that the older Dell Servers don't seem to have the same issue, yet run the same software/kernel versions. Thanks again, will try your other suggestions now. – kwiksand Aug 24 '14 at 08:15