2

I am testing the network performance of two workstations, each having 2.4GHz Xeon quad core processors and NC550SFP PCIe Dual Port 10GbE Server Adapters, linked back to back.

I've checked the bandwidth of the RAM, which is ~12Gbps, so no bottleneck here. The PCIe bus speed is also ok.

I am testing maximum pps using minimum packet size for UDP and results are miserable compared to these: 2012-lpc-networking-qdisc-fastabend.pdf (sorry, I can post only one link). If I increase the packet size and MTU, I can get near line speed (~9.9Gbps).

I'm using pktgen with the NST scripts, macvlan interfaces for multiple threads and I only get ~1Mpps, all four cores at 100%.

In order to improve the TX performance of pktgen, I stumbled across this document: Scaling in the Linux Networking Stack

I have checked and yes, I have mq qdiscs, which should yield the highest performance:

# ip link list | grep eth3
5: eth3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000

I think the problem lies in the fact that only one TX queue is used:

# dmesg | grep be2net
[    4.528058] be2net 0000:01:00.1: irq 47 for MSI/MSI-X
[    4.528066] be2net 0000:01:00.1: irq 48 for MSI/MSI-X
[    4.528073] be2net 0000:01:00.1: irq 49 for MSI/MSI-X
[    4.528079] be2net 0000:01:00.1: irq 50 for MSI/MSI-X
[    4.528104] be2net 0000:01:00.1: enabled 4 MSI-x vector(s)
[    4.696026] be2net 0000:01:00.1: created 4 RSS queue(s) and 1 default RX queue
[    4.761108] be2net 0000:01:00.1: created 1 TX queue(s)

I've gotten a hint on how to enable multiple TX queues from Scaling in the Linux Networking Stack:

The driver for a multi-queue capable NIC typically provides a kernel module parameter or specifying the number of hardware queues to configure. In the bnx2x driver, for instance, this parameter is called num_queues. A typical RSS configuration would be to have one receive queue for each CPU if the device supports enough queues, or otherwise at least one for each memory domain, where a memory domain is a set of CPUs that share a particular memory level (L1, L2, NUMA node, etc.).

I've looked all over the be2net driver documentation from Emulex, even sent them an email, with no luck. I've also skimmed the kernel source.

I've got the latest kernel version (3.10) on Ubuntu 12.04 with the latest firmware on the NICs.

Ideas anyone?

Thanks!

longneck
  • 22,793
  • 4
  • 50
  • 84
mrg2k8
  • 91
  • 3
  • 6
  • Put the link to the PDF in a comment or in the body if your question and someone will edit it in for you. – longneck Jul 04 '13 at 23:34
  • Here's the link: http://www.linuxplumbersconf.org/2012/wp-content/uploads/2012/09/2012-lpc-networking-qdisc-fastabend.pdf – mrg2k8 Jul 05 '13 at 00:26

1 Answers1

1

I had a similar (?) challenge on a Red Hat Enterprise Linux box. I read the same paper, and concluded that my real problem was that the default of using every possible IRQ to get every CPU involved in network packet work. I focused the IRQ activity to a subset of available cores and then steered work accordingly. Here's the rc.local file:

# Reserve CPU0 as the default default IRQ handler
for IRQ in `grep eth0 /proc/interrupts | cut -d ':' -f 1`; do echo 2 > /proc/irq/$IRQ/smp_affinity; done
for IRQ in `grep eth1 /proc/interrupts | cut -d ':' -f 1`; do echo 2 > /proc/irq/$IRQ/smp_affinity; done
for IRQ in `grep eth2 /proc/interrupts | cut -d ':' -f 1`; do echo 2 > /proc/irq/$IRQ/smp_affinity; done
for IRQ in `grep eth4 /proc/interrupts | cut -d ':' -f 1`; do echo $(( (($IRQ & 1) + 1) << 2 )) > /proc/irq/$IRQ/smp_affinity; done

Here's the cgrules.conf file that defines/differentiates my apache web server away from the 10gbe so that serious network throughput can happen as its supposed to:

apache      cpuset,cpu  apache/

And here's the cgconfig.conf file that actually separates the server from the rest of the CPU activities:

mount {
    cpuset  = /cgroup/cpuset;
    cpu = /cgroup/cpu;
    cpuacct = /cgroup/cpuacct;
    memory  = /cgroup/memory;
    devices = /cgroup/devices;
    freezer = /cgroup/freezer;
    net_cls = /cgroup/net_cls;
    blkio   = /cgroup/blkio;
}

group apache {
    cpuset {
        cpuset.memory_spread_slab="0";
        cpuset.memory_spread_page="0";
        cpuset.memory_migrate="0";
        cpuset.sched_relax_domain_level="-1";
        cpuset.sched_load_balance="1";
        cpuset.mem_hardwall="0";
        cpuset.mem_exclusive="0";
        cpuset.cpu_exclusive="0";
        cpuset.mems="1";
        cpuset.cpus="4-7,12-15";
    }
}

group apache {
    cpu {
        cpu.rt_period_us="1000000";
        cpu.rt_runtime_us="0";
        cpu.cfs_period_us="100000";
        cpu.cfs_quota_us="-1";
        cpu.shares="1024";
    }
}

A default configuration (without the IRQ and cgroups hacks) I measured about 5Gb/s network throughput. With IRQs concentrated and random network IO moved away, I measured near wirespeed (9.5Gb/s) performance using netperf.

n.b. jumbo packets made no difference, either to the before or the after numbers.