Analyzing system throughput with Intel PMU

I trust this an appropriate place for this question. It's not programming related or I might have asked on Stackoverflow instead. Nevertheless, here's the question. I'm doing some benchmarking of network throughput. I have two 40GbE NICs currently connected directly verifying bandwidth. (To accomplish this, I'm using iperf3).

My test systems are dual Xeon E5 2667 (from /proc/cpuinfo model name : Intel(R) Xeon(R) CPU E5-2667 0 @ 2.90GHz) exposing 24 logical processors. There is a NUMA hypercube of two nodes with half the processors attached to each. There is 32-GB of RAM in the form of non-ECC DDR3. In each case, the 40GbE NIC is attached to a PCIe Gen 3 x8 slot which is associated to NUMA node 0.

When I perform a test with iperf3 with a minimum of options (essentially using defaults which is sufficient for my purposes), I'm unable see bandwidth for the TCP test rise above 21.x Gbps (ok, so, occasionally it gets above 21 but it's usually 21.x). However, if I use the iperf3 -A n option which restricts iperf3 to processor <n>, I see ~36 Gbps which is much more like what I'd expect. I'd like to understand why.

I've experimented some with numactl to bind memory allocations, processor nodes and physical processor for iperf3. Oddly, I cannot achieve the same throughput using numactl as I get using iperf3's -A option. Fiddling with various things/options (see numactl man page for more), chiefly the --physcpubind=<n>, --cpunodebind=<n> and --membind=<n> I cannot achieve any more than ~31Gbps in this test. I'd like to understand why.

To that end, I've begun using the perf package linux provides. However, I'm not finding much in the mountain of documentation available from either Linux or Intel about the PMU things available. Oh it explains how to run it, but little is said about what things actually mean. As an example, bus-cycles appears in both "Hardware Events" and "Kernel PMU Events." What's the difference? perf list delineates the "events" which can be monitored; it's lengthy. Documentation I've found from Intel for the Xeon E5 2667 (which I believe is a Haswell setup) shows that various NUMA related things are supported and perf list shows uncore_imc_0/cas_count_read/ and uncore_qpi_0/drs_data/ (among many others) which should be related to this. Yet, when I run iperf3 attempting to monitor this, I get that it isn't supported. For example:

$ perf stat -e uncore_qpi_0/drs_data/ -- iperf3 -c 192.168.0.244 -A 0
.... program output ....
Performance counter stats for 'iperf3 -c 192.168.0.244 -A 0':

  <not supported>      uncore_qpi_0/drs_data/

However, the docs show it should be. Is there some way to find out what is supported by my processor without running the program only to find out afterword that it isn't? Would anyone have suggestions for metrics that are important in understanding the disparity? (Tagged with RHEL because that is the target platform for the solution.)

Andrew Falanga

Posted 2017-04-21T20:08:14.563

Reputation: 131

Analyzing system throughput with Intel PMU

No answers