How to improve Intel X520-DA2 10Gb NIC throughput without Jumbo packets

Question

Here's what I have done so far:

Using more Rx/Tx buffers boosts performance the most from defaults. I set RSS Queues to 4 on each adapter, and specified starting RSS CPU on the second port to something other than 0 (it's 16 on the PC that I use, with 16 cores, 32 HTs).

From watching ProcessExplorer, I am limited by CPU's ability to handle the large number of incoming interrupts, even with RSS enabled. I am using PCIe x8 (electrical) slot in 2.x mode. Each of the two adapters connects with a 5GT/sec x8 bus.

OS responsiveness does not matter, I/O throughput does. I am limited by clients' inability to process Jumbo packets.

What settings should I try next?

Details: Dual Xeon-E5 2665, 32 GB RAM, eight SSDs in RAID0 (RAMDrive used for NIC perf validation), 1TB data to be moved via IIS/FTP from 400 clients, ASAP.

In response to comments:

Actual read throughput is 650 MB/sec over a teamed pair of 10Gb/sec links, into RAM Drive

Antivirus and firewall are off, AFAICT. (I have fairly good control over what's installed on the PC, in this case. How can I be sure that no filters are reducing performance? I will have to follow up, good point.)

In Process Explorer, I see spells of time where CPU keeps going (red, kernel time), but network and disk I/O are stopped

Max RSS processors is at its default value, 16

Message-signaled intrrupts are supported on both instances of X520-DA2 device, with MessageNumberLimit set to 18. Here's what I see on my lowly desktop card

A way to check MSI support

ProcExp summary

enter image description here

This is a dual-core rack mount Dell Precision Workstation R7610. Screen capture is there to accentuate the fact that most modern (even simple) PCIe devices support message-singnaled interrupt scheme. — GregC, Sep 04 '13 at 21:02
How are you generating the traffic? Don't use SMB, that has other processing overhead that doesn't saturate your link properly. Use something like iperf. — hookenz, Sep 04 '13 at 21:47
Actually, one of the big issues with 10GBit ethernet and beyond is that that CPU and packet processing overhead and lots of interrupts as mentioned in the TCP stack causes things to slow down dramatically. I've found the same issues with Infiniband. The PC can't keep up. To get the best performance you actually need to start looking into protocols that utilize RDMA. I think MS may have made some recent additions to the SMB protocol to support that. RDMA provides for the ability of the NIC to move data from RAM to RAM across the network without the CPU. — hookenz, Sep 04 '13 at 21:50
Here it is, SMB Direct. http://technet.microsoft.com/en-us/library/jj134210.aspx It might mean upgrading to Server 2012. I'm going to add this as an answer — hookenz, Sep 04 '13 at 21:51
By the way, have you tried them "un-bonded" do you get the same result? — hookenz, Sep 04 '13 at 22:06
I have seen the cards hit a performance wall with large number of clients, prompting me to use link aggregation. In retrospect, that was a questionable call. I need to try this again. Great suggestion, thanks. — GregC, Sep 05 '13 at 14:39
In the same spirit, I plan to try two physical cards, one per CPU socket (or per NUMA node), and see if it makes any difference. — GregC, Sep 05 '13 at 14:39
You are already at 650 MB/s. What is your memory-throughput limit if you just transfer to RAM-disk (without network and write-caching effects)? — Nils, Sep 06 '13 at 13:42

hookenz · Accepted Answer · 2013-09-04T22:50:45.077

One of the problems with high performance NIC's is that the modern PC architecture has a bit of trouble keeping up. But, in your case, this isn't so much the problem. Let me explain.

The CPU has to do a lot of work processing TCP packets. This affects the throughput. What's limiting things in your case is not the network hardware, but the ability of the server to saturate the network links.

In more recent times, we've seen processing move from the CPU to the NIC like checksum offload. Intel have also added features to help reduce the load further. That's cool and I'm sure all optimizing features are turned on.

As you've alluded to, jumbo frames - actually that helps throughput somewhat. But not as much as RDMA.

Most 10GBit ethernet hardware will have a very nice underutilized feature called RDMA or remote direct memory access. It allows the NIC to do memory to memory copies over the network without the intervention of the CPU. Well, OK the CPU tells the NIC what to do and then the NIC does the rest. The trouble is, it's not used much yet. But it's getting there. Apparently, in the most recent version of Microsoft Windows Server 2012, they have something called SMB Direct. It uses RDMA. So, if you want to increase throughput, you want to use that.

Are you able to put together some test hardware and install it onto there to see how it performs?

By the way, I'm not sure if you will see it at 10Gbit so much, but fast RAM helps with RDMA especially with 56Gbit Infiniband. In general, it's best to use the fastest RAM your server supports.

Also note this comment on the SMB Direct link I put above:

You should not team RDMA-capable network adapters if you intend to use the RDMA capability of the network adapters. When teamed, the network adapters will not support RDMA.

Update: Looks like not ALL 10GBit NIC's support RDMA for some reason. So check your model's features first.

Another thought I had was the type of protocol being used to do your testing may be affecting the results. i.e. protocol overhead on top of TCP overhead. I suggest you look into using something that can test without touching the hard drive such as iperf. There is a windows port of it somewhere.

Why is it that Intel's older NICs support RDMA, whereas newer ones say N/A? http://www.intel.com/content/www/us/en/network-adapters/converged-network-adapters/server-selection-guide.html — GregC, Sep 04 '13 at 22:23
I'm really not sure Greg. Are the server cluster adaptor line older? My guess is it's marketing and money orientated.money. — hookenz, Sep 04 '13 at 22:47
Oh, it may be because RDMA is relatively new to the 10Gbit ethernet space. It's an idea that's been pulled from infiniband. Intel aquired qlogic's infiniband technology only late last year so it may explain why the rdma feature is not in all it's products yet. — hookenz, Sep 04 '13 at 22:54
On the other hand, mellanox (who are the main Infiniband supplier) now have 10Gbit ethernet options for their ConnectX line. All of them support RDMA. Are your 10Gbit NIC's onboard or separate cards? Would you consider swapping/trying a mellanox card? — hookenz, Sep 04 '13 at 22:57
We've been using Mellanox cards successfully. Thank you for the suggestion. — GregC, Jul 01 '15 at 08:01

score 1 · Answer 2 · edited Apr 13 '17 at 12:14

1

I think this question: Why does my gigabit bond not deliver at least 150 MB/s throughput? is related to your problem. I was talking about a Dell PowerEdge 6950 there. The answer was basically "use jumbo frames" to reduce the interrupts. I can imagine that tuning the offload-engine of the network-card might help in your case, but I do not know how to do so on W2K8R2.

Idea: Raise number of buffers in the network card, raise the interrupt-triggor for packets in the buffer, so each interrupt will handle more packets (i.e pass them to the OS-IP-stack).

See this link: Setting coalescence-parameters with ethtool for 10 Gb this is what I am basically referring to.

edited Apr 13 '17 at 12:14

Community

1

answered Sep 04 '13 at 21:17

Nils

7,657
3
31
71

Are you talking about increasing MessageNumberLimit value in the registry? http://msdn.microsoft.com/en-us/library/windows/hardware/ff544246%28v=vs.85%29.aspx – GregC Sep 04 '13 at 21:24
@GregC No. I am talking about the IP packets that can be buffered onboard the NIC and about when to trigger an interrupt so the NIC offloads the buffered packets to the operating system. – Nils Sep 05 '13 at 14:28
I have already mentioned that I maxed out the number of Rx/Tx buffers as a first step. The card severely underperforms if that's not done. – GregC Sep 05 '13 at 14:37
Of course, that will improve throughput but slightly increase latency. – hookenz Sep 05 '13 at 21:31
@GregC I updated my answer with a reference to **coalescence** parameters. I do not know how to tune this on W2K8R2. – Nils Sep 06 '13 at 13:40

Evgeniy Berezovsky · Answer 3 · 2015-06-26T06:45:02.443

Your cpu utilization screenshot shows 2 potential bottlenecks:

4 cores maxing out doing kernel work (i.e. probably interrupt handlers processing packets)
1 core maxing out in - mainly - user mode

To address the former:

Try changing Interrupt moderation settings, depending on your drivers it is more than just on/off, you may be able to set a moderation strategy
Try disabling/enabling all the offload features (in your case, disabling might be beneficial, so as to move a potential bottleneck from your (single-core) NIC, which the functionality would be offloaded to, to your (multi-core) processors)
Try enabling "Receive Coalescing" (when receiving TCP), and the various "Large Receive ...", "Large Transmit ..." etc features your driver might provide
Can't you set your RSS queues to a value higher than 4? It seems only one of your 2 ports is being used (as you said you are aware of it, I assume you set your second port to at least 4 (or 8, not sure if HT has to be counted)
If possible, increase the number of different TCP/UDP ports used, or IP source/target addresses, because one address/port/protocol 5-tuple (or address/protocol 3-tuple for non-TCP/UDP traffic) will always have to go to the same core no matter what your RSS settings

As to the latter (not knowing what app you are actually using):

If that 1 core maxing out in user mode indicates your single-threaded (or single-thread bottle-necked) app, it should be

fixed, or
reconfigured (e.g. increase # worker threads if possible), or
redesigned

to use multiple cores, which might or might not be trivial.

Also, as your app (if it is indeed your app), apparently runs on a NUMA node #1, but packet handling by the kernel is done on NUMA node #0,

try affinitizing the app to NUMA node #0

E.g. by right clicking on the process in Task Manager, which will give you the option to change that, at least in Win2012R2. I tried it, and for me it did not help, but it's worth a try, as it might improve the cache hit rate.

Btw, is the machine in question sending? Receiving? Both? In terms of configuring your system for performance, sending and receiving are almost completely unrelated, although my suggestions above cover both.

@GregC So, judging from the various comments and suggestions by Matt, that would be Mellanox, 10GbE, RDMA? — Evgeniy Berezovsky, Jul 01 '15 at 23:43

How to improve Intel X520-DA2 10Gb NIC throughput without Jumbo packets

3 Answers3