18

Does anyone have some data or basic calculations that can answer when frame coalescing (NAPI) is required and when a single interrupt per frame is sufficient?

My hardware: IBM BladeServer HS22, Broadcom 5709 Gigabit NIC hardware (MSI-X), with dual Xeon E5530 quad-core processors. Main purpose is Squid proxy server. Switch is a nice Cisco 6500 series.

Our basic problem is that during peak times (100 Mbps traffic, only 10,000 pps) that latency and packet loss increases. I have done a lot of tuning and kernel upgrade to 2.6.38 and it has improved the packet loss but latency is still poor. Pings are sporadic; jumping even to 200ms on local Gbps LAN. Squid average response jumps from 30ms to 500+ms even though CPU/memory load is fine.

The interrupts climb to about 15,000/second during the peak. Ksoftirqd isn't using much CPU; I have installed irqbalance to balance the IRQs (8 each for eth0 and eth1) across all the cores but that hasn't helped much.

Intel NICs seem to never have these kinds of problems, but do the fact of the bladesystem and fixed configuration hardware, we are kind of stuck with the Broadcoms.

Everything is pointing at the NIC as being the main culprit. The best idea I have right now is to try decrease the interrupts while keeping both latency low and throughput high.

The bnx2 unfortunately doesn't support adaptive-rx or tx.

The NAPI vs Adaptive Interrupts thread answer provides a great over view of interrupt moderation but no concrete information on how to calculate optimal ethtool coalesce settings for given workaround. Is there a better approach then just trial and error?

Does the above mentioned workload and hardware configuration even need NAPI? Or should it be able to live on single interrupt per packet?

Wim Kerkhoff
  • 901
  • 1
  • 5
  • 12
  • Must be a tough question... Thanks for the bounty, @Holocryptic! I have tried some "ethtool -c " settings for coalescing but no remarkable differences yet. – Wim Kerkhoff Apr 23 '11 at 22:18
  • No problem. I just saw it kinda lingering there for a couple days and it seemed like a good question. Hopefully someone has something for you. – Holocryptic Apr 26 '11 at 15:50
  • Another update... we have moved to IBM HS23 blades with Emulex 10 Gbps NICs. This week we hit over 800,000 packets/second, no drops. We had to do a lot of tuning (patching Linux kernel drivers) to get the IRQs load balanced but it's working fantastically now. – Wim Kerkhoff May 04 '13 at 02:12

5 Answers5

6

Great question that had me doing some reading to try and figure it out. Wish I could say I have an answer... but maybe some hints.

I can at least answer your question, "should it be able to live on single interrupt per packet". I think the answer is yes, based on a very busy firewall that I have access to:

Sar output:

03:04:53 PM     IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s
03:04:54 PM        lo     93.00     93.00      6.12      6.12      0.00      0.00      0.00
03:04:54 PM      eth0 115263.00 134750.00  13280.63  41633.46      0.00      0.00      5.00
03:04:54 PM      eth8  70329.00  55480.00  20132.62   6314.51      0.00      0.00      0.00
03:04:54 PM      eth9  53907.00  66669.00   5820.42  21123.55      0.00      0.00      0.00
03:04:54 PM     eth10      0.00      0.00      0.00      0.00      0.00      0.00      0.00
03:04:54 PM     eth11      0.00      0.00      0.00      0.00      0.00      0.00      0.00
03:04:54 PM      eth1      0.00      0.00      0.00      0.00      0.00      0.00      0.00
03:04:54 PM      eth2 146520.00 111904.00  45228.32  12251.48      0.00      0.00     10.00
03:04:54 PM      eth3    252.00  23446.00     21.34   4667.20      0.00      0.00      0.00
03:04:54 PM      eth4      8.00     10.00      0.68      0.76      0.00      0.00      0.00
03:04:54 PM      eth5      0.00      0.00      0.00      0.00      0.00      0.00      0.00
03:04:54 PM      eth6   3929.00   2088.00   1368.01    183.79      0.00      0.00      1.00
03:04:54 PM      eth7     13.00     17.00      1.42      1.19      0.00      0.00      0.00
03:04:54 PM     bond0 169170.00 201419.00  19101.04  62757.00      0.00      0.00      5.00
03:04:54 PM     bond1 216849.00 167384.00  65360.94  18565.99      0.00      0.00     10.00

As you can see, some very high packet per second counts, and no special ethtool tweaking was done on this machine. Oh... Intel chipset, though. :\

The only thing that was done was some manual irq balancing with /proc/irq/XXX/smp_affinity, on a per-interface basis. I'm not sure why they chose to go that way instead of with irqbalance, but it seems to work.

I also thought about the math required to answer your question, but I think there are way too many variables. So... to summarise, in my opinion, the answer is no, I don't think you can predict the outcomes here, but with enough data capture you should be able to tweak it to a better level.

Having said all that, my gut feel is that you're somehow hardware-bound here... as in a firmware or interop bug of some kind.

DictatorBob
  • 1,614
  • 11
  • 15
  • Some useful background here: http://www.alexonlinux.com/smp-affinity-and-proper-interrupt-handling-in-linux – DictatorBob Apr 27 '11 at 15:35
  • 1
    I agree with the basic statement "yep, shouldn't have problems", but seeing as how they do have problems it's likely a firmware or driver issue. I haven't "tuned" my workstation at all and it can pull 65kips without breaking a sweat; 15kips shouldn't be anything to a modern CPU. I use exclusively Broadcom NICs, the 5709 being the most common by far. This test was run on FreeBSD however, not Linux. – Chris S Apr 28 '11 at 23:43
  • Thanks for the ideas. I did try irqbalance but didn't notice any difference. I played with more coalesce settings (ethtool -c) but didn't notice any difference. One of the blades is actually the load balancer, pushing up to 120,000 packets/second. I noticed that if the NAT and conntrack iptables are loaded that ksoftirqd CPU usage goes to 100%. Unload those modules and load drops to 0. On the Squid servers (max 10,000 packets/sec), I flushed the 17,000 (!!!) iptables rules and immediately the latencies dropped down. I thought I had tried that before, but apparently not... – Wim Kerkhoff May 07 '11 at 07:04
3

Certainly given the CPU, chipset and bus capabilities in comparison to such a low amount of traffic you have there's no reason whatsoever for you to NEED any form of interrupt management. We have multiple RHEL 5.3 64-bit machines with 10Gbps NICs and their interrupts aren't too bad at all, this is 100 times less.

Obviously you have a fixed configuration (I use HP's blades which are pretty similar) so swapping out NICs for Intels is now an easy option but what I would say is that I'm starting to spot a number of similar problems around this forum and elsewhere with that particular Broadcom NIC. Ever the SE sites themselves had some problems with this kind of inconsistency and swapping to Intel NICs absolutely helped.

What I'd recommend is picking a single blade and adding an Intel based adapter to that one machine, you'll obviously have to add a interconnect or whatever IBM call them to get the signal out but try the same software setup but with this other NIC (probably disable the Broadcom if you can). Test this and see how you get on, I know what I've described needs a couple of bits of extra hardware but I'm imagine your IBM rep will happily loan you them. It's the only way to know for sure. Please let us know what you find out, I'm genuinely interested if there's a problem with these NICs, even if it's an odd edge-case. As an aside I'm meeting with Intel and Broadcom next week to discuss something entirely unrelated but I'll certainly discuss it with them and let you know if I find anything of interest.

Chopper3
  • 100,240
  • 9
  • 106
  • 238
1

The question about interrupts is how they impact the overall system performance. Interrupts can preempt user and kernel land processing and while you may not see much CPU use, there are a lot of context switching occurring and that is a big performance hit. You can use vmstat and check the system column, cs header for the interrupts and context switches per second (interrupts include the clock so you must weight that in), its worth a check too.

coredump
  • 12,573
  • 2
  • 34
  • 53
1

The short direct answer:

If you enable polling you will reduce the context switches (normally due to interupts) from whatever they are now (15kips in your case) to a predetermined number (usually 1k to 2k).

If you currently have traffic above the predetermined number then you should have better response times by enabling polling. The converse is also true. I would not say this is "necessary" unless the context switches are impacting performance.

Chris S
  • 77,337
  • 11
  • 120
  • 212
1

To followup: with the NAT and conntrack modules unloaded plus minimized iptables ruleset, we get terrific performance. The IPVS load balancer has done over 900 Mbps / 150 kpps. This is while still using the same Broadcom bnx2 chipsets.

So to conclude: the interrupt handling seems fine and defaults for Debian with 2.6.38/3.0.x kernel seem to perform acceptably.

Definitely I would prefer to use Intel NICs so that we can use standard Debian packages. Fighting the non-free bnx2 firmware has been a huge waste of time.

Wim Kerkhoff
  • 901
  • 1
  • 5
  • 12
  • Just another update. Recently the performance was degrading again for no apparent reason. We reviewed all the previous optimizations with no success. Intel NICs are still not an economical option ($30-$40,000 investment in new interconnects, 10GB switches, etc). BUT, we located some slightly newer IBM HS22 blades that still use the crappy bnx2, but with newer firmware. Performance is much better - we broke 150,000 packets/sec barrier. – Wim Kerkhoff Mar 20 '12 at 03:20