The main principle behind interrupt moderation is to generate less than one interrupt per received frame (or one interrupt per transmit frame completion), reducing the OS overhead encountered when servicing interrupts. The BCM5709 controller supports a couple of methods in hardware for coalescing interrupts, including:
- Generate an interrupt after receiving X frames (rx-frames in ethtool)
- Generate an interrupt when no more frames are received after X usecs (rx-usecs in ethtool)
The problem with using these hardware methods is that you need to select them to optimize throughput or latency, you can't have both. Generating one interrupt for each received frame (rx-frames = 1) minimizes latency, but it does so at a high cost in terms of interrupt service overhead. Setting a larger value (say rx-frames = 10) reduces the number of CPU cycles consumed by generating only one interrupt for each ten frames received, but you'll also encounter a higher latency for the first frames in that group of ten.
The NAPI implementation attempts to leverage the fact that traffic comes in bunches, so that you generate an interrupt immediately on the first frame received, then you immediately switch into polling mode (i.e. disable interrupts) because more traffic will be close behind. After you've polled for some number of frames (16 or 64 in your question) or some time interval, then the driver will re-enable interrupts and start over again.
If you have a predictable workload then fixed values can be selected for any of the above (NAPI, rx-frames, rx-usecs) that give you the right trade-off, but most workloads vary and you end up making some sacrifices. This is where adaptive-rx/adaptive-tx come into play. The idea there is that the driver constantly monitors the workload (frames received per second, frame size, etc.) and tunes the hardware interrupt coalescing scheme to optimize for latency in low traffic situations or optimize for throughput in high traffic situations. It's a cool theory but may be difficult to implement in practice. Only a few drivers implement it (see http://fxr.watson.org/fxr/search?v=linux-2.6&string=use_adaptive_rx_coalesce) and the bnx2/e1000 drivers aren't on that list.
For a good description of how each ethtool coalescing field is supposed to work, have a look at the definitions for the ethtool_coalesce structure at the following address:
http://fxr.watson.org/fxr/source/include/linux/ethtool.h?v=linux-2.6#L111
For you particular situation (~400Mb/s throughput) I'd suggest tuning the rx-frames and the rx-usecs values for the best settings for your workload. Look at both the overhead of the ISR as well as the sensitivity of your application (httpd? etc.) to latency.
Dave