How are packets scheduled from network interface queues to CPUs, then onwards to threads for processing? What needs to be considered when it comes to how packets are hashed across queues, hardware interrupts vs softirqs, CPU/memory/app/thread locality, and multithreading vs multi-process daemons, to avoid as much packet rescheduling/copying as possible?
I have a multithreaded network daemon (say, the Unbound resolver) running with 16 native threads on Debian amd64 with Linux 2.6.32 (yes, old), so application load is spread across 16 CPUs. The network card is a bnx2 (BCM5709S) with support for 8 MSI-X rx/tx queues. Each queue's IRQ is assigned to a separate CPU by statically mapping interrupt affinity in /proc/irq/n/smp_affinity (irqbalance never did a good job), and the queue hashing type (RSS type) is the default one (IP src+dst, TCP sport+dport), with the default hashing key.
All this helps spread the load, but not evenly: Typically there's one application thread that does twice the work (=requests per second) of other threads, and one CPU's (probably the one handling that particular thread) softirq rate is twice that of other CPUs.
The CPUs have hyper-threading enabled, but I have not yet done anything to spread load across 'real' cores (which I really should).
Linux comes with a fairly comprehensive network scaling document, but I'm missing some blanks:
The doc says this about RSS configuration:
A typical RSS configuration would be to have one receive queue for each CPU if the device supports enough queues, or otherwise at least one for each memory domain, where a memory domain is a set of CPUs that share a particular memory level (L1, L2, NUMA node, etc.).
Q: How do I determine the CPU/cache/memory domain configuration for my server?
The information about receive flow steering (RFS) seems to answer some of my questions about getting the packet to the right CPU/thread:
The goal of RFS is to increase datacache hitrate by steering kernel processing of packets to the CPU where the application thread consuming the packet is running.
Q: In the case of DNS resolving, there's typically one query packet and one answer packet. With a multithreaded daemon, would only a single thread run bind()+recvfrom(), and thus have to handle all new incoming packets anyway, before scheduling the work onto other threads? Would this particular use case benefit from forked operation (one process per CPU) instead?
Q: Would receive flow steering then typically apply best to a multithreaded TCP daemon?
Q: How would you determine whether to go for multithreading or multi-process operation? Obviously there's the shared memory and datastructures, resource contention, etc, but I'm thinking in regards to packet flow and application listener(s).
Q: Without receive flow steering, or with simple UDP services, can a packet arrive on the 'wrong' CPU, and will therefore be rescheduled to the 'correct' CPU, somehow? Will this trigger a NET_RX softirq?
Q: Is there a NET_RX softirq between the NIC queue and the CPU? Is there also one between the CPU and the listening thread/process? Could there be yet another if the receiving thread schedules the packet to a worker thread, if that's even a possibility?
Too bad there's no video or additional details from Ben Hutchings' netconf 2011 talk, where he covers most of these things. The slides are somewhat brief.
I'll be trying to upgrade to a more recent kernel with a usable perf version, and then inspect what the CPUs are up to, perhaps finding what that higher-loaded CPU is up to compared to the others.
Note: I'm not looking to solve a particular problem here, rather I'm trying to understand how these things work in the Linux kernel. I'm also aware of the various options for interrupt coalescing.