How can I override IRQ affinity for NVME devices

Question

I am trying to move all interrupts over to cores 0-3 to keep the rest of my cores free for high speed, low latency virtualization.

I wrote a quick script to set IRQ affinity to 0-3:

#!/bin/bash

while IFS= read -r LINE; do
    echo "0-3 -> \"$LINE\""
    sudo bash -c "echo 0-3 > \"$LINE\""
done <<< "$(find /proc/irq/ -name smp_affinity_list)"

This appears to work for USB devices and network devices, but not NVME devices. They all produce this error:

bash: line 1: echo: write error: Input/output error

And they stubbornly continue to produce interrupts evenly across almost all my cores.

If I check the current affinities of those devices:

$ cat /proc/irq/81/smp_affinity_list 
0-1,16-17
$ cat /proc/irq/82/smp_affinity_list
2-3,18-19
$ cat /proc/irq/83/smp_affinity_list
4-5,20-21
$ cat /proc/irq/84/smp_affinity_list
6-7,22-23
...

It appears "something" is taking full control of spreading IRQs across cores and not letting me change it.

It is completely critical that I move these to other cores, as I'm doing heavy IO in virtual machines on these cores and the NVME drives are producing a crap load of interrupts. This isn't Windows, I'm supposed to be able to decide what my machine does.

What is controlling IRQ affinity for these devices and how do I override it?

I am using a Ryzen 3950X CPU on a Gigabyte Auros X570 Master motherboard with 3 NVME drives connected to the M.2 ports on the motherboard.

(Update: I am now using a 5950X, still having the exact same issue)

Kernel: 5.12.2-arch1-1

Output of lspci -v related to NVME:

01:00.0 Non-Volatile memory controller: Phison Electronics Corporation E12 NVMe Controller (rev 01) (prog-if 02 [NVM Express])
    Subsystem: Phison Electronics Corporation E12 NVMe Controller
    Flags: bus master, fast devsel, latency 0, IRQ 45, NUMA node 0, IOMMU group 14
    Memory at fc100000 (64-bit, non-prefetchable) [size=16K]
    Capabilities: [80] Express Endpoint, MSI 00
    Capabilities: [d0] MSI-X: Enable+ Count=9 Masked-
    Capabilities: [e0] MSI: Enable- Count=1/8 Maskable- 64bit+
    Capabilities: [f8] Power Management version 3
    Capabilities: [100] Latency Tolerance Reporting
    Capabilities: [110] L1 PM Substates
    Capabilities: [128] Alternative Routing-ID Interpretation (ARI)
    Capabilities: [200] Advanced Error Reporting
    Capabilities: [300] Secondary PCI Express
    Kernel driver in use: nvme

04:00.0 Non-Volatile memory controller: Phison Electronics Corporation E12 NVMe Controller (rev 01) (prog-if 02 [NVM Express])
    Subsystem: Phison Electronics Corporation E12 NVMe Controller
    Flags: bus master, fast devsel, latency 0, IRQ 24, NUMA node 0, IOMMU group 25
    Memory at fbd00000 (64-bit, non-prefetchable) [size=16K]
    Capabilities: [80] Express Endpoint, MSI 00
    Capabilities: [d0] MSI-X: Enable+ Count=9 Masked-
    Capabilities: [e0] MSI: Enable- Count=1/8 Maskable- 64bit+
    Capabilities: [f8] Power Management version 3
    Capabilities: [100] Latency Tolerance Reporting
    Capabilities: [110] L1 PM Substates
    Capabilities: [128] Alternative Routing-ID Interpretation (ARI)
    Capabilities: [200] Advanced Error Reporting
    Capabilities: [300] Secondary PCI Express
    Kernel driver in use: nvme

05:00.0 Non-Volatile memory controller: Phison Electronics Corporation E12 NVMe Controller (rev 01) (prog-if 02 [NVM Express])
    Subsystem: Phison Electronics Corporation E12 NVMe Controller
    Flags: bus master, fast devsel, latency 0, IRQ 40, NUMA node 0, IOMMU group 26
    Memory at fbc00000 (64-bit, non-prefetchable) [size=16K]
    Capabilities: [80] Express Endpoint, MSI 00
    Capabilities: [d0] MSI-X: Enable+ Count=9 Masked-
    Capabilities: [e0] MSI: Enable- Count=1/8 Maskable- 64bit+
    Capabilities: [f8] Power Management version 3
    Capabilities: [100] Latency Tolerance Reporting
    Capabilities: [110] L1 PM Substates
    Capabilities: [128] Alternative Routing-ID Interpretation (ARI)
    Capabilities: [200] Advanced Error Reporting
    Capabilities: [300] Secondary PCI Express
    Kernel driver in use: nvme

$ dmesg | grep -i nvme
[    2.042888] nvme nvme0: pci function 0000:01:00.0
[    2.042912] nvme nvme1: pci function 0000:04:00.0
[    2.042941] nvme nvme2: pci function 0000:05:00.0
[    2.048103] nvme nvme0: missing or invalid SUBNQN field.
[    2.048109] nvme nvme2: missing or invalid SUBNQN field.
[    2.048109] nvme nvme1: missing or invalid SUBNQN field.
[    2.048112] nvme nvme0: Shutdown timeout set to 10 seconds
[    2.048120] nvme nvme1: Shutdown timeout set to 10 seconds
[    2.048127] nvme nvme2: Shutdown timeout set to 10 seconds
[    2.049578] nvme nvme0: 8/0/0 default/read/poll queues
[    2.049668] nvme nvme1: 8/0/0 default/read/poll queues
[    2.049716] nvme nvme2: 8/0/0 default/read/poll queues
[    2.051211]  nvme1n1: p1
[    2.051260]  nvme2n1: p1
[    2.051577]  nvme0n1: p1 p2

I am not sure if that is the answer, but if a useful answer ends up mentioning `isolcpus=` and/or `cset` - please try to get the explanation upstreamed, the whole *managed irq* idea is far from sufficiently documented. — anx, Feb 05 '21 at 07:47
if your using a systemd based distro you might be able to change the Units for mounting your NVME drives to specific CGroups, limiting them to the specific cores you want to use. Additionally you might be able to use taskset somehow — Dennis Nolte, May 10 '21 at 14:39
what's your kernel version? also, can you edit your question to include relevant `sudo lspci -v` output about your NVMe device? — mforsetti, May 12 '21 at 05:20

score 4 · Answer 1 · answered May 11 '21 at 12:36

4

The simplest solution to this problem is probably just to switch from using IRQ/interrupt mode to polling mode for the NVMe driver.

Add this to /etc/modprobe.d/nvme.conf:

options nvme poll_queues=4

then run update-initramfs -u, reboot, and you should see a vast reduction in IRQs for NVMe devices. You can also play around with the poll queue count in sysfs and other NVMe driver tweakables (modinfo NVMe should give you a list of params you can adjust)

That said, this is all highly dependent on what kernel version you’re running…

answered May 11 '21 at 12:36

Andrew H

126
4

I'm on kernel 5.12.2. This looks very promising, I'll give it a shot later today! Any performance impacts I should look out for when using polling mode? – Hubro May 12 '21 at 05:39
2

be warned that using `poll_queues` will increase your CPU load, because, well, it's `poll`. See [this Western Digital's presentation](https://events.static.linuxfound.org/sites/events/files/slides/lemoal-nvme-polling-vault-2017-final_0.pdf) for more info. – mforsetti May 12 '21 at 06:01
@mforsetti According to the presentation you linked, polling seems to make a lot of sense for very fast flash storage (like what I have) even aside from the advantage of reducing interrupts. – Hubro May 12 '21 at 06:07
1

it always depends on your workload. if your workload is CPU-bound, then the performance impact does outweigh any benefit from your `poll_queues`. if your workload is IO-bound, then *maybe* there will be a performance benefit. – mforsetti May 12 '21 at 06:13
I created the modprobe config file, ran `mkinitcpio -P` (since I'm on Arch) and rebooted, but it doesn't seem like polling mode has been enabled. `cat /sys/block/nvme*/queue/io_poll` produces a 0 for each of my 3 NVMe devices. Thousands of interrupts per second are still being produced by the NVMe devices when I run a disk benchmark. If I try to write "1" to "/sys/block/nvme0n1/queue/io_poll" I get "echo: write error: invalid argument". Am I missing something? – Hubro May 12 '21 at 06:35
Does it matter that I'm running md raid on top of my NVMe devices? It doesn't seem like the nvme module is even loaded, `lsmod | grep nvme` produces no output. – Hubro May 12 '21 at 06:57
try `dmesg | grep -i nvme` – mforsetti May 12 '21 at 07:27
Let us [continue this discussion in chat](https://chat.stackexchange.com/rooms/124159/discussion-between-hubro-and-mforsetti). – Hubro May 12 '21 at 08:04

mforsetti · Accepted Answer · 2021-05-13T12:45:54.233

What is controlling IRQ affinity for these devices?

Linux kernel since v4.8 is automatically using MSI/MSI-X interrupt masking in NVMe drivers; and with IRQD_AFFINITY_MANAGED, automatically manages MSI/MSI-X interrupts in kernel.

See these commits:

90c9712fbb388077b5e53069cae43f1acbb0102a - NVMe: Always use MSI/MSI-X interrupts
9c2555835bb3d34dfac52a0be943dcc4bedd650f - genirq: Introduce IRQD_AFFINITY_MANAGED flag

Seeing your kernel version and your devices capabilities via lspci -v output, apparently it is the case.

and how do I override it?

Besides disabling the flags and recompiling the kernel, probably disable MSI/MSI-X to your PCI bridge (instead of devices):

echo 1 > /sys/bus/pci/devices/$bridge/msi_bus

Note that there will be performance impact on disabling MSI/MSI-X. See this kernel documentation for more details.

Instead of disabling MSI/MSI-X, a better approach would be keeping MSI-X but also enable polling mode in NVMe driver. See Andrew H's answer.

Interesting! If I were to use polling mode rather than interrupt mode for my drives, would there be any point in disabling MSI/MSI-X? — Hubro, May 12 '21 at 06:08
nope, disabling MSI/MSI-X will hurt performance in a few ways: 1. MSI-X supports up to 2048 interrupts, disabling it, especially in a very fast storage, will increase latency; and 2. MSI-X [prevents DMA-IRQ race latency](https://en.wikipedia.org/wiki/Message_Signaled_Interrupts#Advantages). — mforsetti, May 12 '21 at 06:21

score 1 · Answer 3 · answered May 12 '21 at 08:46

That is intentional.

NVMe devices are supposed to have multiple command queues with associated interrupts, so interrupts can be delivered to the CPU that requested the operation.

For an emulated virtual disk, this is the CPU running the I/O thread, which then decides if the VM CPU needs to be interrupted to deliver the emulated interrupt.

For a PCIe passthrough disk, this is the VM CPU, which leaves the VM, enters the host interrupt handler, which notices that the interrupt is destined for the virtual CPU, and enqueues it so it is delivered to the VM on the VM enter after the handler returns, so we still get only one interruption of the VM context.

This is pretty much as optimal as it gets. You can pessimize this by delivering the IRQ to another CPU that will then notice that the VM needs to be interrupted, and queue an inter-processor interrupt to direct it where it needs to go.

For I/O that does not belong to a VM, the interrupt should go to a CPU that is not associated with a VM.

For this to work properly, the CPU mapping for the VMs needs to be somewhat static.

There is also the CPU isolation framework you could take a look at, but that is probably too heavy-handed.

"For an emulated virtual disk, this is the CPU running the I/O thread" - This doesn't seem to be the case though, I have pinned the I/O threads to cores that are not being used by the VM, but those cores are still processing thousands of interrupts while my VM is running. "There is also the CPU isolation framework you could take a look at" - This unfortunately doesn't work for NVMe interrupts :( — Hubro, May 12 '21 at 21:12

How can I override IRQ affinity for NVME devices

3 Answers3