If you're already familiar with PCI behavior and Linux's handling of DMA buffers, skip to the third section for my actual question. Otherwise read on for a small summary of how PCI devices perform memory accesses, and how the kernel handles communicating with devices using DMA. I've included this here both in hopes of providing people asking the same question with useful information, and to give others the chance to correct me in case my understanding is off.
(My understanding of) PCI, IOMMU, and DMA
PCI and PCIe devices have in their configuration space a two byte command register which contains a bitmask for enabling or disabling several different hardware features. Bit 2 is the bus master enable bit which, when set, allows the device to initiate DMA requests. This bit, and any other bits on the command register, is set by software running in supervisor mode (typically by kernel drivers) and, despite being physically stored on the PCI device, cannot be changed by it (Actually, this may be wrong. Is it that a PCI bridge won't pass through the DMA request unless it has bus master as well?). On hardware without an IOMMU, the device can request reads and writes to any legal memory address. This is often called a DMA attack or evil bus mastering, and is an issue on any unprotected system with malicious PCI devices. The IOMMU is supposed to be the solution to improve both security and performance. For reference, I am specifically asking about Intel's implementation, VT-d (precisely the more modern VT-d2).
Most systems can be configured for DMA remapping, or DMAR. The ACPI tables included in the BIOS often have the DMAR table, which contains a list of addresses which various PCI groups will have all their memory accesses routed to. This is all described in the section 2.5.1.1 of Intel's VT-d specifications. A graphic from the document summarizes how this works:
The Linux kernel DMA API
The DMAR tables are hardcoded by the BIOS. A given PCI device (or rather, a given IOMMU group) is allowed to access only a pre-determined memory range. The kernel is told where that memory is and is instructed not to allocate any memory there which it does not want readable/writable over DMA. The remapping values are reported in the kernel log buffer:
DMAR: Setting identity map for device 0000:00:02.0 [0xad000000 - 0xaf1fffff] DMAR: Setting identity map for device 0000:00:14.0 [0xa95dc000 - 0xa95e8fff] DMAR: Setting identity map for device 0000:00:1a.0 [0xa95dc000 - 0xa95e8fff] DMAR: Setting identity map for device 0000:00:1d.0 [0xa95dc000 - 0xa95e8fff] DMAR: Prepare 0-16MiB unity mapping for LPC DMAR: Setting identity map for device 0000:00:1f.0 [0x0 - 0xffffff] DMAR: Intel(R) Virtualization Technology for Directed I/O iommu: Adding device 0000:00:00.0 to group 0 iommu: Adding device 0000:00:01.0 to group 1 iommu: Adding device 0000:00:02.0 to group 2 iommu: Adding device 0000:00:14.0 to group 3 iommu: Adding device 0000:00:16.0 to group 4 iommu: Adding device 0000:00:1a.0 to group 5 iommu: Adding device 0000:00:1b.0 to group 6 iommu: Adding device 0000:00:1c.0 to group 7 iommu: Adding device 0000:00:1c.2 to group 8 iommu: Adding device 0000:00:1c.3 to group 9 iommu: Adding device 0000:00:1c.4 to group 10 iommu: Adding device 0000:00:1d.0 to group 11 iommu: Adding device 0000:00:1f.0 to group 12 iommu: Adding device 0000:00:1f.2 to group 12 iommu: Adding device 0000:00:1f.3 to group 12 iommu: Adding device 0000:01:00.0 to group 1 iommu: Adding device 0000:03:00.0 to group 13 iommu: Adding device 0000:04:00.0 to group 14 iommu: Adding device 0000:05:00.0 to group 15
From the bolded lines, we see that group 11 contains (only) device 0000:00:1d.0
, which is able to freely access 13 pages of memory in the range of 0xa95dc000 - 0xa95e8fff
. All accesses for devices in group 11 will only be able to write there, preventing them from modifying the contents of other DMA buffers, or unrelated OS code. This way, even if the device has its bus master bit set, it does not need to keep track of where it is writing, and it cannot (accidentally or maliciously) write anywhere it is not supposed to.
When a kernel driver wants to interact with a device over DMA, it allocates memory specifically for this purpose using, for example, void *addr = kmalloc(len, GFP_KERNEL | GFP_DMA)
. This will return, in addr
, a virtual memory address pointing to a contiguous section of memory len
bytes in size which is suitable for DMA use. This is all described in more detail in the Linux DMA API documentation. The driver is then free to communicate with the PCI device through this shared memory region. The series of events, simplified, may look something like this:
- OpenCL driver allocates memory, shared with the GPU PCI device, for DMA use.
- Driver writes some vector data to the DMA address, and goes off to do something else.
- GPU reads the data over the PCI device, and begins the slow task of processing it.
- When finished, the GPU writes the finished data to the buffer and fires off an interrupt.
- Driver stops what it is doing due to the interrupt and reads the rendered graphic from memory.
Does the kernel distrust DMA buffers and handle them securely?
Does the kernel implicitly trust these DMA buffers? Can a malicious or compromised PCI device, writing nowhere other than the designated buffers (the IOMMU prevents it from doing otherwise), compromise the kernel by exploiting the data structures they are sharing? The obvious answer is possibly, because any sharing and parsing of complex data structures using memory unsafe languages carries with it the risk of exploitation. But the kernel developers may assume that these buffers are trusted and put absolutely no effort into securing the kernel from malicious activity in them (unlike, say, the data shared between unprivileged userland and the kernel via copy_from_user()
and similar functions). I am starting to think that the answer to whether or not a malicious PCI device can compromise the host despite the IOMMU's restrictions is probably.
Exploitation of such a vulnerability would work something like this, where buf
is in the device-controlled and DMA-writable address space, and dest
is elsewhere in kernel memory:
- Device writes data as
struct { size_t len; char data[32]; char foo[32]; } buf
. - Driver is to copy to
data
instruct { char data[32]; bool summon_demons; } dest
. - Device maliciously sets
buf.len = sizeof(buf.data) + 1
andbuf.foo[0] = 1
. - Driver copies data insecurely, using
memcpy(dest.data, buf.data, buf.len)
. - PCI device gains control over the kernel and your immortal soul in a classic buffer overflow.
Obviously this is a contrived example and while most likely such an obvious bug would not make its way into the kernel in the first place, it illustrates my point, and brings me to my primary question:
Are there any examples of vulnerabilities from improper handling of data structures shared over DMA, or of any specific drivers treating the input from PCI devices as trusted?
Limitations of VT-d as an IOMMU
I am aware of its limitations and don't want an answer which tries to explain how a device could work around the IOMMU directly or use another loophole to gain control of the system. I know:
- It cannot adequately protect a system unless x2APIC and Interrupt Remapping are supported.
- Address Translation Services (ATS) can bypass the IOMMU.
- Modified PCI expansion ROMs can attack the system on reboot.
- All devices in a given IOMMU group have access to the same memory (assuming no ACS).
- Some BIOSes come with a broken DMAR table, resulting in the IOMMU being disabled.
- The CSME ("Intel ME") may be able to disable VT-d via PSF and PAVP.
- Yet unknown attacks may be capable of disabling or bypassing the IOMMU.
*DMA means Direct Memory Access. It is a hardware feature whereby certain hardware interfaces (like PCIe) are able to request direct access to system memory, without going through the CPU.