|
| 1 | +.. SPDX-License-Identifier: GPL-2.0 |
| 2 | +
|
| 3 | +=========================================== |
| 4 | +Shared Virtual Addressing (SVA) with ENQCMD |
| 5 | +=========================================== |
| 6 | + |
| 7 | +Background |
| 8 | +========== |
| 9 | + |
| 10 | +Shared Virtual Addressing (SVA) allows the processor and device to use the |
| 11 | +same virtual addresses avoiding the need for software to translate virtual |
| 12 | +addresses to physical addresses. SVA is what PCIe calls Shared Virtual |
| 13 | +Memory (SVM). |
| 14 | + |
| 15 | +In addition to the convenience of using application virtual addresses |
| 16 | +by the device, it also doesn't require pinning pages for DMA. |
| 17 | +PCIe Address Translation Services (ATS) along with Page Request Interface |
| 18 | +(PRI) allow devices to function much the same way as the CPU handling |
| 19 | +application page-faults. For more information please refer to the PCIe |
| 20 | +specification Chapter 10: ATS Specification. |
| 21 | + |
| 22 | +Use of SVA requires IOMMU support in the platform. IOMMU is also |
| 23 | +required to support the PCIe features ATS and PRI. ATS allows devices |
| 24 | +to cache translations for virtual addresses. The IOMMU driver uses the |
| 25 | +mmu_notifier() support to keep the device TLB cache and the CPU cache in |
| 26 | +sync. When an ATS lookup fails for a virtual address, the device should |
| 27 | +use the PRI in order to request the virtual address to be paged into the |
| 28 | +CPU page tables. The device must use ATS again in order the fetch the |
| 29 | +translation before use. |
| 30 | + |
| 31 | +Shared Hardware Workqueues |
| 32 | +========================== |
| 33 | + |
| 34 | +Unlike Single Root I/O Virtualization (SR-IOV), Scalable IOV (SIOV) permits |
| 35 | +the use of Shared Work Queues (SWQ) by both applications and Virtual |
| 36 | +Machines (VM's). This allows better hardware utilization vs. hard |
| 37 | +partitioning resources that could result in under utilization. In order to |
| 38 | +allow the hardware to distinguish the context for which work is being |
| 39 | +executed in the hardware by SWQ interface, SIOV uses Process Address Space |
| 40 | +ID (PASID), which is a 20-bit number defined by the PCIe SIG. |
| 41 | + |
| 42 | +PASID value is encoded in all transactions from the device. This allows the |
| 43 | +IOMMU to track I/O on a per-PASID granularity in addition to using the PCIe |
| 44 | +Resource Identifier (RID) which is the Bus/Device/Function. |
| 45 | + |
| 46 | + |
| 47 | +ENQCMD |
| 48 | +====== |
| 49 | + |
| 50 | +ENQCMD is a new instruction on Intel platforms that atomically submits a |
| 51 | +work descriptor to a device. The descriptor includes the operation to be |
| 52 | +performed, virtual addresses of all parameters, virtual address of a completion |
| 53 | +record, and the PASID (process address space ID) of the current process. |
| 54 | + |
| 55 | +ENQCMD works with non-posted semantics and carries a status back if the |
| 56 | +command was accepted by hardware. This allows the submitter to know if the |
| 57 | +submission needs to be retried or other device specific mechanisms to |
| 58 | +implement fairness or ensure forward progress should be provided. |
| 59 | + |
| 60 | +ENQCMD is the glue that ensures applications can directly submit commands |
| 61 | +to the hardware and also permits hardware to be aware of application context |
| 62 | +to perform I/O operations via use of PASID. |
| 63 | + |
| 64 | +Process Address Space Tagging |
| 65 | +============================= |
| 66 | + |
| 67 | +A new thread-scoped MSR (IA32_PASID) provides the connection between |
| 68 | +user processes and the rest of the hardware. When an application first |
| 69 | +accesses an SVA-capable device, this MSR is initialized with a newly |
| 70 | +allocated PASID. The driver for the device calls an IOMMU-specific API |
| 71 | +that sets up the routing for DMA and page-requests. |
| 72 | + |
| 73 | +For example, the Intel Data Streaming Accelerator (DSA) uses |
| 74 | +iommu_sva_bind_device(), which will do the following: |
| 75 | + |
| 76 | +- Allocate the PASID, and program the process page-table (%cr3 register) in the |
| 77 | + PASID context entries. |
| 78 | +- Register for mmu_notifier() to track any page-table invalidations to keep |
| 79 | + the device TLB in sync. For example, when a page-table entry is invalidated, |
| 80 | + the IOMMU propagates the invalidation to the device TLB. This will force any |
| 81 | + future access by the device to this virtual address to participate in |
| 82 | + ATS. If the IOMMU responds with proper response that a page is not |
| 83 | + present, the device would request the page to be paged in via the PCIe PRI |
| 84 | + protocol before performing I/O. |
| 85 | + |
| 86 | +This MSR is managed with the XSAVE feature set as "supervisor state" to |
| 87 | +ensure the MSR is updated during context switch. |
| 88 | + |
| 89 | +PASID Management |
| 90 | +================ |
| 91 | + |
| 92 | +The kernel must allocate a PASID on behalf of each process which will use |
| 93 | +ENQCMD and program it into the new MSR to communicate the process identity to |
| 94 | +platform hardware. ENQCMD uses the PASID stored in this MSR to tag requests |
| 95 | +from this process. When a user submits a work descriptor to a device using the |
| 96 | +ENQCMD instruction, the PASID field in the descriptor is auto-filled with the |
| 97 | +value from MSR_IA32_PASID. Requests for DMA from the device are also tagged |
| 98 | +with the same PASID. The platform IOMMU uses the PASID in the transaction to |
| 99 | +perform address translation. The IOMMU APIs setup the corresponding PASID |
| 100 | +entry in IOMMU with the process address used by the CPU (e.g. %cr3 register in |
| 101 | +x86). |
| 102 | + |
| 103 | +The MSR must be configured on each logical CPU before any application |
| 104 | +thread can interact with a device. Threads that belong to the same |
| 105 | +process share the same page tables, thus the same MSR value. |
| 106 | + |
| 107 | +PASID is cleared when a process is created. The PASID allocation and MSR |
| 108 | +programming may occur long after a process and its threads have been created. |
| 109 | +One thread must call iommu_sva_bind_device() to allocate the PASID for the |
| 110 | +process. If a thread uses ENQCMD without the MSR first being populated, a #GP |
| 111 | +will be raised. The kernel will update the PASID MSR with the PASID for all |
| 112 | +threads in the process. A single process PASID can be used simultaneously |
| 113 | +with multiple devices since they all share the same address space. |
| 114 | + |
| 115 | +One thread can call iommu_sva_unbind_device() to free the allocated PASID. |
| 116 | +The kernel will clear the PASID MSR for all threads belonging to the process. |
| 117 | + |
| 118 | +New threads inherit the MSR value from the parent. |
| 119 | + |
| 120 | +Relationships |
| 121 | +============= |
| 122 | + |
| 123 | + * Each process has many threads, but only one PASID. |
| 124 | + * Devices have a limited number (~10's to 1000's) of hardware workqueues. |
| 125 | + The device driver manages allocating hardware workqueues. |
| 126 | + * A single mmap() maps a single hardware workqueue as a "portal" and |
| 127 | + each portal maps down to a single workqueue. |
| 128 | + * For each device with which a process interacts, there must be |
| 129 | + one or more mmap()'d portals. |
| 130 | + * Many threads within a process can share a single portal to access |
| 131 | + a single device. |
| 132 | + * Multiple processes can separately mmap() the same portal, in |
| 133 | + which case they still share one device hardware workqueue. |
| 134 | + * The single process-wide PASID is used by all threads to interact |
| 135 | + with all devices. There is not, for instance, a PASID for each |
| 136 | + thread or each thread<->device pair. |
| 137 | + |
| 138 | +FAQ |
| 139 | +=== |
| 140 | + |
| 141 | +* What is SVA/SVM? |
| 142 | + |
| 143 | +Shared Virtual Addressing (SVA) permits I/O hardware and the processor to |
| 144 | +work in the same address space, i.e., to share it. Some call it Shared |
| 145 | +Virtual Memory (SVM), but Linux community wanted to avoid confusing it with |
| 146 | +POSIX Shared Memory and Secure Virtual Machines which were terms already in |
| 147 | +circulation. |
| 148 | + |
| 149 | +* What is a PASID? |
| 150 | + |
| 151 | +A Process Address Space ID (PASID) is a PCIe-defined Transaction Layer Packet |
| 152 | +(TLP) prefix. A PASID is a 20-bit number allocated and managed by the OS. |
| 153 | +PASID is included in all transactions between the platform and the device. |
| 154 | + |
| 155 | +* How are shared workqueues different? |
| 156 | + |
| 157 | +Traditionally, in order for userspace applications to interact with hardware, |
| 158 | +there is a separate hardware instance required per process. For example, |
| 159 | +consider doorbells as a mechanism of informing hardware about work to process. |
| 160 | +Each doorbell is required to be spaced 4k (or page-size) apart for process |
| 161 | +isolation. This requires hardware to provision that space and reserve it in |
| 162 | +MMIO. This doesn't scale as the number of threads becomes quite large. The |
| 163 | +hardware also manages the queue depth for Shared Work Queues (SWQ), and |
| 164 | +consumers don't need to track queue depth. If there is no space to accept |
| 165 | +a command, the device will return an error indicating retry. |
| 166 | + |
| 167 | +A user should check Deferrable Memory Write (DMWr) capability on the device |
| 168 | +and only submits ENQCMD when the device supports it. In the new DMWr PCIe |
| 169 | +terminology, devices need to support DMWr completer capability. In addition, |
| 170 | +it requires all switch ports to support DMWr routing and must be enabled by |
| 171 | +the PCIe subsystem, much like how PCIe atomic operations are managed for |
| 172 | +instance. |
| 173 | + |
| 174 | +SWQ allows hardware to provision just a single address in the device. When |
| 175 | +used with ENQCMD to submit work, the device can distinguish the process |
| 176 | +submitting the work since it will include the PASID assigned to that |
| 177 | +process. This helps the device scale to a large number of processes. |
| 178 | + |
| 179 | +* Is this the same as a user space device driver? |
| 180 | + |
| 181 | +Communicating with the device via the shared workqueue is much simpler |
| 182 | +than a full blown user space driver. The kernel driver does all the |
| 183 | +initialization of the hardware. User space only needs to worry about |
| 184 | +submitting work and processing completions. |
| 185 | + |
| 186 | +* Is this the same as SR-IOV? |
| 187 | + |
| 188 | +Single Root I/O Virtualization (SR-IOV) focuses on providing independent |
| 189 | +hardware interfaces for virtualizing hardware. Hence, it's required to be |
| 190 | +almost fully functional interface to software supporting the traditional |
| 191 | +BARs, space for interrupts via MSI-X, its own register layout. |
| 192 | +Virtual Functions (VFs) are assisted by the Physical Function (PF) |
| 193 | +driver. |
| 194 | + |
| 195 | +Scalable I/O Virtualization builds on the PASID concept to create device |
| 196 | +instances for virtualization. SIOV requires host software to assist in |
| 197 | +creating virtual devices; each virtual device is represented by a PASID |
| 198 | +along with the bus/device/function of the device. This allows device |
| 199 | +hardware to optimize device resource creation and can grow dynamically on |
| 200 | +demand. SR-IOV creation and management is very static in nature. Consult |
| 201 | +references below for more details. |
| 202 | + |
| 203 | +* Why not just create a virtual function for each app? |
| 204 | + |
| 205 | +Creating PCIe SR-IOV type Virtual Functions (VF) is expensive. VFs require |
| 206 | +duplicated hardware for PCI config space and interrupts such as MSI-X. |
| 207 | +Resources such as interrupts have to be hard partitioned between VFs at |
| 208 | +creation time, and cannot scale dynamically on demand. The VFs are not |
| 209 | +completely independent from the Physical Function (PF). Most VFs require |
| 210 | +some communication and assistance from the PF driver. SIOV, in contrast, |
| 211 | +creates a software-defined device where all the configuration and control |
| 212 | +aspects are mediated via the slow path. The work submission and completion |
| 213 | +happen without any mediation. |
| 214 | + |
| 215 | +* Does this support virtualization? |
| 216 | + |
| 217 | +ENQCMD can be used from within a guest VM. In these cases, the VMM helps |
| 218 | +with setting up a translation table to translate from Guest PASID to Host |
| 219 | +PASID. Please consult the ENQCMD instruction set reference for more |
| 220 | +details. |
| 221 | + |
| 222 | +* Does memory need to be pinned? |
| 223 | + |
| 224 | +When devices support SVA along with platform hardware such as IOMMU |
| 225 | +supporting such devices, there is no need to pin memory for DMA purposes. |
| 226 | +Devices that support SVA also support other PCIe features that remove the |
| 227 | +pinning requirement for memory. |
| 228 | + |
| 229 | +Device TLB support - Device requests the IOMMU to lookup an address before |
| 230 | +use via Address Translation Service (ATS) requests. If the mapping exists |
| 231 | +but there is no page allocated by the OS, IOMMU hardware returns that no |
| 232 | +mapping exists. |
| 233 | + |
| 234 | +Device requests the virtual address to be mapped via Page Request |
| 235 | +Interface (PRI). Once the OS has successfully completed the mapping, it |
| 236 | +returns the response back to the device. The device requests again for |
| 237 | +a translation and continues. |
| 238 | + |
| 239 | +IOMMU works with the OS in managing consistency of page-tables with the |
| 240 | +device. When removing pages, it interacts with the device to remove any |
| 241 | +device TLB entry that might have been cached before removing the mappings from |
| 242 | +the OS. |
| 243 | + |
| 244 | +References |
| 245 | +========== |
| 246 | + |
| 247 | +VT-D: |
| 248 | +https://01.org/blogs/ashokraj/2018/recent-enhancements-intel-virtualization-technology-directed-i/o-intel-vt-d |
| 249 | + |
| 250 | +SIOV: |
| 251 | +https://01.org/blogs/2019/assignable-interfaces-intel-scalable-i/o-virtualization-linux |
| 252 | + |
| 253 | +ENQCMD in ISE: |
| 254 | +https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf |
| 255 | + |
| 256 | +DSA spec: |
| 257 | +https://software.intel.com/sites/default/files/341204-intel-data-streaming-accelerator-spec.pdf |
0 commit comments