Skip to content

Commit ac74075

Browse files
committed
Merge tag 'x86_pasid_for_5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 PASID updates from Borislav Petkov: "Initial support for sharing virtual addresses between the CPU and devices which doesn't need pinning of pages for DMA anymore. Add support for the command submission to devices using new x86 instructions like ENQCMD{,S} and MOVDIR64B. In addition, add support for process address space identifiers (PASIDs) which are referenced by those command submission instructions along with the handling of the PASID state on context switch as another extended state. Work by Fenghua Yu, Ashok Raj, Yu-cheng Yu and Dave Jiang" * tag 'x86_pasid_for_5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/asm: Add an enqcmds() wrapper for the ENQCMDS instruction x86/asm: Carve out a generic movdir64b() helper for general usage x86/mmu: Allocate/free a PASID x86/cpufeatures: Mark ENQCMD as disabled when configured out mm: Add a pasid member to struct mm_struct x86/msr-index: Define an IA32_PASID MSR x86/fpu/xstate: Add supervisor PASID state for ENQCMD x86/cpufeatures: Enumerate ENQCMD and ENQCMDS instructions Documentation/x86: Add documentation for SVA (Shared Virtual Addressing) iommu/vt-d: Change flags type to unsigned int in binding mm drm, iommu: Change type of pasid to u32
2 parents 8b6591f + 7f5933f commit ac74075

52 files changed

Lines changed: 607 additions & 164 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

Documentation/x86/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,3 +30,4 @@ x86-specific Documentation
3030
usb-legacy-support
3131
i386/index
3232
x86_64/index
33+
sva

Documentation/x86/sva.rst

Lines changed: 257 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,257 @@
1+
.. SPDX-License-Identifier: GPL-2.0
2+
3+
===========================================
4+
Shared Virtual Addressing (SVA) with ENQCMD
5+
===========================================
6+
7+
Background
8+
==========
9+
10+
Shared Virtual Addressing (SVA) allows the processor and device to use the
11+
same virtual addresses avoiding the need for software to translate virtual
12+
addresses to physical addresses. SVA is what PCIe calls Shared Virtual
13+
Memory (SVM).
14+
15+
In addition to the convenience of using application virtual addresses
16+
by the device, it also doesn't require pinning pages for DMA.
17+
PCIe Address Translation Services (ATS) along with Page Request Interface
18+
(PRI) allow devices to function much the same way as the CPU handling
19+
application page-faults. For more information please refer to the PCIe
20+
specification Chapter 10: ATS Specification.
21+
22+
Use of SVA requires IOMMU support in the platform. IOMMU is also
23+
required to support the PCIe features ATS and PRI. ATS allows devices
24+
to cache translations for virtual addresses. The IOMMU driver uses the
25+
mmu_notifier() support to keep the device TLB cache and the CPU cache in
26+
sync. When an ATS lookup fails for a virtual address, the device should
27+
use the PRI in order to request the virtual address to be paged into the
28+
CPU page tables. The device must use ATS again in order the fetch the
29+
translation before use.
30+
31+
Shared Hardware Workqueues
32+
==========================
33+
34+
Unlike Single Root I/O Virtualization (SR-IOV), Scalable IOV (SIOV) permits
35+
the use of Shared Work Queues (SWQ) by both applications and Virtual
36+
Machines (VM's). This allows better hardware utilization vs. hard
37+
partitioning resources that could result in under utilization. In order to
38+
allow the hardware to distinguish the context for which work is being
39+
executed in the hardware by SWQ interface, SIOV uses Process Address Space
40+
ID (PASID), which is a 20-bit number defined by the PCIe SIG.
41+
42+
PASID value is encoded in all transactions from the device. This allows the
43+
IOMMU to track I/O on a per-PASID granularity in addition to using the PCIe
44+
Resource Identifier (RID) which is the Bus/Device/Function.
45+
46+
47+
ENQCMD
48+
======
49+
50+
ENQCMD is a new instruction on Intel platforms that atomically submits a
51+
work descriptor to a device. The descriptor includes the operation to be
52+
performed, virtual addresses of all parameters, virtual address of a completion
53+
record, and the PASID (process address space ID) of the current process.
54+
55+
ENQCMD works with non-posted semantics and carries a status back if the
56+
command was accepted by hardware. This allows the submitter to know if the
57+
submission needs to be retried or other device specific mechanisms to
58+
implement fairness or ensure forward progress should be provided.
59+
60+
ENQCMD is the glue that ensures applications can directly submit commands
61+
to the hardware and also permits hardware to be aware of application context
62+
to perform I/O operations via use of PASID.
63+
64+
Process Address Space Tagging
65+
=============================
66+
67+
A new thread-scoped MSR (IA32_PASID) provides the connection between
68+
user processes and the rest of the hardware. When an application first
69+
accesses an SVA-capable device, this MSR is initialized with a newly
70+
allocated PASID. The driver for the device calls an IOMMU-specific API
71+
that sets up the routing for DMA and page-requests.
72+
73+
For example, the Intel Data Streaming Accelerator (DSA) uses
74+
iommu_sva_bind_device(), which will do the following:
75+
76+
- Allocate the PASID, and program the process page-table (%cr3 register) in the
77+
PASID context entries.
78+
- Register for mmu_notifier() to track any page-table invalidations to keep
79+
the device TLB in sync. For example, when a page-table entry is invalidated,
80+
the IOMMU propagates the invalidation to the device TLB. This will force any
81+
future access by the device to this virtual address to participate in
82+
ATS. If the IOMMU responds with proper response that a page is not
83+
present, the device would request the page to be paged in via the PCIe PRI
84+
protocol before performing I/O.
85+
86+
This MSR is managed with the XSAVE feature set as "supervisor state" to
87+
ensure the MSR is updated during context switch.
88+
89+
PASID Management
90+
================
91+
92+
The kernel must allocate a PASID on behalf of each process which will use
93+
ENQCMD and program it into the new MSR to communicate the process identity to
94+
platform hardware. ENQCMD uses the PASID stored in this MSR to tag requests
95+
from this process. When a user submits a work descriptor to a device using the
96+
ENQCMD instruction, the PASID field in the descriptor is auto-filled with the
97+
value from MSR_IA32_PASID. Requests for DMA from the device are also tagged
98+
with the same PASID. The platform IOMMU uses the PASID in the transaction to
99+
perform address translation. The IOMMU APIs setup the corresponding PASID
100+
entry in IOMMU with the process address used by the CPU (e.g. %cr3 register in
101+
x86).
102+
103+
The MSR must be configured on each logical CPU before any application
104+
thread can interact with a device. Threads that belong to the same
105+
process share the same page tables, thus the same MSR value.
106+
107+
PASID is cleared when a process is created. The PASID allocation and MSR
108+
programming may occur long after a process and its threads have been created.
109+
One thread must call iommu_sva_bind_device() to allocate the PASID for the
110+
process. If a thread uses ENQCMD without the MSR first being populated, a #GP
111+
will be raised. The kernel will update the PASID MSR with the PASID for all
112+
threads in the process. A single process PASID can be used simultaneously
113+
with multiple devices since they all share the same address space.
114+
115+
One thread can call iommu_sva_unbind_device() to free the allocated PASID.
116+
The kernel will clear the PASID MSR for all threads belonging to the process.
117+
118+
New threads inherit the MSR value from the parent.
119+
120+
Relationships
121+
=============
122+
123+
* Each process has many threads, but only one PASID.
124+
* Devices have a limited number (~10's to 1000's) of hardware workqueues.
125+
The device driver manages allocating hardware workqueues.
126+
* A single mmap() maps a single hardware workqueue as a "portal" and
127+
each portal maps down to a single workqueue.
128+
* For each device with which a process interacts, there must be
129+
one or more mmap()'d portals.
130+
* Many threads within a process can share a single portal to access
131+
a single device.
132+
* Multiple processes can separately mmap() the same portal, in
133+
which case they still share one device hardware workqueue.
134+
* The single process-wide PASID is used by all threads to interact
135+
with all devices. There is not, for instance, a PASID for each
136+
thread or each thread<->device pair.
137+
138+
FAQ
139+
===
140+
141+
* What is SVA/SVM?
142+
143+
Shared Virtual Addressing (SVA) permits I/O hardware and the processor to
144+
work in the same address space, i.e., to share it. Some call it Shared
145+
Virtual Memory (SVM), but Linux community wanted to avoid confusing it with
146+
POSIX Shared Memory and Secure Virtual Machines which were terms already in
147+
circulation.
148+
149+
* What is a PASID?
150+
151+
A Process Address Space ID (PASID) is a PCIe-defined Transaction Layer Packet
152+
(TLP) prefix. A PASID is a 20-bit number allocated and managed by the OS.
153+
PASID is included in all transactions between the platform and the device.
154+
155+
* How are shared workqueues different?
156+
157+
Traditionally, in order for userspace applications to interact with hardware,
158+
there is a separate hardware instance required per process. For example,
159+
consider doorbells as a mechanism of informing hardware about work to process.
160+
Each doorbell is required to be spaced 4k (or page-size) apart for process
161+
isolation. This requires hardware to provision that space and reserve it in
162+
MMIO. This doesn't scale as the number of threads becomes quite large. The
163+
hardware also manages the queue depth for Shared Work Queues (SWQ), and
164+
consumers don't need to track queue depth. If there is no space to accept
165+
a command, the device will return an error indicating retry.
166+
167+
A user should check Deferrable Memory Write (DMWr) capability on the device
168+
and only submits ENQCMD when the device supports it. In the new DMWr PCIe
169+
terminology, devices need to support DMWr completer capability. In addition,
170+
it requires all switch ports to support DMWr routing and must be enabled by
171+
the PCIe subsystem, much like how PCIe atomic operations are managed for
172+
instance.
173+
174+
SWQ allows hardware to provision just a single address in the device. When
175+
used with ENQCMD to submit work, the device can distinguish the process
176+
submitting the work since it will include the PASID assigned to that
177+
process. This helps the device scale to a large number of processes.
178+
179+
* Is this the same as a user space device driver?
180+
181+
Communicating with the device via the shared workqueue is much simpler
182+
than a full blown user space driver. The kernel driver does all the
183+
initialization of the hardware. User space only needs to worry about
184+
submitting work and processing completions.
185+
186+
* Is this the same as SR-IOV?
187+
188+
Single Root I/O Virtualization (SR-IOV) focuses on providing independent
189+
hardware interfaces for virtualizing hardware. Hence, it's required to be
190+
almost fully functional interface to software supporting the traditional
191+
BARs, space for interrupts via MSI-X, its own register layout.
192+
Virtual Functions (VFs) are assisted by the Physical Function (PF)
193+
driver.
194+
195+
Scalable I/O Virtualization builds on the PASID concept to create device
196+
instances for virtualization. SIOV requires host software to assist in
197+
creating virtual devices; each virtual device is represented by a PASID
198+
along with the bus/device/function of the device. This allows device
199+
hardware to optimize device resource creation and can grow dynamically on
200+
demand. SR-IOV creation and management is very static in nature. Consult
201+
references below for more details.
202+
203+
* Why not just create a virtual function for each app?
204+
205+
Creating PCIe SR-IOV type Virtual Functions (VF) is expensive. VFs require
206+
duplicated hardware for PCI config space and interrupts such as MSI-X.
207+
Resources such as interrupts have to be hard partitioned between VFs at
208+
creation time, and cannot scale dynamically on demand. The VFs are not
209+
completely independent from the Physical Function (PF). Most VFs require
210+
some communication and assistance from the PF driver. SIOV, in contrast,
211+
creates a software-defined device where all the configuration and control
212+
aspects are mediated via the slow path. The work submission and completion
213+
happen without any mediation.
214+
215+
* Does this support virtualization?
216+
217+
ENQCMD can be used from within a guest VM. In these cases, the VMM helps
218+
with setting up a translation table to translate from Guest PASID to Host
219+
PASID. Please consult the ENQCMD instruction set reference for more
220+
details.
221+
222+
* Does memory need to be pinned?
223+
224+
When devices support SVA along with platform hardware such as IOMMU
225+
supporting such devices, there is no need to pin memory for DMA purposes.
226+
Devices that support SVA also support other PCIe features that remove the
227+
pinning requirement for memory.
228+
229+
Device TLB support - Device requests the IOMMU to lookup an address before
230+
use via Address Translation Service (ATS) requests. If the mapping exists
231+
but there is no page allocated by the OS, IOMMU hardware returns that no
232+
mapping exists.
233+
234+
Device requests the virtual address to be mapped via Page Request
235+
Interface (PRI). Once the OS has successfully completed the mapping, it
236+
returns the response back to the device. The device requests again for
237+
a translation and continues.
238+
239+
IOMMU works with the OS in managing consistency of page-tables with the
240+
device. When removing pages, it interacts with the device to remove any
241+
device TLB entry that might have been cached before removing the mappings from
242+
the OS.
243+
244+
References
245+
==========
246+
247+
VT-D:
248+
https://01.org/blogs/ashokraj/2018/recent-enhancements-intel-virtualization-technology-directed-i/o-intel-vt-d
249+
250+
SIOV:
251+
https://01.org/blogs/2019/assignable-interfaces-intel-scalable-i/o-virtualization-linux
252+
253+
ENQCMD in ISE:
254+
https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf
255+
256+
DSA spec:
257+
https://software.intel.com/sites/default/files/341204-intel-data-streaming-accelerator-spec.pdf

arch/x86/include/asm/cpufeatures.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -353,6 +353,7 @@
353353
#define X86_FEATURE_CLDEMOTE (16*32+25) /* CLDEMOTE instruction */
354354
#define X86_FEATURE_MOVDIRI (16*32+27) /* MOVDIRI instruction */
355355
#define X86_FEATURE_MOVDIR64B (16*32+28) /* MOVDIR64B instruction */
356+
#define X86_FEATURE_ENQCMD (16*32+29) /* ENQCMD and ENQCMDS instructions */
356357

357358
/* AMD-defined CPU features, CPUID level 0x80000007 (EBX), word 17 */
358359
#define X86_FEATURE_OVERFLOW_RECOV (17*32+ 0) /* MCA overflow recovery support */

arch/x86/include/asm/disabled-features.h

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,12 @@
5656
# define DISABLE_PTI (1 << (X86_FEATURE_PTI & 31))
5757
#endif
5858

59+
#ifdef CONFIG_IOMMU_SUPPORT
60+
# define DISABLE_ENQCMD 0
61+
#else
62+
# define DISABLE_ENQCMD (1 << (X86_FEATURE_ENQCMD & 31))
63+
#endif
64+
5965
/*
6066
* Make sure to add features to the correct mask
6167
*/
@@ -75,7 +81,8 @@
7581
#define DISABLED_MASK13 0
7682
#define DISABLED_MASK14 0
7783
#define DISABLED_MASK15 0
78-
#define DISABLED_MASK16 (DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP)
84+
#define DISABLED_MASK16 (DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP| \
85+
DISABLE_ENQCMD)
7986
#define DISABLED_MASK17 0
8087
#define DISABLED_MASK18 0
8188
#define DISABLED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 19)

arch/x86/include/asm/fpu/api.h

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -62,4 +62,16 @@ extern void switch_fpu_return(void);
6262
*/
6363
extern int cpu_has_xfeatures(u64 xfeatures_mask, const char **feature_name);
6464

65+
/*
66+
* Tasks that are not using SVA have mm->pasid set to zero to note that they
67+
* will not have the valid bit set in MSR_IA32_PASID while they are running.
68+
*/
69+
#define PASID_DISABLED 0
70+
71+
#ifdef CONFIG_IOMMU_SUPPORT
72+
/* Update current's PASID MSR/state by mm's PASID. */
73+
void update_pasid(void);
74+
#else
75+
static inline void update_pasid(void) { }
76+
#endif
6577
#endif /* _ASM_X86_FPU_API_H */

arch/x86/include/asm/fpu/internal.h

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -583,6 +583,13 @@ static inline void switch_fpu_finish(struct fpu *new_fpu)
583583
pkru_val = pk->pkru;
584584
}
585585
__write_pkru(pkru_val);
586+
587+
/*
588+
* Expensive PASID MSR write will be avoided in update_pasid() because
589+
* TIF_NEED_FPU_LOAD was set. And the PASID state won't be updated
590+
* unless it's different from mm->pasid to reduce overhead.
591+
*/
592+
update_pasid();
586593
}
587594

588595
/*

arch/x86/include/asm/fpu/types.h

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -114,7 +114,7 @@ enum xfeature {
114114
XFEATURE_Hi16_ZMM,
115115
XFEATURE_PT_UNIMPLEMENTED_SO_FAR,
116116
XFEATURE_PKRU,
117-
XFEATURE_RSRVD_COMP_10,
117+
XFEATURE_PASID,
118118
XFEATURE_RSRVD_COMP_11,
119119
XFEATURE_RSRVD_COMP_12,
120120
XFEATURE_RSRVD_COMP_13,
@@ -134,6 +134,7 @@ enum xfeature {
134134
#define XFEATURE_MASK_Hi16_ZMM (1 << XFEATURE_Hi16_ZMM)
135135
#define XFEATURE_MASK_PT (1 << XFEATURE_PT_UNIMPLEMENTED_SO_FAR)
136136
#define XFEATURE_MASK_PKRU (1 << XFEATURE_PKRU)
137+
#define XFEATURE_MASK_PASID (1 << XFEATURE_PASID)
137138
#define XFEATURE_MASK_LBR (1 << XFEATURE_LBR)
138139

139140
#define XFEATURE_MASK_FPSSE (XFEATURE_MASK_FP | XFEATURE_MASK_SSE)
@@ -256,6 +257,14 @@ struct arch_lbr_state {
256257
struct lbr_entry entries[];
257258
} __packed;
258259

260+
/*
261+
* State component 10 is supervisor state used for context-switching the
262+
* PASID state.
263+
*/
264+
struct ia32_pasid_state {
265+
u64 pasid;
266+
} __packed;
267+
259268
struct xstate_header {
260269
u64 xfeatures;
261270
u64 xcomp_bv;

arch/x86/include/asm/fpu/xstate.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@
3535
XFEATURE_MASK_BNDCSR)
3636

3737
/* All currently supported supervisor features */
38-
#define XFEATURE_MASK_SUPERVISOR_SUPPORTED (0)
38+
#define XFEATURE_MASK_SUPERVISOR_SUPPORTED (XFEATURE_MASK_PASID)
3939

4040
/*
4141
* A supervisor state component may not always contain valuable information,

0 commit comments

Comments
 (0)