Skip to content

Commit 87fdeaf

Browse files
authored
[Docs] Add code generation architecture documentation (#19396)
This pr adds an architecture document explaining how PrimFuncs are compiled to executable code. Covers the codegen's position in the `tvm.compile()` pipeline, the target dispatch mechanism (`target.build.<kind>` FFI lookup), the two codegen families
1 parent f9787d1 commit 87fdeaf

2 files changed

Lines changed: 309 additions & 0 deletions

File tree

docs/arch/codegen.rst

Lines changed: 306 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,306 @@
1+
.. Licensed to the Apache Software Foundation (ASF) under one
2+
or more contributor license agreements. See the NOTICE file
3+
distributed with this work for additional information
4+
regarding copyright ownership. The ASF licenses this file
5+
to you under the Apache License, Version 2.0 (the
6+
"License"); you may not use this file except in compliance
7+
with the License. You may obtain a copy of the License at
8+
9+
.. http://www.apache.org/licenses/LICENSE-2.0
10+
11+
.. Unless required by applicable law or agreed to in writing,
12+
software distributed under the License is distributed on an
13+
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
14+
KIND, either express or implied. See the License for the
15+
specific language governing permissions and limitations
16+
under the License.
17+
18+
.. _codegen-arch:
19+
20+
Code Generation
21+
===============
22+
23+
Code generation is the final stage of the TVM compilation pipeline — it translates TIR
24+
``PrimFunc``\ s into executable code for a target device. This document explains how TIR
25+
functions become native CPU instructions, GPU kernels, or source code strings, covering the
26+
target dispatch mechanism, the two codegen families (LLVM and Source), and the runtime module
27+
system that wraps the generated code.
28+
29+
30+
Where Codegen Fits
31+
------------------
32+
33+
When a user calls ``tvm.compile()``, the compilation proceeds in two phases:
34+
35+
1. **Relax phase**: the Relax pipeline optimizes and fuses the computational graph, then
36+
``VMCodeGen`` translates Relax functions into VM bytecode (see :ref:`relax-vm-arch`).
37+
2. **TIR phase**: TIR ``PrimFunc``\ s (the actual compute kernels) are compiled to native code.
38+
39+
The TIR phase is handled internally by ``tirx.build()`` (called from ``relax.build()``).
40+
It performs these steps:
41+
42+
.. code-block:: text
43+
44+
TIR PrimFuncs (in IRModule)
45+
46+
▼ TIR pipeline ← lowering passes (flatten buffers, lower intrinsics, etc.)
47+
TIR PrimFuncs (lowered)
48+
49+
▼ split_host_device_mods() ← separate host and device functions
50+
Host IRModule + Device IRModule(s)
51+
│ │
52+
▼ ▼
53+
codegen_build() codegen_build() ← target-specific code generation
54+
│ │
55+
▼ ▼
56+
Host Module Device Module(s)
57+
│ │
58+
▼ import_module() │
59+
Host Module ◄─────────────┘ ← device modules imported into host
60+
61+
▼ (returned to relax.build for linking with VM bytecode)
62+
63+
64+
Target Dispatch
65+
---------------
66+
67+
The core dispatch logic lives in ``codegen::Build()`` (``src/target/codegen.cc``), which is
68+
called from the Python-side ``codegen_build()`` in ``tirx/build.py``. It selects the correct
69+
backend based on the ``Target`` object:
70+
71+
.. code-block:: cpp
72+
73+
ffi::Module Build(IRModule mod, Target target) {
74+
std::string build_f_name = "target.build." + target->kind->name;
75+
const auto bf = tvm::ffi::Function::GetGlobal(build_f_name);
76+
return (*bf)(mod, target).cast<ffi::Module>();
77+
}
78+
79+
Each backend registers its build function via FFI:
80+
81+
.. list-table::
82+
:header-rows: 1
83+
:widths: 25 30 45
84+
85+
* - FFI Key
86+
- Backend
87+
- Codegen Class
88+
* - ``target.build.llvm``
89+
- CPU (x86, ARM, etc.)
90+
- ``CodeGenCPU`` (→ LLVM IR → machine code)
91+
* - ``target.build.cuda``
92+
- NVIDIA GPU
93+
- ``CodeGenCUDA`` (→ CUDA C → PTX/cubin)
94+
* - ``target.build.rocm``
95+
- AMD GPU
96+
- ``CodeGenAMDGPU`` (→ LLVM IR → AMDGPU ISA)
97+
* - ``target.build.nvptx``
98+
- NVIDIA PTX
99+
- ``CodeGenNVPTX`` (→ LLVM IR → PTX)
100+
* - ``target.build.metal``
101+
- Apple GPU
102+
- ``CodeGenMetal`` (→ Metal Shading Language)
103+
* - ``target.build.opencl``
104+
- OpenCL devices
105+
- ``CodeGenOpenCL`` (→ OpenCL C)
106+
* - ``target.build.vulkan``
107+
- Vulkan devices
108+
- ``CodeGenSPIRV`` (→ SPIR-V binary)
109+
* - ``target.build.webgpu``
110+
- WebGPU
111+
- ``CodeGenWebGPU`` (→ WGSL)
112+
* - ``target.build.c``
113+
- C host code
114+
- ``CodeGenCHost`` (→ C source)
115+
116+
117+
Two Codegen Families
118+
--------------------
119+
120+
TVM has two families of code generators, corresponding to two fundamentally different strategies
121+
for producing executable code:
122+
123+
.. code-block:: text
124+
125+
LLVM Family Source Family
126+
────────── ─────────────
127+
TIR → LLVM IR → machine code TIR → source string → external compiler
128+
(in-process, JIT or AOT) (CUDA C, OpenCL C, Metal, WGSL)
129+
130+
LLVM family
131+
~~~~~~~~~~~
132+
133+
``CodeGenLLVM`` (``src/target/llvm/codegen_llvm.h``) translates TIR directly to LLVM IR using
134+
the LLVM C++ API. The generated ``llvm::Module`` is then compiled to native code by LLVM's
135+
backend (x86, ARM, NVPTX, AMDGPU, etc.).
136+
137+
**Inheritance**:
138+
139+
.. code-block:: text
140+
141+
CodeGenLLVM (base)
142+
├── CodeGenCPU ← x86, ARM (target.build.llvm)
143+
│ └── CodeGenHexagon
144+
├── CodeGenNVPTX ← NVIDIA PTX via LLVM (target.build.nvptx)
145+
└── CodeGenAMDGPU ← AMD GPU via LLVM (target.build.rocm)
146+
147+
``CodeGenLLVM`` inherits from both ``ExprFunctor<llvm::Value*(const PrimExpr&)>`` and
148+
``StmtFunctor<void(const Stmt&)>``. Each TIR node type has a corresponding visitor:
149+
150+
- **Expressions** (``VisitExpr_``) convert TIR expressions to LLVM ``Value``\ s:
151+
arithmetic ops → LLVM binary instructions, ``BufferLoad`` → load with pointer arithmetic,
152+
``Cast`` → LLVM type conversions, ``Call`` → intrinsic or extern function calls.
153+
- **Statements** (``VisitStmt_``) emit LLVM IR side effects:
154+
``BufferStore`` → store instructions, ``For`` → loop basic blocks with branches,
155+
``IfThenElse`` → conditional branches, ``AllocBuffer`` → stack or heap allocation.
156+
157+
The key methods on ``CodeGenLLVM`` are:
158+
159+
- ``Create(LLVMTarget*)`` — factory that returns a target-specific subclass.
160+
- ``Init(...)`` — set up the LLVM context, module, and builder.
161+
- ``DeclareFunction(gvar, f)`` / ``AddFunction(gvar, f)`` — forward-declare then compile a
162+
``PrimFunc`` to LLVM IR.
163+
- ``Finish()`` — return the completed ``llvm::Module``.
164+
165+
Source family
166+
~~~~~~~~~~~~~
167+
168+
``CodeGenC`` (``src/target/source/codegen_c.h``) generates C-like source code as text. Each
169+
target subclass overrides methods to emit target-specific syntax.
170+
171+
**Inheritance**:
172+
173+
.. code-block:: text
174+
175+
CodeGenC (base)
176+
├── CodeGenCUDA ← CUDA C (target.build.cuda)
177+
├── CodeGenOpenCL ← OpenCL C (target.build.opencl)
178+
├── CodeGenMetal ← Metal Shading Language (target.build.metal)
179+
├── CodeGenWebGPU ← WGSL (target.build.webgpu)
180+
└── CodeGenCHost ← C host code (target.build.c)
181+
182+
``CodeGenC`` also uses the visitor pattern (``ExprFunctor`` and ``StmtFunctor``), but outputs to
183+
``std::ostream`` instead of constructing LLVM IR. Subclasses override target-specific methods:
184+
185+
- ``PrintStorageScope(scope, os)`` — emit memory qualifiers (e.g., ``__shared__`` for CUDA,
186+
``__local`` for OpenCL).
187+
- ``BindThreadIndex(iv)`` — emit thread index bindings (e.g., ``threadIdx.x``, ``blockIdx.y``).
188+
- ``PrintType(dtype, os)`` — emit target-specific type names (e.g., ``half`` for float16).
189+
- ``PrintVecBinaryOp(...)`` — emit vectorized operations in target syntax.
190+
191+
For CUDA, the build flow (``BuildCUDA`` in ``src/target/opt/build_cuda_on.cc``) is:
192+
193+
1. ``CodeGenCUDA`` generates CUDA C source.
194+
2. An optional post-processing callback (``tvm_callback_cuda_postproc``) transforms the source.
195+
3. A Python callback (``tvm_callback_cuda_compile``) compiles the source to PTX or cubin via
196+
NVRTC or NVCC.
197+
4. The result is wrapped in a ``CUDAModule``.
198+
199+
Design choice
200+
~~~~~~~~~~~~~
201+
202+
Why two families?
203+
204+
- **LLVM family** produces higher-quality code — LLVM applies its own optimization passes
205+
(instruction selection, register allocation, vectorization). Best for CPU targets where TVM
206+
has full control over the compilation.
207+
- **Source family** is more portable — it generates human-readable source that can be compiled
208+
by vendor toolchains (NVCC, Metal compiler, etc.). This is necessary for GPU targets where
209+
the vendor compiler handles device-specific optimizations and the runtime compilation model
210+
(e.g., NVRTC for CUDA, runtime shader compilation for Metal/OpenCL).
211+
212+
213+
Host/Device Split
214+
-----------------
215+
216+
When compiling for GPU targets, TIR functions are split into two categories:
217+
218+
- **Host functions** — run on the CPU. They set up kernel launch parameters (grid/block
219+
dimensions), allocate memory, and invoke device kernels. Compiled with ``target.build.llvm``
220+
or ``target.build.c``.
221+
- **Device functions** — the actual compute kernels that run on the GPU. Compiled with the
222+
target-specific codegen (``target.build.cuda``, etc.).
223+
224+
``split_host_device_mods()`` (``python/tvm/tirx/build.py``) separates functions by their
225+
``target`` attribute: functions whose target kind is ``"llvm"`` or ``"c"`` go to the host
226+
module; all others go to device modules grouped by target.
227+
228+
After compilation, device modules are imported into the host module via ``import_module()``,
229+
forming a module tree. At runtime, the host module dispatches to the imported device module
230+
when a device kernel is called.
231+
232+
233+
Runtime Modules
234+
---------------
235+
236+
Each codegen produces a ``runtime.Module`` — the container that holds the generated code and
237+
exposes it as callable ``PackedFunc``\ s.
238+
239+
.. list-table::
240+
:header-rows: 1
241+
:widths: 20 35 45
242+
243+
* - Module Type
244+
- How Code Is Stored
245+
- How Code Is Executed
246+
* - ``LLVMModule``
247+
- LLVM IR (in-memory ``llvm::Module``)
248+
- JIT-compiled on first call (MCJIT or ORC). Function pointers cached for subsequent calls.
249+
* - ``CUDAModule``
250+
- PTX or cubin binary
251+
- Loaded via CUDA driver API (``cuModuleLoad``). Kernels launched via ``cuLaunchKernel``.
252+
* - ``CSourceModule``
253+
- C source string
254+
- Not directly executable. Used as a build artifact for AOT compilation.
255+
* - ``DeviceSourceModule``
256+
- Device source string (OpenCL C, Metal, WGSL)
257+
- Compiled at runtime by the device driver (e.g., ``clCreateProgramWithSource``).
258+
259+
All module types implement the same interface: ``GetFunction(name)`` returns a ``PackedFunc``
260+
that can be called from Python or C++. The VM and other runtime components use this interface
261+
to invoke compiled kernels without knowing which backend produced them.
262+
263+
The module tree is serializable via ``export_library()``, which packs the host module and all
264+
imported device modules into a single shared library (``.so`` / ``.dll`` / ``.dylib``) or
265+
a tar archive for deployment.
266+
267+
268+
Source Code Map
269+
---------------
270+
271+
.. list-table::
272+
:header-rows: 1
273+
:widths: 50 50
274+
275+
* - Path
276+
- Contents
277+
* - ``python/tvm/tirx/build.py``
278+
- ``tirx.build()``: TIR compilation entry, host/device split, module linking
279+
* - ``src/target/codegen.cc``
280+
- ``codegen::Build()``: target dispatch via ``"target.build.<kind>"``
281+
* - ``src/target/llvm/codegen_llvm.h``
282+
- ``CodeGenLLVM``: TIR → LLVM IR base class
283+
* - ``src/target/llvm/codegen_cpu.h``
284+
- ``CodeGenCPU``: CPU-specific LLVM codegen (x86, ARM)
285+
* - ``src/target/llvm/codegen_nvptx.cc``
286+
- ``CodeGenNVPTX``: NVIDIA PTX via LLVM
287+
* - ``src/target/llvm/codegen_amdgpu.cc``
288+
- ``CodeGenAMDGPU``: AMD GPU via LLVM
289+
* - ``src/target/llvm/llvm_module.cc``
290+
- ``LLVMModuleNode``: runtime module with JIT compilation
291+
* - ``src/target/source/codegen_c.h``
292+
- ``CodeGenC``: TIR → C-like source base class
293+
* - ``src/target/source/codegen_cuda.h``
294+
- ``CodeGenCUDA``: TIR → CUDA C
295+
* - ``src/target/source/codegen_opencl.h``
296+
- ``CodeGenOpenCL``: TIR → OpenCL C
297+
* - ``src/target/source/codegen_metal.h``
298+
- ``CodeGenMetal``: TIR → Metal Shading Language
299+
* - ``src/target/source/codegen_c_host.h``
300+
- ``CodeGenCHost``: TIR → C host code
301+
* - ``src/target/opt/build_cuda_on.cc``
302+
- ``BuildCUDA``: CUDA build flow (codegen → compile → module)
303+
* - ``src/target/spirv/codegen_spirv.h``
304+
- ``CodeGenSPIRV``: TIR → SPIR-V for Vulkan
305+
* - ``src/target/source/codegen_webgpu.h``
306+
- ``CodeGenWebGPU``: TIR → WGSL

docs/arch/index.rst

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -122,6 +122,8 @@ The target translation phase transforms an IRModule to the corresponding target
122122
For backends such as x86 and ARM, we use the LLVM IRBuilder to build in-memory LLVM IR.
123123
We can also generate source-level languages such as CUDA C and OpenCL.
124124
Finally, we support direct translations of a Relax function (sub-graph) to specific targets via external code generators.
125+
See :ref:`codegen-arch` for how TIR functions are compiled to native code through the LLVM and
126+
Source codegen families.
125127
See :ref:`external-library-dispatch` for the full BYOC (Bring Your Own Codegen) pipeline that
126128
offloads operator subgraphs to vendor libraries like cuBLAS, CUTLASS, and cuDNN.
127129
It is important that the final code generation phase is as lightweight as possible. Vast majority of transformations
@@ -130,6 +132,7 @@ and lowering should be performed before the target translation phase.
130132
.. toctree::
131133
:maxdepth: 1
132134

135+
codegen
133136
external_library_dispatch
134137

135138
We also provide a Target structure to specify the compilation target.

0 commit comments

Comments
 (0)