Background
Currently, each algorithm has a cccl_device_<algo>_build function that compiles a kernel via NVRTC and returns a build result struct holding the loaded CUlibrary and related state.
To support ahead-of-time (AoT) compilation — pre-compiling kernels and saving them to disk for use in a different process or on a different machine — the Python layer needs to serialize and deserialize these build result structs. This requires the C layer to expose enough metadata in the structs to make that possible.
Required changes
- New fields in all
*_build_result_t structs:
- int cc — the compute capability the kernel was compiled for (encoded as major10 + minor)
- size_t runtime_policy_size — size of the opaque runtime_policy blob, so it can be round-tripped through serialization
- Per-kernel char lowered name fields — the mangled CUDA kernel names produced by NVRTC, needed to resolve kernels from a cubin via cuLibraryGetKernel during deserialization
- Cross-CC build support — when a kernel is compiled for a target CC that doesn't match the current device (e.g. compiling for SM 9.0 on an SM 8.6 machine),
cuLibraryLoadData returns
CUDA_ERROR_NO_BINARY_FOR_GPU. Currently this is a fatal error. The build functions should be updated to treat this case as success — returning the cubin and lowered names without a loaded CUlibrary — so that the
result can be serialized and shipped to a matching device.
Motivation
These changes are purely additive to the C structs and transparent to existing callers. They unblock the Python layer to implement save() / load_algorithm() for pre-compiled kernel distribution (e.g. shipping pre-compiled kernels in a Python wheel that works across a range of GPU architectures).
Background
Currently, each algorithm has a
cccl_device_<algo>_buildfunction that compiles a kernel via NVRTC and returns a build result struct holding the loaded CUlibrary and related state.To support ahead-of-time (AoT) compilation — pre-compiling kernels and saving them to disk for use in a different process or on a different machine — the Python layer needs to serialize and deserialize these build result structs. This requires the C layer to expose enough metadata in the structs to make that possible.
Required changes
*_build_result_tstructs:-
int cc— the compute capability the kernel was compiled for (encoded as major10 + minor)-
size_t runtime_policy_size— size of the opaqueruntime_policyblob, so it can be round-tripped through serialization- Per-kernel char lowered name fields — the mangled CUDA kernel names produced by NVRTC, needed to resolve kernels from a cubin via
cuLibraryGetKernelduring deserializationcuLibraryLoadDatareturnsCUDA_ERROR_NO_BINARY_FOR_GPU. Currently this is a fatal error. The build functions should be updated to treat this case as success — returning the cubin and lowered names without a loaded CUlibrary — so that theresult can be serialized and shipped to a matching device.
Motivation
These changes are purely additive to the C structs and transparent to existing callers. They unblock the Python layer to implement
save()/load_algorithm()for pre-compiled kernel distribution (e.g. shipping pre-compiled kernels in a Python wheel that works across a range of GPU architectures).