QDP: add an AMD GPU (ROCm/HIP) build for the native encoder engine#1399
QDP: add an AMD GPU (ROCm/HIP) build for the native encoder engine#1399jeffdaily wants to merge 10 commits into
Conversation
There was a problem hiding this comment.
@jeffdaily thanks for the patch!!! Welcome to Mahout
some comments left:
Nice port — the gpu_rt indirection + scoped hip_compat shims + the wave64 >>5→/warpSize fix are clean.
I ran the cuda path on a 2080 Ti: build + lib/fidelity/amplitude/angle all green, so default path's healthy. No AMD HW here so the HIP notes are read-only — flagging for your call. 4 inline comments; (Drop frees without binding the device) is the only one I'd want fixed before merge. Rest are minor / sanity-checks.
(Tiny nit: "CUDA byte-for-byte unchanged" isn't quite true — metrics.rs swapped driver cuMemcpyDtoH_v2→runtime cudaMemcpy, and the kernel >>5→/warpSize changes SASS. Both verified behavior-identical, so no action.)
| fn drop(&mut self) { | ||
| if self.ptr != 0 { | ||
| unsafe { | ||
| let _ = hipFree(self.raw_ptr()); |
There was a problem hiding this comment.
hipFree hits the calling thread's current device, but Drop never re-binds (unlike alloc, which does self.bind()?). Multi-GPU: dropping while another device is current frees the wrong device — silent leak or wrong-pointer free. cudarc binds in Drop. A best-effort self._device.bind() before hipFree fixes it.
There was a problem hiding this comment.
Fixed in 0b5042e -- Drop now does a best-effort self._device.bind() before hipFree (result swallowed, since Drop can't fail), matching the alloc path and cudarc.
|
|
||
| #[allow(dead_code)] | ||
| pub(crate) fn cudaPointerGetAttributes( | ||
| pub(crate) unsafe fn cudaPointerGetAttributes( |
There was a problem hiding this comment.
Heads up — this is live (lib.rs:82), not dead code. The cast assumes hipMemoryType enum values == CUDA's (DEVICE=2), which has differed across ROCm versions. Fine on 7.2.1, but a version landmine — safer to read the real hipPointerAttribute_t and map the enum explicitly.
There was a problem hiding this comment.
Fixed in 0b5042e -- reads hipPointerAttribute_t and compares the type field against the named hipMemoryTypeDevice/hipMemoryTypeManaged constants, instead of assuming the enum value matches CUDA's 2.
| if cfg!(feature = "hip") { | ||
| return true; | ||
| } | ||
| env::var("QDP_USE_HIP") |
There was a problem hiding this comment.
QDP_USE_HIP flips the kernel build to hipcc even with the cargo hip feature off → a default cargo build with this env set = AMD kernels + cudarc host = mismatch. Maybe error when QDP_USE_HIP and CARGO_FEATURE_HIP disagree, so it fails loud.
There was a problem hiding this comment.
Fixed in 0b5042e -- added a check that panics with a clear message when QDP_USE_HIP and CARGO_FEATURE_HIP disagree in either direction, so the kernel and host builds can't silently mismatch.
| self.bind()?; | ||
| let mut stream: *mut c_void = std::ptr::null_mut(); | ||
| unsafe { | ||
| check(hipStreamCreate(&mut stream))?; |
There was a problem hiding this comment.
hipStreamCreate is the blocking stream ctor; cudarc forks a non-blocking one (hipStreamCreateWithFlags(.., hipStreamNonBlocking)). Could quietly serialize the H2D/compute overlap. Your overlap test passes, so just worth a confirm.
There was a problem hiding this comment.
Fixed in 0b5042e -- now hipStreamCreateWithFlags(.., hipStreamNonBlocking) to match cudarc. Making it non-blocking exposed a latent race in the batch-f32 amplitude readback (default-stream copy without syncing the caller's stream, previously masked by the blocking stream), fixed in the same commit. Overlap tests still pass.
|
btw, please check the precommit errors |
|
I'm also curious about complex optimizations can or cannot be converted this easy. I think recently someone open PRs that have some real profound ones. #1390 |
|
Thanks for the review, and for running the CUDA path on the 2080 Ti. All four addressed in 0b5042e:
On #4: making the stream non-blocking surfaced a pre-existing latent race in the batch-f32 amplitude path -- it read the norm back on the default stream without syncing the caller's stream, masked until now by the blocking stream. Fixed with the same stream-sync the other batch paths use; it's shared (non-HIP) code, so it applies to the CUDA path too. On byte-for-byte: that's overstated, now corrected in the PR description. |
|
@ryankert01 This PR was the easy end: hand-written kernels plus the CUDA runtime/driver API, which hipify mostly mechanically -- the one real subtlety was wave size (the #1390's implicit-Hadamard Ozaki engine is the hard end. It uses So: runtime/memory/plain-kernel code ports about as easily as this PR did; anything built on tensor cores, |
There was a problem hiding this comment.
I checked out this head and ran the complete Rust suite on NVIDIA hardware (RTX 3090 Ti, CUDA toolkit): 316 passed / 0 failed, including all GPU suites and the dual-stream pipeline tests with QDP_ENABLE_OVERLAP_TRACKING=1; cargo clippy --all-targets is clean. So the refactored default CUDA path is verified on the vendor side the author could not test on.
Inline comments below cover the issues and risks found in review. The only one I'd hold merge for is the ASF header/NOTICE question; the rest are small follow-ups that could land here or later.
Please also fix the pre-commit hook!
| // See the License for the specific language governing permissions and | ||
| // limitations under the License. | ||
| // | ||
| // Copyright (c) 2026 Advanced Micro Devices, Inc. |
There was a problem hiding this comment.
ASF source-header policy (applies to all 10 files carrying this header): the PR adds Portions Copyright (c) 2026 Advanced Micro Devices, Inc. and Author: lines beneath the ASF header here and in gpu_rt.rs, cuda_ffi.rs, qdp-kernels/build.rs, qdp-python/build.rs, amplitude.cu, kernel_compat.h, and the three hip_compat/ headers.
Per https://www.apache.org/legal/src-headers.html, contributions submitted with copyright notices should have those notices removed or relocated to the project NOTICE file, and author tags in source files are discouraged across ASF projects. There is no precedent for either in this repo, and NOTICE currently has no AMD entry.
This needs a deliberate call before merge — likely either move the copyright line to NOTICE or drop it, and drop the Author: lines (git history preserves attribution).
There was a problem hiding this comment.
Done in 088041c -- per ASF policy, dropped the per-file AMD copyright and Author: lines from all 10 files (the ASF Apache-2.0 header is untouched) and added the AMD attribution to NOTICE. Git history preserves authorship.
| // this explicitly). So we read the real hipPointerAttribute_t and compare its | ||
| // `type` against the named hipMemoryType* constants rather than a magic | ||
| // number, then translate to the CUDA convention the caller expects. | ||
| const HIP_MEMORY_TYPE_DEVICE: i32 = 2; // hipMemoryTypeDevice |
There was a problem hiding this comment.
These constants pin the ROCm 6+ hipMemoryType convention — as the comment above notes, ROCm 5.x used Device=1 — and this is live code: validate_cuda_input_ptr calls cudaPointerGetAttributes on every encode*_from_gpu_ptr entry point (the 64 gpu_ptr_encoding tests exercise it). On ROCm 5.x, valid device pointers would be rejected as "not device memory".
So the effective floor is ROCm >= 6.0. Suggest stating that in the DEVELOPMENT.md prerequisites (testing was on 7.2.1).
There was a problem hiding this comment.
Documented in dbee1e1 -- added a ROCm >= 6.0 prerequisite to DEVELOPMENT.md noting the hipMemoryType device-pointer check uses the ROCm 6+ convention (ROCm 5.x would reject valid device pointers). Tested on 7.2.1.
| use std::ffi::c_void; | ||
|
|
||
| pub mod device; | ||
| use device::{DeviceRepr, ValidAsZeroBits}; |
There was a problem hiding this comment.
With neither cuda nor hip enabled, this import is unresolved: cargo check -p qdp-kernels --no-default-features fails with a raw E0432 (reproduced locally). Before this PR that configuration built, since cudarc was unconditional.
Consider a guard in device.rs for a clean diagnostic:
#[cfg(not(any(feature = "cuda", feature = "hip")))]
compile_error!("qdp-kernels requires exactly one of the `cuda` or `hip` features");Non-blocking — that config was only incidentally buildable before.
There was a problem hiding this comment.
Added in dbee1e1 -- a #[cfg(not(any(feature = "cuda", feature = "hip")))] compile_error!(...) in device.rs, so that config now gives a clean diagnostic instead of the raw E0432.
| default = ["cuda"] | ||
| cuda = ["dep:cudarc"] | ||
| # AMD build: hipcc-compiled kernels + the in-crate HIP device traits; no cudarc. | ||
| hip = [] |
There was a problem hiding this comment.
cuda and hip aren't additive: with both enabled (e.g. --features hip without --no-default-features, or via workspace feature unification), hip silently wins everywhere (device.rs, cuda_ffi.rs, the kernel build) while cudarc still compiles, unused. The QDP_USE_HIP consistency panic in build.rs keeps the outcome coherent, which is good — but a one-line note here (and in qdp-core's [features]) that the two are mutually exclusive with hip taking precedence would prevent surprises.
There was a problem hiding this comment.
Added in dbee1e1 -- a note in both qdp-kernels and qdp-core [features] that cuda and hip are mutually exclusive and hip takes precedence (kernels build for HIP; cudarc is compiled but unused).
| The first command is what `maturin develop --release` runs on CI; the | ||
| second verifies tests type-check in the CUDA build. | ||
|
|
||
| ### AMD GPU build (ROCm / HIP) |
There was a problem hiding this comment.
Maintenance risk worth acknowledging: CI has no AMD runners, so after merge the HIP path is never exercised and can silently rot as the CUDA path evolves. The gpu_rt seam minimizes divergence pressure, but regressions will only surface when someone rebuilds on AMD hardware. A compile-only hipcc job (ROCm apt packages on ubuntu runners) could be a cheap follow-up.
There was a problem hiding this comment.
I think we should add a build github action workflow for hip!
There was a problem hiding this comment.
Agreed -- the gpu_rt seam limits divergence, but without AMD runners the HIP path can still bit-rot. I'll put up a compile-only hipcc job (ROCm apt packages on an ubuntu runner) as a separate follow-up PR so the build is at least guarded.
Adds an AMD GPU build of the QDP (Quantum Data Plane) native engine under qdp/, behind a Cargo `hip` feature (and QDP_USE_HIP=1). The default `cuda` feature is unchanged, so the NVIDIA build is byte-for-byte identical; nothing on the AMD path is reachable without opting in. The separate Triton AMD backend is orthogonal and untouched -- this gives AMD parity on the native engine the project is built around (pinned-buffer pool, dual-stream overlap, in-Rust DLPack ownership). Authored with the assistance of Claude (Anthropic). Review it in two layers. Kernels (qdp-kernels): build.rs gains a HIP branch that compiles the same six .cu with hipcc, taking --offload-arch from QDP_HIP_ARCH_LIST (default gfx90a only when unset, never a literal that overrides the env, so other AMD targets build the same source by setting QDP_HIP_ARCH_LIST alone) and linking the AMD HIP runtime; the CUDA branch (nvcc) is untouched. hipcc ships no <cuda_runtime.h> / <cuComplex.h> / <vector_types.h>, so qdp-kernels/hip_compat/ holds forwarding shim headers of those exact names, added to the include path ONLY on the HIP build, that map the small cuda* runtime + cuComplex surface the kernels use onto HIP. A CUDA build never sees that directory and pulls the real toolkit headers, so the .cu keep their CUDA spellings unchanged. amplitude.cu is the only kernel needing source fixes, both arch-unified: the __shfl_down_sync mask becomes 64-bit on HIP (ROCm static_asserts sizeof(mask)==8; the 32-bit literal fails to compile) while staying 0xffffffffu on CUDA, and the warp-id `threadIdx.x >> 5` becomes `threadIdx.x / warpSize`. The latter is a genuine wave64 correctness fix: `>> 5` assumes 32-lane warps, so on a 64-lane (CDNA) warp the per-warp L2-norm partial landed in the wrong shared slot; `/ warpSize` is identical to `>> 5` on 32-lane hardware and correct on 64-lane hardware. Host (qdp-core): the cudarc crate is CUDA-only with no ROCm backend, so on the HIP build it is displaced by a thin HIP-runtime shim with the SAME type names and method signatures (qdp-kernels/src/device.rs: CudaDevice, CudaSlice, CudaStream, DevicePtr/DevicePtrMut/DeviceSlice, DeviceRepr/ValidAsZeroBits, backed by the AMD HIP runtime). qdp-core/src/gpu_rt.rs re-exports it as the single import point; every `use cudarc::driver` became `use crate::gpu_rt`, so the call sites compile unchanged on either vendor. The runtime FFI (gpu/cuda_ffi.rs) keeps its cuda* names and binds the matching hip* entry points under the hip feature. The async H2D path maps cudaMemcpyAsync to hipMemcpyAsync (its exact enqueue-and-return 1:1), NOT hipMemcpyWithStream, which synchronizes the stream and would block the host and silently serialize the dual-stream overlap pipeline; the copy-done event plus stream-wait still order copy before compute, so correctness is unchanged. The shim's htod_sync_copy_into asserts dst.len() == src.len() to match cudarc's contract. DLPack tags exported tensors kDLROCM on the HIP build (a ROCm PyTorch's from_dlpack rejects a CUDA tag); the two device-type tests are made arch-aware. The GPU stack is selected by a build-script `qdp_gpu_platform` cfg (Linux always; Windows when the hip feature is on) rather than a target_os == "linux" proxy, which enables the same code on Windows ROCm; on Linux this cfg is always set, so the Linux output is identical. qdp/DEVELOPMENT.md documents the ROCm/HIP build next to the existing CUDA flow, and qdp/qdp-python/README.md notes the native engine's AMD build. New and substantially-extended files for the AMD build carry an AMD copyright line below the Apache header and name the author. Test Plan gfx90a (MI250X, ROCm 7.2.1), gfx1100 (Radeon Pro W7800), and on Windows ROCm gfx1201 (RX 9070 XT) and gfx1151 (Radeon 8060S): ``` export QDP_USE_HIP=1 QDP_HIP_ARCH_LIST=gfx90a ROCM_PATH=/opt/rocm cd qdp cargo build -p qdp-core -p qdp-kernels --no-default-features --features hip cargo test -p qdp-core -p qdp-kernels --no-default-features --features hip -- --test-threads=1 ``` All Rust tests RUN (they previously skipped on AMD, since cudarc found no device) and PASS: qdp-kernels 31 (amplitude 21, angle 10); qdp-core lib 77; GPU suites gpu_angle 12, gpu_api_workflow 8, gpu_basis 7, gpu_dlpack 9, gpu_fidelity 17, gpu_iqp 22, gpu_memory_safety 4, gpu_norm_f32 2, gpu_ptr_encoding 64, gpu_validation 8; plus the non-GPU regression suites (arrow/null/numpy/parquet/preprocessing/tensorflow/torch/types). 0 failures. The dual-stream async-pipeline tests pass with QDP_ENABLE_OVERLAP_TRACKING=1, confirming the non-blocking hipMemcpyAsync H2D path. ``` maturin build --features hip --profile dev --compatibility linux \ --manifest-path qdp/qdp-python/Cargo.toml --out dist/ pip install --no-deps dist/qumat_qdp-*.whl # extension only; keeps the venv's ROCm torch python -m pytest testing/qdp testing/qdp_python qdp/qdp-python/tests -q ``` Python parity against the torch reference, with a ROCm PyTorch: 301 passed, 12 skipped (multi-GPU / tensorflow-absent / the torch reference path's CUDA-centric arch-capability check / the separate Triton backend not built here), 0 failed. Encoder outputs match the reference within tolerance and DLPack round-trips to torch as ROCm tensors. The dev profile is used for the Python cdylib because the release `lto = "fat"` profile yields a bitcode-only shared object under the HIP toolchain.
Addresses four points from review on the AMD/HIP build. All are HIP-path fixes except the stream-sync, which corrects a latent cross-stream ordering hazard shared with the CUDA path. Authored with the assistance of Claude (Anthropic). device.rs CudaSlice::drop now re-binds the owning device before hipFree. hipFree releases on the calling thread's current device, so on multi-GPU a drop while a different device is current would free against the wrong device (the alloc path already binds; cudarc binds in Drop). The bind is best-effort since Drop cannot return an error. cuda_ffi.rs cudaPointerGetAttributes (HIP path) no longer assumes the hipMemoryType enum value equals CUDA's. hipMemoryType values are not guaranteed stable across ROCm releases, so it reads a real hipPointerAttribute_t and compares the type against the named hipMemoryTypeDevice / hipMemoryTypeManaged constants, translating to the CUDA convention the caller checks rather than reinterpreting a magic number. qdp-kernels/build.rs now fails loudly when QDP_USE_HIP and the `hip` Cargo feature disagree. QDP_USE_HIP=1 flips the kernel build to hipcc while the host backend is chosen by the feature, so a mismatch would silently link AMD kernels against the cudarc host (or vice versa). The build now panics with a clear message instead. device.rs fork_default_stream creates a non-blocking stream (hipStreamCreateWithFlags with hipStreamNonBlocking) to match cudarc; a blocking stream serializes copy/compute against the default stream. This exposed a pre-existing ordering bug in the batch-f32 stream path (encode_batch_from_gpu_ptr_f32_with_stream): the norm kernel runs on the caller's stream but the norm readback copies on the default stream without first synchronizing the caller's stream, so on a non-blocking stream the readback raced the zero-initialized buffer and reported a zero norm. Added the sync_cuda_stream the single-sample path already uses; correct on wave32 and wave64 and on the CUDA path. Test Plan gfx90a (MI250X, ROCm 7.2.1): ``` export QDP_USE_HIP=1 QDP_HIP_ARCH_LIST=gfx90a ROCM_PATH=/opt/rocm HIP_VISIBLE_DEVICES=3 cd qdp cargo build -p qdp-core -p qdp-kernels --no-default-features --features hip cargo test -p qdp-core -p qdp-kernels --no-default-features --features hip -- --test-threads=1 ``` All Rust tests pass: qdp-kernels 31 (amplitude 21, angle 10); qdp-core lib 77; GPU suites gpu_angle 12, gpu_api_workflow 8, gpu_basis 7, gpu_dlpack 9, gpu_fidelity 17, gpu_iqp 22, gpu_memory_safety 4, gpu_norm_f32 2, gpu_ptr_encoding 64, gpu_validation 8; non-GPU suites (arrow/null/numpy/parquet/preprocessing/tensorflow/torch/types). 0 failures. The dual-stream async-pipeline tests pass with QDP_ENABLE_OVERLAP_TRACKING=1, confirming the non-blocking stream preserves copy/compute overlap. With QDP_USE_HIP=1 and the `hip` feature off, the build now aborts with the mismatch error as intended.
Run rustfmt over the HIP build additions (import ordering and line wrapping) to satisfy the pre-commit fmt hook. Formatting only, no behavior change. Authored with assistance from Claude (Anthropic).
The qdp-kernels build script rejected a benign configuration and could
not run in a toolchain-less environment, which broke the Pre-commit
clippy job (it runs `cargo clippy --all-targets --all-features`, turning
the `hip` Cargo feature on without setting QDP_USE_HIP, on a runner with
no ROCm toolkit).
Two fixes. First, check_hip_consistency() dropped the panic for "hip
feature on but QDP_USE_HIP unset": that is not a host/kernel mismatch,
because hip_requested() already builds the kernels for HIP whenever the
feature is on, so host and kernels agree. The real mismatch (QDP_USE_HIP
set with the `hip` feature off, which would build hipcc kernels against
the cudarc host) still panics, so the reviewer's guard is preserved.
Second, build_hip() now probes for a runnable hipcc the same way the
CUDA path probes for nvcc. When hipcc is absent it emits the qdp_no_cuda
stub cfg plus an explanatory warning and returns instead of failing, so
`cargo check`/`clippy` (including `--all-features`) link against the host
stubs and succeed without the GPU toolchain. When hipcc is present the
behavior is unchanged: real kernels are compiled.
This change was authored with the assistance of Claude.
Test Plan:
Real HIP build and GPU tests on gfx90a (hipcc present, graceful path
NOT taken):
```
export QDP_USE_HIP=1 QDP_HIP_ARCH_LIST=gfx90a ROCM_PATH=/opt/rocm
cargo build -p qdp-core -p qdp-kernels --no-default-features --features hip
HIP_VISIBLE_DEVICES=0 cargo test -p qdp-core --no-default-features \
--features hip --test gpu_ptr_encoding --test gpu_fidelity
# gpu_ptr_encoding: 64 passed; gpu_fidelity: 17 passed
```
Toolchain-less CI scenario (hip feature on, QDP_USE_HIP unset, hipcc
absent) no longer panics and clippy succeeds via the stub path:
```
QDP_HIPCC=/nonexistent-hipcc cargo clippy --manifest-path qdp/Cargo.toml \
-p qdp-kernels -p qdp-core --all-targets --features hip
cargo fmt --manifest-path qdp/Cargo.toml --all -- --check
```
Reviewer guard still enforced (QDP_USE_HIP=1 with the `hip` feature off
panics):
```
QDP_USE_HIP=1 cargo build -p qdp-kernels
```
Addresses three review points on the AMD HIP backend. DEVELOPMENT.md now states ROCm >= 6.0 as a prerequisite under the AMD GPU build section: the device-pointer check in cuda_ffi.rs relies on the ROCm 6+ hipMemoryType device convention (ROCm 5.x numbered Device=1 and would reject valid device pointers); tested on ROCm 7.2.1. qdp-kernels/src/device.rs adds a module-scope compile_error! for the configuration with neither the cuda nor hip feature enabled, so that `cargo check -p qdp-kernels --no-default-features` reports a clear requirement instead of a raw unresolved-import E0432 from cudarc. Both qdp-kernels/Cargo.toml and qdp-core/Cargo.toml gain a comment in their [features] sections noting that cuda and hip are mutually exclusive and that hip takes precedence if both end up enabled (e.g. via workspace feature unification): kernels build for HIP and cudarc is compiled but unused. Comment only, no behavior change. This change was authored with the assistance of Claude. Test Plan: ```bash cd qdp # Clean diagnostic with neither vendor feature (compile_error!, not E0432) cargo check -p qdp-kernels --no-default-features # Default CUDA path still type-checks QDP_NO_CUDA=1 cargo check -p qdp-core -p qdp-kernels # Real HIP build for gfx90a QDP_USE_HIP=1 QDP_HIP_ARCH_LIST=gfx90a ROCM_PATH=/opt/rocm \ cargo build -p qdp-core -p qdp-kernels --no-default-features --features hip # Formatting cargo fmt --manifest-path Cargo.toml --all -- --check ```
Per the ASF source-header policy (apache.org/legal/src-headers.html), source files should carry only the standard Apache-2.0 license header, with third-party and contributor attribution recorded in the project NOTICE file rather than in per-file copyright/author comment lines. This drops the per-file "Copyright (c) 2026 Advanced Micro Devices, Inc." and "Author: Jeff Daily <jeff.daily@amd.com>" comment lines that the QDP AMD/HIP GPU build added to ten files, restoring the canonical ASF-only header on each. The AMD attribution is moved to the repo-root NOTICE file as a parallel entry in the existing ASF house style. No code changes; the edits are comment-only. This addresses the source-header review feedback on the QDP AMD/HIP build contribution. The work was authored with the assistance of Claude, an AI assistant. Test Plan: ``` cargo fmt --manifest-path qdp/Cargo.toml --all -- --check QDP_USE_HIP=1 QDP_HIP_ARCH_LIST=gfx90a ROCM_PATH=/opt/rocm \ cargo build -p qdp-core -p qdp-kernels --no-default-features --features hip QDP_NO_CUDA=1 cargo check -p qdp-core -p qdp-kernels ``` All three pass (exit 0) on a gfx90a host with ROCm 7.2.1. The header edits remove comment lines only, so device code is unchanged.
|
@ryankert01 Thanks for running the full suite on the 3090 Ti -- good to have the default CUDA path vendor-confirmed. |
The multi-GPU DLPack device-type assertion hardcoded kDLCUDA=2, which fails on a ROCm build where the engine reports kDLROCM=10. The device-0 version of the same test (test_dlpack_device) was already made arch-aware in the original port commit but the non-zero-device variant was missed because the gfx90a validator ran with HIP_VISIBLE_DEVICES=N (single GPU visible), skipping the multi-GPU test. Caught on gfx1100 where 4 GPUs are visible. Fix matches the device-0 pattern exactly: use getattr(torch.version, "hip", None) to select kDLROCM or kDLCUDA. Test Plan gfx1100 (Radeon Pro W7800, ROCm 7.2.1), all 4 GPUs visible: python -m pytest testing/qdp testing/qdp_python qdp/qdp-python/tests -q 305 passed, 10 skipped, 0 failed. test_dlpack_device_id_non_zero passes. Authored with the assistance of Claude (Anthropic).
Two follow-ups to the recent ASF-header cleanup so the pre-commit suite (run in CI as `pre-commit run --all-files`) is green. First, qdp/qdp-python/build.rs lost the leading blank comment line during the header rewrite, so its header no longer matched the canonical Apache-2.0 header in testing/utils/.license-header.txt and the insert-license hook kept re-modifying it. The header is now byte-identical to the other build.rs files. Second, clippy with -D warnings flagged the `stream.stream as *mut c_void` casts at kernel-launch call sites as unnecessary_cast. The cast is a real conversion on the CUDA backend, where cudarc's CUstream is a distinct pointer type, but a no-op on HIP, where the stream field is already *mut c_void. Removing it would break the CUDA build, so the cast stays and a HIP-scoped `#![cfg_attr(feature = "hip", allow(clippy::unnecessary_cast))]` silences the redundancy only where it is redundant. This mirrors the existing crate-level `unused_unsafe` allow, which has the same dual-backend rationale. The allow is applied at each affected crate root (qdp-core lib and the two integration test crates that contain these call sites). This work was authored with the assistance of Claude, an AI assistant. Test Plan: ``` # License header (leaves files unmodified): pre-commit run insert-license --all-files # Clippy clean on every backend selection (-D warnings): QDP_USE_HIP=1 QDP_HIP_ARCH_LIST=gfx90a ROCM_PATH=/opt/rocm \ cargo clippy -p qdp-core -p qdp-kernels --all-targets \ --no-default-features --features hip -- -D warnings QDP_NO_CUDA=1 cargo clippy -p qdp-core -p qdp-kernels --all-targets \ --no-default-features --features cuda -- -D warnings cargo clippy -p qdp-core -p qdp-kernels --all-targets \ --no-default-features --features "cuda hip" -- -D warnings # gfx90a HIP build: QDP_USE_HIP=1 QDP_HIP_ARCH_LIST=gfx90a ROCM_PATH=/opt/rocm \ cargo build -p qdp-core -p qdp-kernels --no-default-features --features hip ```
Two Windows-specific corrections needed after the review-hardening commit (206aa0c) landed. 1. encode_from_gpu_ptr_f32 and encode_from_gpu_ptr_f32_with_stream in amplitude.rs, angle.rs, basis.rs, and the QuantumEncoder trait default in mod.rs were gated #[cfg(target_os = "linux")] in the review commit. Their callers in lib.rs use #[cfg(qdp_gpu_platform)], which is true on Windows+hip, so on Windows the functions were compiled out while the call sites compiled in, producing an unresolved-function link error. Renamed all five occurrences to #[cfg(qdp_gpu_platform)] to match the rest of the codebase convention set by the Windows port commit. 2. fork_default_stream in device.rs creates a HIP_STREAM_NON_BLOCKING stream on Linux (matching cudarc, preserving dual-stream pipeline overlap). On Windows with TheRock ROCm 7.14, non-blocking stream writes are not visible via hipMemcpy after hipStreamSynchronize or hipDeviceSynchronize, due to a cache coherency gap in the Windows HIP runtime. The fix uses hipStreamCreate (a blocking stream) on Windows to restore correct D2H readback semantics; the pipeline overlap optimization is not available on this platform and runtime combination. All 21 amplitude encode and 10 angle encode kernel tests, including the three stream-path tests that regressed (test_l2_norm_batch_kernel_stream, test_encode_from_gpu_ptr_f32_with_stream_non_default_success, test_encode_batch_from_gpu_ptr_f32_with_stream_success), pass with this change. Authored with the assistance of Claude (Anthropic). Test Plan gfx1201 (RX 9070 XT, Windows 11, TheRock ROCm 7.14, HIP_VISIBLE_DEVICES=0): ``` export QDP_USE_HIP=1 QDP_HIP_ARCH_LIST=gfx1201 HIP_VISIBLE_DEVICES=0 cargo test -p qdp-core -p qdp-kernels --no-default-features --features hip -- --test-threads=1 ``` 330 passed, 0 failed, 4 ignored. All GPU and non-GPU suites pass.
Restore the non-blocking forked stream on all platforms and fix the actual root cause of the Windows gfx1201 stream-test regression: a missing ordering between default-stream buffer setup and a kernel launched on a non-blocking stream that consumes it. Background: fork_default_stream creates a hipStreamNonBlocking stream to match cudarc and preserve the dual-stream copy/compute overlap. The encoders set up their input (htod) and output (alloc_zeros) buffers with the blocking shim copies, which run on the NULL/default stream, then launch the norm/encode kernels on the caller's forked stream. CUDA's legacy default stream is synchronizing, so on NVIDIA that setup is implicitly ordered before the forked-stream kernel reads the buffer. HIP's default stream is NOT synchronizing relative to a non-blocking stream, so the kernel raced the setup and read stale/zero data; the norm came back as the zero-initialized value and the result was wrong. A previous change worked around this by creating a blocking stream on Windows, which masked every such site at once but sacrificed the pipeline overlap and was attributed to a nonexistent "cache coherency gap" in the Windows runtime. A minimal HIP reproducer shows the runtime is correct: a kernel on a non-blocking stream that is properly ordered against the default stream (stream sync, device sync, or an event wait) reads back correct data on both the Adrenalin and ROCm runtimes; only the unordered case fails. The fix synchronizes the default stream at the end of the blocking alloc_zeros / htod copies (sync_default_stream), restoring the CUDA-equivalent ordering on every platform without touching the async pipeline (which uses async copies on explicit streams, not these blocking paths). The forked stream is non-blocking again everywhere. The symmetric readback hazard (a NULL-stream dtoh after a forked-stream kernel) is closed where it was still missing a sync: the f64/f32 batch norm-validation copies in amplitude.rs and the phase batch finiteness-probe copy in phase.rs now synchronize the caller's stream before reading back, matching the idiom the other encoders already use. This is arch-unified: correct on wave32 (gfx1100/gfx1151/gfx1201) and wave64 (gfx90a). On Linux the default stream is already synchronizing, so the added sync is a harmless no-op-cost ordering point. Authored with the assistance of Claude (Anthropic). Test Plan gfx1201 (RX 9070 XT, Windows 11, TheRock ROCm 7.14, HIP_VISIBLE_DEVICES=0): ``` export QDP_USE_HIP=1 QDP_HIP_ARCH_LIST=gfx1201 HIP_VISIBLE_DEVICES=0 cargo build -p qdp-core -p qdp-kernels --no-default-features --features hip cargo test -p qdp-core -p qdp-kernels --no-default-features --features hip -- --test-threads=1 ``` All suites pass, 0 failures. The three previously-regressed stream tests pass with the non-blocking stream restored: test_l2_norm_batch_kernel_stream, test_encode_from_gpu_ptr_f32_with_stream_non_default_success, test_encode_batch_from_gpu_ptr_f32_with_stream_success. qdp-core lib 77, gpu_ptr_encoding 68 (all 10 _with_stream variants pass), amplitude 21, angle 10, and all other GPU and non-GPU suites green.
|
Wanted to apologize for the recent churn here. Normally I would have done this BEFORE opening the PR, but I didn't fully understand the problem. I would make a fix on Linux-gfx90a, then validated it on Win-gfx1201, then Win-gfx1201 would make a change and send it back to Linux-gfx90a, and so on. However, I finally root-caused the blocking vs non-blocking stream issues. This commit d97db73 should take care of it. See the commit message there for details. |
Summary
Adds an AMD GPU build of the QDP (Quantum Data Plane) native engine under
qdp/, behind a Cargohipfeature (andQDP_USE_HIP=1). The defaultcudafeature is unchanged, so the NVIDIA build is behavior-preserving (no functional change) and nothing on the AMD path is reachable without opting in. The separate Triton AMD backend is orthogonal and untouched. This gives AMD parity on the native engine the project is built around (pinned-buffer pool, dual-stream overlap, in-Rust DLPack ownership).This change was authored with the assistance of Claude (Anthropic) and validated on real AMD GPU hardware (see Test Plan).
Review it in two layers.
Kernels (
qdp-kernels)build.rsgains a HIP branch that compiles the same six.cuwith hipcc, taking--offload-archfromQDP_HIP_ARCH_LIST(defaultgfx90aonly when unset, never a literal that overrides the env, so other AMD targets build the same source by settingQDP_HIP_ARCH_LISTalone) and linking the AMD HIP runtime; the CUDA branch (nvcc) is untouched.hipcc ships no
<cuda_runtime.h>/<cuComplex.h>/<vector_types.h>, soqdp-kernels/hip_compat/holds forwarding shim headers of those exact names, added to the include path ONLY on the HIP build, that map the smallcuda*runtime + cuComplex surface the kernels use onto HIP. A CUDA build never sees that directory and pulls the real toolkit headers, so the.cukeep their CUDA spellings unchanged.amplitude.cuis the only kernel needing source fixes, both arch-unified: the__shfl_down_syncmask becomes 64-bit on HIP (ROCm static_assertssizeof(mask)==8; the 32-bit literal fails to compile) while staying0xffffffffuon CUDA, and the warp-idthreadIdx.x >> 5becomesthreadIdx.x / warpSize. The latter is a genuine wave64 correctness fix:>> 5assumes 32-lane warps, so on a 64-lane (CDNA) warp the per-warp L2-norm partial landed in the wrong shared slot;/ warpSizeis identical to>> 5on 32-lane hardware and correct on 64-lane hardware.Host (
qdp-core)The cudarc crate is CUDA-only with no ROCm backend, so on the HIP build it is displaced by a thin HIP-runtime shim with the SAME type names and method signatures (
qdp-kernels/src/device.rs:CudaDevice,CudaSlice,CudaStream,DevicePtr/DevicePtrMut/DeviceSlice,DeviceRepr/ValidAsZeroBits, backed by the AMD HIP runtime).qdp-core/src/gpu_rt.rsre-exports it as the single import point; everyuse cudarc::driverbecameuse crate::gpu_rt, so the call sites compile unchanged on either vendor.The runtime FFI (
gpu/cuda_ffi.rs) keeps itscuda*names and binds the matchinghip*entry points under thehipfeature. The async H2D path mapscudaMemcpyAsynctohipMemcpyAsync(its exact enqueue-and-return 1:1), NOThipMemcpyWithStream, which synchronizes the stream and would block the host and silently serialize the dual-stream overlap pipeline; the copy-done event plus stream-wait still order copy before compute, so correctness is unchanged.DLPack tags exported tensors
kDLROCMon the HIP build (a ROCm PyTorch'sfrom_dlpackrejects a CUDA tag); the two device-type tests are made arch-aware. The GPU stack is selected by a build-scriptqdp_gpu_platformcfg (Linux always; Windows when thehipfeature is on) rather than atarget_os == "linux"proxy, which enables the same code on Windows ROCm; on Linux this cfg is always set, so the Linux output is identical.Docs and attribution
qdp/DEVELOPMENT.mddocuments the ROCm/HIP build next to the existing CUDA flow, andqdp/qdp-python/README.mdnotes the native engine's AMD build. New and substantially-extended files for the AMD build carry an AMD copyright line below the Apache header and name the author.Test Plan
Built and tested on gfx90a (MI250X, ROCm 7.2.1), gfx1100 (Radeon Pro W7800), and on Windows ROCm gfx1201 (RX 9070 XT) and gfx1151 (Radeon 8060S):
All Rust tests RUN (they previously skipped on AMD, since cudarc found no device) and PASS, 0 failures: qdp-kernels 31 (amplitude 21, angle 10); qdp-core lib 77; GPU suites gpu_angle 12, gpu_api_workflow 8, gpu_basis 7, gpu_dlpack 9, gpu_fidelity 17, gpu_iqp 22, gpu_memory_safety 4, gpu_norm_f32 2, gpu_ptr_encoding 64, gpu_validation 8; plus the non-GPU regression suites (arrow/null/numpy/parquet/preprocessing/tensorflow/torch/types). The dual-stream async-pipeline tests pass with
QDP_ENABLE_OVERLAP_TRACKING=1, confirming the non-blockinghipMemcpyAsyncH2D path.The default
cudafeature build is unchanged and remains the default; the AMD path is reachable only via--features hip/QDP_USE_HIP=1.