Skip to content

QDP: add an AMD GPU (ROCm/HIP) build for the native encoder engine#1399

Open
jeffdaily wants to merge 10 commits into
apache:mainfrom
jeffdaily:moat-port
Open

QDP: add an AMD GPU (ROCm/HIP) build for the native encoder engine#1399
jeffdaily wants to merge 10 commits into
apache:mainfrom
jeffdaily:moat-port

Conversation

@jeffdaily

@jeffdaily jeffdaily commented Jun 11, 2026

Copy link
Copy Markdown

Summary

Adds an AMD GPU build of the QDP (Quantum Data Plane) native engine under qdp/, behind a Cargo hip feature (and QDP_USE_HIP=1). The default cuda feature is unchanged, so the NVIDIA build is behavior-preserving (no functional change) and nothing on the AMD path is reachable without opting in. The separate Triton AMD backend is orthogonal and untouched. This gives AMD parity on the native engine the project is built around (pinned-buffer pool, dual-stream overlap, in-Rust DLPack ownership).

This change was authored with the assistance of Claude (Anthropic) and validated on real AMD GPU hardware (see Test Plan).

Review it in two layers.

Kernels (qdp-kernels)

build.rs gains a HIP branch that compiles the same six .cu with hipcc, taking --offload-arch from QDP_HIP_ARCH_LIST (default gfx90a only when unset, never a literal that overrides the env, so other AMD targets build the same source by setting QDP_HIP_ARCH_LIST alone) and linking the AMD HIP runtime; the CUDA branch (nvcc) is untouched.

hipcc ships no <cuda_runtime.h> / <cuComplex.h> / <vector_types.h>, so qdp-kernels/hip_compat/ holds forwarding shim headers of those exact names, added to the include path ONLY on the HIP build, that map the small cuda* runtime + cuComplex surface the kernels use onto HIP. A CUDA build never sees that directory and pulls the real toolkit headers, so the .cu keep their CUDA spellings unchanged.

amplitude.cu is the only kernel needing source fixes, both arch-unified: the __shfl_down_sync mask becomes 64-bit on HIP (ROCm static_asserts sizeof(mask)==8; the 32-bit literal fails to compile) while staying 0xffffffffu on CUDA, and the warp-id threadIdx.x >> 5 becomes threadIdx.x / warpSize. The latter is a genuine wave64 correctness fix: >> 5 assumes 32-lane warps, so on a 64-lane (CDNA) warp the per-warp L2-norm partial landed in the wrong shared slot; / warpSize is identical to >> 5 on 32-lane hardware and correct on 64-lane hardware.

Host (qdp-core)

The cudarc crate is CUDA-only with no ROCm backend, so on the HIP build it is displaced by a thin HIP-runtime shim with the SAME type names and method signatures (qdp-kernels/src/device.rs: CudaDevice, CudaSlice, CudaStream, DevicePtr/DevicePtrMut/DeviceSlice, DeviceRepr/ValidAsZeroBits, backed by the AMD HIP runtime). qdp-core/src/gpu_rt.rs re-exports it as the single import point; every use cudarc::driver became use crate::gpu_rt, so the call sites compile unchanged on either vendor.

The runtime FFI (gpu/cuda_ffi.rs) keeps its cuda* names and binds the matching hip* entry points under the hip feature. The async H2D path maps cudaMemcpyAsync to hipMemcpyAsync (its exact enqueue-and-return 1:1), NOT hipMemcpyWithStream, which synchronizes the stream and would block the host and silently serialize the dual-stream overlap pipeline; the copy-done event plus stream-wait still order copy before compute, so correctness is unchanged.

DLPack tags exported tensors kDLROCM on the HIP build (a ROCm PyTorch's from_dlpack rejects a CUDA tag); the two device-type tests are made arch-aware. The GPU stack is selected by a build-script qdp_gpu_platform cfg (Linux always; Windows when the hip feature is on) rather than a target_os == "linux" proxy, which enables the same code on Windows ROCm; on Linux this cfg is always set, so the Linux output is identical.

Docs and attribution

qdp/DEVELOPMENT.md documents the ROCm/HIP build next to the existing CUDA flow, and qdp/qdp-python/README.md notes the native engine's AMD build. New and substantially-extended files for the AMD build carry an AMD copyright line below the Apache header and name the author.

Test Plan

Built and tested on gfx90a (MI250X, ROCm 7.2.1), gfx1100 (Radeon Pro W7800), and on Windows ROCm gfx1201 (RX 9070 XT) and gfx1151 (Radeon 8060S):

export QDP_USE_HIP=1 QDP_HIP_ARCH_LIST=gfx90a ROCM_PATH=/opt/rocm
cd qdp
cargo build -p qdp-core -p qdp-kernels --no-default-features --features hip
cargo test  -p qdp-core -p qdp-kernels --no-default-features --features hip -- --test-threads=1

All Rust tests RUN (they previously skipped on AMD, since cudarc found no device) and PASS, 0 failures: qdp-kernels 31 (amplitude 21, angle 10); qdp-core lib 77; GPU suites gpu_angle 12, gpu_api_workflow 8, gpu_basis 7, gpu_dlpack 9, gpu_fidelity 17, gpu_iqp 22, gpu_memory_safety 4, gpu_norm_f32 2, gpu_ptr_encoding 64, gpu_validation 8; plus the non-GPU regression suites (arrow/null/numpy/parquet/preprocessing/tensorflow/torch/types). The dual-stream async-pipeline tests pass with QDP_ENABLE_OVERLAP_TRACKING=1, confirming the non-blocking hipMemcpyAsync H2D path.

The default cuda feature build is unchanged and remains the default; the AMD path is reachable only via --features hip / QDP_USE_HIP=1.

jeffdaily added a commit to jeffdaily/moat that referenced this pull request Jun 11, 2026

@rich7420 rich7420 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeffdaily thanks for the patch!!! Welcome to Mahout
some comments left:

Nice port — the gpu_rt indirection + scoped hip_compat shims + the wave64 >>5/warpSize fix are clean.

I ran the cuda path on a 2080 Ti: build + lib/fidelity/amplitude/angle all green, so default path's healthy. No AMD HW here so the HIP notes are read-only — flagging for your call. 4 inline comments; (Drop frees without binding the device) is the only one I'd want fixed before merge. Rest are minor / sanity-checks.

(Tiny nit: "CUDA byte-for-byte unchanged" isn't quite true — metrics.rs swapped driver cuMemcpyDtoH_v2→runtime cudaMemcpy, and the kernel >>5/warpSize changes SASS. Both verified behavior-identical, so no action.)

fn drop(&mut self) {
if self.ptr != 0 {
unsafe {
let _ = hipFree(self.raw_ptr());

@rich7420 rich7420 Jun 11, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hipFree hits the calling thread's current device, but Drop never re-binds (unlike alloc, which does self.bind()?). Multi-GPU: dropping while another device is current frees the wrong device — silent leak or wrong-pointer free. cudarc binds in Drop. A best-effort self._device.bind() before hipFree fixes it.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 0b5042e -- Drop now does a best-effort self._device.bind() before hipFree (result swallowed, since Drop can't fail), matching the alloc path and cudarc.


#[allow(dead_code)]
pub(crate) fn cudaPointerGetAttributes(
pub(crate) unsafe fn cudaPointerGetAttributes(

@rich7420 rich7420 Jun 11, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heads up — this is live (lib.rs:82), not dead code. The cast assumes hipMemoryType enum values == CUDA's (DEVICE=2), which has differed across ROCm versions. Fine on 7.2.1, but a version landmine — safer to read the real hipPointerAttribute_t and map the enum explicitly.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 0b5042e -- reads hipPointerAttribute_t and compares the type field against the named hipMemoryTypeDevice/hipMemoryTypeManaged constants, instead of assuming the enum value matches CUDA's 2.

Comment thread qdp/qdp-kernels/build.rs
if cfg!(feature = "hip") {
return true;
}
env::var("QDP_USE_HIP")

@rich7420 rich7420 Jun 11, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QDP_USE_HIP flips the kernel build to hipcc even with the cargo hip feature off → a default cargo build with this env set = AMD kernels + cudarc host = mismatch. Maybe error when QDP_USE_HIP and CARGO_FEATURE_HIP disagree, so it fails loud.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 0b5042e -- added a check that panics with a clear message when QDP_USE_HIP and CARGO_FEATURE_HIP disagree in either direction, so the kernel and host builds can't silently mismatch.

Comment thread qdp/qdp-kernels/src/device.rs Outdated
self.bind()?;
let mut stream: *mut c_void = std::ptr::null_mut();
unsafe {
check(hipStreamCreate(&mut stream))?;

@rich7420 rich7420 Jun 11, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hipStreamCreate is the blocking stream ctor; cudarc forks a non-blocking one (hipStreamCreateWithFlags(.., hipStreamNonBlocking)). Could quietly serialize the H2D/compute overlap. Your overlap test passes, so just worth a confirm.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 0b5042e -- now hipStreamCreateWithFlags(.., hipStreamNonBlocking) to match cudarc. Making it non-blocking exposed a latent race in the batch-f32 amplitude readback (default-stream copy without syncing the caller's stream, previously masked by the blocking stream), fixed in the same commit. Overlap tests still pass.

@rich7420

Copy link
Copy Markdown
Contributor

btw, please check the precommit errors

@ryankert01

ryankert01 commented Jun 11, 2026

Copy link
Copy Markdown
Member

I'm also curious about complex optimizations can or cannot be converted this easy. I think recently someone open PRs that have some real profound ones. #1390

@jeffdaily

Copy link
Copy Markdown
Author

Thanks for the review, and for running the CUDA path on the 2080 Ti. All four addressed in 0b5042e:

  • Drop now binds the owning device before hipFree (the merge-blocker).
  • The hipMemoryType check reads hipPointerAttribute_t and compares against hipMemoryTypeDevice instead of the hardcoded 2.
  • build.rs fails loudly when QDP_USE_HIP and the hip feature disagree.
  • The stream is now non-blocking (hipStreamCreateWithFlags(.., hipStreamNonBlocking)) to match cudarc.

On #4: making the stream non-blocking surfaced a pre-existing latent race in the batch-f32 amplitude path -- it read the norm back on the default stream without syncing the caller's stream, masked until now by the blocking stream. Fixed with the same stream-sync the other batch paths use; it's shared (non-HIP) code, so it applies to the CUDA path too.

On byte-for-byte: that's overstated, now corrected in the PR description. metrics.rs swapped the driver cuMemcpyDtoH_v2 for the runtime cudaMemcpy, the kernel >>5 became /warpSize, and the #4 sync adds one more -- all SASS-changing but behavior-identical. The CUDA path is behavior-preserving, not literally byte/SASS-identical.

@jeffdaily

Copy link
Copy Markdown
Author

@rich7420 Fixed in 6d2de29 -- ran cargo fmt over the HIP-path files (import order + line wrapping). Clippy is clean on the Rust side; the one remaining compiler warning (iqp.cu unused parameter) is pre-existing upstream, untouched by this PR.

@jeffdaily

Copy link
Copy Markdown
Author

@ryankert01 This PR was the easy end: hand-written kernels plus the CUDA runtime/driver API, which hipify mostly mechanically -- the one real subtlety was wave size (the >>5 -> /warpSize fix, since CDNA warps are 64-wide).

#1390's implicit-Hadamard Ozaki engine is the hard end. It uses nvcuda::wmma and raw inline PTX (mma.sync.aligned.m16n8k32...s8.s8.s32) -- int8 tensor-core MMA -- to do an Ozaki accurate-FWT. None of that hipifies: inline PTX doesn't compile under hipcc, and it has to be rewritten against AMD matrix cores (MFMA on CDNA / rocWMMA), whose fragment shapes differ (gfx90a MFMA isn't m16n8k32), so the Ozaki tiling and the pre/post layout shuffles have to be re-derived for the AMD fragment geometry. The Ozaki scheme as an algorithm is portable; its whole payoff is fast low-precision matrix-core matmul, so it's a real reimplementation of that path, not a translation.

So: runtime/memory/plain-kernel code ports about as easily as this PR did; anything built on tensor cores, mma/PTX, or CUTLASS/CuTe needs a from-scratch AMD matrix-core path. Convertible, but not "this easy."

@ryankert01 ryankert01 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked out this head and ran the complete Rust suite on NVIDIA hardware (RTX 3090 Ti, CUDA toolkit): 316 passed / 0 failed, including all GPU suites and the dual-stream pipeline tests with QDP_ENABLE_OVERLAP_TRACKING=1; cargo clippy --all-targets is clean. So the refactored default CUDA path is verified on the vendor side the author could not test on.

Inline comments below cover the issues and risks found in review. The only one I'd hold merge for is the ASF header/NOTICE question; the rest are small follow-ups that could land here or later.

Please also fix the pre-commit hook!

Comment thread qdp/qdp-kernels/src/device.rs Outdated
// See the License for the specific language governing permissions and
// limitations under the License.
//
// Copyright (c) 2026 Advanced Micro Devices, Inc.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ASF source-header policy (applies to all 10 files carrying this header): the PR adds Portions Copyright (c) 2026 Advanced Micro Devices, Inc. and Author: lines beneath the ASF header here and in gpu_rt.rs, cuda_ffi.rs, qdp-kernels/build.rs, qdp-python/build.rs, amplitude.cu, kernel_compat.h, and the three hip_compat/ headers.

Per https://www.apache.org/legal/src-headers.html, contributions submitted with copyright notices should have those notices removed or relocated to the project NOTICE file, and author tags in source files are discouraged across ASF projects. There is no precedent for either in this repo, and NOTICE currently has no AMD entry.

This needs a deliberate call before merge — likely either move the copyright line to NOTICE or drop it, and drop the Author: lines (git history preserves attribution).

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 088041c -- per ASF policy, dropped the per-file AMD copyright and Author: lines from all 10 files (the ASF Apache-2.0 header is untouched) and added the AMD attribution to NOTICE. Git history preserves authorship.

// this explicitly). So we read the real hipPointerAttribute_t and compare its
// `type` against the named hipMemoryType* constants rather than a magic
// number, then translate to the CUDA convention the caller expects.
const HIP_MEMORY_TYPE_DEVICE: i32 = 2; // hipMemoryTypeDevice

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These constants pin the ROCm 6+ hipMemoryType convention — as the comment above notes, ROCm 5.x used Device=1 — and this is live code: validate_cuda_input_ptr calls cudaPointerGetAttributes on every encode*_from_gpu_ptr entry point (the 64 gpu_ptr_encoding tests exercise it). On ROCm 5.x, valid device pointers would be rejected as "not device memory".

So the effective floor is ROCm >= 6.0. Suggest stating that in the DEVELOPMENT.md prerequisites (testing was on 7.2.1).

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documented in dbee1e1 -- added a ROCm >= 6.0 prerequisite to DEVELOPMENT.md noting the hipMemoryType device-pointer check uses the ROCm 6+ convention (ROCm 5.x would reject valid device pointers). Tested on 7.2.1.

use std::ffi::c_void;

pub mod device;
use device::{DeviceRepr, ValidAsZeroBits};

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With neither cuda nor hip enabled, this import is unresolved: cargo check -p qdp-kernels --no-default-features fails with a raw E0432 (reproduced locally). Before this PR that configuration built, since cudarc was unconditional.

Consider a guard in device.rs for a clean diagnostic:

#[cfg(not(any(feature = "cuda", feature = "hip")))]
compile_error!("qdp-kernels requires exactly one of the `cuda` or `hip` features");

Non-blocking — that config was only incidentally buildable before.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in dbee1e1 -- a #[cfg(not(any(feature = "cuda", feature = "hip")))] compile_error!(...) in device.rs, so that config now gives a clean diagnostic instead of the raw E0432.

default = ["cuda"]
cuda = ["dep:cudarc"]
# AMD build: hipcc-compiled kernels + the in-crate HIP device traits; no cudarc.
hip = []

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cuda and hip aren't additive: with both enabled (e.g. --features hip without --no-default-features, or via workspace feature unification), hip silently wins everywhere (device.rs, cuda_ffi.rs, the kernel build) while cudarc still compiles, unused. The QDP_USE_HIP consistency panic in build.rs keeps the outcome coherent, which is good — but a one-line note here (and in qdp-core's [features]) that the two are mutually exclusive with hip taking precedence would prevent surprises.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in dbee1e1 -- a note in both qdp-kernels and qdp-core [features] that cuda and hip are mutually exclusive and hip takes precedence (kernels build for HIP; cudarc is compiled but unused).

Comment thread qdp/DEVELOPMENT.md
The first command is what `maturin develop --release` runs on CI; the
second verifies tests type-check in the CUDA build.

### AMD GPU build (ROCm / HIP)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maintenance risk worth acknowledging: CI has no AMD runners, so after merge the HIP path is never exercised and can silently rot as the CUDA path evolves. The gpu_rt seam minimizes divergence pressure, but regressions will only surface when someone rebuilds on AMD hardware. A compile-only hipcc job (ROCm apt packages on ubuntu runners) could be a cheap follow-up.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add a build github action workflow for hip!

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed -- the gpu_rt seam limits divergence, but without AMD runners the HIP path can still bit-rot. I'll put up a compile-only hipcc job (ROCm apt packages on an ubuntu runner) as a separate follow-up PR so the build is at least guarded.

Adds an AMD GPU build of the QDP (Quantum Data Plane) native engine under qdp/,
behind a Cargo `hip` feature (and QDP_USE_HIP=1). The default `cuda` feature is
unchanged, so the NVIDIA build is byte-for-byte identical; nothing on the AMD
path is reachable without opting in. The separate Triton AMD backend is
orthogonal and untouched -- this gives AMD parity on the native engine the
project is built around (pinned-buffer pool, dual-stream overlap, in-Rust DLPack
ownership). Authored with the assistance of Claude (Anthropic).

Review it in two layers.

Kernels (qdp-kernels): build.rs gains a HIP branch that compiles the same six
.cu with hipcc, taking --offload-arch from QDP_HIP_ARCH_LIST (default gfx90a
only when unset, never a literal that overrides the env, so other AMD targets
build the same source by setting QDP_HIP_ARCH_LIST alone) and linking the AMD
HIP runtime; the CUDA branch (nvcc) is untouched. hipcc ships no
<cuda_runtime.h> / <cuComplex.h> / <vector_types.h>, so qdp-kernels/hip_compat/
holds forwarding shim headers of those exact names, added to the include path
ONLY on the HIP build, that map the small cuda* runtime + cuComplex surface the
kernels use onto HIP. A CUDA build never sees that directory and pulls the real
toolkit headers, so the .cu keep their CUDA spellings unchanged. amplitude.cu is
the only kernel needing source fixes, both arch-unified: the __shfl_down_sync
mask becomes 64-bit on HIP (ROCm static_asserts sizeof(mask)==8; the 32-bit
literal fails to compile) while staying 0xffffffffu on CUDA, and the warp-id
`threadIdx.x >> 5` becomes `threadIdx.x / warpSize`. The latter is a genuine
wave64 correctness fix: `>> 5` assumes 32-lane warps, so on a 64-lane (CDNA)
warp the per-warp L2-norm partial landed in the wrong shared slot; `/ warpSize`
is identical to `>> 5` on 32-lane hardware and correct on 64-lane hardware.

Host (qdp-core): the cudarc crate is CUDA-only with no ROCm backend, so on the
HIP build it is displaced by a thin HIP-runtime shim with the SAME type names
and method signatures (qdp-kernels/src/device.rs: CudaDevice, CudaSlice,
CudaStream, DevicePtr/DevicePtrMut/DeviceSlice, DeviceRepr/ValidAsZeroBits,
backed by the AMD HIP runtime). qdp-core/src/gpu_rt.rs re-exports it as the
single import point; every `use cudarc::driver` became `use crate::gpu_rt`, so
the call sites compile unchanged on either vendor. The runtime FFI
(gpu/cuda_ffi.rs) keeps its cuda* names and binds the matching hip* entry points
under the hip feature. The async H2D path maps cudaMemcpyAsync to hipMemcpyAsync
(its exact enqueue-and-return 1:1), NOT hipMemcpyWithStream, which synchronizes
the stream and would block the host and silently serialize the dual-stream
overlap pipeline; the copy-done event plus stream-wait still order copy before
compute, so correctness is unchanged. The shim's htod_sync_copy_into asserts
dst.len() == src.len() to match cudarc's contract. DLPack tags exported tensors
kDLROCM on the HIP build (a ROCm PyTorch's from_dlpack rejects a CUDA tag); the
two device-type tests are made arch-aware. The GPU stack is selected by a
build-script `qdp_gpu_platform` cfg (Linux always; Windows when the hip feature
is on) rather than a target_os == "linux" proxy, which enables the same code on
Windows ROCm; on Linux this cfg is always set, so the Linux output is identical.

qdp/DEVELOPMENT.md documents the ROCm/HIP build next to the existing CUDA flow,
and qdp/qdp-python/README.md notes the native engine's AMD build. New and
substantially-extended files for the AMD build carry an AMD copyright line below
the Apache header and name the author.

Test Plan

gfx90a (MI250X, ROCm 7.2.1), gfx1100 (Radeon Pro W7800), and on Windows ROCm
gfx1201 (RX 9070 XT) and gfx1151 (Radeon 8060S):

```
export QDP_USE_HIP=1 QDP_HIP_ARCH_LIST=gfx90a ROCM_PATH=/opt/rocm
cd qdp
cargo build -p qdp-core -p qdp-kernels --no-default-features --features hip
cargo test  -p qdp-core -p qdp-kernels --no-default-features --features hip -- --test-threads=1
```

All Rust tests RUN (they previously skipped on AMD, since cudarc found no device)
and PASS: qdp-kernels 31 (amplitude 21, angle 10); qdp-core lib 77; GPU suites
gpu_angle 12, gpu_api_workflow 8, gpu_basis 7, gpu_dlpack 9, gpu_fidelity 17,
gpu_iqp 22, gpu_memory_safety 4, gpu_norm_f32 2, gpu_ptr_encoding 64,
gpu_validation 8; plus the non-GPU regression suites
(arrow/null/numpy/parquet/preprocessing/tensorflow/torch/types). 0 failures. The
dual-stream async-pipeline tests pass with QDP_ENABLE_OVERLAP_TRACKING=1,
confirming the non-blocking hipMemcpyAsync H2D path.

```
maturin build --features hip --profile dev --compatibility linux \
  --manifest-path qdp/qdp-python/Cargo.toml --out dist/
pip install --no-deps dist/qumat_qdp-*.whl   # extension only; keeps the venv's ROCm torch
python -m pytest testing/qdp testing/qdp_python qdp/qdp-python/tests -q
```

Python parity against the torch reference, with a ROCm PyTorch: 301 passed, 12
skipped (multi-GPU / tensorflow-absent / the torch reference path's CUDA-centric
arch-capability check / the separate Triton backend not built here), 0 failed.
Encoder outputs match the reference within tolerance and DLPack round-trips to
torch as ROCm tensors. The dev profile is used for the Python cdylib because the
release `lto = "fat"` profile yields a bitcode-only shared object under the HIP
toolchain.
Addresses four points from review on the AMD/HIP build. All are HIP-path
fixes except the stream-sync, which corrects a latent cross-stream ordering
hazard shared with the CUDA path. Authored with the assistance of Claude
(Anthropic).

device.rs CudaSlice::drop now re-binds the owning device before hipFree.
hipFree releases on the calling thread's current device, so on multi-GPU a
drop while a different device is current would free against the wrong device
(the alloc path already binds; cudarc binds in Drop). The bind is best-effort
since Drop cannot return an error.

cuda_ffi.rs cudaPointerGetAttributes (HIP path) no longer assumes the
hipMemoryType enum value equals CUDA's. hipMemoryType values are not
guaranteed stable across ROCm releases, so it reads a real
hipPointerAttribute_t and compares the type against the named
hipMemoryTypeDevice / hipMemoryTypeManaged constants, translating to the CUDA
convention the caller checks rather than reinterpreting a magic number.

qdp-kernels/build.rs now fails loudly when QDP_USE_HIP and the `hip` Cargo
feature disagree. QDP_USE_HIP=1 flips the kernel build to hipcc while the host
backend is chosen by the feature, so a mismatch would silently link AMD
kernels against the cudarc host (or vice versa). The build now panics with a
clear message instead.

device.rs fork_default_stream creates a non-blocking stream
(hipStreamCreateWithFlags with hipStreamNonBlocking) to match cudarc; a
blocking stream serializes copy/compute against the default stream. This
exposed a pre-existing ordering bug in the batch-f32 stream path
(encode_batch_from_gpu_ptr_f32_with_stream): the norm kernel runs on the
caller's stream but the norm readback copies on the default stream without
first synchronizing the caller's stream, so on a non-blocking stream the
readback raced the zero-initialized buffer and reported a zero norm. Added the
sync_cuda_stream the single-sample path already uses; correct on wave32 and
wave64 and on the CUDA path.

Test Plan

gfx90a (MI250X, ROCm 7.2.1):

```
export QDP_USE_HIP=1 QDP_HIP_ARCH_LIST=gfx90a ROCM_PATH=/opt/rocm HIP_VISIBLE_DEVICES=3
cd qdp
cargo build -p qdp-core -p qdp-kernels --no-default-features --features hip
cargo test  -p qdp-core -p qdp-kernels --no-default-features --features hip -- --test-threads=1
```

All Rust tests pass: qdp-kernels 31 (amplitude 21, angle 10); qdp-core lib 77;
GPU suites gpu_angle 12, gpu_api_workflow 8, gpu_basis 7, gpu_dlpack 9,
gpu_fidelity 17, gpu_iqp 22, gpu_memory_safety 4, gpu_norm_f32 2,
gpu_ptr_encoding 64, gpu_validation 8; non-GPU suites
(arrow/null/numpy/parquet/preprocessing/tensorflow/torch/types). 0 failures.
The dual-stream async-pipeline tests pass with QDP_ENABLE_OVERLAP_TRACKING=1,
confirming the non-blocking stream preserves copy/compute overlap. With
QDP_USE_HIP=1 and the `hip` feature off, the build now aborts with the
mismatch error as intended.
Run rustfmt over the HIP build additions (import ordering and line wrapping)
to satisfy the pre-commit fmt hook. Formatting only, no behavior change.

Authored with assistance from Claude (Anthropic).
The qdp-kernels build script rejected a benign configuration and could
not run in a toolchain-less environment, which broke the Pre-commit
clippy job (it runs `cargo clippy --all-targets --all-features`, turning
the `hip` Cargo feature on without setting QDP_USE_HIP, on a runner with
no ROCm toolkit).

Two fixes. First, check_hip_consistency() dropped the panic for "hip
feature on but QDP_USE_HIP unset": that is not a host/kernel mismatch,
because hip_requested() already builds the kernels for HIP whenever the
feature is on, so host and kernels agree. The real mismatch (QDP_USE_HIP
set with the `hip` feature off, which would build hipcc kernels against
the cudarc host) still panics, so the reviewer's guard is preserved.

Second, build_hip() now probes for a runnable hipcc the same way the
CUDA path probes for nvcc. When hipcc is absent it emits the qdp_no_cuda
stub cfg plus an explanatory warning and returns instead of failing, so
`cargo check`/`clippy` (including `--all-features`) link against the host
stubs and succeed without the GPU toolchain. When hipcc is present the
behavior is unchanged: real kernels are compiled.

This change was authored with the assistance of Claude.

Test Plan:

Real HIP build and GPU tests on gfx90a (hipcc present, graceful path
NOT taken):

```
export QDP_USE_HIP=1 QDP_HIP_ARCH_LIST=gfx90a ROCM_PATH=/opt/rocm
cargo build -p qdp-core -p qdp-kernels --no-default-features --features hip
HIP_VISIBLE_DEVICES=0 cargo test -p qdp-core --no-default-features \
    --features hip --test gpu_ptr_encoding --test gpu_fidelity
# gpu_ptr_encoding: 64 passed; gpu_fidelity: 17 passed
```

Toolchain-less CI scenario (hip feature on, QDP_USE_HIP unset, hipcc
absent) no longer panics and clippy succeeds via the stub path:

```
QDP_HIPCC=/nonexistent-hipcc cargo clippy --manifest-path qdp/Cargo.toml \
    -p qdp-kernels -p qdp-core --all-targets --features hip
cargo fmt --manifest-path qdp/Cargo.toml --all -- --check
```

Reviewer guard still enforced (QDP_USE_HIP=1 with the `hip` feature off
panics):

```
QDP_USE_HIP=1 cargo build -p qdp-kernels
```
Addresses three review points on the AMD HIP backend.

DEVELOPMENT.md now states ROCm >= 6.0 as a prerequisite under the AMD
GPU build section: the device-pointer check in cuda_ffi.rs relies on the
ROCm 6+ hipMemoryType device convention (ROCm 5.x numbered Device=1 and
would reject valid device pointers); tested on ROCm 7.2.1.

qdp-kernels/src/device.rs adds a module-scope compile_error! for the
configuration with neither the cuda nor hip feature enabled, so that
`cargo check -p qdp-kernels --no-default-features` reports a clear
requirement instead of a raw unresolved-import E0432 from cudarc.

Both qdp-kernels/Cargo.toml and qdp-core/Cargo.toml gain a comment in
their [features] sections noting that cuda and hip are mutually
exclusive and that hip takes precedence if both end up enabled (e.g.
via workspace feature unification): kernels build for HIP and cudarc is
compiled but unused. Comment only, no behavior change.

This change was authored with the assistance of Claude.

Test Plan:

```bash
cd qdp
# Clean diagnostic with neither vendor feature (compile_error!, not E0432)
cargo check -p qdp-kernels --no-default-features
# Default CUDA path still type-checks
QDP_NO_CUDA=1 cargo check -p qdp-core -p qdp-kernels
# Real HIP build for gfx90a
QDP_USE_HIP=1 QDP_HIP_ARCH_LIST=gfx90a ROCM_PATH=/opt/rocm \
  cargo build -p qdp-core -p qdp-kernels --no-default-features --features hip
# Formatting
cargo fmt --manifest-path Cargo.toml --all -- --check
```
Per the ASF source-header policy (apache.org/legal/src-headers.html),
source files should carry only the standard Apache-2.0 license header,
with third-party and contributor attribution recorded in the project
NOTICE file rather than in per-file copyright/author comment lines.

This drops the per-file "Copyright (c) 2026 Advanced Micro Devices,
Inc." and "Author: Jeff Daily <jeff.daily@amd.com>" comment lines that
the QDP AMD/HIP GPU build added to ten files, restoring the canonical
ASF-only header on each. The AMD attribution is moved to the repo-root
NOTICE file as a parallel entry in the existing ASF house style. No
code changes; the edits are comment-only.

This addresses the source-header review feedback on the QDP AMD/HIP
build contribution. The work was authored with the assistance of
Claude, an AI assistant.

Test Plan:
```
cargo fmt --manifest-path qdp/Cargo.toml --all -- --check
QDP_USE_HIP=1 QDP_HIP_ARCH_LIST=gfx90a ROCM_PATH=/opt/rocm \
  cargo build -p qdp-core -p qdp-kernels --no-default-features --features hip
QDP_NO_CUDA=1 cargo check -p qdp-core -p qdp-kernels
```
All three pass (exit 0) on a gfx90a host with ROCm 7.2.1. The header
edits remove comment lines only, so device code is unchanged.
@jeffdaily

Copy link
Copy Markdown
Author

@ryankert01 Thanks for running the full suite on the 3090 Ti -- good to have the default CUDA path vendor-confirmed.

The multi-GPU DLPack device-type assertion hardcoded kDLCUDA=2, which
fails on a ROCm build where the engine reports kDLROCM=10. The device-0
version of the same test (test_dlpack_device) was already made arch-aware
in the original port commit but the non-zero-device variant was missed
because the gfx90a validator ran with HIP_VISIBLE_DEVICES=N (single GPU
visible), skipping the multi-GPU test. Caught on gfx1100 where 4 GPUs
are visible. Fix matches the device-0 pattern exactly: use
getattr(torch.version, "hip", None) to select kDLROCM or kDLCUDA.

Test Plan

gfx1100 (Radeon Pro W7800, ROCm 7.2.1), all 4 GPUs visible:

  python -m pytest testing/qdp testing/qdp_python qdp/qdp-python/tests -q

305 passed, 10 skipped, 0 failed. test_dlpack_device_id_non_zero passes.

Authored with the assistance of Claude (Anthropic).
Two follow-ups to the recent ASF-header cleanup so the pre-commit suite
(run in CI as `pre-commit run --all-files`) is green.

First, qdp/qdp-python/build.rs lost the leading blank comment line during
the header rewrite, so its header no longer matched the canonical Apache-2.0
header in testing/utils/.license-header.txt and the insert-license hook kept
re-modifying it. The header is now byte-identical to the other build.rs files.

Second, clippy with -D warnings flagged the `stream.stream as *mut c_void`
casts at kernel-launch call sites as unnecessary_cast. The cast is a real
conversion on the CUDA backend, where cudarc's CUstream is a distinct pointer
type, but a no-op on HIP, where the stream field is already *mut c_void.
Removing it would break the CUDA build, so the cast stays and a HIP-scoped
`#![cfg_attr(feature = "hip", allow(clippy::unnecessary_cast))]` silences the
redundancy only where it is redundant. This mirrors the existing crate-level
`unused_unsafe` allow, which has the same dual-backend rationale. The allow is
applied at each affected crate root (qdp-core lib and the two integration test
crates that contain these call sites).

This work was authored with the assistance of Claude, an AI assistant.

Test Plan:
```
# License header (leaves files unmodified):
pre-commit run insert-license --all-files

# Clippy clean on every backend selection (-D warnings):
QDP_USE_HIP=1 QDP_HIP_ARCH_LIST=gfx90a ROCM_PATH=/opt/rocm \
  cargo clippy -p qdp-core -p qdp-kernels --all-targets \
  --no-default-features --features hip -- -D warnings
QDP_NO_CUDA=1 cargo clippy -p qdp-core -p qdp-kernels --all-targets \
  --no-default-features --features cuda -- -D warnings
cargo clippy -p qdp-core -p qdp-kernels --all-targets \
  --no-default-features --features "cuda hip" -- -D warnings

# gfx90a HIP build:
QDP_USE_HIP=1 QDP_HIP_ARCH_LIST=gfx90a ROCM_PATH=/opt/rocm \
  cargo build -p qdp-core -p qdp-kernels --no-default-features --features hip
```
Two Windows-specific corrections needed after the review-hardening commit
(206aa0c) landed.

1. encode_from_gpu_ptr_f32 and encode_from_gpu_ptr_f32_with_stream in
   amplitude.rs, angle.rs, basis.rs, and the QuantumEncoder trait default
   in mod.rs were gated #[cfg(target_os = "linux")] in the review commit.
   Their callers in lib.rs use #[cfg(qdp_gpu_platform)], which is true on
   Windows+hip, so on Windows the functions were compiled out while the
   call sites compiled in, producing an unresolved-function link error.
   Renamed all five occurrences to #[cfg(qdp_gpu_platform)] to match the
   rest of the codebase convention set by the Windows port commit.

2. fork_default_stream in device.rs creates a HIP_STREAM_NON_BLOCKING
   stream on Linux (matching cudarc, preserving dual-stream pipeline
   overlap). On Windows with TheRock ROCm 7.14, non-blocking stream writes
   are not visible via hipMemcpy after hipStreamSynchronize or
   hipDeviceSynchronize, due to a cache coherency gap in the Windows HIP
   runtime. The fix uses hipStreamCreate (a blocking stream) on Windows to
   restore correct D2H readback semantics; the pipeline overlap optimization
   is not available on this platform and runtime combination. All 21
   amplitude encode and 10 angle encode kernel tests, including the three
   stream-path tests that regressed (test_l2_norm_batch_kernel_stream,
   test_encode_from_gpu_ptr_f32_with_stream_non_default_success,
   test_encode_batch_from_gpu_ptr_f32_with_stream_success), pass with this
   change. Authored with the assistance of Claude (Anthropic).

Test Plan

gfx1201 (RX 9070 XT, Windows 11, TheRock ROCm 7.14, HIP_VISIBLE_DEVICES=0):

```
export QDP_USE_HIP=1 QDP_HIP_ARCH_LIST=gfx1201 HIP_VISIBLE_DEVICES=0
cargo test -p qdp-core -p qdp-kernels --no-default-features --features hip -- --test-threads=1
```

330 passed, 0 failed, 4 ignored. All GPU and non-GPU suites pass.
Restore the non-blocking forked stream on all platforms and fix the
actual root cause of the Windows gfx1201 stream-test regression: a
missing ordering between default-stream buffer setup and a kernel
launched on a non-blocking stream that consumes it.

Background: fork_default_stream creates a hipStreamNonBlocking stream to
match cudarc and preserve the dual-stream copy/compute overlap. The
encoders set up their input (htod) and output (alloc_zeros) buffers with
the blocking shim copies, which run on the NULL/default stream, then
launch the norm/encode kernels on the caller's forked stream.

CUDA's legacy default stream is synchronizing, so on NVIDIA that setup is
implicitly ordered before the forked-stream kernel reads the buffer. HIP's
default stream is NOT synchronizing relative to a non-blocking stream, so
the kernel raced the setup and read stale/zero data; the norm came back as
the zero-initialized value and the result was wrong. A previous change
worked around this by creating a blocking stream on Windows, which masked
every such site at once but sacrificed the pipeline overlap and was
attributed to a nonexistent "cache coherency gap" in the Windows runtime.
A minimal HIP reproducer shows the runtime is correct: a kernel on a
non-blocking stream that is properly ordered against the default stream
(stream sync, device sync, or an event wait) reads back correct data on
both the Adrenalin and ROCm runtimes; only the unordered case fails.

The fix synchronizes the default stream at the end of the blocking
alloc_zeros / htod copies (sync_default_stream), restoring the
CUDA-equivalent ordering on every platform without touching the async
pipeline (which uses async copies on explicit streams, not these blocking
paths). The forked stream is non-blocking again everywhere. The symmetric
readback hazard (a NULL-stream dtoh after a forked-stream kernel) is closed
where it was still missing a sync: the f64/f32 batch norm-validation copies
in amplitude.rs and the phase batch finiteness-probe copy in phase.rs now
synchronize the caller's stream before reading back, matching the idiom the
other encoders already use.

This is arch-unified: correct on wave32 (gfx1100/gfx1151/gfx1201) and
wave64 (gfx90a). On Linux the default stream is already synchronizing, so
the added sync is a harmless no-op-cost ordering point.

Authored with the assistance of Claude (Anthropic).

Test Plan

gfx1201 (RX 9070 XT, Windows 11, TheRock ROCm 7.14, HIP_VISIBLE_DEVICES=0):

```
export QDP_USE_HIP=1 QDP_HIP_ARCH_LIST=gfx1201 HIP_VISIBLE_DEVICES=0
cargo build -p qdp-core -p qdp-kernels --no-default-features --features hip
cargo test  -p qdp-core -p qdp-kernels --no-default-features --features hip -- --test-threads=1
```

All suites pass, 0 failures. The three previously-regressed stream tests
pass with the non-blocking stream restored:
test_l2_norm_batch_kernel_stream,
test_encode_from_gpu_ptr_f32_with_stream_non_default_success,
test_encode_batch_from_gpu_ptr_f32_with_stream_success. qdp-core lib 77,
gpu_ptr_encoding 68 (all 10 _with_stream variants pass), amplitude 21,
angle 10, and all other GPU and non-GPU suites green.
@jeffdaily

jeffdaily commented Jun 11, 2026

Copy link
Copy Markdown
Author

Wanted to apologize for the recent churn here. Normally I would have done this BEFORE opening the PR, but I didn't fully understand the problem. I would make a fix on Linux-gfx90a, then validated it on Win-gfx1201, then Win-gfx1201 would make a change and send it back to Linux-gfx90a, and so on. However, I finally root-caused the blocking vs non-blocking stream issues. This commit d97db73 should take care of it. See the commit message there for details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants