QDP: add an AMD GPU (ROCm/HIP) build for the native encoder engine by jeffdaily · Pull Request #1399 · apache/mahout

jeffdaily · 2026-06-11T06:53:45Z

Summary

Adds an AMD GPU build of the QDP (Quantum Data Plane) native engine under qdp/, behind a Cargo hip feature (and QDP_USE_HIP=1). The default cuda feature is unchanged, so the NVIDIA build is behavior-preserving (no functional change) and nothing on the AMD path is reachable without opting in. The separate Triton AMD backend is orthogonal and untouched. This gives AMD parity on the native engine the project is built around (pinned-buffer pool, dual-stream overlap, in-Rust DLPack ownership).

This change was authored with the assistance of Claude (Anthropic) and validated on real AMD GPU hardware (see Test Plan).

Review it in two layers.

Kernels (`qdp-kernels`)

build.rs gains a HIP branch that compiles the same six .cu with hipcc, taking --offload-arch from QDP_HIP_ARCH_LIST (default gfx90a only when unset, never a literal that overrides the env, so other AMD targets build the same source by setting QDP_HIP_ARCH_LIST alone) and linking the AMD HIP runtime; the CUDA branch (nvcc) is untouched.

hipcc ships no <cuda_runtime.h> / <cuComplex.h> / <vector_types.h>, so qdp-kernels/hip_compat/ holds forwarding shim headers of those exact names, added to the include path ONLY on the HIP build, that map the small cuda* runtime + cuComplex surface the kernels use onto HIP. A CUDA build never sees that directory and pulls the real toolkit headers, so the .cu keep their CUDA spellings unchanged.

amplitude.cu is the only kernel needing source fixes, both arch-unified: the __shfl_down_sync mask becomes 64-bit on HIP (ROCm static_asserts sizeof(mask)==8; the 32-bit literal fails to compile) while staying 0xffffffffu on CUDA, and the warp-id threadIdx.x >> 5 becomes threadIdx.x / warpSize. The latter is a genuine wave64 correctness fix: >> 5 assumes 32-lane warps, so on a 64-lane (CDNA) warp the per-warp L2-norm partial landed in the wrong shared slot; / warpSize is identical to >> 5 on 32-lane hardware and correct on 64-lane hardware.

Host (`qdp-core`)

The cudarc crate is CUDA-only with no ROCm backend, so on the HIP build it is displaced by a thin HIP-runtime shim with the SAME type names and method signatures (qdp-kernels/src/device.rs: CudaDevice, CudaSlice, CudaStream, DevicePtr/DevicePtrMut/DeviceSlice, DeviceRepr/ValidAsZeroBits, backed by the AMD HIP runtime). qdp-core/src/gpu_rt.rs re-exports it as the single import point; every use cudarc::driver became use crate::gpu_rt, so the call sites compile unchanged on either vendor.

The runtime FFI (gpu/cuda_ffi.rs) keeps its cuda* names and binds the matching hip* entry points under the hip feature. The async H2D path maps cudaMemcpyAsync to hipMemcpyAsync (its exact enqueue-and-return 1:1), NOT hipMemcpyWithStream, which synchronizes the stream and would block the host and silently serialize the dual-stream overlap pipeline; the copy-done event plus stream-wait still order copy before compute, so correctness is unchanged.

DLPack tags exported tensors kDLROCM on the HIP build (a ROCm PyTorch's from_dlpack rejects a CUDA tag); the two device-type tests are made arch-aware. The GPU stack is selected by a build-script qdp_gpu_platform cfg (Linux always; Windows when the hip feature is on) rather than a target_os == "linux" proxy, which enables the same code on Windows ROCm; on Linux this cfg is always set, so the Linux output is identical.

Docs and attribution

qdp/DEVELOPMENT.md documents the ROCm/HIP build next to the existing CUDA flow, and qdp/qdp-python/README.md notes the native engine's AMD build. New and substantially-extended files for the AMD build carry an AMD copyright line below the Apache header and name the author.

Test Plan

Built and tested on gfx90a (MI250X, ROCm 7.2.1), gfx1100 (Radeon Pro W7800), and on Windows ROCm gfx1201 (RX 9070 XT) and gfx1151 (Radeon 8060S):

export QDP_USE_HIP=1 QDP_HIP_ARCH_LIST=gfx90a ROCM_PATH=/opt/rocm
cd qdp
cargo build -p qdp-core -p qdp-kernels --no-default-features --features hip
cargo test  -p qdp-core -p qdp-kernels --no-default-features --features hip -- --test-threads=1

All Rust tests RUN (they previously skipped on AMD, since cudarc found no device) and PASS, 0 failures: qdp-kernels 31 (amplitude 21, angle 10); qdp-core lib 77; GPU suites gpu_angle 12, gpu_api_workflow 8, gpu_basis 7, gpu_dlpack 9, gpu_fidelity 17, gpu_iqp 22, gpu_memory_safety 4, gpu_norm_f32 2, gpu_ptr_encoding 64, gpu_validation 8; plus the non-GPU regression suites (arrow/null/numpy/parquet/preprocessing/tensorflow/torch/types). The dual-stream async-pipeline tests pass with QDP_ENABLE_OVERLAP_TRACKING=1, confirming the non-blocking hipMemcpyAsync H2D path.

The default cuda feature build is unchanged and remains the default; the AMD path is reachable only via --features hip / QDP_USE_HIP=1.

rich7420

@jeffdaily thanks for the patch!!! Welcome to Mahout
some comments left:

Nice port — the gpu_rt indirection + scoped hip_compat shims + the wave64 >>5→/warpSize fix are clean.

I ran the cuda path on a 2080 Ti: build + lib/fidelity/amplitude/angle all green, so default path's healthy. No AMD HW here so the HIP notes are read-only — flagging for your call. 4 inline comments; (Drop frees without binding the device) is the only one I'd want fixed before merge. Rest are minor / sanity-checks.

(Tiny nit: "CUDA byte-for-byte unchanged" isn't quite true — metrics.rs swapped driver cuMemcpyDtoH_v2→runtime cudaMemcpy, and the kernel >>5→/warpSize changes SASS. Both verified behavior-identical, so no action.)

rich7420 · 2026-06-11T08:33:56Z

+        fn drop(&mut self) {
+            if self.ptr != 0 {
+                unsafe {
+                    let _ = hipFree(self.raw_ptr());


hipFree hits the calling thread's current device, but Drop never re-binds (unlike alloc, which does self.bind()?). Multi-GPU: dropping while another device is current frees the wrong device — silent leak or wrong-pointer free. cudarc binds in Drop. A best-effort self._device.bind() before hipFree fixes it.

Fixed in 0b5042e -- Drop now does a best-effort self._device.bind() before hipFree (result swallowed, since Drop can't fail), matching the alloc path and cudarc.

rich7420 · 2026-06-11T08:33:56Z


    #[allow(dead_code)]
-    pub(crate) fn cudaPointerGetAttributes(
+    pub(crate) unsafe fn cudaPointerGetAttributes(


Heads up — this is live (lib.rs:82), not dead code. The cast assumes hipMemoryType enum values == CUDA's (DEVICE=2), which has differed across ROCm versions. Fine on 7.2.1, but a version landmine — safer to read the real hipPointerAttribute_t and map the enum explicitly.

Fixed in 0b5042e -- reads hipPointerAttribute_t and compares the type field against the named hipMemoryTypeDevice/hipMemoryTypeManaged constants, instead of assuming the enum value matches CUDA's 2.

rich7420 · 2026-06-11T08:33:56Z

+    if cfg!(feature = "hip") {
+        return true;
+    }
+    env::var("QDP_USE_HIP")


QDP_USE_HIP flips the kernel build to hipcc even with the cargo hip feature off → a default cargo build with this env set = AMD kernels + cudarc host = mismatch. Maybe error when QDP_USE_HIP and CARGO_FEATURE_HIP disagree, so it fails loud.

Fixed in 0b5042e -- added a check that panics with a clear message when QDP_USE_HIP and CARGO_FEATURE_HIP disagree in either direction, so the kernel and host builds can't silently mismatch.

rich7420 · 2026-06-11T08:33:56Z

+            self.bind()?;
+            let mut stream: *mut c_void = std::ptr::null_mut();
+            unsafe {
+                check(hipStreamCreate(&mut stream))?;


hipStreamCreate is the blocking stream ctor; cudarc forks a non-blocking one (hipStreamCreateWithFlags(.., hipStreamNonBlocking)). Could quietly serialize the H2D/compute overlap. Your overlap test passes, so just worth a confirm.

Fixed in 0b5042e -- now hipStreamCreateWithFlags(.., hipStreamNonBlocking) to match cudarc. Making it non-blocking exposed a latent race in the batch-f32 amplitude readback (default-stream copy without syncing the caller's stream, previously masked by the blocking stream), fixed in the same commit. Overlap tests still pass.

rich7420 · 2026-06-11T08:40:01Z

btw, please check the precommit errors

ryankert01 · 2026-06-11T15:30:22Z

I'm also curious about complex optimizations can or cannot be converted this easy. I think recently someone open PRs that have some real profound ones. #1390

jeffdaily · 2026-06-11T15:42:07Z

Thanks for the review, and for running the CUDA path on the 2080 Ti. All four addressed in 0b5042e:

Drop now binds the owning device before hipFree (the merge-blocker).
The hipMemoryType check reads hipPointerAttribute_t and compares against hipMemoryTypeDevice instead of the hardcoded 2.
build.rs fails loudly when QDP_USE_HIP and the hip feature disagree.
The stream is now non-blocking (hipStreamCreateWithFlags(.., hipStreamNonBlocking)) to match cudarc.

On #4: making the stream non-blocking surfaced a pre-existing latent race in the batch-f32 amplitude path -- it read the norm back on the default stream without syncing the caller's stream, masked until now by the blocking stream. Fixed with the same stream-sync the other batch paths use; it's shared (non-HIP) code, so it applies to the CUDA path too.

On byte-for-byte: that's overstated, now corrected in the PR description. metrics.rs swapped the driver cuMemcpyDtoH_v2 for the runtime cudaMemcpy, the kernel >>5 became /warpSize, and the #4 sync adds one more -- all SASS-changing but behavior-identical. The CUDA path is behavior-preserving, not literally byte/SASS-identical.

jeffdaily · 2026-06-11T16:00:23Z

@rich7420 Fixed in 6d2de29 -- ran cargo fmt over the HIP-path files (import order + line wrapping). Clippy is clean on the Rust side; the one remaining compiler warning (iqp.cu unused parameter) is pre-existing upstream, untouched by this PR.

jeffdaily · 2026-06-11T16:00:25Z

@ryankert01 This PR was the easy end: hand-written kernels plus the CUDA runtime/driver API, which hipify mostly mechanically -- the one real subtlety was wave size (the >>5 -> /warpSize fix, since CDNA warps are 64-wide).

#1390's implicit-Hadamard Ozaki engine is the hard end. It uses nvcuda::wmma and raw inline PTX (mma.sync.aligned.m16n8k32...s8.s8.s32) -- int8 tensor-core MMA -- to do an Ozaki accurate-FWT. None of that hipifies: inline PTX doesn't compile under hipcc, and it has to be rewritten against AMD matrix cores (MFMA on CDNA / rocWMMA), whose fragment shapes differ (gfx90a MFMA isn't m16n8k32), so the Ozaki tiling and the pre/post layout shuffles have to be re-derived for the AMD fragment geometry. The Ozaki scheme as an algorithm is portable; its whole payoff is fast low-precision matrix-core matmul, so it's a real reimplementation of that path, not a translation.

So: runtime/memory/plain-kernel code ports about as easily as this PR did; anything built on tensor cores, mma/PTX, or CUTLASS/CuTe needs a from-scratch AMD matrix-core path. Convertible, but not "this easy."

ryankert01

I checked out this head and ran the complete Rust suite on NVIDIA hardware (RTX 3090 Ti, CUDA toolkit): 316 passed / 0 failed, including all GPU suites and the dual-stream pipeline tests with QDP_ENABLE_OVERLAP_TRACKING=1; cargo clippy --all-targets is clean. So the refactored default CUDA path is verified on the vendor side the author could not test on.

Inline comments below cover the issues and risks found in review. The only one I'd hold merge for is the ASF header/NOTICE question; the rest are small follow-ups that could land here or later.

Please also fix the pre-commit hook!

ryankert01 · 2026-06-11T16:29:27Z

+// See the License for the specific language governing permissions and
+// limitations under the License.
+//
+// Copyright (c) 2026 Advanced Micro Devices, Inc.


ASF source-header policy (applies to all 10 files carrying this header): the PR adds Portions Copyright (c) 2026 Advanced Micro Devices, Inc. and Author: lines beneath the ASF header here and in gpu_rt.rs, cuda_ffi.rs, qdp-kernels/build.rs, qdp-python/build.rs, amplitude.cu, kernel_compat.h, and the three hip_compat/ headers.

Per https://www.apache.org/legal/src-headers.html, contributions submitted with copyright notices should have those notices removed or relocated to the project NOTICE file, and author tags in source files are discouraged across ASF projects. There is no precedent for either in this repo, and NOTICE currently has no AMD entry.

This needs a deliberate call before merge — likely either move the copyright line to NOTICE or drop it, and drop the Author: lines (git history preserves attribution).

Done in 088041c -- per ASF policy, dropped the per-file AMD copyright and Author: lines from all 10 files (the ASF Apache-2.0 header is untouched) and added the AMD attribution to NOTICE. Git history preserves authorship.

ryankert01 · 2026-06-11T16:29:28Z

+    // this explicitly). So we read the real hipPointerAttribute_t and compare its
+    // `type` against the named hipMemoryType* constants rather than a magic
+    // number, then translate to the CUDA convention the caller expects.
+    const HIP_MEMORY_TYPE_DEVICE: i32 = 2; // hipMemoryTypeDevice


These constants pin the ROCm 6+ hipMemoryType convention — as the comment above notes, ROCm 5.x used Device=1 — and this is live code: validate_cuda_input_ptr calls cudaPointerGetAttributes on every encode*_from_gpu_ptr entry point (the 64 gpu_ptr_encoding tests exercise it). On ROCm 5.x, valid device pointers would be rejected as "not device memory".

So the effective floor is ROCm >= 6.0. Suggest stating that in the DEVELOPMENT.md prerequisites (testing was on 7.2.1).

Documented in dbee1e1 -- added a ROCm >= 6.0 prerequisite to DEVELOPMENT.md noting the hipMemoryType device-pointer check uses the ROCm 6+ convention (ROCm 5.x would reject valid device pointers). Tested on 7.2.1.

ryankert01 · 2026-06-11T16:29:28Z

 use std::ffi::c_void;

+pub mod device;
+use device::{DeviceRepr, ValidAsZeroBits};


With neither cuda nor hip enabled, this import is unresolved: cargo check -p qdp-kernels --no-default-features fails with a raw E0432 (reproduced locally). Before this PR that configuration built, since cudarc was unconditional.

Consider a guard in device.rs for a clean diagnostic:

#[cfg(not(any(feature = "cuda", feature = "hip")))] compile_error!("qdp-kernels requires exactly one of the `cuda` or `hip` features");

Non-blocking — that config was only incidentally buildable before.

Added in dbee1e1 -- a #[cfg(not(any(feature = "cuda", feature = "hip")))] compile_error!(...) in device.rs, so that config now gives a clean diagnostic instead of the raw E0432.

ryankert01 · 2026-06-11T16:29:28Z

+default = ["cuda"]
+cuda = ["dep:cudarc"]
+# AMD build: hipcc-compiled kernels + the in-crate HIP device traits; no cudarc.
+hip = []


cuda and hip aren't additive: with both enabled (e.g. --features hip without --no-default-features, or via workspace feature unification), hip silently wins everywhere (device.rs, cuda_ffi.rs, the kernel build) while cudarc still compiles, unused. The QDP_USE_HIP consistency panic in build.rs keeps the outcome coherent, which is good — but a one-line note here (and in qdp-core's [features]) that the two are mutually exclusive with hip taking precedence would prevent surprises.

Added in dbee1e1 -- a note in both qdp-kernels and qdp-core [features] that cuda and hip are mutually exclusive and hip takes precedence (kernels build for HIP; cudarc is compiled but unused).

ryankert01 · 2026-06-11T16:29:28Z

 The first command is what `maturin develop --release` runs on CI; the
 second verifies tests type-check in the CUDA build.

+### AMD GPU build (ROCm / HIP)


Maintenance risk worth acknowledging: CI has no AMD runners, so after merge the HIP path is never exercised and can silently rot as the CUDA path evolves. The gpu_rt seam minimizes divergence pressure, but regressions will only surface when someone rebuilds on AMD hardware. A compile-only hipcc job (ROCm apt packages on ubuntu runners) could be a cheap follow-up.

I think we should add a build github action workflow for hip!

Agreed -- the gpu_rt seam limits divergence, but without AMD runners the HIP path can still bit-rot. I'll put up a compile-only hipcc job (ROCm apt packages on an ubuntu runner) as a separate follow-up PR so the build is at least guarded.

Adds an AMD GPU build of the QDP (Quantum Data Plane) native engine under qdp/, behind a Cargo `hip` feature (and QDP_USE_HIP=1). The default `cuda` feature is unchanged, so the NVIDIA build is byte-for-byte identical; nothing on the AMD path is reachable without opting in. The separate Triton AMD backend is orthogonal and untouched -- this gives AMD parity on the native engine the project is built around (pinned-buffer pool, dual-stream overlap, in-Rust DLPack ownership). Authored with the assistance of Claude (Anthropic). Review it in two layers. Kernels (qdp-kernels): build.rs gains a HIP branch that compiles the same six .cu with hipcc, taking --offload-arch from QDP_HIP_ARCH_LIST (default gfx90a only when unset, never a literal that overrides the env, so other AMD targets build the same source by setting QDP_HIP_ARCH_LIST alone) and linking the AMD HIP runtime; the CUDA branch (nvcc) is untouched. hipcc ships no <cuda_runtime.h> / <cuComplex.h> / <vector_types.h>, so qdp-kernels/hip_compat/ holds forwarding shim headers of those exact names, added to the include path ONLY on the HIP build, that map the small cuda* runtime + cuComplex surface the kernels use onto HIP. A CUDA build never sees that directory and pulls the real toolkit headers, so the .cu keep their CUDA spellings unchanged. amplitude.cu is the only kernel needing source fixes, both arch-unified: the __shfl_down_sync mask becomes 64-bit on HIP (ROCm static_asserts sizeof(mask)==8; the 32-bit literal fails to compile) while staying 0xffffffffu on CUDA, and the warp-id `threadIdx.x >> 5` becomes `threadIdx.x / warpSize`. The latter is a genuine wave64 correctness fix: `>> 5` assumes 32-lane warps, so on a 64-lane (CDNA) warp the per-warp L2-norm partial landed in the wrong shared slot; `/ warpSize` is identical to `>> 5` on 32-lane hardware and correct on 64-lane hardware. Host (qdp-core): the cudarc crate is CUDA-only with no ROCm backend, so on the HIP build it is displaced by a thin HIP-runtime shim with the SAME type names and method signatures (qdp-kernels/src/device.rs: CudaDevice, CudaSlice, CudaStream, DevicePtr/DevicePtrMut/DeviceSlice, DeviceRepr/ValidAsZeroBits, backed by the AMD HIP runtime). qdp-core/src/gpu_rt.rs re-exports it as the single import point; every `use cudarc::driver` became `use crate::gpu_rt`, so the call sites compile unchanged on either vendor. The runtime FFI (gpu/cuda_ffi.rs) keeps its cuda* names and binds the matching hip* entry points under the hip feature. The async H2D path maps cudaMemcpyAsync to hipMemcpyAsync (its exact enqueue-and-return 1:1), NOT hipMemcpyWithStream, which synchronizes the stream and would block the host and silently serialize the dual-stream overlap pipeline; the copy-done event plus stream-wait still order copy before compute, so correctness is unchanged. The shim's htod_sync_copy_into asserts dst.len() == src.len() to match cudarc's contract. DLPack tags exported tensors kDLROCM on the HIP build (a ROCm PyTorch's from_dlpack rejects a CUDA tag); the two device-type tests are made arch-aware. The GPU stack is selected by a build-script `qdp_gpu_platform` cfg (Linux always; Windows when the hip feature is on) rather than a target_os == "linux" proxy, which enables the same code on Windows ROCm; on Linux this cfg is always set, so the Linux output is identical. qdp/DEVELOPMENT.md documents the ROCm/HIP build next to the existing CUDA flow, and qdp/qdp-python/README.md notes the native engine's AMD build. New and substantially-extended files for the AMD build carry an AMD copyright line below the Apache header and name the author. Test Plan gfx90a (MI250X, ROCm 7.2.1), gfx1100 (Radeon Pro W7800), and on Windows ROCm gfx1201 (RX 9070 XT) and gfx1151 (Radeon 8060S): ``` export QDP_USE_HIP=1 QDP_HIP_ARCH_LIST=gfx90a ROCM_PATH=/opt/rocm cd qdp cargo build -p qdp-core -p qdp-kernels --no-default-features --features hip cargo test -p qdp-core -p qdp-kernels --no-default-features --features hip -- --test-threads=1 ``` All Rust tests RUN (they previously skipped on AMD, since cudarc found no device) and PASS: qdp-kernels 31 (amplitude 21, angle 10); qdp-core lib 77; GPU suites gpu_angle 12, gpu_api_workflow 8, gpu_basis 7, gpu_dlpack 9, gpu_fidelity 17, gpu_iqp 22, gpu_memory_safety 4, gpu_norm_f32 2, gpu_ptr_encoding 64, gpu_validation 8; plus the non-GPU regression suites (arrow/null/numpy/parquet/preprocessing/tensorflow/torch/types). 0 failures. The dual-stream async-pipeline tests pass with QDP_ENABLE_OVERLAP_TRACKING=1, confirming the non-blocking hipMemcpyAsync H2D path. ``` maturin build --features hip --profile dev --compatibility linux \ --manifest-path qdp/qdp-python/Cargo.toml --out dist/ pip install --no-deps dist/qumat_qdp-*.whl # extension only; keeps the venv's ROCm torch python -m pytest testing/qdp testing/qdp_python qdp/qdp-python/tests -q ``` Python parity against the torch reference, with a ROCm PyTorch: 301 passed, 12 skipped (multi-GPU / tensorflow-absent / the torch reference path's CUDA-centric arch-capability check / the separate Triton backend not built here), 0 failed. Encoder outputs match the reference within tolerance and DLPack round-trips to torch as ROCm tensors. The dev profile is used for the Python cdylib because the release `lto = "fat"` profile yields a bitcode-only shared object under the HIP toolchain.

Addresses four points from review on the AMD/HIP build. All are HIP-path fixes except the stream-sync, which corrects a latent cross-stream ordering hazard shared with the CUDA path. Authored with the assistance of Claude (Anthropic). device.rs CudaSlice::drop now re-binds the owning device before hipFree. hipFree releases on the calling thread's current device, so on multi-GPU a drop while a different device is current would free against the wrong device (the alloc path already binds; cudarc binds in Drop). The bind is best-effort since Drop cannot return an error. cuda_ffi.rs cudaPointerGetAttributes (HIP path) no longer assumes the hipMemoryType enum value equals CUDA's. hipMemoryType values are not guaranteed stable across ROCm releases, so it reads a real hipPointerAttribute_t and compares the type against the named hipMemoryTypeDevice / hipMemoryTypeManaged constants, translating to the CUDA convention the caller checks rather than reinterpreting a magic number. qdp-kernels/build.rs now fails loudly when QDP_USE_HIP and the `hip` Cargo feature disagree. QDP_USE_HIP=1 flips the kernel build to hipcc while the host backend is chosen by the feature, so a mismatch would silently link AMD kernels against the cudarc host (or vice versa). The build now panics with a clear message instead. device.rs fork_default_stream creates a non-blocking stream (hipStreamCreateWithFlags with hipStreamNonBlocking) to match cudarc; a blocking stream serializes copy/compute against the default stream. This exposed a pre-existing ordering bug in the batch-f32 stream path (encode_batch_from_gpu_ptr_f32_with_stream): the norm kernel runs on the caller's stream but the norm readback copies on the default stream without first synchronizing the caller's stream, so on a non-blocking stream the readback raced the zero-initialized buffer and reported a zero norm. Added the sync_cuda_stream the single-sample path already uses; correct on wave32 and wave64 and on the CUDA path. Test Plan gfx90a (MI250X, ROCm 7.2.1): ``` export QDP_USE_HIP=1 QDP_HIP_ARCH_LIST=gfx90a ROCM_PATH=/opt/rocm HIP_VISIBLE_DEVICES=3 cd qdp cargo build -p qdp-core -p qdp-kernels --no-default-features --features hip cargo test -p qdp-core -p qdp-kernels --no-default-features --features hip -- --test-threads=1 ``` All Rust tests pass: qdp-kernels 31 (amplitude 21, angle 10); qdp-core lib 77; GPU suites gpu_angle 12, gpu_api_workflow 8, gpu_basis 7, gpu_dlpack 9, gpu_fidelity 17, gpu_iqp 22, gpu_memory_safety 4, gpu_norm_f32 2, gpu_ptr_encoding 64, gpu_validation 8; non-GPU suites (arrow/null/numpy/parquet/preprocessing/tensorflow/torch/types). 0 failures. The dual-stream async-pipeline tests pass with QDP_ENABLE_OVERLAP_TRACKING=1, confirming the non-blocking stream preserves copy/compute overlap. With QDP_USE_HIP=1 and the `hip` feature off, the build now aborts with the mismatch error as intended.

Run rustfmt over the HIP build additions (import ordering and line wrapping) to satisfy the pre-commit fmt hook. Formatting only, no behavior change. Authored with assistance from Claude (Anthropic).

The qdp-kernels build script rejected a benign configuration and could not run in a toolchain-less environment, which broke the Pre-commit clippy job (it runs `cargo clippy --all-targets --all-features`, turning the `hip` Cargo feature on without setting QDP_USE_HIP, on a runner with no ROCm toolkit). Two fixes. First, check_hip_consistency() dropped the panic for "hip feature on but QDP_USE_HIP unset": that is not a host/kernel mismatch, because hip_requested() already builds the kernels for HIP whenever the feature is on, so host and kernels agree. The real mismatch (QDP_USE_HIP set with the `hip` feature off, which would build hipcc kernels against the cudarc host) still panics, so the reviewer's guard is preserved. Second, build_hip() now probes for a runnable hipcc the same way the CUDA path probes for nvcc. When hipcc is absent it emits the qdp_no_cuda stub cfg plus an explanatory warning and returns instead of failing, so `cargo check`/`clippy` (including `--all-features`) link against the host stubs and succeed without the GPU toolchain. When hipcc is present the behavior is unchanged: real kernels are compiled. This change was authored with the assistance of Claude. Test Plan: Real HIP build and GPU tests on gfx90a (hipcc present, graceful path NOT taken): ``` export QDP_USE_HIP=1 QDP_HIP_ARCH_LIST=gfx90a ROCM_PATH=/opt/rocm cargo build -p qdp-core -p qdp-kernels --no-default-features --features hip HIP_VISIBLE_DEVICES=0 cargo test -p qdp-core --no-default-features \ --features hip --test gpu_ptr_encoding --test gpu_fidelity # gpu_ptr_encoding: 64 passed; gpu_fidelity: 17 passed ``` Toolchain-less CI scenario (hip feature on, QDP_USE_HIP unset, hipcc absent) no longer panics and clippy succeeds via the stub path: ``` QDP_HIPCC=/nonexistent-hipcc cargo clippy --manifest-path qdp/Cargo.toml \ -p qdp-kernels -p qdp-core --all-targets --features hip cargo fmt --manifest-path qdp/Cargo.toml --all -- --check ``` Reviewer guard still enforced (QDP_USE_HIP=1 with the `hip` feature off panics): ``` QDP_USE_HIP=1 cargo build -p qdp-kernels ```

Addresses three review points on the AMD HIP backend. DEVELOPMENT.md now states ROCm >= 6.0 as a prerequisite under the AMD GPU build section: the device-pointer check in cuda_ffi.rs relies on the ROCm 6+ hipMemoryType device convention (ROCm 5.x numbered Device=1 and would reject valid device pointers); tested on ROCm 7.2.1. qdp-kernels/src/device.rs adds a module-scope compile_error! for the configuration with neither the cuda nor hip feature enabled, so that `cargo check -p qdp-kernels --no-default-features` reports a clear requirement instead of a raw unresolved-import E0432 from cudarc. Both qdp-kernels/Cargo.toml and qdp-core/Cargo.toml gain a comment in their [features] sections noting that cuda and hip are mutually exclusive and that hip takes precedence if both end up enabled (e.g. via workspace feature unification): kernels build for HIP and cudarc is compiled but unused. Comment only, no behavior change. This change was authored with the assistance of Claude. Test Plan: ```bash cd qdp # Clean diagnostic with neither vendor feature (compile_error!, not E0432) cargo check -p qdp-kernels --no-default-features # Default CUDA path still type-checks QDP_NO_CUDA=1 cargo check -p qdp-core -p qdp-kernels # Real HIP build for gfx90a QDP_USE_HIP=1 QDP_HIP_ARCH_LIST=gfx90a ROCM_PATH=/opt/rocm \ cargo build -p qdp-core -p qdp-kernels --no-default-features --features hip # Formatting cargo fmt --manifest-path Cargo.toml --all -- --check ```

Per the ASF source-header policy (apache.org/legal/src-headers.html), source files should carry only the standard Apache-2.0 license header, with third-party and contributor attribution recorded in the project NOTICE file rather than in per-file copyright/author comment lines. This drops the per-file "Copyright (c) 2026 Advanced Micro Devices, Inc." and "Author: Jeff Daily <jeff.daily@amd.com>" comment lines that the QDP AMD/HIP GPU build added to ten files, restoring the canonical ASF-only header on each. The AMD attribution is moved to the repo-root NOTICE file as a parallel entry in the existing ASF house style. No code changes; the edits are comment-only. This addresses the source-header review feedback on the QDP AMD/HIP build contribution. The work was authored with the assistance of Claude, an AI assistant. Test Plan: ``` cargo fmt --manifest-path qdp/Cargo.toml --all -- --check QDP_USE_HIP=1 QDP_HIP_ARCH_LIST=gfx90a ROCM_PATH=/opt/rocm \ cargo build -p qdp-core -p qdp-kernels --no-default-features --features hip QDP_NO_CUDA=1 cargo check -p qdp-core -p qdp-kernels ``` All three pass (exit 0) on a gfx90a host with ROCm 7.2.1. The header edits remove comment lines only, so device code is unchanged.

jeffdaily · 2026-06-11T17:04:11Z

@ryankert01 Thanks for running the full suite on the 3090 Ti -- good to have the default CUDA path vendor-confirmed.

The multi-GPU DLPack device-type assertion hardcoded kDLCUDA=2, which fails on a ROCm build where the engine reports kDLROCM=10. The device-0 version of the same test (test_dlpack_device) was already made arch-aware in the original port commit but the non-zero-device variant was missed because the gfx90a validator ran with HIP_VISIBLE_DEVICES=N (single GPU visible), skipping the multi-GPU test. Caught on gfx1100 where 4 GPUs are visible. Fix matches the device-0 pattern exactly: use getattr(torch.version, "hip", None) to select kDLROCM or kDLCUDA. Test Plan gfx1100 (Radeon Pro W7800, ROCm 7.2.1), all 4 GPUs visible: python -m pytest testing/qdp testing/qdp_python qdp/qdp-python/tests -q 305 passed, 10 skipped, 0 failed. test_dlpack_device_id_non_zero passes. Authored with the assistance of Claude (Anthropic).

Two follow-ups to the recent ASF-header cleanup so the pre-commit suite (run in CI as `pre-commit run --all-files`) is green. First, qdp/qdp-python/build.rs lost the leading blank comment line during the header rewrite, so its header no longer matched the canonical Apache-2.0 header in testing/utils/.license-header.txt and the insert-license hook kept re-modifying it. The header is now byte-identical to the other build.rs files. Second, clippy with -D warnings flagged the `stream.stream as *mut c_void` casts at kernel-launch call sites as unnecessary_cast. The cast is a real conversion on the CUDA backend, where cudarc's CUstream is a distinct pointer type, but a no-op on HIP, where the stream field is already *mut c_void. Removing it would break the CUDA build, so the cast stays and a HIP-scoped `#![cfg_attr(feature = "hip", allow(clippy::unnecessary_cast))]` silences the redundancy only where it is redundant. This mirrors the existing crate-level `unused_unsafe` allow, which has the same dual-backend rationale. The allow is applied at each affected crate root (qdp-core lib and the two integration test crates that contain these call sites). This work was authored with the assistance of Claude, an AI assistant. Test Plan: ``` # License header (leaves files unmodified): pre-commit run insert-license --all-files # Clippy clean on every backend selection (-D warnings): QDP_USE_HIP=1 QDP_HIP_ARCH_LIST=gfx90a ROCM_PATH=/opt/rocm \ cargo clippy -p qdp-core -p qdp-kernels --all-targets \ --no-default-features --features hip -- -D warnings QDP_NO_CUDA=1 cargo clippy -p qdp-core -p qdp-kernels --all-targets \ --no-default-features --features cuda -- -D warnings cargo clippy -p qdp-core -p qdp-kernels --all-targets \ --no-default-features --features "cuda hip" -- -D warnings # gfx90a HIP build: QDP_USE_HIP=1 QDP_HIP_ARCH_LIST=gfx90a ROCM_PATH=/opt/rocm \ cargo build -p qdp-core -p qdp-kernels --no-default-features --features hip ```

Two Windows-specific corrections needed after the review-hardening commit (206aa0c) landed. 1. encode_from_gpu_ptr_f32 and encode_from_gpu_ptr_f32_with_stream in amplitude.rs, angle.rs, basis.rs, and the QuantumEncoder trait default in mod.rs were gated #[cfg(target_os = "linux")] in the review commit. Their callers in lib.rs use #[cfg(qdp_gpu_platform)], which is true on Windows+hip, so on Windows the functions were compiled out while the call sites compiled in, producing an unresolved-function link error. Renamed all five occurrences to #[cfg(qdp_gpu_platform)] to match the rest of the codebase convention set by the Windows port commit. 2. fork_default_stream in device.rs creates a HIP_STREAM_NON_BLOCKING stream on Linux (matching cudarc, preserving dual-stream pipeline overlap). On Windows with TheRock ROCm 7.14, non-blocking stream writes are not visible via hipMemcpy after hipStreamSynchronize or hipDeviceSynchronize, due to a cache coherency gap in the Windows HIP runtime. The fix uses hipStreamCreate (a blocking stream) on Windows to restore correct D2H readback semantics; the pipeline overlap optimization is not available on this platform and runtime combination. All 21 amplitude encode and 10 angle encode kernel tests, including the three stream-path tests that regressed (test_l2_norm_batch_kernel_stream, test_encode_from_gpu_ptr_f32_with_stream_non_default_success, test_encode_batch_from_gpu_ptr_f32_with_stream_success), pass with this change. Authored with the assistance of Claude (Anthropic). Test Plan gfx1201 (RX 9070 XT, Windows 11, TheRock ROCm 7.14, HIP_VISIBLE_DEVICES=0): ``` export QDP_USE_HIP=1 QDP_HIP_ARCH_LIST=gfx1201 HIP_VISIBLE_DEVICES=0 cargo test -p qdp-core -p qdp-kernels --no-default-features --features hip -- --test-threads=1 ``` 330 passed, 0 failed, 4 ignored. All GPU and non-GPU suites pass.

Restore the non-blocking forked stream on all platforms and fix the actual root cause of the Windows gfx1201 stream-test regression: a missing ordering between default-stream buffer setup and a kernel launched on a non-blocking stream that consumes it. Background: fork_default_stream creates a hipStreamNonBlocking stream to match cudarc and preserve the dual-stream copy/compute overlap. The encoders set up their input (htod) and output (alloc_zeros) buffers with the blocking shim copies, which run on the NULL/default stream, then launch the norm/encode kernels on the caller's forked stream. CUDA's legacy default stream is synchronizing, so on NVIDIA that setup is implicitly ordered before the forked-stream kernel reads the buffer. HIP's default stream is NOT synchronizing relative to a non-blocking stream, so the kernel raced the setup and read stale/zero data; the norm came back as the zero-initialized value and the result was wrong. A previous change worked around this by creating a blocking stream on Windows, which masked every such site at once but sacrificed the pipeline overlap and was attributed to a nonexistent "cache coherency gap" in the Windows runtime. A minimal HIP reproducer shows the runtime is correct: a kernel on a non-blocking stream that is properly ordered against the default stream (stream sync, device sync, or an event wait) reads back correct data on both the Adrenalin and ROCm runtimes; only the unordered case fails. The fix synchronizes the default stream at the end of the blocking alloc_zeros / htod copies (sync_default_stream), restoring the CUDA-equivalent ordering on every platform without touching the async pipeline (which uses async copies on explicit streams, not these blocking paths). The forked stream is non-blocking again everywhere. The symmetric readback hazard (a NULL-stream dtoh after a forked-stream kernel) is closed where it was still missing a sync: the f64/f32 batch norm-validation copies in amplitude.rs and the phase batch finiteness-probe copy in phase.rs now synchronize the caller's stream before reading back, matching the idiom the other encoders already use. This is arch-unified: correct on wave32 (gfx1100/gfx1151/gfx1201) and wave64 (gfx90a). On Linux the default stream is already synchronizing, so the added sync is a harmless no-op-cost ordering point. Authored with the assistance of Claude (Anthropic). Test Plan gfx1201 (RX 9070 XT, Windows 11, TheRock ROCm 7.14, HIP_VISIBLE_DEVICES=0): ``` export QDP_USE_HIP=1 QDP_HIP_ARCH_LIST=gfx1201 HIP_VISIBLE_DEVICES=0 cargo build -p qdp-core -p qdp-kernels --no-default-features --features hip cargo test -p qdp-core -p qdp-kernels --no-default-features --features hip -- --test-threads=1 ``` All suites pass, 0 failures. The three previously-regressed stream tests pass with the non-blocking stream restored: test_l2_norm_batch_kernel_stream, test_encode_from_gpu_ptr_f32_with_stream_non_default_success, test_encode_batch_from_gpu_ptr_f32_with_stream_success. qdp-core lib 77, gpu_ptr_encoding 68 (all 10 _with_stream variants pass), amplitude 21, angle 10, and all other GPU and non-GPU suites green.

jeffdaily · 2026-06-11T19:34:14Z

Wanted to apologize for the recent churn here. Normally I would have done this BEFORE opening the PR, but I didn't fully understand the problem. I would make a fix on Linux-gfx90a, then validated it on Win-gfx1201, then Win-gfx1201 would make a change and send it back to Linux-gfx90a, and so on. However, I finally root-caused the blocking vs non-blocking stream issues. This commit d97db73 should take care of it. See the commit message there for details.

jeffdaily requested review from 400Ping, guan404ming and ryankert01 as code owners June 11, 2026 06:53

jeffdaily added a commit to jeffdaily/moat that referenced this pull request Jun 11, 2026

mahout: lead -> pr-open (apache/mahout#1399)

871959c

rich7420 reviewed Jun 11, 2026

View reviewed changes

ryankert01 reviewed Jun 11, 2026

View reviewed changes

jeffdaily added 4 commits June 12, 2026 00:35

[ROCm] QDP: apply cargo fmt to the HIP-path additions

bc6206b

Run rustfmt over the HIP build additions (import ordering and line wrapping) to satisfy the pre-commit fmt hook. Formatting only, no behavior change. Authored with assistance from Claude (Anthropic).

ryankert01 force-pushed the moat-port branch from cc3859b to 4d54e9c Compare June 11, 2026 16:35

jeffdaily added 2 commits June 11, 2026 16:43

jeffdaily added 4 commits June 11, 2026 17:29

Conversation

jeffdaily commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Kernels (qdp-kernels)

Host (qdp-core)

Docs and attribution

Test Plan

Uh oh!

rich7420 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rich7420 Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rich7420 Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rich7420 Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rich7420 Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rich7420 commented Jun 11, 2026

Uh oh!

ryankert01 commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeffdaily commented Jun 11, 2026

Uh oh!

jeffdaily commented Jun 11, 2026

Uh oh!

jeffdaily commented Jun 11, 2026

Uh oh!

ryankert01 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jeffdaily commented Jun 11, 2026

Uh oh!

jeffdaily commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

jeffdaily commented Jun 11, 2026 •

edited

Loading

Kernels (`qdp-kernels`)

Host (`qdp-core`)

rich7420 left a comment •

edited

Loading

rich7420 Jun 11, 2026 •

edited

Loading

rich7420 Jun 11, 2026 •

edited

Loading

rich7420 Jun 11, 2026 •

edited

Loading

rich7420 Jun 11, 2026 •

edited

Loading

ryankert01 commented Jun 11, 2026 •

edited

Loading

ryankert01 left a comment •

edited

Loading

jeffdaily commented Jun 11, 2026 •

edited

Loading