[QDP] [feature] Pr5 implicit hadamard engine by aloha1357 · Pull Request #1390 · apache/mahout

aloha1357 · 2026-06-07T19:34:03Z

Related Issues

related #1385

Changes

Why

As established in the previous Kronecker Decomposition PR, a significant bottleneck in processing high-qubit circuits ($N \ge 14$) is memory. A traditional $O(4^N)$ matrix representation for the full Dense Hadamard transform completely exhausts modern GPU VRAM limits (causing Out-Of-Memory errors).

Even with the Kronecker Decomposition splitting the matrix into smaller blocks, generating and storing the explicit dense $H$ matrices in memory before applying Tensor Core operations is highly inefficient.

We need a way to perform Dense Matrix Multiplications (GEMM) on the Tensor Cores without ever storing the Hadamard Matrix in Global Memory.

How

This PR introduces the Matrix-Free Implicit Hadamard Ozaki Engine.

Implicit Matrix Generation: The ImplicitHadamardOzakiEngine leverages the structural properties of the Hadamard matrix ($h_{i,j} = (-1)^{\text{popc}(i & j)}$) to calculate the matrix elements on-the-fly directly inside Shared Memory.
Ozaki Multi-pass Tensor Core Execution: Using the Ozaki INT8 scheme, we utilize the .m16n8k32.s8 Tensor Core instructions to perform the GEMM natively in hardware. Because the Hadamard values are always $\pm 1$, we experience absolutely zero quantization error despite using the INT8 pipeline.
Removed Fallback: Replaced the naive_implicit_hadamard_gemm_kernel placeholder from PR 4 with the actual calls to engine.execute_implicit_hadamard.
Build System Fix: Updated build.rs to drop the unsupported sm_75 (Turing) target fallback, as this specific Tensor Core instruction explicitly requires sm_80 (Ampere) or higher.

Benchmark Results

Environment: Dev Machine (NVIDIA GeForce RTX 4060 Laptop GPU)
Configuration: Batch size 64, 30 iterations, GPU-vs-GPU only (no PyTorch reference)
Script: qdp/qdp-python/benchmark/benchmark_pr5.py
Measured: 2026-06-10
Branch: pr5-implicit-hadamard-engine (ba0376a4f)

Encoding path options (`--path`)

Value	Description
`fwt`	`engine.encode(..., "iqp")` — standard FWT dispatch
`tc`	`engine.encode_batch_tc(...)` — Ozaki Kronecker Tensor Core path
`both`	Run both paths and print FWT/TC speedup

FWT vs Ozaki TC (after PR5)

Qubits	Dim	FWT (ms)	TC (ms)	Speedup (FWT/TC)	Notes
8	256	0.188	0.142	1.33x	Fused shared-memory TC path
10	1024	0.312	0.339	0.92x	Within noise
12	4096	0.821	0.833	0.99x	Parity (fused path)
14	16384	3.060	15.870	0.19x	Kronecker TC-GEMM scaffold
16	65536	21.925	70.723	0.31x	Kronecker TC-GEMM scaffold

Checklist

Added or updated unit tests for all changes (Verified passing against existing CI test suite)
Added or updated documentation for all changes (Added explanatory inline comments for PR)

…tecture

… script

…tests - Expose encode_batch_tc through Rust core, PyO3, and Python backend - Fix IQP TC kernel: correct batch stride for ZZ params and raise FWT_SHARED_MEM_THRESHOLD to 12 for fused shared-memory path at N<=12 - Align ImplicitHadamardOzaki.cu with PR6 ldmatrix/alignment fixes - Add benchmark_pr5.py with --path fwt|tc|both (GPU-vs-GPU, no PyTorch) - Add test_iqp_tc_path.py smoke and normalization tests

… branch

aloha1357 requested review from 400Ping, guan404ming and ryankert01 as code owners June 7, 2026 19:34

aloha1357 added 8 commits June 10, 2026 22:49

feat(qdp): introduce batch throughput optimization scaffolding for TC

719bafb

feat(qdp): introduce batch throughput optimization scaffolding for TC

c3d0ed8

feat(qdp): introduce shared memory fused FWT for small qubit counts

cfc8493

feat(qdp): restructure FWT into Kronecker decomposition blocked archi…

b1a32e7

…tecture

chore: remove PR1 agent comments, trim kernel docs, add PR4 benchmark

62c249b

feat(qdp): implement Matrix-Free Implicit Hadamard Tensor Core engine

507661c

chore(qdp): clean up agent comments and add independent PR5 benchmark…

c249cc0

… script

aloha1357 force-pushed the pr5-implicit-hadamard-engine branch from ba0376a to 806a419 Compare June 10, 2026 21:06

chore(qdp): remove dev-only PR micro-benchmarks and reports from code…

e25ce59

… branch

ryankert01 mentioned this pull request Jun 11, 2026

QDP: add an AMD GPU (ROCm/HIP) build for the native encoder engine #1399

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QDP] [feature] Pr5 implicit hadamard engine#1390

[QDP] [feature] Pr5 implicit hadamard engine#1390
aloha1357 wants to merge 9 commits into
apache:mainfrom
aloha1357:pr5-implicit-hadamard-engine

aloha1357 commented Jun 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aloha1357 commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Related Issues

Changes

Why

How

Benchmark Results

Encoding path options (--path)

FWT vs Ozaki TC (after PR5)

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

aloha1357 commented Jun 7, 2026 •

edited

Loading

Encoding path options (`--path`)