Skip to content

[QDP] [feature] Pr5 implicit hadamard engine#1390

Open
aloha1357 wants to merge 9 commits into
apache:mainfrom
aloha1357:pr5-implicit-hadamard-engine
Open

[QDP] [feature] Pr5 implicit hadamard engine#1390
aloha1357 wants to merge 9 commits into
apache:mainfrom
aloha1357:pr5-implicit-hadamard-engine

Conversation

@aloha1357

@aloha1357 aloha1357 commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

Related Issues

related #1385

Changes

  • Bug fix
  • New feature
  • Refactoring
  • Documentation
  • Test
  • CI/CD pipeline
  • Other

Why

As established in the previous Kronecker Decomposition PR, a significant bottleneck in processing high-qubit circuits ($N \ge 14$) is memory. A traditional $O(4^N)$ matrix representation for the full Dense Hadamard transform completely exhausts modern GPU VRAM limits (causing Out-Of-Memory errors).

Even with the Kronecker Decomposition splitting the matrix into smaller blocks, generating and storing the explicit dense $H$ matrices in memory before applying Tensor Core operations is highly inefficient.

We need a way to perform Dense Matrix Multiplications (GEMM) on the Tensor Cores without ever storing the Hadamard Matrix in Global Memory.

How

This PR introduces the Matrix-Free Implicit Hadamard Ozaki Engine.

  • Implicit Matrix Generation: The ImplicitHadamardOzakiEngine leverages the structural properties of the Hadamard matrix ($h_{i,j} = (-1)^{\text{popc}(i & j)}$) to calculate the matrix elements on-the-fly directly inside Shared Memory.
  • Ozaki Multi-pass Tensor Core Execution: Using the Ozaki INT8 scheme, we utilize the .m16n8k32.s8 Tensor Core instructions to perform the GEMM natively in hardware. Because the Hadamard values are always $\pm 1$, we experience absolutely zero quantization error despite using the INT8 pipeline.
  • Removed Fallback: Replaced the naive_implicit_hadamard_gemm_kernel placeholder from PR 4 with the actual calls to engine.execute_implicit_hadamard.
  • Build System Fix: Updated build.rs to drop the unsupported sm_75 (Turing) target fallback, as this specific Tensor Core instruction explicitly requires sm_80 (Ampere) or higher.

Benchmark Results

Environment: Dev Machine (NVIDIA GeForce RTX 4060 Laptop GPU)
Configuration: Batch size 64, 30 iterations, GPU-vs-GPU only (no PyTorch reference)
Script: qdp/qdp-python/benchmark/benchmark_pr5.py
Measured: 2026-06-10
Branch: pr5-implicit-hadamard-engine (ba0376a4f)

Encoding path options (--path)

Value Description
fwt engine.encode(..., "iqp") — standard FWT dispatch
tc engine.encode_batch_tc(...) — Ozaki Kronecker Tensor Core path
both Run both paths and print FWT/TC speedup

FWT vs Ozaki TC (after PR5)

Qubits Dim FWT (ms) TC (ms) Speedup (FWT/TC) Notes
8 256 0.188 0.142 1.33x Fused shared-memory TC path
10 1024 0.312 0.339 0.92x Within noise
12 4096 0.821 0.833 0.99x Parity (fused path)
14 16384 3.060 15.870 0.19x Kronecker TC-GEMM scaffold
16 65536 21.925 70.723 0.31x Kronecker TC-GEMM scaffold

Checklist

  • Added or updated unit tests for all changes (Verified passing against existing CI test suite)
  • Added or updated documentation for all changes (Added explanatory inline comments for PR)

…tests

- Expose encode_batch_tc through Rust core, PyO3, and Python backend
- Fix IQP TC kernel: correct batch stride for ZZ params and raise
  FWT_SHARED_MEM_THRESHOLD to 12 for fused shared-memory path at N<=12
- Align ImplicitHadamardOzaki.cu with PR6 ldmatrix/alignment fixes
- Add benchmark_pr5.py with --path fwt|tc|both (GPU-vs-GPU, no PyTorch)
- Add test_iqp_tc_path.py smoke and normalization tests
@aloha1357 aloha1357 force-pushed the pr5-implicit-hadamard-engine branch from ba0376a to 806a419 Compare June 10, 2026 21:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant