Update code to Optimise-away small GPU allocations for projector & mu…#783
Open
thedaemon-wizard wants to merge 1 commit into
Open
Update code to Optimise-away small GPU allocations for projector & mu…#783thedaemon-wizard wants to merge 1 commit into
thedaemon-wizard wants to merge 1 commit into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Profile and optimise-away small GPU allocations
Closes #749
Summary
The single-GPU backend copied the qubit-index list from host to device
(
cudaMalloc+cudaMemcpyAsync+cudaFree) on every call of severalmulti-qubit operations, via the
getDevInts()helper. For small Quregs thisfixed allocation latency dominates the actual kernel runtime — exactly the
overhead issue #749 asks us to profile and remove.
This PR eliminates that per-call copy for the two operations agreed in scope,
following the register-resident-bitmask pattern already used by
thrust_statevec_calcExpecAnyTargZ_sub:applyMultiQubitProjectorthrust_{statevec,densmatr}_multiQubitProjector_subcalcProbOfMultiQubitOutcomethrust_{statevec,densmatr}_calcProbOfMultiQubitOutcome_subAll changes are confined to a single file:
quest/src/gpu/gpu_thrust.cuh(52 insertions, 52 deletions). No public API, dispatch, or
thrust_*signatureschange — only the internals.
Design
Projectors — bitmask reformulation (removes the copy for all sizes)
A projector keeps an amplitude iff its target qubits match the requested
outcomes. The original device-array test
getValueOfBits(n, targetsPtr, numBits) == retainValueis mathematically identical to the register-only primitive
where
qubitMask = util_getBitMask(qubits)flags the target positions andvalueMask = util_getBitMask(qubits, outcomes)holds the desired outcome bits.Both masks are plain
qindexscalars passed as kernel arguments, so no devicearray is allocated at all. For the density-matrix functor the same masks are
applied to both the row and column substates:
Outcome probability — pass the list by value (removes the copy for all sizes)
calcProbOfMultiQubitOutcomeusesfunctor_insertBits, whoseinsertBitsWithMaskedValuesis a scatter that needs the actual sorted qubitpositions (a bitmask is insufficient). Instead of allocating a
device_vector, the functor now stores the positions in aList64by value— a trivially-copyable, fixed-size (
int[64]), CUDA-kernel-compatible structthat already exists in the codebase precisely for this purpose
(
quest/src/core/lists.hpp). The list rides along as a kernel argument, socudaMalloc/cudaMemcpydisappear for every qubit count (the previous code onlyavoided the copy never — it always called
getDevInts).getDevInts()itself is untouched and still used by ~20 other operations ingpu_subroutines.cppthat are out of scope for this issue.Results (RTX PRO 6000 Blackwell, CUDA 13.0, sm_120)
Nsight Systems — CUDA API call counts
Trace of N = 8…12, 200 reps each, both operations, captured with
Nsight Systems 2025.3.2 (
nsys profile+ per-call CUDA runtime API counts):cudaMalloccudaFreecudaMemcpyAsynccudaLaunchKernelThe ~2000 eliminated allocations are exactly the two per-call qubit-list copies
(one in the projector, one in the probability path). The residual ~1000
cudaMallocin the optimised build isthrust::reduce's own internal temporaryin
calcProb— inherent to Thrust and out of scope.cudaLaunchKernelisunchanged, confirming the kernels themselves are untouched.
Wall-clock per call (microseconds),
numTargs = 3, 2000 repsIn the small-Qureg regime the projector is consistently ~1.8× faster and the
probability calc ~1.4× faster, purely from removing the allocation. (The
large baseline jump at N≥17 is the allocation interacting with the now-larger
state kernels; removing it also removes that cliff.)
Correctness
Built with
-D QUEST_ENABLE_CUDA=ON -D QUEST_BUILD_TESTS=ON -D CMAKE_CUDA_ARCHITECTURES=120 -D CMAKE_BUILD_TYPE=Releaseagainst CUDA 13.0.The unit tests for the affected operations pass across all four deployments
(CPU, CPU+OpenMP, GPU, GPU+OpenMP):
(A correct sm_120 build is also implicitly validated, since a wrong architecture
silently corrupts GPU results.)
Re-verified end-to-end on a freshly re-cloned
devel(HEADb9830592) with thesingle-file change re-applied: configure, build, and the test command above all
pass unchanged.
How to reproduce the measurements
The numbers above come from a small standalone driver (kept out of this PR to
preserve the single-file diff) that, against both a clean
origin/develbuild andthis branch:
-D QUEST_ENABLE_CUDA=ON -D CMAKE_CUDA_ARCHITECTURES=120 -D CMAKE_BUILD_TYPE=Release;N, allocates a Qureg, then times a loop ofapplyMultiQubitProjectorand
calcProbOfMultiQubitOutcome(numTargs = 3) with a CUDA-synchronisedwall clock over many reps (per-call µs in the table);
nsys profileand tallies CUDA runtime API calls(
cudaMalloc/cudaFree/cudaMemcpyAsync/cudaLaunchKernel) for the table above.Happy to share the driver/scripts separately if useful for CI; they are not part of
this change.
Notes for reviewers
devel(the active unitaryHACK branch and the only one thatbuilds on CUDA 13 —
main/v4.2 still usesthrust::binary_function, removedin CUDA 13's libcu++).
cuQuantum is not enabled). cuStateVec is a separate backend, not what Profile and optimise-away small GPU allocations #749
concerns; it was left disabled for these measurements. (For the record, as of
2026 cuStateVec does support CUDA 13 and Blackwell, so this is a scope choice,
not a compatibility workaround.)
getDevIntsingpu_thrust.cuh.AI usage disclosure
Per unitaryHACK's AI guide ("human-in-the-loop";
honesty required for bounty eligibility): an AI coding assistant (Anthropic Claude,
via Claude Code) was used as a co-pilot for parts of this work — to help survey the
gpu_thrust.cuhcode paths, brainstorm the bitmask/List64-by-value reformulation,and draft this PR description and the profiling methodology. It
was not the author of record: every change was reviewed, compiled, and tested by
me on real hardware (RTX PRO 6000 Blackwell, CUDA 13.0, sm_120). The diff is a single
file (+52/−52), the algebraic equivalence of the bitmask reformulation was checked by
hand, and correctness was confirmed by the upstream unit tests passing across all four
deployments (CPU / CPU+OpenMP / GPU / GPU+OpenMP). No unverified or copy-pasted AI
output is included.
unitaryHACK 2026 checklist
Closes #749).