Update code to Optimise-away small GPU allocations for projector & mu… by thedaemon-wizard · Pull Request #783 · QuEST-Kit/QuEST

thedaemon-wizard · 2026-06-07T14:35:33Z

Profile and optimise-away small GPU allocations

Closes #749

Summary

The single-GPU backend copied the qubit-index list from host to device
(cudaMalloc + cudaMemcpyAsync + cudaFree) on every call of several
multi-qubit operations, via the getDevInts() helper. For small Quregs this
fixed allocation latency dominates the actual kernel runtime — exactly the
overhead issue #749 asks us to profile and remove.

This PR eliminates that per-call copy for the two operations agreed in scope,
following the register-resident-bitmask pattern already used by
thrust_statevec_calcExpecAnyTargZ_sub:

Operation	Public API	Subroutine	Technique
Multi-qubit projector	`applyMultiQubitProjector`	`thrust_{statevec,densmatr}_multiQubitProjector_sub`	reformulate the per-qubit test as two primitive bitmasks (no list at all)
Multi-qubit outcome probability	`calcProbOfMultiQubitOutcome`	`thrust_{statevec,densmatr}_calcProbOfMultiQubitOutcome_sub`	carry the tiny sorted list by value inside the Thrust functor

All changes are confined to a single file: quest/src/gpu/gpu_thrust.cuh
(52 insertions, 52 deletions). No public API, dispatch, or thrust_* signatures
change — only the internals.

Design

Projectors — bitmask reformulation (removes the copy for all sizes)

A projector keeps an amplitude iff its target qubits match the requested
outcomes. The original device-array test

getValueOfBits(n, targetsPtr, numBits) == retainValue

is mathematically identical to the register-only primitive

(n & qubitMask) == valueMask

where qubitMask = util_getBitMask(qubits) flags the target positions and
valueMask = util_getBitMask(qubits, outcomes) holds the desired outcome bits.
Both masks are plain qindex scalars passed as kernel arguments, so no device
array is allocated at all. For the density-matrix functor the same masks are
applied to both the row and column substates:

qreal fac = renorm * ((r & qubitMask) == valueMask) * ((c & qubitMask) == valueMask);

Outcome probability — pass the list by value (removes the copy for all sizes)

calcProbOfMultiQubitOutcome uses functor_insertBits, whose
insertBitsWithMaskedValues is a scatter that needs the actual sorted qubit
positions (a bitmask is insufficient). Instead of allocating a
device_vector, the functor now stores the positions in a List64 by value
— a trivially-copyable, fixed-size (int[64]), CUDA-kernel-compatible struct
that already exists in the codebase precisely for this purpose
(quest/src/core/lists.hpp). The list rides along as a kernel argument, so
cudaMalloc/cudaMemcpy disappear for every qubit count (the previous code only
avoided the copy never — it always called getDevInts).

getDevInts() itself is untouched and still used by ~20 other operations in
gpu_subroutines.cpp that are out of scope for this issue.

Results (RTX PRO 6000 Blackwell, CUDA 13.0, sm_120)

Nsight Systems — CUDA API call counts

Trace of N = 8…12, 200 reps each, both operations, captured with
Nsight Systems 2025.3.2 (nsys profile + per-call CUDA runtime API counts):

CUDA runtime call	baseline	optimised	change
`cudaMalloc`	3020	1010	−66%
`cudaFree`	3021	1011	−66%
`cudaMemcpyAsync`	3015	1005	−67%
`cudaLaunchKernel`	2025	2025	unchanged (identical compute)

The ~2000 eliminated allocations are exactly the two per-call qubit-list copies
(one in the projector, one in the probability path). The residual ~1000
cudaMalloc in the optimised build is thrust::reduce's own internal temporary
in calcProb — inherent to Thrust and out of scope. cudaLaunchKernel is
unchanged, confirming the kernels themselves are untouched.

Wall-clock per call (microseconds), `numTargs = 3`, 2000 reps

N	projector base	projector opt	speedup	prob base	prob opt	speedup
4	12.41	6.48	1.92×	20.35	14.59	1.39×
8	11.74	6.40	1.83×	19.44	14.18	1.37×
12	11.85	6.52	1.82×	19.79	14.44	1.37×
16	12.22	6.84	1.79×	28.63	23.21	1.23×
20	58.41	13.74	4.25×	67.99	61.99	1.10×

In the small-Qureg regime the projector is consistently ~1.8× faster and the
probability calc ~1.4× faster, purely from removing the allocation. (The
large baseline jump at N≥17 is the allocation interacting with the now-larger
state kernels; removing it also removes that cliff.)

Correctness

Built with -D QUEST_ENABLE_CUDA=ON -D QUEST_BUILD_TESTS=ON -D CMAKE_CUDA_ARCHITECTURES=120 -D CMAKE_BUILD_TYPE=Release against CUDA 13.0.
The unit tests for the affected operations pass across all four deployments
(CPU, CPU+OpenMP, GPU, GPU+OpenMP):

./build/tests/tests "*QubitProjector*,*calcProbOfMultiQubitOutcome*,*calcProbOfQubitOutcome*"
# All tests passed (57146 assertions in 8 test cases)

(A correct sm_120 build is also implicitly validated, since a wrong architecture
silently corrupts GPU results.)

Re-verified end-to-end on a freshly re-cloned devel (HEAD b9830592) with the
single-file change re-applied: configure, build, and the test command above all
pass unchanged.

How to reproduce the measurements

The numbers above come from a small standalone driver (kept out of this PR to
preserve the single-file diff) that, against both a clean origin/devel build and
this branch:

builds QuEST with -D QUEST_ENABLE_CUDA=ON -D CMAKE_CUDA_ARCHITECTURES=120 -D CMAKE_BUILD_TYPE=Release;
for each N, allocates a Qureg, then times a loop of applyMultiQubitProjector
and calcProbOfMultiQubitOutcome (numTargs = 3) with a CUDA-synchronised
wall clock over many reps (per-call µs in the table);
wraps a shorter run under nsys profile and tallies CUDA runtime API calls
(cudaMalloc/cudaFree/cudaMemcpyAsync/cudaLaunchKernel) for the table above.

Happy to share the driver/scripts separately if useful for CI; they are not part of
this change.

Notes for reviewers

Base branch is devel (the active unitaryHACK branch and the only one that
builds on CUDA 13 — main/v4.2 still uses thrust::binary_function, removed
in CUDA 13's libcu++).
This optimises QuEST's native Thrust GPU backend (the path taken when
cuQuantum is not enabled). cuStateVec is a separate backend, not what Profile and optimise-away small GPU allocations #749
concerns; it was left disabled for these measurements. (For the record, as of
2026 cuStateVec does support CUDA 13 and Blackwell, so this is a scope choice,
not a compatibility workaround.)
This aligns with the in-flight "James' GPU refactor" placeholder note on
getDevInts in gpu_thrust.cuh.

AI usage disclosure

Per unitaryHACK's AI guide ("human-in-the-loop";
honesty required for bounty eligibility): an AI coding assistant (Anthropic Claude,
via Claude Code) was used as a co-pilot for parts of this work — to help survey the
gpu_thrust.cuh code paths, brainstorm the bitmask/List64-by-value reformulation,
and draft this PR description and the profiling methodology. It
was not the author of record: every change was reviewed, compiled, and tested by
me on real hardware (RTX PRO 6000 Blackwell, CUDA 13.0, sm_120). The diff is a single
file (+52/−52), the algebraic equivalence of the bitmask reformulation was checked by
hand, and correctness was confirmed by the upstream unit tests passing across all four
deployments (CPU / CPU+OpenMP / GPU / GPU+OpenMP). No unverified or copy-pasted AI
output is included.

unitaryHACK 2026 checklist

PR description links the issue (Closes #749).
Code is compiled and tested on real hardware (not unverified AI output).
AI assistance disclosed (see "AI usage disclosure" above), per the AI guide.
Scope kept tight; ≤ 4 open PRs; GitHub activity public.

…lti-qubit prob

Update code to Optimise-away small GPU allocations for projector & mu…

1bdb7a8

…lti-qubit prob

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update code to Optimise-away small GPU allocations for projector & mu…#783

Update code to Optimise-away small GPU allocations for projector & mu…#783
thedaemon-wizard wants to merge 1 commit into
QuEST-Kit:develfrom
thedaemon-wizard:optimise-small-gpu-allocations-749

thedaemon-wizard commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

thedaemon-wizard commented Jun 7, 2026

Profile and optimise-away small GPU allocations

Summary

Design

Projectors — bitmask reformulation (removes the copy for all sizes)

Outcome probability — pass the list by value (removes the copy for all sizes)

Results (RTX PRO 6000 Blackwell, CUDA 13.0, sm_120)

Nsight Systems — CUDA API call counts

Wall-clock per call (microseconds), numTargs = 3, 2000 reps

Correctness

How to reproduce the measurements

Notes for reviewers

AI usage disclosure

unitaryHACK 2026 checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Wall-clock per call (microseconds), `numTargs = 3`, 2000 reps