Skip to content

Update code to Optimise-away small GPU allocations for projector & mu…#783

Open
thedaemon-wizard wants to merge 1 commit into
QuEST-Kit:develfrom
thedaemon-wizard:optimise-small-gpu-allocations-749
Open

Update code to Optimise-away small GPU allocations for projector & mu…#783
thedaemon-wizard wants to merge 1 commit into
QuEST-Kit:develfrom
thedaemon-wizard:optimise-small-gpu-allocations-749

Conversation

@thedaemon-wizard
Copy link
Copy Markdown

Profile and optimise-away small GPU allocations

Closes #749

Summary

The single-GPU backend copied the qubit-index list from host to device
(cudaMalloc + cudaMemcpyAsync + cudaFree) on every call of several
multi-qubit operations, via the getDevInts() helper. For small Quregs this
fixed allocation latency dominates the actual kernel runtime — exactly the
overhead issue #749 asks us to profile and remove.

This PR eliminates that per-call copy for the two operations agreed in scope,
following the register-resident-bitmask pattern already used by
thrust_statevec_calcExpecAnyTargZ_sub:

Operation Public API Subroutine Technique
Multi-qubit projector applyMultiQubitProjector thrust_{statevec,densmatr}_multiQubitProjector_sub reformulate the per-qubit test as two primitive bitmasks (no list at all)
Multi-qubit outcome probability calcProbOfMultiQubitOutcome thrust_{statevec,densmatr}_calcProbOfMultiQubitOutcome_sub carry the tiny sorted list by value inside the Thrust functor

All changes are confined to a single file: quest/src/gpu/gpu_thrust.cuh
(52 insertions, 52 deletions). No public API, dispatch, or thrust_* signatures
change — only the internals.

Design

Projectors — bitmask reformulation (removes the copy for all sizes)

A projector keeps an amplitude iff its target qubits match the requested
outcomes. The original device-array test

getValueOfBits(n, targetsPtr, numBits) == retainValue

is mathematically identical to the register-only primitive

(n & qubitMask) == valueMask

where qubitMask = util_getBitMask(qubits) flags the target positions and
valueMask = util_getBitMask(qubits, outcomes) holds the desired outcome bits.
Both masks are plain qindex scalars passed as kernel arguments, so no device
array is allocated at all
. For the density-matrix functor the same masks are
applied to both the row and column substates:

qreal fac = renorm * ((r & qubitMask) == valueMask) * ((c & qubitMask) == valueMask);

Outcome probability — pass the list by value (removes the copy for all sizes)

calcProbOfMultiQubitOutcome uses functor_insertBits, whose
insertBitsWithMaskedValues is a scatter that needs the actual sorted qubit
positions (a bitmask is insufficient). Instead of allocating a
device_vector, the functor now stores the positions in a List64 by value
— a trivially-copyable, fixed-size (int[64]), CUDA-kernel-compatible struct
that already exists in the codebase precisely for this purpose
(quest/src/core/lists.hpp). The list rides along as a kernel argument, so
cudaMalloc/cudaMemcpy disappear for every qubit count (the previous code only
avoided the copy never — it always called getDevInts).

getDevInts() itself is untouched and still used by ~20 other operations in
gpu_subroutines.cpp that are out of scope for this issue.

Results (RTX PRO 6000 Blackwell, CUDA 13.0, sm_120)

Nsight Systems — CUDA API call counts

Trace of N = 8…12, 200 reps each, both operations, captured with
Nsight Systems 2025.3.2 (nsys profile + per-call CUDA runtime API counts):

CUDA runtime call baseline optimised change
cudaMalloc 3020 1010 −66%
cudaFree 3021 1011 −66%
cudaMemcpyAsync 3015 1005 −67%
cudaLaunchKernel 2025 2025 unchanged (identical compute)

The ~2000 eliminated allocations are exactly the two per-call qubit-list copies
(one in the projector, one in the probability path). The residual ~1000
cudaMalloc in the optimised build is thrust::reduce's own internal temporary
in calcProb — inherent to Thrust and out of scope. cudaLaunchKernel is
unchanged, confirming the kernels themselves are untouched.

Wall-clock per call (microseconds), numTargs = 3, 2000 reps

N projector base projector opt speedup prob base prob opt speedup
4 12.41 6.48 1.92× 20.35 14.59 1.39×
8 11.74 6.40 1.83× 19.44 14.18 1.37×
12 11.85 6.52 1.82× 19.79 14.44 1.37×
16 12.22 6.84 1.79× 28.63 23.21 1.23×
20 58.41 13.74 4.25× 67.99 61.99 1.10×

In the small-Qureg regime the projector is consistently ~1.8× faster and the
probability calc ~1.4× faster, purely from removing the allocation. (The
large baseline jump at N≥17 is the allocation interacting with the now-larger
state kernels; removing it also removes that cliff.)

Correctness

Built with -D QUEST_ENABLE_CUDA=ON -D QUEST_BUILD_TESTS=ON -D CMAKE_CUDA_ARCHITECTURES=120 -D CMAKE_BUILD_TYPE=Release against CUDA 13.0.
The unit tests for the affected operations pass across all four deployments
(CPU, CPU+OpenMP, GPU, GPU+OpenMP):

./build/tests/tests "*QubitProjector*,*calcProbOfMultiQubitOutcome*,*calcProbOfQubitOutcome*"
# All tests passed (57146 assertions in 8 test cases)

(A correct sm_120 build is also implicitly validated, since a wrong architecture
silently corrupts GPU results.)

Re-verified end-to-end on a freshly re-cloned devel (HEAD b9830592) with the
single-file change re-applied: configure, build, and the test command above all
pass unchanged.

How to reproduce the measurements

The numbers above come from a small standalone driver (kept out of this PR to
preserve the single-file diff) that, against both a clean origin/devel build and
this branch:

  1. builds QuEST with -D QUEST_ENABLE_CUDA=ON -D CMAKE_CUDA_ARCHITECTURES=120 -D CMAKE_BUILD_TYPE=Release;
  2. for each N, allocates a Qureg, then times a loop of applyMultiQubitProjector
    and calcProbOfMultiQubitOutcome (numTargs = 3) with a CUDA-synchronised
    wall clock over many reps (per-call µs in the table);
  3. wraps a shorter run under nsys profile and tallies CUDA runtime API calls
    (cudaMalloc/cudaFree/cudaMemcpyAsync/cudaLaunchKernel) for the table above.

Happy to share the driver/scripts separately if useful for CI; they are not part of
this change.

Notes for reviewers

  • Base branch is devel (the active unitaryHACK branch and the only one that
    builds on CUDA 13 — main/v4.2 still uses thrust::binary_function, removed
    in CUDA 13's libcu++).
  • This optimises QuEST's native Thrust GPU backend (the path taken when
    cuQuantum is not enabled). cuStateVec is a separate backend, not what Profile and optimise-away small GPU allocations #749
    concerns; it was left disabled for these measurements. (For the record, as of
    2026 cuStateVec does support CUDA 13 and Blackwell, so this is a scope choice,
    not a compatibility workaround.)
  • This aligns with the in-flight "James' GPU refactor" placeholder note on
    getDevInts in gpu_thrust.cuh.

AI usage disclosure

Per unitaryHACK's AI guide ("human-in-the-loop";
honesty required for bounty eligibility): an AI coding assistant (Anthropic Claude,
via Claude Code) was used as a co-pilot for parts of this work — to help survey the
gpu_thrust.cuh code paths, brainstorm the bitmask/List64-by-value reformulation,
and draft this PR description and the profiling methodology. It
was not the author of record: every change was reviewed, compiled, and tested by
me on real hardware (RTX PRO 6000 Blackwell, CUDA 13.0, sm_120). The diff is a single
file (+52/−52), the algebraic equivalence of the bitmask reformulation was checked by
hand, and correctness was confirmed by the upstream unit tests passing across all four
deployments (CPU / CPU+OpenMP / GPU / GPU+OpenMP). No unverified or copy-pasted AI
output is included.

unitaryHACK 2026 checklist

  • PR description links the issue (Closes #749).
  • Code is compiled and tested on real hardware (not unverified AI output).
  • AI assistance disclosed (see "AI usage disclosure" above), per the AI guide.
  • Scope kept tight; ≤ 4 open PRs; GitHub activity public.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant