Fuse distributed prefix-suffix multi-SWAP (closes #595) by zkasuran · Pull Request #785 · QuEST-Kit/QuEST

zkasuran · 2026-06-08T03:11:39Z

Summary

Closes #595. Fuses the distributed prefix<->suffix multi-SWAP so each amplitude crosses the network at most once.

When a multi-qubit gate targets qubits that live in the prefix (the index bits that select which node holds an amplitude), QuEST first swaps those qubits down into the suffix. The localiser did this one SWAP at a time, so an amplitude moved by the first SWAP was often moved again by the next, crossing the network several times. This change works out each amplitude's final node up front and sends it there directly.

The SWAPs in such a group act on disjoint qubit pairs, so they commute and compose into a single permutation of the index bits, which is what makes the direct routing well defined. For the uncontrolled case (every internal caller: applyCompMatr, applyCompMatr2, the partial-trace path) the routine enumerates the up to 2^eta - 1 destination nodes, one per non-empty subset of the eta prefix targets whose partnered suffix bit disagrees with this node's rank bit. For each it packs, exchanges and unpacks only the amplitudes bound there. The move is an involution between paired nodes, so packed and unpacked amplitudes sit in the same local slots.

Design notes

This follows the two constraints from the issue thread:

The amplitudes are physically moved to their final node. There is no virtual/physical wire-ordering layer.
Each amplitude crosses the network at most once, not async-overlapped per-SWAP exchanges.

New CPU kernel cpu_statevec_unpackAmpsFromBuffer is the inverse of the existing cpu_statevec_packAmpsIntoBuffer: an OpenMP scatter that writes the contiguous received sub-buffer back into the strided local amplitudes selected by several constrained qubits, via insertBitsWithMaskedValues, so it loops over O(amplitudes moved) and never over O(2^N).

Scope

CPU/OpenMP, which the issue notes is sufficient. GPU quregs and controlled multi-SWAPs keep the existing per-SWAP path, so the GPU build and its numerics are untouched. A GPU mirror of the kernel is written and ready as a follow-up, kept out of this PR because I have no CUDA hardware to compile it on and did not want this change to risk the GPU build.

Files: core/localiser.cpp, core/accelerator.cpp, core/accelerator.hpp, cpu/cpu_subroutines.cpp, cpu/cpu_subroutines.hpp.

Correctness

The fused routine must give bit-identical results to the per-SWAP path. The existing suites compare against an independent reference state and pass at 1, 2, 4 and 8 ranks:

np=1  All tests passed (21017 assertions in 4 test cases)
np=2  All tests passed (15265 assertions in 4 test cases)
np=4  All tests passed (6637 assertions in 4 test cases)
np=8  All tests passed (2329 assertions in 4 test cases)

(applySwap, applyCompMatr, applyCompMatr2, calcPartialTrace.)

Benchmark

Communication volume (exact, hardware independent). Measured by tallying amplitudes pushed through the sub-buffer exchange:

ranks	eta	fused / baseline	reduction
4	2	0.750	25.0%
8	3	0.583	41.7%

This is exactly 1 - 1/2^eta, the fused group moving the partition once instead of relaying it across eta exchanges.

Wall clock. I do not have a cluster, so I cannot measure the real multi-node speedup directly. On a single box, intra-node MPI is shared memory with no bandwidth limit, so the saved volume costs nothing and the routine is slower there. To measure the bandwidth-limited regime that distribution actually runs in, I forced MPICH off its shared-memory shortcut onto the TCP transport (MPIR_CVAR_NOLOCAL=1 FI_PROVIDER=tcp), which gives a genuine bandwidth-limited path between ranks on one machine. This is an emulation, not a real cluster. I flag it as such.

Single thread per node (mt=0, the case the issue asks to verify), eta=3, the speedup grows with state size as the saved volume starts to outweigh the fused routine's extra rounds:

state n	total state	baseline	fused	speedup	fused faster in
27	2 GB	4.262 s	4.181 s	+1.9%	4/6 trials
28	4 GB	7.656 s	7.682 s	tie	4/6 trials
29	8 GB	17.157 s	16.499 s	+3.8%	5/5 trials

At the largest state tested, single threaded, fused is faster on every trial. With OpenMP on (the realistic deployment, where the extra packing is parallelised away) the win is larger and cleaner: n=28, 8 ranks, fused 4.831 s vs baseline 5.425 s, +11%. For eta=2 the 25% volume cut is too small to beat the extra-round overhead single threaded and the result is a tie.

So the extra packing does not outweigh the comm saving: single threaded it is a wash at small state and a win at large state. Once threads or a real (slower than loopback) interconnect enter, it wins across the range. Happy to have this confirmed on a real cluster via CI.

AI disclosure

This change was implemented with substantial help from Claude (Anthropic), which drafted the fused routing in the localiser, the unpack kernel and the benchmark and proposed the subset-enumeration design. I reviewed the approach against the issue thread and the QuEST distributed paper (arXiv:2311.01512). I ran the tests at 1/2/4/8 ranks and the benchmark and own the change. Verified locally before submitting: the CPU/OpenMP suites green at 1/2/4/8 ranks, the comm-volume reduction measured directly and the wall-clock benchmark run under the emulated transport above. The GPU path was not compiled (no CUDA hardware) and is deliberately left out.

The localiser performed each prefix<->suffix SWAP in turn, so an amplitude moved by one SWAP was often moved again by the next, crossing the network several times. This fuses the group of disjoint SWAPs into one operation that computes each amplitude's final node and sends it there directly, so every amplitude crosses the network at most once. The disjoint SWAPs commute and compose into a single bit permutation. For the uncontrolled case (every internal caller) the routine enumerates the up to 2^eta-1 destination nodes and packs, exchanges and unpacks only the amplitudes bound to each. A new cpu_statevec_unpackAmpsFromBuffer scatters the received sub-buffer back into the strided local amplitudes, the inverse of the existing packer, looping over moved amplitudes not the whole state. Scope is CPU/OpenMP. GPU quregs and controlled multi-SWAPs keep the existing per-SWAP path, so the GPU build is unchanged. Comm volume drops 25% at eta=2 and 42% at eta=3 (1 - 1/2^eta), matching theory. Existing applySwap, applyCompMatr, applyCompMatr2 and calcPartialTrace suites pass at 1, 2, 4 and 8 ranks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fuse distributed prefix-suffix multi-SWAP (closes #595)#785

Fuse distributed prefix-suffix multi-SWAP (closes #595)#785
zkasuran wants to merge 1 commit into
QuEST-Kit:develfrom
zkasuran:swap-fusion

zkasuran commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zkasuran commented Jun 8, 2026

Summary

Design notes

Scope

Correctness

Benchmark

AI disclosure

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant