Fuse distributed prefix-suffix multi-SWAP (closes #595)#785
Open
zkasuran wants to merge 1 commit into
Open
Conversation
The localiser performed each prefix<->suffix SWAP in turn, so an amplitude moved by one SWAP was often moved again by the next, crossing the network several times. This fuses the group of disjoint SWAPs into one operation that computes each amplitude's final node and sends it there directly, so every amplitude crosses the network at most once. The disjoint SWAPs commute and compose into a single bit permutation. For the uncontrolled case (every internal caller) the routine enumerates the up to 2^eta-1 destination nodes and packs, exchanges and unpacks only the amplitudes bound to each. A new cpu_statevec_unpackAmpsFromBuffer scatters the received sub-buffer back into the strided local amplitudes, the inverse of the existing packer, looping over moved amplitudes not the whole state. Scope is CPU/OpenMP. GPU quregs and controlled multi-SWAPs keep the existing per-SWAP path, so the GPU build is unchanged. Comm volume drops 25% at eta=2 and 42% at eta=3 (1 - 1/2^eta), matching theory. Existing applySwap, applyCompMatr, applyCompMatr2 and calcPartialTrace suites pass at 1, 2, 4 and 8 ranks.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #595. Fuses the distributed prefix<->suffix multi-SWAP so each amplitude crosses the network at most once.
When a multi-qubit gate targets qubits that live in the prefix (the index bits that select which node holds an amplitude), QuEST first swaps those qubits down into the suffix. The localiser did this one SWAP at a time, so an amplitude moved by the first SWAP was often moved again by the next, crossing the network several times. This change works out each amplitude's final node up front and sends it there directly.
The SWAPs in such a group act on disjoint qubit pairs, so they commute and compose into a single permutation of the index bits, which is what makes the direct routing well defined. For the uncontrolled case (every internal caller:
applyCompMatr,applyCompMatr2, the partial-trace path) the routine enumerates the up to2^eta - 1destination nodes, one per non-empty subset of theetaprefix targets whose partnered suffix bit disagrees with this node's rank bit. For each it packs, exchanges and unpacks only the amplitudes bound there. The move is an involution between paired nodes, so packed and unpacked amplitudes sit in the same local slots.Design notes
This follows the two constraints from the issue thread:
New CPU kernel
cpu_statevec_unpackAmpsFromBufferis the inverse of the existingcpu_statevec_packAmpsIntoBuffer: an OpenMP scatter that writes the contiguous received sub-buffer back into the strided local amplitudes selected by several constrained qubits, viainsertBitsWithMaskedValues, so it loops over O(amplitudes moved) and never over O(2^N).Scope
CPU/OpenMP, which the issue notes is sufficient. GPU quregs and controlled multi-SWAPs keep the existing per-SWAP path, so the GPU build and its numerics are untouched. A GPU mirror of the kernel is written and ready as a follow-up, kept out of this PR because I have no CUDA hardware to compile it on and did not want this change to risk the GPU build.
Files:
core/localiser.cpp,core/accelerator.cpp,core/accelerator.hpp,cpu/cpu_subroutines.cpp,cpu/cpu_subroutines.hpp.Correctness
The fused routine must give bit-identical results to the per-SWAP path. The existing suites compare against an independent reference state and pass at 1, 2, 4 and 8 ranks:
(
applySwap,applyCompMatr,applyCompMatr2,calcPartialTrace.)Benchmark
Communication volume (exact, hardware independent). Measured by tallying amplitudes pushed through the sub-buffer exchange:
This is exactly
1 - 1/2^eta, the fused group moving the partition once instead of relaying it acrossetaexchanges.Wall clock. I do not have a cluster, so I cannot measure the real multi-node speedup directly. On a single box, intra-node MPI is shared memory with no bandwidth limit, so the saved volume costs nothing and the routine is slower there. To measure the bandwidth-limited regime that distribution actually runs in, I forced MPICH off its shared-memory shortcut onto the TCP transport (
MPIR_CVAR_NOLOCAL=1 FI_PROVIDER=tcp), which gives a genuine bandwidth-limited path between ranks on one machine. This is an emulation, not a real cluster. I flag it as such.Single thread per node (
mt=0, the case the issue asks to verify), eta=3, the speedup grows with state size as the saved volume starts to outweigh the fused routine's extra rounds:At the largest state tested, single threaded, fused is faster on every trial. With OpenMP on (the realistic deployment, where the extra packing is parallelised away) the win is larger and cleaner: n=28, 8 ranks, fused 4.831 s vs baseline 5.425 s, +11%. For eta=2 the 25% volume cut is too small to beat the extra-round overhead single threaded and the result is a tie.
So the extra packing does not outweigh the comm saving: single threaded it is a wash at small state and a win at large state. Once threads or a real (slower than loopback) interconnect enter, it wins across the range. Happy to have this confirmed on a real cluster via CI.
AI disclosure
This change was implemented with substantial help from Claude (Anthropic), which drafted the fused routing in the localiser, the unpack kernel and the benchmark and proposed the subset-enumeration design. I reviewed the approach against the issue thread and the QuEST distributed paper (arXiv:2311.01512). I ran the tests at 1/2/4/8 ranks and the benchmark and own the change. Verified locally before submitting: the CPU/OpenMP suites green at 1/2/4/8 ranks, the comm-volume reduction measured directly and the wall-clock benchmark run under the emulated transport above. The GPU path was not compiled (no CUDA hardware) and is deliberately left out.