Add explicit multi-buffer annotation and manual buffer select support#685
Add explicit multi-buffer annotation and manual buffer select support#685zhangstevenunity wants to merge 9 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces an explicit multi-buffer expression and automatic synchronization scheme, adding the !pto.multi_tile_buf type and associated pto.alloc_multi_tile and pto.multi_tile_get operations. The implementation includes a new PTOResolveBufferSelect pass to lower slot selections into arith.select chains or single-address casts, and updates synchronization solvers to support dynamic event ID derivation and disjoint slot optimization via affine analysis. Review feedback suggests optimizing the arith.select lowering for high slot counts, refining the constant-peeling heuristic in affine analysis, and adding tracking for the current lack of support for low-precision types in multi-buffer allocations.
| for (uint32_t i = 1; i < n; ++i) { | ||
| Value iIdx = rewriter.create<arith::ConstantIndexOp>(loc, i); | ||
| Value isThis = rewriter.create<arith::CmpIOp>( | ||
| loc, arith::CmpIPredicate::eq, slot, iIdx); | ||
| selected = rewriter.create<arith::SelectOp>(loc, isThis, slotMems[i], | ||
| selected); | ||
| } |
There was a problem hiding this comment.
The N-way arith.select chain is O(N) in terms of generated operations. While N is currently capped at 16, for larger N this could lead to significant IR bloat and potentially inefficient hardware execution. Consider if a more efficient lowering (like a binary search tree of selects or a jump table if supported by the target) would be beneficial if the slot count limit is increased in the future.
| // Peel at most one add/sub of a constant. | ||
| Value rem = inner; | ||
| int peeled = 0; | ||
| while (peeled++ < 4) { |
There was a problem hiding this comment.
| if (isPTOLowPrecisionType(elemTy)) | ||
| return emitOpError() << "slot dtype " << elemTy | ||
| << " is not supported by pto.alloc_multi_tile yet"; |
Codex Review该评论由 review 机器人自动更新。
SummaryPR #685 有 3 个需要优先处理的问题:level3 模式下 multi-buffer 会被静默降成 single-buffer,函数边界上的 Findings
当前主流程只在非
|
Remove the `allComparableSlotPairsEqual` early-bail in `getMultiBufferEventIdInfo`. When producer and consumer access a multi-buffer alloc through the same slot SSA, GSS previously fell back to a single static event id; InsertSync kept allocating N dyn event ids so consecutive iterations touching different physical slots could overlap. The two solvers now emit the same pre-loop primes, loop-body dyn `set_flag_dyn` / `wait_flag_dyn`, and post-loop drains for this case. * `multi_tile_prefetch_gss_event_id.pto`: CHECKs updated to expect 2-slot dyn flag pipeline (matching InsertSync). * Design doc: status table marks alignment landed; same-SSA asymmetry removed from limitations / follow-ups. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Resolves conflicts in: - lib/PTO/Transforms/InsertSync/InsertSyncAnalysis.cpp: Keep both slot-affine forward-dep filter (this branch) and CanPrunePipeVBarrier (main's #674). They're orthogonal: the affine filter prunes depVec for same-iter deps; the pipe-v pruner short- circuits the whole sync if all remaining deps are exact-same-access on PIPE_V. - lib/PTO/Transforms/InsertSync/PTOIRTranslator.cpp: Keep the arith.select alias handler (multi-buffer dynamic-slot path) and the multi-address PointerCast slot-offset population. Main only had a codecheck reformat of the single-address-only legacy path, which is now the `op.getAddrs().size() == 1` branch. - lib/PTO/Transforms/InsertSync/SyncCodegen.cpp: Keep this branch's full rewrite of CreateSetWaitOpForMultiBuffer that emits pto.set_flag_dyn / pto.wait_flag_dyn via an N-way select chain over eventIds[slot % N]. Main had a stale GetBufferSelected call whose return was already `(void)`-discarded; the new dyn-flag path supersedes it. - tools/ptoas/ptoas.cpp: Keep the createPTOResolveBufferSelectPass() registration. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
In getBeforeAfterSyncMaps the multi-buffer back-edge dyn event id was keyed off `findSlotSSAExprForRWOp(op)`, which returns the op's *first* slot_marker memref (read operands scanned before writes) -- unrelated to the buffer that actually drove the N-event-id allocation. When an op touches more than one multi-buffer at different slots/phases (e.g. a software-pipeline stage that reads one rotating buffer and writes another), `set_flag_dyn` / `wait_flag_dyn` indexed the wrong event lane, so the real cross-iteration RAW/WAR/WAW on the hazard buffer could be let through -- a data race. Capture the conflicting buffer's slot per side directly in getMultiBufferEventIdInfo, where the hazard MemInfo pair is in hand, and thread it through EventIdInfo (slotExprOp1/slotExprOp2, keyed to op1/op2 == rwOp1/rwOp2). At the backfill, map the slots onto the set/wait sides by op identity, since getSetWaitOcc may order the producer onto either op. If the slot is absent or ambiguous (more than one distinct slot participates in the pair's conflicts) it is left null, so codegen degrades to the safe N-static set_flag/wait_flag fanout instead of selecting a wrong lane. This mirrors the InsertSync path, which already binds the slot to the dep pair's baseBuffer. Add a regression test (multi_tile_two_buf_slot_binding.pto) where the producing tadd reads buffer A at phase (i+1)%2 and writes the hazard buffer B at phase i%2; the test pins B's WAR wait to B's slot and fails on the previous behaviour. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
An op can have multiple tile inputs/outputs, so the producer and consumer slots are resolved independently and one side can come back ambiguous (touching several multi-buffers at different slots) while the other stays a single clean slot. That previously yielded a mismatched static-set / dyn-wait pair, which does not rendezvous on a consistent event lane. Gate symmetrically: a dyn set_flag/wait_flag is emitted only when BOTH sides have an unambiguous slot for the shared hazard buffer; otherwise both fall back to the N-static fanout (safe over-synchronization). Common single-multi-buffer-per-op prefetch is unaffected (both sides carry the buffer's slot). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
No description provided.