Skip to content

Add explicit multi-buffer annotation and manual buffer select support#685

Open
zhangstevenunity wants to merge 9 commits into
mainfrom
claude/serene-bhaskara-f0221e
Open

Add explicit multi-buffer annotation and manual buffer select support#685
zhangstevenunity wants to merge 9 commits into
mainfrom
claude/serene-bhaskara-f0221e

Conversation

@zhangstevenunity

Copy link
Copy Markdown
Collaborator

No description provided.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an explicit multi-buffer expression and automatic synchronization scheme, adding the !pto.multi_tile_buf type and associated pto.alloc_multi_tile and pto.multi_tile_get operations. The implementation includes a new PTOResolveBufferSelect pass to lower slot selections into arith.select chains or single-address casts, and updates synchronization solvers to support dynamic event ID derivation and disjoint slot optimization via affine analysis. Review feedback suggests optimizing the arith.select lowering for high slot counts, refining the constant-peeling heuristic in affine analysis, and adding tracking for the current lack of support for low-precision types in multi-buffer allocations.

Comment on lines +181 to +187
for (uint32_t i = 1; i < n; ++i) {
Value iIdx = rewriter.create<arith::ConstantIndexOp>(loc, i);
Value isThis = rewriter.create<arith::CmpIOp>(
loc, arith::CmpIPredicate::eq, slot, iIdx);
selected = rewriter.create<arith::SelectOp>(loc, isThis, slotMems[i],
selected);
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The N-way arith.select chain is O(N) in terms of generated operations. While N is currently capped at 16, for larger N this could lead to significant IR bloat and potentially inefficient hardware execution. Consider if a more efficient lowering (like a binary search tree of selects or a jump table if supported by the target) would be beneficial if the slot count limit is increased in the future.

// Peel at most one add/sub of a constant.
Value rem = inner;
int peeled = 0;
while (peeled++ < 4) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The heuristic of peeling up to 4 constant additions/subtractions is simple and likely sufficient for common prefetch patterns. However, it is a hardcoded limit. If more complex index expressions are expected, a more robust affine expression walker might be needed.

Comment thread lib/PTO/IR/PTO.cpp
Comment on lines +2188 to +2190
if (isPTOLowPrecisionType(elemTy))
return emitOpError() << "slot dtype " << elemTy
<< " is not supported by pto.alloc_multi_tile yet";

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Low precision types are explicitly blocked for multi-buffer allocations. If this is a temporary hardware or compiler limitation, it should be tracked with a TODO or a reference to a tracking issue to ensure it's addressed when support is added.

@reedhecre

reedhecre commented May 19, 2026

Copy link
Copy Markdown

Codex Review

该评论由 review 机器人自动更新。

Summary

PR #685 有 3 个需要优先处理的问题:level3 模式下 multi-buffer 会被静默降成 single-buffer,函数边界上的 multi_tile_buf 会丢失 slot 语义,另外 alloc_multi_tile 还引入了与 alloc_tile 不一致的低精度类型限制。

Findings

  1. P1 `--pto-level=level3` 下 multi-buffer 会被静默降成单 buffer tools/ptoas/ptoas.cpp:1875

当前主流程只在非 level3 模式下运行 PlanMemory(这里从而生成 multi-buffer 需要的多地址 pto.pointer_cast),但 alloc_multi_tile / multi_tile_get 没有任何 level3 禁止逻辑。这样在 --pto-level=level3 下,PTOResolveBufferSelect 会走到 !root 回退分支,把每个 pto.slot_marker 直接替换回原始 memref,结果 %mb[%0]%mb[%1] 等所有 slot 都落到同一块物理存储上,multi-buffer 语义被静默降成 single-buffer,属于错误结果。这里应该在 level3 入口直接拒绝 multi-buffer IR。

  1. P2 函数边界上的 `multi_tile_buf` 不会被拒绝,但 slot 语义会丢失 lib/PTO/Transforms/PTOViewToMemref.cpp:1406

PTOViewToMemref 会无条件把函数签名里的 !pto.multi_tile_buf<...> 改写成单个 slot 的 memref 类型,但没有任何 verifier / lowering 检查去拒绝这种用法。这样跨函数边界后的值既没有 pto.multi_buffer 属性,也没有多地址 pto.pointer_cast 作为锚点,后续 PTOResolveBufferSelect 只能把 pto.multi_tile_get 退化成对同一 memref 的 identity view。也就是说,函数参数/返回值上的不同 slot 会丢失物理区分,和设计文档里“初版不支持出现在函数参数/返回值”这一约束不一致,而且会产生错误别名语义。

  1. P2 `alloc_multi_tile` 额外禁止了现有可用的低精度 tile 类型 lib/PTO/IR/PTO.cpp:2495

AllocMultiTileOp::verify() 新增了 isPTOLowPrecisionType(elemTy) 的硬性拒绝,而普通 pto.alloc_tile 并没有这条限制。这样 hif8 / FP4 等已经能在单 buffer tile 上工作的类型,到了 alloc_multi_tile 就会直接报错,和本 PR 文档里“slotType 约束与单个 tile_buf 相同”的约定不一致。这会让一批现有低精度 tile kernel 无法使用新引入的 multi-buffer 标注,属于明确的兼容性缺口。

@zhangstevenunity zhangstevenunity changed the title Add explicit multi-buffer support Add explicit multi-buffer and manual buffer select support May 19, 2026
Remove the `allComparableSlotPairsEqual` early-bail in
`getMultiBufferEventIdInfo`. When producer and consumer access a
multi-buffer alloc through the same slot SSA, GSS previously fell
back to a single static event id; InsertSync kept allocating N dyn
event ids so consecutive iterations touching different physical
slots could overlap. The two solvers now emit the same pre-loop
primes, loop-body dyn `set_flag_dyn` / `wait_flag_dyn`, and
post-loop drains for this case.

* `multi_tile_prefetch_gss_event_id.pto`: CHECKs updated to expect
  2-slot dyn flag pipeline (matching InsertSync).
* Design doc: status table marks alignment landed; same-SSA
  asymmetry removed from limitations / follow-ups.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@zhangstevenunity zhangstevenunity changed the title Add explicit multi-buffer and manual buffer select support Add explicit multi-buffer annotation and manual buffer select support May 20, 2026
Resolves conflicts in:
- lib/PTO/Transforms/InsertSync/InsertSyncAnalysis.cpp:
  Keep both slot-affine forward-dep filter (this branch) and
  CanPrunePipeVBarrier (main's #674). They're orthogonal: the affine
  filter prunes depVec for same-iter deps; the pipe-v pruner short-
  circuits the whole sync if all remaining deps are exact-same-access
  on PIPE_V.
- lib/PTO/Transforms/InsertSync/PTOIRTranslator.cpp:
  Keep the arith.select alias handler (multi-buffer dynamic-slot path)
  and the multi-address PointerCast slot-offset population. Main only
  had a codecheck reformat of the single-address-only legacy path,
  which is now the `op.getAddrs().size() == 1` branch.
- lib/PTO/Transforms/InsertSync/SyncCodegen.cpp:
  Keep this branch's full rewrite of CreateSetWaitOpForMultiBuffer
  that emits pto.set_flag_dyn / pto.wait_flag_dyn via an N-way
  select chain over eventIds[slot % N]. Main had a stale
  GetBufferSelected call whose return was already `(void)`-discarded;
  the new dyn-flag path supersedes it.
- tools/ptoas/ptoas.cpp:
  Keep the createPTOResolveBufferSelectPass() registration.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@zhangstevenunity zhangstevenunity marked this pull request as ready for review May 21, 2026 06:24
@gemini-code-assist

Copy link
Copy Markdown

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

zhangstevenunity and others added 5 commits June 1, 2026 18:32
In getBeforeAfterSyncMaps the multi-buffer back-edge dyn event id was
keyed off `findSlotSSAExprForRWOp(op)`, which returns the op's *first*
slot_marker memref (read operands scanned before writes) -- unrelated to
the buffer that actually drove the N-event-id allocation. When an op
touches more than one multi-buffer at different slots/phases (e.g. a
software-pipeline stage that reads one rotating buffer and writes
another), `set_flag_dyn` / `wait_flag_dyn` indexed the wrong event lane,
so the real cross-iteration RAW/WAR/WAW on the hazard buffer could be let
through -- a data race.

Capture the conflicting buffer's slot per side directly in
getMultiBufferEventIdInfo, where the hazard MemInfo pair is in hand, and
thread it through EventIdInfo (slotExprOp1/slotExprOp2, keyed to
op1/op2 == rwOp1/rwOp2). At the backfill, map the slots onto the
set/wait sides by op identity, since getSetWaitOcc may order the producer
onto either op. If the slot is absent or ambiguous (more than one
distinct slot participates in the pair's conflicts) it is left null, so
codegen degrades to the safe N-static set_flag/wait_flag fanout instead
of selecting a wrong lane. This mirrors the InsertSync path, which
already binds the slot to the dep pair's baseBuffer.

Add a regression test (multi_tile_two_buf_slot_binding.pto) where the
producing tadd reads buffer A at phase (i+1)%2 and writes the hazard
buffer B at phase i%2; the test pins B's WAR wait to B's slot and fails
on the previous behaviour.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
An op can have multiple tile inputs/outputs, so the producer and consumer
slots are resolved independently and one side can come back ambiguous
(touching several multi-buffers at different slots) while the other stays
a single clean slot. That previously yielded a mismatched static-set /
dyn-wait pair, which does not rendezvous on a consistent event lane.

Gate symmetrically: a dyn set_flag/wait_flag is emitted only when BOTH
sides have an unambiguous slot for the shared hazard buffer; otherwise
both fall back to the N-static fanout (safe over-synchronization). Common
single-multi-buffer-per-op prefetch is unaffected (both sides carry the
buffer's slot).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants