Add explicit multi-buffer annotation and manual buffer select support by zhangstevenunity · Pull Request #685 · hw-native-sys/PTOAS

zhangstevenunity · 2026-05-19T07:58:28Z

No description provided.

gemini-code-assist

Code Review

This pull request introduces an explicit multi-buffer expression and automatic synchronization scheme, adding the !pto.multi_tile_buf type and associated pto.alloc_multi_tile and pto.multi_tile_get operations. The implementation includes a new PTOResolveBufferSelect pass to lower slot selections into arith.select chains or single-address casts, and updates synchronization solvers to support dynamic event ID derivation and disjoint slot optimization via affine analysis. Review feedback suggests optimizing the arith.select lowering for high slot counts, refining the constant-peeling heuristic in affine analysis, and adding tracking for the current lack of support for low-precision types in multi-buffer allocations.

gemini-code-assist · 2026-05-19T08:04:19Z

+      for (uint32_t i = 1; i < n; ++i) {
+        Value iIdx = rewriter.create<arith::ConstantIndexOp>(loc, i);
+        Value isThis = rewriter.create<arith::CmpIOp>(
+            loc, arith::CmpIPredicate::eq, slot, iIdx);
+        selected = rewriter.create<arith::SelectOp>(loc, isThis, slotMems[i],
+                                                    selected);
+      }


The N-way arith.select chain is O(N) in terms of generated operations. While N is currently capped at 16, for larger N this could lead to significant IR bloat and potentially inefficient hardware execution. Consider if a more efficient lowering (like a binary search tree of selects or a jump table if supported by the target) would be beneficial if the slot count limit is increased in the future.

gemini-code-assist · 2026-05-19T08:04:19Z

+    // Peel at most one add/sub of a constant.
+    Value rem = inner;
+    int peeled = 0;
+    while (peeled++ < 4) {


The heuristic of peeling up to 4 constant additions/subtractions is simple and likely sufficient for common prefetch patterns. However, it is a hardcoded limit. If more complex index expressions are expected, a more robust affine expression walker might be needed.

gemini-code-assist · 2026-05-19T08:04:19Z

+  if (isPTOLowPrecisionType(elemTy))
+    return emitOpError() << "slot dtype " << elemTy
+                         << " is not supported by pto.alloc_multi_tile yet";


Low precision types are explicitly blocked for multi-buffer allocations. If this is a temporary hardware or compiler limitation, it should be tracked with a TODO or a reference to a tracking issue to ensure it's addressed when support is added.

reedhecre · 2026-05-19T08:10:27Z

Codex Review

该评论由 review 机器人自动更新。

PR: Add explicit multi-buffer annotation and manual buffer select support #685 Add explicit multi-buffer annotation and manual buffer select support
Author: zhangstevenunity
Base/Head: main / claude/serene-bhaskara-f0221e
Head SHA: e0e56ee26aa4
Trigger: PR 有新提交
Generated At: 2026-06-02T01:53:03Z
Previous Head SHA: f3729c0658f1
Status: completed

Summary

PR #685 有 3 个需要优先处理的问题：level3 模式下 multi-buffer 会被静默降成 single-buffer，函数边界上的 multi_tile_buf 会丢失 slot 语义，另外 alloc_multi_tile 还引入了与 alloc_tile 不一致的低精度类型限制。

Findings

P1 `--pto-level=level3` 下 multi-buffer 会被静默降成单 buffer tools/ptoas/ptoas.cpp:1875

当前主流程只在非 level3 模式下运行 PlanMemory（这里从而生成 multi-buffer 需要的多地址 pto.pointer_cast），但 alloc_multi_tile / multi_tile_get 没有任何 level3 禁止逻辑。这样在 --pto-level=level3 下，PTOResolveBufferSelect 会走到 !root 回退分支，把每个 pto.slot_marker 直接替换回原始 memref，结果 %mb[%0]、%mb[%1] 等所有 slot 都落到同一块物理存储上，multi-buffer 语义被静默降成 single-buffer，属于错误结果。这里应该在 level3 入口直接拒绝 multi-buffer IR。

P2 函数边界上的 `multi_tile_buf` 不会被拒绝，但 slot 语义会丢失 lib/PTO/Transforms/PTOViewToMemref.cpp:1406

PTOViewToMemref 会无条件把函数签名里的 !pto.multi_tile_buf<...> 改写成单个 slot 的 memref 类型，但没有任何 verifier / lowering 检查去拒绝这种用法。这样跨函数边界后的值既没有 pto.multi_buffer 属性，也没有多地址 pto.pointer_cast 作为锚点，后续 PTOResolveBufferSelect 只能把 pto.multi_tile_get 退化成对同一 memref 的 identity view。也就是说，函数参数/返回值上的不同 slot 会丢失物理区分，和设计文档里“初版不支持出现在函数参数/返回值”这一约束不一致，而且会产生错误别名语义。

P2 `alloc_multi_tile` 额外禁止了现有可用的低精度 tile 类型 lib/PTO/IR/PTO.cpp:2495

AllocMultiTileOp::verify() 新增了 isPTOLowPrecisionType(elemTy) 的硬性拒绝，而普通 pto.alloc_tile 并没有这条限制。这样 hif8 / FP4 等已经能在单 buffer tile 上工作的类型，到了 alloc_multi_tile 就会直接报错，和本 PR 文档里“slotType 约束与单个 tile_buf 相同”的约定不一致。这会让一批现有低精度 tile kernel 无法使用新引入的 multi-buffer 标注，属于明确的兼容性缺口。

Remove the `allComparableSlotPairsEqual` early-bail in `getMultiBufferEventIdInfo`. When producer and consumer access a multi-buffer alloc through the same slot SSA, GSS previously fell back to a single static event id; InsertSync kept allocating N dyn event ids so consecutive iterations touching different physical slots could overlap. The two solvers now emit the same pre-loop primes, loop-body dyn `set_flag_dyn` / `wait_flag_dyn`, and post-loop drains for this case. * `multi_tile_prefetch_gss_event_id.pto`: CHECKs updated to expect 2-slot dyn flag pipeline (matching InsertSync). * Design doc: status table marks alignment landed; same-SSA asymmetry removed from limitations / follow-ups. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Resolves conflicts in: - lib/PTO/Transforms/InsertSync/InsertSyncAnalysis.cpp: Keep both slot-affine forward-dep filter (this branch) and CanPrunePipeVBarrier (main's #674). They're orthogonal: the affine filter prunes depVec for same-iter deps; the pipe-v pruner short- circuits the whole sync if all remaining deps are exact-same-access on PIPE_V. - lib/PTO/Transforms/InsertSync/PTOIRTranslator.cpp: Keep the arith.select alias handler (multi-buffer dynamic-slot path) and the multi-address PointerCast slot-offset population. Main only had a codecheck reformat of the single-address-only legacy path, which is now the `op.getAddrs().size() == 1` branch. - lib/PTO/Transforms/InsertSync/SyncCodegen.cpp: Keep this branch's full rewrite of CreateSetWaitOpForMultiBuffer that emits pto.set_flag_dyn / pto.wait_flag_dyn via an N-way select chain over eventIds[slot % N]. Main had a stale GetBufferSelected call whose return was already `(void)`-discarded; the new dyn-flag path supersedes it. - tools/ptoas/ptoas.cpp: Keep the createPTOResolveBufferSelectPass() registration. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

gemini-code-assist · 2026-05-21T06:24:27Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

In getBeforeAfterSyncMaps the multi-buffer back-edge dyn event id was keyed off `findSlotSSAExprForRWOp(op)`, which returns the op's *first* slot_marker memref (read operands scanned before writes) -- unrelated to the buffer that actually drove the N-event-id allocation. When an op touches more than one multi-buffer at different slots/phases (e.g. a software-pipeline stage that reads one rotating buffer and writes another), `set_flag_dyn` / `wait_flag_dyn` indexed the wrong event lane, so the real cross-iteration RAW/WAR/WAW on the hazard buffer could be let through -- a data race. Capture the conflicting buffer's slot per side directly in getMultiBufferEventIdInfo, where the hazard MemInfo pair is in hand, and thread it through EventIdInfo (slotExprOp1/slotExprOp2, keyed to op1/op2 == rwOp1/rwOp2). At the backfill, map the slots onto the set/wait sides by op identity, since getSetWaitOcc may order the producer onto either op. If the slot is absent or ambiguous (more than one distinct slot participates in the pair's conflicts) it is left null, so codegen degrades to the safe N-static set_flag/wait_flag fanout instead of selecting a wrong lane. This mirrors the InsertSync path, which already binds the slot to the dep pair's baseBuffer. Add a regression test (multi_tile_two_buf_slot_binding.pto) where the producing tadd reads buffer A at phase (i+1)%2 and writes the hazard buffer B at phase i%2; the test pins B's WAR wait to B's slot and fails on the previous behaviour. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

An op can have multiple tile inputs/outputs, so the producer and consumer slots are resolved independently and one side can come back ambiguous (touching several multi-buffers at different slots) while the other stays a single clean slot. That previously yielded a mismatched static-set / dyn-wait pair, which does not rendezvous on a consistent event lane. Gate symmetrically: a dyn set_flag/wait_flag is emitted only when BOTH sides have an unambiguous slot for the shared hazard buffer; otherwise both fall back to the N-static fanout (safe over-synchronization). Common single-multi-buffer-per-op prefetch is unaffected (both sides carry the buffer's slot). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Add explicit multi-buffer support

f215ce1

gemini-code-assist Bot reviewed May 19, 2026

View reviewed changes

Refine multi-tile slot sync handling

5b6f6cb

zhangstevenunity changed the title ~~Add explicit multi-buffer support~~ Add explicit multi-buffer and manual buffer select support May 19, 2026

zhangstevenunity changed the title ~~Add explicit multi-buffer and manual buffer select support~~ Add explicit multi-buffer annotation and manual buffer select support May 20, 2026

zhangstevenunity marked this pull request as ready for review May 21, 2026 06:24

zhangstevenunity and others added 5 commits June 1, 2026 18:32

Merge branch 'main' into claude/serene-bhaskara-f0221e

0962124

fix

3b3bdd2

fix ci build bug

7ff27b9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add explicit multi-buffer annotation and manual buffer select support#685

Add explicit multi-buffer annotation and manual buffer select support#685
zhangstevenunity wants to merge 9 commits into
mainfrom
claude/serene-bhaskara-f0221e

zhangstevenunity commented May 19, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 19, 2026

Uh oh!

gemini-code-assist Bot May 19, 2026

Uh oh!

gemini-code-assist Bot May 19, 2026

Uh oh!

reedhecre commented May 19, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zhangstevenunity commented May 19, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

reedhecre commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codex Review

Summary

Findings

Uh oh!

gemini-code-assist Bot commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

reedhecre commented May 19, 2026 •

edited

Loading