[Perf] SM103 tcgen05.ld.red for fused TMEM load + row-max in softmax by LopezCastroRoberto · Pull Request #2449 · Dao-AILab/flash-attention

LopezCastroRoberto · 2026-04-09T20:00:00Z

Summary

Uses the SM103-only tcgen05.ld.red instruction to fuse the TMEM load with a hardware max reduction in FA4's softmax step, eliminating fmax ALU ops per tile. The max is computed in the TMEM controller at zero ALU cost.

Benchmark (B300, seqlen=8192, upstream bench_sm90.py config with do_bench_cudagraph):

hdim	causal	Speedup	Baseline TFLOPS	ld.red TFLOPS
64	non-causal	+2.9%	1110	1142
64	causal	+6.5%	1024	1091
96	non-causal	+5.4%	1344	1416
96	causal	+4.5%	1222	1277
128	non-causal	+3.1%	1524	1571
128	causal	+1.0%	1353	1366

Zero regressions across 2,000+ different configs tried (exhaustive sweep: 3 hdims × 7 head configs × 2 causal × 7 BS × 7 SL).

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

tridao · 2026-04-09T21:56:35Z

Thanks! I think cute-dsl has copy atom that does this, instead of having to call ptx directly?

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

LopezCastroRoberto · 2026-04-10T13:32:35Z

@tridao yeah, that was actually my first option, but I couldn't make cute-dsl copy work.

I verified that Ld and LdRed produce identical data, identical layouts, and correct hardware max values. But cute.copy(LdRed) produces wrong results on subsequent tiles.

CUTLASS issue #3090 is very likely the reason. CuTe's LdRed is lowered to llvm.inline_asm without has_side_effects=True, causing LLVM's CSE pass to merge multiple reads from the same TMEM address even though the MMA warp writes new data between reads.

The raw PTX workaround on this PR sets has_side_effects=True explicitly, preventing CSE. Once #3090 is fixed, the raw PTX can be replaced with the copy atom for cleaner code and CuTe's native instruction scheduling.

tridao · 2026-04-14T10:28:55Z

cutlass has an example of LdRed, presumably that's working?
https://github.com/NVIDIA/cutlass/blob/08185b9c3e90510ee2b656662ed0d53b06d28157/examples/python/CuTeDSL/blackwell/mla/mla_decode_fp16.py#L2518

LopezCastroRoberto · 2026-04-15T15:14:08Z

Yes, the CUTLASS MLA example works, I verified it on B300. I tried integrating LdRed32x32bOp into FA4 following the exact same pattern. The correct PTX gets emitted, but the kernel produces wrong results. After some debugging, I still think the issue is the same as described in NVIDIA/cutlass#3090

I think CuTeDSL lowers tcgen05.ld and tcgen05.ld.red differently?

This would mean LLVM's CSE pass is free to merge multiple ld.red calls that share the same TMEM source address, even when the TMEM contents change between reads due to intervening MMA writes.

My guess here is that, in FA4, since stage is a compile-time constant, every call to softmax_step produces the same TMEM address. LLVM sees identical ld.red calls across loop iterations and merges them. I confirmed this theory with PTX instruction counts (same observation as NVIDIA/cutlass#3090):

Raw PTX (has_side_effects=true)  :  generates 8 `tcgen05.ld.red` instructions (4 tiles × 2 unrolled iterations)
CuTe LdRed32x32bOp: generates 4 `tcgen05.ld.red` instructions (half eliminated by CSE)

The CUTLASS example doesn't hit this since MLA indexes the TMEM tensor with a runtime-varying pipeline state:

# mla_decode_fp16.py, softmax():
tStS = tStS_staged[None, None, None, mma_s_consumer_state.index]  # runtime  Int32
tAcc = tStS[(None, None), 0, 0]

mma_s_consumer_state.index alternates at runtime (MLA has mma_s_stage=2). This produces add-based addressing in PTX where the TMEM address depends on a runtime value:

; MLA: runtime stage index → address varies per iteration
shl.b32  %r2174, %r4157, 6        ; runtime offset from pipeline state
add.s32  %r2164, %r426, %r2174    ; addr = base + runtime_offset
tcgen05.ld.red.sync.aligned.32x32b.x64.max.f32 {...}, %r2163, [%r2164];

So the theory I have is: LLVM can't prove two iterations produce the same address, so keeps both reads, and this approach works?

FA4 uses stage=0 (compile-time constant for all iterations within one softmax_loop call), producing a fixed TMEM address that LLVM can merge across iterations.

NVIDIA/cutlass#3090 seems to have new activity since yesterday.

init

1ae957a

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

LopezCastroRoberto marked this pull request as draft April 9, 2026 20:00

fix

98d297a

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

LopezCastroRoberto marked this pull request as ready for review April 10, 2026 13:32

LopezCastroRoberto mentioned this pull request Apr 10, 2026

[Perf] SM103 tcgen05.ld.red for fused TMEM load + row-max in softmax vllm-project/flash-attention#131

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Perf] SM103 tcgen05.ld.red for fused TMEM load + row-max in softmax#2449

[Perf] SM103 tcgen05.ld.red for fused TMEM load + row-max in softmax#2449
LopezCastroRoberto wants to merge 2 commits intoDao-AILab:mainfrom
LopezCastroRoberto:perf/ld.red-upstream

LopezCastroRoberto commented Apr 9, 2026 •

edited

Loading

Uh oh!

tridao commented Apr 9, 2026

Uh oh!

LopezCastroRoberto commented Apr 10, 2026

Uh oh!

tridao commented Apr 14, 2026

Uh oh!

LopezCastroRoberto commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

LopezCastroRoberto commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

tridao commented Apr 9, 2026

Uh oh!

LopezCastroRoberto commented Apr 10, 2026

Uh oh!

tridao commented Apr 14, 2026

Uh oh!

LopezCastroRoberto commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

LopezCastroRoberto commented Apr 9, 2026 •

edited

Loading