feat: setup_context for FlashAttnFunc (torch.func.grad) by NJX-njx · Pull Request #2405 · Dao-AILab/flash-attention

NJX-njx · 2026-03-28T15:22:23Z

Summary

Implements setup_context on FlashAttnFunc so orch.func.grad / functorch transforms work (#2071), per maintainer guidance.

When gradients are enabled, forward returns extra tensors for context; non-differentiable outputs are marked accordingly.
The user-visible output is cloned when saving for backward so it is not a view of out_padded (which is marked non-differentiable).
lash_attn_func unpacks the 5-tuple training path so the public API is unchanged.
Adds ests/test_flash_attn_functorch.py (skipped if lash_attn_2_cuda is not built).

Testing

pytest tests/test_flash_attn_functorch.py (skipped here without CUDA extension build).

Fixes #2071

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8e84a94d6b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-28T15:27:52Z

+    def setup_context(ctx, inputs, output):
+        if not isinstance(output, tuple) or len(output) != 5:


Switch FlashAttnFunc to separate forward/setup_context API

setup_context was added, but FlashAttnFunc.forward is still defined in the combined form (forward(ctx, ...)). In PyTorch’s torch.func integration, custom autograd Functions must use the separate form (forward(*args) + setup_context) to participate in transforms; with this mixed definition, torch.func.grad/vjp on flash_attn_func will still fail instead of using the new context path. Please convert forward to the separate signature and keep all ctx writes in setup_context.

Useful? React with 👍 / 👎.

NJX-njx · 2026-03-28T15:36:56Z

Thanks for the review. I've removed \ctx\ from \FlashAttnFunc.forward\ so it uses the separate \orward(*args)\ + \setup_context\ form required for \ orch.func\ transforms (commit c1ff5c5 on this branch).

PyTorch functorch requires forward(*args) + setup_context without the legacy forward(ctx, ...) combined form. Addresses Codex review on PR Dao-AILab#2405. Made-with: Cursor

- Move save_for_backward to setup_context for functorch/torch.func.grad compatibility. - When grad is enabled, return auxiliary outputs and mark non-differentiable tensors; clone trimmed output so it is not a view of out_padded. - Unpack flash_attn_func return when the 5-tuple grad path is used. - Add tests/test_flash_attn_functorch.py (skipped if CUDA ext missing). Made-with: Cursor

PyTorch functorch requires forward(*args) + setup_context without the legacy forward(ctx, ...) combined form. Addresses Codex review on PR Dao-AILab#2405. Made-with: Cursor

NJX-njx · 2026-03-29T07:35:14Z

Rebased onto main and removed unrelated files (Chinese documentation and hopper/flash.h comments) that were accidentally included. The Codex review about the separate forward/setup_context API was already addressed in the latest commit.

NJX-njx mentioned this pull request Mar 28, 2026

setup_context method required for FA autograd Function for working with torch.func.grad #2071

Open

chatgpt-codex-connector bot reviewed Mar 28, 2026

View reviewed changes

NJX-njx mentioned this pull request Mar 29, 2026

Add setup_context method to FlashAttnFunc for torch.func.grad support (fix #2071) #2352

Closed

NJX-njx added 2 commits March 29, 2026 15:34

fix(autograd): use separate FlashAttnFunc.forward without ctx

e12235f

PyTorch functorch requires forward(*args) + setup_context without the legacy forward(ctx, ...) combined form. Addresses Codex review on PR Dao-AILab#2405. Made-with: Cursor

NJX-njx force-pushed the pr/functorch-setup-context branch from c1ff5c5 to e12235f Compare March 29, 2026 07:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: setup_context for FlashAttnFunc (torch.func.grad)#2405

feat: setup_context for FlashAttnFunc (torch.func.grad)#2405
NJX-njx wants to merge 2 commits intoDao-AILab:mainfrom
NJX-njx:pr/functorch-setup-context

NJX-njx commented Mar 28, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Mar 28, 2026

Uh oh!

NJX-njx commented Mar 28, 2026

Uh oh!

NJX-njx commented Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		def setup_context(ctx, inputs, output):
		if not isinstance(output, tuple) or len(output) != 5:

Conversation

NJX-njx commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

NJX-njx commented Mar 28, 2026

Uh oh!

NJX-njx commented Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

NJX-njx commented Mar 28, 2026 •

edited

Loading