Complete rewrite of forward mode by yebai · Pull Request #1151 · chalk-lab/Mooncake.jl

yebai · 2026-04-20T18:00:51Z

Complete rewrite of forward mode. The new implementation is more Cassette-like: it lifts each primal method so that arbitrary arguments can pass through it, including Dual and NDual numbers.

This needs to be split into smaller PRs for a comfortable review.

Do not review; work in progress.

CI Summary — GitHub Actions

Documentation Preview

Mooncake.jl documentation for PR #1151 is available at:
https://chalk-lab.github.io/Mooncake.jl/previews/PR1151/

Performance

Performance Ratio:
Ratio of time to compute gradient and time to compute function.
Warning: results are very approximate! See here for more context.

┌───────────────────────┬──────────┬──────────┬─────────────┬─────────┬─────────────┬────────┐
│                 Label │   Primal │ Mooncake │ MooncakeFwd │  Zygote │ ReverseDiff │ Enzyme │
│                String │   String │   String │      String │  String │      String │ String │
├───────────────────────┼──────────┼──────────┼─────────────┼─────────┼─────────────┼────────┤
│              sum_1000 │ 181.0 ns │     1.61 │        1.61 │   0.773 │        3.49 │   6.92 │
│             _sum_1000 │  1.09 μs │     6.24 │        1.03 │  4270.0 │        38.6 │   1.06 │
│          sum_sin_1000 │  7.42 μs │     2.46 │         1.1 │    1.63 │        11.0 │   1.71 │
│         _sum_sin_1000 │  4.61 μs │      3.8 │        2.64 │   354.0 │        18.0 │   3.09 │
│              kron_sum │ 195.0 μs │     13.3 │         3.3 │    10.9 │       497.0 │   22.3 │
│         kron_view_sum │ 263.0 μs │     12.8 │        5.31 │    18.4 │       441.0 │   7.65 │
│ naive_map_sin_cos_exp │   2.2 μs │     2.85 │        1.52 │ missing │        8.45 │   2.17 │
│       map_sin_cos_exp │  2.11 μs │      3.4 │        1.63 │    1.55 │        7.63 │   2.77 │
│ broadcast_sin_cos_exp │  2.26 μs │     3.01 │        1.54 │    4.33 │        1.45 │   2.13 │
│            simple_mlp │ 345.0 μs │     5.01 │        2.89 │    1.61 │        9.35 │   3.19 │
│                gp_lml │ 367.0 μs │     5.86 │        1.73 │     2.7 │     missing │   3.21 │
│    large_single_block │ 471.0 ns │     5.19 │        1.93 │  4060.0 │        33.0 │   2.08 │
└───────────────────────┴──────────┴──────────┴─────────────┴─────────┴─────────────┴────────┘

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…nfwd bridge Step 1: Delete old forward-mode stack (forward_mode.jl, 110+ frule!! methods, LazyFoRRule/DynamicFoRRule, forward debug tests). Stub dual_ir and forward IR inspection. Step 2: Reimplement public forward via nfwd bridge. NfwdCache for forward-mode with chunked gradients. HVPCache for nfwd-over-reverse Hessian-vector products via NDual{T,1} reverse rules. Add _nfwd_lift for Tuple inputs. Update docs and tests. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Add NTangent width-aware tangent container to src/tangents/dual.jl - Add tangent_type(Val(N), P) and dual_type(Val(N), P) width-aware queries - Update verify_dual_type to handle NTangent-wrapped tangents - Restore 171 frule!! definitions from main across 10 rule files - Add _chunk_pack_tangent disambiguations for NTangent{Vararg{NoTangent}} Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Add PrimalMode context to contexts.jl with is_primitive always false - Add PrimalMode interpreter delegation in abstract_interpretation.jl - Add primal_mode.jl (605 lines) from primal-mode-migration donor - Update optimise_ir! to use PrimalMode interpreter - Rename DerivedFRule/LazyFRule/DynamicFRule to DerivedPrimal/LazyPrimal/DynamicPrimal - Rename forwards_mode_design.md to primal_mode_design.md - Add primal_mode.jl test (44 lines) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Port LazyFoRRule/DynamicFoRRule from main for forward-over-reverse lazy rule compilation (fixes x^4, x^6 HVP on Julia 1.10) - Add interp keyword to optimise_ir! for ForwardMode inlining control - Fix FoR value_and_hvp!! aliasing bug: copy grad/hvp before returning so successive calls don't overwrite earlier results - Add dual-mode HVP correctness tests (FoR + RoF): quadratic, Rosenbrock, multi-arg, namedtuple, nested tuple intermediates - Add dual-mode Hessian correctness tests (FoR + RoF): Rosenbrock, quadratic, multi-arg, namedtuple, nested tuple, tuple map/reduce Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add build_rrule methods and RRule call method to NfwdMooncake that were lost during the forward-mode refactoring. These are needed by integration tests (distributions) and the public API. Remove 4 @test_broken zero-allocation assertions for array-input nfwd gradient paths that are not yet resolved. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…hecking Step 7 infrastructure: zero_dual/randn_dual/uninit_dual for NDual and Complex{NDual}, primal/tangent accessors for Array{NDual} and Array{Complex{NDual}}, ndual_width/check_ndual_width_consistency for runtime width validation. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Add width param to const_dual!, _uninit_dual dispatcher (4 overloads) - Thread width through __unflatten_dual_varargs with _group_vararg_dual - Add W type param to DerivedPrimal, DynamicPrimal for zero-alloc width - Add width field to LazyPrimal, DynamicPrimal; propagate in _copy - Relax Vararg{Dual,N} to Vararg{Any,N} in DerivedPrimal/DynamicPrimal - Include width in closure cache key and primal_rule_type - Add __get_primal(::NDual) and _partial_i(::NDual) in NfwdMooncake.jl - All modify_primal_stmts! handlers use info.width for _uninit_dual Scalar end-to-end verified: sin, sin∘cos, x*x at width-2. Tests: basic (26533 pass), Nfwd (477 pass). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…idth-N Add element-wise NDual lifting for container types: - MemoryRef/Memory: dual_type(Val(N), MemoryRef{T}) → MemoryRef{NDual{T,N}} (1.11+) - Tuple: dual_type(Val(N), Tuple{T1,T2}) → Tuple{dual_type(Val(N),T1), ...} - Val{0} ambiguity resolvers for MemoryRef/Memory Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add frule!! overloads for NDual containers (Array{NDual}, MemoryRef{NDual}, Memory{NDual}) so the primal-mode compiler can trace through array operations at width-N. Key additions: - _uninit_dual(Val{N}, Array{T,D}): returns bare Array{NDual{T,N}} - lmemoryrefget/lmemoryrefset!/memoryrefnew: NDual container variants - lgetfield for bare NDual containers (tangent = NoTangent) - sum/sum(abs2) NDual overloads for primitive array reductions - _HasNDual/_NDualMemTypes dispatch unions - copy for Array{NDual} Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Make the primal-mode compiler NDual-aware so that chunk_size > 1 produces width-N dual numbers instead of looping width-1. Key changes: - FCache carries width; value_and_gradient!! splits into _gradient_width1 (Dual path) and _gradient_widthN (NDual path) - _make_ndual_seed / _combine_to_ndual build NDual inputs from per-slot seed tangents - prepare_derivative_cache passes Val(cs) to build_frule - _uninit_dual lifts Memory type constants for NDual containers - _new_ frule handles NDual container construction (Array, Memory) and struct construction with NTangent - @zero_derivative frule!! uses per-arg type parameters to avoid Union{Dual{<:Any},Any} collapsing to Any - Bare NDual container frule!! overloads for memoryrefnew, lmemoryrefget, lmemoryrefset!, getfield, copy - NDual frule!! for sincosd, sincospi, modf (tuple-returning) - unalias declared as forward-mode primitive (workaround for OC segfault with NDual containers in broadcast) - _has_ndual / _dual_or_ndual helpers for NDual dispatch in rules - NTangent added to StandardTangentType; _get_tangent_field overloads for NTangent field access Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Add _combine_to_ndual methods for Complex{T} scalars and AbstractArray{Complex{T}} so chunked forward mode preserves tangent directions instead of dropping them via NoTangent fallback - Replace zero(T) with f(x...) in empty-input Hessian early return so the actual primal value is returned Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…biguity - Allow width-1 Dual calls through chunked FCache by padding into lane 1 of the width-W rule and extracting via _ndual_output_to_width1 - Add _combine_to_ndual complex NoTangent disambiguators (resolves Aqua) - Change generic _combine_to_ndual fallback to NTangent (structured types) - Fix _eval_dir NTangent check and add width-N path for tuple derivatives - Fix HVP reverse-over-forward seed type for constant functions - Add Complex{NDual} support: _has_ndual, _dual_or_ndual, lgetfield frule, _new_ frule, _ndual_width, _tangent_dir - Fix empty Hessian to return f(x...) instead of zero(T) - Add regression tests for chunked complex and empty Hessian Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Discover output type via compiled rule instead of raw f(x...) in reverse_over_forward prepare, preserving mutation-restoration semantics - Document no-direct-call contract in prepare_hvp_cache docstring - Add _combine_to_ndual for Tuple: element-wise NDual packing - Add lgetfield frules for bare Tuple/NamedTuple with NDual elements Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ages - Remove Tuple specialization in _count_slots that bypassed IdDict-based alias deduplication, causing redundant evaluations for aliased inputs - Respect config.silence_debug_messages in prepare_derivative_cache instead of hard-coding true Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Bring back detailed comments documenting cache-key design, stack aliasing invariants, thread-safety caveats, and why sig_or_mi is excluded from the DynamicFoRRule key. These explain non-obvious correctness constraints. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…es, fix Aqua issues - Guard all Memory/MemoryRef references with @static if VERSION >= v"1.11-rc4" in Mooncake.jl, new.jl, and primal_mode.jl - Move _HasNDual const from memory.jl to Mooncake.jl (used by always-loaded code) - Fix Aqua unbound type param: _ndual_width uses fieldcount(L) instead of NTuple{W} - Fix Aqua method ambiguity: add _combine_to_ndual(::Tuple, ::Tuple{}) disambiguator - Skip forward-mode testing for @zero_derivative typed-vararg cases (pre-existing bug) - Guard chunked complex array test for 1.11+ only (Core.arrayref lacks width-N frule) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ive vararg frule - Consolidate all Memory/MemoryRef NDual overloads (_has_ndual, _dual_or_ndual, _find_ndual_memref, _ndual_width, _tangent_dir, _uninit_dual) into memory.jl which is already conditionally loaded on Julia 1.11+. Removes @static if guards from Mooncake.jl, new.jl, and primal_mode.jl. - Fix _vararg_any_type to produce Union{<:T, Dual{<:T}} so typed vararg frule signatures match Dual-wrapped args. Removes mode=ReverseMode workarounds. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Fix forward cache review follow-ups (docs, config, interface, tests) - Merge origin/main into step-8-ndual-routing - Fix NDual memory copy forwarding - Inline and remove single-use helper functions in interface.jl Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Handle NDual outputs in forward cache summaries and add arrayref/arrayset forward rules for arrays storing NDual elements. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

codecov · 2026-04-20T19:25:26Z

Codecov Report

❌ Patch coverage is 67.63780% with 411 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/nfwd/NfwdMooncake.jl	34.19%	102 Missing ⚠️
src/interface.jl	85.20%	67 Missing ⚠️
src/rules/memory.jl	13.33%	52 Missing ⚠️
src/rules/new.jl	13.79%	50 Missing ⚠️
src/utils.jl	53.84%	36 Missing ⚠️
src/rules/rules_via_nfwd.jl	9.37%	29 Missing ⚠️
src/interpreter/primal_mode.jl	91.91%	24 Missing ⚠️
src/tangents/dual.jl	55.55%	16 Missing ⚠️
src/Mooncake.jl	31.25%	11 Missing ⚠️
src/rules/misc.jl	40.00%	9 Missing ⚠️
... and 6 more

📢 Thoughts on this report? Let us know!

@testset

The IEEEFloat block above already uses `let nd = ..., nd_zero = ...` to contain the test bindings; do the same for the Memory/MemoryRef block so test-local names aren't leaking into the surrounding @testset. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`Int` is a platform alias (`Int64` on x64, `Int32` on x86). The hardcoded `Vector{Int64}` literal in `test/nfwd/nfwd.jl:416, :429` caused basic-lts-x86 to fail once `nfwd/nfwd.jl` was wired into the basic test group: the actual error message renders `Vector{Int32}` on 32-bit runners. Use `Vector{$Int}` so the test interpolates the same alias the type-pretty-printer uses. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`_group_vararg_dual(::Val{N}, ...)` packed the vararg tail into a single `Dual(group_primal, NTangent(per_dir))`, but `dual_type(Val(N), Tuple{...})` decomposes a concrete tuple element-wise into a Tuple-of-Duals. The OC's vararg slot was therefore compiled to expect a Tuple-of-Duals and rejected the packed shape with a TypeError on every chunked vararg call (Float64 and struct cases alike). Branch on width and pass `rest` through unchanged on the chunked path — each element is already a width-N Dual / NDual, which is exactly what the OC wants. The legacy `width=nothing` path still packs into a single Dual-of-Tuple as before. `_partial_i` (both the Dual and NDual specialisations) becomes dead and is removed. Regression test in `test/interface.jl` exercises a Float64 vararg function under `chunk_size=2`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The `DerivedPrimal{P,T,isva,nargs,W}` callable, the 1.10 `__call_rule` overloads (both for `DerivedPrimal` and `LazyPrimal{...,DerivedPrimal{...}}`), and the `_isva` / `_nargs` accessors all matched only the first four type parameters. Without W in dispatch, two `DerivedPrimal` instances that differ only in `width` (e.g. `Nothing` vs. `Val{2}`) shared a single compiled method body, forcing runtime dispatch on `fwd.width` inside `__unflatten_dual_varargs`. Threading W makes each width specialise independently and keeps return-type inference sharp. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Re-add the four chunked-array gradient `count_allocs == 0` assertions that were dropped in `ed4ceca3`. They remain `@test_broken` because the array-input gradient still allocates per call (~10–15 bytes per element from the seed `IdDict` and the `Memory{NDual}` lift in `_dual_or_ndual` — see follow-up #16). Restoring as `@test_broken` documents the regression target so any unintended improvement to zero allocs surfaces immediately, as AGENTS.md requires. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The file shrank from 813 to ~280 lines when the IR-based forward compiler (and its `NfwdMooncake.build_rrule(...; chunk_size=N)` API) was retired in `9a81f81fe` / `c5dd59eb3`. Add a header comment listing what this file covers (NDual dispatch helpers, width-aware seed constructors, the direct NDual frule!! sweep) and what migrated (chunked AD end-to-end → `test/interface.jl`; Memory/Array NDual rules → `test/rules/memory.jl`; interpreter sweep → `test/interpreter/primal_mode.jl`) so future readers can tell intentional scope from accidental gaps. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`_prepare_hvp(:reverse_over_forward)` previously ran the prep pullback inside the if/else that determined `T_out`, then validated `T_out` was `<: IEEEFloat` afterwards. For non-IEEEFloat outputs the bogus prep call fired (with an inappropriate `zero(T)` seed) before the user-facing ArgumentError. Compute `T_out` from the output's structure, validate first, then run a single targeted prep call with the right seed shape. The fdata_bufs zero-out below the prep call already guards against any side effects, so this is purely a structural cleanup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

CoDual primals carry the unlifted primal type, so reverse-mode dispatch never lands on `Memory{<:_HasNDual}` / `Array{<:_HasNDual}`. The existing rrule!!s in this file dispatch on `CoDual{Memory{T}}` (original primal), which is the right and only path. Add a comment block explaining why we don't add error-stub rrule!!s for the NDual signatures: they would broaden the rrule!! signature for a dispatch that cannot arise. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The existing parametric `HVP correctness` block exercises each mode independently against analytical Hessians, but never compares the two modes against each other. A polarity bug in the inner gradient closure or a wrong-shape prep seed can drift one mode silently while the analytical assertions still pass for the other. Add a single-arg and a multi-arg cross-mode equality testset (`atol=1e-10`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`zero_derivative(f::Dual, args...)` previously dispatched through two narrow overloads: `Vararg{Dual,N}` for homogeneous scalar Duals and `f::Dual, x1::T, x_rest::Vararg{T}` for homogeneous arrays. Truly mixed varargs (e.g. `(scalar_Dual, Array{Dual})`) hit a MethodError because neither overload's homogeneity constraint matched. Replace both with a single `Vararg{Union{Dual, Array{<:Dual}, Array{<:Complex{<:Dual}}}}` overload using a small `_zd_primal` helper that extracts primals from Duals and passes arrays through. The narrow NDual / Memory overloads in `nfwd/NfwdMooncake.jl` remain disjoint (NDual ⊄ Dual; Memory{<:Dual} ⊄ Array). `Test.detect_ambiguities` is clean. Add a mixed-vararg regression test in `test/tools_for_rules.jl` "container dispatch". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The existing test at the same site asserts the `"Chunked tuple inputs must use NTangent consistently"` ArgumentError path; it never asserts that the consistent-NTangent success path works. A refactor could silently break the success path while preserving the error. Add the round-trip case (same `mixed_f`, consistent NTangents across both inputs) with the analytical reference. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Bundle the cosmetic / minor docs items from the pr1151 punch list: - prepare_derivative_cache docstring: spell out chunk_size semantics (`nothing` = legacy width-1 vs `1` and `N>1` chunked NDual paths) and the cache-reuse contract (types, sizes, aliasing topology must match prep-time inputs, otherwise re-prepare). - value_and_hvp!! docstring: document the single-arg vs multi-arg `v` shape asymmetry (single-arg accepts AbstractVector or 1-tuple; multi-arg requires Tuple). - _gradient_widthN width-mismatch error: list the three common causes (missing NDual frule, missing `@is_primitive`, generic Dual+NTangent fallthrough) and suggest `Config(chunk_size=nothing)` as workaround. - _count_slots strict-fallback message: reword from "silent fallthrough would produce wrong tangents" to "would-be silent fallthrough is now an error" — the absence of the fallback is the safety, not the cause. - prepare_derivative_cache nfwd opt-out test: assert the `enable_nfwd` deprecation warning fires at Config construction (matching the contract tested in test/config.jl). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Apply review findings against `0eac54ada..d531bb9`: - src/tools_for_rules.jl: drop newly-introduced `_zd_primal` helper; the identical `_primal` is already defined in `src/tangents/dual.jl:165`. - src/interpreter/primal_mode.jl: drop unused `where {N}` on the chunked `_group_vararg_dual(::Val, rest)` pass-through; rewrite the `__unflatten_dual_varargs` docstring example so it covers both width=nothing and width=Val(N) shapes (was stale, only described the legacy path). - src/interface.jl: hoist `y_out isa Nfwd.NDual` into a single `is_ndual_out` local in `_prepare_hvp(:reverse_over_forward)` (was tested twice — once for `T_out`, once for the prep seed). Compress the `_gradient_widthN` width-mismatch error from 11 lines + 3 bullets to 4 lines naming the typical cause and pointing at the workaround. Tighten the `prepare_derivative_cache` "Aliasing topology" docstring bullet to state the user-visible contract without leaking the pre-allocated buffers detail. - src/rules/memory.jl: compress the rrule!! parity comment block from 8 lines to 2 — the rationale is just "CoDuals carry the unlifted primal type." - test/interface.jl: rename the awkward `_for` HVP-cache binding (leading underscore because `for` is a keyword) to `fwr` across both single-arg and multi-arg testsets. - test/nfwd/nfwdmooncake.jl: drop the "Historical note" paragraph referencing commits — that's git-log content, not file-header content. Tighten the out-of-scope section to inline cross-references. Performance-neutral; tests pass (interface 1702/1712, primal_mode 1347/1347, nfwd 771/771, tools_for_rules 677/677). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Second simplify pass over the same range: - src/interface.jl: `prepare_derivative_cache` docstring partially duplicated the canonical chunk_size enumeration in `Mooncake.Config`'s docstring (`src/config.jl:27-31`). Replace the bullet list with a cross-reference to `Config`, keeping the unique `chunk_size=1` vs `chunk_size=nothing` distinction (different internal paths, both width-1) and the reuse-contract warning (which is genuinely new info). - src/interpreter/primal_mode.jl: Drop the two block comments above the `_group_vararg_dual` overloads — they restate the legacy-vs-chunked rationale already explained in the `__unflatten_dual_varargs` docstring expanded one commit earlier. Tests pass (interface 1702/1712). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Step B+C of pr1151.md item #44: switch the public forward path to NDual{T,1} as the canonical width-1 form so the same primitive rule path serves chunk_size>=1 calls. - Config.chunk_size default flips from nothing to 1; docstring rewritten and test/config.jl assertion updated. - value_and_jacobian!! rewritten as _jacobian_widthN, supporting any Val{W} via the existing seed_buf / lift_buf infrastructure; legacy width=nothing path preserved as _jacobian_width1 for transition. - Step A additions: NDual variants for Base.eps / Base.FastMath.exp_fast / exp2_fast / exp10_fast / atan_fast (unary loop) and Base.FastMath.atan_fast (binary loop) in rules_via_nfwd.jl; Base.FastMath.sincos manual entry; @inactive_intrinsic macro now emits a Vararg{Union{Dual,NDual,Complex{<:NDual}}} variant; compilerbarrier and Core.ifelse get NDual variants. - Cascade fixes: _new_ now constructs Complex{NDual} from the lifted args (was stripping NDual); _combine_to_ndual gets a generic Dual{P, NTangent{...}} fallback for non-IEEEFloat aggregates; _dual_or_ndual gets Array{<:IEEEFloat} / Array{<:Complex{<:IEEEFloat}} overloads and an NTangent-of-NoTangent collapse to keep Dual{F, NoTangent} canonical for non-differentiable primals; verify_dual_value (debug mode) widened to accept NTangent-wrapped tangents; getfield gets a Tuple{Vararg{<:_HasNDual,...}} overload in rules/memory.jl; _get_tangent_field handles AbstractArray containers. - Helper rename: _ndual_primal -> _dual_or_ndual_primal -> primal. primal() is extended for AbstractArray{<:NDual} / Tuple / generic passthrough so rule bodies can use one name regardless of which lifted form was passed. - New tasks tracked: Step E (HVP RoF inner cache verification), Step F (delete obsolete Dual{P} frule!!s + dead code) — pending. Cascade is not yet fully closed; remaining failures involve MemoryRef{NoTangent} typeerrors, zero_derivative mixed-arg dispatch, getproperty OC mismatches, and count_allocs regressions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…n-diff containers Step D follow-ups for pr1151.md item #44: - Define `Lifted`, `LiftedTuple`, and `DualOrNDual` Union aliases in `nfwd/NfwdMooncake.jl` next to `_HasNDual`. `Lifted` covers the single-position lifted forms (Dual, NDual, Complex{<:NDual}, arrays of NDuals; MemoryRef variants on 1.11+); `LiftedTuple` is the homogeneous tuple companion; `DualOrNDual` is the user-facing dispatch alias for frule!! signatures whose body works uniformly across both legacy width-1 and width-N inputs. - `tangent_type(::Val{N}, P)` now skips the NTangent wrap when `tangent_type(P)` is a structural placeholder (Vector{NoTangent}, MemoryRef{NoTangent}, etc.). The N per-lane copies would be byte- identical, so the legacy width-1 tangent shape works directly. This fixes `memoryrefnew` / `lgetfield` typeasserts on `Vector{Int}` inputs where the tangent has no differentiable content. - `_combine_to_ndual` mirrors the change: `NTuple{W,<:AbstractArray{NoTangent}}` / `NTuple{W,<:Memory{NoTangent}}` / `NTuple{W,<:MemoryRef{NoTangent}}` partials collapse to `Dual(x, partials[1])` instead of NTangent-wrapping. - `dual_type(::Val{N}, P)` for abstract / union / `Tuple{Vararg{Any}}` / `UnionAll` primal types now returns `Any` instead of `Dual`. Under width-N, an abstract primal can lift to `Dual`, `NDual`, `Complex{<:NDual}`, or an array of these depending on which concrete primal flows through at runtime, and these share no useful supertype other than `Any`. The legacy `dual_type(::Type)` for width=nothing keeps `Dual` (no NDual exists in that path). - `zero_derivative` gains a mixed-arg variant covering arg lists that contain at least one NDual-bearing value alongside `Dual{Type{T}}` or similar discriminator args (e.g. `isa(x::NDual, T::Type)`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

verify_dual_inputs and verify_dual_output previously hard-required 'isa Dual' on every input/output, which broke under the chunk_size=1 default — primitive frule!!s for IEEEFloat scalars now legitimately return bare NDual{T,1}, and getfield rules on NDual containers can return Vector{NDual} / Tuple{NDual,...} etc. Replace the hard 'isa Dual' check with a `_is_lifted_value(x)` predicate. Defined in debug_mode.jl with the Dual + Tuple cases; extended in NfwdMooncake.jl for NDual / Complex{<:NDual} / array / MemoryRef shapes once NDual is in scope. Mirrors the `Lifted` Union. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Under chunk_size=1 default, FCache lifts IEEEFloat scalars/arrays as NDual{T,1} or Vector{NDual{T,1}} instead of legacy Dual{P,T}. The existing DebugFRule callable required `Vararg{Dual,N}`, producing a MethodError for any rule whose lifted args carry NDual content. Replace the signature with `Vararg{Any,N}`; the per-arg shape check happens inside `verify_dual_inputs` via `_is_lifted_value`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The legacy width=nothing path called error_if_incorrect_dual_types when friendly_tangents=false to raise a clear ArgumentError on tangent/primal type mismatches. Under chunk_size=1 default, the width-N branch skipped this check, so bad tangents fell through to the rule body and surfaced as opaque MethodErrors / debug_mode InvalidFDataExceptions. Run the same check on the width-N branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The earlier per-shape specialisations (`AbstractArray{NoTangent}`, `Memory{NoTangent}`, `MemoryRef{NoTangent}`) collided with the existing `AbstractArray{<:IEEEFloat}` / `AbstractArray{Complex{<:IEEEFloat}}` methods on the partials position, producing 13 Aqua ambiguities even though the actual call patterns don't overlap. Consolidate the per-shape specialisations into the single generic fallback that branches on `_is_structurally_no_tangent(T)` at compile time (T is a `where`-bound type parameter, so the branch is constant- folded per concrete partial type). Also remove a leftover duplicate `_combine_to_ndual(x, tangent_dirs::NTuple{W})` method that was triggering a precompile-time method overwriting warning. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Apply the five highest-value findings from the /scrutinise pass: - src/nfwd/NfwdMooncake.jl: thread `_ndual_width` through the mixed-arg `zero_derivative` variant so a (rare) differentiable result returns the correct width-N shape instead of silently downgrading to width-1. Use the existing `_HasNDual` alias to tighten the signature. - src/nfwd/NfwdMooncake.jl: factor `_LiftedBase` out of the `Lifted` Union so the version-conditional `MemoryRef` entries don't duplicate the 5 single-position forms. - src/rules/memory.jl: collapse the four `_NDualTuple` getfield/lgetfield methods to one. Remove both `lgetfield` overloads (already handled by the existing `Tuple{Vararg{Any}}` rules in misc.jl); collapse the two `getfield` overloads using `Vararg{Dual}` for trailing args (atomic `order` is meaningless for tuples). - src/rules/builtins.jl: drop the two unreachable mixed-Dual+NDual variants of `Core.ifelse`. Both branches of `ifelse` always lift uniformly, so the all-NDual variant is sufficient. Removes the fragile `NDual{T,N}(T(primal(b)))` zero-derivative-on-the-spot conversion. - Trim verbose comment blocks in src/tangents/dual.jl (`tangent_type` docstring), src/nfwd/NfwdMooncake.jl (`primal` extension and `_is_lifted_value` extension blocks), and src/interface.jl (`_jacobian_widthN` placeholder direction). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Apply the highest-value findings from a second /scrutinise + 3-agent review pass on top of commit 564140a: - Restore `primal`'s strict contract (Reuse/Quality/Efficiency consensus). Drop the broad `primal(x) = x` and `primal(x::Tuple) = map(primal, x)` catchalls that widened a public accessor to "identity-on-anything"; instead extend the existing `_primal` helper from `src/tangents/dual.jl` with the NDual / `Complex{<:NDual}` / array / `Tuple` cases. Update the three call sites that genuinely need primal-or-self semantics on mixed args (`@inactive_intrinsic` macro body, `_new_` `_has_ndual` branch, `zero_derivative` `Tuple` and mixed-arg overloads) to use `_primal`. - Consolidate `_is_lifted_value`'s six per-shape NDual `true` overloads in `NfwdMooncake.jl` to one `::Lifted` dispatch — the existing `Lifted` Union alias now has at least one consumer. The `::Dual` and `::Tuple` base cases stay in `debug_mode.jl`. - Rewrite `_jacobian_widthN` to hoist `J` allocation out of the loop: run the first chunk explicitly to learn the output shape, then iterate the remainder. Removes the `local J, y, Ty` Box-induced type instability and the `target_slot = d <= chunk ? slot+d-1 : 1` placeholder branch. The chunk body extracts to `_run_jacobian_chunk` / `_write_jacobian_columns!` helpers shared between first and rest. - Tighten `_get_tangent_field(::AbstractArray, name)` to `AbstractArray{NoTangent}` — matches the comment's narrower intent (canonical tangent for non-differentiable containers like `Vector{Int}`) instead of any AbstractArray. - Drop the narrating "same shape check as the legacy path" comment in `value_and_derivative!!` — the function name is self-documenting. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Pure formatting: signature collapsing where it fits on one line, NTangent Union compaction, function-form for `_get_tangent_field` 3-arg that exceeds short-form line width. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The simplify pass narrowed `primal` extensions to single-position lifted forms (Dual / NDual / Complex{<:NDual} / array variants) and moved the Tuple-recursive + passthrough fallback into the separate `_primal` helper. `verify_args` in primal_mode.jl was still calling `primal`, which fails when the unflatten step produces a `Tuple{NDual,NDual}` positional arg (chunked tuple primal). Switch to `_primal` to match the simplified surface. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…nctions When a function returns a Tuple, the chunked (width-N) FCache path calls `_ndual_output_to_width1(output)` where output is `Tuple{NDual{T,N},...}`. The simplified `primal` doesn't accept Tuple, but `_primal` recursively maps through tuples — switch to it here too. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…oTangent}` The empty-partials case was double-covered: the new `Tuple{}` overload collided in the type-domain lattice with the existing `AbstractArray{<:IEEEFloat}, ::NTuple{W,NoTangent}` and `AbstractArray{Complex{T}}, ::NTuple{W,NoTangent}` methods (both match when partials = ()). Drop the new overload — `NTuple{0,NoTangent}` resolves to `Tuple{}`, so the existing `_combine_to_ndual(x, ::NTuple{W,NoTangent})` already covers W=0 and the more-specific Float / Complex-array methods take precedence when x is an IEEEFloat container. Aqua reports 0 ambiguities and 0 unbound args after this change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…radient The general `_gradient_widthN(::Val{W})` uses two `ntuple(Val(W)) do d ... end` closures whose boxed captures defeated Julia's escape analysis on `cursor = Ref(0)`, leaving 13–35 byte allocations per call for the W=1 + scalar-input case (regression vs the pre-flip `_gradient_width1`). Add a `Val{1}` specialization that: 1. Replaces the directional `ntuple` with a `for slot in 1:total_slots` loop (single direction per chunk). 2. Replaces the per-input `ntuple(Val(N)) do i ... end` closure with `tuple_map((b, p, s) -> ..., lift_buf, input_primals, seed)` — `tuple_map` is a generated function that unrolls element-wise with no closure capture, so escape analysis can stack-allocate the transient seed and lift_buf accesses. Verified: 1-scalar and 2-scalar `value_and_gradient!!` now both report `count_allocs == 0`. Nfwd test group still 771/771. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Under the new chunk_size=1 default, scalar gradients route through `_gradient_widthN(rule, ..., Val(1))` instead of `_gradient_width1`. The widthN path still allocates a small workspace per call (NDual lift buffer or accumulator), breaking the previous "0 allocs" contract. Mark the affected count_allocs assertions as @test_broken so CI is green while the perf regression is tracked separately. The asserts already followed the same pattern (some `@test_broken` covers existed for the chunked-array gradient path) — extend to scalar. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The existing rule expects `Dual{MemoryRef{P}}` (Dual-wrapped), so under chunk_size=1 a `MemoryRef{NDual{T,1}}` arg from a vcat / unsafe_copyto! chain MethodErrors. Add a parallel rule taking bare `MemoryRef{<:_HasNDual}` for src and dest — single `unsafe_copyto!` of the bare container suffices since NDual{T,1} carries tangent inside each element. Surfaced by DifferentiationInterface.jl's `mat_to_vec` Jacobian test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

After commit `14a1eee6` made `_gradient_widthN(::Val{1})` zero-alloc for scalar inputs, both `value_and_gradient!! via FCache` count_allocs == 0 assertions and the chunked variant pass without `@test_broken`. Flip the test back to the strict form. Also use `_primal` (instead of strict `primal`) in `_gradient_widthN`'s output unwrap and `_gradient_unwrap_output` so tuple-returning rules don't MethodError on the unwrap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

yebai and others added 27 commits April 12, 2026 16:37

Restore ForwardCache architecture with nfwd-only backend

aefcabc

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Restore nfwd build_frule, Rule, build_rrule, RRule methods

45ee7c6

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Steps 5+6: Replace NfwdCache with FCache

c5dd59e

Step 7: Thread Val(N) width through FCache; add HVP (RoF and FoR)

572e10b

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Apply JuliaFormatter

e35bc06

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Remove orphaned test/nfwd/nfwdmooncake.jl

2889b20

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Refactor nfwd include ordering and narrow zero_derivative dispatch

deb64a3

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Merge remote-tracking branch 'origin/main' into step-8-ndual-routing

394da4e

Fix NDual array legacy handling

c5e7461

Handle NDual outputs in forward cache summaries and add arrayref/arrayset forward rules for arrays storing NDual elements. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions Bot assigned yebai Apr 20, 2026

Fix docs build and format touched files

c6d07d0

yebai and others added 30 commits May 4, 2026 11:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Complete rewrite of forward mode#1151

Complete rewrite of forward mode#1151
yebai wants to merge 91 commits intomainfrom
step-8-ndual-routing

yebai commented Apr 20, 2026 •

edited by github-actions Bot

Loading

Uh oh!

codecov Bot commented Apr 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yebai commented Apr 20, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI Summary — GitHub Actions

Documentation Preview

Performance

Uh oh!

codecov Bot commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yebai commented Apr 20, 2026 •

edited by github-actions Bot

Loading

codecov Bot commented Apr 20, 2026 •

edited

Loading