Skip to content

fix[next]: extend fingerprinting to cover all relevant cache state via structural hashing#2648

Open
egparedes wants to merge 56 commits into
GridTools:mainfrom
egparedes:refactor/fingerprinting-tree-cata
Open

fix[next]: extend fingerprinting to cover all relevant cache state via structural hashing#2648
egparedes wants to merge 56 commits into
GridTools:mainfrom
egparedes:refactor/fingerprinting-tree-cata

Conversation

@egparedes

@egparedes egparedes commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Summary

Generalizes and unifies the fingerprinting / cache-key infrastructure used by the gt4py.next workflow caches, so cache keys are deterministic across interpreter runs and get invalidated when any relevant state changes — the cached step's own configuration and the gt4py version, not just the step input.

Supersedes #2609, keeping its caching policy (cache_key = H(BUILD_CACHE_VERSION_ID, step, key_function(input)), default-include with explicit field opt-out) while replacing the pickle-based mechanism with structural Merkle-style hashing, addressing the issues raised in review (RecursionError on realistic IR depths, local-enum regression, non-orderable dict keys, OrderedDict false cache hits, identity-sensitivity, conflated pickling contracts). It incorporates @havogt's proposal egparedes#8 by merge, preserving authorship.

The design and rationale are documented in the new ADR docs/development/ADRs/next/0023-Fingerprinting.md.

Key changes

  • Structural fingerprinter as a layered catamorphism (gt4py.next.utils), separating the traversal scheme from the reduction logic:
    • Deconstruction: per-type deconstructors peel one level into EmptyDeconstruction / Deconstruction, dispatched on the MRO (eve.utils.singledispatcher); defaults cover class-tagged primitives, ordered tuples/lists, digest-sorted dicts/sets (unsortable keys just work), enums by member content, types/functions/modules by identity-verified qualified name, dataclasses/datamodels by fields honoring gt4py_metadata(fingerprint=False), and a __reduce_ex__ fallback that honors copyreg.dispatch_table reducers (e.g. NumPy ufuncs).
    • Traversal: catabolize, a generic iterative post-order fold with identity-based memoization, cycle back-references (allow_cycles) and canonical ordering of order-insensitive pieces — no recursion limit (a depth-5000 ITIR chain fingerprints in ~120 ms; the pickle-based mechanism failed at depth ~74), and fingerprints are a pure function of value, not object identity.
    • Reduction: a single fingerprint collapser (the catamorphism's algebra) hashing deconstructions uniformly with xxh64 and domain separation; fingerprinters are catabolize partially applied to a deconstructor and this collapser.
  • Cache keys: CachedStep renames hash_functionkey_function, excludes key_function/cache from its own fingerprint, and keys on stable_fingerprinter((BUILD_CACHE_VERSION_ID, self, key_function(inp))). New config.BUILD_CACHE_VERSION_ID (defaults to the gt4py version, env-overridable) salts the cache. Runners (gtfn, dace, ...) adopt stable_fingerprinter as their key_function.
  • Fingerprinting no longer interferes with real pickling: the MetadataBasedPicklingMixin / custom-pickler machinery is removed; classes keep standard serialization. The ffront/ITIR semantic_fingerprinters are deconstructor compositions (DSL definition functions fingerprinted by source + closure variables).
  • Docs / tests: ADR 0023 (pickle- and optree-based designs recorded under alternatives considered); WorkflowPatterns.md updated; unit tests for the catabolize driver, the fingerprint deconstructors, and CachedStep.

Testing

  • tests/next_tests/unit_tests + tests/eve_tests: 1879 + 362 passed (only pre-existing GPU/CuPy environment parametrizations fail on a machine without CuPy).
  • tests/next_tests/integration_tests -k roundtrip: green, including test_constant_closure_vars_with_enums (local enums in closure variables, which fails with the pickle-based mechanism) and the cbrt/test_roots ufunc cases.
  • The depth-80 ITIR RecursionError repro from the fix[next]: extend fingerprinting mechanism to consider all relevant state in caches #2609 review passes; cross-process determinism verified.
  • pre-commit (ruff, mypy, tach) green on all changed files.

egparedes and others added 30 commits May 27, 2026 15:25
…g in fingerprinters

- Add `fingerprint` (alias for `sorting_sets_fingerprinter`) and
  `versioned_fingerprint` (includes BUILD_CACHE_VERSION_ID) to `utils.py`
- Fix `skipping_fields_node_fingerprinter` to pass the reducer dict as a
  positional arg (not keyword) to `CustomPicklingFingerprinter.from_reducers`
- Fix `custom_overriden_pickler` in `eve/utils.py` to use `pickle._Pickler`
  (pure Python) instead of the C-extension `pickle.Pickler` so that
  `reducer_override` is called for built-in types like `dict` and `set`
  (the C-extension fast path bypasses `reducer_override` for built-in types)
- Update `test_cached_with_hashing` to use a module-level function instead
  of a lambda (lambdas can't be pickled, now that `cache_key` fingerprints `self`)
- Add tests for `fingerprint`, `versioned_fingerprint`, and `cache_key` behavior

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…on stage classes

Re-add the `fingerprinter` module-level alias (= `semantic_fingerprinter`) and
the `fingerprint` computed property on all four stage dataclasses
(`DSLFieldOperatorDef`, `FOASTOperatorDef`, `DSLProgramDef`, `PASTProgramDef`).
These were removed when the old `FingerprintedABC`/`FingerprintedMixin` system
was dropped in the 'More refactoring' commit but are still needed by existing
tests and callers.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Re-add the `fingerprint()` method to the `ir.Node` base class that was
removed along with the `FingerprintedABC`/`FingerprintedMixin` system.

The method is needed by:
- `ffront.lowering_utils` (uses `itir.Expr.fingerprint()` to generate
  unique variable names)
- `ffront.foast_to_gtir` (uses `itir.Expr.fingerprint()` to generate
  unique SymRef names for conditionals)
- `iterator_tests/test_ir.py` (tests that fingerprinting ignores location/type)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
NumPy ufuncs (e.g. np.cbrt in DSL closure variables) only pickle through
their copyreg.dispatch_table reducer; calling __reduce_ex__ directly
raises. Consult the dispatch table first, like pickle.Pickler does.
egparedes added 15 commits June 10, 2026 19:15
Propagate the rename through docstrings, the tree_cata doctest, tests
and ADR 0023; cache _class_tag with functools.cache.
atom_reducer/composite_reducer/cycle_reducer (formerly leaf_alg/node_alg/
cycle_alg), with matching parameter and prose updates in tests and ADR.
Extractor (formerly ObjectDecomposer), CompositeContent (formerly
ObjectDecomposition), AtomicContent (formerly DecompositionAtom),
AtomicReducer (formerly AtomReducer), with matching extract/atomic_reducer
parameter names and prose updates in docstrings, tests and ADR 0023.
make_fingerprinter() now takes an Extractor instead of a mapping of
per-type overrides; the new make_extractor(overrides) builds extractors
from the default rules, and the module-level 'extract' is the default
Extractor. skipping_fields_node_fingerprinter returns extractor
overrides (return_extractors=) for composition via make_extractor.
Deconstructor (formerly Extractor), Deconstruction (formerly
CompositeContent), EmptyDeconstruction (formerly AtomicContent),
Collapser (formerly AtomicReducer), CompositeCollapser (formerly
CompositeReducer), with matching deconstruct/collapser parameter and
factory names (make_deconstructor, return_deconstructors, ...).

Also reconcile with the reshaped driver API: cycles are now collapsed
by the regular collapser as back-reference EmptyDeconstructions when
allow_cycles=True (the CycleCollapser alias and the fingerprint cycle
collapser are gone); fingerprint digests are unchanged.
Deconstruction becomes the base Collection of deconstructed pieces with
EmptyDeconstruction (terminal, no pieces) and
OrderInsensitiveDeconstruction (pieces collapse in canonical order)
subclasses, replacing the 'ordered' flag; builders (from_pieces,
from_typed_value, from_reference) are the construction API and all call
sites, tests, docstrings and ADR 0023 adopt the pieces vocabulary.
Fingerprint digests are unchanged.
- reduce_object takes a single 'collapser' (renamed param 'deconstructor'):
  composites are collapsed by re-wrapping the already-collapsed piece
  results via dataclasses.replace, and the driver canonicalizes the order
  of OrderInsensitiveDeconstruction results itself.
- The fingerprint digest scheme is unified accordingly
  (xxh64('node' + state + 'pieces' + digests) for empty and composite
  alike), intentionally changing all fingerprint values (no migration
  needed, stale cache keys are simply never hit again).
- make_fingerprinter is replaced by functools.partial composition;
  make_deconstructor gains a 'fallback' parameter and fingerprinters use
  fingerprint_fallback, which layers the gt4py_metadata(fingerprint=False)
  opt-out over the default dataclass/datamodel field deconstruction.
  Dispatching dataclasses/datamodels through virtual ABC registry keys
  (xtyping.DataclassABC / new datamodels.DataModelABC) was attempted but
  breaks stdlib singledispatch MRO composition for eve's generic
  datamodel classes (RuntimeError: Inconsistent hierarchy), so they are
  handled in the fallbacks instead.
- New eve datamodels.DataModelABC (virtual ABC via subclass hook) with
  unit test.
Also fix the import-order NameError reintroduced by moving the public
deconstructor builders above their implementation dependencies.

@havogt havogt left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated code review (Claude, high effort: 7 finder angles, per-finding adversarial verification with reproduction scripts). 10 findings, ranked most severe first as inline comments.

The two most consequential findings pull in opposite directions and suggest the location-handling changed by accident, not design: the ffront-stage cache keys became location-insensitive (cross-file cache aliasing) while the ITIR-level persistent cache keys became location-sensitive (permanent cache misses on unrelated edits). upstream/main had exactly the opposite arrangement in both places.

🤖 Generated with Claude Code

lambda o: workflow.CachedStep(
o.bare_translation,
hash_function=stages.fingerprint_compilable_program,
key_function=utils.stable_fingerprinter,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[1/10] Persistent translation cache key now includes location and type node fields

This CachedStep (backed by a persistent FileCache) switched from the location/type-skipping fingerprint_compilable_program to plain utils.stable_fingerprinter, which hashes every ITIR node's SourceLocation (absolute file path, line, column) and lazily-inferred type field. Same issue in dace/workflow/factory.py:50.

Verified empirically: stable_fingerprinter(SymRef(id='x', location=SourceLocation(1, 1, '/path/a.py'))) differs from the same node with a different path, while the old key was equal.

Impact: with GT4PY_BUILD_CACHE_LIFETIME=persistent (e.g. icon4py), inserting a blank line above a stencil, moving the checkout, or sharing the cache dir across paths changes every key → guaranteed miss, full C++/SDFG recompilation, unbounded cache growth. The skipping_fields_node_fingerprinter('location', 'type') deconstructors already exist (via return_deconstructors=True) and could be composed into this key function.

#: Fingerprinter for the frontend stages: skips source locations on AST nodes
#: and fingerprints DSL definition functions by their source code and closure
#: variables (instead of by qualified name).
semantic_fingerprinter: utils.Fingerprinter = functools.partial(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[2/10] FOAST/PAST cache keys became location-insensitive → cached lowering shared across files with wrong SourceLocations

semantic_fingerprinter skips location on all FOAST/PAST nodes. This inverts the invariant the deleted test_stages.py tests asserted (test_fingerprint_stage_foast_op_def / test_fingerprint_stage_past_def: same code at different locations must fingerprint differently).

Verified empirically: two field operators with byte-identical source in mod_one.py and mod_two.py produce equal FOAST-stage fingerprints. The process-wide foast_to_gtir/past_to_itir CachedSteps then return the first operator's cached lowering — with mod_one.py SourceLocations baked in by PreserveLocationVisitor — for the second operator, so its error messages, warnings, and dace debug info point at the wrong file/line.

Comment thread src/gt4py/next/utils.py
return target is obj


def _reference_by_fully_qualified_name(obj: Any) -> str:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[3/10] Regression: locally-defined FrozenNamespace subclass instances in closure vars now crash at decoration

The old fingerprint path (add_content_to_fingerprint) had a never-failing str(obj) fallback. Now _reference_by_fully_qualified_name raises TypeError for locally-defined classes reachable via __reduce_ex__.

Verified empirically against both venvs: a field operator whose closure vars hold a locally-defined eve.utils.FrozenNamespace subclass instance — a frontend-valid ConstantPythonNamespaceObject (type_translation.py:262) and the icon4py constants idiom — crashes at @field_operator decoration with TypeError: Objects which are not importable under their qualified name ('__main__.....<locals>.Consts') ... cannot be safely referenced, while upstream/main decorates and compiles fine. Module-level namespaces, enums, and scalars are unaffected — narrow, but a hard crash before compilation.

)
device_type = core_defs.DeviceType.CPU
hash_function = stages.compilation_hash
key_function = utils.stable_fingerprinter

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[4/10] Executor cache key content-hashes full neighbor tables (D2H copy for CuPy)

The replaced stages.compilation_hash hashed offset_provider by identity via common.hash_offset_provider_items_by_id (~1.2 µs, O(1) in mesh size). stable_fingerprinter has no ndarray deconstructor, so connectivity arrays inside CompileTimeArgs.offset_provider fall through to __reduce_ex__(2) = full data bytes: measured ~2.9 ms per 1M-entry int32 table (~2400× slower). For CuPy arrays __reduce__ calls .get(), i.e. a synchronous device-to-host copy. Same in dace/workflow/backend.py:52.

Impact: once per program variant on the main CompiledProgramsPool path (compile/startup cost), but per-invocation on the iterator/runtime.py:92 fendef dispatcher and FieldOperatorFromFoast.__call__ (decorator.py:715) paths — every call pays a multi-MB mesh re-hash plus a D2H transfer on GPU.

Comment thread src/gt4py/next/config.py Outdated
#: developers or advanced users when making changes requiring forced
#: compatibility with previously cached builds.
BUILD_CACHE_VERSION_ID: str = (
os.environ.get("BUILD_CACHE_VERSION_ID") or _get_build_cache_version_id()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[5/10] Env var missing the GT4PY_ prefix

Every other env var in this file uses the GT4PY_ prefix (GT4PY_BUILD_CACHE_DIR, GT4PY_BUILD_CACHE_LIFETIME, GT4PY_BUILD_JOBS, ...); this one reads bare BUILD_CACHE_VERSION_ID.

A user following the convention sets GT4PY_BUILD_CACHE_VERSION_ID and is silently ignored; conversely an unrelated CI/HPC tool exporting the generic name BUILD_CACHE_VERSION_ID silently pins or perturbs every gt4py cache key — including reusing stale binaries across gt4py upgrades, exactly what this salt exists to prevent. (The ADR documents the unprefixed name too and would need updating.)

__datamodel_params__: ClassVar[utils.FrozenNamespace[Any]]

@classmethod
def __subclasshook__(cls, subclass: type) -> bool:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[6/10] __subclasshook__ returns False instead of NotImplemented, breaking the ABC hook contract

The hook ignores cls and never returns NotImplemented. Verified empirically:

  • for class MyBase(DataModelABC), issubclass(AnyUnrelatedDatamodel, MyBase) returns True (hook inherited, cls ignored);
  • a genuine direct subclass class X(DataModelABC) fails issubclass(X, DataModelABC) because returning False suppresses the normal MRO fallback that NotImplemented would allow.

Correct pattern:

if cls is DataModelABC:
    return is_datamodel(subclass)
return NotImplemented

(Note: DataModelABC currently has no production callers — see the dead-infrastructure comment on eve/utils.py.)

Comment thread src/gt4py/next/utils.py
bytearray: lambda obj: EmptyDeconstruction.from_typed_value(type(obj), bytes(obj)),
tuple: lambda obj: Deconstruction.from_pieces(*obj, state=_class_obj_tag(obj)),
list: lambda obj: Deconstruction.from_pieces(*obj, state=_class_obj_tag(obj)),
dict: lambda obj: Deconstruction.from_pieces(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[7/10] dict/list subclass instance state silently dropped by MRO dispatch

singledispatch walks the MRO, so this dict entry also captures all dict subclasses, hashing only the class tag + items() and bypassing the __reduce_ex__ fallback that would capture extra instance state.

Verified empirically: two instances of a dict subclass with equal items but different extra-attribute values produce identical fingerprints (same for list subclasses). Reachable via closure_vars: dict[str, Any] in ffront stages or a user-supplied offset_provider mapping → false cache hit returns an artifact compiled for the other object.

step: Workflow[StartT, EndT]
hash_function: Callable[[StartT], HashT] = dataclasses.field(default=hash) # type: ignore[assignment]
cache: OpaqueMutableMapping[HashT, EndT] = dataclasses.field(repr=False, default_factory=dict)
key_function: Callable[[StartT], HashT] = dataclasses.field(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[8/10] Breaking CachedStep API change with no deprecation path

hash_function (default=hash) was renamed to key_function and is now required, and cache_key fingerprints self, so steps containing lambdas/closures raise at first call (a test removed by this PR explicitly used step=lambda inp: [*inp, 1]).

Verified empirically:

  • CachedStep(step=tuple)TypeError: missing ... 'key_function'
  • CachedStep(step=tuple, hash_function=str) → unexpected keyword
  • CachedStep(step=lambda inp: [*inp, 1], key_function=str)([1])TypeError: not importable under their qualified name

The new docstring and WorkflowPatterns.md document the restriction, so this is clearly intentional — but downstream code (icon4py-style custom workflows, the pre-PR WorkflowPatterns.md recipe) breaks at construction or first call with no deprecation period.

Comment thread src/gt4py/eve/utils.py
return cast(xtyping.SingleDispatchCallable[P, T], result)


def merge_dispatchers(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[9/10] Dead-on-arrival eve infrastructure (3 items)

Verified by grep — each is referenced only by its own new unit tests, with zero production callers:

  • merge_dispatchers (here, ~30 lines + 3 tests) — ffront/stages.py:59 composes deconstructors via plain dict merging into make_deconstructor({**foast.semantic_fingerprint_deconstructors, ...}) instead;
  • DataModelABC (datamodels/core.py:104) — next/utils.py deliberately avoids it with an explicit comment that virtual-ABC dispatch breaks singledispatch, and calls datamodels.is_datamodel directly;
  • @runtime_checkable on SingleDispatchCallable (extended_typing.py:228) — no isinstance/issubclass check against the protocol exists anywhere; the pre-existing is_single_dispatch_callable typeguard uses getattr duck-typing.

Same theme, smaller: the empty BaseStage marker class (ffront/stages.py:37) and the fingerprinter = semantic_fingerprinter alias (stages.py:72) are referenced nowhere / tests-only.

Comment thread src/gt4py/next/utils.py Outdated
_DECONSTRUCT_PICKLE_REDUCE_PROTOCOL: Final[int] = 2


def _dataclass_fields(cls: type) -> Optional[tuple[Any, ...]]:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[10/10] Avoidable per-node costs on the fingerprinting path

Measured 8.2 ms per stable_fingerprinter call on a 2404-node ITIR tree (~3.4 µs/node), re-paid several times across the chained CachedSteps per program even on warm-disk-cache startup — noticeable for sessions compiling many programs (icon4py). Profile attribution:

  • _dataclass_fields recomputes dataclasses.fields() / datamodels.get_fields() per object instead of caching per type, and fingerprint_fallback re-filters field metadata per instance (~27% — the old code cached this with @functools.cache on _get_metadata_based_state_getstate);
  • digests round-trip through hexdigest() strings re-encoded with .encode('ascii') per edge (~13%; raw digest() bytes sort fine for the order-insensitive case, only the root needs hex);
  • catabolize uses dataclasses.replace per internal node (~8%; direct construction skips the field-introspection machinery);
  • CachedStep.cache_key re-fingerprints the constant frozen self on every call instead of once per instance (catabolize's memo dict is call-local).

Address review findings on the structural-fingerprinting refactor:

- Restore dedicated cache key functions (compilation_hash,
  fingerprint_compilable_program) for the gtfn/dace executor and
  persistent translation caches, reimplemented on the new machinery:
  location/type-agnostic program fingerprint, order-sensitive by-id
  offset providers for the in-memory executor (fixes silent wrong
  results from reordered offset_provider dicts and avoids content-hashing
  connectivity tables on every lookup), and location-stable persistent
  keys.
- Fingerprint DSL definition functions by source code only (drop
  filename/line/column) so textually identical operators match.
- Fix DataModelABC.__subclasshook__ to defer non-base checks via
  NotImplemented.
- Rename env var to GT4PY_BUILD_CACHE_VERSION_ID.
- Cache CachedStep self-fingerprint instead of re-walking the step graph
  per lookup.
- Unify the three drifted fields-deconstruction sites.
- Fix WorkflowPatterns.md cached-step example.
…/subclass fingerprinting, perf

Round 2, addressing the automated PR review findings:

- [2/10] Restore location-SENSITIVE frontend-stage cache keys. The PR made
  FOAST/PAST stage keys and DSL-definition keys location-insensitive, so two
  textually identical operators in different files aliased to one cached
  lowering, baking the first's SourceLocations into the second (wrong error
  locations). FOAST/PAST node fingerprints now include 'location'; the DSL
  definition function is fingerprinted by its full SourceDefinition again.
  Location-agnostic fingerprinting remains for the ITIR persistent cache and
  var-name generation. Adds a location-sensitivity regression test.
- [3/10] Fix hard crash at @field_operator decoration when closure vars hold a
  locally-defined FrozenNamespace subclass (icon4py constants idiom): register
  a content-based deconstructor for eve.utils.Namespace instead of falling
  through to __reduce_ex__, which referenced the non-importable local class.
- [7/10] Capture builtin-container *subclass* instance __dict__ state so a dict/
  list/set subclass with extra attributes no longer collides with a plain
  instance (false cache hit). Exact builtins are unaffected.
- [10/10] Cheap fingerprinting-hot-path wins: cache _dataclass_fields per type;
  construct the recombined Deconstruction directly instead of dataclasses.replace.

[1/10],[4/10],[5/10],[6/10] were already addressed in the previous commit.
@egparedes

Copy link
Copy Markdown
Contributor Author

Thanks for the thorough automated pass — it caught a real location-handling inversion I'd half-mirrored. Addressed in two commits (48027da and c7f9b1d). Disposition below.

Fixed

  • [1/10] persistent ITIR key includes location/type — the gtfn/dace translation FileCache now keys on a restored stages.fingerprint_compilable_program built on itir.semantic_fingerprinter (skips location/type), so unrelated edits / checkout moves no longer bust the persistent cache.
  • [2/10] ffront keys location-insensitive → cross-file aliasing — confirmed against main's deleted test_stages.py (samecode_fieldop != fieldop). FOAST/PAST node fingerprints now include location again, and the DSL definition is fingerprinted by its full SourceDefinition. Location-agnostic fingerprinting stays where it belongs (ITIR persistent cache + the __val_/__cond_ var-name generation, which use itir.semantic_fingerprinter). Added a test_fingerprinter_is_location_sensitive regression test.
  • [3/10] local FrozenNamespace subclass crashes at decoration — registered a content-based deconstructor for eve.utils.Namespace, so constant namespaces in closure vars (the icon4py idiom) are fingerprinted by their items instead of falling through to __reduce_ex__ and referencing the non-importable local class.
  • [4/10] executor content-hashes neighbor tables — restored stages.compilation_hash for the in-memory executor key: location-agnostic program fingerprint + offset providers by identity/order via hash_offset_provider_items_by_id (O(1), no D2H copy).
  • [5/10] env var prefix — now GT4PY_BUILD_CACHE_VERSION_ID.
  • [6/10] __subclasshook__ — now if cls is DataModelABC and is_datamodel(subclass): return True; return NotImplemented (genuine subclasses and derived-ABC checks defer to the MRO).
  • [7/10] container-subclass instance state droppeddict/list/set/… subclasses now contribute their instance __dict__ to the fingerprint; exact builtins are unchanged.
  • [10/10] per-node overhead — applied the cheap, safe wins: _dataclass_fields is now cached per type, and catabolize's COMBINE constructs the recombined Deconstruction directly instead of via dataclasses.replace. (Left the hex/encode('ascii') digest round-trip alone for now — it changes the collapser format and the order-insensitive sort, so it felt out of scope for a fix-up.)

Each fix has a unit test; all affected suites are green, mypy/pre-commit clean, and an end-to-end run_gtfn_cached compile + cache-hit checks out.

Left as-is (flagging for your call)

  • [8/10] breaking CachedStep API — this is the intent of the PR (fingerprinting the step is the new "step state in caches" behavior), so I didn't add a deprecation shim. I did fix the doc recipe and cache the per-instance step fingerprint so it isn't re-walked on every call. Happy to add a transitional hash_function alias if you want a deprecation window.
  • [9/10] dead infrastructuremerge_dispatchers was deliberately restored earlier in the branch, and DataModelABC is new public API (now with a correct hook), so removing them felt like your decision rather than a review fix-up. Easy to drop them + @runtime_checkable/BaseStage/the fingerprinter alias if you'd prefer.

🤖 Generated with Claude Code

…fn cache-key crash)

The frontend-stage cache keys fingerprint program-likes (FieldOperator,
Program, ...) when they appear in another program's closure variables. The
PR removed the dedicated FieldOperator/Program fingerprint registrations,
so the whole 'backend' object graph is now traversed. That graph can hold
non-importable objects — e.g. the 'unittest.mock.Mock' backend that
test_compiled_program swaps in, whose dynamically-created 'Mock' subclass
is rejected by the by-qualified-name reference deconstructor — crashing
fingerprinting with TypeError (test_compiled_program gtfn variants, internal
nomesh CI jobs).

Exclude 'backend' from the fingerprint via gt4py_metadata(fingerprint=False):
it does not affect the lowering (which is what these stage caches key), the
backend-specific compilation is keyed separately in the backend's own caches,
and traversing the whole backend graph per cache lookup was wasteful and
fragile. Adds a regression test.
@egparedes egparedes requested review from edopao and havogt June 11, 2026 21:51
Comment on lines +125 to +129
def __getstate__(self) -> dict[str, Any]:
# Serialize only the dataclass fields, excluding cached properties
# stored in `__dict__` (which may not be picklable).
return {f.name: getattr(self, f.name) for f in dataclasses.fields(self)}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was introduced as a fix by claude. Is it needed @egparedes

Comment thread src/gt4py/next/utils.py

from __future__ import annotations

import collections

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the fingerprinting related utils deserve their own module?

lambda o: workflow.CachedStep(
o.bare_translation,
hash_function=stages.fingerprint_compilable_program,
key_function=stages.fingerprint_compilable_program,

@edopao edopao Jun 12, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current FileCache key does not include the stage fingerprint. I think it should be added, question is whether to do it now or in a separate PR.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another comment about this aspect:

[1/10] persistent ITIR key includes location/type — the gtfn/dace translation FileCache now keys on a restored stages.fingerprint_compilable_program built on itir.semantic_fingerprinter (skips location/type), so unrelated edits / checkout moves no longer bust the persistent cache.

In my opinion, the FileCache key should include the type: the translated program should not clash in disk cache for different types.

Comment on lines +32 to +34
- The OTF build cache (`CachedStep`) keys compiled artifacts by a fingerprint
of the workflow step **and** its input, so a cached compiled program can be
reused only when both are unchanged.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This text does not match the key of the disk cache in translation stage (see my other comments).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants