Skip to content

feat(regression): CI coupling gate for test case baselines#727

Open
lewisjared wants to merge 3 commits into
feat/regression-clifrom
feat/regression-coupling-gate
Open

feat(regression): CI coupling gate for test case baselines#727
lewisjared wants to merge 3 commits into
feat/regression-clifrom
feat/regression-coupling-gate

Conversation

@lewisjared

@lewisjared lewisjared commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Description

Adds the CI coupling gate for RFC 0005 regression baselines (PR-3), stacked on #724.

For each test case the gate decides — purely from what changed relative to the base branch — how CI should verify the baseline: skip, replay (cheap, anonymous, against cached native blobs), execute (full re-run when test_case_version is bumped), or fail (an unauthorised or unverifiable change). decide_coupling is a pure function; ref test-cases ci-gate maps a PR's changed-file list onto its inputs.

This branch also folds in the review-driven hardening:

  • Native is opt-in. REPLAY is selected only when the head manifest actually has native blobs (fork contributors can't mint, so an empty native set is a permanent valid state). A de-mint (native removed, committed unchanged) warns and SKIPs rather than failing.
  • Baselines are coupled to their inputs. New optional Manifest.catalog_hash; a catalog.yaml change without regenerating the baseline now FAILs the gate instead of silently passing. The .catalog_hash sidecar is retired in favour of the manifest as the single coupling record.
  • Fail-closed on the committed dimension: managed-manifest deletion, committed-bundle drift, and input-catalog drift all FAIL.
  • Comparator fixes: no longer conflates bool with int/float; non-zero default atol so values at zero aren't held to bit-exactness.
  • Robustness/cleanup: broaden extraction-change detection to the core surfaces behind build_execution_result; reject unknown native-store URL schemes; reject a bare . in safe_path; mark ci-gate read-only; cache the pooch manager; memoise per-provider source-root resolution.

A background doc (docs/background/regression-baselines.md) explains the two-layer baseline model, the lifecycle verbs, and the gate, with lifecycle and gate-decision mermaid diagrams.

Note: the actual CI workflow YAML, the R2 write backend, and the <RECIPE_RUN> sanitisation for real ESMValTool are deferred to later PRs by design.

Checklist

Please confirm that this pull request has done the following:

  • Tests added
  • Documentation added (where applicable)
  • Changelog item added to changelog/

Add `ref test-cases ci-gate`, which decides how CI should verify each
regression test case against the base branch: replay the cached native
baseline, execute a full re-run when `test_case_version` is bumped, skip
unchanged cases, or fail an unauthorised baseline change.

The decision lives in the pure `climate_ref_core.regression.gate`
(`decide_coupling`, `Action`, `GateDecision`, `paths_under`) so the full
matrix is unit-testable offline; the CLI maps the git diff and on-disk
state onto its arguments. Extract `Manifest.loads` from `Manifest.load`
so the base-branch manifest parses from `git show` output.

The gate fails closed: deleting a managed manifest, drifting the committed
bundle from its manifest digests, or re-minting native blobs without a
version bump are all caught rather than silently skipped.
Address review findings on the CI coupling gate, resolving five design
decisions reached during review:

- Gate emits REPLAY only when the head manifest has native blobs to replay
  (seeding / native-changed / extraction-changed). An empty native set is a
  permanent valid state, so a de-mint (native removed, committed unchanged)
  warns and SKIPs rather than failing. The native axis is documented as not
  fail-closed.
- Couple each baseline to its inputs: new optional `Manifest.catalog_hash`,
  checked by the gate via `catalog_integrity_ok`. A `catalog.yaml` change
  without regenerating the baseline now FAILs instead of silently skipping.
  The `.catalog_hash` sidecar is retired in favour of the manifest, which is
  the single coupling record; run/mint populate it via a shared helper.
- Comparator no longer conflates `bool` with int/float, and the default
  absolute tolerance is a small non-zero placeholder so values at zero are not
  held to bit-exactness.
- Broaden extraction-change detection to the core surfaces behind
  build_execution_result (pycmec, output_files, diagnostics), reject unknown
  native-store URL schemes instead of coercing them to local paths, reject a
  bare '.' in safe_path, add ci-gate to the read-only command set, cache the
  pooch manager, and memoise per-provider source-root resolution in the gate.

Adds a background doc (regression baselines + CI coupling gate) with lifecycle
and gate-decision mermaid diagrams, and unit coverage for every new branch.
@codecov

codecov Bot commented Jun 15, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 94.49541% with 12 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...ages/climate-ref/src/climate_ref/cli/test_cases.py 90.16% 9 Missing and 3 partials ⚠️
Flag Coverage Δ
core 92.57% <94.49%> (+0.08%) ⬆️
providers 91.80% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...ges/climate-ref-core/src/climate_ref_core/paths.py 100.00% <100.00%> (ø)
...f-core/src/climate_ref_core/regression/__init__.py 100.00% <100.00%> (ø)
...ef-core/src/climate_ref_core/regression/compare.py 97.36% <100.00%> (+0.03%) ⬆️
...e-ref-core/src/climate_ref_core/regression/gate.py 100.00% <100.00%> (ø)
...f-core/src/climate_ref_core/regression/manifest.py 100.00% <100.00%> (ø)
...-ref-core/src/climate_ref_core/regression/store.py 97.56% <100.00%> (+0.15%) ⬆️
...s/climate-ref-core/src/climate_ref_core/testing.py 89.83% <100.00%> (+0.36%) ⬆️
...ckages/climate-ref/src/climate_ref/cli/__init__.py 96.96% <ø> (ø)
...ages/climate-ref/src/climate_ref/cli/test_cases.py 82.58% <90.16%> (+2.12%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant