From 27cd155a5e49594a3837abf01f1bc818ab879aae Mon Sep 17 00:00:00 2001 From: Jared Lewis Date: Thu, 4 Jun 2026 15:54:01 +1000 Subject: [PATCH 1/8] rfc: regression baselines (golden-in-git + native fixtures) --- text/0000-regression-baselines.md | 246 ++++++++++++++++++++++++++++++ 1 file changed, 246 insertions(+) create mode 100644 text/0000-regression-baselines.md diff --git a/text/0000-regression-baselines.md b/text/0000-regression-baselines.md new file mode 100644 index 0000000..5168bf4 --- /dev/null +++ b/text/0000-regression-baselines.md @@ -0,0 +1,246 @@ +- Feature Name: `regression_baselines` +- Start Date: 2026-06-04 +- RFC PR: [CMIP-REF/rfcs#0000](https://github.com/CMIP-REF/rfcs/pull/0000) + +# Summary +[summary]: #summary + +Manage diagnostic regression baselines as **two layers**: + +1. A small, human-readable **golden** (the REF-shaped `series.json` + CMEC bundles) committed + directly to the source repo — this is the test *gate* and the in-PR *diff signal*. +2. The large, binary **native diagnostic outputs** stored outside git in an object store and + referenced from a single git-committed **`native_manifest.json`** (sha256 digests of the golden, + the native files, and the extraction inputs). + +The REF CLI (`ref test-cases`) fetches, runs, and mints baselines. The native outputs double as the +**input fixtures that test the native→bundle extraction step**, so most PRs are verified fast on a +stock runner without re-running the (expensive) diagnostic. + +The **object-store backend is left as an explicit open decision** (see Unresolved questions) — this +RFC specifies the *layering, manifest, CLI, and CI*, which are backend-agnostic behind a small +`NativeStore` interface. + +# Motivation +[motivation]: #motivation + +Diagnostic regression testing today is flaky, opaque, and bloats the repo: + +- **Bloat.** `tests/.../regression` is ~1.0 GB on disk. Of that, **696 MB is input `climate_data` + that the capture step wrongly copies in**; the real curated output is ~316 MB (esmvaltool 5 MB, + ilamb 61 MB, pmp 250 MB). Some native is already hand-`.gitignore`d, so it is *missing* — no + coverage, no parity. The bloat is a capture bug plus committing binaries, not an inherent need. +- **No clear "this-changed" signal.** Failures are a mix; you dig CI logs across many jobs. The + REF-shaped output (`series.json`, CMEC bundles) is small JSON that diffs cleanly in the PR and is + exactly the content that feeds the REF API — it is what reviewers most need to see. +- **The extraction regression never runs on a PR.** It runs only post-merge/nightly today, so a PR + that breaks extraction goes green, then red after merge. +- **Most diagnostic authors are external scientists.** Minimal setup, no new credentials. A + contributor should need only the REF. +- **Security.** The integration runner shares a writable cache; untrusted fork code must never reach + it or any write credential. + +A key reframe: **the native outputs are not review-only artifacts — they are the input fixtures for +the extraction step.** `Diagnostic.run()` is `execute()` (runs the diagnostic, writes native) then +`build_execution_result()` (reads native → CMEC bundle + series). `RegressionValidator` already +replays *committed native* through `build_execution_result` without re-running the diagnostic. That +makes a fast, fork-safe extraction test possible on a stock runner — *if* the native is fetchable. + +# Reference-level explanation +[reference-level-explanation]: #reference-level-explanation + +## Two layers, one manifest + +For each `(provider, diagnostic, test-case)`: + +``` +packages//tests/test-data/// + catalog.yaml # committed: dataset metadata (exists today, content-hashed) + catalog.paths.yaml # gitignored: per-user local paths (exists today) + regression/ + series.json # committed GOLDEN (the gate + diff signal) + diagnostic.json # committed GOLDEN (CMEC metric bundle) + output.json # committed GOLDEN (CMEC output bundle) + native_manifest.json # committed: source of truth binding golden <-> native +``` + +The native bytes (provenance + matched `*.nc` / `*.png`) live in the object store, **not** in git. + +`native_manifest.json`: + +```json +{ "schema": 1, + "golden": { "series.json": "", "diagnostic.json": "", "output.json": "" }, + "native": { "": { "sha256": "...", "size": ... } }, + "extraction_inputs": { "files_series_digest": "", "input_selectors_digest": "" } } +``` + +- `golden` binds the committed golden bytes to the manifest: CI recomputes the golden digests and + asserts they equal `manifest.golden`. A commit that updates `series.json` but not the manifest (or + vice-versa) **fails loudly** — closing the one consistency leg content-addressing does not cover. +- `native` lists the curated native blobs by content digest (immutable, dedups, survives history). +- `extraction_inputs` hashes the diagnostic's `files`/`series` patterns and the catalog + `input_selectors` — the real extraction inputs the current catalog hash ignores — so a change to + extraction behaviour bumps the manifest *at PR time*. + +Two manifest sections keep the diff readable: a changed `native` block means native changed; a +changed `golden` block means extraction output changed. + +## Curated capture (kills the bloat at the source) + +The diagnostic already declares which native files matter via `files` + `series` glob patterns; +production publishing copies only those (the CMEC `output.json` subset). The regression capture must +use the **same curation** instead of copying the whole working directory. This alone prunes the +696 MB `climate_data` input and the verbose run logs for free. Diagnostic authors control what is +stored/pruned through those existing patterns — no new mechanism. + +## `NativeStore` interface (backend-agnostic) + +The REF talks to the store through a thin Protocol so the backend is swappable and unit-testable, +and so this RFC does not have to settle the backend to be implementable: + +```python +class NativeStore(Protocol): + def has(self, digest: str) -> bool: ... + def fetch(self, digest: str, dest: Path) -> None: ... # by sha256; public read, no creds + def put(self, path: Path) -> str: ... # returns sha256; needs write creds +``` + +Required properties of whatever backend is chosen: + +- **Anonymous public read** so fork CI and contributors fetch without credentials. +- **Writes gated to trusted contexts only** (CI on `main`, or maintainers). +- Content-addressed (sha256) storage; native addressed **per-test-case** so a fetch is one + diagnostic's output (KB–32 MB). +- Retention that **never expires a digest still referenced by a reachable manifest**. + +Backend selection is deferred to a maintainer decision (Unresolved questions). + +## CLI (extends the existing `ref test-cases` app) + +| Subcommand | Purpose | Creds | +|------------|---------|-------| +| `run --provider P --diagnostic D` | run `execute()` + `build_execution_result`, write golden + manifest locally | none (local) | +| `sync [--provider P]` | fetch native blobs named by the committed manifest(s) | none (public read) | +| `replay --provider P --diagnostic D` | fetch native, run `build_execution_result`, compare to golden | none (public read) | +| `mint --provider P` | `run` + `put` native to the store + update manifest | write creds | + +`mint` is the only credentialed verb and runs only on trusted contexts. + +## Tiered CI + +- **PR — any, incl forks — stock `ubuntu-latest`, no secrets.** `ref test-cases sync` (public read) + → `replay` → diff golden → post a PR comment. Figures surfaced but **non-blocking**. Routed by + changed-files so the check is *honest*: + - PR touches extraction (`build_execution_result`, `files`/`series`) or golden/manifest → run the + replay; the check is named **`extraction-replay`** and its green means what it says. + - PR touches `execute()` only → replay is a tautology → emit a **neutral + "execute-changed: needs runner verification"** check (not green), linking the bootstrap path. + - A fetch *miss* for a fixture the manifest says should exist is a **failure, not a skip** (today, + missing fixtures silently skip → false green). +- **Same-repo PR — self-hosted runner, no upload — when a golden file changes.** Run the slow + `execute()` to confirm the diagnostic still reproduces its native, catching breakage pre-merge. + Gated with no labels: `if: github.event.pull_request.head.repo.full_name == github.repository`. +- **Merge to `main` — self-hosted runner, the only job holding write creds.** `ref test-cases mint` + → `execute()` + `put` native + update manifest. +- **Nightly — self-hosted runner.** Full sweep; drift detection backstop; opens an auto-PR with the + new manifest + a human-readable summary if outputs drift. +- **Fork PR needing new/changed native** (a new diagnostic, or an `execute()` change): a documented + `workflow_dispatch` a maintainer triggers to mint on the trusted runner and push the golden + + manifest back to the PR. Forks never run `execute()` on the self-hosted runner (untrusted code, + shared cache). + +## Self-hosted runner security + +The integration runner uses ephemeral pods, but mounts a **shared, writable cache** (datasets + +provider software). Untrusted fork code in such a job could poison that cache for the next trusted +run; ephemeral pods do not close this. Therefore: never run untrusted `execute()` on the shared +runner; keep write credentials on the `main` (and explicitly-triggered) path only; gate the +same-repo PR job by `head.repo.full_name == github.repository`. (A separate hardened runner with a +read-only cache could later make fork `execute()` safe; out of scope here.) + +## Determinism (prerequisite for comparing bundles) + +CMEC bundle content comparison (currently only `series.json` is compared) requires removing +nondeterminism first: + +- **Execution-dir timestamp.** ESMValTool writes a `recipe__` dir that gets baked + into `output.json`. Sanitise `recipe_\d{8}_\d{6}` → a placeholder in both capture and comparison. + Tracked as [Climate-REF/climate-ref#713](https://github.com/Climate-REF/climate-ref/issues/713). +- **Floats.** Replay path (stored native → extraction) is deterministic → exact compare. Execute + path (re-run) has float jitter → compare structure/dimensions only, leaving numeric values to the + replay path; any value that must be compared on the execute path uses a relative tolerance, never + exact equality. + +# Drawbacks +[drawbacks]: #drawbacks + +- The REF CLI owns fetch/mint/credential plumbing — more surface to maintain and test than + off-loading to an existing data-management tool. +- Native is a cache of a regenerable artifact: losing a digest still referenced by a reachable + manifest breaks historical replay (mitigated by reachability-aware retention). +- Golden minting is effectively a maintainer/CI action for conda providers (esmvaltool's env is + ~1373 packages); external authors usually cannot run `execute()` locally. +- Fork `execute()` changes have no automated PR signal until bootstrapped — the fast path tests + extraction, not the diagnostic algorithm. +- Historical bloat remains: existing committed binaries are not rewritten out of history + (deliberate — the repo is already used by externals). + +# Rationale and alternatives +[rationale-and-alternatives]: #rationale-and-alternatives + +**Why two layers.** Keeping the small REF-shaped golden in git puts the review signal where +reviewers already work (the PR diff) and keeps the gate human-readable; moving the large binaries +out of git removes the bloat while still making them fetchable for the extraction test. The split +maps cleanly onto what the REF already does for catalogs (`catalog.yaml` committed, +`catalog.paths.yaml` gitignored). + +**Why a manifest + CLI rather than committing the bytes.** Measured native is too large to commit +(pmp 250 MB, ilamb 61 MB). A content-addressed manifest gives dedup, per-commit pinning, and a clean +in-PR "native changed" diff without binaries in git. + +**Alternatives considered:** + +- **Commit curated native to git.** Rejected by measurement (too large for pmp/ilamb). Viable only + for esmvaltool (~5 MB) — not worth a per-provider split of the storage model. +- **Git LFS.** Pointer churn remains; GitHub LFS egress costs; no anonymous public read. +- **Regenerate-on-demand, store nothing.** The fast fork-PR extraction path runs on a stock runner + with no conda/data and cannot regenerate; it must fetch. Per-provider, pmp's 250 MB *could* be + regenerated on the self-hosted runner rather than stored if its fetch rate is low — the + `NativeStore` interface allows mixing. + +**Impact of not doing this.** Contributors keep hand-rolling baselines, PRs keep shipping output +changes without a visible diff, and silent upstream regressions stay invisible until a manual check. + +# Prior art +[prior-art]: #prior-art + +- The REF's existing **catalog split** — the exact "commit small metadata, keep bytes out of git" + pattern, here extended to outputs. +- **pytest-regressions** (`--force-regen`) — the regenerate-the-golden workflow this mirrors. +- Content-addressed artifact stores with a small in-git lock (DVC, Snakemake) — same idea; this RFC + keeps the client surface to just the REF CLI rather than adding a separate toolchain on + contributors. + +# Unresolved questions +[unresolved-questions]: #unresolved-questions + +- **Object-store backend choice** — the main open decision, deferred deliberately. Candidate + backends behind the `NativeStore` interface include an S3-compatible bucket or an OCI registry + (e.g. GitHub Packages); selection should weigh anonymous-read, trusted-write identity, egress + cost, content-addressing/GC, and any storage the org has already provisioned. **No backend is + assumed by this RFC.** +- Per-provider **store-vs-regenerate** for the large providers (pmp 250 MB). +- Retention policy for non-current native versions. +- Whether the PR comment should render a **figure diff** (PNG) / **NetCDF stat diff**, or defer to a + follow-up. +- Exact form of the fork-PR **bootstrap `workflow_dispatch`**. + +# Future possibilities +[future-possibilities]: #future-possibilities + +- Rich diff renderers in the PR comment (PNG diffs, NetCDF summary-stat diffs). +- Reuse the same `NativeStore` + manifest for large reference/observational inputs too big for git. +- A `ref test-cases preview` that serves locally what the API would render for a PR, straight from a + synced native fixture (already the `replay` path). +- Signed manifests for end-to-end baseline provenance. From b556db0be28f998d2dde1c34ae5673f5c8a96f5d Mon Sep 17 00:00:00 2001 From: Jared Lewis Date: Thu, 4 Jun 2026 16:00:40 +1000 Subject: [PATCH 2/8] chore: format --- text/0000-regression-baselines.md | 8 -------- 1 file changed, 8 deletions(-) diff --git a/text/0000-regression-baselines.md b/text/0000-regression-baselines.md index 5168bf4..ab98e59 100644 --- a/text/0000-regression-baselines.md +++ b/text/0000-regression-baselines.md @@ -3,7 +3,6 @@ - RFC PR: [CMIP-REF/rfcs#0000](https://github.com/CMIP-REF/rfcs/pull/0000) # Summary -[summary]: #summary Manage diagnostic regression baselines as **two layers**: @@ -22,7 +21,6 @@ RFC specifies the *layering, manifest, CLI, and CI*, which are backend-agnostic `NativeStore` interface. # Motivation -[motivation]: #motivation Diagnostic regression testing today is flaky, opaque, and bloats the repo: @@ -47,7 +45,6 @@ replays *committed native* through `build_execution_result` without re-running t makes a fast, fork-safe extraction test possible on a stock runner — *if* the native is fetchable. # Reference-level explanation -[reference-level-explanation]: #reference-level-explanation ## Two layers, one manifest @@ -173,7 +170,6 @@ nondeterminism first: exact equality. # Drawbacks -[drawbacks]: #drawbacks - The REF CLI owns fetch/mint/credential plumbing — more surface to maintain and test than off-loading to an existing data-management tool. @@ -187,7 +183,6 @@ nondeterminism first: (deliberate — the repo is already used by externals). # Rationale and alternatives -[rationale-and-alternatives]: #rationale-and-alternatives **Why two layers.** Keeping the small REF-shaped golden in git puts the review signal where reviewers already work (the PR diff) and keeps the gate human-readable; moving the large binaries @@ -213,7 +208,6 @@ in-PR "native changed" diff without binaries in git. changes without a visible diff, and silent upstream regressions stay invisible until a manual check. # Prior art -[prior-art]: #prior-art - The REF's existing **catalog split** — the exact "commit small metadata, keep bytes out of git" pattern, here extended to outputs. @@ -223,7 +217,6 @@ changes without a visible diff, and silent upstream regressions stay invisible u contributors. # Unresolved questions -[unresolved-questions]: #unresolved-questions - **Object-store backend choice** — the main open decision, deferred deliberately. Candidate backends behind the `NativeStore` interface include an S3-compatible bucket or an OCI registry @@ -237,7 +230,6 @@ changes without a visible diff, and silent upstream regressions stay invisible u - Exact form of the fork-PR **bootstrap `workflow_dispatch`**. # Future possibilities -[future-possibilities]: #future-possibilities - Rich diff renderers in the PR comment (PNG diffs, NetCDF summary-stat diffs). - Reuse the same `NativeStore` + manifest for large reference/observational inputs too big for git. From ef1a4fa792cb5a3208e53fdada04382d919e6925 Mon Sep 17 00:00:00 2001 From: Jared Lewis Date: Thu, 4 Jun 2026 16:08:00 +1000 Subject: [PATCH 3/8] docs: rewrite regression-baselines RFC with semantic line breaks --- text/0000-regression-baselines.md | 269 +++++++++++++++++------------- 1 file changed, 151 insertions(+), 118 deletions(-) diff --git a/text/0000-regression-baselines.md b/text/0000-regression-baselines.md index ab98e59..f24f293 100644 --- a/text/0000-regression-baselines.md +++ b/text/0000-regression-baselines.md @@ -6,43 +6,52 @@ Manage diagnostic regression baselines as **two layers**: -1. A small, human-readable **golden** (the REF-shaped `series.json` + CMEC bundles) committed - directly to the source repo — this is the test *gate* and the in-PR *diff signal*. -2. The large, binary **native diagnostic outputs** stored outside git in an object store and - referenced from a single git-committed **`native_manifest.json`** (sha256 digests of the golden, - the native files, and the extraction inputs). +1. A small, human-readable **golden** (the REF-shaped `series.json` + CMEC bundles) + committed directly to the source repo — + this is the test *gate* and the in-PR *diff signal*. +2. The large, binary **native diagnostic outputs** stored outside git in an object store + and referenced from a single git-committed **`native_manifest.json`** + (sha256 digests of the golden, the native files, and the extraction inputs). -The REF CLI (`ref test-cases`) fetches, runs, and mints baselines. The native outputs double as the -**input fixtures that test the native→bundle extraction step**, so most PRs are verified fast on a -stock runner without re-running the (expensive) diagnostic. +The REF CLI (`ref test-cases`) fetches, runs, and mints baselines. +The native outputs double as the **input fixtures that test the native→bundle extraction step**, +so most PRs are verified fast on a stock runner without re-running the (expensive) diagnostic. -The **object-store backend is left as an explicit open decision** (see Unresolved questions) — this -RFC specifies the *layering, manifest, CLI, and CI*, which are backend-agnostic behind a small -`NativeStore` interface. +The **object-store backend is left as an explicit open decision** (see Unresolved questions) — +this RFC specifies the *layering, manifest, CLI, and CI*, +which are backend-agnostic behind a small `NativeStore` interface. # Motivation Diagnostic regression testing today is flaky, opaque, and bloats the repo: -- **Bloat.** `tests/.../regression` is ~1.0 GB on disk. Of that, **696 MB is input `climate_data` - that the capture step wrongly copies in**; the real curated output is ~316 MB (esmvaltool 5 MB, - ilamb 61 MB, pmp 250 MB). Some native is already hand-`.gitignore`d, so it is *missing* — no - coverage, no parity. The bloat is a capture bug plus committing binaries, not an inherent need. -- **No clear "this-changed" signal.** Failures are a mix; you dig CI logs across many jobs. The - REF-shaped output (`series.json`, CMEC bundles) is small JSON that diffs cleanly in the PR and is - exactly the content that feeds the REF API — it is what reviewers most need to see. -- **The extraction regression never runs on a PR.** It runs only post-merge/nightly today, so a PR - that breaks extraction goes green, then red after merge. -- **Most diagnostic authors are external scientists.** Minimal setup, no new credentials. A - contributor should need only the REF. -- **Security.** The integration runner shares a writable cache; untrusted fork code must never reach - it or any write credential. - -A key reframe: **the native outputs are not review-only artifacts — they are the input fixtures for -the extraction step.** `Diagnostic.run()` is `execute()` (runs the diagnostic, writes native) then -`build_execution_result()` (reads native → CMEC bundle + series). `RegressionValidator` already -replays *committed native* through `build_execution_result` without re-running the diagnostic. That -makes a fast, fork-safe extraction test possible on a stock runner — *if* the native is fetchable. +- **Bloat.** `tests/.../regression` is ~1.0 GB on disk. + Of that, **696 MB is input `climate_data` that the capture step wrongly copies in**; + the real curated output is ~316 MB (esmvaltool 5 MB, ilamb 61 MB, pmp 250 MB). + Some native is already hand-`.gitignore`d, so it is *missing* — no coverage, no parity. + The bloat is a capture bug plus committing binaries, not an inherent need. +- **No clear "this-changed" signal.** + Failures are a mix; you dig CI logs across many jobs. + The REF-shaped output (`series.json`, CMEC bundles) is small JSON that diffs cleanly in the PR + and is exactly the content that feeds the REF API — + it is what reviewers most need to see. +- **The extraction regression never runs on a PR.** + It runs only post-merge/nightly today, + so a PR that breaks extraction goes green, then red after merge. +- **Most diagnostic authors are external scientists.** + Minimal setup, no new credentials. + A contributor should need only the REF. +- **Security.** + The integration runner shares a writable cache; + untrusted fork code must never reach it or any write credential. + +A key reframe: **the native outputs are not review-only artifacts — +they are the input fixtures for the extraction step.** +`Diagnostic.run()` is `execute()` (runs the diagnostic, writes native) +then `build_execution_result()` (reads native → CMEC bundle + series). +`RegressionValidator` already replays *committed native* through `build_execution_result` +without re-running the diagnostic. +That makes a fast, fork-safe extraction test possible on a stock runner — *if* the native is fetchable. # Reference-level explanation @@ -72,28 +81,31 @@ The native bytes (provenance + matched `*.nc` / `*.png`) live in the object stor "extraction_inputs": { "files_series_digest": "", "input_selectors_digest": "" } } ``` -- `golden` binds the committed golden bytes to the manifest: CI recomputes the golden digests and - asserts they equal `manifest.golden`. A commit that updates `series.json` but not the manifest (or - vice-versa) **fails loudly** — closing the one consistency leg content-addressing does not cover. +- `golden` binds the committed golden bytes to the manifest: + CI recomputes the golden digests and asserts they equal `manifest.golden`. + A commit that updates `series.json` but not the manifest (or vice-versa) **fails loudly** — + closing the one consistency leg content-addressing does not cover. - `native` lists the curated native blobs by content digest (immutable, dedups, survives history). -- `extraction_inputs` hashes the diagnostic's `files`/`series` patterns and the catalog - `input_selectors` — the real extraction inputs the current catalog hash ignores — so a change to - extraction behaviour bumps the manifest *at PR time*. +- `extraction_inputs` hashes the diagnostic's `files`/`series` patterns and the catalog `input_selectors` — + the real extraction inputs the current catalog hash ignores — + so a change to extraction behaviour bumps the manifest *at PR time*. -Two manifest sections keep the diff readable: a changed `native` block means native changed; a -changed `golden` block means extraction output changed. +Two manifest sections keep the diff readable: +a changed `native` block means native changed; +a changed `golden` block means extraction output changed. ## Curated capture (kills the bloat at the source) The diagnostic already declares which native files matter via `files` + `series` glob patterns; -production publishing copies only those (the CMEC `output.json` subset). The regression capture must -use the **same curation** instead of copying the whole working directory. This alone prunes the -696 MB `climate_data` input and the verbose run logs for free. Diagnostic authors control what is -stored/pruned through those existing patterns — no new mechanism. +production publishing copies only those (the CMEC `output.json` subset). +The regression capture must use the **same curation** instead of copying the whole working directory. +This alone prunes the 696 MB `climate_data` input and the verbose run logs for free. +Diagnostic authors control what is stored/pruned through those existing patterns — no new mechanism. ## `NativeStore` interface (backend-agnostic) -The REF talks to the store through a thin Protocol so the backend is swappable and unit-testable, +The REF talks to the store through a thin Protocol +so the backend is swappable and unit-testable, and so this RFC does not have to settle the backend to be implementable: ```python @@ -107,8 +119,8 @@ Required properties of whatever backend is chosen: - **Anonymous public read** so fork CI and contributors fetch without credentials. - **Writes gated to trusted contexts only** (CI on `main`, or maintainers). -- Content-addressed (sha256) storage; native addressed **per-test-case** so a fetch is one - diagnostic's output (KB–32 MB). +- Content-addressed (sha256) storage; + native addressed **per-test-case** so a fetch is one diagnostic's output (KB–32 MB). - Retention that **never expires a digest still referenced by a reachable manifest**. Backend selection is deferred to a maintainer decision (Unresolved questions). @@ -126,113 +138,134 @@ Backend selection is deferred to a maintainer decision (Unresolved questions). ## Tiered CI -- **PR — any, incl forks — stock `ubuntu-latest`, no secrets.** `ref test-cases sync` (public read) - → `replay` → diff golden → post a PR comment. Figures surfaced but **non-blocking**. Routed by - changed-files so the check is *honest*: - - PR touches extraction (`build_execution_result`, `files`/`series`) or golden/manifest → run the - replay; the check is named **`extraction-replay`** and its green means what it says. - - PR touches `execute()` only → replay is a tautology → emit a **neutral - "execute-changed: needs runner verification"** check (not green), linking the bootstrap path. - - A fetch *miss* for a fixture the manifest says should exist is a **failure, not a skip** (today, - missing fixtures silently skip → false green). -- **Same-repo PR — self-hosted runner, no upload — when a golden file changes.** Run the slow - `execute()` to confirm the diagnostic still reproduces its native, catching breakage pre-merge. +- **PR — any, incl forks — stock `ubuntu-latest`, no secrets.** + `ref test-cases sync` (public read) → `replay` → diff golden → post a PR comment. + Figures surfaced but **non-blocking**. + Routed by changed-files so the check is *honest*: + - PR touches extraction (`build_execution_result`, `files`/`series`) or golden/manifest → run the replay; + the check is named **`extraction-replay`** and its green means what it says. + - PR touches `execute()` only → replay is a tautology → + emit a **neutral "execute-changed: needs runner verification"** check (not green), + linking the bootstrap path. + - A fetch *miss* for a fixture the manifest says should exist is a **failure, not a skip** + (today, missing fixtures silently skip → false green). +- **Same-repo PR — self-hosted runner, no upload — when a golden file changes.** + Run the slow `execute()` to confirm the diagnostic still reproduces its native, + catching breakage pre-merge. Gated with no labels: `if: github.event.pull_request.head.repo.full_name == github.repository`. -- **Merge to `main` — self-hosted runner, the only job holding write creds.** `ref test-cases mint` - → `execute()` + `put` native + update manifest. -- **Nightly — self-hosted runner.** Full sweep; drift detection backstop; opens an auto-PR with the - new manifest + a human-readable summary if outputs drift. -- **Fork PR needing new/changed native** (a new diagnostic, or an `execute()` change): a documented - `workflow_dispatch` a maintainer triggers to mint on the trusted runner and push the golden + - manifest back to the PR. Forks never run `execute()` on the self-hosted runner (untrusted code, - shared cache). +- **Merge to `main` — self-hosted runner, the only job holding write creds.** + `ref test-cases mint` → `execute()` + `put` native + update manifest. +- **Nightly — self-hosted runner.** + Full sweep; drift detection backstop; + opens an auto-PR with the new manifest + a human-readable summary if outputs drift. +- **Fork PR needing new/changed native** (a new diagnostic, or an `execute()` change): + a documented `workflow_dispatch` a maintainer triggers + to mint on the trusted runner and push the golden + manifest back to the PR. + Forks never run `execute()` on the self-hosted runner (untrusted code, shared cache). ## Self-hosted runner security -The integration runner uses ephemeral pods, but mounts a **shared, writable cache** (datasets + -provider software). Untrusted fork code in such a job could poison that cache for the next trusted -run; ephemeral pods do not close this. Therefore: never run untrusted `execute()` on the shared -runner; keep write credentials on the `main` (and explicitly-triggered) path only; gate the -same-repo PR job by `head.repo.full_name == github.repository`. (A separate hardened runner with a -read-only cache could later make fork `execute()` safe; out of scope here.) +The integration runner uses ephemeral pods, +but mounts a **shared, writable cache** (datasets + provider software). +Untrusted fork code in such a job could poison that cache for the next trusted run; +ephemeral pods do not close this. +Therefore: never run untrusted `execute()` on the shared runner; +keep write credentials on the `main` (and explicitly-triggered) path only; +gate the same-repo PR job by `head.repo.full_name == github.repository`. +(A separate hardened runner with a read-only cache could later make fork `execute()` safe; +out of scope here.) ## Determinism (prerequisite for comparing bundles) -CMEC bundle content comparison (currently only `series.json` is compared) requires removing -nondeterminism first: +CMEC bundle content comparison (currently only `series.json` is compared) +requires removing nondeterminism first: -- **Execution-dir timestamp.** ESMValTool writes a `recipe__` dir that gets baked - into `output.json`. Sanitise `recipe_\d{8}_\d{6}` → a placeholder in both capture and comparison. +- **Execution-dir timestamp.** + ESMValTool writes a `recipe__` dir that gets baked into `output.json`. + Sanitise `recipe_\d{8}_\d{6}` → a placeholder in both capture and comparison. Tracked as [Climate-REF/climate-ref#713](https://github.com/Climate-REF/climate-ref/issues/713). -- **Floats.** Replay path (stored native → extraction) is deterministic → exact compare. Execute - path (re-run) has float jitter → compare structure/dimensions only, leaving numeric values to the - replay path; any value that must be compared on the execute path uses a relative tolerance, never - exact equality. +- **Floats.** + Replay path (stored native → extraction) is deterministic → exact compare. + Execute path (re-run) has float jitter → compare structure/dimensions only, + leaving numeric values to the replay path; + any value that must be compared on the execute path uses a relative tolerance, never exact equality. # Drawbacks -- The REF CLI owns fetch/mint/credential plumbing — more surface to maintain and test than - off-loading to an existing data-management tool. -- Native is a cache of a regenerable artifact: losing a digest still referenced by a reachable - manifest breaks historical replay (mitigated by reachability-aware retention). -- Golden minting is effectively a maintainer/CI action for conda providers (esmvaltool's env is - ~1373 packages); external authors usually cannot run `execute()` locally. -- Fork `execute()` changes have no automated PR signal until bootstrapped — the fast path tests - extraction, not the diagnostic algorithm. +- The REF CLI owns fetch/mint/credential plumbing — + more surface to maintain and test than off-loading to an existing data-management tool. +- Native is a cache of a regenerable artifact: + losing a digest still referenced by a reachable manifest breaks historical replay + (mitigated by reachability-aware retention). +- Golden minting is effectively a maintainer/CI action for conda providers + (esmvaltool's env is ~1373 packages); + external authors usually cannot run `execute()` locally. +- Fork `execute()` changes have no automated PR signal until bootstrapped — + the fast path tests extraction, not the diagnostic algorithm. - Historical bloat remains: existing committed binaries are not rewritten out of history (deliberate — the repo is already used by externals). # Rationale and alternatives -**Why two layers.** Keeping the small REF-shaped golden in git puts the review signal where -reviewers already work (the PR diff) and keeps the gate human-readable; moving the large binaries -out of git removes the bloat while still making them fetchable for the extraction test. The split -maps cleanly onto what the REF already does for catalogs (`catalog.yaml` committed, -`catalog.paths.yaml` gitignored). +**Why two layers.** +Keeping the small REF-shaped golden in git +puts the review signal where reviewers already work (the PR diff) +and keeps the gate human-readable; +moving the large binaries out of git removes the bloat +while still making them fetchable for the extraction test. +The split maps cleanly onto what the REF already does for catalogs +(`catalog.yaml` committed, `catalog.paths.yaml` gitignored). -**Why a manifest + CLI rather than committing the bytes.** Measured native is too large to commit -(pmp 250 MB, ilamb 61 MB). A content-addressed manifest gives dedup, per-commit pinning, and a clean -in-PR "native changed" diff without binaries in git. +**Why a manifest + CLI rather than committing the bytes.** +Measured native is too large to commit (pmp 250 MB, ilamb 61 MB). +A content-addressed manifest gives dedup, per-commit pinning, +and a clean in-PR "native changed" diff without binaries in git. **Alternatives considered:** -- **Commit curated native to git.** Rejected by measurement (too large for pmp/ilamb). Viable only - for esmvaltool (~5 MB) — not worth a per-provider split of the storage model. -- **Git LFS.** Pointer churn remains; GitHub LFS egress costs; no anonymous public read. -- **Regenerate-on-demand, store nothing.** The fast fork-PR extraction path runs on a stock runner - with no conda/data and cannot regenerate; it must fetch. Per-provider, pmp's 250 MB *could* be - regenerated on the self-hosted runner rather than stored if its fetch rate is low — the - `NativeStore` interface allows mixing. - -**Impact of not doing this.** Contributors keep hand-rolling baselines, PRs keep shipping output -changes without a visible diff, and silent upstream regressions stay invisible until a manual check. +- **Commit curated native to git.** + Rejected by measurement (too large for pmp/ilamb). + Viable only for esmvaltool (~5 MB) — not worth a per-provider split of the storage model. +- **Git LFS.** + Pointer churn remains; GitHub LFS egress costs; no anonymous public read. +- **Regenerate-on-demand, store nothing.** + The fast fork-PR extraction path runs on a stock runner with no conda/data and cannot regenerate; + it must fetch. + Per-provider, pmp's 250 MB *could* be regenerated on the self-hosted runner rather than stored + if its fetch rate is low — the `NativeStore` interface allows mixing. + +**Impact of not doing this.** +Contributors keep hand-rolling baselines, +PRs keep shipping output changes without a visible diff, +and silent upstream regressions stay invisible until a manual check. # Prior art -- The REF's existing **catalog split** — the exact "commit small metadata, keep bytes out of git" - pattern, here extended to outputs. +- The REF's existing **catalog split** — + the exact "commit small metadata, keep bytes out of git" pattern, here extended to outputs. - **pytest-regressions** (`--force-regen`) — the regenerate-the-golden workflow this mirrors. -- Content-addressed artifact stores with a small in-git lock (DVC, Snakemake) — same idea; this RFC - keeps the client surface to just the REF CLI rather than adding a separate toolchain on - contributors. +- Content-addressed artifact stores with a small in-git lock (DVC, Snakemake) — same idea; + this RFC keeps the client surface to just the REF CLI + rather than adding a separate toolchain on contributors. # Unresolved questions -- **Object-store backend choice** — the main open decision, deferred deliberately. Candidate - backends behind the `NativeStore` interface include an S3-compatible bucket or an OCI registry - (e.g. GitHub Packages); selection should weigh anonymous-read, trusted-write identity, egress - cost, content-addressing/GC, and any storage the org has already provisioned. **No backend is - assumed by this RFC.** +- **Object-store backend choice** — the main open decision, deferred deliberately. + Candidate backends behind the `NativeStore` interface include + an S3-compatible bucket or an OCI registry (e.g. GitHub Packages); + selection should weigh anonymous-read, trusted-write identity, egress cost, + content-addressing/GC, and any storage the org has already provisioned. + **No backend is assumed by this RFC.** - Per-provider **store-vs-regenerate** for the large providers (pmp 250 MB). - Retention policy for non-current native versions. -- Whether the PR comment should render a **figure diff** (PNG) / **NetCDF stat diff**, or defer to a - follow-up. +- Whether the PR comment should render a **figure diff** (PNG) / **NetCDF stat diff**, + or defer to a follow-up. - Exact form of the fork-PR **bootstrap `workflow_dispatch`**. # Future possibilities - Rich diff renderers in the PR comment (PNG diffs, NetCDF summary-stat diffs). - Reuse the same `NativeStore` + manifest for large reference/observational inputs too big for git. -- A `ref test-cases preview` that serves locally what the API would render for a PR, straight from a - synced native fixture (already the `replay` path). +- A `ref test-cases preview` that serves locally what the API would render for a PR, + straight from a synced native fixture (already the `replay` path). - Signed manifests for end-to-end baseline provenance. From 871e74c2422fceec2bac1c5f6b7ee885ce47f1a3 Mon Sep 17 00:00:00 2001 From: Jared Lewis Date: Fri, 5 Jun 2026 06:12:32 +1000 Subject: [PATCH 4/8] chore: rename golden to committed --- text/0000-regression-baselines.md | 85 ++++++++++++++++--------------- 1 file changed, 43 insertions(+), 42 deletions(-) diff --git a/text/0000-regression-baselines.md b/text/0000-regression-baselines.md index f24f293..1266954 100644 --- a/text/0000-regression-baselines.md +++ b/text/0000-regression-baselines.md @@ -1,20 +1,22 @@ - Feature Name: `regression_baselines` - Start Date: 2026-06-04 -- RFC PR: [CMIP-REF/rfcs#0000](https://github.com/CMIP-REF/rfcs/pull/0000) +- RFC PR: [CMIP-REF/rfcs#5](https://github.com/CMIP-REF/rfcs/pull/0005) # Summary -Manage diagnostic regression baselines as **two layers**: +Manage diagnostic regression **baselines**. +Each baseline (one test-case run) has **two bundles**: -1. A small, human-readable **golden** (the REF-shaped `series.json` + CMEC bundles) +1. A small, human-readable **committed bundle** (the REF-shaped `series.json` + CMEC bundles) committed directly to the source repo — this is the test *gate* and the in-PR *diff signal*. -2. The large, binary **native diagnostic outputs** stored outside git in an object store - and referenced from a single git-committed **`native_manifest.json`** - (sha256 digests of the golden, the native files, and the extraction inputs). +2. A large, binary **native bundle** (the diagnostic's native outputs) + stored outside git in an object store + and referenced from a single git-committed **`manifest.json`** + (sha256 digests of the committed bundle, the native files, and the extraction inputs). The REF CLI (`ref test-cases`) fetches, runs, and mints baselines. -The native outputs double as the **input fixtures that test the native→bundle extraction step**, +The native bundle doubles as the **input fixtures that test the native->bundle extraction step**, so most PRs are verified fast on a stock runner without re-running the (expensive) diagnostic. The **object-store backend is left as an explicit open decision** (see Unresolved questions) — @@ -45,17 +47,16 @@ Diagnostic regression testing today is flaky, opaque, and bloats the repo: The integration runner shares a writable cache; untrusted fork code must never reach it or any write credential. -A key reframe: **the native outputs are not review-only artifacts — -they are the input fixtures for the extraction step.** +A key reframe: **the native bundle is not a review-only artifact — +it is the input fixtures for the extraction step.** `Diagnostic.run()` is `execute()` (runs the diagnostic, writes native) -then `build_execution_result()` (reads native → CMEC bundle + series). -`RegressionValidator` already replays *committed native* through `build_execution_result` -without re-running the diagnostic. +then `build_execution_result()` (reads native -> CMEC bundle + series). +`RegressionValidator` already replays *committed native* through `build_execution_result` without re-running the diagnostic. That makes a fast, fork-safe extraction test possible on a stock runner — *if* the native is fetchable. # Reference-level explanation -## Two layers, one manifest +## Two bundles, one manifest For each `(provider, diagnostic, test-case)`: @@ -64,25 +65,25 @@ packages//tests/test-data/// catalog.yaml # committed: dataset metadata (exists today, content-hashed) catalog.paths.yaml # gitignored: per-user local paths (exists today) regression/ - series.json # committed GOLDEN (the gate + diff signal) - diagnostic.json # committed GOLDEN (CMEC metric bundle) - output.json # committed GOLDEN (CMEC output bundle) - native_manifest.json # committed: source of truth binding golden <-> native + series.json # committed bundle (the gate + diff signal) + diagnostic.json # committed bundle (CMEC metric bundle) + output.json # committed bundle (CMEC output bundle) + manifest.json # committed: source of truth binding committed bundle <-> native bundle ``` The native bytes (provenance + matched `*.nc` / `*.png`) live in the object store, **not** in git. -`native_manifest.json`: +`manifest.json`: ```json { "schema": 1, - "golden": { "series.json": "", "diagnostic.json": "", "output.json": "" }, + "committed": { "series.json": "", "diagnostic.json": "", "output.json": "" }, "native": { "": { "sha256": "...", "size": ... } }, "extraction_inputs": { "files_series_digest": "", "input_selectors_digest": "" } } ``` -- `golden` binds the committed golden bytes to the manifest: - CI recomputes the golden digests and asserts they equal `manifest.golden`. +- `committed` binds the committed bundle bytes to the manifest: + CI recomputes the committed digests and asserts they equal `manifest.committed`. A commit that updates `series.json` but not the manifest (or vice-versa) **fails loudly** — closing the one consistency leg content-addressing does not cover. - `native` lists the curated native blobs by content digest (immutable, dedups, survives history). @@ -92,7 +93,7 @@ The native bytes (provenance + matched `*.nc` / `*.png`) live in the object stor Two manifest sections keep the diff readable: a changed `native` block means native changed; -a changed `golden` block means extraction output changed. +a changed `committed` block means extraction output changed. ## Curated capture (kills the bloat at the source) @@ -127,40 +128,40 @@ Backend selection is deferred to a maintainer decision (Unresolved questions). ## CLI (extends the existing `ref test-cases` app) -| Subcommand | Purpose | Creds | -|------------|---------|-------| -| `run --provider P --diagnostic D` | run `execute()` + `build_execution_result`, write golden + manifest locally | none (local) | -| `sync [--provider P]` | fetch native blobs named by the committed manifest(s) | none (public read) | -| `replay --provider P --diagnostic D` | fetch native, run `build_execution_result`, compare to golden | none (public read) | -| `mint --provider P` | `run` + `put` native to the store + update manifest | write creds | +| Subcommand | Purpose | Creds | +| ------------------------------------ | ----------------------------------------------------------------------------------- | ------------------ | +| `run --provider P --diagnostic D` | run `execute()` + `build_execution_result`, write committed bundle + manifest locally | none (local) | +| `sync [--provider P]` | fetch native blobs named by the committed manifest(s) | none (public read) | +| `replay --provider P --diagnostic D` | fetch native, run `build_execution_result`, compare to committed bundle | none (public read) | +| `mint --provider P` | `run` + `put` native to the store + update manifest | write creds | `mint` is the only credentialed verb and runs only on trusted contexts. ## Tiered CI - **PR — any, incl forks — stock `ubuntu-latest`, no secrets.** - `ref test-cases sync` (public read) → `replay` → diff golden → post a PR comment. + `ref test-cases sync` (public read) -> `replay` -> diff committed bundle -> post a PR comment. Figures surfaced but **non-blocking**. Routed by changed-files so the check is *honest*: - - PR touches extraction (`build_execution_result`, `files`/`series`) or golden/manifest → run the replay; + - PR touches extraction (`build_execution_result`, `files`/`series`) or committed/manifest -> run the replay; the check is named **`extraction-replay`** and its green means what it says. - - PR touches `execute()` only → replay is a tautology → + - PR touches `execute()` only -> replay is a tautology -> emit a **neutral "execute-changed: needs runner verification"** check (not green), linking the bootstrap path. - A fetch *miss* for a fixture the manifest says should exist is a **failure, not a skip** - (today, missing fixtures silently skip → false green). -- **Same-repo PR — self-hosted runner, no upload — when a golden file changes.** + (today, missing fixtures silently skip -> false green). +- **Same-repo PR — self-hosted runner, no upload — when a committed bundle file changes.** Run the slow `execute()` to confirm the diagnostic still reproduces its native, catching breakage pre-merge. Gated with no labels: `if: github.event.pull_request.head.repo.full_name == github.repository`. - **Merge to `main` — self-hosted runner, the only job holding write creds.** - `ref test-cases mint` → `execute()` + `put` native + update manifest. + `ref test-cases mint` -> `execute()` + `put` native + update manifest. - **Nightly — self-hosted runner.** Full sweep; drift detection backstop; opens an auto-PR with the new manifest + a human-readable summary if outputs drift. - **Fork PR needing new/changed native** (a new diagnostic, or an `execute()` change): a documented `workflow_dispatch` a maintainer triggers - to mint on the trusted runner and push the golden + manifest back to the PR. + to mint on the trusted runner and push the committed bundle + manifest back to the PR. Forks never run `execute()` on the self-hosted runner (untrusted code, shared cache). ## Self-hosted runner security @@ -182,11 +183,11 @@ requires removing nondeterminism first: - **Execution-dir timestamp.** ESMValTool writes a `recipe__` dir that gets baked into `output.json`. - Sanitise `recipe_\d{8}_\d{6}` → a placeholder in both capture and comparison. + Sanitise `recipe_\d{8}_\d{6}` -> a placeholder in both capture and comparison. Tracked as [Climate-REF/climate-ref#713](https://github.com/Climate-REF/climate-ref/issues/713). - **Floats.** - Replay path (stored native → extraction) is deterministic → exact compare. - Execute path (re-run) has float jitter → compare structure/dimensions only, + Replay path (stored native -> extraction) is deterministic -> exact compare. + Execute path (re-run) has float jitter -> compare structure/dimensions only, leaving numeric values to the replay path; any value that must be compared on the execute path uses a relative tolerance, never exact equality. @@ -197,7 +198,7 @@ requires removing nondeterminism first: - Native is a cache of a regenerable artifact: losing a digest still referenced by a reachable manifest breaks historical replay (mitigated by reachability-aware retention). -- Golden minting is effectively a maintainer/CI action for conda providers +- Committed-bundle minting is effectively a maintainer/CI action for conda providers (esmvaltool's env is ~1373 packages); external authors usually cannot run `execute()` locally. - Fork `execute()` changes have no automated PR signal until bootstrapped — @@ -207,8 +208,8 @@ requires removing nondeterminism first: # Rationale and alternatives -**Why two layers.** -Keeping the small REF-shaped golden in git +**Why two bundles.** +Keeping the small REF-shaped committed bundle in git puts the review signal where reviewers already work (the PR diff) and keeps the gate human-readable; moving the large binaries out of git removes the bloat @@ -243,7 +244,7 @@ and silent upstream regressions stay invisible until a manual check. - The REF's existing **catalog split** — the exact "commit small metadata, keep bytes out of git" pattern, here extended to outputs. -- **pytest-regressions** (`--force-regen`) — the regenerate-the-golden workflow this mirrors. +- **pytest-regressions** (`--force-regen`) — the regenerate-the-committed-bundle workflow this mirrors. - Content-addressed artifact stores with a small in-git lock (DVC, Snakemake) — same idea; this RFC keeps the client surface to just the REF CLI rather than adding a separate toolchain on contributors. From ddc49bb8d724279f194bd9d21178af7415f8748f Mon Sep 17 00:00:00 2001 From: Jared Lewis Date: Fri, 5 Jun 2026 10:48:27 +1000 Subject: [PATCH 5/8] chore: rename file --- text/0000-regression-baselines.md | 272 ------------------- text/0005-regression-baselines.md | 433 ++++++++++++++++++++++++++++++ 2 files changed, 433 insertions(+), 272 deletions(-) delete mode 100644 text/0000-regression-baselines.md create mode 100644 text/0005-regression-baselines.md diff --git a/text/0000-regression-baselines.md b/text/0000-regression-baselines.md deleted file mode 100644 index 1266954..0000000 --- a/text/0000-regression-baselines.md +++ /dev/null @@ -1,272 +0,0 @@ -- Feature Name: `regression_baselines` -- Start Date: 2026-06-04 -- RFC PR: [CMIP-REF/rfcs#5](https://github.com/CMIP-REF/rfcs/pull/0005) - -# Summary - -Manage diagnostic regression **baselines**. -Each baseline (one test-case run) has **two bundles**: - -1. A small, human-readable **committed bundle** (the REF-shaped `series.json` + CMEC bundles) - committed directly to the source repo — - this is the test *gate* and the in-PR *diff signal*. -2. A large, binary **native bundle** (the diagnostic's native outputs) - stored outside git in an object store - and referenced from a single git-committed **`manifest.json`** - (sha256 digests of the committed bundle, the native files, and the extraction inputs). - -The REF CLI (`ref test-cases`) fetches, runs, and mints baselines. -The native bundle doubles as the **input fixtures that test the native->bundle extraction step**, -so most PRs are verified fast on a stock runner without re-running the (expensive) diagnostic. - -The **object-store backend is left as an explicit open decision** (see Unresolved questions) — -this RFC specifies the *layering, manifest, CLI, and CI*, -which are backend-agnostic behind a small `NativeStore` interface. - -# Motivation - -Diagnostic regression testing today is flaky, opaque, and bloats the repo: - -- **Bloat.** `tests/.../regression` is ~1.0 GB on disk. - Of that, **696 MB is input `climate_data` that the capture step wrongly copies in**; - the real curated output is ~316 MB (esmvaltool 5 MB, ilamb 61 MB, pmp 250 MB). - Some native is already hand-`.gitignore`d, so it is *missing* — no coverage, no parity. - The bloat is a capture bug plus committing binaries, not an inherent need. -- **No clear "this-changed" signal.** - Failures are a mix; you dig CI logs across many jobs. - The REF-shaped output (`series.json`, CMEC bundles) is small JSON that diffs cleanly in the PR - and is exactly the content that feeds the REF API — - it is what reviewers most need to see. -- **The extraction regression never runs on a PR.** - It runs only post-merge/nightly today, - so a PR that breaks extraction goes green, then red after merge. -- **Most diagnostic authors are external scientists.** - Minimal setup, no new credentials. - A contributor should need only the REF. -- **Security.** - The integration runner shares a writable cache; - untrusted fork code must never reach it or any write credential. - -A key reframe: **the native bundle is not a review-only artifact — -it is the input fixtures for the extraction step.** -`Diagnostic.run()` is `execute()` (runs the diagnostic, writes native) -then `build_execution_result()` (reads native -> CMEC bundle + series). -`RegressionValidator` already replays *committed native* through `build_execution_result` without re-running the diagnostic. -That makes a fast, fork-safe extraction test possible on a stock runner — *if* the native is fetchable. - -# Reference-level explanation - -## Two bundles, one manifest - -For each `(provider, diagnostic, test-case)`: - -``` -packages//tests/test-data/// - catalog.yaml # committed: dataset metadata (exists today, content-hashed) - catalog.paths.yaml # gitignored: per-user local paths (exists today) - regression/ - series.json # committed bundle (the gate + diff signal) - diagnostic.json # committed bundle (CMEC metric bundle) - output.json # committed bundle (CMEC output bundle) - manifest.json # committed: source of truth binding committed bundle <-> native bundle -``` - -The native bytes (provenance + matched `*.nc` / `*.png`) live in the object store, **not** in git. - -`manifest.json`: - -```json -{ "schema": 1, - "committed": { "series.json": "", "diagnostic.json": "", "output.json": "" }, - "native": { "": { "sha256": "...", "size": ... } }, - "extraction_inputs": { "files_series_digest": "", "input_selectors_digest": "" } } -``` - -- `committed` binds the committed bundle bytes to the manifest: - CI recomputes the committed digests and asserts they equal `manifest.committed`. - A commit that updates `series.json` but not the manifest (or vice-versa) **fails loudly** — - closing the one consistency leg content-addressing does not cover. -- `native` lists the curated native blobs by content digest (immutable, dedups, survives history). -- `extraction_inputs` hashes the diagnostic's `files`/`series` patterns and the catalog `input_selectors` — - the real extraction inputs the current catalog hash ignores — - so a change to extraction behaviour bumps the manifest *at PR time*. - -Two manifest sections keep the diff readable: -a changed `native` block means native changed; -a changed `committed` block means extraction output changed. - -## Curated capture (kills the bloat at the source) - -The diagnostic already declares which native files matter via `files` + `series` glob patterns; -production publishing copies only those (the CMEC `output.json` subset). -The regression capture must use the **same curation** instead of copying the whole working directory. -This alone prunes the 696 MB `climate_data` input and the verbose run logs for free. -Diagnostic authors control what is stored/pruned through those existing patterns — no new mechanism. - -## `NativeStore` interface (backend-agnostic) - -The REF talks to the store through a thin Protocol -so the backend is swappable and unit-testable, -and so this RFC does not have to settle the backend to be implementable: - -```python -class NativeStore(Protocol): - def has(self, digest: str) -> bool: ... - def fetch(self, digest: str, dest: Path) -> None: ... # by sha256; public read, no creds - def put(self, path: Path) -> str: ... # returns sha256; needs write creds -``` - -Required properties of whatever backend is chosen: - -- **Anonymous public read** so fork CI and contributors fetch without credentials. -- **Writes gated to trusted contexts only** (CI on `main`, or maintainers). -- Content-addressed (sha256) storage; - native addressed **per-test-case** so a fetch is one diagnostic's output (KB–32 MB). -- Retention that **never expires a digest still referenced by a reachable manifest**. - -Backend selection is deferred to a maintainer decision (Unresolved questions). - -## CLI (extends the existing `ref test-cases` app) - -| Subcommand | Purpose | Creds | -| ------------------------------------ | ----------------------------------------------------------------------------------- | ------------------ | -| `run --provider P --diagnostic D` | run `execute()` + `build_execution_result`, write committed bundle + manifest locally | none (local) | -| `sync [--provider P]` | fetch native blobs named by the committed manifest(s) | none (public read) | -| `replay --provider P --diagnostic D` | fetch native, run `build_execution_result`, compare to committed bundle | none (public read) | -| `mint --provider P` | `run` + `put` native to the store + update manifest | write creds | - -`mint` is the only credentialed verb and runs only on trusted contexts. - -## Tiered CI - -- **PR — any, incl forks — stock `ubuntu-latest`, no secrets.** - `ref test-cases sync` (public read) -> `replay` -> diff committed bundle -> post a PR comment. - Figures surfaced but **non-blocking**. - Routed by changed-files so the check is *honest*: - - PR touches extraction (`build_execution_result`, `files`/`series`) or committed/manifest -> run the replay; - the check is named **`extraction-replay`** and its green means what it says. - - PR touches `execute()` only -> replay is a tautology -> - emit a **neutral "execute-changed: needs runner verification"** check (not green), - linking the bootstrap path. - - A fetch *miss* for a fixture the manifest says should exist is a **failure, not a skip** - (today, missing fixtures silently skip -> false green). -- **Same-repo PR — self-hosted runner, no upload — when a committed bundle file changes.** - Run the slow `execute()` to confirm the diagnostic still reproduces its native, - catching breakage pre-merge. - Gated with no labels: `if: github.event.pull_request.head.repo.full_name == github.repository`. -- **Merge to `main` — self-hosted runner, the only job holding write creds.** - `ref test-cases mint` -> `execute()` + `put` native + update manifest. -- **Nightly — self-hosted runner.** - Full sweep; drift detection backstop; - opens an auto-PR with the new manifest + a human-readable summary if outputs drift. -- **Fork PR needing new/changed native** (a new diagnostic, or an `execute()` change): - a documented `workflow_dispatch` a maintainer triggers - to mint on the trusted runner and push the committed bundle + manifest back to the PR. - Forks never run `execute()` on the self-hosted runner (untrusted code, shared cache). - -## Self-hosted runner security - -The integration runner uses ephemeral pods, -but mounts a **shared, writable cache** (datasets + provider software). -Untrusted fork code in such a job could poison that cache for the next trusted run; -ephemeral pods do not close this. -Therefore: never run untrusted `execute()` on the shared runner; -keep write credentials on the `main` (and explicitly-triggered) path only; -gate the same-repo PR job by `head.repo.full_name == github.repository`. -(A separate hardened runner with a read-only cache could later make fork `execute()` safe; -out of scope here.) - -## Determinism (prerequisite for comparing bundles) - -CMEC bundle content comparison (currently only `series.json` is compared) -requires removing nondeterminism first: - -- **Execution-dir timestamp.** - ESMValTool writes a `recipe__` dir that gets baked into `output.json`. - Sanitise `recipe_\d{8}_\d{6}` -> a placeholder in both capture and comparison. - Tracked as [Climate-REF/climate-ref#713](https://github.com/Climate-REF/climate-ref/issues/713). -- **Floats.** - Replay path (stored native -> extraction) is deterministic -> exact compare. - Execute path (re-run) has float jitter -> compare structure/dimensions only, - leaving numeric values to the replay path; - any value that must be compared on the execute path uses a relative tolerance, never exact equality. - -# Drawbacks - -- The REF CLI owns fetch/mint/credential plumbing — - more surface to maintain and test than off-loading to an existing data-management tool. -- Native is a cache of a regenerable artifact: - losing a digest still referenced by a reachable manifest breaks historical replay - (mitigated by reachability-aware retention). -- Committed-bundle minting is effectively a maintainer/CI action for conda providers - (esmvaltool's env is ~1373 packages); - external authors usually cannot run `execute()` locally. -- Fork `execute()` changes have no automated PR signal until bootstrapped — - the fast path tests extraction, not the diagnostic algorithm. -- Historical bloat remains: existing committed binaries are not rewritten out of history - (deliberate — the repo is already used by externals). - -# Rationale and alternatives - -**Why two bundles.** -Keeping the small REF-shaped committed bundle in git -puts the review signal where reviewers already work (the PR diff) -and keeps the gate human-readable; -moving the large binaries out of git removes the bloat -while still making them fetchable for the extraction test. -The split maps cleanly onto what the REF already does for catalogs -(`catalog.yaml` committed, `catalog.paths.yaml` gitignored). - -**Why a manifest + CLI rather than committing the bytes.** -Measured native is too large to commit (pmp 250 MB, ilamb 61 MB). -A content-addressed manifest gives dedup, per-commit pinning, -and a clean in-PR "native changed" diff without binaries in git. - -**Alternatives considered:** - -- **Commit curated native to git.** - Rejected by measurement (too large for pmp/ilamb). - Viable only for esmvaltool (~5 MB) — not worth a per-provider split of the storage model. -- **Git LFS.** - Pointer churn remains; GitHub LFS egress costs; no anonymous public read. -- **Regenerate-on-demand, store nothing.** - The fast fork-PR extraction path runs on a stock runner with no conda/data and cannot regenerate; - it must fetch. - Per-provider, pmp's 250 MB *could* be regenerated on the self-hosted runner rather than stored - if its fetch rate is low — the `NativeStore` interface allows mixing. - -**Impact of not doing this.** -Contributors keep hand-rolling baselines, -PRs keep shipping output changes without a visible diff, -and silent upstream regressions stay invisible until a manual check. - -# Prior art - -- The REF's existing **catalog split** — - the exact "commit small metadata, keep bytes out of git" pattern, here extended to outputs. -- **pytest-regressions** (`--force-regen`) — the regenerate-the-committed-bundle workflow this mirrors. -- Content-addressed artifact stores with a small in-git lock (DVC, Snakemake) — same idea; - this RFC keeps the client surface to just the REF CLI - rather than adding a separate toolchain on contributors. - -# Unresolved questions - -- **Object-store backend choice** — the main open decision, deferred deliberately. - Candidate backends behind the `NativeStore` interface include - an S3-compatible bucket or an OCI registry (e.g. GitHub Packages); - selection should weigh anonymous-read, trusted-write identity, egress cost, - content-addressing/GC, and any storage the org has already provisioned. - **No backend is assumed by this RFC.** -- Per-provider **store-vs-regenerate** for the large providers (pmp 250 MB). -- Retention policy for non-current native versions. -- Whether the PR comment should render a **figure diff** (PNG) / **NetCDF stat diff**, - or defer to a follow-up. -- Exact form of the fork-PR **bootstrap `workflow_dispatch`**. - -# Future possibilities - -- Rich diff renderers in the PR comment (PNG diffs, NetCDF summary-stat diffs). -- Reuse the same `NativeStore` + manifest for large reference/observational inputs too big for git. -- A `ref test-cases preview` that serves locally what the API would render for a PR, - straight from a synced native fixture (already the `replay` path). -- Signed manifests for end-to-end baseline provenance. diff --git a/text/0005-regression-baselines.md b/text/0005-regression-baselines.md new file mode 100644 index 0000000..7598116 --- /dev/null +++ b/text/0005-regression-baselines.md @@ -0,0 +1,433 @@ +- Feature Name: `regression_baselines` +- Start Date: 2026-06-04 +- RFC PR: [CMIP-REF/rfcs#5](https://github.com/CMIP-REF/rfcs/pull/0005) + +# Summary + +Effectively diagnostic regression **baselines** for use by the testing suite. +Having baseline output for a diagnostic is an important contract between the development team and the diagnostic implementor. + +Each baseline (one test-case run) has **two bundles**: + +1. A small, human-readable **committed bundle** (the REF-shaped `series.json` + CMEC bundles) + committed directly to the source repo. + This should be stable and able to be reproduced by the CI. +2. A large, binary **native bundle** (the diagnostic's native outputs) + stored outside git in an object store + and referenced from a single git-committed **`manifest.json`** + (sha256 digests of the committed bundle, the native files, and the extraction inputs). + +The REF CLI (`ref test-cases`) fetches, runs, and mints baselines +with the goal of making it simple for diagnostic developers to contribute test output without requiring custom pytest implementation. + +The **object-store backend is left as an explicit open decision** (see Unresolved questions) — +this RFC specifies the *layering, manifest, CLI, and CI*, +which are backend-agnostic behind a small `NativeStore` interface. + +# Motivation + +Diagnostic regression testing today is flaky, opaque, and bloats the repo: + +- **Bloat.** `tests/.../regression` is ~1 GB on disk. + The git repo size is growing in a way that isn't sustainable. + We want to minimise the volume of data needed to setup the repo. +- **No clear "this-changed" signal.** + It's hard to see the failures for new diagnostics as we aren't following a consitent process. + The REF-shaped output (`series.json`, CMEC bundles) is small JSON that diffs cleanly in the PR + and is what reviewers most need to see. +- **The extraction regression never runs on a PR.** + It runs only post-merge/nightly today, + so a PR that breaks extraction goes green, then red after merge. + The integration tests are currently broken and require a refresh of the regression data. +- **Most diagnostic authors are external scientists.** + Require minimal setup, no new credentials. + A contributor should need only the REF. +- **Security.** + The integration runner shares a writable cache so we must be careful that fork code must never reach it or any write credential. + +The native bundle is not a review-only artifact as it is the input fixtures for integration tests. +`Diagnostic.run()` is `execute()` (runs the diagnostic, writes native) +then `build_execution_result()` (reads native -> CMEC bundle + series). +`RegressionValidator` already replays *committed native* through `build_execution_result` without re-running the diagnostic. +That makes a fast, fork-safe extraction test possible on a stock runner — *if* the native is fetchable. + +# Reference-level explanation + +## Two bundles, one manifest + +For each `(provider, diagnostic, test-case)`: + +``` +packages//tests/test-data/// + catalog.yaml # committed: dataset metadata (exists today, content-hashed) + catalog.paths.yaml # gitignored: per-user local paths (exists today) + regression/ + series.json # committed bundle (the gate + diff signal) + diagnostic.json # committed bundle (CMEC metric bundle) + output.json # committed bundle (CMEC output bundle) + manifest.json # committed: source of truth binding committed bundle <-> native bundle +``` + +The native bytes (provenance + matched `*.nc` / `*.png`) live in the object store, **not** in git. + +`manifest.json`: + +```json +{ "schema": 1, + "committed": { "series.json": "", "diagnostic.json": "", "output.json": "" }, + "native": { "": { "sha256": "...", "size": ... } }, + "extraction_inputs": { "files_series_digest": "", "input_selectors_digest": "" } } +``` + +- `committed` binds the committed bundle bytes to the manifest **for integrity only**: + CI recomputes the committed digests and asserts they equal `manifest.committed`. + A commit that updates `series.json` but not the manifest (or vice-versa) **fails loudly** — + closing the one consistency leg content-addressing does not cover. + This is a tamper/desync check on the in-repo files, **not** the regression comparator + (a re-run is compared by *content with tolerance*, not by digest — see Comparison). +- `native` lists the curated native blobs by content digest (dedups, survives history). + These digests are **authored only by `mint`**: the value is the sha256 of the exact bytes + `mint` uploaded to the store, so it is a *fetch address + integrity check*, never a reproducibility gate. +- `extraction_inputs` hashes the diagnostic's `files`/`series` patterns and the catalog `input_selectors` — + the real extraction inputs the current catalog hash ignores — + so a change to extraction behaviour bumps the manifest *at PR time*. + +**Native digests do not gate reproducibility.** +Native bytes are nondeterministic (`.nc`/`.png` embed timestamps, library versions, and provenance, floats jitter across platforms), +so re-running `execute()` on a different machine yields different bytes resulting in a different digest for an unchanged diagnostic. +The CI therefore never recomputes a native digest and compares it to the manifest. +A mismatch is expected, and not a failure. + +The deterministic gate is the **committed bundle**, compared by content with tolerance (see Comparison). +The `native` block **churns on every real mint** (timestamps and maybe SHAs shift), +so it is not a clean "native meaningfully changed" signal. +Only the `committed` block is a semantic diff signal +Remint when the committed bundle changes or a new test-case lands, not on native drift. + +The consistency actually enforced: + +| pair | enforced | how | +| ---------------------------------------------------- | --------------------- | ----------------------------------------------- | +| in-repo committed bundle ↔ `manifest.committed` | yes — integrity gate | CI recompute equals (tamper/desync check) | +| replayed committed bundle ↔ in-repo committed bundle | yes — regression gate | content compare with tolerance (see Comparison) | +| stored native bytes ↔ `manifest.native` digest | by construction | `mint` hashes what it uploads | +| re-executed native ↔ `manifest.native` digest | **no — by design** | nondeterministic; gate is the committed bundle | + +## Curated capture (kills the bloat at the source) + +The diagnostic already declares which native files matter via `files` + `series` glob patterns +with production publishing copies only those (the CMEC `output.json` subset). +The regression capture must use the **same curation** instead of copying the whole working directory. +Diagnostic authors control what is stored/pruned through those existing patterns — no new mechanism. + +## `NativeStore` interface + +The REF talks to the store through a thin Protocol +so the backend is swappable and unit-testable, +and so this RFC does not have to settle the backend to be implementable: + +```python +class NativeStore(Protocol): + def has(self, digest: str) -> bool: ... + def fetch(self, digest: str, dest: Path) -> None: ... # by sha256; public read, no creds + def put(self, path: Path) -> str: ... # returns sha256; needs write creds +``` + +Required properties of whatever backend is chosen: + +- **Anonymous public read** so fork CI and contributors fetch without credentials. +- **Writes gated to trusted contexts only** (CI on `main`, or maintainers). +- Content-addressed (sha256) storage; + native addressed **per-test-case** so a fetch is one diagnostic's output (KB–32 MB). +- Retention that **never expires a digest still referenced by a reachable manifest**. + +The backend will be prototyped using an R2 bucket on Cloudflare, +with the CI having write credentials to mint new baselines. + +## CLI (extends the existing `ref test-cases` app) + +| Subcommand | Purpose | Creds | +| ------------------------------------ | ------------------------------------------------------------------------------------- | ------------------ | +| `run --provider P --diagnostic D` | run `execute()` + `build_execution_result`, write committed bundle + a *local* manifest | none (local) | +| `sync [--provider P]` | fetch native blobs named by the committed manifest(s) | none (public read) | +| `replay --provider P --diagnostic D` | fetch native, run `build_execution_result`, compare to committed bundle | none (public read) | +| `mint --provider P` | `run` + `put` native to the store + author the canonical `manifest.native` | write creds | + +`mint` is the only credentialed verb, and the only writer of canonical native digests, +so anonymous users can fetch the native baseline. +A dev `run` produces native and a local manifest for previewing the committed-bundle diff; +its native digests are advisory and never committed. +A PR commits only the committed bundle + `manifest.{committed,extraction_inputs}` — +the `native` block is authored exclusively by the gated post-merge `mint`, +so native digests exist iff `mint` wrote them (no advisory-then-wrong window on `main`). + +## Tiered CI + +The PR tier runs **credential-free** so it can safely execute fork code: +the danger was never `execute()` itself, +it was `execute()` holding write creds on a shared writable cache. +Strip both and a fork's diagnostic runs harmlessly on a throwaway public runner. +Minting (the only `put`) stays post-merge behind a manual gate. + +- **PR — any, incl forks — public `ubuntu-latest`, no secrets, no creds.** + Routed by changed-files so the check is *honest*: + - **New diagnostic / new test-case, or `execute()` changed** -> run `execute()` + `build_execution_result` + on the public runner, then compare the freshly built committed bundle (tolerant). + No stored baseline is fetched, so a brand-new test-case validates on its own PR — no follow-up needed. + The check is named **`pr-execute`**. + - **Extraction changed (`build_execution_result`, `files`/`series`, committed/manifest) and native already exists** -> + cheap path: `ref test-cases sync` (public read) -> `replay` -> compare committed bundle. + The check is named **`extraction-replay`**. + - A fetch *miss* for an **existing** manifest entry is a **failure, not a skip** + (today, missing fixtures silently skip -> false green); + a *new* entry has no stored native by definition and takes the execute path instead. + - **Coverage:** the public runner handles ~90% of test-cases. + Some heavy ESMValTool ocean diagnostics exceed its compute/memory/disk and will fail there — + those wait for the future private runner (see below); they are not a silent skip. +- **Merge to `main` — gated mint, the only context holding write creds.** + A job bound to a GitHub **Environment (`mint`) with required reviewers**: + the run pauses until a maintainer approves, releasing the write creds only onto **already-merged (trusted) code**. + On approve: `ref test-cases mint` -> `execute()` + `put` native + author `manifest.native` (bot commit back to `main`). + **Mint set** = test-cases whose committed bundle or `extraction_inputs` changed in the merge diff; + a manual `workflow_dispatch` can target a specific `--provider`/`--diagnostic` for backfills or store-loss recovery. +- **Nightly — gated runner.** + Full sweep; drift detection backstop; + opens an auto-PR with the new manifest + a human-readable summary if outputs drift. +- **Future — private read-only runner.** + For the heavy diagnostics the public runner cannot host. + Safe to run fork code only if each job is **ephemeral / container-per-job**, mounts conda + data **read-only**, + carries **no write creds**, and has **controlled egress** (no shared writable cache — the L46 hazard). + Deferred: start public-only. + +## PR workflow + +Two phases: the diagnostic **author** first builds and runs the test-case **locally**, +then opens a PR that the **reviewer** drives through CI to merge. + +### 1. Author a new diagnostic (local, no creds) + +```mermaid +sequenceDiagram + actor Author as Diagnostic author + participant Repo as Local checkout + participant CLI as ref test-cases + participant Store as NativeStore + + Author->>Repo: add diagnostic (new class + test cases) + Author->>CLI: ref test-cases run --provider P --diagnostic D + CLI->>CLI: execute() then build_execution_result + CLI-->>Repo: write committed bundle + local manifest + + opt native already minted (changed/existing diagnostic) + Author->>CLI: ref test-cases replay --provider P --diagnostic D + CLI->>Store: sync native (public read) + CLI-->>Author: tolerant diff vs committed bundle + end + + Author->>Author: preview committed-bundle diff (small JSON) + Author->>Repo: commit committed bundle + manifest.committed/extraction_inputs +``` + +A local `run` produces native + advisory digests for preview only — +the author commits no `native` block. +Canonical `manifest.native` is authored later by the gated post-merge `mint`. + +### 2. Make a PR (review + CI) + +```mermaid +sequenceDiagram + actor Author as Diagnostic author + participant PR as PR / GitHub + participant CI as PR CI (public ubuntu-latest, no creds) + actor Reviewer as Reviewer / maintainer + participant Mint as Mint job (gated env, write creds) + participant Store as NativeStore + + Author->>PR: open PR (diagnostic + committed bundle + manifest.committed/extraction_inputs) + PR->>CI: trigger checks (routed by changed files) + + alt new diagnostic or execute() changed + CI->>CI: run execute() + build_execution_result (no creds) + CI->>CI: compare freshly built committed bundle (tolerant) + CI-->>PR: pr-execute check + bundle diff comment + else extraction changed, native already exists + CI->>Store: ref test-cases sync (public read) + alt native fetch hit + CI->>CI: replay then compare committed bundle (tolerant) + CI-->>PR: extraction-replay check + bundle diff comment + else native missing for an existing entry + CI-->>PR: FAIL (no silent skip) + end + end + + Note over CI: heavy ESMValTool ocean diagnostics may exceed public limits (future private runner) + + Reviewer->>PR: review committed-bundle diff (small JSON, human-readable) + Reviewer->>PR: approve + merge + + PR->>Mint: merge to main queues mint + Mint->>Reviewer: pause for required-reviewer approval + Reviewer->>Mint: approve (releases write creds onto trusted code) + Mint->>Mint: execute() + build_execution_result + Mint->>Store: put native (write creds) + Mint->>PR: bot commit manifest.native to main +``` + +### 3. CI routing (decision view) + +How the PR check is chosen from the changed files, and where each path lands: + +```mermaid +flowchart TD + A[PR opened] --> B{What changed?} + + B -->|"new diagnostic or execute() changed"| X["run execute() + build_execution_result on public runner, no creds"] + B -->|"extraction only, native exists"| C["sync native (public read)"] + + C --> D{Native fetch hit?} + D -->|yes| E["replay then rebuild committed bundle"] + D -->|"no (existing entry)"| F["FAIL: no silent skip"] + + X --> G{Diff within tolerance?} + E --> G + G -->|yes| H["check green (pr-execute / extraction-replay)"] + G -->|no| I["red: bundle diff comment, author fixes"] + I --> A + F --> A + + H --> M{Reviewer approves + merges?} + M -->|no| A + M -->|yes| P["merge to main queues gated mint"] + P --> Q{Maintainer approves mint?} + Q -->|yes| O["put native + author manifest.native (bot commit)"] + Q -->|no| P +``` + +The PR runner is public and **credential-free**, so a fork's `execute()` never touches write creds or a writable cache — +the L46 hazard is removed, not gated around. +Minting is the only `put`, and it runs **post-merge behind a required-reviewer environment**, +so write creds reach only already-merged (trusted) code. +The public runner covers ~90% of test-cases; +heavy ESMValTool ocean diagnostics that exceed its limits wait for the future private read-only runner. + +## Comparison + +The regression gate compares a freshly replayed committed bundle to the in-repo reference +**by content, not by bytes**. +Byte/digest equality is too strict: it surfaces least-significant-bit float flips +and other cross-platform noise (a fork replays on stock `ubuntu-latest`, a mint ran elsewhere), +turning harmless jitter into red CI. +Content comparison needs awareness of each artifact's shape: + +| artifact | comparison | +| -------- | ---------- | +| `series.json` (numeric) | structural keys exact; values by relative tolerance (`rtol`/`atol`) | +| CMEC `diagnostic.json` / `output.json` | structural JSON compare; numerics with tolerance; volatile fields (timestamps, abs paths, library versions, provenance) normalised out via an ignore-list | + +- Tolerances are **declared per diagnostic/test-case** — a global default with per-case overrides, + since some metrics are inherently noisier than others. + (pytest-regressions precedent: `num_regression` tolerances, `data_regression` structural compare.) +- **Fast path:** if bytes are already equal, pass without the content compare; only mismatches fall through to the tolerant path. +- The `manifest.committed` sha256 is **not** this comparator — it is the integrity/desync check above. + +## Determinism + +Content comparison still requires removing avoidable nondeterminism at capture time: + +- **Execution-dir timestamp.** + ESMValTool writes a `recipe__` dir that gets baked into `output.json`. + Sanitise `recipe_\d{8}_\d{6}` -> a placeholder in both capture and comparison. + Tracked as [Climate-REF/climate-ref#713](https://github.com/Climate-REF/climate-ref/issues/713). +- **Floats.** + Any float values should be compared using a relative tolerance to the baseline. + Resolving these kinds of CI-dev machine differences is time consuming. +- **Binary files.** + We don't compare SHAs of generated files as they will differ depending on who generated the files (especially with NetCDF files). + We shouldn't try and chase perfect hash consistency. + The SHA in the manifest is for validating a fetched blob against its store address, not for gating reproducibility. + +# Drawbacks + +- The REF CLI owns fetch/mint/credential plumbing — + more surface to maintain and test than off-loading to an existing data-management tool. +- Native is a cache of a regenerable artifact: + losing a digest still referenced by a reachable manifest breaks historical replay + (mitigated by reachability-aware retention). +- Committed-bundle minting is effectively a maintainer/CI action for conda providers. + external authors usually cannot run `execute()` locally. +- The public PR runner covers ~90% of test-cases but not all: + heavy ESMValTool ocean diagnostics exceed its compute/memory/disk + and have no automated PR signal until the future private runner lands. +- Historical bloat remains: existing committed binaries are not rewritten out of history + (deliberate — the repo is already used by externals). + +# Rationale and alternatives + +**Why two bundles.** +Keeping the small REF-shaped committed bundle in git +puts the review signal where reviewers already work (the PR diff) +and keeps the gate human-readable; +moving the large binaries out of git removes the bloat +while still making them fetchable for the extraction test. +The split maps cleanly onto what the REF already does for catalogs +(`catalog.yaml` committed, `catalog.paths.yaml` gitignored). + +**Why a manifest + CLI rather than committing the bytes.** +Measured native is too large to commit (pmp 250 MB, ilamb 61 MB). +A content-addressed manifest gives dedup, per-commit pinning of the exact native bytes, +and a fetchable fixture without binaries in git. +The clean in-PR semantic diff comes from the committed bundle, not the native digests +(which churn on every mint — see the manifest section). + +**Alternatives considered:** + +- **Commit curated native to git.** + Rejected by measurement (too large for pmp/ilamb). + Viable only for esmvaltool — not worth a per-provider split of the storage model. +- **Git LFS.** + Pointer churn remains; GitHub LFS egress costs; no anonymous public read. +- **Regenerate-on-demand, store nothing.** + The cheap `extraction-replay` path skips conda + `execute()` by fetching stored native; + dropping the store forces every PR onto the slow execute path. + Storing native keeps the common extraction-only PR fast. + Per-provider the `NativeStore` interface still allows mixing + (a provider with a low fetch rate could regenerate instead of store). + +**Impact of not doing this.** +Contributors keep hand-rolling baselines, +PRs keep shipping output changes without a visible diff, +and silent upstream regressions stay invisible until a manual check. + +# Prior art + +- The REF's existing **catalog split** — + the exact "commit small metadata, keep bytes out of git" pattern, here extended to outputs. +- **pytest-regressions** (`--force-regen`) — the regenerate-the-committed-bundle workflow this mirrors. +- Content-addressed artifact stores with a small in-git lock (DVC, Snakemake) — same idea; + this RFC keeps the client surface to just the REF CLI + rather than adding a separate toolchain on contributors. + +# Unresolved questions + +- **Object-store backend choice** — the main open decision, deferred deliberately. + Candidate backends behind the `NativeStore` interface include + an S3-compatible bucket or an OCI registry (e.g. GitHub Packages); + selection should weigh anonymous-read, trusted-write identity, egress cost, + content-addressing/GC, and any storage the org has already provisioned. + **No backend is assumed by this RFC.** +- Per-provider **store-vs-regenerate** for the large providers (pmp 250 MB). +- Retention policy for non-current native versions. +- Whether the PR comment should render a **figure diff** (PNG) / **NetCDF stat diff**, + or defer to a follow-up. +- **Private read-only runner** for the heavy diagnostics the public runner cannot host — + isolation model (ephemeral container, read-only mounts, no creds, controlled egress) and when to build it. +- Scope of the manual `mint` **`workflow_dispatch`** (targeting, backfills, store-loss recovery) + beyond the automatic merge-diff mint set. + +# Future possibilities + +- Rich diff renderers in the PR comment (PNG diffs, NetCDF summary-stat diffs). +- Reuse the same `NativeStore` + manifest for large reference/observational inputs too big for git. +- A `ref test-cases preview` that serves locally what the API would render for a PR, + straight from a synced native fixture (already the `replay` path). +- Signed manifests for end-to-end baseline provenance. From afd074fd1d3a1411480c24a27ce4f8db7f38b596 Mon Sep 17 00:00:00 2001 From: Jared Lewis Date: Fri, 5 Jun 2026 11:30:05 +1000 Subject: [PATCH 6/8] chore: reread --- text/0005-regression-baselines.md | 175 ++++++++++++------------------ 1 file changed, 71 insertions(+), 104 deletions(-) diff --git a/text/0005-regression-baselines.md b/text/0005-regression-baselines.md index 7598116..277cce1 100644 --- a/text/0005-regression-baselines.md +++ b/text/0005-regression-baselines.md @@ -4,21 +4,20 @@ # Summary -Effectively diagnostic regression **baselines** for use by the testing suite. Having baseline output for a diagnostic is an important contract between the development team and the diagnostic implementor. +The baseline includes information about what the REF should extract from a diagnostic. Each baseline (one test-case run) has **two bundles**: -1. A small, human-readable **committed bundle** (the REF-shaped `series.json` + CMEC bundles) +1. A small, human-readable **committed bundle** (the files ingested by the REF, `series.json` + CMEC bundles) committed directly to the source repo. This should be stable and able to be reproduced by the CI. 2. A large, binary **native bundle** (the diagnostic's native outputs) - stored outside git in an object store - and referenced from a single git-committed **`manifest.json`** + stored outside git in an object store and referenced from a single git-committed **`manifest.json`** (sha256 digests of the committed bundle, the native files, and the extraction inputs). The REF CLI (`ref test-cases`) fetches, runs, and mints baselines -with the goal of making it simple for diagnostic developers to contribute test output without requiring custom pytest implementation. +with the goal of making it simple for diagnostic developers to contribute test output. The **object-store backend is left as an explicit open decision** (see Unresolved questions) — this RFC specifies the *layering, manifest, CLI, and CI*, @@ -32,7 +31,8 @@ Diagnostic regression testing today is flaky, opaque, and bloats the repo: The git repo size is growing in a way that isn't sustainable. We want to minimise the volume of data needed to setup the repo. - **No clear "this-changed" signal.** - It's hard to see the failures for new diagnostics as we aren't following a consitent process. + It's hard to see the failures for diagnostics. + There are lots of CI output making it difficult to see what has broken. The REF-shaped output (`series.json`, CMEC bundles) is small JSON that diffs cleanly in the PR and is what reviewers most need to see. - **The extraction regression never runs on a PR.** @@ -41,15 +41,9 @@ Diagnostic regression testing today is flaky, opaque, and bloats the repo: The integration tests are currently broken and require a refresh of the regression data. - **Most diagnostic authors are external scientists.** Require minimal setup, no new credentials. - A contributor should need only the REF. + A contributor should need only the REF and shouldn't need to touch more than a few files to add a diagnostic. - **Security.** - The integration runner shares a writable cache so we must be careful that fork code must never reach it or any write credential. - -The native bundle is not a review-only artifact as it is the input fixtures for integration tests. -`Diagnostic.run()` is `execute()` (runs the diagnostic, writes native) -then `build_execution_result()` (reads native -> CMEC bundle + series). -`RegressionValidator` already replays *committed native* through `build_execution_result` without re-running the diagnostic. -That makes a fast, fork-safe extraction test possible on a stock runner — *if* the native is fetchable. + The integration runner shares a writable cache so we must be careful that fork code must never reach it or any write credentials. # Reference-level explanation @@ -57,7 +51,7 @@ That makes a fast, fork-safe extraction test possible on a stock runner — *if* For each `(provider, diagnostic, test-case)`: -``` +```text packages//tests/test-data/// catalog.yaml # committed: dataset metadata (exists today, content-hashed) catalog.paths.yaml # gitignored: per-user local paths (exists today) @@ -81,44 +75,48 @@ The native bytes (provenance + matched `*.nc` / `*.png`) live in the object stor - `committed` binds the committed bundle bytes to the manifest **for integrity only**: CI recomputes the committed digests and asserts they equal `manifest.committed`. - A commit that updates `series.json` but not the manifest (or vice-versa) **fails loudly** — - closing the one consistency leg content-addressing does not cover. - This is a tamper/desync check on the in-repo files, **not** the regression comparator - (a re-run is compared by *content with tolerance*, not by digest — see Comparison). -- `native` lists the curated native blobs by content digest (dedups, survives history). - These digests are **authored only by `mint`**: the value is the sha256 of the exact bytes - `mint` uploaded to the store, so it is a *fetch address + integrity check*, never a reproducibility gate. + A commit that updates `series.json` but not the manifest (or vice-versa) **fails loudly**. +- `native` lists the curated native blobs by content digest. + These digests are **authored only by the `mint` action**: the value is the sha256 of the exact bytes `mint` uploaded to the store. - `extraction_inputs` hashes the diagnostic's `files`/`series` patterns and the catalog `input_selectors` — the real extraction inputs the current catalog hash ignores — so a change to extraction behaviour bumps the manifest *at PR time*. -**Native digests do not gate reproducibility.** +**Digests do not gate reproducibility.** Native bytes are nondeterministic (`.nc`/`.png` embed timestamps, library versions, and provenance, floats jitter across platforms), so re-running `execute()` on a different machine yields different bytes resulting in a different digest for an unchanged diagnostic. The CI therefore never recomputes a native digest and compares it to the manifest. A mismatch is expected, and not a failure. -The deterministic gate is the **committed bundle**, compared by content with tolerance (see Comparison). +The gate for being able to reproduce values deterministically is the **committed bundle**, +compared by content with tolerance (see Comparison). The `native` block **churns on every real mint** (timestamps and maybe SHAs shift), so it is not a clean "native meaningfully changed" signal. Only the `committed` block is a semantic diff signal Remint when the committed bundle changes or a new test-case lands, not on native drift. -The consistency actually enforced: +Some of the native content may be useful to compare visually for example figures. +How that is surface is not yet resolved. + +The consistency enforced: -| pair | enforced | how | -| ---------------------------------------------------- | --------------------- | ----------------------------------------------- | -| in-repo committed bundle ↔ `manifest.committed` | yes — integrity gate | CI recompute equals (tamper/desync check) | -| replayed committed bundle ↔ in-repo committed bundle | yes — regression gate | content compare with tolerance (see Comparison) | -| stored native bytes ↔ `manifest.native` digest | by construction | `mint` hashes what it uploads | -| re-executed native ↔ `manifest.native` digest | **no — by design** | nondeterministic; gate is the committed bundle | +| pair | enforced | how | +| ----------------------------------------------------- | --------------------- | ----------------------------------------------- | +| in-repo committed bundle <> `manifest.committed` | yes — integrity gate | CI recompute equals (tamper/desync check) | +| replayed committed bundle <> in-repo committed bundle | yes — regression gate | content compare with tolerance (see Comparison) | +| stored native bytes <> `manifest.native` digest | by construction | `mint` hashes what it uploads | +| re-executed native <> `manifest.native` digest | **no — by design** | nondeterministic; gate is the committed bundle | -## Curated capture (kills the bloat at the source) +## Curated capture + +There will be additional files in the native bundle that aren't needed, +including the climate data that ESMValTool includes in the output directory. The diagnostic already declares which native files matter via `files` + `series` glob patterns with production publishing copies only those (the CMEC `output.json` subset). The regression capture must use the **same curation** instead of copying the whole working directory. -Diagnostic authors control what is stored/pruned through those existing patterns — no new mechanism. +Diagnostic authors control what is stored/pruned through those existing patterns, +and any errors in this process will be surfaced as differences in the committed bundle. ## `NativeStore` interface @@ -137,14 +135,13 @@ Required properties of whatever backend is chosen: - **Anonymous public read** so fork CI and contributors fetch without credentials. - **Writes gated to trusted contexts only** (CI on `main`, or maintainers). -- Content-addressed (sha256) storage; - native addressed **per-test-case** so a fetch is one diagnostic's output (KB–32 MB). - Retention that **never expires a digest still referenced by a reachable manifest**. +- Potentially using content-addressed (sha256) storage, but this complicates the expiration. The backend will be prototyped using an R2 bucket on Cloudflare, with the CI having write credentials to mint new baselines. -## CLI (extends the existing `ref test-cases` app) +## CLI (extends the existing `ref test-cases` cli) | Subcommand | Purpose | Creds | | ------------------------------------ | ------------------------------------------------------------------------------------- | ------------------ | @@ -155,48 +152,45 @@ with the CI having write credentials to mint new baselines. `mint` is the only credentialed verb, and the only writer of canonical native digests, so anonymous users can fetch the native baseline. -A dev `run` produces native and a local manifest for previewing the committed-bundle diff; -its native digests are advisory and never committed. +A dev `run` produces native and a local manifest for previewing the committed-bundle diff +with the native digests being advisory and never committed. A PR commits only the committed bundle + `manifest.{committed,extraction_inputs}` — -the `native` block is authored exclusively by the gated post-merge `mint`, -so native digests exist iff `mint` wrote them (no advisory-then-wrong window on `main`). +the `native` block is authored exclusively by the post-merge `mint` action so native digests exist iff `mint` wrote them. ## Tiered CI -The PR tier runs **credential-free** so it can safely execute fork code: -the danger was never `execute()` itself, -it was `execute()` holding write creds on a shared writable cache. -Strip both and a fork's diagnostic runs harmlessly on a throwaway public runner. -Minting (the only `put`) stays post-merge behind a manual gate. +The PR tier runs **credential-free** so it can safely execute fork code. +Self-hosted runners should never run `execute()` holding write creds on a shared writable cache. + +Minting stays post-merge behind a manual gate as it requires credentials. - **PR — any, incl forks — public `ubuntu-latest`, no secrets, no creds.** Routed by changed-files so the check is *honest*: - - **New diagnostic / new test-case, or `execute()` changed** -> run `execute()` + `build_execution_result` - on the public runner, then compare the freshly built committed bundle (tolerant). - No stored baseline is fetched, so a brand-new test-case validates on its own PR — no follow-up needed. - The check is named **`pr-execute`**. + - **New diagnostic / new test-case, or `execute()` changed** -> run `execute()` + `build_execution_result` on the public runner, + then compare the freshly built committed bundle against what is in git. + At this point, any existing native content may be invalidated. - **Extraction changed (`build_execution_result`, `files`/`series`, committed/manifest) and native already exists** -> - cheap path: `ref test-cases sync` (public read) -> `replay` -> compare committed bundle. - The check is named **`extraction-replay`**. + cheap path: `ref test-cases sync` (read native content) -> `replay` -> compare committed bundle. - A fetch *miss* for an **existing** manifest entry is a **failure, not a skip** (today, missing fixtures silently skip -> false green); a *new* entry has no stored native by definition and takes the execute path instead. - **Coverage:** the public runner handles ~90% of test-cases. Some heavy ESMValTool ocean diagnostics exceed its compute/memory/disk and will fail there — those wait for the future private runner (see below); they are not a silent skip. -- **Merge to `main` — gated mint, the only context holding write creds.** - A job bound to a GitHub **Environment (`mint`) with required reviewers**: +- **Merge to `main`** + A job bound to a GitHub **Environment (`mint`)**: the run pauses until a maintainer approves, releasing the write creds only onto **already-merged (trusted) code**. On approve: `ref test-cases mint` -> `execute()` + `put` native + author `manifest.native` (bot commit back to `main`). **Mint set** = test-cases whose committed bundle or `extraction_inputs` changed in the merge diff; - a manual `workflow_dispatch` can target a specific `--provider`/`--diagnostic` for backfills or store-loss recovery. + a manual `workflow_dispatch` can target a specific `--provider`/`--diagnostic` for backfills or manual fixes. - **Nightly — gated runner.** - Full sweep; drift detection backstop; - opens an auto-PR with the new manifest + a human-readable summary if outputs drift. + Full sweep to check for any drift. + This will `execute()`, `build_execution_result` and compare. + This job should output a human-readable summary as part of the workflow. - **Future — private read-only runner.** For the heavy diagnostics the public runner cannot host. Safe to run fork code only if each job is **ephemeral / container-per-job**, mounts conda + data **read-only**, - carries **no write creds**, and has **controlled egress** (no shared writable cache — the L46 hazard). + carries **no write creds**, and has **controlled egress** (no shared writable cache). Deferred: start public-only. ## PR workflow @@ -217,19 +211,11 @@ sequenceDiagram Author->>CLI: ref test-cases run --provider P --diagnostic D CLI->>CLI: execute() then build_execution_result CLI-->>Repo: write committed bundle + local manifest - - opt native already minted (changed/existing diagnostic) - Author->>CLI: ref test-cases replay --provider P --diagnostic D - CLI->>Store: sync native (public read) - CLI-->>Author: tolerant diff vs committed bundle - end - Author->>Author: preview committed-bundle diff (small JSON) Author->>Repo: commit committed bundle + manifest.committed/extraction_inputs ``` -A local `run` produces native + advisory digests for preview only — -the author commits no `native` block. +A local `run` produces native + advisory digests for preview only — the author commits no `native` block. Canonical `manifest.native` is authored later by the gated post-merge `mint`. ### 2. Make a PR (review + CI) @@ -303,32 +289,17 @@ flowchart TD Q -->|no| P ``` -The PR runner is public and **credential-free**, so a fork's `execute()` never touches write creds or a writable cache — -the L46 hazard is removed, not gated around. -Minting is the only `put`, and it runs **post-merge behind a required-reviewer environment**, -so write creds reach only already-merged (trusted) code. -The public runner covers ~90% of test-cases; -heavy ESMValTool ocean diagnostics that exceed its limits wait for the future private read-only runner. - ## Comparison -The regression gate compares a freshly replayed committed bundle to the in-repo reference -**by content, not by bytes**. -Byte/digest equality is too strict: it surfaces least-significant-bit float flips -and other cross-platform noise (a fork replays on stock `ubuntu-latest`, a mint ran elsewhere), -turning harmless jitter into red CI. -Content comparison needs awareness of each artifact's shape: - -| artifact | comparison | -| -------- | ---------- | -| `series.json` (numeric) | structural keys exact; values by relative tolerance (`rtol`/`atol`) | -| CMEC `diagnostic.json` / `output.json` | structural JSON compare; numerics with tolerance; volatile fields (timestamps, abs paths, library versions, provenance) normalised out via an ignore-list | - -- Tolerances are **declared per diagnostic/test-case** — a global default with per-case overrides, - since some metrics are inherently noisier than others. - (pytest-regressions precedent: `num_regression` tolerances, `data_regression` structural compare.) +The regression gate compares a freshly replayed committed bundle to the in-repo reference **by content, not by bytes**. +Byte/digest equality is too strict as the artifacts may include floats which will lead to flaky test results. + - **Fast path:** if bytes are already equal, pass without the content compare; only mismatches fall through to the tolerant path. -- The `manifest.committed` sha256 is **not** this comparator — it is the integrity/desync check above. +- Tolerances are **declared per diagnostic/test-case** — a global default with per-case overrides. + +We are comparing the content of the files which are ingested by the REF. +Changes to other native files aren't covered by this approach other than the existance of them. +This could be extended to cover these files, but we don't control their shape. ## Determinism @@ -349,17 +320,15 @@ Content comparison still requires removing avoidable nondeterminism at capture t # Drawbacks - The REF CLI owns fetch/mint/credential plumbing — - more surface to maintain and test than off-loading to an existing data-management tool. + more surface to maintain and test, but that is easier than a sprawl of user tests. - Native is a cache of a regenerable artifact: - losing a digest still referenced by a reachable manifest breaks historical replay - (mitigated by reachability-aware retention). -- Committed-bundle minting is effectively a maintainer/CI action for conda providers. - external authors usually cannot run `execute()` locally. + losing a digest still referenced by a reachable manifest breaks historical replay. + They can always be replaced. - The public PR runner covers ~90% of test-cases but not all: heavy ESMValTool ocean diagnostics exceed its compute/memory/disk and have no automated PR signal until the future private runner lands. - Historical bloat remains: existing committed binaries are not rewritten out of history - (deliberate — the repo is already used by externals). + (the repo is already used by externals so rewriting history is hard). # Rationale and alternatives @@ -382,10 +351,9 @@ The clean in-PR semantic diff comes from the committed bundle, not the native di **Alternatives considered:** - **Commit curated native to git.** - Rejected by measurement (too large for pmp/ilamb). - Viable only for esmvaltool — not worth a per-provider split of the storage model. + Curating causes the inability to regenerate the output.json file. - **Git LFS.** - Pointer churn remains; GitHub LFS egress costs; no anonymous public read. + Pointer churn remains; GitHub LFS egress costs are very low and it requires additional steps when checking out. - **Regenerate-on-demand, store nothing.** The cheap `extraction-replay` path skips conda + `execute()` by fetching stored native; dropping the store forces every PR onto the slow execute path. @@ -393,6 +361,8 @@ The clean in-PR semantic diff comes from the committed bundle, not the native di Per-provider the `NativeStore` interface still allows mixing (a provider with a low fetch rate could regenerate instead of store). + Generally it would take significant CI time to regenerate on every commit. + **Impact of not doing this.** Contributors keep hand-rolling baselines, PRs keep shipping output changes without a visible diff, @@ -403,19 +373,16 @@ and silent upstream regressions stay invisible until a manual check. - The REF's existing **catalog split** — the exact "commit small metadata, keep bytes out of git" pattern, here extended to outputs. - **pytest-regressions** (`--force-regen`) — the regenerate-the-committed-bundle workflow this mirrors. +- `ref-sample-data` used CI owned data, - Content-addressed artifact stores with a small in-git lock (DVC, Snakemake) — same idea; this RFC keeps the client surface to just the REF CLI rather than adding a separate toolchain on contributors. # Unresolved questions -- **Object-store backend choice** — the main open decision, deferred deliberately. - Candidate backends behind the `NativeStore` interface include - an S3-compatible bucket or an OCI registry (e.g. GitHub Packages); - selection should weigh anonymous-read, trusted-write identity, egress cost, - content-addressing/GC, and any storage the org has already provisioned. +- **Object-store backend choice** — the main open decision. + The most obvious choice is another R2 bucket similar to the datasets. **No backend is assumed by this RFC.** -- Per-provider **store-vs-regenerate** for the large providers (pmp 250 MB). - Retention policy for non-current native versions. - Whether the PR comment should render a **figure diff** (PNG) / **NetCDF stat diff**, or defer to a follow-up. From 4309918475a91ddb4822d556d89fc510e720bb9e Mon Sep 17 00:00:00 2001 From: Jared Lewis Date: Fri, 5 Jun 2026 12:18:34 +1000 Subject: [PATCH 7/8] chore: add a bump field --- text/0005-regression-baselines.md | 81 ++++++++++++++++++------------- 1 file changed, 46 insertions(+), 35 deletions(-) diff --git a/text/0005-regression-baselines.md b/text/0005-regression-baselines.md index 277cce1..8baf823 100644 --- a/text/0005-regression-baselines.md +++ b/text/0005-regression-baselines.md @@ -68,19 +68,28 @@ The native bytes (provenance + matched `*.nc` / `*.png`) live in the object stor ```json { "schema": 1, + "test_case_version": 1, "committed": { "series.json": "", "diagnostic.json": "", "output.json": "" }, - "native": { "": { "sha256": "...", "size": ... } }, - "extraction_inputs": { "files_series_digest": "", "input_selectors_digest": "" } } + "native": { "": { "sha256": "...", "size": ... } } } ``` +- `test_case_version` is a **monotonic integer** the author/maintainer bumps to declare a new baseline — + the deliberate "re-run and re-mint" trigger. + Instead of relying on hashes or code, diagnostic developers control when a test case is minted. + CI **hard-fails** a PR whose committed bundle changed without a `test_case_version` bump + (a forced bump with identical output is harmless — it just re-mints native). + Existing test-cases seed at `1`. - `committed` binds the committed bundle bytes to the manifest **for integrity only**: CI recomputes the committed digests and asserts they equal `manifest.committed`. A commit that updates `series.json` but not the manifest (or vice-versa) **fails loudly**. - `native` lists the curated native blobs by content digest. These digests are **authored only by the `mint` action**: the value is the sha256 of the exact bytes `mint` uploaded to the store. -- `extraction_inputs` hashes the diagnostic's `files`/`series` patterns and the catalog `input_selectors` — - the real extraction inputs the current catalog hash ignores — - so a change to extraction behaviour bumps the manifest *at PR time*. + +The provenance of *what produced* a baseline (diagnostic version, provider version, datasets, selectors) +is **not** recomputed here. +The addition of a manifest file `execution.json` [Climate-REF/rfcs#3](https://github.com/Climate-REF/rfcs/pull/3) could provide this information. +This manifest file records the run's **identity** (`diagnostic_slug`, `diagnostic_version`, `provider_slug`, `provider_version`, `dataset_hash`, `datasets` by `instance_id`, `selectors`). +This file should be included in the committed set when implemented. **Digests do not gate reproducibility.** Native bytes are nondeterministic (`.nc`/`.png` embed timestamps, library versions, and provenance, floats jitter across platforms), @@ -93,7 +102,7 @@ compared by content with tolerance (see Comparison). The `native` block **churns on every real mint** (timestamps and maybe SHAs shift), so it is not a clean "native meaningfully changed" signal. Only the `committed` block is a semantic diff signal -Remint when the committed bundle changes or a new test-case lands, not on native drift. +Remint on a `test_case_version` bump (a new baseline, or a new test-case), not on native drift. Some of the native content may be useful to compare visually for example figures. How that is surface is not yet resolved. @@ -154,7 +163,7 @@ with the CI having write credentials to mint new baselines. so anonymous users can fetch the native baseline. A dev `run` produces native and a local manifest for previewing the committed-bundle diff with the native digests being advisory and never committed. -A PR commits only the committed bundle + `manifest.{committed,extraction_inputs}` — +A PR commits only the committed bundle + `manifest.{test_case_version,committed}` — the `native` block is authored exclusively by the post-merge `mint` action so native digests exist iff `mint` wrote them. ## Tiered CI @@ -165,15 +174,19 @@ Self-hosted runners should never run `execute()` holding write creds on a shared Minting stays post-merge behind a manual gate as it requires credentials. - **PR — any, incl forks — public `ubuntu-latest`, no secrets, no creds.** - Routed by changed-files so the check is *honest*: - - **New diagnostic / new test-case, or `execute()` changed** -> run `execute()` + `build_execution_result` on the public runner, + Routed by the `test_case_version` bump so the check is *honest*: + - **`test_case_version` bumped, or a new test-case** -> `pr-test-case-execute`: + run `execute()` + `build_execution_result` on the public runner, then compare the freshly built committed bundle against what is in git. - At this point, any existing native content may be invalidated. - - **Extraction changed (`build_execution_result`, `files`/`series`, committed/manifest) and native already exists** -> - cheap path: `ref test-cases sync` (read native content) -> `replay` -> compare committed bundle. - - A fetch *miss* for an **existing** manifest entry is a **failure, not a skip** - (today, missing fixtures silently skip -> false green); - a *new* entry has no stored native by definition and takes the execute path instead. + A bumped/new version has no stored native by definition, so nothing is fetched — + a brand-new test-case validates on its own PR, no follow-up needed. + - **No bump, but extraction code changed (`build_execution_result`, `files`/`series`)** -> `pr-test-case-replay`: + cheap path — `ref test-cases sync` (read native) -> `replay` against the *existing* baseline -> compare. + Catches unintended extraction drift; a real mismatch here means "you changed output but didn't bump the version." + - **Coupling gate.** A committed bundle that changed without a `test_case_version` bump is a **hard failure** + ("output changed but baseline version not bumped — intended?"). + A fetch *miss* for an **existing** entry is likewise a failure, not a silent skip. + We may need a field on the test-case to continue on failure. - **Coverage:** the public runner handles ~90% of test-cases. Some heavy ESMValTool ocean diagnostics exceed its compute/memory/disk and will fail there — those wait for the future private runner (see below); they are not a silent skip. @@ -181,7 +194,7 @@ Minting stays post-merge behind a manual gate as it requires credentials. A job bound to a GitHub **Environment (`mint`)**: the run pauses until a maintainer approves, releasing the write creds only onto **already-merged (trusted) code**. On approve: `ref test-cases mint` -> `execute()` + `put` native + author `manifest.native` (bot commit back to `main`). - **Mint set** = test-cases whose committed bundle or `extraction_inputs` changed in the merge diff; + **Mint set** = test-cases whose `test_case_version` was bumped in the merge diff (plus any brand-new test-case); a manual `workflow_dispatch` can target a specific `--provider`/`--diagnostic` for backfills or manual fixes. - **Nightly — gated runner.** Full sweep to check for any drift. @@ -212,7 +225,7 @@ sequenceDiagram CLI->>CLI: execute() then build_execution_result CLI-->>Repo: write committed bundle + local manifest Author->>Author: preview committed-bundle diff (small JSON) - Author->>Repo: commit committed bundle + manifest.committed/extraction_inputs + Author->>Repo: commit committed bundle + manifest.committed + test_case_version ``` A local `run` produces native + advisory digests for preview only — the author commits no `native` block. @@ -229,21 +242,19 @@ sequenceDiagram participant Mint as Mint job (gated env, write creds) participant Store as NativeStore - Author->>PR: open PR (diagnostic + committed bundle + manifest.committed/extraction_inputs) + Author->>PR: open PR (diagnostic + committed bundle + manifest.committed + test_case_version) PR->>CI: trigger checks (routed by changed files) - alt new diagnostic or execute() changed + alt test_case_version bumped, or new test-case CI->>CI: run execute() + build_execution_result (no creds) CI->>CI: compare freshly built committed bundle (tolerant) CI-->>PR: pr-execute check + bundle diff comment - else extraction changed, native already exists + else no bump, extraction code changed CI->>Store: ref test-cases sync (public read) - alt native fetch hit - CI->>CI: replay then compare committed bundle (tolerant) - CI-->>PR: extraction-replay check + bundle diff comment - else native missing for an existing entry - CI-->>PR: FAIL (no silent skip) - end + CI->>CI: replay against existing baseline then compare (tolerant) + CI-->>PR: extraction-replay check + bundle diff comment + else committed bundle changed without a bump + CI-->>PR: FAIL (bump test_case_version to re-baseline) end Note over CI: heavy ESMValTool ocean diagnostics may exceed public limits (future private runner) @@ -265,18 +276,16 @@ How the PR check is chosen from the changed files, and where each path lands: ```mermaid flowchart TD - A[PR opened] --> B{What changed?} - - B -->|"new diagnostic or execute() changed"| X["run execute() + build_execution_result on public runner, no creds"] - B -->|"extraction only, native exists"| C["sync native (public read)"] + A[PR opened] --> B{test_case_version bumped?} - C --> D{Native fetch hit?} - D -->|yes| E["replay then rebuild committed bundle"] - D -->|"no (existing entry)"| F["FAIL: no silent skip"] + B -->|"yes (or new test-case)"| X["pr-execute: run execute() + build_execution_result, no creds"] + B -->|no| V{committed bundle changed?} + V -->|yes| F["FAIL: bump test_case_version to re-baseline"] + V -->|"no (extraction code changed)"| C["sync native then replay existing baseline"] X --> G{Diff within tolerance?} - E --> G - G -->|yes| H["check green (pr-execute / extraction-replay)"] + C --> G + G -->|yes| H["check green"] G -->|no| I["red: bundle diff comment, author fixes"] I --> A F --> A @@ -377,6 +386,8 @@ and silent upstream regressions stay invisible until a manual check. - Content-addressed artifact stores with a small in-git lock (DVC, Snakemake) — same idea; this RFC keeps the client surface to just the REF CLI rather than adding a separate toolchain on contributors. +- **Execution lifecycle ([Climate-REF/rfcs#3](https://github.com/Climate-REF/rfcs/pull/3))** — + the `execution.json` self-describing manifest this RFC consumes for baseline provenance information. # Unresolved questions From 0efcdd05ff08abc69551e2578e96fb4ad77bf8b8 Mon Sep 17 00:00:00 2001 From: Jared Lewis Date: Fri, 5 Jun 2026 12:33:23 +1000 Subject: [PATCH 8/8] chore: update flow chart --- text/0005-regression-baselines.md | 45 +++++++++++++++++-------------- 1 file changed, 25 insertions(+), 20 deletions(-) diff --git a/text/0005-regression-baselines.md b/text/0005-regression-baselines.md index 8baf823..13bd5ee 100644 --- a/text/0005-regression-baselines.md +++ b/text/0005-regression-baselines.md @@ -248,11 +248,11 @@ sequenceDiagram alt test_case_version bumped, or new test-case CI->>CI: run execute() + build_execution_result (no creds) CI->>CI: compare freshly built committed bundle (tolerant) - CI-->>PR: pr-execute check + bundle diff comment + CI-->>PR: pr-test-case-execute check + bundle diff comment else no bump, extraction code changed CI->>Store: ref test-cases sync (public read) CI->>CI: replay against existing baseline then compare (tolerant) - CI-->>PR: extraction-replay check + bundle diff comment + CI-->>PR: pr-test-case-replay check + bundle diff comment else committed bundle changed without a bump CI-->>PR: FAIL (bump test_case_version to re-baseline) end @@ -276,26 +276,31 @@ How the PR check is chosen from the changed files, and where each path lands: ```mermaid flowchart TD - A[PR opened] --> B{test_case_version bumped?} + Start(["Push to PR"]) --> B{"test_case_version bumped?"} - B -->|"yes (or new test-case)"| X["pr-execute: run execute() + build_execution_result, no creds"] - B -->|no| V{committed bundle changed?} - V -->|yes| F["FAIL: bump test_case_version to re-baseline"] - V -->|"no (extraction code changed)"| C["sync native then replay existing baseline"] + B -->|"yes / new test-case"| X["pr-test-case-execute: run execute() + build_execution_result (no creds)"] + B -->|"no"| V{"committed bundle changed?"} + V -->|"yes"| F["FAIL: bump test_case_version to re-baseline"] + V -->|"no — extraction code changed"| C["pr-test-case-replay: sync native, replay existing baseline"] - X --> G{Diff within tolerance?} + X --> G{"Diff within tolerance?"} C --> G - G -->|yes| H["check green"] - G -->|no| I["red: bundle diff comment, author fixes"] - I --> A - F --> A - - H --> M{Reviewer approves + merges?} - M -->|no| A - M -->|yes| P["merge to main queues gated mint"] - P --> Q{Maintainer approves mint?} - Q -->|yes| O["put native + author manifest.native (bot commit)"] - Q -->|no| P + G -->|"yes"| M{"Reviewer approves + merges?"} + G -->|"no"| F + + F --> Revise(["Author revises, re-pushes"]) + M -->|"no"| Revise + M -->|"yes"| P["merge to main queues gated mint"] + P --> Q{"Maintainer approves mint?"} + Q -->|"no (waits)"| P + Q -->|"yes"| Done(["Merged + manifest.native minted by CI"]) + + classDef entry fill:#dff5e1,stroke:#1a7f37,stroke-width:2px; + classDef good fill:#dbeafe,stroke:#1d4ed8,stroke-width:2px; + classDef warn fill:#fde7c7,stroke:#b45309,stroke-width:2px; + class Start entry; + class Done good; + class Revise warn; ``` ## Comparison @@ -364,7 +369,7 @@ The clean in-PR semantic diff comes from the committed bundle, not the native di - **Git LFS.** Pointer churn remains; GitHub LFS egress costs are very low and it requires additional steps when checking out. - **Regenerate-on-demand, store nothing.** - The cheap `extraction-replay` path skips conda + `execute()` by fetching stored native; + The cheap `pr-test-case-replay` path skips conda + `execute()` by fetching stored native; dropping the store forces every PR onto the slow execute path. Storing native keeps the common extraction-only PR fast. Per-provider the `NativeStore` interface still allows mixing