From 27cd155a5e49594a3837abf01f1bc818ab879aae Mon Sep 17 00:00:00 2001
From: Jared Lewis <jared@jared.kiwi.nz>
Date: Thu, 4 Jun 2026 15:54:01 +1000
Subject: [PATCH 1/8] rfc: regression baselines (golden-in-git + native
 fixtures)

---
 text/0000-regression-baselines.md | 246 ++++++++++++++++++++++++++++++
 1 file changed, 246 insertions(+)
 create mode 100644 text/0000-regression-baselines.md
diff --git a/text/0000-regression-baselines.md b/text/0000-regression-baselines.md
new file mode 100644
index 0000000..5168bf4
--- /dev/null
+++ b/text/0000-regression-baselines.md
@@ -0,0 +1,246 @@
+- Feature Name: `regression_baselines`
+- Start Date: 2026-06-04
+- RFC PR: [CMIP-REF/rfcs#0000](https://github.com/CMIP-REF/rfcs/pull/0000)
+
+# Summary
+[summary]: #summary
+
+Manage diagnostic regression baselines as **two layers**:
+
+1. A small, human-readable **golden** (the REF-shaped `series.json` + CMEC bundles) committed
+   directly to the source repo — this is the test *gate* and the in-PR *diff signal*.
+2. The large, binary **native diagnostic outputs** stored outside git in an object store and
+   referenced from a single git-committed **`native_manifest.json`** (sha256 digests of the golden,
+   the native files, and the extraction inputs).
+
+The REF CLI (`ref test-cases`) fetches, runs, and mints baselines. The native outputs double as the
+**input fixtures that test the native→bundle extraction step**, so most PRs are verified fast on a
+stock runner without re-running the (expensive) diagnostic.
+
+The **object-store backend is left as an explicit open decision** (see Unresolved questions) — this
+RFC specifies the *layering, manifest, CLI, and CI*, which are backend-agnostic behind a small
+`NativeStore` interface.
+
+# Motivation
+[motivation]: #motivation
+
+Diagnostic regression testing today is flaky, opaque, and bloats the repo:
+
+- **Bloat.** `tests/.../regression` is ~1.0 GB on disk. Of that, **696 MB is input `climate_data`
+  that the capture step wrongly copies in**; the real curated output is ~316 MB (esmvaltool 5 MB,
+  ilamb 61 MB, pmp 250 MB). Some native is already hand-`.gitignore`d, so it is *missing* — no
+  coverage, no parity. The bloat is a capture bug plus committing binaries, not an inherent need.
+- **No clear "this-changed" signal.** Failures are a mix; you dig CI logs across many jobs. The
+  REF-shaped output (`series.json`, CMEC bundles) is small JSON that diffs cleanly in the PR and is
+  exactly the content that feeds the REF API — it is what reviewers most need to see.
+- **The extraction regression never runs on a PR.** It runs only post-merge/nightly today, so a PR
+  that breaks extraction goes green, then red after merge.
+- **Most diagnostic authors are external scientists.** Minimal setup, no new credentials. A
+  contributor should need only the REF.
+- **Security.** The integration runner shares a writable cache; untrusted fork code must never reach
+  it or any write credential.
+
+A key reframe: **the native outputs are not review-only artifacts — they are the input fixtures for
+the extraction step.** `Diagnostic.run()` is `execute()` (runs the diagnostic, writes native) then
+`build_execution_result()` (reads native → CMEC bundle + series). `RegressionValidator` already
+replays *committed native* through `build_execution_result` without re-running the diagnostic. That
+makes a fast, fork-safe extraction test possible on a stock runner — *if* the native is fetchable.
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation
+
+## Two layers, one manifest
+
+For each `(provider, diagnostic, test-case)`:
+
+```
+packages/<provider>/tests/test-data/<diagnostic>/<test-case>/
+  catalog.yaml            # committed: dataset metadata (exists today, content-hashed)
+  catalog.paths.yaml      # gitignored: per-user local paths (exists today)
+  regression/
+    series.json           # committed GOLDEN (the gate + diff signal)
+    diagnostic.json        # committed GOLDEN (CMEC metric bundle)
+    output.json           # committed GOLDEN (CMEC output bundle)
+  native_manifest.json    # committed: source of truth binding golden <-> native
+```
+
+The native bytes (provenance + matched `*.nc` / `*.png`) live in the object store, **not** in git.
+
+`native_manifest.json`:
+
+```json
+{ "schema": 1,
+  "golden":  { "series.json": "<sha256>", "diagnostic.json": "<sha256>", "output.json": "<sha256>" },
+  "native":  { "<relpath>": { "sha256": "...", "size": ... } },
+  "extraction_inputs": { "files_series_digest": "<sha256>", "input_selectors_digest": "<sha256>" } }
+```
+
+- `golden` binds the committed golden bytes to the manifest: CI recomputes the golden digests and
+  asserts they equal `manifest.golden`. A commit that updates `series.json` but not the manifest (or
+  vice-versa) **fails loudly** — closing the one consistency leg content-addressing does not cover.
+- `native` lists the curated native blobs by content digest (immutable, dedups, survives history).
+- `extraction_inputs` hashes the diagnostic's `files`/`series` patterns and the catalog
+  `input_selectors` — the real extraction inputs the current catalog hash ignores — so a change to
+  extraction behaviour bumps the manifest *at PR time*.
+
+Two manifest sections keep the diff readable: a changed `native` block means native changed; a
+changed `golden` block means extraction output changed.
+
+## Curated capture (kills the bloat at the source)
+
+The diagnostic already declares which native files matter via `files` + `series` glob patterns;
+production publishing copies only those (the CMEC `output.json` subset). The regression capture must
+use the **same curation** instead of copying the whole working directory. This alone prunes the
+696 MB `climate_data` input and the verbose run logs for free. Diagnostic authors control what is
+stored/pruned through those existing patterns — no new mechanism.
+
+## `NativeStore` interface (backend-agnostic)
+
+The REF talks to the store through a thin Protocol so the backend is swappable and unit-testable,
+and so this RFC does not have to settle the backend to be implementable:
+
+```python
+class NativeStore(Protocol):
+    def has(self, digest: str) -> bool: ...
+    def fetch(self, digest: str, dest: Path) -> None: ...   # by sha256; public read, no creds
+    def put(self, path: Path) -> str: ...                    # returns sha256; needs write creds
+```
+
+Required properties of whatever backend is chosen:
+
+- **Anonymous public read** so fork CI and contributors fetch without credentials.
+- **Writes gated to trusted contexts only** (CI on `main`, or maintainers).
+- Content-addressed (sha256) storage; native addressed **per-test-case** so a fetch is one
+  diagnostic's output (KB–32 MB).
+- Retention that **never expires a digest still referenced by a reachable manifest**.
+
+Backend selection is deferred to a maintainer decision (Unresolved questions).
+
+## CLI (extends the existing `ref test-cases` app)
+
+| Subcommand | Purpose | Creds |
+|------------|---------|-------|
+| `run --provider P --diagnostic D` | run `execute()` + `build_execution_result`, write golden + manifest locally | none (local) |
+| `sync [--provider P]` | fetch native blobs named by the committed manifest(s) | none (public read) |
+| `replay --provider P --diagnostic D` | fetch native, run `build_execution_result`, compare to golden | none (public read) |
+| `mint --provider P` | `run` + `put` native to the store + update manifest | write creds |
+
+`mint` is the only credentialed verb and runs only on trusted contexts.
+
+## Tiered CI
+
+- **PR — any, incl forks — stock `ubuntu-latest`, no secrets.** `ref test-cases sync` (public read)
+  → `replay` → diff golden → post a PR comment. Figures surfaced but **non-blocking**. Routed by
+  changed-files so the check is *honest*:
+  - PR touches extraction (`build_execution_result`, `files`/`series`) or golden/manifest → run the
+    replay; the check is named **`extraction-replay`** and its green means what it says.
+  - PR touches `execute()` only → replay is a tautology → emit a **neutral
+    "execute-changed: needs runner verification"** check (not green), linking the bootstrap path.
+  - A fetch *miss* for a fixture the manifest says should exist is a **failure, not a skip** (today,
+    missing fixtures silently skip → false green).
+- **Same-repo PR — self-hosted runner, no upload — when a golden file changes.** Run the slow
+  `execute()` to confirm the diagnostic still reproduces its native, catching breakage pre-merge.
+  Gated with no labels: `if: github.event.pull_request.head.repo.full_name == github.repository`.
+- **Merge to `main` — self-hosted runner, the only job holding write creds.** `ref test-cases mint`
+  → `execute()` + `put` native + update manifest.
+- **Nightly — self-hosted runner.** Full sweep; drift detection backstop; opens an auto-PR with the
+  new manifest + a human-readable summary if outputs drift.
+- **Fork PR needing new/changed native** (a new diagnostic, or an `execute()` change): a documented
+  `workflow_dispatch` a maintainer triggers to mint on the trusted runner and push the golden +
+  manifest back to the PR. Forks never run `execute()` on the self-hosted runner (untrusted code,
+  shared cache).
+
+## Self-hosted runner security
+
+The integration runner uses ephemeral pods, but mounts a **shared, writable cache** (datasets +
+provider software). Untrusted fork code in such a job could poison that cache for the next trusted
+run; ephemeral pods do not close this. Therefore: never run untrusted `execute()` on the shared
+runner; keep write credentials on the `main` (and explicitly-triggered) path only; gate the
+same-repo PR job by `head.repo.full_name == github.repository`. (A separate hardened runner with a
+read-only cache could later make fork `execute()` safe; out of scope here.)
+
+## Determinism (prerequisite for comparing bundles)
+
+CMEC bundle content comparison (currently only `series.json` is compared) requires removing
+nondeterminism first:
+
+- **Execution-dir timestamp.** ESMValTool writes a `recipe_<YYYYMMDD>_<HHMMSS>` dir that gets baked
+  into `output.json`. Sanitise `recipe_\d{8}_\d{6}` → a placeholder in both capture and comparison.
+  Tracked as [Climate-REF/climate-ref#713](https://github.com/Climate-REF/climate-ref/issues/713).
+- **Floats.** Replay path (stored native → extraction) is deterministic → exact compare. Execute
+  path (re-run) has float jitter → compare structure/dimensions only, leaving numeric values to the
+  replay path; any value that must be compared on the execute path uses a relative tolerance, never
+  exact equality.
+
+# Drawbacks
+[drawbacks]: #drawbacks
+
+- The REF CLI owns fetch/mint/credential plumbing — more surface to maintain and test than
+  off-loading to an existing data-management tool.
+- Native is a cache of a regenerable artifact: losing a digest still referenced by a reachable
+  manifest breaks historical replay (mitigated by reachability-aware retention).
+- Golden minting is effectively a maintainer/CI action for conda providers (esmvaltool's env is
+  ~1373 packages); external authors usually cannot run `execute()` locally.
+- Fork `execute()` changes have no automated PR signal until bootstrapped — the fast path tests
+  extraction, not the diagnostic algorithm.
+- Historical bloat remains: existing committed binaries are not rewritten out of history
+  (deliberate — the repo is already used by externals).
+
+# Rationale and alternatives
+[rationale-and-alternatives]: #rationale-and-alternatives
+
+**Why two layers.** Keeping the small REF-shaped golden in git puts the review signal where
+reviewers already work (the PR diff) and keeps the gate human-readable; moving the large binaries
+out of git removes the bloat while still making them fetchable for the extraction test. The split
+maps cleanly onto what the REF already does for catalogs (`catalog.yaml` committed,
+`catalog.paths.yaml` gitignored).
+
+**Why a manifest + CLI rather than committing the bytes.** Measured native is too large to commit
+(pmp 250 MB, ilamb 61 MB). A content-addressed manifest gives dedup, per-commit pinning, and a clean
+in-PR "native changed" diff without binaries in git.
+
+**Alternatives considered:**
+
+- **Commit curated native to git.** Rejected by measurement (too large for pmp/ilamb). Viable only
+  for esmvaltool (~5 MB) — not worth a per-provider split of the storage model.
+- **Git LFS.** Pointer churn remains; GitHub LFS egress costs; no anonymous public read.
+- **Regenerate-on-demand, store nothing.** The fast fork-PR extraction path runs on a stock runner
+  with no conda/data and cannot regenerate; it must fetch. Per-provider, pmp's 250 MB *could* be
+  regenerated on the self-hosted runner rather than stored if its fetch rate is low — the
+  `NativeStore` interface allows mixing.
+
+**Impact of not doing this.** Contributors keep hand-rolling baselines, PRs keep shipping output
+changes without a visible diff, and silent upstream regressions stay invisible until a manual check.
+
+# Prior art
+[prior-art]: #prior-art
+
+- The REF's existing **catalog split** — the exact "commit small metadata, keep bytes out of git"
+  pattern, here extended to outputs.
+- **pytest-regressions** (`--force-regen`) — the regenerate-the-golden workflow this mirrors.
+- Content-addressed artifact stores with a small in-git lock (DVC, Snakemake) — same idea; this RFC
+  keeps the client surface to just the REF CLI rather than adding a separate toolchain on
+  contributors.
+
+# Unresolved questions
+[unresolved-questions]: #unresolved-questions
+
+- **Object-store backend choice** — the main open decision, deferred deliberately. Candidate
+  backends behind the `NativeStore` interface include an S3-compatible bucket or an OCI registry
+  (e.g. GitHub Packages); selection should weigh anonymous-read, trusted-write identity, egress
+  cost, content-addressing/GC, and any storage the org has already provisioned. **No backend is
+  assumed by this RFC.**
+- Per-provider **store-vs-regenerate** for the large providers (pmp 250 MB).
+- Retention policy for non-current native versions.
+- Whether the PR comment should render a **figure diff** (PNG) / **NetCDF stat diff**, or defer to a
+  follow-up.
+- Exact form of the fork-PR **bootstrap `workflow_dispatch`**.
+
+# Future possibilities
+[future-possibilities]: #future-possibilities
+
+- Rich diff renderers in the PR comment (PNG diffs, NetCDF summary-stat diffs).
+- Reuse the same `NativeStore` + manifest for large reference/observational inputs too big for git.
+- A `ref test-cases preview` that serves locally what the API would render for a PR, straight from a
+  synced native fixture (already the `replay` path).
+- Signed manifests for end-to-end baseline provenance.

From b556db0be28f998d2dde1c34ae5673f5c8a96f5d Mon Sep 17 00:00:00 2001
From: Jared Lewis <jared@jared.kiwi.nz>
Date: Thu, 4 Jun 2026 16:00:40 +1000
Subject: [PATCH 2/8] chore: format

---
 text/0000-regression-baselines.md | 8 --------
 1 file changed, 8 deletions(-)

diff --git a/text/0000-regression-baselines.md b/text/0000-regression-baselines.md
index 5168bf4..ab98e59 100644
--- a/text/0000-regression-baselines.md
+++ b/text/0000-regression-baselines.md
@@ -3,7 +3,6 @@
 - RFC PR: [CMIP-REF/rfcs#0000](https://github.com/CMIP-REF/rfcs/pull/0000)
 
 # Summary
-[summary]: #summary
 
 Manage diagnostic regression baselines as **two layers**:
 
@@ -22,7 +21,6 @@ RFC specifies the *layering, manifest, CLI, and CI*, which are backend-agnostic
 `NativeStore` interface.
 
 # Motivation
-[motivation]: #motivation
 
 Diagnostic regression testing today is flaky, opaque, and bloats the repo:
 
@@ -47,7 +45,6 @@ replays *committed native* through `build_execution_result` without re-running t
 makes a fast, fork-safe extraction test possible on a stock runner — *if* the native is fetchable.
 
 # Reference-level explanation
-[reference-level-explanation]: #reference-level-explanation
 
 ## Two layers, one manifest
 
@@ -173,7 +170,6 @@ nondeterminism first:
   exact equality.
 
 # Drawbacks
-[drawbacks]: #drawbacks
 
 - The REF CLI owns fetch/mint/credential plumbing — more surface to maintain and test than
   off-loading to an existing data-management tool.
@@ -187,7 +183,6 @@ nondeterminism first:
   (deliberate — the repo is already used by externals).
 
 # Rationale and alternatives
-[rationale-and-alternatives]: #rationale-and-alternatives
 
 **Why two layers.** Keeping the small REF-shaped golden in git puts the review signal where
 reviewers already work (the PR diff) and keeps the gate human-readable; moving the large binaries
@@ -213,7 +208,6 @@ in-PR "native changed" diff without binaries in git.
 changes without a visible diff, and silent upstream regressions stay invisible until a manual check.
 
 # Prior art
-[prior-art]: #prior-art
 
 - The REF's existing **catalog split** — the exact "commit small metadata, keep bytes out of git"
   pattern, here extended to outputs.
@@ -223,7 +217,6 @@ changes without a visible diff, and silent upstream regressions stay invisible u
   contributors.
 
 # Unresolved questions
-[unresolved-questions]: #unresolved-questions
 
 - **Object-store backend choice** — the main open decision, deferred deliberately. Candidate
   backends behind the `NativeStore` interface include an S3-compatible bucket or an OCI registry
@@ -237,7 +230,6 @@ changes without a visible diff, and silent upstream regressions stay invisible u
 - Exact form of the fork-PR **bootstrap `workflow_dispatch`**.
 
 # Future possibilities
-[future-possibilities]: #future-possibilities
 
 - Rich diff renderers in the PR comment (PNG diffs, NetCDF summary-stat diffs).
 - Reuse the same `NativeStore` + manifest for large reference/observational inputs too big for git.

From ef1a4fa792cb5a3208e53fdada04382d919e6925 Mon Sep 17 00:00:00 2001
From: Jared Lewis <jared@jared.kiwi.nz>
Date: Thu, 4 Jun 2026 16:08:00 +1000
Subject: [PATCH 3/8] docs: rewrite regression-baselines RFC with semantic line
 breaks

---
 text/0000-regression-baselines.md | 269 +++++++++++++++++-------------
 1 file changed, 151 insertions(+), 118 deletions(-)

diff --git a/text/0000-regression-baselines.md b/text/0000-regression-baselines.md
index ab98e59..f24f293 100644
--- a/text/0000-regression-baselines.md
+++ b/text/0000-regression-baselines.md
@@ -6,43 +6,52 @@
 
 Manage diagnostic regression baselines as **two layers**:
 
-1. A small, human-readable **golden** (the REF-shaped `series.json` + CMEC bundles) committed
-   directly to the source repo — this is the test *gate* and the in-PR *diff signal*.
-2. The large, binary **native diagnostic outputs** stored outside git in an object store and
-   referenced from a single git-committed **`native_manifest.json`** (sha256 digests of the golden,
-   the native files, and the extraction inputs).
+1. A small, human-readable **golden** (the REF-shaped `series.json` + CMEC bundles)
+   committed directly to the source repo —
+   this is the test *gate* and the in-PR *diff signal*.
+2. The large, binary **native diagnostic outputs** stored outside git in an object store
+   and referenced from a single git-committed **`native_manifest.json`**
+   (sha256 digests of the golden, the native files, and the extraction inputs).
 
-The REF CLI (`ref test-cases`) fetches, runs, and mints baselines. The native outputs double as the
-**input fixtures that test the native→bundle extraction step**, so most PRs are verified fast on a
-stock runner without re-running the (expensive) diagnostic.
+The REF CLI (`ref test-cases`) fetches, runs, and mints baselines.
+The native outputs double as the **input fixtures that test the native→bundle extraction step**,
+so most PRs are verified fast on a stock runner without re-running the (expensive) diagnostic.
 
-The **object-store backend is left as an explicit open decision** (see Unresolved questions) — this
-RFC specifies the *layering, manifest, CLI, and CI*, which are backend-agnostic behind a small
-`NativeStore` interface.
+The **object-store backend is left as an explicit open decision** (see Unresolved questions) —
+this RFC specifies the *layering, manifest, CLI, and CI*,
+which are backend-agnostic behind a small `NativeStore` interface.
 
 # Motivation
 
 Diagnostic regression testing today is flaky, opaque, and bloats the repo:
 
-- **Bloat.** `tests/.../regression` is ~1.0 GB on disk. Of that, **696 MB is input `climate_data`
-  that the capture step wrongly copies in**; the real curated output is ~316 MB (esmvaltool 5 MB,
-  ilamb 61 MB, pmp 250 MB). Some native is already hand-`.gitignore`d, so it is *missing* — no
-  coverage, no parity. The bloat is a capture bug plus committing binaries, not an inherent need.
-- **No clear "this-changed" signal.** Failures are a mix; you dig CI logs across many jobs. The
-  REF-shaped output (`series.json`, CMEC bundles) is small JSON that diffs cleanly in the PR and is
-  exactly the content that feeds the REF API — it is what reviewers most need to see.
-- **The extraction regression never runs on a PR.** It runs only post-merge/nightly today, so a PR
-  that breaks extraction goes green, then red after merge.
-- **Most diagnostic authors are external scientists.** Minimal setup, no new credentials. A
-  contributor should need only the REF.
-- **Security.** The integration runner shares a writable cache; untrusted fork code must never reach
-  it or any write credential.
-
-A key reframe: **the native outputs are not review-only artifacts — they are the input fixtures for
-the extraction step.** `Diagnostic.run()` is `execute()` (runs the diagnostic, writes native) then
-`build_execution_result()` (reads native → CMEC bundle + series). `RegressionValidator` already
-replays *committed native* through `build_execution_result` without re-running the diagnostic. That
-makes a fast, fork-safe extraction test possible on a stock runner — *if* the native is fetchable.
+- **Bloat.** `tests/.../regression` is ~1.0 GB on disk.
+  Of that, **696 MB is input `climate_data` that the capture step wrongly copies in**;
+  the real curated output is ~316 MB (esmvaltool 5 MB, ilamb 61 MB, pmp 250 MB).
+  Some native is already hand-`.gitignore`d, so it is *missing* — no coverage, no parity.
+  The bloat is a capture bug plus committing binaries, not an inherent need.
+- **No clear "this-changed" signal.**
+  Failures are a mix; you dig CI logs across many jobs.
+  The REF-shaped output (`series.json`, CMEC bundles) is small JSON that diffs cleanly in the PR
+  and is exactly the content that feeds the REF API —
+  it is what reviewers most need to see.
+- **The extraction regression never runs on a PR.**
+  It runs only post-merge/nightly today,
+  so a PR that breaks extraction goes green, then red after merge.
+- **Most diagnostic authors are external scientists.**
+  Minimal setup, no new credentials.
+  A contributor should need only the REF.
+- **Security.**
+  The integration runner shares a writable cache;
+  untrusted fork code must never reach it or any write credential.
+
+A key reframe: **the native outputs are not review-only artifacts —
+they are the input fixtures for the extraction step.**
+`Diagnostic.run()` is `execute()` (runs the diagnostic, writes native)
+then `build_execution_result()` (reads native → CMEC bundle + series).
+`RegressionValidator` already replays *committed native* through `build_execution_result`
+without re-running the diagnostic.
+That makes a fast, fork-safe extraction test possible on a stock runner — *if* the native is fetchable.
 
 # Reference-level explanation
 
@@ -72,28 +81,31 @@ The native bytes (provenance + matched `*.nc` / `*.png`) live in the object stor
   "extraction_inputs": { "files_series_digest": "<sha256>", "input_selectors_digest": "<sha256>" } }
 ```
 
-- `golden` binds the committed golden bytes to the manifest: CI recomputes the golden digests and
-  asserts they equal `manifest.golden`. A commit that updates `series.json` but not the manifest (or
-  vice-versa) **fails loudly** — closing the one consistency leg content-addressing does not cover.
+- `golden` binds the committed golden bytes to the manifest:
+  CI recomputes the golden digests and asserts they equal `manifest.golden`.
+  A commit that updates `series.json` but not the manifest (or vice-versa) **fails loudly** —
+  closing the one consistency leg content-addressing does not cover.
 - `native` lists the curated native blobs by content digest (immutable, dedups, survives history).
-- `extraction_inputs` hashes the diagnostic's `files`/`series` patterns and the catalog
-  `input_selectors` — the real extraction inputs the current catalog hash ignores — so a change to
-  extraction behaviour bumps the manifest *at PR time*.
+- `extraction_inputs` hashes the diagnostic's `files`/`series` patterns and the catalog `input_selectors` —
+  the real extraction inputs the current catalog hash ignores —
+  so a change to extraction behaviour bumps the manifest *at PR time*.
 
-Two manifest sections keep the diff readable: a changed `native` block means native changed; a
-changed `golden` block means extraction output changed.
+Two manifest sections keep the diff readable:
+a changed `native` block means native changed;
+a changed `golden` block means extraction output changed.
 
 ## Curated capture (kills the bloat at the source)
 
 The diagnostic already declares which native files matter via `files` + `series` glob patterns;
-production publishing copies only those (the CMEC `output.json` subset). The regression capture must
-use the **same curation** instead of copying the whole working directory. This alone prunes the
-696 MB `climate_data` input and the verbose run logs for free. Diagnostic authors control what is
-stored/pruned through those existing patterns — no new mechanism.
+production publishing copies only those (the CMEC `output.json` subset).
+The regression capture must use the **same curation** instead of copying the whole working directory.
+This alone prunes the 696 MB `climate_data` input and the verbose run logs for free.
+Diagnostic authors control what is stored/pruned through those existing patterns — no new mechanism.
 
 ## `NativeStore` interface (backend-agnostic)
 
-The REF talks to the store through a thin Protocol so the backend is swappable and unit-testable,
+The REF talks to the store through a thin Protocol
+so the backend is swappable and unit-testable,
 and so this RFC does not have to settle the backend to be implementable:
 
 ```python
@@ -107,8 +119,8 @@ Required properties of whatever backend is chosen:
 
 - **Anonymous public read** so fork CI and contributors fetch without credentials.
 - **Writes gated to trusted contexts only** (CI on `main`, or maintainers).
-- Content-addressed (sha256) storage; native addressed **per-test-case** so a fetch is one
-  diagnostic's output (KB–32 MB).
+- Content-addressed (sha256) storage;
+  native addressed **per-test-case** so a fetch is one diagnostic's output (KB–32 MB).
 - Retention that **never expires a digest still referenced by a reachable manifest**.
 
 Backend selection is deferred to a maintainer decision (Unresolved questions).
@@ -126,113 +138,134 @@ Backend selection is deferred to a maintainer decision (Unresolved questions).
 
 ## Tiered CI
 
-- **PR — any, incl forks — stock `ubuntu-latest`, no secrets.** `ref test-cases sync` (public read)
-  → `replay` → diff golden → post a PR comment. Figures surfaced but **non-blocking**. Routed by
-  changed-files so the check is *honest*:
-  - PR touches extraction (`build_execution_result`, `files`/`series`) or golden/manifest → run the
-    replay; the check is named **`extraction-replay`** and its green means what it says.
-  - PR touches `execute()` only → replay is a tautology → emit a **neutral
-    "execute-changed: needs runner verification"** check (not green), linking the bootstrap path.
-  - A fetch *miss* for a fixture the manifest says should exist is a **failure, not a skip** (today,
-    missing fixtures silently skip → false green).
-- **Same-repo PR — self-hosted runner, no upload — when a golden file changes.** Run the slow
-  `execute()` to confirm the diagnostic still reproduces its native, catching breakage pre-merge.
+- **PR — any, incl forks — stock `ubuntu-latest`, no secrets.**
+  `ref test-cases sync` (public read) → `replay` → diff golden → post a PR comment.
+  Figures surfaced but **non-blocking**.
+  Routed by changed-files so the check is *honest*:
+  - PR touches extraction (`build_execution_result`, `files`/`series`) or golden/manifest → run the replay;
+    the check is named **`extraction-replay`** and its green means what it says.
+  - PR touches `execute()` only → replay is a tautology →
+    emit a **neutral "execute-changed: needs runner verification"** check (not green),
+    linking the bootstrap path.
+  - A fetch *miss* for a fixture the manifest says should exist is a **failure, not a skip**
+    (today, missing fixtures silently skip → false green).
+- **Same-repo PR — self-hosted runner, no upload — when a golden file changes.**
+  Run the slow `execute()` to confirm the diagnostic still reproduces its native,
+  catching breakage pre-merge.
   Gated with no labels: `if: github.event.pull_request.head.repo.full_name == github.repository`.
-- **Merge to `main` — self-hosted runner, the only job holding write creds.** `ref test-cases mint`
-  → `execute()` + `put` native + update manifest.
-- **Nightly — self-hosted runner.** Full sweep; drift detection backstop; opens an auto-PR with the
-  new manifest + a human-readable summary if outputs drift.
-- **Fork PR needing new/changed native** (a new diagnostic, or an `execute()` change): a documented
-  `workflow_dispatch` a maintainer triggers to mint on the trusted runner and push the golden +
-  manifest back to the PR. Forks never run `execute()` on the self-hosted runner (untrusted code,
-  shared cache).
+- **Merge to `main` — self-hosted runner, the only job holding write creds.**
+  `ref test-cases mint` → `execute()` + `put` native + update manifest.
+- **Nightly — self-hosted runner.**
+  Full sweep; drift detection backstop;
+  opens an auto-PR with the new manifest + a human-readable summary if outputs drift.
+- **Fork PR needing new/changed native** (a new diagnostic, or an `execute()` change):
+  a documented `workflow_dispatch` a maintainer triggers
+  to mint on the trusted runner and push the golden + manifest back to the PR.
+  Forks never run `execute()` on the self-hosted runner (untrusted code, shared cache).
 
 ## Self-hosted runner security
 
-The integration runner uses ephemeral pods, but mounts a **shared, writable cache** (datasets +
-provider software). Untrusted fork code in such a job could poison that cache for the next trusted
-run; ephemeral pods do not close this. Therefore: never run untrusted `execute()` on the shared
-runner; keep write credentials on the `main` (and explicitly-triggered) path only; gate the
-same-repo PR job by `head.repo.full_name == github.repository`. (A separate hardened runner with a
-read-only cache could later make fork `execute()` safe; out of scope here.)
+The integration runner uses ephemeral pods,
+but mounts a **shared, writable cache** (datasets + provider software).
+Untrusted fork code in such a job could poison that cache for the next trusted run;
+ephemeral pods do not close this.
+Therefore: never run untrusted `execute()` on the shared runner;
+keep write credentials on the `main` (and explicitly-triggered) path only;
+gate the same-repo PR job by `head.repo.full_name == github.repository`.
+(A separate hardened runner with a read-only cache could later make fork `execute()` safe;
+out of scope here.)
 
 ## Determinism (prerequisite for comparing bundles)
 
-CMEC bundle content comparison (currently only `series.json` is compared) requires removing
-nondeterminism first:
+CMEC bundle content comparison (currently only `series.json` is compared)
+requires removing nondeterminism first:
 
-- **Execution-dir timestamp.** ESMValTool writes a `recipe_<YYYYMMDD>_<HHMMSS>` dir that gets baked
-  into `output.json`. Sanitise `recipe_\d{8}_\d{6}` → a placeholder in both capture and comparison.
+- **Execution-dir timestamp.**
+  ESMValTool writes a `recipe_<YYYYMMDD>_<HHMMSS>` dir that gets baked into `output.json`.
+  Sanitise `recipe_\d{8}_\d{6}` → a placeholder in both capture and comparison.
   Tracked as [Climate-REF/climate-ref#713](https://github.com/Climate-REF/climate-ref/issues/713).
-- **Floats.** Replay path (stored native → extraction) is deterministic → exact compare. Execute
-  path (re-run) has float jitter → compare structure/dimensions only, leaving numeric values to the
-  replay path; any value that must be compared on the execute path uses a relative tolerance, never
-  exact equality.
+- **Floats.**
+  Replay path (stored native → extraction) is deterministic → exact compare.
+  Execute path (re-run) has float jitter → compare structure/dimensions only,
+  leaving numeric values to the replay path;
+  any value that must be compared on the execute path uses a relative tolerance, never exact equality.
 
 # Drawbacks
 
-- The REF CLI owns fetch/mint/credential plumbing — more surface to maintain and test than
-  off-loading to an existing data-management tool.
-- Native is a cache of a regenerable artifact: losing a digest still referenced by a reachable
-  manifest breaks historical replay (mitigated by reachability-aware retention).
-- Golden minting is effectively a maintainer/CI action for conda providers (esmvaltool's env is
-  ~1373 packages); external authors usually cannot run `execute()` locally.
-- Fork `execute()` changes have no automated PR signal until bootstrapped — the fast path tests
-  extraction, not the diagnostic algorithm.
+- The REF CLI owns fetch/mint/credential plumbing —
+  more surface to maintain and test than off-loading to an existing data-management tool.
+- Native is a cache of a regenerable artifact:
+  losing a digest still referenced by a reachable manifest breaks historical replay
+  (mitigated by reachability-aware retention).
+- Golden minting is effectively a maintainer/CI action for conda providers
+  (esmvaltool's env is ~1373 packages);
+  external authors usually cannot run `execute()` locally.
+- Fork `execute()` changes have no automated PR signal until bootstrapped —
+  the fast path tests extraction, not the diagnostic algorithm.
 - Historical bloat remains: existing committed binaries are not rewritten out of history
   (deliberate — the repo is already used by externals).
 
 # Rationale and alternatives
 
-**Why two layers.** Keeping the small REF-shaped golden in git puts the review signal where
-reviewers already work (the PR diff) and keeps the gate human-readable; moving the large binaries
-out of git removes the bloat while still making them fetchable for the extraction test. The split
-maps cleanly onto what the REF already does for catalogs (`catalog.yaml` committed,
-`catalog.paths.yaml` gitignored).
+**Why two layers.**
+Keeping the small REF-shaped golden in git
+puts the review signal where reviewers already work (the PR diff)
+and keeps the gate human-readable;
+moving the large binaries out of git removes the bloat
+while still making them fetchable for the extraction test.
+The split maps cleanly onto what the REF already does for catalogs
+(`catalog.yaml` committed, `catalog.paths.yaml` gitignored).
 
-**Why a manifest + CLI rather than committing the bytes.** Measured native is too large to commit
-(pmp 250 MB, ilamb 61 MB). A content-addressed manifest gives dedup, per-commit pinning, and a clean
-in-PR "native changed" diff without binaries in git.
+**Why a manifest + CLI rather than committing the bytes.**
+Measured native is too large to commit (pmp 250 MB, ilamb 61 MB).
+A content-addressed manifest gives dedup, per-commit pinning,
+and a clean in-PR "native changed" diff without binaries in git.
 
 **Alternatives considered:**
 
-- **Commit curated native to git.** Rejected by measurement (too large for pmp/ilamb). Viable only
-  for esmvaltool (~5 MB) — not worth a per-provider split of the storage model.
-- **Git LFS.** Pointer churn remains; GitHub LFS egress costs; no anonymous public read.
-- **Regenerate-on-demand, store nothing.** The fast fork-PR extraction path runs on a stock runner
-  with no conda/data and cannot regenerate; it must fetch. Per-provider, pmp's 250 MB *could* be
-  regenerated on the self-hosted runner rather than stored if its fetch rate is low — the
-  `NativeStore` interface allows mixing.
-
-**Impact of not doing this.** Contributors keep hand-rolling baselines, PRs keep shipping output
-changes without a visible diff, and silent upstream regressions stay invisible until a manual check.
+- **Commit curated native to git.**
+  Rejected by measurement (too large for pmp/ilamb).
+  Viable only for esmvaltool (~5 MB) — not worth a per-provider split of the storage model.
+- **Git LFS.**
+  Pointer churn remains; GitHub LFS egress costs; no anonymous public read.
+- **Regenerate-on-demand, store nothing.**
+  The fast fork-PR extraction path runs on a stock runner with no conda/data and cannot regenerate;
+  it must fetch.
+  Per-provider, pmp's 250 MB *could* be regenerated on the self-hosted runner rather than stored
+  if its fetch rate is low — the `NativeStore` interface allows mixing.
+
+**Impact of not doing this.**
+Contributors keep hand-rolling baselines,
+PRs keep shipping output changes without a visible diff,
+and silent upstream regressions stay invisible until a manual check.
 
 # Prior art
 
-- The REF's existing **catalog split** — the exact "commit small metadata, keep bytes out of git"
-  pattern, here extended to outputs.
+- The REF's existing **catalog split** —
+  the exact "commit small metadata, keep bytes out of git" pattern, here extended to outputs.
 - **pytest-regressions** (`--force-regen`) — the regenerate-the-golden workflow this mirrors.
-- Content-addressed artifact stores with a small in-git lock (DVC, Snakemake) — same idea; this RFC
-  keeps the client surface to just the REF CLI rather than adding a separate toolchain on
-  contributors.
+- Content-addressed artifact stores with a small in-git lock (DVC, Snakemake) — same idea;
+  this RFC keeps the client surface to just the REF CLI
+  rather than adding a separate toolchain on contributors.
 
 # Unresolved questions
 
-- **Object-store backend choice** — the main open decision, deferred deliberately. Candidate
-  backends behind the `NativeStore` interface include an S3-compatible bucket or an OCI registry
-  (e.g. GitHub Packages); selection should weigh anonymous-read, trusted-write identity, egress
-  cost, content-addressing/GC, and any storage the org has already provisioned. **No backend is
-  assumed by this RFC.**
+- **Object-store backend choice** — the main open decision, deferred deliberately.
+  Candidate backends behind the `NativeStore` interface include
+  an S3-compatible bucket or an OCI registry (e.g. GitHub Packages);
+  selection should weigh anonymous-read, trusted-write identity, egress cost,
+  content-addressing/GC, and any storage the org has already provisioned.
+  **No backend is assumed by this RFC.**
 - Per-provider **store-vs-regenerate** for the large providers (pmp 250 MB).
 - Retention policy for non-current native versions.
-- Whether the PR comment should render a **figure diff** (PNG) / **NetCDF stat diff**, or defer to a
-  follow-up.
+- Whether the PR comment should render a **figure diff** (PNG) / **NetCDF stat diff**,
+  or defer to a follow-up.
 - Exact form of the fork-PR **bootstrap `workflow_dispatch`**.
 
 # Future possibilities
 
 - Rich diff renderers in the PR comment (PNG diffs, NetCDF summary-stat diffs).
 - Reuse the same `NativeStore` + manifest for large reference/observational inputs too big for git.
-- A `ref test-cases preview` that serves locally what the API would render for a PR, straight from a
-  synced native fixture (already the `replay` path).
+- A `ref test-cases preview` that serves locally what the API would render for a PR,
+  straight from a synced native fixture (already the `replay` path).
 - Signed manifests for end-to-end baseline provenance.

From 871e74c2422fceec2bac1c5f6b7ee885ce47f1a3 Mon Sep 17 00:00:00 2001
From: Jared Lewis <jared@jared.kiwi.nz>
Date: Fri, 5 Jun 2026 06:12:32 +1000
Subject: [PATCH 4/8] chore: rename golden to committed

---
 text/0000-regression-baselines.md | 85 ++++++++++++++++---------------
 1 file changed, 43 insertions(+), 42 deletions(-)

diff --git a/text/0000-regression-baselines.md b/text/0000-regression-baselines.md
index f24f293..1266954 100644
--- a/text/0000-regression-baselines.md
+++ b/text/0000-regression-baselines.md
@@ -1,20 +1,22 @@
 - Feature Name: `regression_baselines`
 - Start Date: 2026-06-04
-- RFC PR: [CMIP-REF/rfcs#0000](https://github.com/CMIP-REF/rfcs/pull/0000)
+- RFC PR: [CMIP-REF/rfcs#5](https://github.com/CMIP-REF/rfcs/pull/0005)
 
 # Summary
 
-Manage diagnostic regression baselines as **two layers**:
+Manage diagnostic regression **baselines**.
+Each baseline (one test-case run) has **two bundles**:
 
-1. A small, human-readable **golden** (the REF-shaped `series.json` + CMEC bundles)
+1. A small, human-readable **committed bundle** (the REF-shaped `series.json` + CMEC bundles)
    committed directly to the source repo —
    this is the test *gate* and the in-PR *diff signal*.
-2. The large, binary **native diagnostic outputs** stored outside git in an object store
-   and referenced from a single git-committed **`native_manifest.json`**
-   (sha256 digests of the golden, the native files, and the extraction inputs).
+2. A large, binary **native bundle** (the diagnostic's native outputs)
+   stored outside git in an object store
+   and referenced from a single git-committed **`manifest.json`**
+   (sha256 digests of the committed bundle, the native files, and the extraction inputs).
 
 The REF CLI (`ref test-cases`) fetches, runs, and mints baselines.
-The native outputs double as the **input fixtures that test the native→bundle extraction step**,
+The native bundle doubles as the **input fixtures that test the native->bundle extraction step**,
 so most PRs are verified fast on a stock runner without re-running the (expensive) diagnostic.
 
 The **object-store backend is left as an explicit open decision** (see Unresolved questions) —
@@ -45,17 +47,16 @@ Diagnostic regression testing today is flaky, opaque, and bloats the repo:
   The integration runner shares a writable cache;
   untrusted fork code must never reach it or any write credential.
 
-A key reframe: **the native outputs are not review-only artifacts —
-they are the input fixtures for the extraction step.**
+A key reframe: **the native bundle is not a review-only artifact —
+it is the input fixtures for the extraction step.**
 `Diagnostic.run()` is `execute()` (runs the diagnostic, writes native)
-then `build_execution_result()` (reads native → CMEC bundle + series).
-`RegressionValidator` already replays *committed native* through `build_execution_result`
-without re-running the diagnostic.
+then `build_execution_result()` (reads native -> CMEC bundle + series).
+`RegressionValidator` already replays *committed native* through `build_execution_result` without re-running the diagnostic.
 That makes a fast, fork-safe extraction test possible on a stock runner — *if* the native is fetchable.
 
 # Reference-level explanation
 
-## Two layers, one manifest
+## Two bundles, one manifest
 
 For each `(provider, diagnostic, test-case)`:
 
@@ -64,25 +65,25 @@ packages/<provider>/tests/test-data/<diagnostic>/<test-case>/
   catalog.yaml            # committed: dataset metadata (exists today, content-hashed)
   catalog.paths.yaml      # gitignored: per-user local paths (exists today)
   regression/
-    series.json           # committed GOLDEN (the gate + diff signal)
-    diagnostic.json        # committed GOLDEN (CMEC metric bundle)
-    output.json           # committed GOLDEN (CMEC output bundle)
-  native_manifest.json    # committed: source of truth binding golden <-> native
+    series.json           # committed bundle (the gate + diff signal)
+    diagnostic.json        # committed bundle (CMEC metric bundle)
+    output.json           # committed bundle (CMEC output bundle)
+  manifest.json    # committed: source of truth binding committed bundle <-> native bundle
 ```
 
 The native bytes (provenance + matched `*.nc` / `*.png`) live in the object store, **not** in git.
 
-`native_manifest.json`:
+`manifest.json`:
 
 ```json
 { "schema": 1,
-  "golden":  { "series.json": "<sha256>", "diagnostic.json": "<sha256>", "output.json": "<sha256>" },
+  "committed":  { "series.json": "<sha256>", "diagnostic.json": "<sha256>", "output.json": "<sha256>" },
   "native":  { "<relpath>": { "sha256": "...", "size": ... } },
   "extraction_inputs": { "files_series_digest": "<sha256>", "input_selectors_digest": "<sha256>" } }
 ```
 
-- `golden` binds the committed golden bytes to the manifest:
-  CI recomputes the golden digests and asserts they equal `manifest.golden`.
+- `committed` binds the committed bundle bytes to the manifest:
+  CI recomputes the committed digests and asserts they equal `manifest.committed`.
   A commit that updates `series.json` but not the manifest (or vice-versa) **fails loudly** —
   closing the one consistency leg content-addressing does not cover.
 - `native` lists the curated native blobs by content digest (immutable, dedups, survives history).
@@ -92,7 +93,7 @@ The native bytes (provenance + matched `*.nc` / `*.png`) live in the object stor
 
 Two manifest sections keep the diff readable:
 a changed `native` block means native changed;
-a changed `golden` block means extraction output changed.
+a changed `committed` block means extraction output changed.
 
 ## Curated capture (kills the bloat at the source)
 
@@ -127,40 +128,40 @@ Backend selection is deferred to a maintainer decision (Unresolved questions).
 
 ## CLI (extends the existing `ref test-cases` app)
 
-| Subcommand | Purpose | Creds |
-|------------|---------|-------|
-| `run --provider P --diagnostic D` | run `execute()` + `build_execution_result`, write golden + manifest locally | none (local) |
-| `sync [--provider P]` | fetch native blobs named by the committed manifest(s) | none (public read) |
-| `replay --provider P --diagnostic D` | fetch native, run `build_execution_result`, compare to golden | none (public read) |
-| `mint --provider P` | `run` + `put` native to the store + update manifest | write creds |
+| Subcommand                           | Purpose                                                                              | Creds              |
+| ------------------------------------ | ----------------------------------------------------------------------------------- | ------------------ |
+| `run --provider P --diagnostic D`    | run `execute()` + `build_execution_result`, write committed bundle + manifest locally | none (local)       |
+| `sync [--provider P]`                | fetch native blobs named by the committed manifest(s)                                | none (public read) |
+| `replay --provider P --diagnostic D` | fetch native, run `build_execution_result`, compare to committed bundle               | none (public read) |
+| `mint --provider P`                  | `run` + `put` native to the store + update manifest                                  | write creds        |
 
 `mint` is the only credentialed verb and runs only on trusted contexts.
 
 ## Tiered CI
 
 - **PR — any, incl forks — stock `ubuntu-latest`, no secrets.**
-  `ref test-cases sync` (public read) → `replay` → diff golden → post a PR comment.
+  `ref test-cases sync` (public read) -> `replay` -> diff committed bundle -> post a PR comment.
   Figures surfaced but **non-blocking**.
   Routed by changed-files so the check is *honest*:
-  - PR touches extraction (`build_execution_result`, `files`/`series`) or golden/manifest → run the replay;
+  - PR touches extraction (`build_execution_result`, `files`/`series`) or committed/manifest -> run the replay;
     the check is named **`extraction-replay`** and its green means what it says.
-  - PR touches `execute()` only → replay is a tautology →
+  - PR touches `execute()` only -> replay is a tautology ->
     emit a **neutral "execute-changed: needs runner verification"** check (not green),
     linking the bootstrap path.
   - A fetch *miss* for a fixture the manifest says should exist is a **failure, not a skip**
-    (today, missing fixtures silently skip → false green).
-- **Same-repo PR — self-hosted runner, no upload — when a golden file changes.**
+    (today, missing fixtures silently skip -> false green).
+- **Same-repo PR — self-hosted runner, no upload — when a committed bundle file changes.**
   Run the slow `execute()` to confirm the diagnostic still reproduces its native,
   catching breakage pre-merge.
   Gated with no labels: `if: github.event.pull_request.head.repo.full_name == github.repository`.
 - **Merge to `main` — self-hosted runner, the only job holding write creds.**
-  `ref test-cases mint` → `execute()` + `put` native + update manifest.
+  `ref test-cases mint` -> `execute()` + `put` native + update manifest.
 - **Nightly — self-hosted runner.**
   Full sweep; drift detection backstop;
   opens an auto-PR with the new manifest + a human-readable summary if outputs drift.
 - **Fork PR needing new/changed native** (a new diagnostic, or an `execute()` change):
   a documented `workflow_dispatch` a maintainer triggers
-  to mint on the trusted runner and push the golden + manifest back to the PR.
+  to mint on the trusted runner and push the committed bundle + manifest back to the PR.
   Forks never run `execute()` on the self-hosted runner (untrusted code, shared cache).
 
 ## Self-hosted runner security
@@ -182,11 +183,11 @@ requires removing nondeterminism first:
 
 - **Execution-dir timestamp.**
   ESMValTool writes a `recipe_<YYYYMMDD>_<HHMMSS>` dir that gets baked into `output.json`.
-  Sanitise `recipe_\d{8}_\d{6}` → a placeholder in both capture and comparison.
+  Sanitise `recipe_\d{8}_\d{6}` -> a placeholder in both capture and comparison.
   Tracked as [Climate-REF/climate-ref#713](https://github.com/Climate-REF/climate-ref/issues/713).
 - **Floats.**
-  Replay path (stored native → extraction) is deterministic → exact compare.
-  Execute path (re-run) has float jitter → compare structure/dimensions only,
+  Replay path (stored native -> extraction) is deterministic -> exact compare.
+  Execute path (re-run) has float jitter -> compare structure/dimensions only,
   leaving numeric values to the replay path;
   any value that must be compared on the execute path uses a relative tolerance, never exact equality.
 
@@ -197,7 +198,7 @@ requires removing nondeterminism first:
 - Native is a cache of a regenerable artifact:
   losing a digest still referenced by a reachable manifest breaks historical replay
   (mitigated by reachability-aware retention).
-- Golden minting is effectively a maintainer/CI action for conda providers
+- Committed-bundle minting is effectively a maintainer/CI action for conda providers
   (esmvaltool's env is ~1373 packages);
   external authors usually cannot run `execute()` locally.
 - Fork `execute()` changes have no automated PR signal until bootstrapped —
@@ -207,8 +208,8 @@ requires removing nondeterminism first:
 
 # Rationale and alternatives
 
-**Why two layers.**
-Keeping the small REF-shaped golden in git
+**Why two bundles.**
+Keeping the small REF-shaped committed bundle in git
 puts the review signal where reviewers already work (the PR diff)
 and keeps the gate human-readable;
 moving the large binaries out of git removes the bloat
@@ -243,7 +244,7 @@ and silent upstream regressions stay invisible until a manual check.
 
 - The REF's existing **catalog split** —
   the exact "commit small metadata, keep bytes out of git" pattern, here extended to outputs.
-- **pytest-regressions** (`--force-regen`) — the regenerate-the-golden workflow this mirrors.
+- **pytest-regressions** (`--force-regen`) — the regenerate-the-committed-bundle workflow this mirrors.
 - Content-addressed artifact stores with a small in-git lock (DVC, Snakemake) — same idea;
   this RFC keeps the client surface to just the REF CLI
   rather than adding a separate toolchain on contributors.

From ddc49bb8d724279f194bd9d21178af7415f8748f Mon Sep 17 00:00:00 2001
From: Jared Lewis <jared@jared.kiwi.nz>
Date: Fri, 5 Jun 2026 10:48:27 +1000
Subject: [PATCH 5/8] chore: rename file

---
 text/0000-regression-baselines.md | 272 -------------------
 text/0005-regression-baselines.md | 433 ++++++++++++++++++++++++++++++
 2 files changed, 433 insertions(+), 272 deletions(-)
 delete mode 100644 text/0000-regression-baselines.md
 create mode 100644 text/0005-regression-baselines.md

diff --git a/text/0000-regression-baselines.md b/text/0000-regression-baselines.md
deleted file mode 100644
index 1266954..0000000
--- a/text/0000-regression-baselines.md
+++ /dev/null
@@ -1,272 +0,0 @@
-- Feature Name: `regression_baselines`
-- Start Date: 2026-06-04
-- RFC PR: [CMIP-REF/rfcs#5](https://github.com/CMIP-REF/rfcs/pull/0005)
-
-# Summary
-
-Manage diagnostic regression **baselines**.
-Each baseline (one test-case run) has **two bundles**:
-
-1. A small, human-readable **committed bundle** (the REF-shaped `series.json` + CMEC bundles)
-   committed directly to the source repo —
-   this is the test *gate* and the in-PR *diff signal*.
-2. A large, binary **native bundle** (the diagnostic's native outputs)
-   stored outside git in an object store
-   and referenced from a single git-committed **`manifest.json`**
-   (sha256 digests of the committed bundle, the native files, and the extraction inputs).
-
-The REF CLI (`ref test-cases`) fetches, runs, and mints baselines.
-The native bundle doubles as the **input fixtures that test the native->bundle extraction step**,
-so most PRs are verified fast on a stock runner without re-running the (expensive) diagnostic.
-
-The **object-store backend is left as an explicit open decision** (see Unresolved questions) —
-this RFC specifies the *layering, manifest, CLI, and CI*,
-which are backend-agnostic behind a small `NativeStore` interface.
-
-# Motivation
-
-Diagnostic regression testing today is flaky, opaque, and bloats the repo:
-
-- **Bloat.** `tests/.../regression` is ~1.0 GB on disk.
-  Of that, **696 MB is input `climate_data` that the capture step wrongly copies in**;
-  the real curated output is ~316 MB (esmvaltool 5 MB, ilamb 61 MB, pmp 250 MB).
-  Some native is already hand-`.gitignore`d, so it is *missing* — no coverage, no parity.
-  The bloat is a capture bug plus committing binaries, not an inherent need.
-- **No clear "this-changed" signal.**
-  Failures are a mix; you dig CI logs across many jobs.
-  The REF-shaped output (`series.json`, CMEC bundles) is small JSON that diffs cleanly in the PR
-  and is exactly the content that feeds the REF API —
-  it is what reviewers most need to see.
-- **The extraction regression never runs on a PR.**
-  It runs only post-merge/nightly today,
-  so a PR that breaks extraction goes green, then red after merge.
-- **Most diagnostic authors are external scientists.**
-  Minimal setup, no new credentials.
-  A contributor should need only the REF.
-- **Security.**
-  The integration runner shares a writable cache;
-  untrusted fork code must never reach it or any write credential.
-
-A key reframe: **the native bundle is not a review-only artifact —
-it is the input fixtures for the extraction step.**
-`Diagnostic.run()` is `execute()` (runs the diagnostic, writes native)
-then `build_execution_result()` (reads native -> CMEC bundle + series).
-`RegressionValidator` already replays *committed native* through `build_execution_result` without re-running the diagnostic.
-That makes a fast, fork-safe extraction test possible on a stock runner — *if* the native is fetchable.
-
-# Reference-level explanation
-
-## Two bundles, one manifest
-
-For each `(provider, diagnostic, test-case)`:
-
-```
-packages/<provider>/tests/test-data/<diagnostic>/<test-case>/
-  catalog.yaml            # committed: dataset metadata (exists today, content-hashed)
-  catalog.paths.yaml      # gitignored: per-user local paths (exists today)
-  regression/
-    series.json           # committed bundle (the gate + diff signal)
-    diagnostic.json        # committed bundle (CMEC metric bundle)
-    output.json           # committed bundle (CMEC output bundle)
-  manifest.json    # committed: source of truth binding committed bundle <-> native bundle
-```
-
-The native bytes (provenance + matched `*.nc` / `*.png`) live in the object store, **not** in git.
-
-`manifest.json`:
-
-```json
-{ "schema": 1,
-  "committed":  { "series.json": "<sha256>", "diagnostic.json": "<sha256>", "output.json": "<sha256>" },
-  "native":  { "<relpath>": { "sha256": "...", "size": ... } },
-  "extraction_inputs": { "files_series_digest": "<sha256>", "input_selectors_digest": "<sha256>" } }
-```
-
-- `committed` binds the committed bundle bytes to the manifest:
-  CI recomputes the committed digests and asserts they equal `manifest.committed`.
-  A commit that updates `series.json` but not the manifest (or vice-versa) **fails loudly** —
-  closing the one consistency leg content-addressing does not cover.
-- `native` lists the curated native blobs by content digest (immutable, dedups, survives history).
-- `extraction_inputs` hashes the diagnostic's `files`/`series` patterns and the catalog `input_selectors` —
-  the real extraction inputs the current catalog hash ignores —
-  so a change to extraction behaviour bumps the manifest *at PR time*.
-
-Two manifest sections keep the diff readable:
-a changed `native` block means native changed;
-a changed `committed` block means extraction output changed.
-
-## Curated capture (kills the bloat at the source)
-
-The diagnostic already declares which native files matter via `files` + `series` glob patterns;
-production publishing copies only those (the CMEC `output.json` subset).
-The regression capture must use the **same curation** instead of copying the whole working directory.
-This alone prunes the 696 MB `climate_data` input and the verbose run logs for free.
-Diagnostic authors control what is stored/pruned through those existing patterns — no new mechanism.
-
-## `NativeStore` interface (backend-agnostic)
-
-The REF talks to the store through a thin Protocol
-so the backend is swappable and unit-testable,
-and so this RFC does not have to settle the backend to be implementable:
-
-```python
-class NativeStore(Protocol):
-    def has(self, digest: str) -> bool: ...
-    def fetch(self, digest: str, dest: Path) -> None: ...   # by sha256; public read, no creds
-    def put(self, path: Path) -> str: ...                    # returns sha256; needs write creds
-```
-
-Required properties of whatever backend is chosen:
-
-- **Anonymous public read** so fork CI and contributors fetch without credentials.
-- **Writes gated to trusted contexts only** (CI on `main`, or maintainers).
-- Content-addressed (sha256) storage;
-  native addressed **per-test-case** so a fetch is one diagnostic's output (KB–32 MB).
-- Retention that **never expires a digest still referenced by a reachable manifest**.
-
-Backend selection is deferred to a maintainer decision (Unresolved questions).
-
-## CLI (extends the existing `ref test-cases` app)
-
-| Subcommand                           | Purpose                                                                              | Creds              |
-| ------------------------------------ | ----------------------------------------------------------------------------------- | ------------------ |
-| `run --provider P --diagnostic D`    | run `execute()` + `build_execution_result`, write committed bundle + manifest locally | none (local)       |
-| `sync [--provider P]`                | fetch native blobs named by the committed manifest(s)                                | none (public read) |
-| `replay --provider P --diagnostic D` | fetch native, run `build_execution_result`, compare to committed bundle               | none (public read) |
-| `mint --provider P`                  | `run` + `put` native to the store + update manifest                                  | write creds        |
-
-`mint` is the only credentialed verb and runs only on trusted contexts.
-
-## Tiered CI
-
-- **PR — any, incl forks — stock `ubuntu-latest`, no secrets.**
-  `ref test-cases sync` (public read) -> `replay` -> diff committed bundle -> post a PR comment.
-  Figures surfaced but **non-blocking**.
-  Routed by changed-files so the check is *honest*:
-  - PR touches extraction (`build_execution_result`, `files`/`series`) or committed/manifest -> run the replay;
-    the check is named **`extraction-replay`** and its green means what it says.
-  - PR touches `execute()` only -> replay is a tautology ->
-    emit a **neutral "execute-changed: needs runner verification"** check (not green),
-    linking the bootstrap path.
-  - A fetch *miss* for a fixture the manifest says should exist is a **failure, not a skip**
-    (today, missing fixtures silently skip -> false green).
-- **Same-repo PR — self-hosted runner, no upload — when a committed bundle file changes.**
-  Run the slow `execute()` to confirm the diagnostic still reproduces its native,
-  catching breakage pre-merge.
-  Gated with no labels: `if: github.event.pull_request.head.repo.full_name == github.repository`.
-- **Merge to `main` — self-hosted runner, the only job holding write creds.**
-  `ref test-cases mint` -> `execute()` + `put` native + update manifest.
-- **Nightly — self-hosted runner.**
-  Full sweep; drift detection backstop;
-  opens an auto-PR with the new manifest + a human-readable summary if outputs drift.
-- **Fork PR needing new/changed native** (a new diagnostic, or an `execute()` change):
-  a documented `workflow_dispatch` a maintainer triggers
-  to mint on the trusted runner and push the committed bundle + manifest back to the PR.
-  Forks never run `execute()` on the self-hosted runner (untrusted code, shared cache).
-
-## Self-hosted runner security
-
-The integration runner uses ephemeral pods,
-but mounts a **shared, writable cache** (datasets + provider software).
-Untrusted fork code in such a job could poison that cache for the next trusted run;
-ephemeral pods do not close this.
-Therefore: never run untrusted `execute()` on the shared runner;
-keep write credentials on the `main` (and explicitly-triggered) path only;
-gate the same-repo PR job by `head.repo.full_name == github.repository`.
-(A separate hardened runner with a read-only cache could later make fork `execute()` safe;
-out of scope here.)
-
-## Determinism (prerequisite for comparing bundles)
-
-CMEC bundle content comparison (currently only `series.json` is compared)
-requires removing nondeterminism first:
-
-- **Execution-dir timestamp.**
-  ESMValTool writes a `recipe_<YYYYMMDD>_<HHMMSS>` dir that gets baked into `output.json`.
-  Sanitise `recipe_\d{8}_\d{6}` -> a placeholder in both capture and comparison.
-  Tracked as [Climate-REF/climate-ref#713](https://github.com/Climate-REF/climate-ref/issues/713).
-- **Floats.**
-  Replay path (stored native -> extraction) is deterministic -> exact compare.
-  Execute path (re-run) has float jitter -> compare structure/dimensions only,
-  leaving numeric values to the replay path;
-  any value that must be compared on the execute path uses a relative tolerance, never exact equality.
-
-# Drawbacks
-
-- The REF CLI owns fetch/mint/credential plumbing —
-  more surface to maintain and test than off-loading to an existing data-management tool.
-- Native is a cache of a regenerable artifact:
-  losing a digest still referenced by a reachable manifest breaks historical replay
-  (mitigated by reachability-aware retention).
-- Committed-bundle minting is effectively a maintainer/CI action for conda providers
-  (esmvaltool's env is ~1373 packages);
-  external authors usually cannot run `execute()` locally.
-- Fork `execute()` changes have no automated PR signal until bootstrapped —
-  the fast path tests extraction, not the diagnostic algorithm.
-- Historical bloat remains: existing committed binaries are not rewritten out of history
-  (deliberate — the repo is already used by externals).
-
-# Rationale and alternatives
-
-**Why two bundles.**
-Keeping the small REF-shaped committed bundle in git
-puts the review signal where reviewers already work (the PR diff)
-and keeps the gate human-readable;
-moving the large binaries out of git removes the bloat
-while still making them fetchable for the extraction test.
-The split maps cleanly onto what the REF already does for catalogs
-(`catalog.yaml` committed, `catalog.paths.yaml` gitignored).
-
-**Why a manifest + CLI rather than committing the bytes.**
-Measured native is too large to commit (pmp 250 MB, ilamb 61 MB).
-A content-addressed manifest gives dedup, per-commit pinning,
-and a clean in-PR "native changed" diff without binaries in git.
-
-**Alternatives considered:**
-
-- **Commit curated native to git.**
-  Rejected by measurement (too large for pmp/ilamb).
-  Viable only for esmvaltool (~5 MB) — not worth a per-provider split of the storage model.
-- **Git LFS.**
-  Pointer churn remains; GitHub LFS egress costs; no anonymous public read.
-- **Regenerate-on-demand, store nothing.**
-  The fast fork-PR extraction path runs on a stock runner with no conda/data and cannot regenerate;
-  it must fetch.
-  Per-provider, pmp's 250 MB *could* be regenerated on the self-hosted runner rather than stored
-  if its fetch rate is low — the `NativeStore` interface allows mixing.
-
-**Impact of not doing this.**
-Contributors keep hand-rolling baselines,
-PRs keep shipping output changes without a visible diff,
-and silent upstream regressions stay invisible until a manual check.
-
-# Prior art
-
-- The REF's existing **catalog split** —
-  the exact "commit small metadata, keep bytes out of git" pattern, here extended to outputs.
-- **pytest-regressions** (`--force-regen`) — the regenerate-the-committed-bundle workflow this mirrors.
-- Content-addressed artifact stores with a small in-git lock (DVC, Snakemake) — same idea;
-  this RFC keeps the client surface to just the REF CLI
-  rather than adding a separate toolchain on contributors.
-
-# Unresolved questions
-
-- **Object-store backend choice** — the main open decision, deferred deliberately.
-  Candidate backends behind the `NativeStore` interface include
-  an S3-compatible bucket or an OCI registry (e.g. GitHub Packages);
-  selection should weigh anonymous-read, trusted-write identity, egress cost,
-  content-addressing/GC, and any storage the org has already provisioned.
-  **No backend is assumed by this RFC.**
-- Per-provider **store-vs-regenerate** for the large providers (pmp 250 MB).
-- Retention policy for non-current native versions.
-- Whether the PR comment should render a **figure diff** (PNG) / **NetCDF stat diff**,
-  or defer to a follow-up.
-- Exact form of the fork-PR **bootstrap `workflow_dispatch`**.
-
-# Future possibilities
-
-- Rich diff renderers in the PR comment (PNG diffs, NetCDF summary-stat diffs).
-- Reuse the same `NativeStore` + manifest for large reference/observational inputs too big for git.
-- A `ref test-cases preview` that serves locally what the API would render for a PR,
-  straight from a synced native fixture (already the `replay` path).
-- Signed manifests for end-to-end baseline provenance.
diff --git a/text/0005-regression-baselines.md b/text/0005-regression-baselines.md
new file mode 100644
index 0000000..7598116
--- /dev/null
+++ b/text/0005-regression-baselines.md
@@ -0,0 +1,433 @@
+- Feature Name: `regression_baselines`
+- Start Date: 2026-06-04
+- RFC PR: [CMIP-REF/rfcs#5](https://github.com/CMIP-REF/rfcs/pull/0005)
+
+# Summary
+
+Effectively diagnostic regression **baselines** for use by the testing suite.
+Having baseline output for a diagnostic is an important contract between the development team and the diagnostic implementor.
+
+Each baseline (one test-case run) has **two bundles**:
+
+1. A small, human-readable **committed bundle** (the REF-shaped `series.json` + CMEC bundles)
+   committed directly to the source repo.
+   This should be stable and able to be reproduced by the CI.
+2. A large, binary **native bundle** (the diagnostic's native outputs)
+   stored outside git in an object store
+   and referenced from a single git-committed **`manifest.json`**
+   (sha256 digests of the committed bundle, the native files, and the extraction inputs).
+
+The REF CLI (`ref test-cases`) fetches, runs, and mints baselines
+with the goal of making it simple for diagnostic developers to contribute test output without requiring custom pytest implementation.
+
+The **object-store backend is left as an explicit open decision** (see Unresolved questions) —
+this RFC specifies the *layering, manifest, CLI, and CI*,
+which are backend-agnostic behind a small `NativeStore` interface.
+
+# Motivation
+
+Diagnostic regression testing today is flaky, opaque, and bloats the repo:
+
+- **Bloat.** `tests/.../regression` is ~1 GB on disk.
+  The git repo size is growing in a way that isn't sustainable.
+  We want to minimise the volume of data needed to setup the repo.
+- **No clear "this-changed" signal.**
+  It's hard to see the failures for new diagnostics as we aren't following a consitent process.
+  The REF-shaped output (`series.json`, CMEC bundles) is small JSON that diffs cleanly in the PR
+  and is what reviewers most need to see.
+- **The extraction regression never runs on a PR.**
+  It runs only post-merge/nightly today,
+  so a PR that breaks extraction goes green, then red after merge.
+  The integration tests are currently broken and require a refresh of the regression data.
+- **Most diagnostic authors are external scientists.**
+  Require minimal setup, no new credentials.
+  A contributor should need only the REF.
+- **Security.**
+  The integration runner shares a writable cache so we must be careful that fork code must never reach it or any write credential.
+
+The native bundle is not a review-only artifact as it is the input fixtures for integration tests.
+`Diagnostic.run()` is `execute()` (runs the diagnostic, writes native)
+then `build_execution_result()` (reads native -> CMEC bundle + series).
+`RegressionValidator` already replays *committed native* through `build_execution_result` without re-running the diagnostic.
+That makes a fast, fork-safe extraction test possible on a stock runner — *if* the native is fetchable.
+
+# Reference-level explanation
+
+## Two bundles, one manifest
+
+For each `(provider, diagnostic, test-case)`:
+
+```
+packages/<provider>/tests/test-data/<diagnostic>/<test-case>/
+  catalog.yaml            # committed: dataset metadata (exists today, content-hashed)
+  catalog.paths.yaml      # gitignored: per-user local paths (exists today)
+  regression/
+    series.json           # committed bundle (the gate + diff signal)
+    diagnostic.json        # committed bundle (CMEC metric bundle)
+    output.json           # committed bundle (CMEC output bundle)
+  manifest.json           # committed: source of truth binding committed bundle <-> native bundle
+```
+
+The native bytes (provenance + matched `*.nc` / `*.png`) live in the object store, **not** in git.
+
+`manifest.json`:
+
+```json
+{ "schema": 1,
+  "committed":  { "series.json": "<sha256>", "diagnostic.json": "<sha256>", "output.json": "<sha256>" },
+  "native":  { "<relpath>": { "sha256": "...", "size": ... } },
+  "extraction_inputs": { "files_series_digest": "<sha256>", "input_selectors_digest": "<sha256>" } }
+```
+
+- `committed` binds the committed bundle bytes to the manifest **for integrity only**:
+  CI recomputes the committed digests and asserts they equal `manifest.committed`.
+  A commit that updates `series.json` but not the manifest (or vice-versa) **fails loudly** —
+  closing the one consistency leg content-addressing does not cover.
+  This is a tamper/desync check on the in-repo files, **not** the regression comparator
+  (a re-run is compared by *content with tolerance*, not by digest — see Comparison).
+- `native` lists the curated native blobs by content digest (dedups, survives history).
+  These digests are **authored only by `mint`**: the value is the sha256 of the exact bytes
+  `mint` uploaded to the store, so it is a *fetch address + integrity check*, never a reproducibility gate.
+- `extraction_inputs` hashes the diagnostic's `files`/`series` patterns and the catalog `input_selectors` —
+  the real extraction inputs the current catalog hash ignores —
+  so a change to extraction behaviour bumps the manifest *at PR time*.
+
+**Native digests do not gate reproducibility.**
+Native bytes are nondeterministic (`.nc`/`.png` embed timestamps, library versions, and provenance, floats jitter across platforms),
+so re-running `execute()` on a different machine yields different bytes resulting in a different digest for an unchanged diagnostic.
+The CI therefore never recomputes a native digest and compares it to the manifest.
+A mismatch is expected, and not a failure.
+
+The deterministic gate is the **committed bundle**, compared by content with tolerance (see Comparison).
+The `native` block **churns on every real mint** (timestamps and maybe SHAs shift),
+so it is not a clean "native meaningfully changed" signal.
+Only the `committed` block is a semantic diff signal
+Remint when the committed bundle changes or a new test-case lands, not on native drift.
+
+The consistency actually enforced:
+
+| pair                                                 | enforced              | how                                             |
+| ---------------------------------------------------- | --------------------- | ----------------------------------------------- |
+| in-repo committed bundle ↔ `manifest.committed`      | yes — integrity gate  | CI recompute equals (tamper/desync check)       |
+| replayed committed bundle ↔ in-repo committed bundle | yes — regression gate | content compare with tolerance (see Comparison) |
+| stored native bytes ↔ `manifest.native` digest       | by construction       | `mint` hashes what it uploads                   |
+| re-executed native ↔ `manifest.native` digest        | **no — by design**    | nondeterministic; gate is the committed bundle  |
+
+## Curated capture (kills the bloat at the source)
+
+The diagnostic already declares which native files matter via `files` + `series` glob patterns
+with production publishing copies only those (the CMEC `output.json` subset).
+The regression capture must use the **same curation** instead of copying the whole working directory.
+Diagnostic authors control what is stored/pruned through those existing patterns — no new mechanism.
+
+## `NativeStore` interface
+
+The REF talks to the store through a thin Protocol
+so the backend is swappable and unit-testable,
+and so this RFC does not have to settle the backend to be implementable:
+
+```python
+class NativeStore(Protocol):
+    def has(self, digest: str) -> bool: ...
+    def fetch(self, digest: str, dest: Path) -> None: ...   # by sha256; public read, no creds
+    def put(self, path: Path) -> str: ...                    # returns sha256; needs write creds
+```
+
+Required properties of whatever backend is chosen:
+
+- **Anonymous public read** so fork CI and contributors fetch without credentials.
+- **Writes gated to trusted contexts only** (CI on `main`, or maintainers).
+- Content-addressed (sha256) storage;
+  native addressed **per-test-case** so a fetch is one diagnostic's output (KB–32 MB).
+- Retention that **never expires a digest still referenced by a reachable manifest**.
+
+The backend will be prototyped using an R2 bucket on Cloudflare,
+with the CI having write credentials to mint new baselines.
+
+## CLI (extends the existing `ref test-cases` app)
+
+| Subcommand                           | Purpose                                                                               | Creds              |
+| ------------------------------------ | ------------------------------------------------------------------------------------- | ------------------ |
+| `run --provider P --diagnostic D`    | run `execute()` + `build_execution_result`, write committed bundle + a *local* manifest | none (local)       |
+| `sync [--provider P]`                | fetch native blobs named by the committed manifest(s)                                 | none (public read) |
+| `replay --provider P --diagnostic D` | fetch native, run `build_execution_result`, compare to committed bundle               | none (public read) |
+| `mint --provider P`                  | `run` + `put` native to the store + author the canonical `manifest.native`            | write creds        |
+
+`mint` is the only credentialed verb, and the only writer of canonical native digests,
+so anonymous users can fetch the native baseline.
+A dev `run` produces native and a local manifest for previewing the committed-bundle diff;
+its native digests are advisory and never committed.
+A PR commits only the committed bundle + `manifest.{committed,extraction_inputs}` —
+the `native` block is authored exclusively by the gated post-merge `mint`,
+so native digests exist iff `mint` wrote them (no advisory-then-wrong window on `main`).
+
+## Tiered CI
+
+The PR tier runs **credential-free** so it can safely execute fork code:
+the danger was never `execute()` itself,
+it was `execute()` holding write creds on a shared writable cache.
+Strip both and a fork's diagnostic runs harmlessly on a throwaway public runner.
+Minting (the only `put`) stays post-merge behind a manual gate.
+
+- **PR — any, incl forks — public `ubuntu-latest`, no secrets, no creds.**
+  Routed by changed-files so the check is *honest*:
+  - **New diagnostic / new test-case, or `execute()` changed** -> run `execute()` + `build_execution_result`
+    on the public runner, then compare the freshly built committed bundle (tolerant).
+    No stored baseline is fetched, so a brand-new test-case validates on its own PR — no follow-up needed.
+    The check is named **`pr-execute`**.
+  - **Extraction changed (`build_execution_result`, `files`/`series`, committed/manifest) and native already exists** ->
+    cheap path: `ref test-cases sync` (public read) -> `replay` -> compare committed bundle.
+    The check is named **`extraction-replay`**.
+  - A fetch *miss* for an **existing** manifest entry is a **failure, not a skip**
+    (today, missing fixtures silently skip -> false green);
+    a *new* entry has no stored native by definition and takes the execute path instead.
+  - **Coverage:** the public runner handles ~90% of test-cases.
+    Some heavy ESMValTool ocean diagnostics exceed its compute/memory/disk and will fail there —
+    those wait for the future private runner (see below); they are not a silent skip.
+- **Merge to `main` — gated mint, the only context holding write creds.**
+  A job bound to a GitHub **Environment (`mint`) with required reviewers**:
+  the run pauses until a maintainer approves, releasing the write creds only onto **already-merged (trusted) code**.
+  On approve: `ref test-cases mint` -> `execute()` + `put` native + author `manifest.native` (bot commit back to `main`).
+  **Mint set** = test-cases whose committed bundle or `extraction_inputs` changed in the merge diff;
+  a manual `workflow_dispatch` can target a specific `--provider`/`--diagnostic` for backfills or store-loss recovery.
+- **Nightly — gated runner.**
+  Full sweep; drift detection backstop;
+  opens an auto-PR with the new manifest + a human-readable summary if outputs drift.
+- **Future — private read-only runner.**
+  For the heavy diagnostics the public runner cannot host.
+  Safe to run fork code only if each job is **ephemeral / container-per-job**, mounts conda + data **read-only**,
+  carries **no write creds**, and has **controlled egress** (no shared writable cache — the L46 hazard).
+  Deferred: start public-only.
+
+## PR workflow
+
+Two phases: the diagnostic **author** first builds and runs the test-case **locally**,
+then opens a PR that the **reviewer** drives through CI to merge.
+
+### 1. Author a new diagnostic (local, no creds)
+
+```mermaid
+sequenceDiagram
+    actor Author as Diagnostic author
+    participant Repo as Local checkout
+    participant CLI as ref test-cases
+    participant Store as NativeStore
+
+    Author->>Repo: add diagnostic (new class + test cases)
+    Author->>CLI: ref test-cases run --provider P --diagnostic D
+    CLI->>CLI: execute() then build_execution_result
+    CLI-->>Repo: write committed bundle + local manifest
+
+    opt native already minted (changed/existing diagnostic)
+        Author->>CLI: ref test-cases replay --provider P --diagnostic D
+        CLI->>Store: sync native (public read)
+        CLI-->>Author: tolerant diff vs committed bundle
+    end
+
+    Author->>Author: preview committed-bundle diff (small JSON)
+    Author->>Repo: commit committed bundle + manifest.committed/extraction_inputs
+```
+
+A local `run` produces native + advisory digests for preview only —
+the author commits no `native` block.
+Canonical `manifest.native` is authored later by the gated post-merge `mint`.
+
+### 2. Make a PR (review + CI)
+
+```mermaid
+sequenceDiagram
+    actor Author as Diagnostic author
+    participant PR as PR / GitHub
+    participant CI as PR CI (public ubuntu-latest, no creds)
+    actor Reviewer as Reviewer / maintainer
+    participant Mint as Mint job (gated env, write creds)
+    participant Store as NativeStore
+
+    Author->>PR: open PR (diagnostic + committed bundle + manifest.committed/extraction_inputs)
+    PR->>CI: trigger checks (routed by changed files)
+
+    alt new diagnostic or execute() changed
+        CI->>CI: run execute() + build_execution_result (no creds)
+        CI->>CI: compare freshly built committed bundle (tolerant)
+        CI-->>PR: pr-execute check + bundle diff comment
+    else extraction changed, native already exists
+        CI->>Store: ref test-cases sync (public read)
+        alt native fetch hit
+            CI->>CI: replay then compare committed bundle (tolerant)
+            CI-->>PR: extraction-replay check + bundle diff comment
+        else native missing for an existing entry
+            CI-->>PR: FAIL (no silent skip)
+        end
+    end
+
+    Note over CI: heavy ESMValTool ocean diagnostics may exceed public limits (future private runner)
+
+    Reviewer->>PR: review committed-bundle diff (small JSON, human-readable)
+    Reviewer->>PR: approve + merge
+
+    PR->>Mint: merge to main queues mint
+    Mint->>Reviewer: pause for required-reviewer approval
+    Reviewer->>Mint: approve (releases write creds onto trusted code)
+    Mint->>Mint: execute() + build_execution_result
+    Mint->>Store: put native (write creds)
+    Mint->>PR: bot commit manifest.native to main
+```
+
+### 3. CI routing (decision view)
+
+How the PR check is chosen from the changed files, and where each path lands:
+
+```mermaid
+flowchart TD
+    A[PR opened] --> B{What changed?}
+
+    B -->|"new diagnostic or execute() changed"| X["run execute() + build_execution_result on public runner, no creds"]
+    B -->|"extraction only, native exists"| C["sync native (public read)"]
+
+    C --> D{Native fetch hit?}
+    D -->|yes| E["replay then rebuild committed bundle"]
+    D -->|"no (existing entry)"| F["FAIL: no silent skip"]
+
+    X --> G{Diff within tolerance?}
+    E --> G
+    G -->|yes| H["check green (pr-execute / extraction-replay)"]
+    G -->|no| I["red: bundle diff comment, author fixes"]
+    I --> A
+    F --> A
+
+    H --> M{Reviewer approves + merges?}
+    M -->|no| A
+    M -->|yes| P["merge to main queues gated mint"]
+    P --> Q{Maintainer approves mint?}
+    Q -->|yes| O["put native + author manifest.native (bot commit)"]
+    Q -->|no| P
+```
+
+The PR runner is public and **credential-free**, so a fork's `execute()` never touches write creds or a writable cache —
+the L46 hazard is removed, not gated around.
+Minting is the only `put`, and it runs **post-merge behind a required-reviewer environment**,
+so write creds reach only already-merged (trusted) code.
+The public runner covers ~90% of test-cases;
+heavy ESMValTool ocean diagnostics that exceed its limits wait for the future private read-only runner.
+
+## Comparison
+
+The regression gate compares a freshly replayed committed bundle to the in-repo reference
+**by content, not by bytes**.
+Byte/digest equality is too strict: it surfaces least-significant-bit float flips
+and other cross-platform noise (a fork replays on stock `ubuntu-latest`, a mint ran elsewhere),
+turning harmless jitter into red CI.
+Content comparison needs awareness of each artifact's shape:
+
+| artifact | comparison |
+| -------- | ---------- |
+| `series.json` (numeric) | structural keys exact; values by relative tolerance (`rtol`/`atol`) |
+| CMEC `diagnostic.json` / `output.json` | structural JSON compare; numerics with tolerance; volatile fields (timestamps, abs paths, library versions, provenance) normalised out via an ignore-list |
+
+- Tolerances are **declared per diagnostic/test-case** — a global default with per-case overrides,
+  since some metrics are inherently noisier than others.
+  (pytest-regressions precedent: `num_regression` tolerances, `data_regression` structural compare.)
+- **Fast path:** if bytes are already equal, pass without the content compare; only mismatches fall through to the tolerant path.
+- The `manifest.committed` sha256 is **not** this comparator — it is the integrity/desync check above.
+
+## Determinism
+
+Content comparison still requires removing avoidable nondeterminism at capture time:
+
+- **Execution-dir timestamp.**
+  ESMValTool writes a `recipe_<YYYYMMDD>_<HHMMSS>` dir that gets baked into `output.json`.
+  Sanitise `recipe_\d{8}_\d{6}` -> a placeholder in both capture and comparison.
+  Tracked as [Climate-REF/climate-ref#713](https://github.com/Climate-REF/climate-ref/issues/713).
+- **Floats.**
+  Any float values should be compared using a relative tolerance to the baseline.
+  Resolving these kinds of CI-dev machine differences is time consuming.
+- **Binary files.**
+  We don't compare SHAs of generated files as they will differ depending on who generated the files (especially with NetCDF files).
+  We shouldn't try and chase perfect hash consistency.
+  The SHA in the manifest is for validating a fetched blob against its store address, not for gating reproducibility.
+
+# Drawbacks
+
+- The REF CLI owns fetch/mint/credential plumbing —
+  more surface to maintain and test than off-loading to an existing data-management tool.
+- Native is a cache of a regenerable artifact:
+  losing a digest still referenced by a reachable manifest breaks historical replay
+  (mitigated by reachability-aware retention).
+- Committed-bundle minting is effectively a maintainer/CI action for conda providers.
+  external authors usually cannot run `execute()` locally.
+- The public PR runner covers ~90% of test-cases but not all:
+  heavy ESMValTool ocean diagnostics exceed its compute/memory/disk
+  and have no automated PR signal until the future private runner lands.
+- Historical bloat remains: existing committed binaries are not rewritten out of history
+  (deliberate — the repo is already used by externals).
+
+# Rationale and alternatives
+
+**Why two bundles.**
+Keeping the small REF-shaped committed bundle in git
+puts the review signal where reviewers already work (the PR diff)
+and keeps the gate human-readable;
+moving the large binaries out of git removes the bloat
+while still making them fetchable for the extraction test.
+The split maps cleanly onto what the REF already does for catalogs
+(`catalog.yaml` committed, `catalog.paths.yaml` gitignored).
+
+**Why a manifest + CLI rather than committing the bytes.**
+Measured native is too large to commit (pmp 250 MB, ilamb 61 MB).
+A content-addressed manifest gives dedup, per-commit pinning of the exact native bytes,
+and a fetchable fixture without binaries in git.
+The clean in-PR semantic diff comes from the committed bundle, not the native digests
+(which churn on every mint — see the manifest section).
+
+**Alternatives considered:**
+
+- **Commit curated native to git.**
+  Rejected by measurement (too large for pmp/ilamb).
+  Viable only for esmvaltool — not worth a per-provider split of the storage model.
+- **Git LFS.**
+  Pointer churn remains; GitHub LFS egress costs; no anonymous public read.
+- **Regenerate-on-demand, store nothing.**
+  The cheap `extraction-replay` path skips conda + `execute()` by fetching stored native;
+  dropping the store forces every PR onto the slow execute path.
+  Storing native keeps the common extraction-only PR fast.
+  Per-provider the `NativeStore` interface still allows mixing
+  (a provider with a low fetch rate could regenerate instead of store).
+
+**Impact of not doing this.**
+Contributors keep hand-rolling baselines,
+PRs keep shipping output changes without a visible diff,
+and silent upstream regressions stay invisible until a manual check.
+
+# Prior art
+
+- The REF's existing **catalog split** —
+  the exact "commit small metadata, keep bytes out of git" pattern, here extended to outputs.
+- **pytest-regressions** (`--force-regen`) — the regenerate-the-committed-bundle workflow this mirrors.
+- Content-addressed artifact stores with a small in-git lock (DVC, Snakemake) — same idea;
+  this RFC keeps the client surface to just the REF CLI
+  rather than adding a separate toolchain on contributors.
+
+# Unresolved questions
+
+- **Object-store backend choice** — the main open decision, deferred deliberately.
+  Candidate backends behind the `NativeStore` interface include
+  an S3-compatible bucket or an OCI registry (e.g. GitHub Packages);
+  selection should weigh anonymous-read, trusted-write identity, egress cost,
+  content-addressing/GC, and any storage the org has already provisioned.
+  **No backend is assumed by this RFC.**
+- Per-provider **store-vs-regenerate** for the large providers (pmp 250 MB).
+- Retention policy for non-current native versions.
+- Whether the PR comment should render a **figure diff** (PNG) / **NetCDF stat diff**,
+  or defer to a follow-up.
+- **Private read-only runner** for the heavy diagnostics the public runner cannot host —
+  isolation model (ephemeral container, read-only mounts, no creds, controlled egress) and when to build it.
+- Scope of the manual `mint` **`workflow_dispatch`** (targeting, backfills, store-loss recovery)
+  beyond the automatic merge-diff mint set.
+
+# Future possibilities
+
+- Rich diff renderers in the PR comment (PNG diffs, NetCDF summary-stat diffs).
+- Reuse the same `NativeStore` + manifest for large reference/observational inputs too big for git.
+- A `ref test-cases preview` that serves locally what the API would render for a PR,
+  straight from a synced native fixture (already the `replay` path).
+- Signed manifests for end-to-end baseline provenance.

From afd074fd1d3a1411480c24a27ce4f8db7f38b596 Mon Sep 17 00:00:00 2001
From: Jared Lewis <jared@jared.kiwi.nz>
Date: Fri, 5 Jun 2026 11:30:05 +1000
Subject: [PATCH 6/8] chore: reread

---
 text/0005-regression-baselines.md | 175 ++++++++++++------------------
 1 file changed, 71 insertions(+), 104 deletions(-)

diff --git a/text/0005-regression-baselines.md b/text/0005-regression-baselines.md
index 7598116..277cce1 100644
--- a/text/0005-regression-baselines.md
+++ b/text/0005-regression-baselines.md
@@ -4,21 +4,20 @@
 
 # Summary
 
-Effectively diagnostic regression **baselines** for use by the testing suite.
 Having baseline output for a diagnostic is an important contract between the development team and the diagnostic implementor.
+The baseline includes information about what the REF should extract from a diagnostic.
 
 Each baseline (one test-case run) has **two bundles**:
 
-1. A small, human-readable **committed bundle** (the REF-shaped `series.json` + CMEC bundles)
+1. A small, human-readable **committed bundle** (the files ingested by the REF, `series.json` + CMEC bundles)
    committed directly to the source repo.
    This should be stable and able to be reproduced by the CI.
 2. A large, binary **native bundle** (the diagnostic's native outputs)
-   stored outside git in an object store
-   and referenced from a single git-committed **`manifest.json`**
+   stored outside git in an object store and referenced from a single git-committed **`manifest.json`**
    (sha256 digests of the committed bundle, the native files, and the extraction inputs).
 
 The REF CLI (`ref test-cases`) fetches, runs, and mints baselines
-with the goal of making it simple for diagnostic developers to contribute test output without requiring custom pytest implementation.
+with the goal of making it simple for diagnostic developers to contribute test output.
 
 The **object-store backend is left as an explicit open decision** (see Unresolved questions) —
 this RFC specifies the *layering, manifest, CLI, and CI*,
@@ -32,7 +31,8 @@ Diagnostic regression testing today is flaky, opaque, and bloats the repo:
   The git repo size is growing in a way that isn't sustainable.
   We want to minimise the volume of data needed to setup the repo.
 - **No clear "this-changed" signal.**
-  It's hard to see the failures for new diagnostics as we aren't following a consitent process.
+  It's hard to see the failures for diagnostics.
+  There are lots of CI output making it difficult to see what has broken.
   The REF-shaped output (`series.json`, CMEC bundles) is small JSON that diffs cleanly in the PR
   and is what reviewers most need to see.
 - **The extraction regression never runs on a PR.**
@@ -41,15 +41,9 @@ Diagnostic regression testing today is flaky, opaque, and bloats the repo:
   The integration tests are currently broken and require a refresh of the regression data.
 - **Most diagnostic authors are external scientists.**
   Require minimal setup, no new credentials.
-  A contributor should need only the REF.
+  A contributor should need only the REF and shouldn't need to touch more than a few files to add a diagnostic.
 - **Security.**
-  The integration runner shares a writable cache so we must be careful that fork code must never reach it or any write credential.
-
-The native bundle is not a review-only artifact as it is the input fixtures for integration tests.
-`Diagnostic.run()` is `execute()` (runs the diagnostic, writes native)
-then `build_execution_result()` (reads native -> CMEC bundle + series).
-`RegressionValidator` already replays *committed native* through `build_execution_result` without re-running the diagnostic.
-That makes a fast, fork-safe extraction test possible on a stock runner — *if* the native is fetchable.
+  The integration runner shares a writable cache so we must be careful that fork code must never reach it or any write credentials.
 
 # Reference-level explanation
 
@@ -57,7 +51,7 @@ That makes a fast, fork-safe extraction test possible on a stock runner — *if*
 
 For each `(provider, diagnostic, test-case)`:
 
-```
+```text
 packages/<provider>/tests/test-data/<diagnostic>/<test-case>/
   catalog.yaml            # committed: dataset metadata (exists today, content-hashed)
   catalog.paths.yaml      # gitignored: per-user local paths (exists today)
@@ -81,44 +75,48 @@ The native bytes (provenance + matched `*.nc` / `*.png`) live in the object stor
 
 - `committed` binds the committed bundle bytes to the manifest **for integrity only**:
   CI recomputes the committed digests and asserts they equal `manifest.committed`.
-  A commit that updates `series.json` but not the manifest (or vice-versa) **fails loudly** —
-  closing the one consistency leg content-addressing does not cover.
-  This is a tamper/desync check on the in-repo files, **not** the regression comparator
-  (a re-run is compared by *content with tolerance*, not by digest — see Comparison).
-- `native` lists the curated native blobs by content digest (dedups, survives history).
-  These digests are **authored only by `mint`**: the value is the sha256 of the exact bytes
-  `mint` uploaded to the store, so it is a *fetch address + integrity check*, never a reproducibility gate.
+  A commit that updates `series.json` but not the manifest (or vice-versa) **fails loudly**.
+- `native` lists the curated native blobs by content digest.
+  These digests are **authored only by the `mint` action**: the value is the sha256 of the exact bytes `mint` uploaded to the store.
 - `extraction_inputs` hashes the diagnostic's `files`/`series` patterns and the catalog `input_selectors` —
   the real extraction inputs the current catalog hash ignores —
   so a change to extraction behaviour bumps the manifest *at PR time*.
 
-**Native digests do not gate reproducibility.**
+**Digests do not gate reproducibility.**
 Native bytes are nondeterministic (`.nc`/`.png` embed timestamps, library versions, and provenance, floats jitter across platforms),
 so re-running `execute()` on a different machine yields different bytes resulting in a different digest for an unchanged diagnostic.
 The CI therefore never recomputes a native digest and compares it to the manifest.
 A mismatch is expected, and not a failure.
 
-The deterministic gate is the **committed bundle**, compared by content with tolerance (see Comparison).
+The gate for being able to reproduce values deterministically is the **committed bundle**,
+compared by content with tolerance (see Comparison).
 The `native` block **churns on every real mint** (timestamps and maybe SHAs shift),
 so it is not a clean "native meaningfully changed" signal.
 Only the `committed` block is a semantic diff signal
 Remint when the committed bundle changes or a new test-case lands, not on native drift.
 
-The consistency actually enforced:
+Some of the native content may be useful to compare visually for example figures.
+How that is surface is not yet resolved.
+
+The consistency enforced:
 
-| pair                                                 | enforced              | how                                             |
-| ---------------------------------------------------- | --------------------- | ----------------------------------------------- |
-| in-repo committed bundle ↔ `manifest.committed`      | yes — integrity gate  | CI recompute equals (tamper/desync check)       |
-| replayed committed bundle ↔ in-repo committed bundle | yes — regression gate | content compare with tolerance (see Comparison) |
-| stored native bytes ↔ `manifest.native` digest       | by construction       | `mint` hashes what it uploads                   |
-| re-executed native ↔ `manifest.native` digest        | **no — by design**    | nondeterministic; gate is the committed bundle  |
+| pair                                                  | enforced              | how                                             |
+| ----------------------------------------------------- | --------------------- | ----------------------------------------------- |
+| in-repo committed bundle <> `manifest.committed`      | yes — integrity gate  | CI recompute equals (tamper/desync check)       |
+| replayed committed bundle <> in-repo committed bundle | yes — regression gate | content compare with tolerance (see Comparison) |
+| stored native bytes <> `manifest.native` digest       | by construction       | `mint` hashes what it uploads                   |
+| re-executed native <> `manifest.native` digest        | **no — by design**    | nondeterministic; gate is the committed bundle  |
 
-## Curated capture (kills the bloat at the source)
+## Curated capture
+
+There will be additional files in the native bundle that aren't needed,
+including the climate data that ESMValTool includes in the output directory.
 
 The diagnostic already declares which native files matter via `files` + `series` glob patterns
 with production publishing copies only those (the CMEC `output.json` subset).
 The regression capture must use the **same curation** instead of copying the whole working directory.
-Diagnostic authors control what is stored/pruned through those existing patterns — no new mechanism.
+Diagnostic authors control what is stored/pruned through those existing patterns,
+and any errors in this process will be surfaced as differences in the committed bundle.
 
 ## `NativeStore` interface
 
@@ -137,14 +135,13 @@ Required properties of whatever backend is chosen:
 
 - **Anonymous public read** so fork CI and contributors fetch without credentials.
 - **Writes gated to trusted contexts only** (CI on `main`, or maintainers).
-- Content-addressed (sha256) storage;
-  native addressed **per-test-case** so a fetch is one diagnostic's output (KB–32 MB).
 - Retention that **never expires a digest still referenced by a reachable manifest**.
+- Potentially using content-addressed (sha256) storage, but this complicates the expiration.
 
 The backend will be prototyped using an R2 bucket on Cloudflare,
 with the CI having write credentials to mint new baselines.
 
-## CLI (extends the existing `ref test-cases` app)
+## CLI (extends the existing `ref test-cases` cli)
 
 | Subcommand                           | Purpose                                                                               | Creds              |
 | ------------------------------------ | ------------------------------------------------------------------------------------- | ------------------ |
@@ -155,48 +152,45 @@ with the CI having write credentials to mint new baselines.
 
 `mint` is the only credentialed verb, and the only writer of canonical native digests,
 so anonymous users can fetch the native baseline.
-A dev `run` produces native and a local manifest for previewing the committed-bundle diff;
-its native digests are advisory and never committed.
+A dev `run` produces native and a local manifest for previewing the committed-bundle diff
+with the native digests being advisory and never committed.
 A PR commits only the committed bundle + `manifest.{committed,extraction_inputs}` —
-the `native` block is authored exclusively by the gated post-merge `mint`,
-so native digests exist iff `mint` wrote them (no advisory-then-wrong window on `main`).
+the `native` block is authored exclusively by the post-merge `mint` action so native digests exist iff `mint` wrote them.
 
 ## Tiered CI
 
-The PR tier runs **credential-free** so it can safely execute fork code:
-the danger was never `execute()` itself,
-it was `execute()` holding write creds on a shared writable cache.
-Strip both and a fork's diagnostic runs harmlessly on a throwaway public runner.
-Minting (the only `put`) stays post-merge behind a manual gate.
+The PR tier runs **credential-free** so it can safely execute fork code.
+Self-hosted runners should never run `execute()` holding write creds on a shared writable cache.
+
+Minting stays post-merge behind a manual gate as it requires credentials.
 
 - **PR — any, incl forks — public `ubuntu-latest`, no secrets, no creds.**
   Routed by changed-files so the check is *honest*:
-  - **New diagnostic / new test-case, or `execute()` changed** -> run `execute()` + `build_execution_result`
-    on the public runner, then compare the freshly built committed bundle (tolerant).
-    No stored baseline is fetched, so a brand-new test-case validates on its own PR — no follow-up needed.
-    The check is named **`pr-execute`**.
+  - **New diagnostic / new test-case, or `execute()` changed** -> run `execute()` + `build_execution_result` on the public runner,
+    then compare the freshly built committed bundle against what is in git.
+    At this point, any existing native content may be invalidated.
   - **Extraction changed (`build_execution_result`, `files`/`series`, committed/manifest) and native already exists** ->
-    cheap path: `ref test-cases sync` (public read) -> `replay` -> compare committed bundle.
-    The check is named **`extraction-replay`**.
+    cheap path: `ref test-cases sync` (read native content) -> `replay` -> compare committed bundle.
   - A fetch *miss* for an **existing** manifest entry is a **failure, not a skip**
     (today, missing fixtures silently skip -> false green);
     a *new* entry has no stored native by definition and takes the execute path instead.
   - **Coverage:** the public runner handles ~90% of test-cases.
     Some heavy ESMValTool ocean diagnostics exceed its compute/memory/disk and will fail there —
     those wait for the future private runner (see below); they are not a silent skip.
-- **Merge to `main` — gated mint, the only context holding write creds.**
-  A job bound to a GitHub **Environment (`mint`) with required reviewers**:
+- **Merge to `main`**
+  A job bound to a GitHub **Environment (`mint`)**:
   the run pauses until a maintainer approves, releasing the write creds only onto **already-merged (trusted) code**.
   On approve: `ref test-cases mint` -> `execute()` + `put` native + author `manifest.native` (bot commit back to `main`).
   **Mint set** = test-cases whose committed bundle or `extraction_inputs` changed in the merge diff;
-  a manual `workflow_dispatch` can target a specific `--provider`/`--diagnostic` for backfills or store-loss recovery.
+  a manual `workflow_dispatch` can target a specific `--provider`/`--diagnostic` for backfills or manual fixes.
 - **Nightly — gated runner.**
-  Full sweep; drift detection backstop;
-  opens an auto-PR with the new manifest + a human-readable summary if outputs drift.
+  Full sweep to check for any drift.
+  This will `execute()`, `build_execution_result` and compare.
+  This job should output a human-readable summary as part of the workflow.
 - **Future — private read-only runner.**
   For the heavy diagnostics the public runner cannot host.
   Safe to run fork code only if each job is **ephemeral / container-per-job**, mounts conda + data **read-only**,
-  carries **no write creds**, and has **controlled egress** (no shared writable cache — the L46 hazard).
+  carries **no write creds**, and has **controlled egress** (no shared writable cache).
   Deferred: start public-only.
 
 ## PR workflow
@@ -217,19 +211,11 @@ sequenceDiagram
     Author->>CLI: ref test-cases run --provider P --diagnostic D
     CLI->>CLI: execute() then build_execution_result
     CLI-->>Repo: write committed bundle + local manifest
-
-    opt native already minted (changed/existing diagnostic)
-        Author->>CLI: ref test-cases replay --provider P --diagnostic D
-        CLI->>Store: sync native (public read)
-        CLI-->>Author: tolerant diff vs committed bundle
-    end
-
     Author->>Author: preview committed-bundle diff (small JSON)
     Author->>Repo: commit committed bundle + manifest.committed/extraction_inputs
 ```
 
-A local `run` produces native + advisory digests for preview only —
-the author commits no `native` block.
+A local `run` produces native + advisory digests for preview only — the author commits no `native` block.
 Canonical `manifest.native` is authored later by the gated post-merge `mint`.
 
 ### 2. Make a PR (review + CI)
@@ -303,32 +289,17 @@ flowchart TD
     Q -->|no| P
 ```
 
-The PR runner is public and **credential-free**, so a fork's `execute()` never touches write creds or a writable cache —
-the L46 hazard is removed, not gated around.
-Minting is the only `put`, and it runs **post-merge behind a required-reviewer environment**,
-so write creds reach only already-merged (trusted) code.
-The public runner covers ~90% of test-cases;
-heavy ESMValTool ocean diagnostics that exceed its limits wait for the future private read-only runner.
-
 ## Comparison
 
-The regression gate compares a freshly replayed committed bundle to the in-repo reference
-**by content, not by bytes**.
-Byte/digest equality is too strict: it surfaces least-significant-bit float flips
-and other cross-platform noise (a fork replays on stock `ubuntu-latest`, a mint ran elsewhere),
-turning harmless jitter into red CI.
-Content comparison needs awareness of each artifact's shape:
-
-| artifact | comparison |
-| -------- | ---------- |
-| `series.json` (numeric) | structural keys exact; values by relative tolerance (`rtol`/`atol`) |
-| CMEC `diagnostic.json` / `output.json` | structural JSON compare; numerics with tolerance; volatile fields (timestamps, abs paths, library versions, provenance) normalised out via an ignore-list |
-
-- Tolerances are **declared per diagnostic/test-case** — a global default with per-case overrides,
-  since some metrics are inherently noisier than others.
-  (pytest-regressions precedent: `num_regression` tolerances, `data_regression` structural compare.)
+The regression gate compares a freshly replayed committed bundle to the in-repo reference **by content, not by bytes**.
+Byte/digest equality is too strict as the artifacts may include floats which will lead to flaky test results.
+
 - **Fast path:** if bytes are already equal, pass without the content compare; only mismatches fall through to the tolerant path.
-- The `manifest.committed` sha256 is **not** this comparator — it is the integrity/desync check above.
+- Tolerances are **declared per diagnostic/test-case** — a global default with per-case overrides.
+
+We are comparing the content of the files which are ingested by the REF.
+Changes to other native files aren't covered by this approach other than the existance of them.
+This could be extended to cover these files, but we don't control their shape.
 
 ## Determinism
 
@@ -349,17 +320,15 @@ Content comparison still requires removing avoidable nondeterminism at capture t
 # Drawbacks
 
 - The REF CLI owns fetch/mint/credential plumbing —
-  more surface to maintain and test than off-loading to an existing data-management tool.
+  more surface to maintain and test, but that is easier than a sprawl of user tests.
 - Native is a cache of a regenerable artifact:
-  losing a digest still referenced by a reachable manifest breaks historical replay
-  (mitigated by reachability-aware retention).
-- Committed-bundle minting is effectively a maintainer/CI action for conda providers.
-  external authors usually cannot run `execute()` locally.
+  losing a digest still referenced by a reachable manifest breaks historical replay.
+  They can always be replaced.
 - The public PR runner covers ~90% of test-cases but not all:
   heavy ESMValTool ocean diagnostics exceed its compute/memory/disk
   and have no automated PR signal until the future private runner lands.
 - Historical bloat remains: existing committed binaries are not rewritten out of history
-  (deliberate — the repo is already used by externals).
+  (the repo is already used by externals so rewriting history is hard).
 
 # Rationale and alternatives
 
@@ -382,10 +351,9 @@ The clean in-PR semantic diff comes from the committed bundle, not the native di
 **Alternatives considered:**
 
 - **Commit curated native to git.**
-  Rejected by measurement (too large for pmp/ilamb).
-  Viable only for esmvaltool — not worth a per-provider split of the storage model.
+  Curating causes the inability to regenerate the output.json file.
 - **Git LFS.**
-  Pointer churn remains; GitHub LFS egress costs; no anonymous public read.
+  Pointer churn remains; GitHub LFS egress costs are very low and it requires additional steps when checking out.
 - **Regenerate-on-demand, store nothing.**
   The cheap `extraction-replay` path skips conda + `execute()` by fetching stored native;
   dropping the store forces every PR onto the slow execute path.
@@ -393,6 +361,8 @@ The clean in-PR semantic diff comes from the committed bundle, not the native di
   Per-provider the `NativeStore` interface still allows mixing
   (a provider with a low fetch rate could regenerate instead of store).
 
+  Generally it would take significant CI time to regenerate on every commit.
+
 **Impact of not doing this.**
 Contributors keep hand-rolling baselines,
 PRs keep shipping output changes without a visible diff,
@@ -403,19 +373,16 @@ and silent upstream regressions stay invisible until a manual check.
 - The REF's existing **catalog split** —
   the exact "commit small metadata, keep bytes out of git" pattern, here extended to outputs.
 - **pytest-regressions** (`--force-regen`) — the regenerate-the-committed-bundle workflow this mirrors.
+- `ref-sample-data` used CI owned data,
 - Content-addressed artifact stores with a small in-git lock (DVC, Snakemake) — same idea;
   this RFC keeps the client surface to just the REF CLI
   rather than adding a separate toolchain on contributors.
 
 # Unresolved questions
 
-- **Object-store backend choice** — the main open decision, deferred deliberately.
-  Candidate backends behind the `NativeStore` interface include
-  an S3-compatible bucket or an OCI registry (e.g. GitHub Packages);
-  selection should weigh anonymous-read, trusted-write identity, egress cost,
-  content-addressing/GC, and any storage the org has already provisioned.
+- **Object-store backend choice** — the main open decision.
+  The most obvious choice is another R2 bucket similar to the datasets.
   **No backend is assumed by this RFC.**
-- Per-provider **store-vs-regenerate** for the large providers (pmp 250 MB).
 - Retention policy for non-current native versions.
 - Whether the PR comment should render a **figure diff** (PNG) / **NetCDF stat diff**,
   or defer to a follow-up.

From 4309918475a91ddb4822d556d89fc510e720bb9e Mon Sep 17 00:00:00 2001
From: Jared Lewis <jared@jared.kiwi.nz>
Date: Fri, 5 Jun 2026 12:18:34 +1000
Subject: [PATCH 7/8] chore: add a bump field

---
 text/0005-regression-baselines.md | 81 ++++++++++++++++++-------------
 1 file changed, 46 insertions(+), 35 deletions(-)

diff --git a/text/0005-regression-baselines.md b/text/0005-regression-baselines.md
index 277cce1..8baf823 100644
--- a/text/0005-regression-baselines.md
+++ b/text/0005-regression-baselines.md
@@ -68,19 +68,28 @@ The native bytes (provenance + matched `*.nc` / `*.png`) live in the object stor
 
 ```json
 { "schema": 1,
+  "test_case_version": 1,
   "committed":  { "series.json": "<sha256>", "diagnostic.json": "<sha256>", "output.json": "<sha256>" },
-  "native":  { "<relpath>": { "sha256": "...", "size": ... } },
-  "extraction_inputs": { "files_series_digest": "<sha256>", "input_selectors_digest": "<sha256>" } }
+  "native":  { "<relpath>": { "sha256": "...", "size": ... } } }
 ```
 
+- `test_case_version` is a **monotonic integer** the author/maintainer bumps to declare a new baseline —
+  the deliberate "re-run and re-mint" trigger.
+  Instead of relying on hashes or code, diagnostic developers control when a test case is minted.
+  CI **hard-fails** a PR whose committed bundle changed without a `test_case_version` bump
+  (a forced bump with identical output is harmless — it just re-mints native).
+  Existing test-cases seed at `1`.
 - `committed` binds the committed bundle bytes to the manifest **for integrity only**:
   CI recomputes the committed digests and asserts they equal `manifest.committed`.
   A commit that updates `series.json` but not the manifest (or vice-versa) **fails loudly**.
 - `native` lists the curated native blobs by content digest.
   These digests are **authored only by the `mint` action**: the value is the sha256 of the exact bytes `mint` uploaded to the store.
-- `extraction_inputs` hashes the diagnostic's `files`/`series` patterns and the catalog `input_selectors` —
-  the real extraction inputs the current catalog hash ignores —
-  so a change to extraction behaviour bumps the manifest *at PR time*.
+
+The provenance of *what produced* a baseline (diagnostic version, provider version, datasets, selectors)
+is **not** recomputed here.
+The addition of a manifest file `execution.json` [Climate-REF/rfcs#3](https://github.com/Climate-REF/rfcs/pull/3) could provide this information.
+This manifest file records the run's **identity** (`diagnostic_slug`, `diagnostic_version`, `provider_slug`, `provider_version`, `dataset_hash`, `datasets` by `instance_id`, `selectors`).
+This file should be included in the committed set when implemented.
 
 **Digests do not gate reproducibility.**
 Native bytes are nondeterministic (`.nc`/`.png` embed timestamps, library versions, and provenance, floats jitter across platforms),
@@ -93,7 +102,7 @@ compared by content with tolerance (see Comparison).
 The `native` block **churns on every real mint** (timestamps and maybe SHAs shift),
 so it is not a clean "native meaningfully changed" signal.
 Only the `committed` block is a semantic diff signal
-Remint when the committed bundle changes or a new test-case lands, not on native drift.
+Remint on a `test_case_version` bump (a new baseline, or a new test-case), not on native drift.
 
 Some of the native content may be useful to compare visually for example figures.
 How that is surface is not yet resolved.
@@ -154,7 +163,7 @@ with the CI having write credentials to mint new baselines.
 so anonymous users can fetch the native baseline.
 A dev `run` produces native and a local manifest for previewing the committed-bundle diff
 with the native digests being advisory and never committed.
-A PR commits only the committed bundle + `manifest.{committed,extraction_inputs}` —
+A PR commits only the committed bundle + `manifest.{test_case_version,committed}` —
 the `native` block is authored exclusively by the post-merge `mint` action so native digests exist iff `mint` wrote them.
 
 ## Tiered CI
@@ -165,15 +174,19 @@ Self-hosted runners should never run `execute()` holding write creds on a shared
 Minting stays post-merge behind a manual gate as it requires credentials.
 
 - **PR — any, incl forks — public `ubuntu-latest`, no secrets, no creds.**
-  Routed by changed-files so the check is *honest*:
-  - **New diagnostic / new test-case, or `execute()` changed** -> run `execute()` + `build_execution_result` on the public runner,
+  Routed by the `test_case_version` bump so the check is *honest*:
+  - **`test_case_version` bumped, or a new test-case** -> `pr-test-case-execute`:
+    run `execute()` + `build_execution_result` on the public runner,
     then compare the freshly built committed bundle against what is in git.
-    At this point, any existing native content may be invalidated.
-  - **Extraction changed (`build_execution_result`, `files`/`series`, committed/manifest) and native already exists** ->
-    cheap path: `ref test-cases sync` (read native content) -> `replay` -> compare committed bundle.
-  - A fetch *miss* for an **existing** manifest entry is a **failure, not a skip**
-    (today, missing fixtures silently skip -> false green);
-    a *new* entry has no stored native by definition and takes the execute path instead.
+    A bumped/new version has no stored native by definition, so nothing is fetched —
+    a brand-new test-case validates on its own PR, no follow-up needed.
+  - **No bump, but extraction code changed (`build_execution_result`, `files`/`series`)** -> `pr-test-case-replay`:
+    cheap path — `ref test-cases sync` (read native) -> `replay` against the *existing* baseline -> compare.
+    Catches unintended extraction drift; a real mismatch here means "you changed output but didn't bump the version."
+  - **Coupling gate.** A committed bundle that changed without a `test_case_version` bump is a **hard failure**
+    ("output changed but baseline version not bumped — intended?").
+    A fetch *miss* for an **existing** entry is likewise a failure, not a silent skip.
+    We may need a field on the test-case to continue on failure.
   - **Coverage:** the public runner handles ~90% of test-cases.
     Some heavy ESMValTool ocean diagnostics exceed its compute/memory/disk and will fail there —
     those wait for the future private runner (see below); they are not a silent skip.
@@ -181,7 +194,7 @@ Minting stays post-merge behind a manual gate as it requires credentials.
   A job bound to a GitHub **Environment (`mint`)**:
   the run pauses until a maintainer approves, releasing the write creds only onto **already-merged (trusted) code**.
   On approve: `ref test-cases mint` -> `execute()` + `put` native + author `manifest.native` (bot commit back to `main`).
-  **Mint set** = test-cases whose committed bundle or `extraction_inputs` changed in the merge diff;
+  **Mint set** = test-cases whose `test_case_version` was bumped in the merge diff (plus any brand-new test-case);
   a manual `workflow_dispatch` can target a specific `--provider`/`--diagnostic` for backfills or manual fixes.
 - **Nightly — gated runner.**
   Full sweep to check for any drift.
@@ -212,7 +225,7 @@ sequenceDiagram
     CLI->>CLI: execute() then build_execution_result
     CLI-->>Repo: write committed bundle + local manifest
     Author->>Author: preview committed-bundle diff (small JSON)
-    Author->>Repo: commit committed bundle + manifest.committed/extraction_inputs
+    Author->>Repo: commit committed bundle + manifest.committed + test_case_version
 ```
 
 A local `run` produces native + advisory digests for preview only — the author commits no `native` block.
@@ -229,21 +242,19 @@ sequenceDiagram
     participant Mint as Mint job (gated env, write creds)
     participant Store as NativeStore
 
-    Author->>PR: open PR (diagnostic + committed bundle + manifest.committed/extraction_inputs)
+    Author->>PR: open PR (diagnostic + committed bundle + manifest.committed + test_case_version)
     PR->>CI: trigger checks (routed by changed files)
 
-    alt new diagnostic or execute() changed
+    alt test_case_version bumped, or new test-case
         CI->>CI: run execute() + build_execution_result (no creds)
         CI->>CI: compare freshly built committed bundle (tolerant)
         CI-->>PR: pr-execute check + bundle diff comment
-    else extraction changed, native already exists
+    else no bump, extraction code changed
         CI->>Store: ref test-cases sync (public read)
-        alt native fetch hit
-            CI->>CI: replay then compare committed bundle (tolerant)
-            CI-->>PR: extraction-replay check + bundle diff comment
-        else native missing for an existing entry
-            CI-->>PR: FAIL (no silent skip)
-        end
+        CI->>CI: replay against existing baseline then compare (tolerant)
+        CI-->>PR: extraction-replay check + bundle diff comment
+    else committed bundle changed without a bump
+        CI-->>PR: FAIL (bump test_case_version to re-baseline)
     end
 
     Note over CI: heavy ESMValTool ocean diagnostics may exceed public limits (future private runner)
@@ -265,18 +276,16 @@ How the PR check is chosen from the changed files, and where each path lands:
 
 ```mermaid
 flowchart TD
-    A[PR opened] --> B{What changed?}
-
-    B -->|"new diagnostic or execute() changed"| X["run execute() + build_execution_result on public runner, no creds"]
-    B -->|"extraction only, native exists"| C["sync native (public read)"]
+    A[PR opened] --> B{test_case_version bumped?}
 
-    C --> D{Native fetch hit?}
-    D -->|yes| E["replay then rebuild committed bundle"]
-    D -->|"no (existing entry)"| F["FAIL: no silent skip"]
+    B -->|"yes (or new test-case)"| X["pr-execute: run execute() + build_execution_result, no creds"]
+    B -->|no| V{committed bundle changed?}
+    V -->|yes| F["FAIL: bump test_case_version to re-baseline"]
+    V -->|"no (extraction code changed)"| C["sync native then replay existing baseline"]
 
     X --> G{Diff within tolerance?}
-    E --> G
-    G -->|yes| H["check green (pr-execute / extraction-replay)"]
+    C --> G
+    G -->|yes| H["check green"]
     G -->|no| I["red: bundle diff comment, author fixes"]
     I --> A
     F --> A
@@ -377,6 +386,8 @@ and silent upstream regressions stay invisible until a manual check.
 - Content-addressed artifact stores with a small in-git lock (DVC, Snakemake) — same idea;
   this RFC keeps the client surface to just the REF CLI
   rather than adding a separate toolchain on contributors.
+- **Execution lifecycle ([Climate-REF/rfcs#3](https://github.com/Climate-REF/rfcs/pull/3))** —
+  the `execution.json` self-describing manifest this RFC consumes for baseline provenance information.
 
 # Unresolved questions
 

From 0efcdd05ff08abc69551e2578e96fb4ad77bf8b8 Mon Sep 17 00:00:00 2001
From: Jared Lewis <jared@jared.kiwi.nz>
Date: Fri, 5 Jun 2026 12:33:23 +1000
Subject: [PATCH 8/8] chore: update flow chart

---
 text/0005-regression-baselines.md | 45 +++++++++++++++++--------------
 1 file changed, 25 insertions(+), 20 deletions(-)

diff --git a/text/0005-regression-baselines.md b/text/0005-regression-baselines.md
index 8baf823..13bd5ee 100644
--- a/text/0005-regression-baselines.md
+++ b/text/0005-regression-baselines.md
@@ -248,11 +248,11 @@ sequenceDiagram
     alt test_case_version bumped, or new test-case
         CI->>CI: run execute() + build_execution_result (no creds)
         CI->>CI: compare freshly built committed bundle (tolerant)
-        CI-->>PR: pr-execute check + bundle diff comment
+        CI-->>PR: pr-test-case-execute check + bundle diff comment
     else no bump, extraction code changed
         CI->>Store: ref test-cases sync (public read)
         CI->>CI: replay against existing baseline then compare (tolerant)
-        CI-->>PR: extraction-replay check + bundle diff comment
+        CI-->>PR: pr-test-case-replay check + bundle diff comment
     else committed bundle changed without a bump
         CI-->>PR: FAIL (bump test_case_version to re-baseline)
     end
@@ -276,26 +276,31 @@ How the PR check is chosen from the changed files, and where each path lands:
 
 ```mermaid
 flowchart TD
-    A[PR opened] --> B{test_case_version bumped?}
+    Start(["Push to PR"]) --> B{"test_case_version bumped?"}
 
-    B -->|"yes (or new test-case)"| X["pr-execute: run execute() + build_execution_result, no creds"]
-    B -->|no| V{committed bundle changed?}
-    V -->|yes| F["FAIL: bump test_case_version to re-baseline"]
-    V -->|"no (extraction code changed)"| C["sync native then replay existing baseline"]
+    B -->|"yes / new test-case"| X["pr-test-case-execute: run execute() + build_execution_result (no creds)"]
+    B -->|"no"| V{"committed bundle changed?"}
+    V -->|"yes"| F["FAIL: bump test_case_version to re-baseline"]
+    V -->|"no — extraction code changed"| C["pr-test-case-replay: sync native, replay existing baseline"]
 
-    X --> G{Diff within tolerance?}
+    X --> G{"Diff within tolerance?"}
     C --> G
-    G -->|yes| H["check green"]
-    G -->|no| I["red: bundle diff comment, author fixes"]
-    I --> A
-    F --> A
-
-    H --> M{Reviewer approves + merges?}
-    M -->|no| A
-    M -->|yes| P["merge to main queues gated mint"]
-    P --> Q{Maintainer approves mint?}
-    Q -->|yes| O["put native + author manifest.native (bot commit)"]
-    Q -->|no| P
+    G -->|"yes"| M{"Reviewer approves + merges?"}
+    G -->|"no"| F
+
+    F --> Revise(["Author revises, re-pushes"])
+    M -->|"no"| Revise
+    M -->|"yes"| P["merge to main queues gated mint"]
+    P --> Q{"Maintainer approves mint?"}
+    Q -->|"no (waits)"| P
+    Q -->|"yes"| Done(["Merged + manifest.native minted by CI"])
+
+    classDef entry fill:#dff5e1,stroke:#1a7f37,stroke-width:2px;
+    classDef good fill:#dbeafe,stroke:#1d4ed8,stroke-width:2px;
+    classDef warn fill:#fde7c7,stroke:#b45309,stroke-width:2px;
+    class Start entry;
+    class Done good;
+    class Revise warn;
 ```
 
 ## Comparison
@@ -364,7 +369,7 @@ The clean in-PR semantic diff comes from the committed bundle, not the native di
 - **Git LFS.**
   Pointer churn remains; GitHub LFS egress costs are very low and it requires additional steps when checking out.
 - **Regenerate-on-demand, store nothing.**
-  The cheap `extraction-replay` path skips conda + `execute()` by fetching stored native;
+  The cheap `pr-test-case-replay` path skips conda + `execute()` by fetching stored native;
   dropping the store forces every PR onto the slow execute path.
   Storing native keeps the common extraction-only PR fast.
   Per-provider the `NativeStore` interface still allows mixing