diff --git a/text/0005-regression-baselines.md b/text/0005-regression-baselines.md new file mode 100644 index 0000000..13bd5ee --- /dev/null +++ b/text/0005-regression-baselines.md @@ -0,0 +1,416 @@ +- Feature Name: `regression_baselines` +- Start Date: 2026-06-04 +- RFC PR: [CMIP-REF/rfcs#5](https://github.com/CMIP-REF/rfcs/pull/0005) + +# Summary + +Having baseline output for a diagnostic is an important contract between the development team and the diagnostic implementor. +The baseline includes information about what the REF should extract from a diagnostic. + +Each baseline (one test-case run) has **two bundles**: + +1. A small, human-readable **committed bundle** (the files ingested by the REF, `series.json` + CMEC bundles) + committed directly to the source repo. + This should be stable and able to be reproduced by the CI. +2. A large, binary **native bundle** (the diagnostic's native outputs) + stored outside git in an object store and referenced from a single git-committed **`manifest.json`** + (sha256 digests of the committed bundle, the native files, and the extraction inputs). + +The REF CLI (`ref test-cases`) fetches, runs, and mints baselines +with the goal of making it simple for diagnostic developers to contribute test output. + +The **object-store backend is left as an explicit open decision** (see Unresolved questions) — +this RFC specifies the *layering, manifest, CLI, and CI*, +which are backend-agnostic behind a small `NativeStore` interface. + +# Motivation + +Diagnostic regression testing today is flaky, opaque, and bloats the repo: + +- **Bloat.** `tests/.../regression` is ~1 GB on disk. + The git repo size is growing in a way that isn't sustainable. + We want to minimise the volume of data needed to setup the repo. +- **No clear "this-changed" signal.** + It's hard to see the failures for diagnostics. + There are lots of CI output making it difficult to see what has broken. + The REF-shaped output (`series.json`, CMEC bundles) is small JSON that diffs cleanly in the PR + and is what reviewers most need to see. +- **The extraction regression never runs on a PR.** + It runs only post-merge/nightly today, + so a PR that breaks extraction goes green, then red after merge. + The integration tests are currently broken and require a refresh of the regression data. +- **Most diagnostic authors are external scientists.** + Require minimal setup, no new credentials. + A contributor should need only the REF and shouldn't need to touch more than a few files to add a diagnostic. +- **Security.** + The integration runner shares a writable cache so we must be careful that fork code must never reach it or any write credentials. + +# Reference-level explanation + +## Two bundles, one manifest + +For each `(provider, diagnostic, test-case)`: + +```text +packages//tests/test-data/// + catalog.yaml # committed: dataset metadata (exists today, content-hashed) + catalog.paths.yaml # gitignored: per-user local paths (exists today) + regression/ + series.json # committed bundle (the gate + diff signal) + diagnostic.json # committed bundle (CMEC metric bundle) + output.json # committed bundle (CMEC output bundle) + manifest.json # committed: source of truth binding committed bundle <-> native bundle +``` + +The native bytes (provenance + matched `*.nc` / `*.png`) live in the object store, **not** in git. + +`manifest.json`: + +```json +{ "schema": 1, + "test_case_version": 1, + "committed": { "series.json": "", "diagnostic.json": "", "output.json": "" }, + "native": { "": { "sha256": "...", "size": ... } } } +``` + +- `test_case_version` is a **monotonic integer** the author/maintainer bumps to declare a new baseline — + the deliberate "re-run and re-mint" trigger. + Instead of relying on hashes or code, diagnostic developers control when a test case is minted. + CI **hard-fails** a PR whose committed bundle changed without a `test_case_version` bump + (a forced bump with identical output is harmless — it just re-mints native). + Existing test-cases seed at `1`. +- `committed` binds the committed bundle bytes to the manifest **for integrity only**: + CI recomputes the committed digests and asserts they equal `manifest.committed`. + A commit that updates `series.json` but not the manifest (or vice-versa) **fails loudly**. +- `native` lists the curated native blobs by content digest. + These digests are **authored only by the `mint` action**: the value is the sha256 of the exact bytes `mint` uploaded to the store. + +The provenance of *what produced* a baseline (diagnostic version, provider version, datasets, selectors) +is **not** recomputed here. +The addition of a manifest file `execution.json` [Climate-REF/rfcs#3](https://github.com/Climate-REF/rfcs/pull/3) could provide this information. +This manifest file records the run's **identity** (`diagnostic_slug`, `diagnostic_version`, `provider_slug`, `provider_version`, `dataset_hash`, `datasets` by `instance_id`, `selectors`). +This file should be included in the committed set when implemented. + +**Digests do not gate reproducibility.** +Native bytes are nondeterministic (`.nc`/`.png` embed timestamps, library versions, and provenance, floats jitter across platforms), +so re-running `execute()` on a different machine yields different bytes resulting in a different digest for an unchanged diagnostic. +The CI therefore never recomputes a native digest and compares it to the manifest. +A mismatch is expected, and not a failure. + +The gate for being able to reproduce values deterministically is the **committed bundle**, +compared by content with tolerance (see Comparison). +The `native` block **churns on every real mint** (timestamps and maybe SHAs shift), +so it is not a clean "native meaningfully changed" signal. +Only the `committed` block is a semantic diff signal +Remint on a `test_case_version` bump (a new baseline, or a new test-case), not on native drift. + +Some of the native content may be useful to compare visually for example figures. +How that is surface is not yet resolved. + +The consistency enforced: + +| pair | enforced | how | +| ----------------------------------------------------- | --------------------- | ----------------------------------------------- | +| in-repo committed bundle <> `manifest.committed` | yes — integrity gate | CI recompute equals (tamper/desync check) | +| replayed committed bundle <> in-repo committed bundle | yes — regression gate | content compare with tolerance (see Comparison) | +| stored native bytes <> `manifest.native` digest | by construction | `mint` hashes what it uploads | +| re-executed native <> `manifest.native` digest | **no — by design** | nondeterministic; gate is the committed bundle | + +## Curated capture + +There will be additional files in the native bundle that aren't needed, +including the climate data that ESMValTool includes in the output directory. + +The diagnostic already declares which native files matter via `files` + `series` glob patterns +with production publishing copies only those (the CMEC `output.json` subset). +The regression capture must use the **same curation** instead of copying the whole working directory. +Diagnostic authors control what is stored/pruned through those existing patterns, +and any errors in this process will be surfaced as differences in the committed bundle. + +## `NativeStore` interface + +The REF talks to the store through a thin Protocol +so the backend is swappable and unit-testable, +and so this RFC does not have to settle the backend to be implementable: + +```python +class NativeStore(Protocol): + def has(self, digest: str) -> bool: ... + def fetch(self, digest: str, dest: Path) -> None: ... # by sha256; public read, no creds + def put(self, path: Path) -> str: ... # returns sha256; needs write creds +``` + +Required properties of whatever backend is chosen: + +- **Anonymous public read** so fork CI and contributors fetch without credentials. +- **Writes gated to trusted contexts only** (CI on `main`, or maintainers). +- Retention that **never expires a digest still referenced by a reachable manifest**. +- Potentially using content-addressed (sha256) storage, but this complicates the expiration. + +The backend will be prototyped using an R2 bucket on Cloudflare, +with the CI having write credentials to mint new baselines. + +## CLI (extends the existing `ref test-cases` cli) + +| Subcommand | Purpose | Creds | +| ------------------------------------ | ------------------------------------------------------------------------------------- | ------------------ | +| `run --provider P --diagnostic D` | run `execute()` + `build_execution_result`, write committed bundle + a *local* manifest | none (local) | +| `sync [--provider P]` | fetch native blobs named by the committed manifest(s) | none (public read) | +| `replay --provider P --diagnostic D` | fetch native, run `build_execution_result`, compare to committed bundle | none (public read) | +| `mint --provider P` | `run` + `put` native to the store + author the canonical `manifest.native` | write creds | + +`mint` is the only credentialed verb, and the only writer of canonical native digests, +so anonymous users can fetch the native baseline. +A dev `run` produces native and a local manifest for previewing the committed-bundle diff +with the native digests being advisory and never committed. +A PR commits only the committed bundle + `manifest.{test_case_version,committed}` — +the `native` block is authored exclusively by the post-merge `mint` action so native digests exist iff `mint` wrote them. + +## Tiered CI + +The PR tier runs **credential-free** so it can safely execute fork code. +Self-hosted runners should never run `execute()` holding write creds on a shared writable cache. + +Minting stays post-merge behind a manual gate as it requires credentials. + +- **PR — any, incl forks — public `ubuntu-latest`, no secrets, no creds.** + Routed by the `test_case_version` bump so the check is *honest*: + - **`test_case_version` bumped, or a new test-case** -> `pr-test-case-execute`: + run `execute()` + `build_execution_result` on the public runner, + then compare the freshly built committed bundle against what is in git. + A bumped/new version has no stored native by definition, so nothing is fetched — + a brand-new test-case validates on its own PR, no follow-up needed. + - **No bump, but extraction code changed (`build_execution_result`, `files`/`series`)** -> `pr-test-case-replay`: + cheap path — `ref test-cases sync` (read native) -> `replay` against the *existing* baseline -> compare. + Catches unintended extraction drift; a real mismatch here means "you changed output but didn't bump the version." + - **Coupling gate.** A committed bundle that changed without a `test_case_version` bump is a **hard failure** + ("output changed but baseline version not bumped — intended?"). + A fetch *miss* for an **existing** entry is likewise a failure, not a silent skip. + We may need a field on the test-case to continue on failure. + - **Coverage:** the public runner handles ~90% of test-cases. + Some heavy ESMValTool ocean diagnostics exceed its compute/memory/disk and will fail there — + those wait for the future private runner (see below); they are not a silent skip. +- **Merge to `main`** + A job bound to a GitHub **Environment (`mint`)**: + the run pauses until a maintainer approves, releasing the write creds only onto **already-merged (trusted) code**. + On approve: `ref test-cases mint` -> `execute()` + `put` native + author `manifest.native` (bot commit back to `main`). + **Mint set** = test-cases whose `test_case_version` was bumped in the merge diff (plus any brand-new test-case); + a manual `workflow_dispatch` can target a specific `--provider`/`--diagnostic` for backfills or manual fixes. +- **Nightly — gated runner.** + Full sweep to check for any drift. + This will `execute()`, `build_execution_result` and compare. + This job should output a human-readable summary as part of the workflow. +- **Future — private read-only runner.** + For the heavy diagnostics the public runner cannot host. + Safe to run fork code only if each job is **ephemeral / container-per-job**, mounts conda + data **read-only**, + carries **no write creds**, and has **controlled egress** (no shared writable cache). + Deferred: start public-only. + +## PR workflow + +Two phases: the diagnostic **author** first builds and runs the test-case **locally**, +then opens a PR that the **reviewer** drives through CI to merge. + +### 1. Author a new diagnostic (local, no creds) + +```mermaid +sequenceDiagram + actor Author as Diagnostic author + participant Repo as Local checkout + participant CLI as ref test-cases + participant Store as NativeStore + + Author->>Repo: add diagnostic (new class + test cases) + Author->>CLI: ref test-cases run --provider P --diagnostic D + CLI->>CLI: execute() then build_execution_result + CLI-->>Repo: write committed bundle + local manifest + Author->>Author: preview committed-bundle diff (small JSON) + Author->>Repo: commit committed bundle + manifest.committed + test_case_version +``` + +A local `run` produces native + advisory digests for preview only — the author commits no `native` block. +Canonical `manifest.native` is authored later by the gated post-merge `mint`. + +### 2. Make a PR (review + CI) + +```mermaid +sequenceDiagram + actor Author as Diagnostic author + participant PR as PR / GitHub + participant CI as PR CI (public ubuntu-latest, no creds) + actor Reviewer as Reviewer / maintainer + participant Mint as Mint job (gated env, write creds) + participant Store as NativeStore + + Author->>PR: open PR (diagnostic + committed bundle + manifest.committed + test_case_version) + PR->>CI: trigger checks (routed by changed files) + + alt test_case_version bumped, or new test-case + CI->>CI: run execute() + build_execution_result (no creds) + CI->>CI: compare freshly built committed bundle (tolerant) + CI-->>PR: pr-test-case-execute check + bundle diff comment + else no bump, extraction code changed + CI->>Store: ref test-cases sync (public read) + CI->>CI: replay against existing baseline then compare (tolerant) + CI-->>PR: pr-test-case-replay check + bundle diff comment + else committed bundle changed without a bump + CI-->>PR: FAIL (bump test_case_version to re-baseline) + end + + Note over CI: heavy ESMValTool ocean diagnostics may exceed public limits (future private runner) + + Reviewer->>PR: review committed-bundle diff (small JSON, human-readable) + Reviewer->>PR: approve + merge + + PR->>Mint: merge to main queues mint + Mint->>Reviewer: pause for required-reviewer approval + Reviewer->>Mint: approve (releases write creds onto trusted code) + Mint->>Mint: execute() + build_execution_result + Mint->>Store: put native (write creds) + Mint->>PR: bot commit manifest.native to main +``` + +### 3. CI routing (decision view) + +How the PR check is chosen from the changed files, and where each path lands: + +```mermaid +flowchart TD + Start(["Push to PR"]) --> B{"test_case_version bumped?"} + + B -->|"yes / new test-case"| X["pr-test-case-execute: run execute() + build_execution_result (no creds)"] + B -->|"no"| V{"committed bundle changed?"} + V -->|"yes"| F["FAIL: bump test_case_version to re-baseline"] + V -->|"no — extraction code changed"| C["pr-test-case-replay: sync native, replay existing baseline"] + + X --> G{"Diff within tolerance?"} + C --> G + G -->|"yes"| M{"Reviewer approves + merges?"} + G -->|"no"| F + + F --> Revise(["Author revises, re-pushes"]) + M -->|"no"| Revise + M -->|"yes"| P["merge to main queues gated mint"] + P --> Q{"Maintainer approves mint?"} + Q -->|"no (waits)"| P + Q -->|"yes"| Done(["Merged + manifest.native minted by CI"]) + + classDef entry fill:#dff5e1,stroke:#1a7f37,stroke-width:2px; + classDef good fill:#dbeafe,stroke:#1d4ed8,stroke-width:2px; + classDef warn fill:#fde7c7,stroke:#b45309,stroke-width:2px; + class Start entry; + class Done good; + class Revise warn; +``` + +## Comparison + +The regression gate compares a freshly replayed committed bundle to the in-repo reference **by content, not by bytes**. +Byte/digest equality is too strict as the artifacts may include floats which will lead to flaky test results. + +- **Fast path:** if bytes are already equal, pass without the content compare; only mismatches fall through to the tolerant path. +- Tolerances are **declared per diagnostic/test-case** — a global default with per-case overrides. + +We are comparing the content of the files which are ingested by the REF. +Changes to other native files aren't covered by this approach other than the existance of them. +This could be extended to cover these files, but we don't control their shape. + +## Determinism + +Content comparison still requires removing avoidable nondeterminism at capture time: + +- **Execution-dir timestamp.** + ESMValTool writes a `recipe__` dir that gets baked into `output.json`. + Sanitise `recipe_\d{8}_\d{6}` -> a placeholder in both capture and comparison. + Tracked as [Climate-REF/climate-ref#713](https://github.com/Climate-REF/climate-ref/issues/713). +- **Floats.** + Any float values should be compared using a relative tolerance to the baseline. + Resolving these kinds of CI-dev machine differences is time consuming. +- **Binary files.** + We don't compare SHAs of generated files as they will differ depending on who generated the files (especially with NetCDF files). + We shouldn't try and chase perfect hash consistency. + The SHA in the manifest is for validating a fetched blob against its store address, not for gating reproducibility. + +# Drawbacks + +- The REF CLI owns fetch/mint/credential plumbing — + more surface to maintain and test, but that is easier than a sprawl of user tests. +- Native is a cache of a regenerable artifact: + losing a digest still referenced by a reachable manifest breaks historical replay. + They can always be replaced. +- The public PR runner covers ~90% of test-cases but not all: + heavy ESMValTool ocean diagnostics exceed its compute/memory/disk + and have no automated PR signal until the future private runner lands. +- Historical bloat remains: existing committed binaries are not rewritten out of history + (the repo is already used by externals so rewriting history is hard). + +# Rationale and alternatives + +**Why two bundles.** +Keeping the small REF-shaped committed bundle in git +puts the review signal where reviewers already work (the PR diff) +and keeps the gate human-readable; +moving the large binaries out of git removes the bloat +while still making them fetchable for the extraction test. +The split maps cleanly onto what the REF already does for catalogs +(`catalog.yaml` committed, `catalog.paths.yaml` gitignored). + +**Why a manifest + CLI rather than committing the bytes.** +Measured native is too large to commit (pmp 250 MB, ilamb 61 MB). +A content-addressed manifest gives dedup, per-commit pinning of the exact native bytes, +and a fetchable fixture without binaries in git. +The clean in-PR semantic diff comes from the committed bundle, not the native digests +(which churn on every mint — see the manifest section). + +**Alternatives considered:** + +- **Commit curated native to git.** + Curating causes the inability to regenerate the output.json file. +- **Git LFS.** + Pointer churn remains; GitHub LFS egress costs are very low and it requires additional steps when checking out. +- **Regenerate-on-demand, store nothing.** + The cheap `pr-test-case-replay` path skips conda + `execute()` by fetching stored native; + dropping the store forces every PR onto the slow execute path. + Storing native keeps the common extraction-only PR fast. + Per-provider the `NativeStore` interface still allows mixing + (a provider with a low fetch rate could regenerate instead of store). + + Generally it would take significant CI time to regenerate on every commit. + +**Impact of not doing this.** +Contributors keep hand-rolling baselines, +PRs keep shipping output changes without a visible diff, +and silent upstream regressions stay invisible until a manual check. + +# Prior art + +- The REF's existing **catalog split** — + the exact "commit small metadata, keep bytes out of git" pattern, here extended to outputs. +- **pytest-regressions** (`--force-regen`) — the regenerate-the-committed-bundle workflow this mirrors. +- `ref-sample-data` used CI owned data, +- Content-addressed artifact stores with a small in-git lock (DVC, Snakemake) — same idea; + this RFC keeps the client surface to just the REF CLI + rather than adding a separate toolchain on contributors. +- **Execution lifecycle ([Climate-REF/rfcs#3](https://github.com/Climate-REF/rfcs/pull/3))** — + the `execution.json` self-describing manifest this RFC consumes for baseline provenance information. + +# Unresolved questions + +- **Object-store backend choice** — the main open decision. + The most obvious choice is another R2 bucket similar to the datasets. + **No backend is assumed by this RFC.** +- Retention policy for non-current native versions. +- Whether the PR comment should render a **figure diff** (PNG) / **NetCDF stat diff**, + or defer to a follow-up. +- **Private read-only runner** for the heavy diagnostics the public runner cannot host — + isolation model (ephemeral container, read-only mounts, no creds, controlled egress) and when to build it. +- Scope of the manual `mint` **`workflow_dispatch`** (targeting, backfills, store-loss recovery) + beyond the automatic merge-diff mint set. + +# Future possibilities + +- Rich diff renderers in the PR comment (PNG diffs, NetCDF summary-stat diffs). +- Reuse the same `NativeStore` + manifest for large reference/observational inputs too big for git. +- A `ref test-cases preview` that serves locally what the API would render for a PR, + straight from a synced native fixture (already the `replay` path). +- Signed manifests for end-to-end baseline provenance.