From f9d00349ad0445dd8838b344e5857eccb857970b Mon Sep 17 00:00:00 2001
From: Jared Lewis <jared@jared.kiwi.nz>
Date: Tue, 12 May 2026 17:34:36 +1000
Subject: [PATCH 01/12] rfc: execution lifecycle consolidation

Consolidate diagnostic execution lifecycle (allocation, dispatch, run,
classify, publish, ingest, finalise) into one deep module behind a single
Transport port. Surface ResourceHint on Diagnostic so providers can declare
memory/CPU/wall-clock once. Capture per-execution Telemetry to enable future
adaptive scheduling without a schema change.
---
 text/0000-execution-lifecycle.md | 822 +++++++++++++++++++++++++++++++
 1 file changed, 822 insertions(+)
 create mode 100644 text/0000-execution-lifecycle.md

diff --git a/text/0000-execution-lifecycle.md b/text/0000-execution-lifecycle.md
new file mode 100644
index 0000000..18d0a69
--- /dev/null
+++ b/text/0000-execution-lifecycle.md
@@ -0,0 +1,822 @@
+- Feature Name: `execution_lifecycle`
+- Start Date: 2026-05-12
+- RFC PR: [Climate-REF/rfcs#0000](https://github.com/Climate-REF/rfcs/pull/0000)
+
+# Summary
+[summary]: #summary
+
+Consolidate the lifecycle of a single diagnostic execution
+— allocation, dispatch, run, classify, publish, ingest, finalise —
+into one deep module (`ExecutionLifecycle`) sitting behind a single `Transport` port,
+and surface a declarative `ResourceHint` on each `Diagnostic` so providers
+can express memory / CPU / wall-clock expectations in one place
+without reaching into the executor or scheduler layer.
+
+# Motivation
+[motivation]: #motivation
+
+## Problem
+
+The lifecycle of one diagnostic execution today is fragmented across roughly
+eight files in two packages.
+A single execution flows through, in order:
+
+1. `solver.py` creates an `Execution` row with a `PLACEHOLDER_FRAGMENT`,
+   calls `assign_execution_fragment` to flush, compute a group-short string,
+   and rewrite the output fragment.
+2. `solver.py` rebuilds the `ExecutionDefinition` via `attrs.evolve`,
+   calls `execution.register_datasets`, expunges the row, commits.
+3. `executor.run(definition, execution)` is called outside the transaction.
+   Three implementations exist:
+   `SynchronousExecutor` runs in-process,
+   `LocalExecutor` submits to a `ProcessPoolExecutor`,
+   `CeleryExecutor` sends to a broker with linked result handlers.
+4. `execute_locally` (in `climate-ref-core`) wraps `diagnostic.run`,
+   classifies exceptions via the private `_is_system_error`,
+   special-cases `CondaCommandError`,
+   and produces an `ExecutionResult` — whose `build_from_output_bundle`
+   factory writes three JSON files to disk despite being a frozen attrs class.
+5. `process_result` and `handle_execution_result` copy log, metric bundle,
+   output bundle, series file, and every bundle-referenced plot/data/html
+   file from scratch to results,
+   then ingest scalar values, series values, and outputs in a nested
+   transaction.
+6. `ExecutionGroup.dirty` is toggled in three different branches across
+   `handle_execution_result` and `mark_execution_failed`.
+
+Concrete problems this fragmentation causes today:
+
+- **Provider authors have no obvious place to express parallelisation hints.**
+  ESMValTool diagnostics that need 16 GB of memory or eight hours of
+  wall-clock have nowhere to declare it.
+  The `LocalExecutor` enforces a 6 hour per-task timeout;
+  the `CeleryExecutor` does not enforce a per-task timeout at all;
+  neither passes resource requirements through to anything that could honour
+  them.
+  When the project eventually adds SLURM or PBS adapters,
+  the natural way to wire memory / CPU / wall-clock into `sbatch` or `qsub`
+  directives does not exist.
+- **Retry classification is split across the file tree.**
+  `_is_system_error` (in `climate-ref-core/executor.py`),
+  `CondaCommandError` handling (in the same file),
+  missing-log detection (in `result_handling.py`),
+  per-task timeout (in `LocalExecutor`),
+  and pool-shutdown abandonment (also in `LocalExecutor`)
+  each set `retryable` independently.
+  There is no single classifier function and no contract between
+  `ExecutionResult.build_from_failure` and the executor about which
+  exception types are retryable.
+- **The `dirty` flag is decided in three places.**
+  Success path sets it `False`;
+  non-retryable failure sets it `False`;
+  retryable failure leaves it `True`;
+  missing log leaves it `True`.
+  These branches are spread across `handle_execution_result` and the
+  Local/Celery executors with no unifying decision function.
+- **Tests have to mock subprocess, filesystem, DB, and Celery.**
+  `tests/unit/test_providers.py` patches `subprocess` directly for the
+  conda provider; `tests/unit/test_executor.py` imports the private
+  `_is_system_error`; integration tests are the only thing that exercises
+  the full happy path.
+- **`ExecutionResult` is not picklable cleanly.**
+  Its frozen factory writes three JSON files;
+  re-construction in a test requires a real tmp_path filesystem.
+- **The `Executor` Protocol is shallow.**
+  Its `run` / `join` interface forces every adapter
+  (Synchronous, Local, Celery, future SLURM/PBS/K8s)
+  to reinvent the same dance:
+  reattach the detached `Execution` row,
+  open a fresh transaction,
+  copy scratch to results,
+  load the CV,
+  ingest scalars and series,
+  apply the dirty rule,
+  mark the row.
+
+## Why now
+
+The CMIP REF is approaching the point where serious deployment targets are
+plausible:
+SLURM and PBS adapters,
+distributed Celery deployments behind a managed broker,
+possibly Kubernetes Jobs.
+Each new transport currently means a new ~250-LOC adapter that re-implements
+the lifecycle dance.
+The same fragmentation also blocks anything resembling
+**informed scheduling** — even simple things like "ESMValTool needs more memory
+than ILAMB" or "the ENSO diagnostic should land on the `bigmem` queue"
+have no representation in the codebase.
+
+This RFC is **not** about replacing SLURM, PBS, or Celery as schedulers.
+It is about defining the seam between *climate-ref* and a scheduler so the
+scheduler has enough information to do its job,
+and so the lifecycle around the scheduler is one robust, well-tested module
+instead of eight thin ones.
+
+## Expected outcomes
+
+1. The solver dispatch loop shrinks from ~50 lines of bookkeeping to ~3.
+2. A new transport adapter (SLURM, PBS, K8s) is a single class
+   implementing `Transport`, around 80–120 LOC,
+   with no copy-paste of fragment allocation, ingestion, CV loading,
+   dirty-flag logic, or `Execution` reattachment.
+3. Provider authors declare `resources: ResourceHint(...)` once on a
+   diagnostic class and the rest of the system honours it.
+4. Per-execution telemetry (peak memory, duration, host)
+   is captured on every outcome and persisted on the `Execution` row,
+   making future adaptive scheduling
+   a feature addition rather than a schema change.
+5. Unit tests run with zero filesystem and zero subprocess
+   via an `InMemoryTransport` and a separated bundle writer.
+
+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation
+
+## Module boundary
+
+```
+┌──────────────────────────────────────────────────────────────────────────────┐
+│ ExecutionLifecycle (climate_ref.lifecycle)                                   │
+│                                                                              │
+│   __init__(config, db, transport, *,                                         │
+│            retry=DefaultRetryPolicy(),                                       │
+│            cv=None,           # default: CV.load_from_file(config.paths…)    │
+│            clock=datetime.utcnow)                                            │
+│                                                                              │
+│   submit(execution, definition) -> None                                      │
+│   drain(timeout=None) -> None                                                │
+│                                                                              │
+│   replay_abandoned() -> list[int]    # find executions stranded across boot  │
+│   dry_run(group, datasets) -> ExecutionDefinition                            │
+│                                                                              │
+│   OWNED (private):                                                           │
+│     _FragmentAllocator   _Classifier   _Promoter      _Ingestor              │
+│     _DirtyRule           _BundleWriter (worker-side)                         │
+└──────────────────────────────────────────────────────────────────────────────┘
+                              │
+                              ▼ Port — the ONLY thing that varies per backend
+┌──────────────────────────────────────────────────────────────────────────────┐
+│ Transport (Protocol, climate_ref_core.lifecycle.ports)                       │
+│                                                                              │
+│   name: ClassVar[str]                                                        │
+│   dispatch(envelope: ExecutionEnvelope) -> None                              │
+│   poll(block: bool, timeout: float | None)                                   │
+│       -> Iterator[ExecutionOutcome]                                          │
+│   shutdown(timeout: float | None) -> None                                    │
+│                                                                              │
+│ Production adapters:                                                         │
+│   InMemoryTransport      (in climate-ref, used by tests & SynchronousExec)   │
+│   ProcessPoolTransport   (in climate-ref, replaces LocalExecutor body)       │
+│   CeleryTransport        (in climate-ref-celery)                             │
+│                                                                              │
+│ Future adapters (NOT in this RFC, but the seam supports them):               │
+│   SlurmTransport         consumes ResourceHint as --mem/--cpus/--time/-p     │
+│   PbsTransport           consumes ResourceHint as -l mem,ncpus,walltime,q    │
+│   K8sTransport           consumes ResourceHint as pod resources + deadline   │
+└──────────────────────────────────────────────────────────────────────────────┘
+```
+
+## Wire types
+
+Three frozen, picklable value types cross the transport boundary
+(process pool, broker, future network hop).
+None of them carry DB sessions, `Config`, or controlled vocabulary.
+
+```python
+@attrs.frozen
+class ResourceHint:
+    """Declarative resource expectation, lives on the diagnostic class.
+
+    Per-execution overrides are produced by Diagnostic.resources_for().
+    """
+    memory_mb: int = 4096
+    cpu: int = 1
+    wall_clock: timedelta = timedelta(hours=2)
+    queue: str | None = None   # transport-specific routing tag
+
+@attrs.frozen
+class ExecutionEnvelope:
+    """Wire format sent to a worker. Picklable."""
+    execution_id: int
+    definition: ExecutionDefinition
+    resources: ResourceHint
+    deadline: datetime         # clock() + resources.wall_clock at submit time
+
+@attrs.frozen
+class Telemetry:
+    """Captured by the worker, persisted on the Execution row by the coordinator."""
+    duration: timedelta
+    peak_rss_mb: int | None
+    host: str
+    exit_code: int | None
+    transport_meta: Mapping[str, str]   # slurm jobid / k8s pod / celery task_id …
+
+@attrs.frozen
+class ExecutionOutcome:
+    """What poll() yields back to the lifecycle."""
+    execution_id: int
+    result: ExecutionResult | None       # None ⇒ transport-side abandonment
+    failure: ExecutionFailure | None     # timeout | broker_lost | pool_shutdown
+    telemetry: Telemetry
+```
+
+`ExecutionResult` becomes pure data.
+The current `build_from_output_bundle` factory
+(which writes three JSON files inside a frozen-attrs constructor)
+is split into `ExecutionResult.from_bundle(definition, bundle)` —
+which constructs the value object without I/O —
+and a separate `_BundleWriter.write(definition, bundle)`
+that is invoked worker-side only.
+
+## Diagnostic-side declaration
+
+```python
+class Diagnostic(AbstractDiagnostic):
+    ...
+    resources: ResourceHint = ResourceHint()
+
+    def resources_for(self, definition: ExecutionDefinition) -> ResourceHint:
+        """Optional per-execution sizing.
+
+        Default returns ``self.resources`` unchanged.
+        Diagnostics whose memory scales with input size override this.
+        """
+        return self.resources
+```
+
+Concrete provider declarations under the new system:
+
+```python
+# packages/climate-ref-esmvaltool/src/climate_ref_esmvaltool/diagnostics/base.py
+class ESMValToolDiagnostic(CommandLineDiagnostic):
+    resources = ResourceHint(
+        memory_mb=16000, cpu=4, wall_clock=timedelta(hours=6),
+    )
+
+# packages/climate-ref-pmp/src/climate_ref_pmp/diagnostics/enso.py
+class EnsoDiagnostic(Diagnostic):
+    resources = ResourceHint(
+        memory_mb=24000, cpu=8, wall_clock=timedelta(hours=8),
+        queue="bigmem",
+    )
+
+# packages/climate-ref-ilamb/src/climate_ref_ilamb/standard.py
+class IlambDiagnostic(Diagnostic):
+    resources = ResourceHint(
+        memory_mb=8000, cpu=2, wall_clock=timedelta(hours=2),
+    )
+    def resources_for(self, defn):
+        n = len(defn.datasets.get_cmip6())
+        return attrs.evolve(self.resources, memory_mb=8000 + 500 * n)
+```
+
+The base `Diagnostic` default of `(4 GB, 1 CPU, 2 h)` is the project-wide
+sensible starting point.
+Providers that need more either override the class attribute
+or override `resources_for` for input-size scaling.
+
+## Solver dispatch — before and after
+
+**Before** (`solver.py:709-757`, condensed):
+
+```python
+execution = Execution(
+    execution_group=execution_group,
+    dataset_hash=definition.datasets.hash,
+    output_fragment=PLACEHOLDER_FRAGMENT,
+    provider_version=potential_execution.provider.version,
+)
+db.session.add(execution)
+fragment = assign_execution_fragment(
+    db.session, execution,
+    provider_slug=provider_slug,
+    diagnostic_slug=potential_execution.diagnostic.slug,
+    selectors=potential_execution.selectors,
+    group_id=execution_group.id,
+)
+definition = attrs.evolve(
+    definition,
+    output_directory=config.paths.scratch.resolve() / pathlib.Path(fragment),
+)
+execution.register_datasets(db, definition.datasets)
+if execute:
+    db.session.expunge(execution)
+    pending = (definition, execution)
+# … commit happens at top of loop …
+if pending is not None:
+    executor.run(definition=pending[0], execution=pending[1])
+…
+executor.join(timeout=timeout)
+```
+
+**After**:
+
+```python
+lifecycle = ExecutionLifecycle(config, db, transport)
+
+for group, datasets, definition in planned_executions:
+    execution = Execution(execution_group=group,
+                         dataset_hash=datasets.hash,
+                         provider_version=definition.diagnostic.provider.version)
+    lifecycle.submit(execution, definition)
+
+lifecycle.drain(timeout=timeout)
+```
+
+`submit` privately handles fragment allocation,
+`register_datasets`,
+the expunge/commit dance,
+and `transport.dispatch`.
+`drain` privately handles polling,
+reattachment,
+artifact promotion,
+ingestion,
+the dirty rule,
+and telemetry persistence.
+
+## End-to-end sequence
+
+```
+Solver        Diagnostic       Lifecycle           Transport          Worker             DB
+  │ submit(e,d)   │                │                    │                  │                 │
+  ├──────────────►│ resources_for(d)│                   │                  │                 │
+  │ ◄──── ResourceHint(...)        │                    │                  │                 │
+  │              │                 │ allocate fragment ─┼──────────────────┼────────────────►│
+  │              │                 │ register_datasets ─┼──────────────────┼────────────────►│
+  │              │                 │ session.expunge                                          │
+  │              │                 │ commit ──────────────────────────────────────────────── │
+  │              │                 │ deadline = now + r.wall_clock                            │
+  │              │                 │ dispatch(envelope)─►│  (slurm sbatch / celery send /    │
+  │              │                 │                    │   processpool.submit / inline)     │
+  │              │                 │                    ├─►│ execute_locally(envelope)      │
+  │              │                 │                    │  │   diagnostic.run               │
+  │              │                 │                    │  │   _BundleWriter.write          │
+  │              │                 │                    │  │   CV.validate (hard-fail)      │
+  │              │                 │                    │  │   capture Telemetry            │
+  │              │                 │                    │  │   return ExecutionOutcome      │
+  │              │                 │                    │◄─│                                │
+  │ drain()      │                 │ poll() ◄───────────│                                    │
+  │ ────────────►│                 │ _finalize:                                              │
+  │              │                 │   session.merge(eid) ───────────────────────────────────►│
+  │              │                 │   _Promoter.copy(scratch→results, log+bundles+refs)     │
+  │              │                 │   _Classifier(failure or result) → Success|Retry|GiveUp │
+  │              │                 │   _Ingestor.upsert(outputs, scalars, series) ──────────►│
+  │              │                 │   _DirtyRule.apply(decision) ──────────────────────────►│
+  │              │                 │   execution.telemetry = outcome.telemetry ─────────────►│
+  │              │                 │   execution.mark_{successful,failed} ─────────────────►│
+  │ done         │                 │                    │                                    │
+```
+
+## Retry classification
+
+A single `RetryPolicy.classify(outcome) -> RetryDecision` replaces:
+`_is_system_error`,
+the `CondaCommandError` branch in `execute_locally`,
+the missing-log branch in `handle_execution_result`,
+the per-task timeout handling in `LocalExecutor.join`,
+the pool-shutdown abandonment in `LocalExecutor._fail_outstanding`.
+
+```python
+class RetryDecision(enum.Enum):
+    SUCCESS = "success"
+    RETRY   = "retry"   # leaves dirty=True
+    GIVE_UP = "give_up" # sets dirty=False, marks failed
+
+class DefaultRetryPolicy:
+    SYSTEM_ERRORS = (OSError, MemoryError, SystemExit, KeyboardInterrupt)
+    NON_RETRYABLE = (CondaCommandError,)
+
+    def classify(self, outcome: ExecutionOutcome) -> RetryDecision:
+        if outcome.failure is not None:
+            # transport-side: timeout, broker_lost, pool_shutdown → retry
+            return RetryDecision.RETRY
+        result = outcome.result
+        if result is None:
+            return RetryDecision.RETRY
+        if result.successful:
+            return RetryDecision.SUCCESS
+        return (RetryDecision.RETRY if result.retryable
+                else RetryDecision.GIVE_UP)
+```
+
+The dirty rule is derived from the decision in one place:
+
+```python
+class DefaultDirtyRule:
+    def apply(self, execution: Execution, decision: RetryDecision) -> None:
+        if decision in (RetryDecision.SUCCESS, RetryDecision.GIVE_UP):
+            execution.execution_group.dirty = False
+        # RETRY: leave dirty=True so the next solve picks it up
+```
+
+## Idempotent re-ingest
+
+Re-running `_finalize` on the same `execution_id`
+(e.g. a join-time crash followed by `replay_abandoned`)
+must not double-insert.
+The ingester moves to upserts keyed on natural identifiers:
+
+- `ExecutionOutput`: `(execution_id, output_type, short_name)`
+- `ScalarMetricValue`: `(execution_id, dimensions_hash)`
+- `SeriesMetricValue`: `(execution_id, dimensions_hash, index_name)`
+
+Where SQLite is in use the implementation uses
+`INSERT … ON CONFLICT DO NOTHING`;
+on PostgreSQL the same construct works natively.
+The scratch-to-results copy uses `shutil.copy` with `exist_ok=True`
+on the destination directory.
+
+## Telemetry
+
+Three new columns are added to `Execution`:
+
+```sql
+ALTER TABLE execution ADD COLUMN duration_seconds   FLOAT  NULL;
+ALTER TABLE execution ADD COLUMN peak_rss_mb        INTEGER NULL;
+ALTER TABLE execution ADD COLUMN telemetry_meta     JSON   NULL;
+```
+
+`telemetry_meta` is a small JSON blob with hostname, exit code,
+and transport-specific identifiers (Celery task id, SLURM job id, …).
+None of these are read by the system at solve time;
+they exist so that a future adaptive resource provider
+can ask "what was the rolling p95 memory for diagnostic X
+on dataset shape Y?" without a schema migration.
+
+The shape of that future provider is **not** in this RFC.
+A natural seam is:
+
+```python
+class ResourceProvider(Protocol):
+    def hint_for(self, definition: ExecutionDefinition) -> ResourceHint: ...
+```
+
+with a `StaticResourceProvider` (uses `diagnostic.resources_for`)
+and a future `AdaptiveResourceProvider` (reads telemetry, applies multipliers).
+`ExecutionLifecycle` would gain a `resources:` constructor kwarg defaulting
+to `StaticResourceProvider`.
+This is intentionally deferred — the column existing today
+is the only commitment.
+
+## Module / file layout
+
+```
+packages/climate-ref-core/src/climate_ref_core/
+├── lifecycle/
+│   ├── __init__.py            ← public re-exports
+│   ├── ports.py               ← Transport, ResourceHint, Envelope, Outcome, Telemetry
+│   ├── results.py             ← ExecutionResult (pure data) + from_bundle factory
+│   └── worker.py              ← execute_worker(envelope) — picklable entry point
+│
+packages/climate-ref/src/climate_ref/
+├── lifecycle.py               ← ExecutionLifecycle class
+├── transports/
+│   ├── inmemory.py            ← InMemoryTransport
+│   └── processpool.py         ← ProcessPoolTransport (was LocalExecutor)
+└── executor/                  ← deleted; the Executor Protocol goes away
+│
+packages/climate-ref-celery/src/climate_ref_celery/
+└── transport.py               ← CeleryTransport (was CeleryExecutor)
+```
+
+The existing `Executor` Protocol in `climate-ref-core/executor.py`
+and the three concrete executors are deleted.
+`execute_locally` becomes a thin wrapper around `execute_worker`
+kept for backwards compatibility for one release cycle, then removed.
+
+## Test impact
+
+**Deleted tests** (boundary tests will replace them):
+
+- `tests/unit/test_executor.py::test_is_system_error_*`
+  — private-import tests on `_is_system_error`
+- `tests/unit/test_providers.py::test_conda_*` subprocess patch chains
+  for the conda provider's `run()` are replaced by transport-level boundary tests
+- `tests/unit/executor/test_synchronous.py::*` reattachment tests
+- `tests/unit/executor/test_local.py::*` future-handling tests
+  beyond what `ProcessPoolTransport`'s own tests cover
+- Per-executor `mark_execution_failed` mock-chain tests
+
+**New boundary tests** (against `ExecutionLifecycle` with `InMemoryTransport`):
+
+- `submit` allocates a unique fragment per execution
+- `submit` persists `register_datasets` even when the transport dispatch fails
+- `drain` ingests one successful outcome end-to-end and sets `dirty=False`
+- `drain` re-dispatches a retryable failure on the next solve
+  by leaving `dirty=True`
+- `drain` marks a non-retryable failure with `dirty=False` and `successful=False`
+- `drain` is idempotent: re-draining the same outcome does not double-insert
+  scalar or series rows
+- `drain` enforces `resources.wall_clock` uniformly — a worker that exceeds
+  the deadline produces a retryable failure regardless of transport
+- CV validation fails hard with a `ResultValidationError` when metrics
+  reference unknown dimensions (replaces today's silent `logger.warning`)
+- `replay_abandoned` returns the IDs of executions whose rows exist
+  but never received an outcome
+
+# Drawbacks
+[drawbacks]: #drawbacks
+
+## Migration cost
+
+This is not a small refactor.
+The dispatch loop, three executor classes, result handling,
+fragment allocation, and ingestion all move at once.
+A staged migration is possible — `InMemoryTransport` and `ProcessPoolTransport`
+can ship before `CeleryTransport` — but the lifecycle module itself has to
+land in one piece because every transport depends on it.
+
+## CV validation becomes a hard fail
+
+Today, when scalar or series metric values do not conform to the controlled
+vocabulary, `result_handling.py` logs a warning with a `TODO`.
+This RFC promotes that to a hard `ResultValidationError`.
+Any in-flight diagnostic with a CV mismatch will start failing.
+This is the *intended* behaviour — silent CV drift is a real source of data
+quality issues — but it is a behaviour change that needs deliberate handling
+during rollout (e.g. a one-release-cycle deprecation period where the
+violation is logged with `ERROR` instead of `WARNING` but does not yet raise).
+
+## A single fat class owns more responsibility
+
+`ExecutionLifecycle` ends up around 400–500 LOC.
+That is intentional: the things it owns are already coupled across eight
+files today, and module *depth* is the design goal.
+But it does mean the file is large; reviewers should expect that.
+
+## The Celery boundary still has one quirk
+
+In the current code, Celery's `link=handle_result.s(...)` chains a result
+handler that runs on the broker side of the boundary.
+In the new design, `CeleryTransport.poll` is what feeds outcomes back into
+`drain`.
+This is cleaner, but it means a coordinator-side crash mid-drain can leave
+results "in flight" in the broker that the coordinator has not yet observed.
+The mitigation is `replay_abandoned()`, which on next startup queries
+unfinished executions and re-polls the broker for them.
+This is a robustness improvement over today
+(where a coordinator crash silently strands rows in `successful=NULL`),
+but it is not free —
+it requires `CeleryTransport` to persist task IDs alongside execution IDs
+in a small table or in `telemetry_meta` so the broker can be queried later.
+
+## Resource hints can be wrong
+
+If a provider declares 4 GB for a diagnostic that actually needs 16 GB,
+the SLURM adapter will OOM-kill the job.
+Today's `LocalExecutor` runs in the parent address space and "just works"
+up to the host's memory.
+With `ResourceHint` in place, getting the number wrong is a new failure mode.
+Mitigations:
+
+- The default `ResourceHint(memory_mb=4096, cpu=1, wall_clock=2h)`
+  is generous enough that most diagnostics run unchanged.
+- `ProcessPoolTransport` deliberately ignores everything except
+  `wall_clock` — running on a developer laptop does not get harder.
+- Telemetry capture means the first failed run produces the data needed
+  to adjust the hint upward.
+
+# Rationale and alternatives
+[rationale-and-alternatives]: #rationale-and-alternatives
+
+Three designs were considered in detail.
+Sketches and trade-offs follow; the chosen design is a deliberate hybrid.
+
+## A — Minimal interface (rejected as a pure form)
+
+Two public methods (`submit`, `drain`), one Protocol (`Transport`).
+Everything else collapses behind the facade and cannot be customised
+without editing the module.
+
+```
+Public surface:   ████░░░░░░░░░░░░░░░░  (smallest)
+Defaults baked:   ████████░░░░░░░░░░░░
+Future bend:      ██████████░░░░░░░░░░  (one port + edit-in-place)
+Migration churn:  ████████████████░░░░  (deletes Executor Protocol)
+```
+
+Strength: minimal cognitive load. Weakness: no place to declare
+resource hints, no place for per-provider retry policies; both end up
+requiring constructor kwarg additions on first concrete demand.
+
+## B — Maximally flexible (rejected as a pure form)
+
+Five ports (`Transport`, `ArtifactStore`, `RetryPolicy`, `IngestSink`,
+`FragmentAllocator`), seven lifecycle hooks, plugin discovery via entry
+points, schema-versioned wire types.
+
+```
+Public surface:   ████████████████████  (5 ports × N methods + 7 hooks)
+Defaults baked:   ████░░░░░░░░░░░░░░░░  (everything injectable)
+Future bend:      ████████████████████  (port-shaped for K8s/S3/Prom)
+Migration churn:  ████████████████████  (new wire format + plugin registry)
+Speculation tax:  ██████████░░░░░░░░░░  (hooks + FragmentAllocator-as-port)
+```
+
+Strength: every anticipated future requirement has a seam already.
+Weakness: heavy speculation tax — five ports without five concrete
+implementations driving them is exactly the shape that calcifies into
+"the way things have always been done".
+
+## C — Common-case optimised (rejected as a pure form)
+
+One class with `dispatch`, `drain`, plus `replay_abandoned`, `dry_run`,
+`ingest`, `with_transport`. Sensible defaults sourced from `Config`.
+
+```
+Public surface:   ████████░░░░░░░░░░░░  (1 hot method + 4 cold methods)
+Defaults baked:   ████████████████████  (highest)
+Future bend:      ████████░░░░░░░░░░░░  (single port; widen kwargs)
+Migration churn:  ██████░░░░░░░░░░░░░░  (smallest; adapters survive)
+Hidden leak:      ████░░░░░░░░░░░░░░░░  (Celery ignores on_done callback)
+```
+
+Strength: the solver call site collapses to one line per iteration.
+Weakness: a `ResultSink` callback that Celery silently ignores is an
+asymmetry that will trip someone, and the design has no place for
+resource hints.
+
+## The chosen hybrid
+
+- **C's façade** — `ExecutionLifecycle` is one class with `submit` and
+  `drain` as the hot path and a small set of cold-path entry points.
+- **A's transport contract** — `Transport.dispatch(envelope)` plus
+  `poll()` returning `Iterator[ExecutionOutcome]`. No `ResultSink`
+  callback. The lifecycle owns the result loop.
+- **A's bundle-writer separation** — `ExecutionResult` is pure data;
+  bundle JSON write moves into a worker-side `_BundleWriter`.
+- **B's `ExecutionEnvelope` and `Telemetry` wire types** — earned by
+  the actual provider-side pain (no place for resource hints) and by the
+  need to capture per-execution actuals for future adaptive sizing.
+- **Defer the rest of B** — no `ArtifactStore` port (S3 mirror is
+  hypothetical), no `IngestSink` priority chain (DB-only today),
+  no `LifecycleHooks` (today's `logger.*` calls suffice),
+  no `FragmentAllocator` port (one implementation foreseeable),
+  no entry-point plugin registry.
+- **Delete the `Executor` Protocol.** It is a shallow interface
+  dressed up as flexibility; `SynchronousExecutor` becomes
+  `ExecutionLifecycle(transport=InMemoryTransport())`.
+
+## Impact of not doing this
+
+Each future transport (SLURM, PBS, K8s) re-implements ~250 LOC of
+lifecycle wiring per adapter.
+Resource hints have to be retrofitted later through a new wire format,
+which is a strictly larger change than introducing them at the same time
+as the lifecycle consolidation.
+Per-task timeout, CV validation, dirty-flag rules, and retry
+classification stay scattered, and integration tests remain the only
+honest signal that the system works end-to-end.
+
+# Prior art
+[prior-art]: #prior-art
+
+## Dask's `distributed`
+
+Dask's scheduler accepts `resources=` annotations on submitted tasks
+(e.g. `client.submit(func, resources={"GPU": 1})`).
+The pattern of declaring resource needs on the unit-of-work and letting
+the scheduler honour them is the direct inspiration for `ResourceHint`.
+Dask does not concern itself with what is "above" the scheduler
+(catalog, ingest, controlled vocabulary),
+which is the layer this RFC is consolidating.
+
+## Snakemake / Nextflow
+
+Both pipeline frameworks have first-class resource directives
+(`resources: mem_mb=…, runtime=…`) on rules / processes,
+which translate transparently into SLURM / PBS / K8s directives.
+This RFC reuses that mental model at the diagnostic level.
+
+## Airflow operators
+
+Airflow's executor/operator split is the closest analogue:
+operators carry the work definition, executors decide how to run it.
+Airflow learned the hard way that an executor that owns result
+handling is too narrow an interface; their `BaseExecutor.execute_async`
+plus `sync` (poll) split is essentially what `Transport.dispatch` /
+`Transport.poll` is here.
+
+## Celery's `task_time_limit` and queue routing
+
+Celery already supports per-queue routing and per-task time limits.
+This RFC's `ResourceHint.queue` maps directly onto Celery's queue
+routing; `wall_clock` maps onto `task_time_limit` plus
+`task_soft_time_limit`.
+The `CeleryExecutor` in `climate-ref-celery` does not currently use
+either; the new `CeleryTransport` does.
+
+## The Rust RFC process itself
+
+The shape of this document (Summary, Motivation, Reference-level,
+Drawbacks, Rationale, Prior art, Unresolved, Future) is inherited from
+[rust-lang/rfcs](https://github.com/rust-lang/rfcs)
+via this repository's template.
+
+# Unresolved questions
+[unresolved-questions]: #unresolved-questions
+
+To resolve through the RFC discussion:
+
+- **Should `ResourceHint` include I/O class (e.g. `io_intensive: bool`)
+  or GPU (`gpu: int`)?**
+  Neither has a concrete consumer today.
+  The recommended default is to add fields when a concrete adapter
+  needs them rather than speculating now.
+- **Where does the controlled vocabulary live in test contexts?**
+  Today `CV.load_from_file(config.paths.dimensions_cv)` is invoked
+  per-result inside `handle_execution_result`.
+  The RFC moves it to lifecycle construction, which means tests need
+  a deterministic CV.
+  A `PermissiveCV()` test fixture is one option;
+  baking the project's CV into a constant for test use is another.
+- **Should `replay_abandoned` be called automatically on
+  `ExecutionLifecycle.__init__`?**
+  Doing so makes recovery transparent;
+  not doing so makes startup behaviour explicit.
+  The current draft assumes it is explicit and called from the CLI.
+
+To resolve through implementation:
+
+- **Exact unique-constraint shape for upserts** —
+  `(execution_id, output_type, short_name)` for `ExecutionOutput`
+  is the obvious key, but the existing schema does not enforce it.
+  An Alembic migration is needed.
+- **`CeleryTransport.poll` semantics** —
+  whether to use `AsyncResult.get(timeout=…)` per task
+  or `app.control.inspect` for batch status.
+  Implementation choice; does not affect the public interface.
+- **`Telemetry.peak_rss_mb` capture on macOS** —
+  `resource.getrusage` returns bytes on macOS and KB on Linux.
+  A small platform shim is required worker-side.
+
+Out of scope:
+
+- **Adaptive resource provider** (reading rolling telemetry to suggest
+  hints). Telemetry columns land here; the provider is a follow-up RFC.
+- **GPU scheduling**.
+  None of the current diagnostics use GPUs;
+  when one does, `ResourceHint` gains a `gpu: int` field.
+- **Multi-tenant queues**.
+  `ResourceHint.queue` is enough for single-deployment routing;
+  multi-tenant work (per-org queues, fair-share) is out of scope.
+- **Replacing SLURM / PBS / Celery as schedulers**.
+  Explicitly *not* in scope: the lifecycle owns the seam *above* the
+  scheduler, not the scheduler itself.
+
+# Future possibilities
+[future-possibilities]: #future-possibilities
+
+The deepening proposed here unlocks several follow-ups, each a
+self-contained piece of work:
+
+- **SLURM and PBS transports.**
+  Each is a ~100 LOC adapter implementing `Transport`.
+  `dispatch` builds a job script from `envelope.resources`,
+  shells out to `sbatch` / `qsub`, records the job ID in `telemetry_meta`;
+  `poll` queries `squeue` / `qstat` and surfaces completed jobs back as
+  `ExecutionOutcome`s.
+  No changes to the lifecycle or to provider code are required.
+
+- **Adaptive `ResourceProvider`.**
+  Reads `Execution.peak_rss_mb` over a rolling window per
+  `(diagnostic_slug, dataset_shape_signature)`,
+  suggests a memory hint at the rolling p95 × 1.2,
+  applies a per-CLI flag to opt in.
+  Schema is already in place from this RFC; the provider is the only
+  new code.
+
+- **Per-provider retry policies.**
+  Currently `RetryPolicy` is a single injected object.
+  ESMValTool's memory-exhaustion failures are retryable with a larger
+  memory hint; a `Mapping[str, RetryPolicy]` keyed on provider slug
+  becomes the natural extension when concrete demand arrives.
+
+- **Streaming partial results.**
+  `Transport.poll` already yields outcomes incrementally;
+  a `PartialOutcome` event could be added to the iterator for
+  progress reporting if a UI consumer appears.
+
+- **S3 / HTTP artifact store.**
+  If results need to land somewhere other than the local filesystem,
+  an `ArtifactStore` port can be carved out of `_Promoter` later.
+  The wire format does not change (scratch path remains relative);
+  only the lifecycle's promote step is rewritten.
+
+- **Pluggable ingest sinks.**
+  Prometheus metrics, audit logs, S3 mirrors — each is a side-effect
+  observer over `ExecutionOutcome`.
+  A small `IngestSink` interface with ordered observers is a small
+  follow-up if and when a second sink (beyond the DB) materialises.
+
+- **Recovery on coordinator restart.**
+  `replay_abandoned()` is in this RFC.
+  Extending it to interrogate transport-side state (SLURM job that
+  finished while the coordinator was down) is a natural follow-up that
+  only needs `Transport.lookup(transport_meta) -> ExecutionOutcome | None`
+  on each adapter.
+
+None of the above are reasons to accept this RFC on their own;
+they are listed to show that the seam introduced here is the right shape
+for the directions the project is plausibly heading,
+without baking any of them in prematurely.

From 5034a53eab50cb13b3506ba35f039f220fad1a11 Mon Sep 17 00:00:00 2001
From: Jared Lewis <jared@jared.kiwi.nz>
Date: Tue, 12 May 2026 17:35:57 +1000
Subject: [PATCH 02/12] rfc(execution-lifecycle): assign PR #3 number

---
 ...{0000-execution-lifecycle.md => 0003-execution-lifecycle.md} | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
 rename text/{0000-execution-lifecycle.md => 0003-execution-lifecycle.md} (99%)

diff --git a/text/0000-execution-lifecycle.md b/text/0003-execution-lifecycle.md
similarity index 99%
rename from text/0000-execution-lifecycle.md
rename to text/0003-execution-lifecycle.md
index 18d0a69..40418be 100644
--- a/text/0000-execution-lifecycle.md
+++ b/text/0003-execution-lifecycle.md
@@ -1,6 +1,6 @@
 - Feature Name: `execution_lifecycle`
 - Start Date: 2026-05-12
-- RFC PR: [Climate-REF/rfcs#0000](https://github.com/Climate-REF/rfcs/pull/0000)
+- RFC PR: [Climate-REF/rfcs#0003](https://github.com/Climate-REF/rfcs/pull/3)
 
 # Summary
 [summary]: #summary

From aa5d896ced84d04f5745af85e6a4157bd0153675 Mon Sep 17 00:00:00 2001
From: Jared Lewis <jared@jared.kiwi.nz>
Date: Tue, 12 May 2026 20:24:11 +1000
Subject: [PATCH 03/12] rfc(execution-lifecycle): convert diagrams to mermaid

---
 text/0003-execution-lifecycle.md | 193 ++++++++++++++++---------------
 1 file changed, 101 insertions(+), 92 deletions(-)

diff --git a/text/0003-execution-lifecycle.md b/text/0003-execution-lifecycle.md
index 40418be..e0461d0 100644
--- a/text/0003-execution-lifecycle.md
+++ b/text/0003-execution-lifecycle.md
@@ -134,46 +134,40 @@ instead of eight thin ones.
 
 ## Module boundary
 
-```
-┌──────────────────────────────────────────────────────────────────────────────┐
-│ ExecutionLifecycle (climate_ref.lifecycle)                                   │
-│                                                                              │
-│   __init__(config, db, transport, *,                                         │
-│            retry=DefaultRetryPolicy(),                                       │
-│            cv=None,           # default: CV.load_from_file(config.paths…)    │
-│            clock=datetime.utcnow)                                            │
-│                                                                              │
-│   submit(execution, definition) -> None                                      │
-│   drain(timeout=None) -> None                                                │
-│                                                                              │
-│   replay_abandoned() -> list[int]    # find executions stranded across boot  │
-│   dry_run(group, datasets) -> ExecutionDefinition                            │
-│                                                                              │
-│   OWNED (private):                                                           │
-│     _FragmentAllocator   _Classifier   _Promoter      _Ingestor              │
-│     _DirtyRule           _BundleWriter (worker-side)                         │
-└──────────────────────────────────────────────────────────────────────────────┘
-                              │
-                              ▼ Port — the ONLY thing that varies per backend
-┌──────────────────────────────────────────────────────────────────────────────┐
-│ Transport (Protocol, climate_ref_core.lifecycle.ports)                       │
-│                                                                              │
-│   name: ClassVar[str]                                                        │
-│   dispatch(envelope: ExecutionEnvelope) -> None                              │
-│   poll(block: bool, timeout: float | None)                                   │
-│       -> Iterator[ExecutionOutcome]                                          │
-│   shutdown(timeout: float | None) -> None                                    │
-│                                                                              │
-│ Production adapters:                                                         │
-│   InMemoryTransport      (in climate-ref, used by tests & SynchronousExec)   │
-│   ProcessPoolTransport   (in climate-ref, replaces LocalExecutor body)       │
-│   CeleryTransport        (in climate-ref-celery)                             │
-│                                                                              │
-│ Future adapters (NOT in this RFC, but the seam supports them):               │
-│   SlurmTransport         consumes ResourceHint as --mem/--cpus/--time/-p     │
-│   PbsTransport           consumes ResourceHint as -l mem,ncpus,walltime,q    │
-│   K8sTransport           consumes ResourceHint as pod resources + deadline   │
-└──────────────────────────────────────────────────────────────────────────────┘
+```mermaid
+flowchart TB
+    subgraph Lifecycle["<b>ExecutionLifecycle</b> &nbsp; <i>climate_ref.lifecycle</i>"]
+        direction TB
+        Public["<b>Public</b><br/>__init__(config, db, transport, *, retry, cv, clock)<br/>submit(execution, definition)<br/>drain(timeout=None)<br/>replay_abandoned() · dry_run(group, datasets)"]
+        Private["<b>Owned (private)</b><br/>_FragmentAllocator · _Classifier · _Promoter<br/>_Ingestor · _DirtyRule · _BundleWriter"]
+        Public --- Private
+    end
+
+    Lifecycle -- dispatch / poll --> Port
+
+    subgraph Port["<b>Transport (Protocol)</b> &nbsp; <i>climate_ref_core.lifecycle.ports</i>"]
+        TP["name: ClassVar[str]<br/>dispatch(envelope: ExecutionEnvelope) -> None<br/>poll(block, timeout) -> Iterator[ExecutionOutcome]<br/>shutdown(timeout) -> None"]
+    end
+
+    Port --> Today
+    Port -.future.-> Future
+
+    subgraph Today["<b>Adapters in this RFC</b>"]
+        direction LR
+        InMem["InMemoryTransport<br/><i>tests + SynchronousExecutor replacement</i>"]
+        PP["ProcessPoolTransport<br/><i>replaces LocalExecutor body</i>"]
+        CT["CeleryTransport<br/><i>in climate-ref-celery</i>"]
+    end
+
+    subgraph Future["<b>Future adapters &mdash; seam supports them</b>"]
+        direction LR
+        SLURM["SlurmTransport<br/><i>sbatch --mem --cpus --time --partition</i>"]
+        PBS["PbsTransport<br/><i>qsub -l mem,ncpus,walltime,queue</i>"]
+        K8s["K8sTransport<br/><i>pod resources + activeDeadlineSeconds</i>"]
+    end
+
+    classDef future stroke-dasharray: 5 5
+    class Future,SLURM,PBS,K8s future
 ```
 
 ## Wire types
@@ -336,35 +330,47 @@ and telemetry persistence.
 
 ## End-to-end sequence
 
-```
-Solver        Diagnostic       Lifecycle           Transport          Worker             DB
-  │ submit(e,d)   │                │                    │                  │                 │
-  ├──────────────►│ resources_for(d)│                   │                  │                 │
-  │ ◄──── ResourceHint(...)        │                    │                  │                 │
-  │              │                 │ allocate fragment ─┼──────────────────┼────────────────►│
-  │              │                 │ register_datasets ─┼──────────────────┼────────────────►│
-  │              │                 │ session.expunge                                          │
-  │              │                 │ commit ──────────────────────────────────────────────── │
-  │              │                 │ deadline = now + r.wall_clock                            │
-  │              │                 │ dispatch(envelope)─►│  (slurm sbatch / celery send /    │
-  │              │                 │                    │   processpool.submit / inline)     │
-  │              │                 │                    ├─►│ execute_locally(envelope)      │
-  │              │                 │                    │  │   diagnostic.run               │
-  │              │                 │                    │  │   _BundleWriter.write          │
-  │              │                 │                    │  │   CV.validate (hard-fail)      │
-  │              │                 │                    │  │   capture Telemetry            │
-  │              │                 │                    │  │   return ExecutionOutcome      │
-  │              │                 │                    │◄─│                                │
-  │ drain()      │                 │ poll() ◄───────────│                                    │
-  │ ────────────►│                 │ _finalize:                                              │
-  │              │                 │   session.merge(eid) ───────────────────────────────────►│
-  │              │                 │   _Promoter.copy(scratch→results, log+bundles+refs)     │
-  │              │                 │   _Classifier(failure or result) → Success|Retry|GiveUp │
-  │              │                 │   _Ingestor.upsert(outputs, scalars, series) ──────────►│
-  │              │                 │   _DirtyRule.apply(decision) ──────────────────────────►│
-  │              │                 │   execution.telemetry = outcome.telemetry ─────────────►│
-  │              │                 │   execution.mark_{successful,failed} ─────────────────►│
-  │ done         │                 │                    │                                    │
+```mermaid
+sequenceDiagram
+    autonumber
+    participant S as Solver
+    participant D as Diagnostic
+    participant L as ExecutionLifecycle
+    participant T as Transport
+    participant W as Worker
+    participant DB as DB
+
+    S->>L: submit(execution, definition)
+    L->>D: resources_for(definition)
+    D-->>L: ResourceHint(memory_mb, cpu, wall_clock, queue?)
+    L->>DB: allocate fragment + register_datasets
+    L->>DB: session.expunge + commit
+    Note over L: deadline = clock() + resources.wall_clock
+    L->>T: dispatch(ExecutionEnvelope)
+
+    Note over T,W: sbatch · qsub · pool.submit · celery send · inline
+    T->>W: hand off envelope
+    activate W
+    W->>W: diagnostic.run(definition)
+    W->>W: _BundleWriter.write (output.json, diagnostic.json, series.json)
+    W->>W: CV.validate (hard-fail)
+    W->>W: capture Telemetry (duration, peak_rss_mb, host)
+    W-->>T: ExecutionOutcome
+    deactivate W
+
+    S->>L: drain(timeout)
+    loop until no in-flight executions
+        L->>T: poll(block, timeout)
+        T-->>L: ExecutionOutcome
+        Note over L: _finalize(outcome)
+        L->>DB: session.merge(execution_id)
+        L->>DB: _Promoter.copy(scratch → results)
+        Note over L: _Classifier → SUCCESS | RETRY | GIVE_UP
+        L->>DB: _Ingestor.upsert(outputs, scalars, series)
+        L->>DB: _DirtyRule.apply(decision)
+        L->>DB: execution.telemetry = outcome.telemetry
+        L->>DB: mark_successful / mark_failed
+    end
 ```
 
 ## Retry classification
@@ -582,19 +588,38 @@ Mitigations:
 Three designs were considered in detail.
 Sketches and trade-offs follow; the chosen design is a deliberate hybrid.
 
+```mermaid
+quadrantChart
+    title Design trade-off space
+    x-axis "Surface area (concepts)" --> "Larger"
+    y-axis "Defaults baked in" --> "More"
+    quadrant-1 "Heavy & opinionated"
+    quadrant-2 "Lean & opinionated"
+    quadrant-3 "Lean & open"
+    quadrant-4 "Heavy & open"
+    "A - Minimal": [0.18, 0.55]
+    "B - Maximally flexible": [0.92, 0.18]
+    "C - Common-case optimised": [0.38, 0.92]
+    "Hybrid (chosen)": [0.42, 0.7]
+```
+
+Comparison at a glance (relative scale, 0–5):
+
+| Dimension              | A — Minimal | B — Maximal | C — Common-case | Hybrid (chosen) |
+|------------------------|:-:|:-:|:-:|:-:|
+| Public surface         | 1 | 5 | 2 | 2 |
+| Defaults baked in      | 3 | 1 | 5 | 4 |
+| Bend without editing   | 3 | 5 | 2 | 3 |
+| Migration churn        | 4 | 5 | 2 | 3 |
+| Resource-hint support  | 0 | 5 | 0 | 5 |
+| Speculation tax        | 0 | 3 | 0 | 1 |
+
 ## A — Minimal interface (rejected as a pure form)
 
 Two public methods (`submit`, `drain`), one Protocol (`Transport`).
 Everything else collapses behind the facade and cannot be customised
 without editing the module.
 
-```
-Public surface:   ████░░░░░░░░░░░░░░░░  (smallest)
-Defaults baked:   ████████░░░░░░░░░░░░
-Future bend:      ██████████░░░░░░░░░░  (one port + edit-in-place)
-Migration churn:  ████████████████░░░░  (deletes Executor Protocol)
-```
-
 Strength: minimal cognitive load. Weakness: no place to declare
 resource hints, no place for per-provider retry policies; both end up
 requiring constructor kwarg additions on first concrete demand.
@@ -605,14 +630,6 @@ Five ports (`Transport`, `ArtifactStore`, `RetryPolicy`, `IngestSink`,
 `FragmentAllocator`), seven lifecycle hooks, plugin discovery via entry
 points, schema-versioned wire types.
 
-```
-Public surface:   ████████████████████  (5 ports × N methods + 7 hooks)
-Defaults baked:   ████░░░░░░░░░░░░░░░░  (everything injectable)
-Future bend:      ████████████████████  (port-shaped for K8s/S3/Prom)
-Migration churn:  ████████████████████  (new wire format + plugin registry)
-Speculation tax:  ██████████░░░░░░░░░░  (hooks + FragmentAllocator-as-port)
-```
-
 Strength: every anticipated future requirement has a seam already.
 Weakness: heavy speculation tax — five ports without five concrete
 implementations driving them is exactly the shape that calcifies into
@@ -623,14 +640,6 @@ implementations driving them is exactly the shape that calcifies into
 One class with `dispatch`, `drain`, plus `replay_abandoned`, `dry_run`,
 `ingest`, `with_transport`. Sensible defaults sourced from `Config`.
 
-```
-Public surface:   ████████░░░░░░░░░░░░  (1 hot method + 4 cold methods)
-Defaults baked:   ████████████████████  (highest)
-Future bend:      ████████░░░░░░░░░░░░  (single port; widen kwargs)
-Migration churn:  ██████░░░░░░░░░░░░░░  (smallest; adapters survive)
-Hidden leak:      ████░░░░░░░░░░░░░░░░  (Celery ignores on_done callback)
-```
-
 Strength: the solver call site collapses to one line per iteration.
 Weakness: a `ResultSink` callback that Celery silently ignores is an
 asymmetry that will trip someone, and the design has no place for

From 67f5fd2e4a0d5ee62ad1dd2f16a370e3e4538dac Mon Sep 17 00:00:00 2001
From: Jared Lewis <jared@jared.kiwi.nz>
Date: Tue, 12 May 2026 20:29:02 +1000
Subject: [PATCH 04/12] rfc(execution-lifecycle): swap module boundary to
 classDiagram

Mermaid classDiagram is idiomatic for the Protocol + adapters pattern.
LR layout fits PR width; method signatures stay legible without HTML hacks.
---
 text/0003-execution-lifecycle.md | 87 ++++++++++++++++++++------------
 1 file changed, 54 insertions(+), 33 deletions(-)

diff --git a/text/0003-execution-lifecycle.md b/text/0003-execution-lifecycle.md
index e0461d0..dafd7f6 100644
--- a/text/0003-execution-lifecycle.md
+++ b/text/0003-execution-lifecycle.md
@@ -135,39 +135,60 @@ instead of eight thin ones.
 ## Module boundary
 
 ```mermaid
-flowchart TB
-    subgraph Lifecycle["<b>ExecutionLifecycle</b> &nbsp; <i>climate_ref.lifecycle</i>"]
-        direction TB
-        Public["<b>Public</b><br/>__init__(config, db, transport, *, retry, cv, clock)<br/>submit(execution, definition)<br/>drain(timeout=None)<br/>replay_abandoned() · dry_run(group, datasets)"]
-        Private["<b>Owned (private)</b><br/>_FragmentAllocator · _Classifier · _Promoter<br/>_Ingestor · _DirtyRule · _BundleWriter"]
-        Public --- Private
-    end
-
-    Lifecycle -- dispatch / poll --> Port
-
-    subgraph Port["<b>Transport (Protocol)</b> &nbsp; <i>climate_ref_core.lifecycle.ports</i>"]
-        TP["name: ClassVar[str]<br/>dispatch(envelope: ExecutionEnvelope) -> None<br/>poll(block, timeout) -> Iterator[ExecutionOutcome]<br/>shutdown(timeout) -> None"]
-    end
-
-    Port --> Today
-    Port -.future.-> Future
-
-    subgraph Today["<b>Adapters in this RFC</b>"]
-        direction LR
-        InMem["InMemoryTransport<br/><i>tests + SynchronousExecutor replacement</i>"]
-        PP["ProcessPoolTransport<br/><i>replaces LocalExecutor body</i>"]
-        CT["CeleryTransport<br/><i>in climate-ref-celery</i>"]
-    end
-
-    subgraph Future["<b>Future adapters &mdash; seam supports them</b>"]
-        direction LR
-        SLURM["SlurmTransport<br/><i>sbatch --mem --cpus --time --partition</i>"]
-        PBS["PbsTransport<br/><i>qsub -l mem,ncpus,walltime,queue</i>"]
-        K8s["K8sTransport<br/><i>pod resources + activeDeadlineSeconds</i>"]
-    end
-
-    classDef future stroke-dasharray: 5 5
-    class Future,SLURM,PBS,K8s future
+classDiagram
+    direction LR
+
+    class ExecutionLifecycle {
+        <<climate_ref.lifecycle>>
+        +submit(execution, definition)
+        +drain(timeout) None
+        +replay_abandoned() list~int~
+        +dry_run(group, datasets) ExecutionDefinition
+        -_FragmentAllocator
+        -_Classifier
+        -_Promoter
+        -_Ingestor
+        -_DirtyRule
+        -_BundleWriter
+    }
+
+    class Transport {
+        <<Protocol>>
+        +name: ClassVar~str~
+        +dispatch(envelope: ExecutionEnvelope) None
+        +poll(block, timeout) Iterator~ExecutionOutcome~
+        +shutdown(timeout) None
+    }
+
+    class InMemoryTransport {
+        tests + SynchronousExecutor
+    }
+    class ProcessPoolTransport {
+        replaces LocalExecutor
+    }
+    class CeleryTransport {
+        climate-ref-celery
+    }
+    class SlurmTransport {
+        future
+        sbatch --mem --cpus --time --partition
+    }
+    class PbsTransport {
+        future
+        qsub -l mem,ncpus,walltime,queue
+    }
+    class K8sTransport {
+        future
+        pod resources + activeDeadlineSeconds
+    }
+
+    ExecutionLifecycle ..> Transport : dispatch / poll
+    InMemoryTransport ..|> Transport
+    ProcessPoolTransport ..|> Transport
+    CeleryTransport ..|> Transport
+    SlurmTransport ..|> Transport
+    PbsTransport ..|> Transport
+    K8sTransport ..|> Transport
 ```
 
 ## Wire types

From 8f8263b33119daf38103a4442bedc58adaace7ec Mon Sep 17 00:00:00 2001
From: Jared Lewis <jared@jared.kiwi.nz>
Date: Tue, 12 May 2026 20:32:40 +1000
Subject: [PATCH 05/12] rfc(execution-lifecycle): simplify; trim doc by ~half

Compress prose, drop per-design subsections in Rationale (table tells the
story), tighten Drawbacks/Prior art/Unresolved/Future to bullets. All three
diagrams kept; technical substance preserved.
---
 text/0003-execution-lifecycle.md | 839 ++++++++-----------------------
 1 file changed, 204 insertions(+), 635 deletions(-)

diff --git a/text/0003-execution-lifecycle.md b/text/0003-execution-lifecycle.md
index dafd7f6..edc0a65 100644
--- a/text/0003-execution-lifecycle.md
+++ b/text/0003-execution-lifecycle.md
@@ -5,129 +5,55 @@
 # Summary
 [summary]: #summary
 
-Consolidate the lifecycle of a single diagnostic execution
+Consolidate the lifecycle of one diagnostic execution
 — allocation, dispatch, run, classify, publish, ingest, finalise —
-into one deep module (`ExecutionLifecycle`) sitting behind a single `Transport` port,
-and surface a declarative `ResourceHint` on each `Diagnostic` so providers
-can express memory / CPU / wall-clock expectations in one place
-without reaching into the executor or scheduler layer.
+into one deep module (`ExecutionLifecycle`) behind one `Transport` port.
+Add a declarative `ResourceHint` on every `Diagnostic`
+and capture per-execution `Telemetry`,
+so providers can express memory / CPU / wall-clock once
+and future schedulers (SLURM, PBS, K8s)
+plug in as ~100 LOC adapters.
 
 # Motivation
 [motivation]: #motivation
 
-## Problem
-
-The lifecycle of one diagnostic execution today is fragmented across roughly
-eight files in two packages.
-A single execution flows through, in order:
-
-1. `solver.py` creates an `Execution` row with a `PLACEHOLDER_FRAGMENT`,
-   calls `assign_execution_fragment` to flush, compute a group-short string,
-   and rewrite the output fragment.
-2. `solver.py` rebuilds the `ExecutionDefinition` via `attrs.evolve`,
-   calls `execution.register_datasets`, expunges the row, commits.
-3. `executor.run(definition, execution)` is called outside the transaction.
-   Three implementations exist:
-   `SynchronousExecutor` runs in-process,
-   `LocalExecutor` submits to a `ProcessPoolExecutor`,
-   `CeleryExecutor` sends to a broker with linked result handlers.
-4. `execute_locally` (in `climate-ref-core`) wraps `diagnostic.run`,
-   classifies exceptions via the private `_is_system_error`,
-   special-cases `CondaCommandError`,
-   and produces an `ExecutionResult` — whose `build_from_output_bundle`
-   factory writes three JSON files to disk despite being a frozen attrs class.
-5. `process_result` and `handle_execution_result` copy log, metric bundle,
-   output bundle, series file, and every bundle-referenced plot/data/html
-   file from scratch to results,
-   then ingest scalar values, series values, and outputs in a nested
-   transaction.
-6. `ExecutionGroup.dirty` is toggled in three different branches across
-   `handle_execution_result` and `mark_execution_failed`.
-
-Concrete problems this fragmentation causes today:
-
-- **Provider authors have no obvious place to express parallelisation hints.**
-  ESMValTool diagnostics that need 16 GB of memory or eight hours of
-  wall-clock have nowhere to declare it.
-  The `LocalExecutor` enforces a 6 hour per-task timeout;
-  the `CeleryExecutor` does not enforce a per-task timeout at all;
-  neither passes resource requirements through to anything that could honour
-  them.
-  When the project eventually adds SLURM or PBS adapters,
-  the natural way to wire memory / CPU / wall-clock into `sbatch` or `qsub`
-  directives does not exist.
-- **Retry classification is split across the file tree.**
-  `_is_system_error` (in `climate-ref-core/executor.py`),
-  `CondaCommandError` handling (in the same file),
-  missing-log detection (in `result_handling.py`),
-  per-task timeout (in `LocalExecutor`),
-  and pool-shutdown abandonment (also in `LocalExecutor`)
-  each set `retryable` independently.
-  There is no single classifier function and no contract between
-  `ExecutionResult.build_from_failure` and the executor about which
-  exception types are retryable.
-- **The `dirty` flag is decided in three places.**
-  Success path sets it `False`;
-  non-retryable failure sets it `False`;
-  retryable failure leaves it `True`;
-  missing log leaves it `True`.
-  These branches are spread across `handle_execution_result` and the
-  Local/Celery executors with no unifying decision function.
-- **Tests have to mock subprocess, filesystem, DB, and Celery.**
-  `tests/unit/test_providers.py` patches `subprocess` directly for the
-  conda provider; `tests/unit/test_executor.py` imports the private
-  `_is_system_error`; integration tests are the only thing that exercises
-  the full happy path.
-- **`ExecutionResult` is not picklable cleanly.**
-  Its frozen factory writes three JSON files;
-  re-construction in a test requires a real tmp_path filesystem.
-- **The `Executor` Protocol is shallow.**
-  Its `run` / `join` interface forces every adapter
-  (Synchronous, Local, Celery, future SLURM/PBS/K8s)
-  to reinvent the same dance:
-  reattach the detached `Execution` row,
-  open a fresh transaction,
-  copy scratch to results,
-  load the CV,
-  ingest scalars and series,
-  apply the dirty rule,
-  mark the row.
-
-## Why now
-
-The CMIP REF is approaching the point where serious deployment targets are
-plausible:
-SLURM and PBS adapters,
-distributed Celery deployments behind a managed broker,
-possibly Kubernetes Jobs.
-Each new transport currently means a new ~250-LOC adapter that re-implements
-the lifecycle dance.
-The same fragmentation also blocks anything resembling
-**informed scheduling** — even simple things like "ESMValTool needs more memory
-than ILAMB" or "the ENSO diagnostic should land on the `bigmem` queue"
-have no representation in the codebase.
-
-This RFC is **not** about replacing SLURM, PBS, or Celery as schedulers.
-It is about defining the seam between *climate-ref* and a scheduler so the
-scheduler has enough information to do its job,
-and so the lifecycle around the scheduler is one robust, well-tested module
-instead of eight thin ones.
-
-## Expected outcomes
-
-1. The solver dispatch loop shrinks from ~50 lines of bookkeeping to ~3.
-2. A new transport adapter (SLURM, PBS, K8s) is a single class
-   implementing `Transport`, around 80–120 LOC,
-   with no copy-paste of fragment allocation, ingestion, CV loading,
-   dirty-flag logic, or `Execution` reattachment.
-3. Provider authors declare `resources: ResourceHint(...)` once on a
-   diagnostic class and the rest of the system honours it.
-4. Per-execution telemetry (peak memory, duration, host)
-   is captured on every outcome and persisted on the `Execution` row,
-   making future adaptive scheduling
-   a feature addition rather than a schema change.
-5. Unit tests run with zero filesystem and zero subprocess
-   via an `InMemoryTransport` and a separated bundle writer.
+The lifecycle of one execution is currently fragmented across ~8 files
+in 2 packages.
+A single happy-path run touches
+`solver.py` (allocates row + fragment, register_datasets, expunge, commit),
+`climate_ref_core/executor.py` (`execute_locally`, `_is_system_error`,
+`CondaCommandError` handling),
+`climate_ref_core/diagnostics.py` (`ExecutionResult.build_from_output_bundle`
+which secretly writes three JSON files inside a frozen-attrs factory),
+`climate_ref/executor/result_handling.py` (scratch→results copy ×4,
+ingestion in nested-tx, dirty-flag toggled in 3 branches),
+`climate_ref/executor/fragment.py` (PLACEHOLDER_FRAGMENT, group_short),
+and one of `synchronous.py` / `local.py` / `climate_ref_celery/executor.py`
+(each reimplementing the same reattach / commit / mark dance).
+
+Concrete consequences today:
+
+- **Providers have no place to declare parallelisation hints.**
+  ESMValTool diagnostics that need 16 GB or 8 h have nowhere to say so;
+  `LocalExecutor` hardcodes a 6 h per-task timeout,
+  `CeleryExecutor` enforces no per-task timeout at all,
+  and a future SLURM/PBS adapter has nothing to translate into `sbatch`/`qsub`.
+- **Retry classification is scattered** across `_is_system_error`,
+  the `CondaCommandError` branch, missing-log handling, `LocalExecutor`'s
+  per-task timeout, and pool-shutdown abandonment.
+- **The `dirty` flag is decided in three places**
+  (success path, non-retryable failure, retryable failure / missing log).
+- **CV validation is silenced** (`logger.warning` with TODO instead of raising).
+- **Tests mock subprocess + filesystem + DB + Celery** and import private
+  helpers (`_is_system_error`) to compensate for the missing seam.
+- **`ExecutionResult` is not cleanly picklable** because its frozen factory
+  performs disk I/O.
+
+The CMIP REF is approaching deployment targets (SLURM / PBS / K8s) where
+each new adapter would otherwise mean ~250 LOC of duplicated lifecycle
+wiring.
+This RFC is **not** about replacing SLURM, PBS, K8s, or Celery as schedulers.
+It is about defining a single robust seam *above* them.
 
 # Reference-level explanation
 [reference-level-explanation]: #reference-level-explanation
@@ -160,27 +86,12 @@ classDiagram
         +shutdown(timeout) None
     }
 
-    class InMemoryTransport {
-        tests + SynchronousExecutor
-    }
-    class ProcessPoolTransport {
-        replaces LocalExecutor
-    }
-    class CeleryTransport {
-        climate-ref-celery
-    }
-    class SlurmTransport {
-        future
-        sbatch --mem --cpus --time --partition
-    }
-    class PbsTransport {
-        future
-        qsub -l mem,ncpus,walltime,queue
-    }
-    class K8sTransport {
-        future
-        pod resources + activeDeadlineSeconds
-    }
+    class InMemoryTransport { tests + SynchronousExecutor }
+    class ProcessPoolTransport { replaces LocalExecutor }
+    class CeleryTransport { climate-ref-celery }
+    class SlurmTransport { future · sbatch --mem --cpus --time }
+    class PbsTransport { future · qsub -l mem,ncpus,walltime }
+    class K8sTransport { future · pod resources + deadline }
 
     ExecutionLifecycle ..> Transport : dispatch / poll
     InMemoryTransport ..|> Transport
@@ -193,42 +104,33 @@ classDiagram
 
 ## Wire types
 
-Three frozen, picklable value types cross the transport boundary
-(process pool, broker, future network hop).
-None of them carry DB sessions, `Config`, or controlled vocabulary.
+Picklable value objects only. No DB sessions, `Config`, or CV cross the boundary.
 
 ```python
 @attrs.frozen
 class ResourceHint:
-    """Declarative resource expectation, lives on the diagnostic class.
-
-    Per-execution overrides are produced by Diagnostic.resources_for().
-    """
     memory_mb: int = 4096
     cpu: int = 1
     wall_clock: timedelta = timedelta(hours=2)
-    queue: str | None = None   # transport-specific routing tag
+    queue: str | None = None    # transport-specific routing tag
 
 @attrs.frozen
 class ExecutionEnvelope:
-    """Wire format sent to a worker. Picklable."""
     execution_id: int
     definition: ExecutionDefinition
     resources: ResourceHint
-    deadline: datetime         # clock() + resources.wall_clock at submit time
+    deadline: datetime          # clock() + resources.wall_clock at submit time
 
 @attrs.frozen
 class Telemetry:
-    """Captured by the worker, persisted on the Execution row by the coordinator."""
     duration: timedelta
     peak_rss_mb: int | None
     host: str
     exit_code: int | None
-    transport_meta: Mapping[str, str]   # slurm jobid / k8s pod / celery task_id …
+    transport_meta: Mapping[str, str]   # slurm jobid / k8s pod / celery task_id
 
 @attrs.frozen
 class ExecutionOutcome:
-    """What poll() yields back to the lifecycle."""
     execution_id: int
     result: ExecutionResult | None       # None ⇒ transport-side abandonment
     failure: ExecutionFailure | None     # timeout | broker_lost | pool_shutdown
@@ -236,119 +138,54 @@ class ExecutionOutcome:
 ```
 
 `ExecutionResult` becomes pure data.
-The current `build_from_output_bundle` factory
-(which writes three JSON files inside a frozen-attrs constructor)
-is split into `ExecutionResult.from_bundle(definition, bundle)` —
-which constructs the value object without I/O —
-and a separate `_BundleWriter.write(definition, bundle)`
-that is invoked worker-side only.
+The current `build_from_output_bundle` factory is split into a pure
+`ExecutionResult.from_bundle(definition, bundle)` and a worker-side
+`_BundleWriter.write(definition, bundle)` that owns the JSON I/O.
 
 ## Diagnostic-side declaration
 
 ```python
 class Diagnostic(AbstractDiagnostic):
-    ...
-    resources: ResourceHint = ResourceHint()
+    resources: ResourceHint = ResourceHint()        # project-wide default
 
     def resources_for(self, definition: ExecutionDefinition) -> ResourceHint:
-        """Optional per-execution sizing.
-
-        Default returns ``self.resources`` unchanged.
-        Diagnostics whose memory scales with input size override this.
-        """
+        """Optional per-execution sizing. Default returns self.resources."""
         return self.resources
-```
 
-Concrete provider declarations under the new system:
 
-```python
-# packages/climate-ref-esmvaltool/src/climate_ref_esmvaltool/diagnostics/base.py
+# Examples
 class ESMValToolDiagnostic(CommandLineDiagnostic):
-    resources = ResourceHint(
-        memory_mb=16000, cpu=4, wall_clock=timedelta(hours=6),
-    )
+    resources = ResourceHint(memory_mb=16000, cpu=4, wall_clock=timedelta(hours=6))
 
-# packages/climate-ref-pmp/src/climate_ref_pmp/diagnostics/enso.py
 class EnsoDiagnostic(Diagnostic):
-    resources = ResourceHint(
-        memory_mb=24000, cpu=8, wall_clock=timedelta(hours=8),
-        queue="bigmem",
-    )
+    resources = ResourceHint(memory_mb=24000, cpu=8,
+                             wall_clock=timedelta(hours=8), queue="bigmem")
 
-# packages/climate-ref-ilamb/src/climate_ref_ilamb/standard.py
 class IlambDiagnostic(Diagnostic):
-    resources = ResourceHint(
-        memory_mb=8000, cpu=2, wall_clock=timedelta(hours=2),
-    )
+    resources = ResourceHint(memory_mb=8000, cpu=2, wall_clock=timedelta(hours=2))
     def resources_for(self, defn):
         n = len(defn.datasets.get_cmip6())
         return attrs.evolve(self.resources, memory_mb=8000 + 500 * n)
 ```
 
-The base `Diagnostic` default of `(4 GB, 1 CPU, 2 h)` is the project-wide
-sensible starting point.
-Providers that need more either override the class attribute
-or override `resources_for` for input-size scaling.
-
 ## Solver dispatch — before and after
 
-**Before** (`solver.py:709-757`, condensed):
-
 ```python
-execution = Execution(
-    execution_group=execution_group,
-    dataset_hash=definition.datasets.hash,
-    output_fragment=PLACEHOLDER_FRAGMENT,
-    provider_version=potential_execution.provider.version,
-)
-db.session.add(execution)
-fragment = assign_execution_fragment(
-    db.session, execution,
-    provider_slug=provider_slug,
-    diagnostic_slug=potential_execution.diagnostic.slug,
-    selectors=potential_execution.selectors,
-    group_id=execution_group.id,
-)
-definition = attrs.evolve(
-    definition,
-    output_directory=config.paths.scratch.resolve() / pathlib.Path(fragment),
-)
-execution.register_datasets(db, definition.datasets)
-if execute:
-    db.session.expunge(execution)
-    pending = (definition, execution)
-# … commit happens at top of loop …
-if pending is not None:
-    executor.run(definition=pending[0], execution=pending[1])
-…
-executor.join(timeout=timeout)
-```
-
-**After**:
+# Before: ~50 lines of bookkeeping in solver.py:709-757
+#   PLACEHOLDER_FRAGMENT, assign_execution_fragment, attrs.evolve,
+#   register_datasets, expunge, commit, executor.run, … executor.join
 
-```python
+# After:
 lifecycle = ExecutionLifecycle(config, db, transport)
 
 for group, datasets, definition in planned_executions:
-    execution = Execution(execution_group=group,
-                         dataset_hash=datasets.hash,
-                         provider_version=definition.diagnostic.provider.version)
+    execution = Execution(execution_group=group, dataset_hash=datasets.hash,
+                          provider_version=definition.diagnostic.provider.version)
     lifecycle.submit(execution, definition)
 
 lifecycle.drain(timeout=timeout)
 ```
 
-`submit` privately handles fragment allocation,
-`register_datasets`,
-the expunge/commit dance,
-and `transport.dispatch`.
-`drain` privately handles polling,
-reattachment,
-artifact promotion,
-ingestion,
-the dirty rule,
-and telemetry persistence.
-
 ## End-to-end sequence
 
 ```mermaid
@@ -359,23 +196,19 @@ sequenceDiagram
     participant L as ExecutionLifecycle
     participant T as Transport
     participant W as Worker
-    participant DB as DB
+    participant DB
 
     S->>L: submit(execution, definition)
     L->>D: resources_for(definition)
-    D-->>L: ResourceHint(memory_mb, cpu, wall_clock, queue?)
-    L->>DB: allocate fragment + register_datasets
-    L->>DB: session.expunge + commit
+    D-->>L: ResourceHint(...)
+    L->>DB: allocate fragment + register_datasets + expunge + commit
     Note over L: deadline = clock() + resources.wall_clock
     L->>T: dispatch(ExecutionEnvelope)
 
     Note over T,W: sbatch · qsub · pool.submit · celery send · inline
     T->>W: hand off envelope
     activate W
-    W->>W: diagnostic.run(definition)
-    W->>W: _BundleWriter.write (output.json, diagnostic.json, series.json)
-    W->>W: CV.validate (hard-fail)
-    W->>W: capture Telemetry (duration, peak_rss_mb, host)
+    W->>W: diagnostic.run · _BundleWriter.write · CV.validate · Telemetry
     W-->>T: ExecutionOutcome
     deactivate W
 
@@ -383,231 +216,92 @@ sequenceDiagram
     loop until no in-flight executions
         L->>T: poll(block, timeout)
         T-->>L: ExecutionOutcome
-        Note over L: _finalize(outcome)
-        L->>DB: session.merge(execution_id)
-        L->>DB: _Promoter.copy(scratch → results)
         Note over L: _Classifier → SUCCESS | RETRY | GIVE_UP
-        L->>DB: _Ingestor.upsert(outputs, scalars, series)
-        L->>DB: _DirtyRule.apply(decision)
-        L->>DB: execution.telemetry = outcome.telemetry
-        L->>DB: mark_successful / mark_failed
+        L->>DB: merge · promote artifacts · upsert outputs/scalars/series
+        L->>DB: _DirtyRule.apply · save telemetry · mark_*
     end
 ```
 
-## Retry classification
-
-A single `RetryPolicy.classify(outcome) -> RetryDecision` replaces:
-`_is_system_error`,
-the `CondaCommandError` branch in `execute_locally`,
-the missing-log branch in `handle_execution_result`,
-the per-task timeout handling in `LocalExecutor.join`,
-the pool-shutdown abandonment in `LocalExecutor._fail_outstanding`.
+## Retry + dirty rule (one source of truth)
 
 ```python
 class RetryDecision(enum.Enum):
     SUCCESS = "success"
-    RETRY   = "retry"   # leaves dirty=True
-    GIVE_UP = "give_up" # sets dirty=False, marks failed
+    RETRY   = "retry"     # leaves dirty=True
+    GIVE_UP = "give_up"   # sets dirty=False, marks failed
 
 class DefaultRetryPolicy:
     SYSTEM_ERRORS = (OSError, MemoryError, SystemExit, KeyboardInterrupt)
     NON_RETRYABLE = (CondaCommandError,)
 
     def classify(self, outcome: ExecutionOutcome) -> RetryDecision:
-        if outcome.failure is not None:
-            # transport-side: timeout, broker_lost, pool_shutdown → retry
-            return RetryDecision.RETRY
-        result = outcome.result
-        if result is None:
+        if outcome.failure is not None:  # timeout | broker_lost | pool_shutdown
             return RetryDecision.RETRY
-        if result.successful:
-            return RetryDecision.SUCCESS
-        return (RetryDecision.RETRY if result.retryable
-                else RetryDecision.GIVE_UP)
-```
-
-The dirty rule is derived from the decision in one place:
-
-```python
-class DefaultDirtyRule:
-    def apply(self, execution: Execution, decision: RetryDecision) -> None:
-        if decision in (RetryDecision.SUCCESS, RetryDecision.GIVE_UP):
-            execution.execution_group.dirty = False
-        # RETRY: leave dirty=True so the next solve picks it up
-```
-
-## Idempotent re-ingest
-
-Re-running `_finalize` on the same `execution_id`
-(e.g. a join-time crash followed by `replay_abandoned`)
-must not double-insert.
-The ingester moves to upserts keyed on natural identifiers:
-
-- `ExecutionOutput`: `(execution_id, output_type, short_name)`
-- `ScalarMetricValue`: `(execution_id, dimensions_hash)`
-- `SeriesMetricValue`: `(execution_id, dimensions_hash, index_name)`
-
-Where SQLite is in use the implementation uses
-`INSERT … ON CONFLICT DO NOTHING`;
-on PostgreSQL the same construct works natively.
-The scratch-to-results copy uses `shutil.copy` with `exist_ok=True`
-on the destination directory.
-
-## Telemetry
-
-Three new columns are added to `Execution`:
-
-```sql
-ALTER TABLE execution ADD COLUMN duration_seconds   FLOAT  NULL;
-ALTER TABLE execution ADD COLUMN peak_rss_mb        INTEGER NULL;
-ALTER TABLE execution ADD COLUMN telemetry_meta     JSON   NULL;
+        r = outcome.result
+        if r is None:                    return RetryDecision.RETRY
+        if r.successful:                 return RetryDecision.SUCCESS
+        return RetryDecision.RETRY if r.retryable else RetryDecision.GIVE_UP
 ```
 
-`telemetry_meta` is a small JSON blob with hostname, exit code,
-and transport-specific identifiers (Celery task id, SLURM job id, …).
-None of these are read by the system at solve time;
-they exist so that a future adaptive resource provider
-can ask "what was the rolling p95 memory for diagnostic X
-on dataset shape Y?" without a schema migration.
+Replaces `_is_system_error`, the `CondaCommandError` branch,
+missing-log handling, the per-task timeout path,
+and pool-shutdown abandonment — five sites collapse to one.
 
-The shape of that future provider is **not** in this RFC.
-A natural seam is:
+## Idempotent ingest, telemetry
 
-```python
-class ResourceProvider(Protocol):
-    def hint_for(self, definition: ExecutionDefinition) -> ResourceHint: ...
-```
-
-with a `StaticResourceProvider` (uses `diagnostic.resources_for`)
-and a future `AdaptiveResourceProvider` (reads telemetry, applies multipliers).
-`ExecutionLifecycle` would gain a `resources:` constructor kwarg defaulting
-to `StaticResourceProvider`.
-This is intentionally deferred — the column existing today
-is the only commitment.
-
-## Module / file layout
-
-```
-packages/climate-ref-core/src/climate_ref_core/
-├── lifecycle/
-│   ├── __init__.py            ← public re-exports
-│   ├── ports.py               ← Transport, ResourceHint, Envelope, Outcome, Telemetry
-│   ├── results.py             ← ExecutionResult (pure data) + from_bundle factory
-│   └── worker.py              ← execute_worker(envelope) — picklable entry point
-│
-packages/climate-ref/src/climate_ref/
-├── lifecycle.py               ← ExecutionLifecycle class
-├── transports/
-│   ├── inmemory.py            ← InMemoryTransport
-│   └── processpool.py         ← ProcessPoolTransport (was LocalExecutor)
-└── executor/                  ← deleted; the Executor Protocol goes away
-│
-packages/climate-ref-celery/src/climate_ref_celery/
-└── transport.py               ← CeleryTransport (was CeleryExecutor)
-```
+Ingestion uses `INSERT … ON CONFLICT DO NOTHING` on natural keys
+(`(execution_id, output_type, short_name)` for outputs,
+`(execution_id, dimensions_hash[, index_name])` for metric/series values),
+so `replay_abandoned()` is safe.
+Scratch-to-results copy uses `exist_ok=True`.
 
-The existing `Executor` Protocol in `climate-ref-core/executor.py`
-and the three concrete executors are deleted.
-`execute_locally` becomes a thin wrapper around `execute_worker`
-kept for backwards compatibility for one release cycle, then removed.
+New `Execution` columns: `duration_seconds`, `peak_rss_mb`, `telemetry_meta JSON`.
+No solver code reads these today; they exist so a future adaptive
+`ResourceProvider` is a feature addition rather than a schema migration.
 
 ## Test impact
 
-**Deleted tests** (boundary tests will replace them):
-
-- `tests/unit/test_executor.py::test_is_system_error_*`
-  — private-import tests on `_is_system_error`
-- `tests/unit/test_providers.py::test_conda_*` subprocess patch chains
-  for the conda provider's `run()` are replaced by transport-level boundary tests
-- `tests/unit/executor/test_synchronous.py::*` reattachment tests
-- `tests/unit/executor/test_local.py::*` future-handling tests
-  beyond what `ProcessPoolTransport`'s own tests cover
-- Per-executor `mark_execution_failed` mock-chain tests
-
-**New boundary tests** (against `ExecutionLifecycle` with `InMemoryTransport`):
-
-- `submit` allocates a unique fragment per execution
-- `submit` persists `register_datasets` even when the transport dispatch fails
-- `drain` ingests one successful outcome end-to-end and sets `dirty=False`
-- `drain` re-dispatches a retryable failure on the next solve
-  by leaving `dirty=True`
-- `drain` marks a non-retryable failure with `dirty=False` and `successful=False`
-- `drain` is idempotent: re-draining the same outcome does not double-insert
-  scalar or series rows
-- `drain` enforces `resources.wall_clock` uniformly — a worker that exceeds
-  the deadline produces a retryable failure regardless of transport
-- CV validation fails hard with a `ResultValidationError` when metrics
-  reference unknown dimensions (replaces today's silent `logger.warning`)
-- `replay_abandoned` returns the IDs of executions whose rows exist
-  but never received an outcome
+Delete: `_is_system_error` private-import tests,
+subprocess patch chains in `test_providers.py`,
+per-executor reattach tests,
+`mark_execution_failed` mock chains.
+
+Add boundary tests against `ExecutionLifecycle` + `InMemoryTransport`:
+unique fragment per submit;
+end-to-end success → `dirty=False`;
+retryable failure leaves `dirty=True`;
+non-retryable failure → `dirty=False`, `successful=False`;
+re-drain idempotent (no double-insert);
+`wall_clock` enforced uniformly across transports;
+CV mismatch raises `ResultValidationError`;
+`replay_abandoned` returns stranded IDs.
 
 # Drawbacks
 [drawbacks]: #drawbacks
 
-## Migration cost
-
-This is not a small refactor.
-The dispatch loop, three executor classes, result handling,
-fragment allocation, and ingestion all move at once.
-A staged migration is possible — `InMemoryTransport` and `ProcessPoolTransport`
-can ship before `CeleryTransport` — but the lifecycle module itself has to
-land in one piece because every transport depends on it.
-
-## CV validation becomes a hard fail
-
-Today, when scalar or series metric values do not conform to the controlled
-vocabulary, `result_handling.py` logs a warning with a `TODO`.
-This RFC promotes that to a hard `ResultValidationError`.
-Any in-flight diagnostic with a CV mismatch will start failing.
-This is the *intended* behaviour — silent CV drift is a real source of data
-quality issues — but it is a behaviour change that needs deliberate handling
-during rollout (e.g. a one-release-cycle deprecation period where the
-violation is logged with `ERROR` instead of `WARNING` but does not yet raise).
-
-## A single fat class owns more responsibility
-
-`ExecutionLifecycle` ends up around 400–500 LOC.
-That is intentional: the things it owns are already coupled across eight
-files today, and module *depth* is the design goal.
-But it does mean the file is large; reviewers should expect that.
-
-## The Celery boundary still has one quirk
-
-In the current code, Celery's `link=handle_result.s(...)` chains a result
-handler that runs on the broker side of the boundary.
-In the new design, `CeleryTransport.poll` is what feeds outcomes back into
-`drain`.
-This is cleaner, but it means a coordinator-side crash mid-drain can leave
-results "in flight" in the broker that the coordinator has not yet observed.
-The mitigation is `replay_abandoned()`, which on next startup queries
-unfinished executions and re-polls the broker for them.
-This is a robustness improvement over today
-(where a coordinator crash silently strands rows in `successful=NULL`),
-but it is not free —
-it requires `CeleryTransport` to persist task IDs alongside execution IDs
-in a small table or in `telemetry_meta` so the broker can be queried later.
-
-## Resource hints can be wrong
-
-If a provider declares 4 GB for a diagnostic that actually needs 16 GB,
-the SLURM adapter will OOM-kill the job.
-Today's `LocalExecutor` runs in the parent address space and "just works"
-up to the host's memory.
-With `ResourceHint` in place, getting the number wrong is a new failure mode.
-Mitigations:
-
-- The default `ResourceHint(memory_mb=4096, cpu=1, wall_clock=2h)`
-  is generous enough that most diagnostics run unchanged.
-- `ProcessPoolTransport` deliberately ignores everything except
-  `wall_clock` — running on a developer laptop does not get harder.
-- Telemetry capture means the first failed run produces the data needed
-  to adjust the hint upward.
+- **Migration is wide.** The lifecycle lands in one piece because every
+  transport depends on it. Transports can ship staggered after that.
+- **CV becomes hard-fail.** Today's silent `logger.warning` becomes a
+  raise. Intentional, but needs a one-cycle deprecation window where
+  the violation is `ERROR` but not raised.
+- **One fat class (~400–500 LOC).** Intentional depth, but reviewers
+  should expect a large file.
+- **Celery has no broker-side result handler anymore.**
+  `CeleryTransport.poll` feeds outcomes back into `drain`.
+  A coordinator crash mid-drain is recovered by `replay_abandoned`,
+  which requires `CeleryTransport` to persist task IDs alongside
+  execution IDs.
+- **Resource hints can be wrong.** SLURM will OOM-kill a job whose
+  declared memory is too low. Mitigated by a generous default
+  (4 GB / 1 CPU / 2 h), by `ProcessPoolTransport` ignoring everything
+  except `wall_clock`, and by telemetry capture making the first
+  failed run actionable.
 
 # Rationale and alternatives
 [rationale-and-alternatives]: #rationale-and-alternatives
 
-Three designs were considered in detail.
-Sketches and trade-offs follow; the chosen design is a deliberate hybrid.
+Three designs were considered.
+The chosen interface is a deliberate hybrid.
 
 ```mermaid
 quadrantChart
@@ -624,9 +318,7 @@ quadrantChart
     "Hybrid (chosen)": [0.42, 0.7]
 ```
 
-Comparison at a glance (relative scale, 0–5):
-
-| Dimension              | A — Minimal | B — Maximal | C — Common-case | Hybrid (chosen) |
+| Dimension              | A — Minimal | B — Maximal | C — Common-case | **Hybrid** |
 |------------------------|:-:|:-:|:-:|:-:|
 | Public surface         | 1 | 5 | 2 | 2 |
 | Defaults baked in      | 3 | 1 | 5 | 4 |
@@ -635,218 +327,95 @@ Comparison at a glance (relative scale, 0–5):
 | Resource-hint support  | 0 | 5 | 0 | 5 |
 | Speculation tax        | 0 | 3 | 0 | 1 |
 
-## A — Minimal interface (rejected as a pure form)
-
-Two public methods (`submit`, `drain`), one Protocol (`Transport`).
-Everything else collapses behind the facade and cannot be customised
-without editing the module.
-
-Strength: minimal cognitive load. Weakness: no place to declare
-resource hints, no place for per-provider retry policies; both end up
-requiring constructor kwarg additions on first concrete demand.
-
-## B — Maximally flexible (rejected as a pure form)
-
-Five ports (`Transport`, `ArtifactStore`, `RetryPolicy`, `IngestSink`,
-`FragmentAllocator`), seven lifecycle hooks, plugin discovery via entry
-points, schema-versioned wire types.
-
-Strength: every anticipated future requirement has a seam already.
-Weakness: heavy speculation tax — five ports without five concrete
-implementations driving them is exactly the shape that calcifies into
-"the way things have always been done".
-
-## C — Common-case optimised (rejected as a pure form)
-
-One class with `dispatch`, `drain`, plus `replay_abandoned`, `dry_run`,
-`ingest`, `with_transport`. Sensible defaults sourced from `Config`.
-
-Strength: the solver call site collapses to one line per iteration.
-Weakness: a `ResultSink` callback that Celery silently ignores is an
-asymmetry that will trip someone, and the design has no place for
-resource hints.
-
-## The chosen hybrid
-
-- **C's façade** — `ExecutionLifecycle` is one class with `submit` and
-  `drain` as the hot path and a small set of cold-path entry points.
-- **A's transport contract** — `Transport.dispatch(envelope)` plus
-  `poll()` returning `Iterator[ExecutionOutcome]`. No `ResultSink`
-  callback. The lifecycle owns the result loop.
-- **A's bundle-writer separation** — `ExecutionResult` is pure data;
-  bundle JSON write moves into a worker-side `_BundleWriter`.
-- **B's `ExecutionEnvelope` and `Telemetry` wire types** — earned by
-  the actual provider-side pain (no place for resource hints) and by the
-  need to capture per-execution actuals for future adaptive sizing.
-- **Defer the rest of B** — no `ArtifactStore` port (S3 mirror is
-  hypothetical), no `IngestSink` priority chain (DB-only today),
-  no `LifecycleHooks` (today's `logger.*` calls suffice),
-  no `FragmentAllocator` port (one implementation foreseeable),
-  no entry-point plugin registry.
-- **Delete the `Executor` Protocol.** It is a shallow interface
-  dressed up as flexibility; `SynchronousExecutor` becomes
-  `ExecutionLifecycle(transport=InMemoryTransport())`.
-
-## Impact of not doing this
-
-Each future transport (SLURM, PBS, K8s) re-implements ~250 LOC of
-lifecycle wiring per adapter.
-Resource hints have to be retrofitted later through a new wire format,
-which is a strictly larger change than introducing them at the same time
-as the lifecycle consolidation.
-Per-task timeout, CV validation, dirty-flag rules, and retry
-classification stay scattered, and integration tests remain the only
-honest signal that the system works end-to-end.
+- **A — Minimal**: 2 methods, 1 port, everything else hidden.
+  No place for resource hints or per-provider retry without later kwarg growth.
+- **B — Maximal**: 5 ports (Transport, ArtifactStore, RetryPolicy,
+  IngestSink, FragmentAllocator) + 7 hooks + entry-point plugin registry.
+  Earned the wire-type split and `ResourceHint`; everything else is
+  speculation.
+- **C — Common-case**: one class, defaults sourced from `Config`,
+  solver call site collapses to one line.
+  A `ResultSink` callback that Celery silently ignores is an asymmetry
+  that will trip someone, and there is still no place for resource hints.
+
+**Chosen hybrid**: C's façade (one class, hot/cold method split) +
+A's transport contract (`dispatch(envelope)` + `poll() -> Iterator[Outcome]`,
+no result callback) + A's `BundleWriter` separation +
+B's `ExecutionEnvelope`/`Telemetry` wire types.
+Deferred: `ArtifactStore`, `IngestSink`, `LifecycleHooks`,
+`FragmentAllocator` as a port, plugin registry.
+The shallow `Executor` Protocol and the three concrete executors
+are deleted.
+
+**Impact of not doing this**: each new transport reimplements ~250 LOC
+of lifecycle wiring; resource hints retrofit later through a new wire
+format (strictly larger change); per-task timeout, CV validation,
+dirty-flag, and retry classification stay scattered.
 
 # Prior art
 [prior-art]: #prior-art
 
-## Dask's `distributed`
-
-Dask's scheduler accepts `resources=` annotations on submitted tasks
-(e.g. `client.submit(func, resources={"GPU": 1})`).
-The pattern of declaring resource needs on the unit-of-work and letting
-the scheduler honour them is the direct inspiration for `ResourceHint`.
-Dask does not concern itself with what is "above" the scheduler
-(catalog, ingest, controlled vocabulary),
-which is the layer this RFC is consolidating.
-
-## Snakemake / Nextflow
-
-Both pipeline frameworks have first-class resource directives
-(`resources: mem_mb=…, runtime=…`) on rules / processes,
-which translate transparently into SLURM / PBS / K8s directives.
-This RFC reuses that mental model at the diagnostic level.
-
-## Airflow operators
-
-Airflow's executor/operator split is the closest analogue:
-operators carry the work definition, executors decide how to run it.
-Airflow learned the hard way that an executor that owns result
-handling is too narrow an interface; their `BaseExecutor.execute_async`
-plus `sync` (poll) split is essentially what `Transport.dispatch` /
-`Transport.poll` is here.
-
-## Celery's `task_time_limit` and queue routing
-
-Celery already supports per-queue routing and per-task time limits.
-This RFC's `ResourceHint.queue` maps directly onto Celery's queue
-routing; `wall_clock` maps onto `task_time_limit` plus
-`task_soft_time_limit`.
-The `CeleryExecutor` in `climate-ref-celery` does not currently use
-either; the new `CeleryTransport` does.
-
-## The Rust RFC process itself
-
-The shape of this document (Summary, Motivation, Reference-level,
-Drawbacks, Rationale, Prior art, Unresolved, Future) is inherited from
-[rust-lang/rfcs](https://github.com/rust-lang/rfcs)
-via this repository's template.
+- **Dask `distributed`** — `resources=` annotations on submitted tasks
+  inspire `ResourceHint`.
+- **Snakemake / Nextflow** — first-class `resources:` directives
+  translate transparently into SLURM / PBS / K8s. Same mental model at
+  the diagnostic level.
+- **Airflow** — executor / operator split; `BaseExecutor.execute_async`
+  + `sync` is essentially `Transport.dispatch` + `Transport.poll`.
+- **Celery** — `task_time_limit` + queue routing. `ResourceHint.queue`
+  maps onto Celery queues; `wall_clock` onto `task_time_limit` /
+  `task_soft_time_limit`. Today's `CeleryExecutor` uses neither.
+- **Rust RFC process** — document shape inherited via this repo's
+  template.
 
 # Unresolved questions
 [unresolved-questions]: #unresolved-questions
 
-To resolve through the RFC discussion:
-
-- **Should `ResourceHint` include I/O class (e.g. `io_intensive: bool`)
-  or GPU (`gpu: int`)?**
-  Neither has a concrete consumer today.
-  The recommended default is to add fields when a concrete adapter
-  needs them rather than speculating now.
-- **Where does the controlled vocabulary live in test contexts?**
-  Today `CV.load_from_file(config.paths.dimensions_cv)` is invoked
-  per-result inside `handle_execution_result`.
-  The RFC moves it to lifecycle construction, which means tests need
-  a deterministic CV.
-  A `PermissiveCV()` test fixture is one option;
-  baking the project's CV into a constant for test use is another.
-- **Should `replay_abandoned` be called automatically on
-  `ExecutionLifecycle.__init__`?**
-  Doing so makes recovery transparent;
-  not doing so makes startup behaviour explicit.
-  The current draft assumes it is explicit and called from the CLI.
+To resolve through this RFC:
+
+- Should `ResourceHint` include `gpu: int` / `io_intensive: bool`?
+  Recommended default: add when a concrete adapter needs them.
+- Where does the CV come from in tests? `PermissiveCV()` fixture
+  vs. project CV baked into a constant.
+- Is `replay_abandoned` automatic on `__init__` or explicit from the CLI?
+  Draft assumes explicit.
 
 To resolve through implementation:
 
-- **Exact unique-constraint shape for upserts** —
-  `(execution_id, output_type, short_name)` for `ExecutionOutput`
-  is the obvious key, but the existing schema does not enforce it.
-  An Alembic migration is needed.
-- **`CeleryTransport.poll` semantics** —
-  whether to use `AsyncResult.get(timeout=…)` per task
-  or `app.control.inspect` for batch status.
-  Implementation choice; does not affect the public interface.
-- **`Telemetry.peak_rss_mb` capture on macOS** —
-  `resource.getrusage` returns bytes on macOS and KB on Linux.
-  A small platform shim is required worker-side.
+- Exact upsert unique constraints + Alembic migration.
+- `CeleryTransport.poll` semantics (per-task `AsyncResult.get` vs batch
+  inspection).
+- `peak_rss_mb` capture across macOS / Linux (`getrusage` unit difference).
 
 Out of scope:
 
-- **Adaptive resource provider** (reading rolling telemetry to suggest
-  hints). Telemetry columns land here; the provider is a follow-up RFC.
-- **GPU scheduling**.
-  None of the current diagnostics use GPUs;
-  when one does, `ResourceHint` gains a `gpu: int` field.
-- **Multi-tenant queues**.
-  `ResourceHint.queue` is enough for single-deployment routing;
-  multi-tenant work (per-org queues, fair-share) is out of scope.
-- **Replacing SLURM / PBS / Celery as schedulers**.
-  Explicitly *not* in scope: the lifecycle owns the seam *above* the
-  scheduler, not the scheduler itself.
+- Adaptive resource provider (telemetry columns land here; the
+  provider is a follow-up RFC).
+- GPU scheduling, multi-tenant queues.
+- Replacing SLURM / PBS / Celery as schedulers.
 
 # Future possibilities
 [future-possibilities]: #future-possibilities
 
-The deepening proposed here unlocks several follow-ups, each a
-self-contained piece of work:
-
-- **SLURM and PBS transports.**
-  Each is a ~100 LOC adapter implementing `Transport`.
-  `dispatch` builds a job script from `envelope.resources`,
-  shells out to `sbatch` / `qsub`, records the job ID in `telemetry_meta`;
-  `poll` queries `squeue` / `qstat` and surfaces completed jobs back as
-  `ExecutionOutcome`s.
-  No changes to the lifecycle or to provider code are required.
-
-- **Adaptive `ResourceProvider`.**
-  Reads `Execution.peak_rss_mb` over a rolling window per
-  `(diagnostic_slug, dataset_shape_signature)`,
-  suggests a memory hint at the rolling p95 × 1.2,
-  applies a per-CLI flag to opt in.
-  Schema is already in place from this RFC; the provider is the only
-  new code.
-
-- **Per-provider retry policies.**
-  Currently `RetryPolicy` is a single injected object.
-  ESMValTool's memory-exhaustion failures are retryable with a larger
-  memory hint; a `Mapping[str, RetryPolicy]` keyed on provider slug
-  becomes the natural extension when concrete demand arrives.
-
-- **Streaming partial results.**
-  `Transport.poll` already yields outcomes incrementally;
-  a `PartialOutcome` event could be added to the iterator for
-  progress reporting if a UI consumer appears.
-
-- **S3 / HTTP artifact store.**
-  If results need to land somewhere other than the local filesystem,
-  an `ArtifactStore` port can be carved out of `_Promoter` later.
-  The wire format does not change (scratch path remains relative);
-  only the lifecycle's promote step is rewritten.
-
-- **Pluggable ingest sinks.**
-  Prometheus metrics, audit logs, S3 mirrors — each is a side-effect
-  observer over `ExecutionOutcome`.
-  A small `IngestSink` interface with ordered observers is a small
-  follow-up if and when a second sink (beyond the DB) materialises.
-
-- **Recovery on coordinator restart.**
-  `replay_abandoned()` is in this RFC.
-  Extending it to interrogate transport-side state (SLURM job that
-  finished while the coordinator was down) is a natural follow-up that
-  only needs `Transport.lookup(transport_meta) -> ExecutionOutcome | None`
-  on each adapter.
-
-None of the above are reasons to accept this RFC on their own;
-they are listed to show that the seam introduced here is the right shape
-for the directions the project is plausibly heading,
-without baking any of them in prematurely.
+Each item below is a self-contained follow-up enabled by this RFC:
+
+- **SLURM / PBS transports** — ~100 LOC adapters; `dispatch` builds a
+  job script from `envelope.resources`, `poll` queries `squeue` / `qstat`.
+- **Adaptive `ResourceProvider`** — reads `Execution.peak_rss_mb` over
+  a rolling window, suggests memory hint at p95 × 1.2.
+- **Per-provider retry policies** — single `RetryPolicy` becomes
+  `Mapping[str, RetryPolicy]` keyed on provider slug when concrete
+  demand arrives.
+- **Streaming partial outcomes** — `Transport.poll` already incremental;
+  add a `PartialOutcome` event when a UI consumer appears.
+- **S3 / HTTP artifact store** — extract `_Promoter` into an
+  `ArtifactStore` port when a non-local target lands.
+- **Pluggable ingest sinks** — Prometheus, audit log, S3 mirror as
+  ordered `IngestSink` observers when a second sink materialises.
+- **Cross-restart recovery** — extend `replay_abandoned` with
+  `Transport.lookup(transport_meta)` so a SLURM job finished while the
+  coordinator was down is picked up.
+
+None of these are reasons to accept this RFC on their own;
+they show the seam is the right shape for the directions the project is
+plausibly heading, without pre-baking any of them.

From 034a91af23eec2620d71606400a67e3d432878ec Mon Sep 17 00:00:00 2001
From: Jared Lewis <jared@jared.kiwi.nz>
Date: Thu, 4 Jun 2026 12:09:42 +1000
Subject: [PATCH 06/12] rfc(execution-lifecycle): address adversarial review
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Correct factual and overstated claims, and surface design risks the
draft glossed:

- Drop the false "not picklable" claim — ExecutionResult already pickles
  across the ProcessPoolExecutor today; reframe as construction entangled
  with disk I/O (a testability cost, not a correctness one).
- Split retry classification into worker-side (exception -> outcome,
  consolidating _is_system_error + CondaCommandError) and coordinator-side
  (outcome -> decision); drop the misleading "five sites collapse to one"
  and the unused policy constants.
- Default wall_clock 2h -> 6h to preserve current LocalExecutor budget;
  fix the mitigation text. Avoids a silent kill-at-2h regression.
- Move deadline computation to job start (transport-side); anchoring at
  submit time would expire queued SLURM/PBS jobs before they run.
- Expand the Celery push->pull trade-off (loss of fire-and-forget
  ingestion) from one bullet to an explicit, weighted drawback.
- Add a staged migration plan and a deprecation cycle for the public
  Executor protocol / import_executor_cls config key.
- Note ResourceHint must live in climate_ref_core (layering).
- Clarify ON CONFLICT DO NOTHING is safe because each retry mints a new
  execution_id.
- Remove dry_run from the lifecycle surface (planning, not dispatch).
- Soften unverifiable LOC estimates; scope dirty-flag rule to the result
  path (CLI resets stay separate).
---
 text/0003-execution-lifecycle.md | 173 +++++++++++++++++++++++--------
 1 file changed, 132 insertions(+), 41 deletions(-)

diff --git a/text/0003-execution-lifecycle.md b/text/0003-execution-lifecycle.md
index edc0a65..0c40a56 100644
--- a/text/0003-execution-lifecycle.md
+++ b/text/0003-execution-lifecycle.md
@@ -12,7 +12,8 @@ Add a declarative `ResourceHint` on every `Diagnostic`
 and capture per-execution `Telemetry`,
 so providers can express memory / CPU / wall-clock once
 and future schedulers (SLURM, PBS, K8s)
-plug in as ~100 LOC adapters.
+plug in as thin adapters that only translate the envelope
+and poll job state.
 
 # Motivation
 [motivation]: #motivation
@@ -23,10 +24,11 @@ A single happy-path run touches
 `solver.py` (allocates row + fragment, register_datasets, expunge, commit),
 `climate_ref_core/executor.py` (`execute_locally`, `_is_system_error`,
 `CondaCommandError` handling),
-`climate_ref_core/diagnostics.py` (`ExecutionResult.build_from_output_bundle`
-which secretly writes three JSON files inside a frozen-attrs factory),
+`climate_ref_core/diagnostics.py` (`ExecutionResult.build_from_output_bundle`,
+a static factory that writes three JSON files to disk as a side effect of
+constructing the result),
 `climate_ref/executor/result_handling.py` (scratch→results copy ×4,
-ingestion in nested-tx, dirty-flag toggled in 3 branches),
+ingestion in nested-tx, dirty-flag toggled in 3 branches of the result path),
 `climate_ref/executor/fragment.py` (PLACEHOLDER_FRAGMENT, group_short),
 and one of `synchronous.py` / `local.py` / `climate_ref_celery/executor.py`
 (each reimplementing the same reattach / commit / mark dance).
@@ -35,23 +37,34 @@ Concrete consequences today:
 
 - **Providers have no place to declare parallelisation hints.**
   ESMValTool diagnostics that need 16 GB or 8 h have nowhere to say so;
-  `LocalExecutor` hardcodes a 6 h per-task timeout,
+  `LocalExecutor` applies one 6 h per-task timeout to every diagnostic
+  (a single constructor default, not per-diagnostic),
   `CeleryExecutor` enforces no per-task timeout at all,
   and a future SLURM/PBS adapter has nothing to translate into `sbatch`/`qsub`.
 - **Retry classification is scattered** across `_is_system_error`,
   the `CondaCommandError` branch, missing-log handling, `LocalExecutor`'s
   per-task timeout, and pool-shutdown abandonment.
-- **The `dirty` flag is decided in three places**
+- **The `dirty` flag is decided in three branches of the result path**
   (success path, non-retryable failure, retryable failure / missing log).
+  User-initiated `dirty=True` resets in `cli/executions.py` (rerun / reset)
+  are deliberately separate and stay outside the consolidated rule.
 - **CV validation is silenced** (`logger.warning` with TODO instead of raising).
 - **Tests mock subprocess + filesystem + DB + Celery** and import private
   helpers (`_is_system_error`) to compensate for the missing seam.
-- **`ExecutionResult` is not cleanly picklable** because its frozen factory
-  performs disk I/O.
-
-The CMIP REF is approaching deployment targets (SLURM / PBS / K8s) where
-each new adapter would otherwise mean ~250 LOC of duplicated lifecycle
-wiring.
+- **`ExecutionResult` construction is entangled with disk I/O.**
+  It already pickles across the `ProcessPoolExecutor` boundary today
+  (`_process_run` returns it to the parent), so picklability is not the
+  problem — the problem is that the factory cannot be exercised without a
+  writable output directory, which forces filesystem fixtures into every
+  unit test that builds a result.
+
+The CMIP REF is approaching deployment targets (SLURM / PBS / K8s).
+The recurring lifecycle logic — fragment allocation, the reattach / commit /
+mark dance, scratch→results promotion, retry classification — is already
+shared in `result_handling.py`; what each new transport reimplements is the
+dispatch / poll / timeout wiring around it.
+This RFC pulls that shared logic behind one seam so a transport contributes
+only its dispatch and poll, not a copy of the lifecycle.
 This RFC is **not** about replacing SLURM, PBS, K8s, or Celery as schedulers.
 It is about defining a single robust seam *above* them.
 
@@ -69,7 +82,6 @@ classDiagram
         +submit(execution, definition)
         +drain(timeout) None
         +replay_abandoned() list~int~
-        +dry_run(group, datasets) ExecutionDefinition
         -_FragmentAllocator
         -_Classifier
         -_Promoter
@@ -106,12 +118,23 @@ classDiagram
 
 Picklable value objects only. No DB sessions, `Config`, or CV cross the boundary.
 
+`ResourceHint` lives in `climate_ref_core` (the `Diagnostic` base class
+declares it, and core cannot import the application package).
+`ExecutionEnvelope`, `Telemetry`, `ExecutionOutcome`, and the `Transport`
+protocol live in `climate_ref.lifecycle` alongside `ExecutionLifecycle`.
+
+The default `wall_clock` is **6 h**, matching today's `LocalExecutor`
+per-task timeout, so diagnostics that run for hours without declaring
+`resources` keep their current budget (Celery enforces no limit today, so
+nothing regresses there either). Tightening the default is a separate,
+explicit decision.
+
 ```python
 @attrs.frozen
 class ResourceHint:
     memory_mb: int = 4096
     cpu: int = 1
-    wall_clock: timedelta = timedelta(hours=2)
+    wall_clock: timedelta = timedelta(hours=6)   # matches current LocalExecutor budget
     queue: str | None = None    # transport-specific routing tag
 
 @attrs.frozen
@@ -119,7 +142,10 @@ class ExecutionEnvelope:
     execution_id: int
     definition: ExecutionDefinition
     resources: ResourceHint
-    deadline: datetime          # clock() + resources.wall_clock at submit time
+    # wall_clock travels in `resources`; the *deadline* is computed by the
+    # transport when the job starts running, not here — a queued SLURM/PBS
+    # job may wait hours before it begins, so anchoring the deadline at
+    # submit time would expire jobs before they start.
 
 @attrs.frozen
 class Telemetry:
@@ -202,10 +228,10 @@ sequenceDiagram
     L->>D: resources_for(definition)
     D-->>L: ResourceHint(...)
     L->>DB: allocate fragment + register_datasets + expunge + commit
-    Note over L: deadline = clock() + resources.wall_clock
     L->>T: dispatch(ExecutionEnvelope)
 
     Note over T,W: sbatch · qsub · pool.submit · celery send · inline
+    Note over T: deadline = job_start + resources.wall_clock (transport-side)
     T->>W: hand off envelope
     activate W
     W->>W: diagnostic.run · _BundleWriter.write · CV.validate · Telemetry
@@ -224,6 +250,28 @@ sequenceDiagram
 
 ## Retry + dirty rule (one source of truth)
 
+Classification happens in two clearly separated places.
+
+**Worker side** — exception → outcome. The worker runs the diagnostic and
+maps the raised exception (or clean return) onto the booleans carried by
+`ExecutionResult` / `ExecutionFailure`. This is where the exception-type
+knowledge lives, because the exception is only ever raised on the worker:
+
+```python
+# worker-side, inside the run path
+SYSTEM_ERRORS = (OSError, MemoryError, SystemExit, KeyboardInterrupt)  # → retryable
+NON_RETRYABLE = (CondaCommandError,)                                   # → give up
+```
+
+This consolidates the exception-classification logic that is **today**
+spread across `_is_system_error` and the separate `CondaCommandError`
+branch in `execute_locally` into one worker-side function.
+
+**Coordinator side** — outcome → decision. The policy never inspects
+exception types; it maps the already-classified outcome onto a decision,
+so the same rule applies identically to every transport (including remote
+ones where the exception object never comes back):
+
 ```python
 class RetryDecision(enum.Enum):
     SUCCESS = "success"
@@ -231,9 +279,6 @@ class RetryDecision(enum.Enum):
     GIVE_UP = "give_up"   # sets dirty=False, marks failed
 
 class DefaultRetryPolicy:
-    SYSTEM_ERRORS = (OSError, MemoryError, SystemExit, KeyboardInterrupt)
-    NON_RETRYABLE = (CondaCommandError,)
-
     def classify(self, outcome: ExecutionOutcome) -> RetryDecision:
         if outcome.failure is not None:  # timeout | broker_lost | pool_shutdown
             return RetryDecision.RETRY
@@ -243,9 +288,10 @@ class DefaultRetryPolicy:
         return RetryDecision.RETRY if r.retryable else RetryDecision.GIVE_UP
 ```
 
-Replaces `_is_system_error`, the `CondaCommandError` branch,
-missing-log handling, the per-task timeout path,
-and pool-shutdown abandonment — five sites collapse to one.
+Net effect: exception classification goes from two scattered sites to one
+worker-side function, and the *transport-level* outcomes that today live in
+the per-executor `join` loops (missing log, per-task timeout, pool-shutdown
+abandonment) collapse into the single coordinator policy above.
 
 ## Idempotent ingest, telemetry
 
@@ -255,6 +301,13 @@ Ingestion uses `INSERT … ON CONFLICT DO NOTHING` on natural keys
 so `replay_abandoned()` is safe.
 Scratch-to-results copy uses `exist_ok=True`.
 
+`DO NOTHING` (rather than `DO UPDATE`) is correct because every key is
+scoped to `execution_id`, and each solve mints a **new** `Execution` row per
+attempt: a retry produces a fresh `execution_id`, so its values never
+collide with the abandoned attempt's. The conflict clause therefore only
+guards re-ingestion of the *same* `execution_id` during replay — it never
+silently keeps stale values from a previous attempt.
+
 New `Execution` columns: `duration_seconds`, `peak_rss_mb`, `telemetry_meta JSON`.
 No solver code reads these today; they exist so a future adaptive
 `ResourceProvider` is a feature addition rather than a schema migration.
@@ -279,22 +332,47 @@ CV mismatch raises `ResultValidationError`;
 # Drawbacks
 [drawbacks]: #drawbacks
 
-- **Migration is wide.** The lifecycle lands in one piece because every
-  transport depends on it. Transports can ship staggered after that.
+- **Celery loses fire-and-forget ingestion — the biggest trade-off.**
+  Today `CeleryExecutor` attaches `link` / `link_error` callbacks
+  (`handle_result` / `handle_failure`) so a worker ingests its own result
+  with no live coordinator; the submitting process can exit immediately.
+  The pull model (`Transport.poll` feeding `drain`) couples ingestion to a
+  coordinator that stays alive for the whole batch. This is a real
+  regression for the distributed case and is accepted deliberately: it buys
+  one ingestion path and uniform retry/dirty handling across transports,
+  and `replay_abandoned` (backed by `CeleryTransport` persisting task IDs
+  alongside execution IDs) recovers a coordinator crash mid-drain. If
+  detached submission turns out to be a hard requirement, a worker-side
+  `IngestSink` callback can be added later without changing the seam — but
+  the draft does **not** preserve it, and reviewers should weigh that.
+- **Migration is wide and the seam swap is atomic.** Staggering applies to
+  *adding* transports later, not to the cutover: `climate-ref-core` (wire
+  types, `ResourceHint` on `Diagnostic`), `climate-ref`
+  (`ExecutionLifecycle`, the transports), `climate-ref-celery`, and an
+  Alembic migration all land together, because the seam replaces the
+  `Executor` protocol the solver calls. Proposed landing order to bound
+  risk: (1) add `ResourceHint` + telemetry columns (additive, no behaviour
+  change); (2) introduce `ExecutionLifecycle` + `InMemoryTransport` +
+  `ProcessPoolTransport` behind the existing solver entry point with the old
+  executors still present; (3) port Celery; (4) delete the old executors.
+- **Deleting the `Executor` protocol is a breaking public change.**
+  `import_executor_cls` resolves an executor from a dotted path in `Config`,
+  so the executor class is a documented extension point and any downstream
+  custom executor implements it. The cutover must ship a deprecation cycle:
+  keep `import_executor_cls` resolving known names to the new transports,
+  warn on custom FQNs, and provide a config-migration note. This is not yet
+  spelled out and is a precondition for merge.
 - **CV becomes hard-fail.** Today's silent `logger.warning` becomes a
   raise. Intentional, but needs a one-cycle deprecation window where
-  the violation is `ERROR` but not raised.
+  the violation is `ERROR` but not raised — and it ships in the same wide
+  migration as the seam swap, so it must be feature-flagged to keep the two
+  behaviour changes independently bisectable.
 - **One fat class (~400–500 LOC).** Intentional depth, but reviewers
-  should expect a large file.
-- **Celery has no broker-side result handler anymore.**
-  `CeleryTransport.poll` feeds outcomes back into `drain`.
-  A coordinator crash mid-drain is recovered by `replay_abandoned`,
-  which requires `CeleryTransport` to persist task IDs alongside
-  execution IDs.
+  should expect a large file. (LOC is an estimate, not a target.)
 - **Resource hints can be wrong.** SLURM will OOM-kill a job whose
-  declared memory is too low. Mitigated by a generous default
-  (4 GB / 1 CPU / 2 h), by `ProcessPoolTransport` ignoring everything
-  except `wall_clock`, and by telemetry capture making the first
+  declared memory is too low. Mitigated by a default that preserves current
+  behaviour (4 GB / 1 CPU / 6 h), by `ProcessPoolTransport` ignoring
+  everything except `wall_clock`, and by telemetry capture making the first
   failed run actionable.
 
 # Rationale and alternatives
@@ -342,15 +420,22 @@ quadrantChart
 A's transport contract (`dispatch(envelope)` + `poll() -> Iterator[Outcome]`,
 no result callback) + A's `BundleWriter` separation +
 B's `ExecutionEnvelope`/`Telemetry` wire types.
+Dropping the result callback is what costs Celery its fire-and-forget
+ingestion (see Drawbacks); it is chosen for one uniform pull path, and a
+worker-side `IngestSink` remains a non-breaking future addition.
 Deferred: `ArtifactStore`, `IngestSink`, `LifecycleHooks`,
 `FragmentAllocator` as a port, plugin registry.
 The shallow `Executor` Protocol and the three concrete executors
-are deleted.
+are deleted (with a deprecation cycle for `import_executor_cls`; see
+Drawbacks).
 
-**Impact of not doing this**: each new transport reimplements ~250 LOC
-of lifecycle wiring; resource hints retrofit later through a new wire
-format (strictly larger change); per-task timeout, CV validation,
-dirty-flag, and retry classification stay scattered.
+**Impact of not doing this**: each new transport reimplements its own
+dispatch / poll / timeout wiring and re-derives retry and dirty handling
+inline (the shared promotion/ingest helpers in `result_handling.py` already
+exist, but nothing forces a new transport to route through them);
+resource hints retrofit later through a new wire format (strictly larger
+change); per-task timeout, CV validation, dirty-flag, and exception
+classification stay scattered.
 
 # Prior art
 [prior-art]: #prior-art
@@ -379,10 +464,15 @@ To resolve through this RFC:
   vs. project CV baked into a constant.
 - Is `replay_abandoned` automatic on `__init__` or explicit from the CLI?
   Draft assumes explicit.
+- Is detached (coordinator-free) submission a hard requirement for Celery
+  deployments? If yes, a worker-side `IngestSink` ships with the cutover
+  rather than as a deferred addition.
 
 To resolve through implementation:
 
 - Exact upsert unique constraints + Alembic migration.
+- Deprecation path for `import_executor_cls` / the executor FQN config key
+  (name → transport mapping, warning on custom executors, migration note).
 - `CeleryTransport.poll` semantics (per-task `AsyncResult.get` vs batch
   inspection).
 - `peak_rss_mb` capture across macOS / Linux (`getrusage` unit difference).
@@ -399,8 +489,9 @@ Out of scope:
 
 Each item below is a self-contained follow-up enabled by this RFC:
 
-- **SLURM / PBS transports** — ~100 LOC adapters; `dispatch` builds a
-  job script from `envelope.resources`, `poll` queries `squeue` / `qstat`.
+- **SLURM / PBS transports** — thin adapters: `dispatch` builds a job
+  script from `envelope.resources` and submits it, `poll` queries
+  `squeue` / `qstat` and computes the deadline from job start time.
 - **Adaptive `ResourceProvider`** — reads `Execution.peak_rss_mb` over
   a rolling window, suggests memory hint at p95 × 1.2.
 - **Per-provider retry policies** — single `RetryPolicy` becomes

From 36dd7a298b8eb3b83454a5dec0c629206a6894ec Mon Sep 17 00:00:00 2001
From: Jared Lewis <jared@jared.kiwi.nz>
Date: Thu, 4 Jun 2026 15:41:01 +1000
Subject: [PATCH 07/12] chore: formatting

---
 text/0003-execution-lifecycle.md | 26 +++++++++-----------------
 1 file changed, 9 insertions(+), 17 deletions(-)

diff --git a/text/0003-execution-lifecycle.md b/text/0003-execution-lifecycle.md
index 0c40a56..4773bd8 100644
--- a/text/0003-execution-lifecycle.md
+++ b/text/0003-execution-lifecycle.md
@@ -3,7 +3,6 @@
 - RFC PR: [Climate-REF/rfcs#0003](https://github.com/Climate-REF/rfcs/pull/3)
 
 # Summary
-[summary]: #summary
 
 Consolidate the lifecycle of one diagnostic execution
 — allocation, dispatch, run, classify, publish, ingest, finalise —
@@ -16,7 +15,6 @@ plug in as thin adapters that only translate the envelope
 and poll job state.
 
 # Motivation
-[motivation]: #motivation
 
 The lifecycle of one execution is currently fragmented across ~8 files
 in 2 packages.
@@ -69,7 +67,6 @@ This RFC is **not** about replacing SLURM, PBS, K8s, or Celery as schedulers.
 It is about defining a single robust seam *above* them.
 
 # Reference-level explanation
-[reference-level-explanation]: #reference-level-explanation
 
 ## Module boundary
 
@@ -330,7 +327,6 @@ CV mismatch raises `ResultValidationError`;
 `replay_abandoned` returns stranded IDs.
 
 # Drawbacks
-[drawbacks]: #drawbacks
 
 - **Celery loses fire-and-forget ingestion — the biggest trade-off.**
   Today `CeleryExecutor` attaches `link` / `link_error` callbacks
@@ -376,7 +372,6 @@ CV mismatch raises `ResultValidationError`;
   failed run actionable.
 
 # Rationale and alternatives
-[rationale-and-alternatives]: #rationale-and-alternatives
 
 Three designs were considered.
 The chosen interface is a deliberate hybrid.
@@ -396,14 +391,14 @@ quadrantChart
     "Hybrid (chosen)": [0.42, 0.7]
 ```
 
-| Dimension              | A — Minimal | B — Maximal | C — Common-case | **Hybrid** |
-|------------------------|:-:|:-:|:-:|:-:|
-| Public surface         | 1 | 5 | 2 | 2 |
-| Defaults baked in      | 3 | 1 | 5 | 4 |
-| Bend without editing   | 3 | 5 | 2 | 3 |
-| Migration churn        | 4 | 5 | 2 | 3 |
-| Resource-hint support  | 0 | 5 | 0 | 5 |
-| Speculation tax        | 0 | 3 | 0 | 1 |
+| Dimension             | A — Minimal | B — Maximal | C — Common-case | **Hybrid** |
+| --------------------- | :---------: | :---------: | :-------------: | :--------: |
+| Public surface        |      1      |      5      |        2        |     2      |
+| Defaults baked in     |      3      |      1      |        5        |     4      |
+| Bend without editing  |      3      |      5      |        2        |     3      |
+| Migration churn       |      4      |      5      |        2        |     3      |
+| Resource-hint support |      0      |      5      |        0        |     5      |
+| Speculation tax       |      0      |      3      |        0        |     1      |
 
 - **A — Minimal**: 2 methods, 1 port, everything else hidden.
   No place for resource hints or per-provider retry without later kwarg growth.
@@ -438,7 +433,6 @@ change); per-task timeout, CV validation, dirty-flag, and exception
 classification stay scattered.
 
 # Prior art
-[prior-art]: #prior-art
 
 - **Dask `distributed`** — `resources=` annotations on submitted tasks
   inspire `ResourceHint`.
@@ -446,7 +440,7 @@ classification stay scattered.
   translate transparently into SLURM / PBS / K8s. Same mental model at
   the diagnostic level.
 - **Airflow** — executor / operator split; `BaseExecutor.execute_async`
-  + `sync` is essentially `Transport.dispatch` + `Transport.poll`.
+  - `sync` is essentially `Transport.dispatch` + `Transport.poll`.
 - **Celery** — `task_time_limit` + queue routing. `ResourceHint.queue`
   maps onto Celery queues; `wall_clock` onto `task_time_limit` /
   `task_soft_time_limit`. Today's `CeleryExecutor` uses neither.
@@ -454,7 +448,6 @@ classification stay scattered.
   template.
 
 # Unresolved questions
-[unresolved-questions]: #unresolved-questions
 
 To resolve through this RFC:
 
@@ -485,7 +478,6 @@ Out of scope:
 - Replacing SLURM / PBS / Celery as schedulers.
 
 # Future possibilities
-[future-possibilities]: #future-possibilities
 
 Each item below is a self-contained follow-up enabled by this RFC:
 

From dde7dbb8098c066c6288fa4dc277b1c68f08849a Mon Sep 17 00:00:00 2001
From: Jared Lewis <jared@jared.kiwi.nz>
Date: Thu, 4 Jun 2026 16:02:57 +1000
Subject: [PATCH 08/12] rfc(execution-lifecycle): rewrite around on-disk
 manifest + liveness transport
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Full rewrite after design review. Pivots the proposal from a result-carrying
transport to a disk-backed lifecycle:

- Source of truth is an on-disk, two-phase execution manifest (execution.json),
  REF-owned and typed, embedding CMEC provenance. Phase 1 (identity) written
  before the run; phase 2 (outcome) written last, even on recoverable failure.
  Absent outcome = incomplete (in-flight or hard-killed) — crashes are now
  diagnosable from durable state.
- Transport narrows to launch + liveness (RUNNING|EXITED|GONE); results never
  cross the boundary. Manifest decides outcome, transport decides liveness, a
  present manifest outcome wins over liveness.
- Split into two verbs: run (definition -> on-disk bundle) and ingest
  (dir -> DB, transport-agnostic, idempotent, CV-validated, replayable).
- All executors ported (InMemory, ProcessPool, Celery, Slurm, Pbs); only K8s
  deferred. Celery link/link_error demotes to an optional eager-ingest hook.
- Primary driver reframed as regression output + crash-robustness; HPC and
  cross-deployment portability are downstream beneficiaries.
- Adds execution state diagram, staged migration plan, and CMEC-aligned terms.
- Defers: K8s, external/null execution, portability tier 3 (remote-file
  datasets, own RFC), external-version enforcement (ingest-as-is), telemetry/
  adaptive provider, CV hard-fail.

Markdown uses semantic line breaks targeting ~110 columns.
---
 text/0003-execution-lifecycle.md | 862 +++++++++++++++----------------
 1 file changed, 431 insertions(+), 431 deletions(-)

diff --git a/text/0003-execution-lifecycle.md b/text/0003-execution-lifecycle.md
index 4773bd8..4c25eab 100644
--- a/text/0003-execution-lifecycle.md
+++ b/text/0003-execution-lifecycle.md
@@ -3,502 +3,502 @@
 - RFC PR: [Climate-REF/rfcs#0003](https://github.com/Climate-REF/rfcs/pull/3)
 
 # Summary
+[summary]: #summary
 
-Consolidate the lifecycle of one diagnostic execution
-— allocation, dispatch, run, classify, publish, ingest, finalise —
-into one deep module (`ExecutionLifecycle`) behind one `Transport` port.
-Add a declarative `ResourceHint` on every `Diagnostic`
-and capture per-execution `Telemetry`,
-so providers can express memory / CPU / wall-clock once
-and future schedulers (SLURM, PBS, K8s)
-plug in as thin adapters that only translate the envelope
-and poll job state.
+Replace the shallow `Executor` protocol and its three divergent implementations
+(`SynchronousExecutor`, `LocalExecutor`, `CeleryExecutor`)
+with one execution lifecycle built on two ideas.
 
-# Motivation
+1. A self-describing, on-disk **execution manifest** (`execution.json`)
+   that records — in CMEC-compatible terms — what an execution *is* and how it *ended*.
+2. A thin **`Transport`** port that only *launches* work and reports *liveness*
+   (`RUNNING | EXITED | GONE`).
+
+Results never travel back through the transport.
+Every execution method — in-process, process pool, Celery, SLURM, PBS —
+completes by writing its CMEC bundles to a directory on disk,
+and a single, transport-agnostic *ingest* step loads the manifest and bundles into the database.
+
+This makes executions crash-robust and replayable,
+makes diagnostic output self-describing so it can be compared across runs and versions (regression),
+and reduces every backend to a launcher plus a status query.
 
-The lifecycle of one execution is currently fragmented across ~8 files
-in 2 packages.
-A single happy-path run touches
-`solver.py` (allocates row + fragment, register_datasets, expunge, commit),
-`climate_ref_core/executor.py` (`execute_locally`, `_is_system_error`,
-`CondaCommandError` handling),
-`climate_ref_core/diagnostics.py` (`ExecutionResult.build_from_output_bundle`,
-a static factory that writes three JSON files to disk as a side effect of
-constructing the result),
-`climate_ref/executor/result_handling.py` (scratch→results copy ×4,
-ingestion in nested-tx, dirty-flag toggled in 3 branches of the result path),
-`climate_ref/executor/fragment.py` (PLACEHOLDER_FRAGMENT, group_short),
-and one of `synchronous.py` / `local.py` / `climate_ref_celery/executor.py`
-(each reimplementing the same reattach / commit / mark dance).
-
-Concrete consequences today:
-
-- **Providers have no place to declare parallelisation hints.**
-  ESMValTool diagnostics that need 16 GB or 8 h have nowhere to say so;
-  `LocalExecutor` applies one 6 h per-task timeout to every diagnostic
-  (a single constructor default, not per-diagnostic),
-  `CeleryExecutor` enforces no per-task timeout at all,
-  and a future SLURM/PBS adapter has nothing to translate into `sbatch`/`qsub`.
-- **Retry classification is scattered** across `_is_system_error`,
-  the `CondaCommandError` branch, missing-log handling, `LocalExecutor`'s
-  per-task timeout, and pool-shutdown abandonment.
-- **The `dirty` flag is decided in three branches of the result path**
-  (success path, non-retryable failure, retryable failure / missing log).
-  User-initiated `dirty=True` resets in `cli/executions.py` (rerun / reset)
-  are deliberately separate and stay outside the consolidated rule.
-- **CV validation is silenced** (`logger.warning` with TODO instead of raising).
-- **Tests mock subprocess + filesystem + DB + Celery** and import private
-  helpers (`_is_system_error`) to compensate for the missing seam.
-- **`ExecutionResult` construction is entangled with disk I/O.**
-  It already pickles across the `ProcessPoolExecutor` boundary today
-  (`_process_run` returns it to the parent), so picklability is not the
-  problem — the problem is that the factory cannot be exercised without a
-  writable output directory, which forces filesystem fixtures into every
-  unit test that builds a result.
-
-The CMIP REF is approaching deployment targets (SLURM / PBS / K8s).
-The recurring lifecycle logic — fragment allocation, the reattach / commit /
-mark dance, scratch→results promotion, retry classification — is already
-shared in `result_handling.py`; what each new transport reimplements is the
-dispatch / poll / timeout wiring around it.
-This RFC pulls that shared logic behind one seam so a transport contributes
-only its dispatch and poll, not a copy of the lifecycle.
-This RFC is **not** about replacing SLURM, PBS, K8s, or Celery as schedulers.
-It is about defining a single robust seam *above* them.
+# Motivation
+[motivation]: #motivation
+
+The REF needs **one robust way to run diagnostics** that works across a range of deployments —
+a laptop, a Celery cluster, and an HPC batch scheduler at a modelling centre.
+"Robust" has a concrete meaning here:
+an execution survives coordinator and worker crashes cleanly,
+and an operator can always see *what a run did and why it failed* from durable state, not from a live process.
+
+Today none of that holds, for four reasons.
+
+## The lifecycle is fragmented across divergent executors
+
+A single happy-path run touches eight files in two packages:
+
+- `solver.py` allocates the row and fragment,
+  registers datasets, expunges, and commits before handing off to an executor.
+- `climate_ref_core/executor.py` holds `execute_locally`, `_is_system_error`,
+  and the `CondaCommandError` branch.
+- `climate_ref_core/diagnostics.py` holds `ExecutionResult.build_from_output_bundle`,
+  a static factory that writes three JSON files to disk as a side effect of constructing the result.
+- `climate_ref/executor/result_handling.py` copies scratch→results, ingests inside a nested transaction,
+  and toggles the `dirty` flag in three branches.
+- One of `synchronous.py` / `local.py` / `climate_ref_celery/executor.py`
+  re-implements the reattach / commit / mark dance.
+
+The three executors diverge in ways that matter:
+`LocalExecutor` applies a single 6-hour per-task timeout to every diagnostic;
+`CeleryExecutor` enforces no per-task timeout at all;
+`SynchronousExecutor` runs inline.
+Retry classification is scattered across `_is_system_error`, the `CondaCommandError` branch,
+missing-log handling, the per-task timeout path, and pool-shutdown abandonment.
+
+## Executions are not robust to crashes
+
+Failure state is decided from a *live* exception in `execute_locally` and recorded only in the database.
+If a worker is OOM-killed, the coordinator dies mid-drain, or a Celery broker is lost,
+there is no durable record on disk of what the execution was or how far it got.
+The `LocalExecutor` join loop marks abandoned futures retryable from memory;
+once that process is gone, the knowledge is gone.
+An operator inspecting a stuck `Execution` row (`successful IS NULL`)
+cannot tell whether it is still running or died hours ago.
+
+## Results are not self-describing
+
+An execution directory today contains the CMEC output bundle (`output.json`),
+the CMEC diagnostic bundle (`diagnostic.json`), the series values (`series.json`), and a log (`out.log`).
+**Nothing on disk says which diagnostic, provider version, or datasets produced them** —
+that identity lives only in the database, and `reingest` rebuilds it *from* the database.
+This blocks the primary motivation for this RFC: **regression output**.
+To compare a diagnostic's output across versions or runs,
+or to re-ingest results after changing extraction logic,
+the result directory must carry its own identity and outcome.
+It also blocks any future where a directory is copied between deployments.
+
+## There is no seam for HPC
+
+Modelling centres run on SLURM and PBS.
+There is nowhere for a diagnostic to declare it needs 16 GB or 8 hours,
+and nothing a batch adapter could translate into `sbatch --mem --time` or `qsub -l`.
+Each new backend would otherwise re-implement the whole lifecycle inline.
+
+**Primary driver: regression output and crash-robustness.**
+HPC reach and cross-deployment portability are downstream beneficiaries of the same on-disk format,
+not the justification on their own.
 
 # Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation
 
-## Module boundary
+## Two verbs: `run` and `ingest`
 
-```mermaid
-classDiagram
-    direction LR
-
-    class ExecutionLifecycle {
-        <<climate_ref.lifecycle>>
-        +submit(execution, definition)
-        +drain(timeout) None
-        +replay_abandoned() list~int~
-        -_FragmentAllocator
-        -_Classifier
-        -_Promoter
-        -_Ingestor
-        -_DirtyRule
-        -_BundleWriter
-    }
-
-    class Transport {
-        <<Protocol>>
-        +name: ClassVar~str~
-        +dispatch(envelope: ExecutionEnvelope) None
-        +poll(block, timeout) Iterator~ExecutionOutcome~
-        +shutdown(timeout) None
-    }
-
-    class InMemoryTransport { tests + SynchronousExecutor }
-    class ProcessPoolTransport { replaces LocalExecutor }
-    class CeleryTransport { climate-ref-celery }
-    class SlurmTransport { future · sbatch --mem --cpus --time }
-    class PbsTransport { future · qsub -l mem,ncpus,walltime }
-    class K8sTransport { future · pod resources + deadline }
-
-    ExecutionLifecycle ..> Transport : dispatch / poll
-    InMemoryTransport ..|> Transport
-    ProcessPoolTransport ..|> Transport
-    CeleryTransport ..|> Transport
-    SlurmTransport ..|> Transport
-    PbsTransport ..|> Transport
-    K8sTransport ..|> Transport
-```
+The lifecycle splits cleanly into two operations that today are tangled together.
 
-## Wire types
+- **`run`**: turn an `ExecutionDefinition` into an output directory on disk —
+  a manifest plus CMEC bundles. A `Transport` drives this (or, in future, an external system does).
+- **`ingest`**: load one output directory into the database. Transport-agnostic, idempotent, replayable.
 
-Picklable value objects only. No DB sessions, `Config`, or CV cross the boundary.
+Every execution method funnels through the same `ingest`.
+Live completion, re-ingestion after a logic change, and (future) cross-deployment import
+are the *same code path* reading the *same on-disk format*.
 
-`ResourceHint` lives in `climate_ref_core` (the `Diagnostic` base class
-declares it, and core cannot import the application package).
-`ExecutionEnvelope`, `Telemetry`, `ExecutionOutcome`, and the `Transport`
-protocol live in `climate_ref.lifecycle` alongside `ExecutionLifecycle`.
+## The execution manifest (`execution.json`)
 
-The default `wall_clock` is **6 h**, matching today's `LocalExecutor`
-per-task timeout, so diagnostics that run for hours without declaring
-`resources` keep their current budget (Celery enforces no limit today, so
-nothing regresses there either). Tightening the default is a separate,
-explicit decision.
+A REF-owned, typed, versioned file written to the execution directory.
+It is the durable record of an execution's identity and outcome.
+CMEC provides building blocks for *provenance* but has no vocabulary for *outcome*,
+so the manifest is a REF schema that embeds CMEC provenance rather than living inside the CMEC output bundle.
 
 ```python
-@attrs.frozen
-class ResourceHint:
-    memory_mb: int = 4096
-    cpu: int = 1
-    wall_clock: timedelta = timedelta(hours=6)   # matches current LocalExecutor budget
-    queue: str | None = None    # transport-specific routing tag
+class ManifestStatus(enum.Enum):
+    SUCCESS     = "success"      # ran; produced a valid, CV-checkable bundle
+    FAILED      = "failed"       # ran; diagnostic logic error — not retryable (give up)
+    RECOVERABLE = "recoverable"  # ran; system/transient error — retryable
 
 @attrs.frozen
-class ExecutionEnvelope:
-    execution_id: int
-    definition: ExecutionDefinition
-    resources: ResourceHint
-    # wall_clock travels in `resources`; the *deadline* is computed by the
-    # transport when the job starts running, not here — a queued SLURM/PBS
-    # job may wait hours before it begins, so anchoring the deadline at
-    # submit time would expire jobs before they start.
+class DatasetRef:
+    instance_id: str             # Dataset.slug — the CMIP6 instance_id; assumed globally unique
+    slug_column: str
+    source_type: str             # SourceDatasetType: cmip6, obs4mips, ...
 
 @attrs.frozen
-class Telemetry:
-    duration: timedelta
-    peak_rss_mb: int | None
-    host: str
+class Outcome:                   # phase 2 — written when the diagnostic ends
+    status: ManifestStatus
     exit_code: int | None
-    transport_meta: Mapping[str, str]   # slurm jobid / k8s pod / celery task_id
+    started_at: datetime
+    finished_at: datetime
+    output_bundle: str | None    # output.json     — CMEC output bundle (CMECOutput)
+    metric_bundle: str | None    # diagnostic.json  — CMEC diagnostic bundle (CMECMetric)
+    series: str | None           # series.json
+    log: str = "out.log"
+    provenance: OutputProvenance | None = None   # CMEC environment / modeldata / obsdata / log
 
 @attrs.frozen
-class ExecutionOutcome:
-    execution_id: int
-    result: ExecutionResult | None       # None ⇒ transport-side abandonment
-    failure: ExecutionFailure | None     # timeout | broker_lost | pool_shutdown
-    telemetry: Telemetry
+class ExecutionManifest:
+    schema_version: int
+    # --- identity: phase 1, written before the diagnostic runs ---
+    diagnostic_slug: str
+    diagnostic_version: int
+    provider_slug: str
+    provider_version: str
+    dataset_hash: str
+    datasets: list[DatasetRef]
+    selectors: dict[str, str]
+    # --- outcome: phase 2 ---
+    outcome: Outcome | None = None   # None ⇒ incomplete (in-flight, or hard-killed before writing)
 ```
 
-`ExecutionResult` becomes pure data.
-The current `build_from_output_bundle` factory is split into a pure
-`ExecutionResult.from_bundle(definition, bundle)` and a worker-side
-`_BundleWriter.write(definition, bundle)` that owns the JSON I/O.
+**Two-phase write.**
+The identity half is the *first* file written to the directory, before the diagnostic runs,
+with `outcome = None`.
+The outcome half is the *last* thing the worker writes, **even on a recoverable failure**.
+Therefore a manifest with `outcome is None` means *incomplete*:
+either still in flight, or hard-killed (OOM, `SIGKILL`, node loss) before it could write its epitaph.
+This single invariant is what makes crashes diagnosable from disk alone.
 
-## Diagnostic-side declaration
+**CMEC alignment.**
+`provenance` reuses the CMEC `OutputProvenance` shape (`environment`, `modeldata`, `obsdata`, `log`),
+so the manifest captures inputs and environment in standard terms.
+`datasets` references inputs by `instance_id` (the CMEC/ESGF dataset identity),
+which is what makes a directory portable: it names its inputs without carrying them.
 
-```python
-class Diagnostic(AbstractDiagnostic):
-    resources: ResourceHint = ResourceHint()        # project-wide default
+## The `Transport` port — liveness, not results
 
-    def resources_for(self, definition: ExecutionDefinition) -> ResourceHint:
-        """Optional per-execution sizing. Default returns self.resources."""
-        return self.resources
+A transport launches an execution and answers one question: *is it alive?*
+It never carries results.
 
+```python
+class Status(enum.Enum):
+    RUNNING = "running"   # still executing
+    EXITED  = "exited"    # ended with a clean exit code
+    GONE    = "gone"      # vanished without a clean exit (OOM, SIGKILL, node loss, timeout-kill)
+
+class Transport(Protocol):
+    name: ClassVar[str]
+    def dispatch(self, envelope: ExecutionEnvelope) -> str: ...        # launch; return durable handle
+    def status(self, handles: Mapping[int, str]) -> Mapping[int, Status]: ...
+    def cancel(self, handle: str) -> None: ...                         # deadline / drain kill
+    def shutdown(self, timeout: float) -> None: ...
+```
 
-# Examples
-class ESMValToolDiagnostic(CommandLineDiagnostic):
-    resources = ResourceHint(memory_mb=16000, cpu=4, wall_clock=timedelta(hours=6))
+`dispatch` returns a durable handle (pid, Celery task id, SLURM job id)
+that the lifecycle persists on the `Execution` row in the same commit,
+so a freshly restarted coordinator can still ask about the job.
 
-class EnsoDiagnostic(Diagnostic):
-    resources = ResourceHint(memory_mb=24000, cpu=8,
-                             wall_clock=timedelta(hours=8), queue="bigmem")
+```python
+@attrs.frozen
+class ResourceHint:
+    memory_mb: int = 4096
+    cpu: int = 1
+    wall_clock: timedelta = timedelta(hours=6)   # matches today's LocalExecutor budget
+    queue: str | None = None                     # transport-specific routing tag
 
-class IlambDiagnostic(Diagnostic):
-    resources = ResourceHint(memory_mb=8000, cpu=2, wall_clock=timedelta(hours=2))
-    def resources_for(self, defn):
-        n = len(defn.datasets.get_cmip6())
-        return attrs.evolve(self.resources, memory_mb=8000 + 500 * n)
+@attrs.frozen
+class ExecutionEnvelope:
+    execution_id: int
+    definition: ExecutionDefinition
+    resources: ResourceHint
+    # wall_clock travels inside `resources`; the deadline is computed by the transport
+    # (or the scheduler) at job *start*, never at submit time — a queued SLURM/PBS job
+    # may wait hours before it begins.
 ```
 
-## Solver dispatch — before and after
+The default `wall_clock` is 6 h to match today's `LocalExecutor` budget,
+so diagnostics that run for hours without declaring `resources` keep their current allowance.
 
-```python
-# Before: ~50 lines of bookkeeping in solver.py:709-757
-#   PLACEHOLDER_FRAGMENT, assign_execution_fragment, attrs.evolve,
-#   register_datasets, expunge, commit, executor.run, … executor.join
+## The lifecycle and the drain loop
 
-# After:
-lifecycle = ExecutionLifecycle(config, db, transport)
+`ExecutionLifecycle` is one deep module that owns submission, the drain loop, retry/dirty, and ingest.
+It is stateless with respect to in-flight work:
+the set of executions to wait on comes from the database (`successful IS NULL`), not from memory.
+This is what makes it re-runnable and crash-safe — `drain` after a restart simply resumes.
 
+```python
 for group, datasets, definition in planned_executions:
     execution = Execution(execution_group=group, dataset_hash=datasets.hash,
                           provider_version=definition.diagnostic.provider.version)
-    lifecycle.submit(execution, definition)
+    lifecycle.submit(execution, definition)   # writes phase-1 manifest, dispatch(), persists handle
 
 lifecycle.drain(timeout=timeout)
 ```
 
-## End-to-end sequence
+`drain` loops over the in-flight set.
+**The transport decides liveness; the manifest decides outcome;
+a present manifest outcome wins over liveness.**
 
-```mermaid
-sequenceDiagram
-    autonumber
-    participant S as Solver
-    participant D as Diagnostic
-    participant L as ExecutionLifecycle
-    participant T as Transport
-    participant W as Worker
-    participant DB
-
-    S->>L: submit(execution, definition)
-    L->>D: resources_for(definition)
-    D-->>L: ResourceHint(...)
-    L->>DB: allocate fragment + register_datasets + expunge + commit
-    L->>T: dispatch(ExecutionEnvelope)
-
-    Note over T,W: sbatch · qsub · pool.submit · celery send · inline
-    Note over T: deadline = job_start + resources.wall_clock (transport-side)
-    T->>W: hand off envelope
-    activate W
-    W->>W: diagnostic.run · _BundleWriter.write · CV.validate · Telemetry
-    W-->>T: ExecutionOutcome
-    deactivate W
-
-    S->>L: drain(timeout)
-    loop until no in-flight executions
-        L->>T: poll(block, timeout)
-        T-->>L: ExecutionOutcome
-        Note over L: _Classifier → SUCCESS | RETRY | GIVE_UP
-        L->>DB: merge · promote artifacts · upsert outputs/scalars/series
-        L->>DB: _DirtyRule.apply · save telemetry · mark_*
-    end
+```python
+for ex in db.in_flight(submitted):                  # successful IS NULL
+    match transport.status({ex.id: ex.handle})[ex.id]:
+        case Status.RUNNING:
+            if past_deadline(ex):
+                transport.cancel(ex.handle)         # ProcessPool only; schedulers self-kill
+            # else: keep waiting
+
+        case Status.EXITED:                          # clean exit — ask the disk what it meant
+            outcome = read_manifest(ex.dir).outcome
+            if outcome is None:                      # exited 0 but wrote no outcome → incomplete
+                mark_retryable(ex)
+            elif outcome.status is SUCCESS:
+                ingest_from_disk(ex.dir)             # CV-validated; same path for every transport
+            elif outcome.status is FAILED:
+                mark_failed(ex)                      # diagnostic error, give up
+            else:                                    # RECOVERABLE
+                mark_retryable(ex)
+
+        case Status.GONE:                            # no clean exit
+            outcome = read_manifest(ex.dir).outcome  # worker may have logged a recoverable fail first
+            if outcome is None:
+                mark_retryable(ex)                   # hard crash: OOM / SIGKILL / node loss
+            else:
+                apply(outcome)                        # manifest wins (e.g. SUCCESS written then node reaped)
 ```
 
-## Retry + dirty rule (one source of truth)
-
-Classification happens in two clearly separated places.
+The decisive robustness rule is the last branch:
+if the worker wrote `SUCCESS` and the node then died before the scheduler reaped it,
+the transport reports `GONE` but the manifest reports success — and the manifest wins.
+Transport liveness only adjudicates the *no-manifest-outcome* case.
 
-**Worker side** — exception → outcome. The worker runs the diagnostic and
-maps the raised exception (or clean return) onto the booleans carried by
-`ExecutionResult` / `ExecutionFailure`. This is where the exception-type
-knowledge lives, because the exception is only ever raised on the worker:
+## Execution states
 
-```python
-# worker-side, inside the run path
-SYSTEM_ERRORS = (OSError, MemoryError, SystemExit, KeyboardInterrupt)  # → retryable
-NON_RETRYABLE = (CondaCommandError,)                                   # → give up
+```mermaid
+stateDiagram-v2
+    [*] --> Planned: solver creates Execution (successful=NULL)
+    Planned --> Dispatched: transport.dispatch()<br/>phase-1 manifest written
+    Dispatched --> Running: worker starts
+
+    Running --> Running: status=RUNNING, within deadline
+    Running --> Gone: status=RUNNING, past deadline<br/>transport.cancel()
+
+    state poll <<choice>>
+    Running --> poll: status=EXITED
+    Running --> Gone: status=GONE
+    Gone --> poll: read manifest
+
+    poll --> Ingesting: manifest.outcome = SUCCESS<br/>(wins even if GONE)
+    poll --> Failed: manifest.outcome = FAILED<br/>(diagnostic error)
+    poll --> Retryable: manifest.outcome = RECOVERABLE
+    poll --> Retryable: manifest.outcome absent<br/>(incomplete / OOM / SIGKILL)
+
+    Ingesting --> Successful: bundles→DB, CV-validated<br/>successful=True, group.dirty=False
+    Ingesting --> Retryable: ingest error
+
+    Successful --> [*]
+    Failed --> [*]: successful=False, group.dirty=False
+    Retryable --> [*]: successful=False, group.dirty=True<br/>→ new Execution next solve
+
+    note right of Dispatched
+        External/Null transport (future):
+        no status channel — status is
+        derived from manifest presence
+    end note
 ```
 
-This consolidates the exception-classification logic that is **today**
-spread across `_is_system_error` and the separate `CondaCommandError`
-branch in `execute_locally` into one worker-side function.
+A row has three terminal outcomes:
+**Successful** (clean, group marked clean);
+**Failed** (diagnostic error, not retryable, group marked clean);
+**Retryable** (system/transient error or hard crash — the group stays dirty, and the next solve mints a *new*
+`Execution`).
+A retry never re-runs the same row.
+
+## Retry and dirty: two clearly separated decisions
 
-**Coordinator side** — outcome → decision. The policy never inspects
-exception types; it maps the already-classified outcome onto a decision,
-so the same rule applies identically to every transport (including remote
-ones where the exception object never comes back):
+Classification happens in two places, and only two.
+
+**Worker side — exception → outcome.**
+The worker maps the raised exception (or a clean return) onto `ManifestStatus` and writes it to the manifest.
+This is the one place that knows exception *types*, because the exception is only ever raised on the worker.
+It consolidates the logic spread today across `_is_system_error` and the `CondaCommandError` branch:
 
 ```python
-class RetryDecision(enum.Enum):
-    SUCCESS = "success"
-    RETRY   = "retry"     # leaves dirty=True
-    GIVE_UP = "give_up"   # sets dirty=False, marks failed
-
-class DefaultRetryPolicy:
-    def classify(self, outcome: ExecutionOutcome) -> RetryDecision:
-        if outcome.failure is not None:  # timeout | broker_lost | pool_shutdown
-            return RetryDecision.RETRY
-        r = outcome.result
-        if r is None:                    return RetryDecision.RETRY
-        if r.successful:                 return RetryDecision.SUCCESS
-        return RetryDecision.RETRY if r.retryable else RetryDecision.GIVE_UP
+SYSTEM_ERRORS = (OSError, MemoryError, SystemExit, KeyboardInterrupt)  # → RECOVERABLE
+NON_RETRYABLE = (CondaCommandError,)                                   # → FAILED
 ```
 
-Net effect: exception classification goes from two scattered sites to one
-worker-side function, and the *transport-level* outcomes that today live in
-the per-executor `join` loops (missing log, per-task timeout, pool-shutdown
-abandonment) collapse into the single coordinator policy above.
-
-## Idempotent ingest, telemetry
-
-Ingestion uses `INSERT … ON CONFLICT DO NOTHING` on natural keys
-(`(execution_id, output_type, short_name)` for outputs,
-`(execution_id, dimensions_hash[, index_name])` for metric/series values),
-so `replay_abandoned()` is safe.
-Scratch-to-results copy uses `exist_ok=True`.
-
-`DO NOTHING` (rather than `DO UPDATE`) is correct because every key is
-scoped to `execution_id`, and each solve mints a **new** `Execution` row per
-attempt: a retry produces a fresh `execution_id`, so its values never
-collide with the abandoned attempt's. The conflict clause therefore only
-guards re-ingestion of the *same* `execution_id` during replay — it never
-silently keeps stale values from a previous attempt.
-
-New `Execution` columns: `duration_seconds`, `peak_rss_mb`, `telemetry_meta JSON`.
-No solver code reads these today; they exist so a future adaptive
-`ResourceProvider` is a feature addition rather than a schema migration.
-
-## Test impact
-
-Delete: `_is_system_error` private-import tests,
-subprocess patch chains in `test_providers.py`,
-per-executor reattach tests,
-`mark_execution_failed` mock chains.
-
-Add boundary tests against `ExecutionLifecycle` + `InMemoryTransport`:
-unique fragment per submit;
-end-to-end success → `dirty=False`;
-retryable failure leaves `dirty=True`;
-non-retryable failure → `dirty=False`, `successful=False`;
-re-drain idempotent (no double-insert);
-`wall_clock` enforced uniformly across transports;
-CV mismatch raises `ResultValidationError`;
-`replay_abandoned` returns stranded IDs.
+**Coordinator side — outcome → decision.**
+The drain loop maps the already-classified outcome (plus transport liveness) onto `SUCCESS | retry | give up`.
+It never inspects exception types, so it behaves identically for every transport,
+including remote ones where the exception object never comes back.
+
+The `dirty` flag follows directly:
+`SUCCESS` and `FAILED` clear it; `RECOVERABLE` and incomplete leave it set so the next solve retries.
+User-initiated `dirty = True` resets in `cli/executions.py` (rerun / reset) stay separate by design.
+
+## Ingest from disk
+
+`ingest_from_disk(directory)` reads `execution.json` and the bundles it names, then writes to the database.
+It is the only path that mutates result tables, and it is:
+
+- **Idempotent.** Inserts use `ON CONFLICT DO NOTHING` on natural keys
+  (`(execution_id, output_type, short_name)` for outputs;
+  `(execution_id, dimensions_hash[, index_name])` for diagnostic and series values).
+  Each solve mints a new `execution_id` per attempt, so retries never collide;
+  the conflict clause only guards re-ingestion of the *same* execution during a replay or eager/poll race.
+- **CV-validated.** The CMEC diagnostic and series bundles are validated against the controlled vocabulary
+  at ingest, exactly as today. (This RFC keeps the current `logger.warning` behaviour;
+  making CV a hard failure is explicitly out of scope — see below.)
+- **Portable.** Result directories hold results only, with relative internal paths.
+  Inputs are referenced by `instance_id` and resolved against the importing deployment's dataset index;
+  a directory never carries or hardcodes input data.
+
+This is also what enables the existing `reingest` use case to work *without* the database it was produced from
+(within limits — see portability tiers under Unresolved questions).
+
+## Transports
+
+All current executors are replaced, and the two HPC backends are added, as part of this work.
+Each is a launcher plus a status query — typically under a few hundred lines.
+
+| Transport             | Replaces / adds        | `dispatch`            | `status` source         | deadline enforced by      |
+|-----------------------|------------------------|-----------------------|-------------------------|---------------------------|
+| `InMemoryTransport`   | `SynchronousExecutor`  | run inline            | always `EXITED`         | n/a (inline)              |
+| `ProcessPoolTransport`| `LocalExecutor`        | `pool.submit`         | `future` state          | coordinator (`cancel`)    |
+| `CeleryTransport`     | `CeleryExecutor`       | `app.send_task`       | `AsyncResult`           | broker (`task_time_limit`)|
+| `SlurmTransport`      | new                    | `sbatch --mem --time` | `sacct` / `squeue`      | scheduler (`--time`)      |
+| `PbsTransport`        | new                    | `qsub -l`             | `qstat`                 | scheduler (walltime)      |
+
+`ProcessPoolTransport` is the only transport that enforces the deadline coordinator-side
+(via `cancel` → `future.cancel`), because there is no external scheduler to do it;
+this preserves today's `LocalExecutor` per-task timeout.
+
+**Celery callback demotes to a latency optimisation.**
+Today `CeleryExecutor` attaches `link` / `link_error` callbacks (`handle_result` / `handle_failure`)
+that ingest worker-side.
+Under this design the worker writes its bundle and manifest to disk like every other transport,
+and the coordinator (or a resident orchestrator running `drain`) ingests from disk.
+The `link` callback may be kept as an *eager-ingest hook* — "ingest this directory now" rather than at the next
+poll — but it is no longer load-bearing: if the callback is lost, the disk scan recovers the result.
+This dissolves the previous push-vs-pull asymmetry; both modes run the same `ingest_from_disk`.
+
+## Migration plan
+
+The cutover replaces the seam the solver calls, so it lands in stages, each independently shippable:
+
+1. **Manifest, additively.** Current executors start writing `execution.json` (both phases).
+   No behaviour change; result directories become self-describing immediately —
+   this alone unlocks regression comparison.
+2. **Seam + local transports.** Introduce `ExecutionLifecycle`, the `Transport` port, `ingest_from_disk`,
+   and `InMemoryTransport` + `ProcessPoolTransport`; route the solver through the lifecycle.
+3. **Celery.** Port to `CeleryTransport`; convert `link`/`link_error` to the optional eager-ingest hook.
+4. **HPC.** Add `SlurmTransport` and `PbsTransport`.
+5. **Cleanup.** Delete the `Executor` protocol and the three old executors once nothing imports them.
+
+`import_executor_cls` (which resolves an executor from a dotted path in `Config`) gains a deprecation cycle:
+known names map to the new transports, custom dotted paths warn, and a config-migration note ships with step 5.
 
 # Drawbacks
-
-- **Celery loses fire-and-forget ingestion — the biggest trade-off.**
-  Today `CeleryExecutor` attaches `link` / `link_error` callbacks
-  (`handle_result` / `handle_failure`) so a worker ingests its own result
-  with no live coordinator; the submitting process can exit immediately.
-  The pull model (`Transport.poll` feeding `drain`) couples ingestion to a
-  coordinator that stays alive for the whole batch. This is a real
-  regression for the distributed case and is accepted deliberately: it buys
-  one ingestion path and uniform retry/dirty handling across transports,
-  and `replay_abandoned` (backed by `CeleryTransport` persisting task IDs
-  alongside execution IDs) recovers a coordinator crash mid-drain. If
-  detached submission turns out to be a hard requirement, a worker-side
-  `IngestSink` callback can be added later without changing the seam — but
-  the draft does **not** preserve it, and reviewers should weigh that.
-- **Migration is wide and the seam swap is atomic.** Staggering applies to
-  *adding* transports later, not to the cutover: `climate-ref-core` (wire
-  types, `ResourceHint` on `Diagnostic`), `climate-ref`
-  (`ExecutionLifecycle`, the transports), `climate-ref-celery`, and an
-  Alembic migration all land together, because the seam replaces the
-  `Executor` protocol the solver calls. Proposed landing order to bound
-  risk: (1) add `ResourceHint` + telemetry columns (additive, no behaviour
-  change); (2) introduce `ExecutionLifecycle` + `InMemoryTransport` +
-  `ProcessPoolTransport` behind the existing solver entry point with the old
-  executors still present; (3) port Celery; (4) delete the old executors.
-- **Deleting the `Executor` protocol is a breaking public change.**
-  `import_executor_cls` resolves an executor from a dotted path in `Config`,
-  so the executor class is a documented extension point and any downstream
-  custom executor implements it. The cutover must ship a deprecation cycle:
-  keep `import_executor_cls` resolving known names to the new transports,
-  warn on custom FQNs, and provide a config-migration note. This is not yet
-  spelled out and is a precondition for merge.
-- **CV becomes hard-fail.** Today's silent `logger.warning` becomes a
-  raise. Intentional, but needs a one-cycle deprecation window where
-  the violation is `ERROR` but not raised — and it ships in the same wide
-  migration as the seam swap, so it must be feature-flagged to keep the two
-  behaviour changes independently bisectable.
-- **One fat class (~400–500 LOC).** Intentional depth, but reviewers
-  should expect a large file. (LOC is an estimate, not a target.)
-- **Resource hints can be wrong.** SLURM will OOM-kill a job whose
-  declared memory is too low. Mitigated by a default that preserves current
-  behaviour (4 GB / 1 CPU / 6 h), by `ProcessPoolTransport` ignoring
-  everything except `wall_clock`, and by telemetry capture making the first
-  failed run actionable.
+[drawbacks]: #drawbacks
+
+- **The migration is wide.** The seam the solver calls is replaced, touching `climate-ref-core`,
+  `climate-ref`, and `climate-ref-celery`. The staging above bounds the risk but does not remove it.
+- **One deep module.** `ExecutionLifecycle` is intentionally a large, central class.
+  Reviewers should expect depth concentrated in one file rather than spread thinly.
+- **A resident coordinator is needed for steady-state ingest.**
+  With the callback demoted, ingest normally runs in the process that calls `drain`
+  (the CLI, or a resident orchestrator). The eager-ingest hook keeps Celery's low latency,
+  but the fire-and-forget-with-no-coordinator mode is gone unless the hook is retained.
+- **Manifest is a new on-disk contract.** Once written, `execution.json` must be versioned and migrated
+  carefully; a schema change is a data-format change.
+- **Resource hints can be wrong.** A SLURM job whose declared memory is too low is OOM-killed.
+  Mitigated by a default that preserves current behaviour (4 GB / 1 CPU / 6 h),
+  by `ProcessPoolTransport` honouring only `wall_clock`,
+  and by the manifest making the first failed run diagnosable.
 
 # Rationale and alternatives
-
-Three designs were considered.
-The chosen interface is a deliberate hybrid.
-
-```mermaid
-quadrantChart
-    title Design trade-off space
-    x-axis "Surface area (concepts)" --> "Larger"
-    y-axis "Defaults baked in" --> "More"
-    quadrant-1 "Heavy & opinionated"
-    quadrant-2 "Lean & opinionated"
-    quadrant-3 "Lean & open"
-    quadrant-4 "Heavy & open"
-    "A - Minimal": [0.18, 0.55]
-    "B - Maximally flexible": [0.92, 0.18]
-    "C - Common-case optimised": [0.38, 0.92]
-    "Hybrid (chosen)": [0.42, 0.7]
-```
-
-| Dimension             | A — Minimal | B — Maximal | C — Common-case | **Hybrid** |
-| --------------------- | :---------: | :---------: | :-------------: | :--------: |
-| Public surface        |      1      |      5      |        2        |     2      |
-| Defaults baked in     |      3      |      1      |        5        |     4      |
-| Bend without editing  |      3      |      5      |        2        |     3      |
-| Migration churn       |      4      |      5      |        2        |     3      |
-| Resource-hint support |      0      |      5      |        0        |     5      |
-| Speculation tax       |      0      |      3      |        0        |     1      |
-
-- **A — Minimal**: 2 methods, 1 port, everything else hidden.
-  No place for resource hints or per-provider retry without later kwarg growth.
-- **B — Maximal**: 5 ports (Transport, ArtifactStore, RetryPolicy,
-  IngestSink, FragmentAllocator) + 7 hooks + entry-point plugin registry.
-  Earned the wire-type split and `ResourceHint`; everything else is
-  speculation.
-- **C — Common-case**: one class, defaults sourced from `Config`,
-  solver call site collapses to one line.
-  A `ResultSink` callback that Celery silently ignores is an asymmetry
-  that will trip someone, and there is still no place for resource hints.
-
-**Chosen hybrid**: C's façade (one class, hot/cold method split) +
-A's transport contract (`dispatch(envelope)` + `poll() -> Iterator[Outcome]`,
-no result callback) + A's `BundleWriter` separation +
-B's `ExecutionEnvelope`/`Telemetry` wire types.
-Dropping the result callback is what costs Celery its fire-and-forget
-ingestion (see Drawbacks); it is chosen for one uniform pull path, and a
-worker-side `IngestSink` remains a non-breaking future addition.
-Deferred: `ArtifactStore`, `IngestSink`, `LifecycleHooks`,
-`FragmentAllocator` as a port, plugin registry.
-The shallow `Executor` Protocol and the three concrete executors
-are deleted (with a deprecation cycle for `import_executor_cls`; see
-Drawbacks).
-
-**Impact of not doing this**: each new transport reimplements its own
-dispatch / poll / timeout wiring and re-derives retry and dirty handling
-inline (the shared promotion/ingest helpers in `result_handling.py` already
-exist, but nothing forces a new transport to route through them);
-resource hints retrofit later through a new wire format (strictly larger
-change); per-task timeout, CV validation, dirty-flag, and exception
-classification stay scattered.
+[rationale-and-alternatives]: #rationale-and-alternatives
+
+The central choice is **where the source of truth lives**.
+This design puts it on disk (the manifest), with the transport reduced to a liveness oracle.
+The main alternatives:
+
+- **Transport returns results (`poll() -> Iterator[Outcome]`).**
+  The earlier draft of this RFC.
+  It forces every transport to materialise a full result object, which is natural for a process pool
+  and a Celery result backend but *false* for SLURM, where the job writes to disk and exits with a code.
+  It also keeps the lifecycle tail running in two places (worker-side for Celery callbacks,
+  coordinator-side for the pool). Routing all completion through an on-disk manifest removes both problems.
+
+- **Keep per-executor lifecycles, share helpers only.**
+  This is roughly today's state: `result_handling.py` already shares promotion and ingest.
+  But nothing forces a new backend through the shared path, and crash-robustness is impossible
+  without a durable on-disk record. A new SLURM executor would re-derive retry and dirty handling inline.
+
+- **Database-only completion (no manifest).**
+  Record outcome only in the database, as today.
+  This cannot survive a coordinator crash mid-run, cannot make a result directory self-describing,
+  and cannot support regression comparison or replay — the primary drivers.
+
+**Impact of not doing this.**
+Each new backend re-implements lifecycle wiring and re-derives retry/dirty inline;
+results stay anonymous on disk, so regression comparison and replay stay impossible;
+crashes remain undiagnosable from durable state.
 
 # Prior art
-
-- **Dask `distributed`** — `resources=` annotations on submitted tasks
-  inspire `ResourceHint`.
-- **Snakemake / Nextflow** — first-class `resources:` directives
-  translate transparently into SLURM / PBS / K8s. Same mental model at
-  the diagnostic level.
-- **Airflow** — executor / operator split; `BaseExecutor.execute_async`
-  - `sync` is essentially `Transport.dispatch` + `Transport.poll`.
-- **Celery** — `task_time_limit` + queue routing. `ResourceHint.queue`
-  maps onto Celery queues; `wall_clock` onto `task_time_limit` /
-  `task_soft_time_limit`. Today's `CeleryExecutor` uses neither.
-- **Rust RFC process** — document shape inherited via this repo's
-  template.
+[prior-art]: #prior-art
+
+- **CMEC / EMDS.** The output and diagnostic bundles, provenance, and dimensions are CMEC concepts;
+  the manifest extends CMEC provenance rather than inventing a parallel vocabulary.
+- **Snakemake / Nextflow.** First-class `resources:` directives that translate into SLURM / PBS / cloud.
+  Same mental model at the diagnostic level (`ResourceHint`).
+  Both also treat the working directory as the durable record of a job — the manifest follows this.
+- **Dask `distributed`.** `resources=` annotations on submitted tasks inspire `ResourceHint`.
+- **Airflow.** Executor / scheduler split; `BaseExecutor` exposes async submit plus a sync/heartbeat that
+  reconciles task state from the database — essentially `dispatch` plus `status`-driven `drain`.
+- **Celery.** `task_time_limit` and queue routing; `ResourceHint.queue` maps onto Celery queues and
+  `wall_clock` onto `task_time_limit`. Today's `CeleryExecutor` uses neither.
+- **Make / build systems.** A target is rebuilt unless its output exists and is current — the same logic as
+  reading a manifest outcome from disk before deciding to re-run.
 
 # Unresolved questions
+[unresolved-questions]: #unresolved-questions
 
-To resolve through this RFC:
+To resolve through the RFC process:
 
-- Should `ResourceHint` include `gpu: int` / `io_intensive: bool`?
-  Recommended default: add when a concrete adapter needs them.
-- Where does the CV come from in tests? `PermissiveCV()` fixture
-  vs. project CV baked into a constant.
-- Is `replay_abandoned` automatic on `__init__` or explicit from the CLI?
-  Draft assumes explicit.
-- Is detached (coordinator-free) submission a hard requirement for Celery
-  deployments? If yes, a worker-side `IngestSink` ships with the cutover
-  rather than as a deferred addition.
+- The exact `ExecutionManifest` field set and `schema_version` stability guarantees.
+- Whether the Celery eager-ingest hook (`link`) is retained or Celery polls purely via `status`.
+- Where the controlled vocabulary comes from in tests (a permissive fixture vs. the project CV).
 
 To resolve through implementation:
 
-- Exact upsert unique constraints + Alembic migration.
-- Deprecation path for `import_executor_cls` / the executor FQN config key
-  (name → transport mapping, warning on custom executors, migration note).
-- `CeleryTransport.poll` semantics (per-task `AsyncResult.get` vs batch
-  inspection).
-- `peak_rss_mb` capture across macOS / Linux (`getrusage` unit difference).
-
-Out of scope:
-
-- Adaptive resource provider (telemetry columns land here; the
-  provider is a follow-up RFC).
-- GPU scheduling, multi-tenant queues.
-- Replacing SLURM / PBS / Celery as schedulers.
+- Exact upsert unique constraints and any Alembic migration
+  (`output_fragment`, `provider_version`, and `diagnostic_version` already exist).
+- `sacct` / `squeue` / `qstat` polling cadence and output parsing for the HPC transports.
+- The deprecation mechanics for `import_executor_cls` and the executor dotted-path config key.
+
+Out of scope for this RFC (addressed independently later):
+
+- **K8s transport.**
+- **External / "null" execution** — a directory produced entirely outside the REF, then ingested.
+  The format already supports it; the operational contract does not yet.
+- **Cross-deployment portability beyond shared datasets.**
+  Tier 2 (the importing deployment already indexes the same ESGF datasets, resolved by `instance_id`)
+  is enabled by the manifest's `datasets` field.
+  Tier 3 (datasets *not* present locally) needs dataset support for remote / not-yet-local files
+  and is a separate RFC. The manifest carries the cross-link information (`instance_id`) but not the files.
+- **External-execution version policy.** The default is ingest-as-is:
+  a result lands under its declared `diagnostic_version`, surfaces only if it equals the local
+  `promoted_version`, is CV-validated, and is never rejected on version. Hardening is deferred.
+- **Adaptive resource provider and per-execution telemetry columns.**
+- **Making CV validation a hard failure.**
 
 # Future possibilities
-
-Each item below is a self-contained follow-up enabled by this RFC:
-
-- **SLURM / PBS transports** — thin adapters: `dispatch` builds a job
-  script from `envelope.resources` and submits it, `poll` queries
-  `squeue` / `qstat` and computes the deadline from job start time.
-- **Adaptive `ResourceProvider`** — reads `Execution.peak_rss_mb` over
-  a rolling window, suggests memory hint at p95 × 1.2.
-- **Per-provider retry policies** — single `RetryPolicy` becomes
-  `Mapping[str, RetryPolicy]` keyed on provider slug when concrete
-  demand arrives.
-- **Streaming partial outcomes** — `Transport.poll` already incremental;
-  add a `PartialOutcome` event when a UI consumer appears.
-- **S3 / HTTP artifact store** — extract `_Promoter` into an
-  `ArtifactStore` port when a non-local target lands.
-- **Pluggable ingest sinks** — Prometheus, audit log, S3 mirror as
-  ordered `IngestSink` observers when a second sink materialises.
-- **Cross-restart recovery** — extend `replay_abandoned` with
-  `Transport.lookup(transport_meta)` so a SLURM job finished while the
-  coordinator was down is picked up.
-
-None of these are reasons to accept this RFC on their own;
-they show the seam is the right shape for the directions the project is
-plausibly heading, without pre-baking any of them.
+[future-possibilities]: #future-possibilities
+
+- **K8s transport** — a pod per execution with `resources` and an active deadline; `status` from the pod phase.
+- **External / null transport** — the lowest-friction modelling-centre adoption path:
+  a centre runs the diagnostic in its own pipeline, writes `execution.json` + bundles to the expected layout,
+  and `ref ingest <dir>` loads it. The manifest becomes a public, versioned contract.
+- **Remote-file datasets (portability tier 3)** — resolve `instance_id` references whose files are not local,
+  populated from ESGF queries or from imported result bundles, so a fresh database can be bootstrapped.
+- **Adaptive `ResourceProvider`** — capture per-execution telemetry (peak RSS, duration) into the manifest
+  and the database, then suggest `ResourceHint` memory at, say, p95 × 1.2 over a rolling window.
+- **Cross-restart recovery** — already largely free: because `drain` reconstructs the in-flight set from the
+  database and `status` queries durable scheduler state, a coordinator that was down while a SLURM job finished
+  picks it up on the next `drain`.
+- **Per-provider retry policies** — the single worker-side classifier becomes a per-provider mapping
+  when concrete demand appears.
+
+None of these justify the RFC on their own;
+they show the manifest-plus-liveness seam is shaped for where the project is heading
+without pre-building any of them.

From 11f951060f14dd1b046be87bef76bba820c52d15 Mon Sep 17 00:00:00 2001
From: Jared Lewis <jared@jared.kiwi.nz>
Date: Fri, 5 Jun 2026 10:20:49 +1000
Subject: [PATCH 09/12] rfc(execution-lifecycle): add drop-dead deadline to the
 manifest

The manifest now declares when an execution must complete by.

- Move started_at into phase-1 (identity) and add an absolute deadline
  (= started_at + wall_clock), stamped by the worker at job start.
- Deadline is anchored at start, never submit, so a job that waited in a
  SLURM queue is judged from when it actually began.
- Drain reads the deadline from the manifest; operators (or a reconciling
  coordinator with a weak transport) can flag an overdue execution
  (now > deadline and outcome absent) without querying the scheduler.
- Thread the field through the two-phase prose, envelope comment, drain
  loop, and state diagram.
---
 text/0003-execution-lifecycle.md | 103 +++++++++++++++----------------
 1 file changed, 51 insertions(+), 52 deletions(-)

diff --git a/text/0003-execution-lifecycle.md b/text/0003-execution-lifecycle.md
index 4c25eab..a832dcb 100644
--- a/text/0003-execution-lifecycle.md
+++ b/text/0003-execution-lifecycle.md
@@ -1,16 +1,14 @@
 - Feature Name: `execution_lifecycle`
 - Start Date: 2026-05-12
-- RFC PR: [Climate-REF/rfcs#0003](https://github.com/Climate-REF/rfcs/pull/3)
+- RFC PR: [Climate-REF/rfcs#3](https://github.com/Climate-REF/rfcs/pull/3)
 
 # Summary
-[summary]: #summary
 
 Replace the shallow `Executor` protocol and its three divergent implementations
 (`SynchronousExecutor`, `LocalExecutor`, `CeleryExecutor`)
 with one execution lifecycle built on two ideas.
 
-1. A self-describing, on-disk **execution manifest** (`execution.json`)
-   that records — in CMEC-compatible terms — what an execution *is* and how it *ended*.
+1. A self-describing, on-disk **execution manifest** (`execution.json`) that records what an execution *is* and how it *ended*.
 2. A thin **`Transport`** port that only *launches* work and reports *liveness*
    (`RUNNING | EXITED | GONE`).
 
@@ -24,12 +22,10 @@ makes diagnostic output self-describing so it can be compared across runs and ve
 and reduces every backend to a launcher plus a status query.
 
 # Motivation
-[motivation]: #motivation
 
-The REF needs **one robust way to run diagnostics** that works across a range of deployments —
+The REF needs **one robust way to run diagnostics** that works across a range of deployments -
 a laptop, a Celery cluster, and an HPC batch scheduler at a modelling centre.
-"Robust" has a concrete meaning here:
-an execution survives coordinator and worker crashes cleanly,
+"Robust" has a concrete meaning here: an execution survives coordinator and worker crashes cleanly,
 and an operator can always see *what a run did and why it failed* from durable state, not from a live process.
 
 Today none of that holds, for four reasons.
@@ -44,7 +40,7 @@ A single happy-path run touches eight files in two packages:
   and the `CondaCommandError` branch.
 - `climate_ref_core/diagnostics.py` holds `ExecutionResult.build_from_output_bundle`,
   a static factory that writes three JSON files to disk as a side effect of constructing the result.
-- `climate_ref/executor/result_handling.py` copies scratch→results, ingests inside a nested transaction,
+- `climate_ref/executor/result_handling.py` copies scratch->results, ingests inside a nested transaction,
   and toggles the `dirty` flag in three branches.
 - One of `synchronous.py` / `local.py` / `climate_ref_celery/executor.py`
   re-implements the reattach / commit / mark dance.
@@ -80,17 +76,11 @@ It also blocks any future where a directory is copied between deployments.
 
 ## There is no seam for HPC
 
-Modelling centres run on SLURM and PBS.
+Modelling centres often use SLURM and PBS for task scheduling.
 There is nowhere for a diagnostic to declare it needs 16 GB or 8 hours,
 and nothing a batch adapter could translate into `sbatch --mem --time` or `qsub -l`.
-Each new backend would otherwise re-implement the whole lifecycle inline.
-
-**Primary driver: regression output and crash-robustness.**
-HPC reach and cross-deployment portability are downstream beneficiaries of the same on-disk format,
-not the justification on their own.
 
 # Reference-level explanation
-[reference-level-explanation]: #reference-level-explanation
 
 ## Two verbs: `run` and `ingest`
 
@@ -104,6 +94,9 @@ Every execution method funnels through the same `ingest`.
 Live completion, re-ingestion after a logic change, and (future) cross-deployment import
 are the *same code path* reading the *same on-disk format*.
 
+The concept of `ingest` isn't new in the REF.
+Datasets are currently ingested to extract information from the dataset and put it in the database.
+
 ## The execution manifest (`execution.json`)
 
 A REF-owned, typed, versioned file written to the execution directory.
@@ -127,7 +120,6 @@ class DatasetRef:
 class Outcome:                   # phase 2 — written when the diagnostic ends
     status: ManifestStatus
     exit_code: int | None
-    started_at: datetime
     finished_at: datetime
     output_bundle: str | None    # output.json     — CMEC output bundle (CMECOutput)
     metric_bundle: str | None    # diagnostic.json  — CMEC diagnostic bundle (CMECMetric)
@@ -138,7 +130,7 @@ class Outcome:                   # phase 2 — written when the diagnostic ends
 @attrs.frozen
 class ExecutionManifest:
     schema_version: int
-    # --- identity: phase 1, written before the diagnostic runs ---
+    # --- identity: phase 1, written when the run starts ---
     diagnostic_slug: str
     diagnostic_version: int
     provider_slug: str
@@ -146,28 +138,38 @@ class ExecutionManifest:
     dataset_hash: str
     datasets: list[DatasetRef]
     selectors: dict[str, str]
+    started_at: datetime             # when the worker began running the diagnostic
+    deadline: datetime               # drop-dead: complete-by = started_at + resources.wall_clock
     # --- outcome: phase 2 ---
     outcome: Outcome | None = None   # None ⇒ incomplete (in-flight, or hard-killed before writing)
 ```
 
 **Two-phase write.**
-The identity half is the *first* file written to the directory, before the diagnostic runs,
-with `outcome = None`.
+The identity half is the *first* file written to the directory, the moment the worker starts —
+before the diagnostic runs, with `outcome = None`.
+It stamps `started_at` and `deadline` (the drop-dead time, `started_at + wall_clock`),
+so the file itself declares when the execution must be complete by.
 The outcome half is the *last* thing the worker writes, **even on a recoverable failure**.
 Therefore a manifest with `outcome is None` means *incomplete*:
 either still in flight, or hard-killed (OOM, `SIGKILL`, node loss) before it could write its epitaph.
 This single invariant is what makes crashes diagnosable from disk alone.
 
+**Drop-dead deadline.**
+Because `deadline` is recorded on disk, an operator — or a reconciling coordinator with a weak transport —
+can flag an overdue execution (`now > deadline` and `outcome is None`) without querying the scheduler.
+It is anchored at `started_at`, never at submit time, so a job that waited hours in a SLURM queue
+is judged from when it actually began.
+
 **CMEC alignment.**
 `provenance` reuses the CMEC `OutputProvenance` shape (`environment`, `modeldata`, `obsdata`, `log`),
 so the manifest captures inputs and environment in standard terms.
 `datasets` references inputs by `instance_id` (the CMEC/ESGF dataset identity),
 which is what makes a directory portable: it names its inputs without carrying them.
 
-## The `Transport` port — liveness, not results
+## The `Transport` port
 
 A transport launches an execution and answers one question: *is it alive?*
-It never carries results.
+It never carries results as that greatly simplifies the required functionality.
 
 ```python
 class Status(enum.Enum):
@@ -185,7 +187,7 @@ class Transport(Protocol):
 
 `dispatch` returns a durable handle (pid, Celery task id, SLURM job id)
 that the lifecycle persists on the `Execution` row in the same commit,
-so a freshly restarted coordinator can still ask about the job.
+so a freshly restarted coordinator can still ask about the job that may be in flight.
 
 ```python
 @attrs.frozen
@@ -200,9 +202,9 @@ class ExecutionEnvelope:
     execution_id: int
     definition: ExecutionDefinition
     resources: ResourceHint
-    # wall_clock travels inside `resources`; the deadline is computed by the transport
-    # (or the scheduler) at job *start*, never at submit time — a queued SLURM/PBS job
-    # may wait hours before it begins.
+    # wall_clock travels inside `resources`; the worker stamps the absolute deadline into the
+    # manifest at job *start* (started_at + wall_clock), never at submit — a queued SLURM/PBS
+    # job may wait hours before it begins.
 ```
 
 The default `wall_clock` is 6 h to match today's `LocalExecutor` budget,
@@ -210,7 +212,7 @@ so diagnostics that run for hours without declaring `resources` keep their curre
 
 ## The lifecycle and the drain loop
 
-`ExecutionLifecycle` is one deep module that owns submission, the drain loop, retry/dirty, and ingest.
+`ExecutionLifecycle` becomes a deep module that owns submission, the drain loop, retry/dirty, and ingest.
 It is stateless with respect to in-flight work:
 the set of executions to wait on comes from the database (`successful IS NULL`), not from memory.
 This is what makes it re-runnable and crash-safe — `drain` after a restart simply resumes.
@@ -219,7 +221,7 @@ This is what makes it re-runnable and crash-safe — `drain` after a restart sim
 for group, datasets, definition in planned_executions:
     execution = Execution(execution_group=group, dataset_hash=datasets.hash,
                           provider_version=definition.diagnostic.provider.version)
-    lifecycle.submit(execution, definition)   # writes phase-1 manifest, dispatch(), persists handle
+    lifecycle.submit(execution, definition)   # dispatch(), persist handle; worker writes phase-1 at start
 
 lifecycle.drain(timeout=timeout)
 ```
@@ -232,13 +234,13 @@ a present manifest outcome wins over liveness.**
 for ex in db.in_flight(submitted):                  # successful IS NULL
     match transport.status({ex.id: ex.handle})[ex.id]:
         case Status.RUNNING:
-            if past_deadline(ex):
-                transport.cancel(ex.handle)         # ProcessPool only; schedulers self-kill
+            if now > read_manifest(ex.dir).deadline:   # drop-dead from phase-1 manifest
+                transport.cancel(ex.handle)            # ProcessPool only; schedulers self-kill
             # else: keep waiting
 
         case Status.EXITED:                          # clean exit — ask the disk what it meant
             outcome = read_manifest(ex.dir).outcome
-            if outcome is None:                      # exited 0 but wrote no outcome → incomplete
+            if outcome is None:                      # exited 0 but wrote no outcome -> incomplete
                 mark_retryable(ex)
             elif outcome.status is SUCCESS:
                 ingest_from_disk(ex.dir)             # CV-validated; same path for every transport
@@ -265,8 +267,8 @@ Transport liveness only adjudicates the *no-manifest-outcome* case.
 ```mermaid
 stateDiagram-v2
     [*] --> Planned: solver creates Execution (successful=NULL)
-    Planned --> Dispatched: transport.dispatch()<br/>phase-1 manifest written
-    Dispatched --> Running: worker starts
+    Planned --> Dispatched: transport.dispatch(), persist handle
+    Dispatched --> Running: worker starts, writes phase-1<br/>(started_at, deadline)
 
     Running --> Running: status=RUNNING, within deadline
     Running --> Gone: status=RUNNING, past deadline<br/>transport.cancel()
@@ -281,12 +283,12 @@ stateDiagram-v2
     poll --> Retryable: manifest.outcome = RECOVERABLE
     poll --> Retryable: manifest.outcome absent<br/>(incomplete / OOM / SIGKILL)
 
-    Ingesting --> Successful: bundles→DB, CV-validated<br/>successful=True, group.dirty=False
+    Ingesting --> Successful: bundles->DB, CV-validated<br/>successful=True, group.dirty=False
     Ingesting --> Retryable: ingest error
 
     Successful --> [*]
     Failed --> [*]: successful=False, group.dirty=False
-    Retryable --> [*]: successful=False, group.dirty=True<br/>→ new Execution next solve
+    Retryable --> [*]: successful=False, group.dirty=True<br/>-> new Execution next solve
 
     note right of Dispatched
         External/Null transport (future):
@@ -302,27 +304,30 @@ A row has three terminal outcomes:
 `Execution`).
 A retry never re-runs the same row.
 
-## Retry and dirty: two clearly separated decisions
+## Execution Status
 
-Classification happens in two places, and only two.
+Classification of the status of an execution happens in two places.
 
-**Worker side — exception → outcome.**
+**Worker side — exception -> outcome.**
 The worker maps the raised exception (or a clean return) onto `ManifestStatus` and writes it to the manifest.
 This is the one place that knows exception *types*, because the exception is only ever raised on the worker.
 It consolidates the logic spread today across `_is_system_error` and the `CondaCommandError` branch:
 
 ```python
-SYSTEM_ERRORS = (OSError, MemoryError, SystemExit, KeyboardInterrupt)  # → RECOVERABLE
-NON_RETRYABLE = (CondaCommandError,)                                   # → FAILED
+SYSTEM_ERRORS = (OSError, MemoryError, SystemExit, KeyboardInterrupt)  # -> RECOVERABLE
+NON_RETRYABLE = (CondaCommandError,)                                   # -> FAILED
 ```
 
-**Coordinator side — outcome → decision.**
-The drain loop maps the already-classified outcome (plus transport liveness) onto `SUCCESS | retry | give up`.
+**Coordinator side — outcome -> decision.**
+The drain loop maps the outcome in the manifest (plus transport liveness) onto `SUCCESS | retry | give up`.
 It never inspects exception types, so it behaves identically for every transport,
 including remote ones where the exception object never comes back.
 
-The `dirty` flag follows directly:
-`SUCCESS` and `FAILED` clear it; `RECOVERABLE` and incomplete leave it set so the next solve retries.
+The `dirty` flag for an execution group is then updated:
+
+- `SUCCESS` and `FAILED` clear the dirty state.
+- `RECOVERABLE` and incomplete leave it set so the next solve retries the execution.
+
 User-initiated `dirty = True` resets in `cli/executions.py` (rerun / reset) stay separate by design.
 
 ## Ingest from disk
@@ -336,8 +341,7 @@ It is the only path that mutates result tables, and it is:
   Each solve mints a new `execution_id` per attempt, so retries never collide;
   the conflict clause only guards re-ingestion of the *same* execution during a replay or eager/poll race.
 - **CV-validated.** The CMEC diagnostic and series bundles are validated against the controlled vocabulary
-  at ingest, exactly as today. (This RFC keeps the current `logger.warning` behaviour;
-  making CV a hard failure is explicitly out of scope — see below.)
+  at ingest, exactly as today. Validations errors result in warnings rather than hard failures for now.
 - **Portable.** Result directories hold results only, with relative internal paths.
   Inputs are referenced by `instance_id` and resolved against the importing deployment's dataset index;
   a directory never carries or hardcodes input data.
@@ -359,7 +363,7 @@ Each is a launcher plus a status query — typically under a few hundred lines.
 | `PbsTransport`        | new                    | `qsub -l`             | `qstat`                 | scheduler (walltime)      |
 
 `ProcessPoolTransport` is the only transport that enforces the deadline coordinator-side
-(via `cancel` → `future.cancel`), because there is no external scheduler to do it;
+(via `cancel` -> `future.cancel`), because there is no external scheduler to do it;
 this preserves today's `LocalExecutor` per-task timeout.
 
 **Celery callback demotes to a latency optimisation.**
@@ -388,7 +392,6 @@ The cutover replaces the seam the solver calls, so it lands in stages, each inde
 known names map to the new transports, custom dotted paths warn, and a config-migration note ships with step 5.
 
 # Drawbacks
-[drawbacks]: #drawbacks
 
 - **The migration is wide.** The seam the solver calls is replaced, touching `climate-ref-core`,
   `climate-ref`, and `climate-ref-celery`. The staging above bounds the risk but does not remove it.
@@ -406,7 +409,6 @@ known names map to the new transports, custom dotted paths warn, and a config-mi
   and by the manifest making the first failed run diagnosable.
 
 # Rationale and alternatives
-[rationale-and-alternatives]: #rationale-and-alternatives
 
 The central choice is **where the source of truth lives**.
 This design puts it on disk (the manifest), with the transport reduced to a liveness oracle.
@@ -435,7 +437,6 @@ results stay anonymous on disk, so regression comparison and replay stay impossi
 crashes remain undiagnosable from durable state.
 
 # Prior art
-[prior-art]: #prior-art
 
 - **CMEC / EMDS.** The output and diagnostic bundles, provenance, and dimensions are CMEC concepts;
   the manifest extends CMEC provenance rather than inventing a parallel vocabulary.
@@ -451,7 +452,6 @@ crashes remain undiagnosable from durable state.
   reading a manifest outcome from disk before deciding to re-run.
 
 # Unresolved questions
-[unresolved-questions]: #unresolved-questions
 
 To resolve through the RFC process:
 
@@ -483,7 +483,6 @@ Out of scope for this RFC (addressed independently later):
 - **Making CV validation a hard failure.**
 
 # Future possibilities
-[future-possibilities]: #future-possibilities
 
 - **K8s transport** — a pod per execution with `resources` and an active deadline; `status` from the pod phase.
 - **External / null transport** — the lowest-friction modelling-centre adoption path:

From d2336a4567086dfc4352ca3788f7d3a1c68904c7 Mon Sep 17 00:00:00 2001
From: Jared Lewis <jared@jared.kiwi.nz>
Date: Fri, 5 Jun 2026 10:37:01 +1000
Subject: [PATCH 10/12] rfc(execution-lifecycle): one HpcTransport for Slurm
 and PBS

The existing HPCExecutor already handles both schedulers via parsl
(SlurmProvider / SmartPBSProvider behind one HighThroughputExecutor), so
the seam needs one HpcTransport, not separate Slurm/Pbs transports.

- Collapse the two HPC table rows into one HpcTransport (parsl-backed,
  scheduler by config); fix the dispatch/status columns (parsl, not raw
  sbatch/qsub/sacct/qstat).
- Note the parsl pilot model: scheduler walltime bounds the block, not the
  individual execution; per-execution wall_clock is enforced in the pilot.
- Migration step 4 becomes 'port HPCExecutor', not 'add new transports'.
- Unresolved question reframed to parsl future/pilot state mapping.
---
 text/0003-execution-lifecycle.md | 56 +++++++++++++++++++-------------
 1 file changed, 33 insertions(+), 23 deletions(-)

diff --git a/text/0003-execution-lifecycle.md b/text/0003-execution-lifecycle.md
index a832dcb..5fd1146 100644
--- a/text/0003-execution-lifecycle.md
+++ b/text/0003-execution-lifecycle.md
@@ -212,33 +212,37 @@ so diagnostics that run for hours without declaring `resources` keep their curre
 
 ## The lifecycle and the drain loop
 
-`ExecutionLifecycle` becomes a deep module that owns submission, the drain loop, retry/dirty, and ingest.
+`ExecutionLifecycle` owns the submission of new executions, the drain loop, retry/dirty, and ingest.
 It is stateless with respect to in-flight work:
 the set of executions to wait on comes from the database (`successful IS NULL`), not from memory.
-This is what makes it re-runnable and crash-safe — `drain` after a restart simply resumes.
+This is what makes it re-runnable and crash-safe. Running `drain` after a restart simply resumes.
 
 ```python
 for group, datasets, definition in planned_executions:
-    execution = Execution(execution_group=group, dataset_hash=datasets.hash,
-                          provider_version=definition.diagnostic.provider.version)
-    lifecycle.submit(execution, definition)   # dispatch(), persist handle; worker writes phase-1 at start
+    execution = Execution(
+        execution_group=group, 
+        dataset_hash=datasets.hash,
+        provider_version=definition.diagnostic.provider.version
+    )
+    # dispatch(), persist handle in db, worker writes phase-1 at start of execution
+    lifecycle.submit(execution, definition)   
 
 lifecycle.drain(timeout=timeout)
 ```
 
-`drain` loops over the in-flight set.
-**The transport decides liveness; the manifest decides outcome;
-a present manifest outcome wins over liveness.**
+`drain` loops over the in-flight executions as queried from the database.
+The transport decides liveness of the exeuction and the manifest determines the outcome.
+If a outcome is present in a manifest, it wins over the liveness check.
 
 ```python
 for ex in db.in_flight(submitted):                  # successful IS NULL
     match transport.status({ex.id: ex.handle})[ex.id]:
         case Status.RUNNING:
-            if now > read_manifest(ex.dir).deadline:   # drop-dead from phase-1 manifest
-                transport.cancel(ex.handle)            # ProcessPool only; schedulers self-kill
+            if now > read_manifest(ex.dir).deadline:   # deadline check
+                transport.cancel(ex.handle)            # ProcessPool only, other schedulers use native deadlines
             # else: keep waiting
 
-        case Status.EXITED:                          # clean exit — ask the disk what it meant
+        case Status.EXITED:                          # clean exit - check the outcome from disk
             outcome = read_manifest(ex.dir).outcome
             if outcome is None:                      # exited 0 but wrote no outcome -> incomplete
                 mark_retryable(ex)
@@ -252,7 +256,7 @@ for ex in db.in_flight(submitted):                  # successful IS NULL
         case Status.GONE:                            # no clean exit
             outcome = read_manifest(ex.dir).outcome  # worker may have logged a recoverable fail first
             if outcome is None:
-                mark_retryable(ex)                   # hard crash: OOM / SIGKILL / node loss
+                mark_retryable(ex)                   # hard crash: OOM / SIGKILL / node loss / queue loss
             else:
                 apply(outcome)                        # manifest wins (e.g. SUCCESS written then node reaped)
 ```
@@ -351,21 +355,26 @@ This is also what enables the existing `reingest` use case to work *without* the
 
 ## Transports
 
-All current executors are replaced, and the two HPC backends are added, as part of this work.
-Each is a launcher plus a status query — typically under a few hundred lines.
+All current executors are replaced, including the parsl-based HPC backend —
+one transport covers both Slurm and PBS, selected by config, as `HPCExecutor` does today.
+Each is a launcher plus a status query.
 
-| Transport             | Replaces / adds        | `dispatch`            | `status` source         | deadline enforced by      |
-|-----------------------|------------------------|-----------------------|-------------------------|---------------------------|
-| `InMemoryTransport`   | `SynchronousExecutor`  | run inline            | always `EXITED`         | n/a (inline)              |
-| `ProcessPoolTransport`| `LocalExecutor`        | `pool.submit`         | `future` state          | coordinator (`cancel`)    |
-| `CeleryTransport`     | `CeleryExecutor`       | `app.send_task`       | `AsyncResult`           | broker (`task_time_limit`)|
-| `SlurmTransport`      | new                    | `sbatch --mem --time` | `sacct` / `squeue`      | scheduler (`--time`)      |
-| `PbsTransport`        | new                    | `qsub -l`             | `qstat`                 | scheduler (walltime)      |
+| Transport             | Replaces               | `dispatch`                        | `status` source     | deadline enforced by      |
+|-----------------------|------------------------|-----------------------------------|---------------------|---------------------------|
+| `InMemoryTransport`   | `SynchronousExecutor`  | run inline                        | always `EXITED`     | n/a (inline)              |
+| `ProcessPoolTransport`| `LocalExecutor`        | `pool.submit`                     | `future` state      | coordinator (`cancel`)    |
+| `CeleryTransport`     | `CeleryExecutor`       | `app.send_task`                   | `AsyncResult`       | broker (`task_time_limit`)|
+| `HpcTransport`        | `HPCExecutor` (parsl)  | parsl provider submit (Slurm/PBS) | parsl future state  | block walltime + pilot    |
 
 `ProcessPoolTransport` is the only transport that enforces the deadline coordinator-side
 (via `cancel` -> `future.cancel`), because there is no external scheduler to do it;
 this preserves today's `LocalExecutor` per-task timeout.
 
+`HpcTransport` wraps parsl rather than calling `sbatch` / `qsub` directly.
+parsl requests scheduler *blocks* (pilot jobs) and runs many executions inside each one,
+so the scheduler walltime bounds the allocation, not the individual execution;
+per-execution `wall_clock` is enforced inside the pilot, as `HPCExecutor.join` does today.
+
 **Celery callback demotes to a latency optimisation.**
 Today `CeleryExecutor` attaches `link` / `link_error` callbacks (`handle_result` / `handle_failure`)
 that ingest worker-side.
@@ -385,7 +394,7 @@ The cutover replaces the seam the solver calls, so it lands in stages, each inde
 2. **Seam + local transports.** Introduce `ExecutionLifecycle`, the `Transport` port, `ingest_from_disk`,
    and `InMemoryTransport` + `ProcessPoolTransport`; route the solver through the lifecycle.
 3. **Celery.** Port to `CeleryTransport`; convert `link`/`link_error` to the optional eager-ingest hook.
-4. **HPC.** Add `SlurmTransport` and `PbsTransport`.
+4. **HPC.** Port `HPCExecutor` to `HpcTransport` (Slurm + PBS via parsl).
 5. **Cleanup.** Delete the `Executor` protocol and the three old executors once nothing imports them.
 
 `import_executor_cls` (which resolves an executor from a dotted path in `Config`) gains a deprecation cycle:
@@ -463,7 +472,8 @@ To resolve through implementation:
 
 - Exact upsert unique constraints and any Alembic migration
   (`output_fragment`, `provider_version`, and `diagnostic_version` already exist).
-- `sacct` / `squeue` / `qstat` polling cadence and output parsing for the HPC transports.
+- Mapping parsl future / pilot-block state to `RUNNING | EXITED | GONE`, and enforcing the
+  per-execution deadline inside the pilot.
 - The deprecation mechanics for `import_executor_cls` and the executor dotted-path config key.
 
 Out of scope for this RFC (addressed independently later):

From 6a23e118ce1ed54b90037850962328b6f764a8f7 Mon Sep 17 00:00:00 2001
From: Jared Lewis <jared@jared.kiwi.nz>
Date: Fri, 5 Jun 2026 10:49:12 +1000
Subject: [PATCH 11/12] rfc(execution-lifecycle): collapse EXITED/GONE drain
 branches
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

EXITED and GONE already took identical action — a present manifest outcome
is applied either way, an absent one is retryable either way. Transport
status only separates RUNNING from done; the manifest decides the rest.

- Merge the two branches into 'case Status.EXITED | Status.GONE'; GONE
  keeps a distinct warning log (transport lost the job) but no distinct
  control flow.
- Add the atomic outcome-write invariant (temp + fsync + rename) so a
  GONE job can never observe a half-written outcome — this is what lets a
  lost job be trusted exactly as much as a clean exit.
- Rewrite the robustness paragraph to match.
---
 text/0003-execution-lifecycle.md | 93 +++++++++++++-------------------
 1 file changed, 36 insertions(+), 57 deletions(-)

diff --git a/text/0003-execution-lifecycle.md b/text/0003-execution-lifecycle.md
index 5fd1146..fca583d 100644
--- a/text/0003-execution-lifecycle.md
+++ b/text/0003-execution-lifecycle.md
@@ -154,6 +154,11 @@ Therefore a manifest with `outcome is None` means *incomplete*:
 either still in flight, or hard-killed (OOM, `SIGKILL`, node loss) before it could write its epitaph.
 This single invariant is what makes crashes diagnosable from disk alone.
 
+The outcome must be written **atomically** (write a temp file, `fsync`, then `rename`),
+so a worker killed mid-write can never leave a half-written outcome that reads as present-but-corrupt.
+A present outcome is therefore always complete, which is what lets a lost (`GONE`) job be trusted
+exactly as much as a cleanly `EXITED` one.
+
 **Drop-dead deadline.**
 Because `deadline` is recorded on disk, an operator — or a reconciling coordinator with a weak transport —
 can flag an overdue execution (`now > deadline` and `outcome is None`) without querying the scheduler.
@@ -242,9 +247,11 @@ for ex in db.in_flight(submitted):                  # successful IS NULL
                 transport.cancel(ex.handle)            # ProcessPool only, other schedulers use native deadlines
             # else: keep waiting
 
-        case Status.EXITED:                          # clean exit - check the outcome from disk
+        case Status.EXITED | Status.GONE as st:      # done - the manifest, not the transport, decides
+            if st is Status.GONE:                    # transport lost the job; the manifest still rules
+                logger.warning(f"{ex} vanished without a clean exit; trusting the on-disk manifest")
             outcome = read_manifest(ex.dir).outcome
-            if outcome is None:                      # exited 0 but wrote no outcome -> incomplete
+            if outcome is None:                      # clean exit w/o epitaph, or hard crash (OOM / SIGKILL)
                 mark_retryable(ex)
             elif outcome.status is SUCCESS:
                 ingest_from_disk(ex.dir)             # CV-validated; same path for every transport
@@ -252,19 +259,14 @@ for ex in db.in_flight(submitted):                  # successful IS NULL
                 mark_failed(ex)                      # diagnostic error, give up
             else:                                    # RECOVERABLE
                 mark_retryable(ex)
-
-        case Status.GONE:                            # no clean exit
-            outcome = read_manifest(ex.dir).outcome  # worker may have logged a recoverable fail first
-            if outcome is None:
-                mark_retryable(ex)                   # hard crash: OOM / SIGKILL / node loss / queue loss
-            else:
-                apply(outcome)                        # manifest wins (e.g. SUCCESS written then node reaped)
 ```
 
-The decisive robustness rule is the last branch:
-if the worker wrote `SUCCESS` and the node then died before the scheduler reaped it,
-the transport reports `GONE` but the manifest reports success — and the manifest wins.
-Transport liveness only adjudicates the *no-manifest-outcome* case.
+`EXITED` and `GONE` therefore take the *same* action — a present manifest outcome is applied either way,
+and an absent one is retryable either way.
+Transport status only separates `RUNNING` from done; the manifest decides the rest.
+`GONE` differs only in that it emits a warning (the transport lost track of the job),
+which matters for the case where the worker wrote `SUCCESS` and the node then died before being reaped:
+the transport reports `GONE`, but the manifest still wins and the result is ingested.
 
 ## Execution states
 
@@ -364,34 +366,34 @@ Each is a launcher plus a status query.
 | `InMemoryTransport`   | `SynchronousExecutor`  | run inline                        | always `EXITED`     | n/a (inline)              |
 | `ProcessPoolTransport`| `LocalExecutor`        | `pool.submit`                     | `future` state      | coordinator (`cancel`)    |
 | `CeleryTransport`     | `CeleryExecutor`       | `app.send_task`                   | `AsyncResult`       | broker (`task_time_limit`)|
-| `HpcTransport`        | `HPCExecutor` (parsl)  | parsl provider submit (Slurm/PBS) | parsl future state  | block walltime + pilot    |
+| `HPCTransport`        | `HPCExecutor` (parsl)  | parsl provider submit (Slurm/PBS) | parsl future state  | block walltime + pilot    |
 
 `ProcessPoolTransport` is the only transport that enforces the deadline coordinator-side
 (via `cancel` -> `future.cancel`), because there is no external scheduler to do it;
 this preserves today's `LocalExecutor` per-task timeout.
 
-`HpcTransport` wraps parsl rather than calling `sbatch` / `qsub` directly.
+`HPCTransport` wraps parsl rather than calling `sbatch` / `qsub` directly.
 parsl requests scheduler *blocks* (pilot jobs) and runs many executions inside each one,
-so the scheduler walltime bounds the allocation, not the individual execution;
+so the scheduler walltime bounds the allocation, not the individual execution (to confirm);
 per-execution `wall_clock` is enforced inside the pilot, as `HPCExecutor.join` does today.
 
-**Celery callback demotes to a latency optimisation.**
+**Celery callback.**
 Today `CeleryExecutor` attaches `link` / `link_error` callbacks (`handle_result` / `handle_failure`)
 that ingest worker-side.
+In production, this requires an additional `orchastrator` worker.
 Under this design the worker writes its bundle and manifest to disk like every other transport,
 and the coordinator (or a resident orchestrator running `drain`) ingests from disk.
-The `link` callback may be kept as an *eager-ingest hook* — "ingest this directory now" rather than at the next
-poll — but it is no longer load-bearing: if the callback is lost, the disk scan recovers the result.
-This dissolves the previous push-vs-pull asymmetry; both modes run the same `ingest_from_disk`.
+The `link` callback may be kept as an *eager-ingest hook* — "ingest this directory now" rather than at the next poll.
+It's no longer load-bearing as if the callback is lost, the disk scan recovers the result.
+This dissolves the previous push-vs-pull asymmetry and both modes run the same `ingest_from_disk`.
 
 ## Migration plan
 
 The cutover replaces the seam the solver calls, so it lands in stages, each independently shippable:
 
-1. **Manifest, additively.** Current executors start writing `execution.json` (both phases).
-   No behaviour change; result directories become self-describing immediately —
-   this alone unlocks regression comparison.
-2. **Seam + local transports.** Introduce `ExecutionLifecycle`, the `Transport` port, `ingest_from_disk`,
+1. **ExecutionManifest, additively.** Current executors start writing `execution.json` (both phases).
+   No behaviour change; result directories become self-describing immediately — this alone unlocks regression comparison.
+2. **New classes + local transports.** Introduce `ExecutionLifecycle`, the `Transport` port, `ingest_from_disk`,
    and `InMemoryTransport` + `ProcessPoolTransport`; route the solver through the lifecycle.
 3. **Celery.** Port to `CeleryTransport`; convert `link`/`link_error` to the optional eager-ingest hook.
 4. **HPC.** Port `HPCExecutor` to `HpcTransport` (Slurm + PBS via parsl).
@@ -457,44 +459,26 @@ crashes remain undiagnosable from durable state.
   reconciles task state from the database — essentially `dispatch` plus `status`-driven `drain`.
 - **Celery.** `task_time_limit` and queue routing; `ResourceHint.queue` maps onto Celery queues and
   `wall_clock` onto `task_time_limit`. Today's `CeleryExecutor` uses neither.
-- **Make / build systems.** A target is rebuilt unless its output exists and is current — the same logic as
-  reading a manifest outcome from disk before deciding to re-run.
+- **Make / build systems.** A target is rebuilt unless its output exists and is current.
+  The same logic as reading a manifest outcome from disk before deciding to re-run.
 
 # Unresolved questions
 
-To resolve through the RFC process:
+Outstanding questions that are not yet resolved:
 
 - The exact `ExecutionManifest` field set and `schema_version` stability guarantees.
 - Whether the Celery eager-ingest hook (`link`) is retained or Celery polls purely via `status`.
-- Where the controlled vocabulary comes from in tests (a permissive fixture vs. the project CV).
-
-To resolve through implementation:
-
 - Exact upsert unique constraints and any Alembic migration
   (`output_fragment`, `provider_version`, and `diagnostic_version` already exist).
-- Mapping parsl future / pilot-block state to `RUNNING | EXITED | GONE`, and enforcing the
-  per-execution deadline inside the pilot.
+- Mapping parsl future state to `RUNNING | EXITED | GONE`,
+  and enforcing the per-execution deadline.
 - The deprecation mechanics for `import_executor_cls` and the executor dotted-path config key.
 
-Out of scope for this RFC (addressed independently later):
-
-- **K8s transport.**
-- **External / "null" execution** — a directory produced entirely outside the REF, then ingested.
-  The format already supports it; the operational contract does not yet.
-- **Cross-deployment portability beyond shared datasets.**
-  Tier 2 (the importing deployment already indexes the same ESGF datasets, resolved by `instance_id`)
-  is enabled by the manifest's `datasets` field.
-  Tier 3 (datasets *not* present locally) needs dataset support for remote / not-yet-local files
-  and is a separate RFC. The manifest carries the cross-link information (`instance_id`) but not the files.
-- **External-execution version policy.** The default is ingest-as-is:
-  a result lands under its declared `diagnostic_version`, surfaces only if it equals the local
-  `promoted_version`, is CV-validated, and is never rejected on version. Hardening is deferred.
-- **Adaptive resource provider and per-execution telemetry columns.**
-- **Making CV validation a hard failure.**
-
 # Future possibilities
 
 - **K8s transport** — a pod per execution with `resources` and an active deadline; `status` from the pod phase.
+  The jobs are long enough running that the start up cost of a container may be negligible.
+  This may even deprecate Celery.
 - **External / null transport** — the lowest-friction modelling-centre adoption path:
   a centre runs the diagnostic in its own pipeline, writes `execution.json` + bundles to the expected layout,
   and `ref ingest <dir>` loads it. The manifest becomes a public, versioned contract.
@@ -503,11 +487,6 @@ Out of scope for this RFC (addressed independently later):
 - **Adaptive `ResourceProvider`** — capture per-execution telemetry (peak RSS, duration) into the manifest
   and the database, then suggest `ResourceHint` memory at, say, p95 × 1.2 over a rolling window.
 - **Cross-restart recovery** — already largely free: because `drain` reconstructs the in-flight set from the
-  database and `status` queries durable scheduler state, a coordinator that was down while a SLURM job finished
-  picks it up on the next `drain`.
-- **Per-provider retry policies** — the single worker-side classifier becomes a per-provider mapping
-  when concrete demand appears.
-
-None of these justify the RFC on their own;
-they show the manifest-plus-liveness seam is shaped for where the project is heading
-without pre-building any of them.
+  database and `status` queries durable scheduler state,
+  a coordinator that was down while a SLURM job finished picks it up on the next `drain`.
+- **Per-provider retry policies** — the single worker-side classifier becomes a per-provider mapping when concrete demand appears.

From 5b7e53ecd5d943cdc754907893b90ab3cdc8cc34 Mon Sep 17 00:00:00 2001
From: Jared Lewis <jared@jared.kiwi.nz>
Date: Fri, 5 Jun 2026 11:41:50 +1000
Subject: [PATCH 12/12] rfc(execution-lifecycle): align coordinator
 outcome->decision with collapsed drain

The paragraph claimed the decision used '(plus transport liveness)' and
listed 'SUCCESS | retry | give up', which contradicted the collapsed drain
loop and the robustness line (liveness only separates RUNNING from done).

- Spell out the full mapping: SUCCESS -> ingest, RECOVERABLE -> retry,
  FAILED -> give up, finished-with-no-outcome -> retry.
- State that liveness is used only to know the execution finished, not to
  classify it.
---
 text/0003-execution-lifecycle.md | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/text/0003-execution-lifecycle.md b/text/0003-execution-lifecycle.md
index fca583d..1721e01 100644
--- a/text/0003-execution-lifecycle.md
+++ b/text/0003-execution-lifecycle.md
@@ -325,8 +325,11 @@ NON_RETRYABLE = (CondaCommandError,)                                   # -> FAIL
 ```
 
 **Coordinator side — outcome -> decision.**
-The drain loop maps the outcome in the manifest (plus transport liveness) onto `SUCCESS | retry | give up`.
-It never inspects exception types, so it behaves identically for every transport,
+The drain loop maps the manifest outcome onto ingest / retry / give up:
+`SUCCESS` -> ingest, `RECOVERABLE` -> retry, `FAILED` -> give up,
+and a finished execution with no outcome -> retry.
+It never inspects exception types, and it relies on transport liveness only to know the execution has
+finished — not to classify it — so it behaves identically for every transport,
 including remote ones where the exception object never comes back.
 
 The `dirty` flag for an execution group is then updated: