From f9d00349ad0445dd8838b344e5857eccb857970b Mon Sep 17 00:00:00 2001 From: Jared Lewis Date: Tue, 12 May 2026 17:34:36 +1000 Subject: [PATCH 01/12] rfc: execution lifecycle consolidation Consolidate diagnostic execution lifecycle (allocation, dispatch, run, classify, publish, ingest, finalise) into one deep module behind a single Transport port. Surface ResourceHint on Diagnostic so providers can declare memory/CPU/wall-clock once. Capture per-execution Telemetry to enable future adaptive scheduling without a schema change. --- text/0000-execution-lifecycle.md | 822 +++++++++++++++++++++++++++++++ 1 file changed, 822 insertions(+) create mode 100644 text/0000-execution-lifecycle.md diff --git a/text/0000-execution-lifecycle.md b/text/0000-execution-lifecycle.md new file mode 100644 index 0000000..18d0a69 --- /dev/null +++ b/text/0000-execution-lifecycle.md @@ -0,0 +1,822 @@ +- Feature Name: `execution_lifecycle` +- Start Date: 2026-05-12 +- RFC PR: [Climate-REF/rfcs#0000](https://github.com/Climate-REF/rfcs/pull/0000) + +# Summary +[summary]: #summary + +Consolidate the lifecycle of a single diagnostic execution +— allocation, dispatch, run, classify, publish, ingest, finalise — +into one deep module (`ExecutionLifecycle`) sitting behind a single `Transport` port, +and surface a declarative `ResourceHint` on each `Diagnostic` so providers +can express memory / CPU / wall-clock expectations in one place +without reaching into the executor or scheduler layer. + +# Motivation +[motivation]: #motivation + +## Problem + +The lifecycle of one diagnostic execution today is fragmented across roughly +eight files in two packages. +A single execution flows through, in order: + +1. `solver.py` creates an `Execution` row with a `PLACEHOLDER_FRAGMENT`, + calls `assign_execution_fragment` to flush, compute a group-short string, + and rewrite the output fragment. +2. `solver.py` rebuilds the `ExecutionDefinition` via `attrs.evolve`, + calls `execution.register_datasets`, expunges the row, commits. +3. `executor.run(definition, execution)` is called outside the transaction. + Three implementations exist: + `SynchronousExecutor` runs in-process, + `LocalExecutor` submits to a `ProcessPoolExecutor`, + `CeleryExecutor` sends to a broker with linked result handlers. +4. `execute_locally` (in `climate-ref-core`) wraps `diagnostic.run`, + classifies exceptions via the private `_is_system_error`, + special-cases `CondaCommandError`, + and produces an `ExecutionResult` — whose `build_from_output_bundle` + factory writes three JSON files to disk despite being a frozen attrs class. +5. `process_result` and `handle_execution_result` copy log, metric bundle, + output bundle, series file, and every bundle-referenced plot/data/html + file from scratch to results, + then ingest scalar values, series values, and outputs in a nested + transaction. +6. `ExecutionGroup.dirty` is toggled in three different branches across + `handle_execution_result` and `mark_execution_failed`. + +Concrete problems this fragmentation causes today: + +- **Provider authors have no obvious place to express parallelisation hints.** + ESMValTool diagnostics that need 16 GB of memory or eight hours of + wall-clock have nowhere to declare it. + The `LocalExecutor` enforces a 6 hour per-task timeout; + the `CeleryExecutor` does not enforce a per-task timeout at all; + neither passes resource requirements through to anything that could honour + them. + When the project eventually adds SLURM or PBS adapters, + the natural way to wire memory / CPU / wall-clock into `sbatch` or `qsub` + directives does not exist. +- **Retry classification is split across the file tree.** + `_is_system_error` (in `climate-ref-core/executor.py`), + `CondaCommandError` handling (in the same file), + missing-log detection (in `result_handling.py`), + per-task timeout (in `LocalExecutor`), + and pool-shutdown abandonment (also in `LocalExecutor`) + each set `retryable` independently. + There is no single classifier function and no contract between + `ExecutionResult.build_from_failure` and the executor about which + exception types are retryable. +- **The `dirty` flag is decided in three places.** + Success path sets it `False`; + non-retryable failure sets it `False`; + retryable failure leaves it `True`; + missing log leaves it `True`. + These branches are spread across `handle_execution_result` and the + Local/Celery executors with no unifying decision function. +- **Tests have to mock subprocess, filesystem, DB, and Celery.** + `tests/unit/test_providers.py` patches `subprocess` directly for the + conda provider; `tests/unit/test_executor.py` imports the private + `_is_system_error`; integration tests are the only thing that exercises + the full happy path. +- **`ExecutionResult` is not picklable cleanly.** + Its frozen factory writes three JSON files; + re-construction in a test requires a real tmp_path filesystem. +- **The `Executor` Protocol is shallow.** + Its `run` / `join` interface forces every adapter + (Synchronous, Local, Celery, future SLURM/PBS/K8s) + to reinvent the same dance: + reattach the detached `Execution` row, + open a fresh transaction, + copy scratch to results, + load the CV, + ingest scalars and series, + apply the dirty rule, + mark the row. + +## Why now + +The CMIP REF is approaching the point where serious deployment targets are +plausible: +SLURM and PBS adapters, +distributed Celery deployments behind a managed broker, +possibly Kubernetes Jobs. +Each new transport currently means a new ~250-LOC adapter that re-implements +the lifecycle dance. +The same fragmentation also blocks anything resembling +**informed scheduling** — even simple things like "ESMValTool needs more memory +than ILAMB" or "the ENSO diagnostic should land on the `bigmem` queue" +have no representation in the codebase. + +This RFC is **not** about replacing SLURM, PBS, or Celery as schedulers. +It is about defining the seam between *climate-ref* and a scheduler so the +scheduler has enough information to do its job, +and so the lifecycle around the scheduler is one robust, well-tested module +instead of eight thin ones. + +## Expected outcomes + +1. The solver dispatch loop shrinks from ~50 lines of bookkeeping to ~3. +2. A new transport adapter (SLURM, PBS, K8s) is a single class + implementing `Transport`, around 80–120 LOC, + with no copy-paste of fragment allocation, ingestion, CV loading, + dirty-flag logic, or `Execution` reattachment. +3. Provider authors declare `resources: ResourceHint(...)` once on a + diagnostic class and the rest of the system honours it. +4. Per-execution telemetry (peak memory, duration, host) + is captured on every outcome and persisted on the `Execution` row, + making future adaptive scheduling + a feature addition rather than a schema change. +5. Unit tests run with zero filesystem and zero subprocess + via an `InMemoryTransport` and a separated bundle writer. + +# Reference-level explanation +[reference-level-explanation]: #reference-level-explanation + +## Module boundary + +``` +┌──────────────────────────────────────────────────────────────────────────────┐ +│ ExecutionLifecycle (climate_ref.lifecycle) │ +│ │ +│ __init__(config, db, transport, *, │ +│ retry=DefaultRetryPolicy(), │ +│ cv=None, # default: CV.load_from_file(config.paths…) │ +│ clock=datetime.utcnow) │ +│ │ +│ submit(execution, definition) -> None │ +│ drain(timeout=None) -> None │ +│ │ +│ replay_abandoned() -> list[int] # find executions stranded across boot │ +│ dry_run(group, datasets) -> ExecutionDefinition │ +│ │ +│ OWNED (private): │ +│ _FragmentAllocator _Classifier _Promoter _Ingestor │ +│ _DirtyRule _BundleWriter (worker-side) │ +└──────────────────────────────────────────────────────────────────────────────┘ + │ + ▼ Port — the ONLY thing that varies per backend +┌──────────────────────────────────────────────────────────────────────────────┐ +│ Transport (Protocol, climate_ref_core.lifecycle.ports) │ +│ │ +│ name: ClassVar[str] │ +│ dispatch(envelope: ExecutionEnvelope) -> None │ +│ poll(block: bool, timeout: float | None) │ +│ -> Iterator[ExecutionOutcome] │ +│ shutdown(timeout: float | None) -> None │ +│ │ +│ Production adapters: │ +│ InMemoryTransport (in climate-ref, used by tests & SynchronousExec) │ +│ ProcessPoolTransport (in climate-ref, replaces LocalExecutor body) │ +│ CeleryTransport (in climate-ref-celery) │ +│ │ +│ Future adapters (NOT in this RFC, but the seam supports them): │ +│ SlurmTransport consumes ResourceHint as --mem/--cpus/--time/-p │ +│ PbsTransport consumes ResourceHint as -l mem,ncpus,walltime,q │ +│ K8sTransport consumes ResourceHint as pod resources + deadline │ +└──────────────────────────────────────────────────────────────────────────────┘ +``` + +## Wire types + +Three frozen, picklable value types cross the transport boundary +(process pool, broker, future network hop). +None of them carry DB sessions, `Config`, or controlled vocabulary. + +```python +@attrs.frozen +class ResourceHint: + """Declarative resource expectation, lives on the diagnostic class. + + Per-execution overrides are produced by Diagnostic.resources_for(). + """ + memory_mb: int = 4096 + cpu: int = 1 + wall_clock: timedelta = timedelta(hours=2) + queue: str | None = None # transport-specific routing tag + +@attrs.frozen +class ExecutionEnvelope: + """Wire format sent to a worker. Picklable.""" + execution_id: int + definition: ExecutionDefinition + resources: ResourceHint + deadline: datetime # clock() + resources.wall_clock at submit time + +@attrs.frozen +class Telemetry: + """Captured by the worker, persisted on the Execution row by the coordinator.""" + duration: timedelta + peak_rss_mb: int | None + host: str + exit_code: int | None + transport_meta: Mapping[str, str] # slurm jobid / k8s pod / celery task_id … + +@attrs.frozen +class ExecutionOutcome: + """What poll() yields back to the lifecycle.""" + execution_id: int + result: ExecutionResult | None # None ⇒ transport-side abandonment + failure: ExecutionFailure | None # timeout | broker_lost | pool_shutdown + telemetry: Telemetry +``` + +`ExecutionResult` becomes pure data. +The current `build_from_output_bundle` factory +(which writes three JSON files inside a frozen-attrs constructor) +is split into `ExecutionResult.from_bundle(definition, bundle)` — +which constructs the value object without I/O — +and a separate `_BundleWriter.write(definition, bundle)` +that is invoked worker-side only. + +## Diagnostic-side declaration + +```python +class Diagnostic(AbstractDiagnostic): + ... + resources: ResourceHint = ResourceHint() + + def resources_for(self, definition: ExecutionDefinition) -> ResourceHint: + """Optional per-execution sizing. + + Default returns ``self.resources`` unchanged. + Diagnostics whose memory scales with input size override this. + """ + return self.resources +``` + +Concrete provider declarations under the new system: + +```python +# packages/climate-ref-esmvaltool/src/climate_ref_esmvaltool/diagnostics/base.py +class ESMValToolDiagnostic(CommandLineDiagnostic): + resources = ResourceHint( + memory_mb=16000, cpu=4, wall_clock=timedelta(hours=6), + ) + +# packages/climate-ref-pmp/src/climate_ref_pmp/diagnostics/enso.py +class EnsoDiagnostic(Diagnostic): + resources = ResourceHint( + memory_mb=24000, cpu=8, wall_clock=timedelta(hours=8), + queue="bigmem", + ) + +# packages/climate-ref-ilamb/src/climate_ref_ilamb/standard.py +class IlambDiagnostic(Diagnostic): + resources = ResourceHint( + memory_mb=8000, cpu=2, wall_clock=timedelta(hours=2), + ) + def resources_for(self, defn): + n = len(defn.datasets.get_cmip6()) + return attrs.evolve(self.resources, memory_mb=8000 + 500 * n) +``` + +The base `Diagnostic` default of `(4 GB, 1 CPU, 2 h)` is the project-wide +sensible starting point. +Providers that need more either override the class attribute +or override `resources_for` for input-size scaling. + +## Solver dispatch — before and after + +**Before** (`solver.py:709-757`, condensed): + +```python +execution = Execution( + execution_group=execution_group, + dataset_hash=definition.datasets.hash, + output_fragment=PLACEHOLDER_FRAGMENT, + provider_version=potential_execution.provider.version, +) +db.session.add(execution) +fragment = assign_execution_fragment( + db.session, execution, + provider_slug=provider_slug, + diagnostic_slug=potential_execution.diagnostic.slug, + selectors=potential_execution.selectors, + group_id=execution_group.id, +) +definition = attrs.evolve( + definition, + output_directory=config.paths.scratch.resolve() / pathlib.Path(fragment), +) +execution.register_datasets(db, definition.datasets) +if execute: + db.session.expunge(execution) + pending = (definition, execution) +# … commit happens at top of loop … +if pending is not None: + executor.run(definition=pending[0], execution=pending[1]) +… +executor.join(timeout=timeout) +``` + +**After**: + +```python +lifecycle = ExecutionLifecycle(config, db, transport) + +for group, datasets, definition in planned_executions: + execution = Execution(execution_group=group, + dataset_hash=datasets.hash, + provider_version=definition.diagnostic.provider.version) + lifecycle.submit(execution, definition) + +lifecycle.drain(timeout=timeout) +``` + +`submit` privately handles fragment allocation, +`register_datasets`, +the expunge/commit dance, +and `transport.dispatch`. +`drain` privately handles polling, +reattachment, +artifact promotion, +ingestion, +the dirty rule, +and telemetry persistence. + +## End-to-end sequence + +``` +Solver Diagnostic Lifecycle Transport Worker DB + │ submit(e,d) │ │ │ │ │ + ├──────────────►│ resources_for(d)│ │ │ │ + │ ◄──── ResourceHint(...) │ │ │ │ + │ │ │ allocate fragment ─┼──────────────────┼────────────────►│ + │ │ │ register_datasets ─┼──────────────────┼────────────────►│ + │ │ │ session.expunge │ + │ │ │ commit ──────────────────────────────────────────────── │ + │ │ │ deadline = now + r.wall_clock │ + │ │ │ dispatch(envelope)─►│ (slurm sbatch / celery send / │ + │ │ │ │ processpool.submit / inline) │ + │ │ │ ├─►│ execute_locally(envelope) │ + │ │ │ │ │ diagnostic.run │ + │ │ │ │ │ _BundleWriter.write │ + │ │ │ │ │ CV.validate (hard-fail) │ + │ │ │ │ │ capture Telemetry │ + │ │ │ │ │ return ExecutionOutcome │ + │ │ │ │◄─│ │ + │ drain() │ │ poll() ◄───────────│ │ + │ ────────────►│ │ _finalize: │ + │ │ │ session.merge(eid) ───────────────────────────────────►│ + │ │ │ _Promoter.copy(scratch→results, log+bundles+refs) │ + │ │ │ _Classifier(failure or result) → Success|Retry|GiveUp │ + │ │ │ _Ingestor.upsert(outputs, scalars, series) ──────────►│ + │ │ │ _DirtyRule.apply(decision) ──────────────────────────►│ + │ │ │ execution.telemetry = outcome.telemetry ─────────────►│ + │ │ │ execution.mark_{successful,failed} ─────────────────►│ + │ done │ │ │ │ +``` + +## Retry classification + +A single `RetryPolicy.classify(outcome) -> RetryDecision` replaces: +`_is_system_error`, +the `CondaCommandError` branch in `execute_locally`, +the missing-log branch in `handle_execution_result`, +the per-task timeout handling in `LocalExecutor.join`, +the pool-shutdown abandonment in `LocalExecutor._fail_outstanding`. + +```python +class RetryDecision(enum.Enum): + SUCCESS = "success" + RETRY = "retry" # leaves dirty=True + GIVE_UP = "give_up" # sets dirty=False, marks failed + +class DefaultRetryPolicy: + SYSTEM_ERRORS = (OSError, MemoryError, SystemExit, KeyboardInterrupt) + NON_RETRYABLE = (CondaCommandError,) + + def classify(self, outcome: ExecutionOutcome) -> RetryDecision: + if outcome.failure is not None: + # transport-side: timeout, broker_lost, pool_shutdown → retry + return RetryDecision.RETRY + result = outcome.result + if result is None: + return RetryDecision.RETRY + if result.successful: + return RetryDecision.SUCCESS + return (RetryDecision.RETRY if result.retryable + else RetryDecision.GIVE_UP) +``` + +The dirty rule is derived from the decision in one place: + +```python +class DefaultDirtyRule: + def apply(self, execution: Execution, decision: RetryDecision) -> None: + if decision in (RetryDecision.SUCCESS, RetryDecision.GIVE_UP): + execution.execution_group.dirty = False + # RETRY: leave dirty=True so the next solve picks it up +``` + +## Idempotent re-ingest + +Re-running `_finalize` on the same `execution_id` +(e.g. a join-time crash followed by `replay_abandoned`) +must not double-insert. +The ingester moves to upserts keyed on natural identifiers: + +- `ExecutionOutput`: `(execution_id, output_type, short_name)` +- `ScalarMetricValue`: `(execution_id, dimensions_hash)` +- `SeriesMetricValue`: `(execution_id, dimensions_hash, index_name)` + +Where SQLite is in use the implementation uses +`INSERT … ON CONFLICT DO NOTHING`; +on PostgreSQL the same construct works natively. +The scratch-to-results copy uses `shutil.copy` with `exist_ok=True` +on the destination directory. + +## Telemetry + +Three new columns are added to `Execution`: + +```sql +ALTER TABLE execution ADD COLUMN duration_seconds FLOAT NULL; +ALTER TABLE execution ADD COLUMN peak_rss_mb INTEGER NULL; +ALTER TABLE execution ADD COLUMN telemetry_meta JSON NULL; +``` + +`telemetry_meta` is a small JSON blob with hostname, exit code, +and transport-specific identifiers (Celery task id, SLURM job id, …). +None of these are read by the system at solve time; +they exist so that a future adaptive resource provider +can ask "what was the rolling p95 memory for diagnostic X +on dataset shape Y?" without a schema migration. + +The shape of that future provider is **not** in this RFC. +A natural seam is: + +```python +class ResourceProvider(Protocol): + def hint_for(self, definition: ExecutionDefinition) -> ResourceHint: ... +``` + +with a `StaticResourceProvider` (uses `diagnostic.resources_for`) +and a future `AdaptiveResourceProvider` (reads telemetry, applies multipliers). +`ExecutionLifecycle` would gain a `resources:` constructor kwarg defaulting +to `StaticResourceProvider`. +This is intentionally deferred — the column existing today +is the only commitment. + +## Module / file layout + +``` +packages/climate-ref-core/src/climate_ref_core/ +├── lifecycle/ +│ ├── __init__.py ← public re-exports +│ ├── ports.py ← Transport, ResourceHint, Envelope, Outcome, Telemetry +│ ├── results.py ← ExecutionResult (pure data) + from_bundle factory +│ └── worker.py ← execute_worker(envelope) — picklable entry point +│ +packages/climate-ref/src/climate_ref/ +├── lifecycle.py ← ExecutionLifecycle class +├── transports/ +│ ├── inmemory.py ← InMemoryTransport +│ └── processpool.py ← ProcessPoolTransport (was LocalExecutor) +└── executor/ ← deleted; the Executor Protocol goes away +│ +packages/climate-ref-celery/src/climate_ref_celery/ +└── transport.py ← CeleryTransport (was CeleryExecutor) +``` + +The existing `Executor` Protocol in `climate-ref-core/executor.py` +and the three concrete executors are deleted. +`execute_locally` becomes a thin wrapper around `execute_worker` +kept for backwards compatibility for one release cycle, then removed. + +## Test impact + +**Deleted tests** (boundary tests will replace them): + +- `tests/unit/test_executor.py::test_is_system_error_*` + — private-import tests on `_is_system_error` +- `tests/unit/test_providers.py::test_conda_*` subprocess patch chains + for the conda provider's `run()` are replaced by transport-level boundary tests +- `tests/unit/executor/test_synchronous.py::*` reattachment tests +- `tests/unit/executor/test_local.py::*` future-handling tests + beyond what `ProcessPoolTransport`'s own tests cover +- Per-executor `mark_execution_failed` mock-chain tests + +**New boundary tests** (against `ExecutionLifecycle` with `InMemoryTransport`): + +- `submit` allocates a unique fragment per execution +- `submit` persists `register_datasets` even when the transport dispatch fails +- `drain` ingests one successful outcome end-to-end and sets `dirty=False` +- `drain` re-dispatches a retryable failure on the next solve + by leaving `dirty=True` +- `drain` marks a non-retryable failure with `dirty=False` and `successful=False` +- `drain` is idempotent: re-draining the same outcome does not double-insert + scalar or series rows +- `drain` enforces `resources.wall_clock` uniformly — a worker that exceeds + the deadline produces a retryable failure regardless of transport +- CV validation fails hard with a `ResultValidationError` when metrics + reference unknown dimensions (replaces today's silent `logger.warning`) +- `replay_abandoned` returns the IDs of executions whose rows exist + but never received an outcome + +# Drawbacks +[drawbacks]: #drawbacks + +## Migration cost + +This is not a small refactor. +The dispatch loop, three executor classes, result handling, +fragment allocation, and ingestion all move at once. +A staged migration is possible — `InMemoryTransport` and `ProcessPoolTransport` +can ship before `CeleryTransport` — but the lifecycle module itself has to +land in one piece because every transport depends on it. + +## CV validation becomes a hard fail + +Today, when scalar or series metric values do not conform to the controlled +vocabulary, `result_handling.py` logs a warning with a `TODO`. +This RFC promotes that to a hard `ResultValidationError`. +Any in-flight diagnostic with a CV mismatch will start failing. +This is the *intended* behaviour — silent CV drift is a real source of data +quality issues — but it is a behaviour change that needs deliberate handling +during rollout (e.g. a one-release-cycle deprecation period where the +violation is logged with `ERROR` instead of `WARNING` but does not yet raise). + +## A single fat class owns more responsibility + +`ExecutionLifecycle` ends up around 400–500 LOC. +That is intentional: the things it owns are already coupled across eight +files today, and module *depth* is the design goal. +But it does mean the file is large; reviewers should expect that. + +## The Celery boundary still has one quirk + +In the current code, Celery's `link=handle_result.s(...)` chains a result +handler that runs on the broker side of the boundary. +In the new design, `CeleryTransport.poll` is what feeds outcomes back into +`drain`. +This is cleaner, but it means a coordinator-side crash mid-drain can leave +results "in flight" in the broker that the coordinator has not yet observed. +The mitigation is `replay_abandoned()`, which on next startup queries +unfinished executions and re-polls the broker for them. +This is a robustness improvement over today +(where a coordinator crash silently strands rows in `successful=NULL`), +but it is not free — +it requires `CeleryTransport` to persist task IDs alongside execution IDs +in a small table or in `telemetry_meta` so the broker can be queried later. + +## Resource hints can be wrong + +If a provider declares 4 GB for a diagnostic that actually needs 16 GB, +the SLURM adapter will OOM-kill the job. +Today's `LocalExecutor` runs in the parent address space and "just works" +up to the host's memory. +With `ResourceHint` in place, getting the number wrong is a new failure mode. +Mitigations: + +- The default `ResourceHint(memory_mb=4096, cpu=1, wall_clock=2h)` + is generous enough that most diagnostics run unchanged. +- `ProcessPoolTransport` deliberately ignores everything except + `wall_clock` — running on a developer laptop does not get harder. +- Telemetry capture means the first failed run produces the data needed + to adjust the hint upward. + +# Rationale and alternatives +[rationale-and-alternatives]: #rationale-and-alternatives + +Three designs were considered in detail. +Sketches and trade-offs follow; the chosen design is a deliberate hybrid. + +## A — Minimal interface (rejected as a pure form) + +Two public methods (`submit`, `drain`), one Protocol (`Transport`). +Everything else collapses behind the facade and cannot be customised +without editing the module. + +``` +Public surface: ████░░░░░░░░░░░░░░░░ (smallest) +Defaults baked: ████████░░░░░░░░░░░░ +Future bend: ██████████░░░░░░░░░░ (one port + edit-in-place) +Migration churn: ████████████████░░░░ (deletes Executor Protocol) +``` + +Strength: minimal cognitive load. Weakness: no place to declare +resource hints, no place for per-provider retry policies; both end up +requiring constructor kwarg additions on first concrete demand. + +## B — Maximally flexible (rejected as a pure form) + +Five ports (`Transport`, `ArtifactStore`, `RetryPolicy`, `IngestSink`, +`FragmentAllocator`), seven lifecycle hooks, plugin discovery via entry +points, schema-versioned wire types. + +``` +Public surface: ████████████████████ (5 ports × N methods + 7 hooks) +Defaults baked: ████░░░░░░░░░░░░░░░░ (everything injectable) +Future bend: ████████████████████ (port-shaped for K8s/S3/Prom) +Migration churn: ████████████████████ (new wire format + plugin registry) +Speculation tax: ██████████░░░░░░░░░░ (hooks + FragmentAllocator-as-port) +``` + +Strength: every anticipated future requirement has a seam already. +Weakness: heavy speculation tax — five ports without five concrete +implementations driving them is exactly the shape that calcifies into +"the way things have always been done". + +## C — Common-case optimised (rejected as a pure form) + +One class with `dispatch`, `drain`, plus `replay_abandoned`, `dry_run`, +`ingest`, `with_transport`. Sensible defaults sourced from `Config`. + +``` +Public surface: ████████░░░░░░░░░░░░ (1 hot method + 4 cold methods) +Defaults baked: ████████████████████ (highest) +Future bend: ████████░░░░░░░░░░░░ (single port; widen kwargs) +Migration churn: ██████░░░░░░░░░░░░░░ (smallest; adapters survive) +Hidden leak: ████░░░░░░░░░░░░░░░░ (Celery ignores on_done callback) +``` + +Strength: the solver call site collapses to one line per iteration. +Weakness: a `ResultSink` callback that Celery silently ignores is an +asymmetry that will trip someone, and the design has no place for +resource hints. + +## The chosen hybrid + +- **C's façade** — `ExecutionLifecycle` is one class with `submit` and + `drain` as the hot path and a small set of cold-path entry points. +- **A's transport contract** — `Transport.dispatch(envelope)` plus + `poll()` returning `Iterator[ExecutionOutcome]`. No `ResultSink` + callback. The lifecycle owns the result loop. +- **A's bundle-writer separation** — `ExecutionResult` is pure data; + bundle JSON write moves into a worker-side `_BundleWriter`. +- **B's `ExecutionEnvelope` and `Telemetry` wire types** — earned by + the actual provider-side pain (no place for resource hints) and by the + need to capture per-execution actuals for future adaptive sizing. +- **Defer the rest of B** — no `ArtifactStore` port (S3 mirror is + hypothetical), no `IngestSink` priority chain (DB-only today), + no `LifecycleHooks` (today's `logger.*` calls suffice), + no `FragmentAllocator` port (one implementation foreseeable), + no entry-point plugin registry. +- **Delete the `Executor` Protocol.** It is a shallow interface + dressed up as flexibility; `SynchronousExecutor` becomes + `ExecutionLifecycle(transport=InMemoryTransport())`. + +## Impact of not doing this + +Each future transport (SLURM, PBS, K8s) re-implements ~250 LOC of +lifecycle wiring per adapter. +Resource hints have to be retrofitted later through a new wire format, +which is a strictly larger change than introducing them at the same time +as the lifecycle consolidation. +Per-task timeout, CV validation, dirty-flag rules, and retry +classification stay scattered, and integration tests remain the only +honest signal that the system works end-to-end. + +# Prior art +[prior-art]: #prior-art + +## Dask's `distributed` + +Dask's scheduler accepts `resources=` annotations on submitted tasks +(e.g. `client.submit(func, resources={"GPU": 1})`). +The pattern of declaring resource needs on the unit-of-work and letting +the scheduler honour them is the direct inspiration for `ResourceHint`. +Dask does not concern itself with what is "above" the scheduler +(catalog, ingest, controlled vocabulary), +which is the layer this RFC is consolidating. + +## Snakemake / Nextflow + +Both pipeline frameworks have first-class resource directives +(`resources: mem_mb=…, runtime=…`) on rules / processes, +which translate transparently into SLURM / PBS / K8s directives. +This RFC reuses that mental model at the diagnostic level. + +## Airflow operators + +Airflow's executor/operator split is the closest analogue: +operators carry the work definition, executors decide how to run it. +Airflow learned the hard way that an executor that owns result +handling is too narrow an interface; their `BaseExecutor.execute_async` +plus `sync` (poll) split is essentially what `Transport.dispatch` / +`Transport.poll` is here. + +## Celery's `task_time_limit` and queue routing + +Celery already supports per-queue routing and per-task time limits. +This RFC's `ResourceHint.queue` maps directly onto Celery's queue +routing; `wall_clock` maps onto `task_time_limit` plus +`task_soft_time_limit`. +The `CeleryExecutor` in `climate-ref-celery` does not currently use +either; the new `CeleryTransport` does. + +## The Rust RFC process itself + +The shape of this document (Summary, Motivation, Reference-level, +Drawbacks, Rationale, Prior art, Unresolved, Future) is inherited from +[rust-lang/rfcs](https://github.com/rust-lang/rfcs) +via this repository's template. + +# Unresolved questions +[unresolved-questions]: #unresolved-questions + +To resolve through the RFC discussion: + +- **Should `ResourceHint` include I/O class (e.g. `io_intensive: bool`) + or GPU (`gpu: int`)?** + Neither has a concrete consumer today. + The recommended default is to add fields when a concrete adapter + needs them rather than speculating now. +- **Where does the controlled vocabulary live in test contexts?** + Today `CV.load_from_file(config.paths.dimensions_cv)` is invoked + per-result inside `handle_execution_result`. + The RFC moves it to lifecycle construction, which means tests need + a deterministic CV. + A `PermissiveCV()` test fixture is one option; + baking the project's CV into a constant for test use is another. +- **Should `replay_abandoned` be called automatically on + `ExecutionLifecycle.__init__`?** + Doing so makes recovery transparent; + not doing so makes startup behaviour explicit. + The current draft assumes it is explicit and called from the CLI. + +To resolve through implementation: + +- **Exact unique-constraint shape for upserts** — + `(execution_id, output_type, short_name)` for `ExecutionOutput` + is the obvious key, but the existing schema does not enforce it. + An Alembic migration is needed. +- **`CeleryTransport.poll` semantics** — + whether to use `AsyncResult.get(timeout=…)` per task + or `app.control.inspect` for batch status. + Implementation choice; does not affect the public interface. +- **`Telemetry.peak_rss_mb` capture on macOS** — + `resource.getrusage` returns bytes on macOS and KB on Linux. + A small platform shim is required worker-side. + +Out of scope: + +- **Adaptive resource provider** (reading rolling telemetry to suggest + hints). Telemetry columns land here; the provider is a follow-up RFC. +- **GPU scheduling**. + None of the current diagnostics use GPUs; + when one does, `ResourceHint` gains a `gpu: int` field. +- **Multi-tenant queues**. + `ResourceHint.queue` is enough for single-deployment routing; + multi-tenant work (per-org queues, fair-share) is out of scope. +- **Replacing SLURM / PBS / Celery as schedulers**. + Explicitly *not* in scope: the lifecycle owns the seam *above* the + scheduler, not the scheduler itself. + +# Future possibilities +[future-possibilities]: #future-possibilities + +The deepening proposed here unlocks several follow-ups, each a +self-contained piece of work: + +- **SLURM and PBS transports.** + Each is a ~100 LOC adapter implementing `Transport`. + `dispatch` builds a job script from `envelope.resources`, + shells out to `sbatch` / `qsub`, records the job ID in `telemetry_meta`; + `poll` queries `squeue` / `qstat` and surfaces completed jobs back as + `ExecutionOutcome`s. + No changes to the lifecycle or to provider code are required. + +- **Adaptive `ResourceProvider`.** + Reads `Execution.peak_rss_mb` over a rolling window per + `(diagnostic_slug, dataset_shape_signature)`, + suggests a memory hint at the rolling p95 × 1.2, + applies a per-CLI flag to opt in. + Schema is already in place from this RFC; the provider is the only + new code. + +- **Per-provider retry policies.** + Currently `RetryPolicy` is a single injected object. + ESMValTool's memory-exhaustion failures are retryable with a larger + memory hint; a `Mapping[str, RetryPolicy]` keyed on provider slug + becomes the natural extension when concrete demand arrives. + +- **Streaming partial results.** + `Transport.poll` already yields outcomes incrementally; + a `PartialOutcome` event could be added to the iterator for + progress reporting if a UI consumer appears. + +- **S3 / HTTP artifact store.** + If results need to land somewhere other than the local filesystem, + an `ArtifactStore` port can be carved out of `_Promoter` later. + The wire format does not change (scratch path remains relative); + only the lifecycle's promote step is rewritten. + +- **Pluggable ingest sinks.** + Prometheus metrics, audit logs, S3 mirrors — each is a side-effect + observer over `ExecutionOutcome`. + A small `IngestSink` interface with ordered observers is a small + follow-up if and when a second sink (beyond the DB) materialises. + +- **Recovery on coordinator restart.** + `replay_abandoned()` is in this RFC. + Extending it to interrogate transport-side state (SLURM job that + finished while the coordinator was down) is a natural follow-up that + only needs `Transport.lookup(transport_meta) -> ExecutionOutcome | None` + on each adapter. + +None of the above are reasons to accept this RFC on their own; +they are listed to show that the seam introduced here is the right shape +for the directions the project is plausibly heading, +without baking any of them in prematurely. From 5034a53eab50cb13b3506ba35f039f220fad1a11 Mon Sep 17 00:00:00 2001 From: Jared Lewis Date: Tue, 12 May 2026 17:35:57 +1000 Subject: [PATCH 02/12] rfc(execution-lifecycle): assign PR #3 number --- ...{0000-execution-lifecycle.md => 0003-execution-lifecycle.md} | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) rename text/{0000-execution-lifecycle.md => 0003-execution-lifecycle.md} (99%) diff --git a/text/0000-execution-lifecycle.md b/text/0003-execution-lifecycle.md similarity index 99% rename from text/0000-execution-lifecycle.md rename to text/0003-execution-lifecycle.md index 18d0a69..40418be 100644 --- a/text/0000-execution-lifecycle.md +++ b/text/0003-execution-lifecycle.md @@ -1,6 +1,6 @@ - Feature Name: `execution_lifecycle` - Start Date: 2026-05-12 -- RFC PR: [Climate-REF/rfcs#0000](https://github.com/Climate-REF/rfcs/pull/0000) +- RFC PR: [Climate-REF/rfcs#0003](https://github.com/Climate-REF/rfcs/pull/3) # Summary [summary]: #summary From aa5d896ced84d04f5745af85e6a4157bd0153675 Mon Sep 17 00:00:00 2001 From: Jared Lewis Date: Tue, 12 May 2026 20:24:11 +1000 Subject: [PATCH 03/12] rfc(execution-lifecycle): convert diagrams to mermaid --- text/0003-execution-lifecycle.md | 193 ++++++++++++++++--------------- 1 file changed, 101 insertions(+), 92 deletions(-) diff --git a/text/0003-execution-lifecycle.md b/text/0003-execution-lifecycle.md index 40418be..e0461d0 100644 --- a/text/0003-execution-lifecycle.md +++ b/text/0003-execution-lifecycle.md @@ -134,46 +134,40 @@ instead of eight thin ones. ## Module boundary -``` -┌──────────────────────────────────────────────────────────────────────────────┐ -│ ExecutionLifecycle (climate_ref.lifecycle) │ -│ │ -│ __init__(config, db, transport, *, │ -│ retry=DefaultRetryPolicy(), │ -│ cv=None, # default: CV.load_from_file(config.paths…) │ -│ clock=datetime.utcnow) │ -│ │ -│ submit(execution, definition) -> None │ -│ drain(timeout=None) -> None │ -│ │ -│ replay_abandoned() -> list[int] # find executions stranded across boot │ -│ dry_run(group, datasets) -> ExecutionDefinition │ -│ │ -│ OWNED (private): │ -│ _FragmentAllocator _Classifier _Promoter _Ingestor │ -│ _DirtyRule _BundleWriter (worker-side) │ -└──────────────────────────────────────────────────────────────────────────────┘ - │ - ▼ Port — the ONLY thing that varies per backend -┌──────────────────────────────────────────────────────────────────────────────┐ -│ Transport (Protocol, climate_ref_core.lifecycle.ports) │ -│ │ -│ name: ClassVar[str] │ -│ dispatch(envelope: ExecutionEnvelope) -> None │ -│ poll(block: bool, timeout: float | None) │ -│ -> Iterator[ExecutionOutcome] │ -│ shutdown(timeout: float | None) -> None │ -│ │ -│ Production adapters: │ -│ InMemoryTransport (in climate-ref, used by tests & SynchronousExec) │ -│ ProcessPoolTransport (in climate-ref, replaces LocalExecutor body) │ -│ CeleryTransport (in climate-ref-celery) │ -│ │ -│ Future adapters (NOT in this RFC, but the seam supports them): │ -│ SlurmTransport consumes ResourceHint as --mem/--cpus/--time/-p │ -│ PbsTransport consumes ResourceHint as -l mem,ncpus,walltime,q │ -│ K8sTransport consumes ResourceHint as pod resources + deadline │ -└──────────────────────────────────────────────────────────────────────────────┘ +```mermaid +flowchart TB + subgraph Lifecycle["ExecutionLifecycle   climate_ref.lifecycle"] + direction TB + Public["Public
__init__(config, db, transport, *, retry, cv, clock)
submit(execution, definition)
drain(timeout=None)
replay_abandoned() · dry_run(group, datasets)"] + Private["Owned (private)
_FragmentAllocator · _Classifier · _Promoter
_Ingestor · _DirtyRule · _BundleWriter"] + Public --- Private + end + + Lifecycle -- dispatch / poll --> Port + + subgraph Port["Transport (Protocol)   climate_ref_core.lifecycle.ports"] + TP["name: ClassVar[str]
dispatch(envelope: ExecutionEnvelope) -> None
poll(block, timeout) -> Iterator[ExecutionOutcome]
shutdown(timeout) -> None"] + end + + Port --> Today + Port -.future.-> Future + + subgraph Today["Adapters in this RFC"] + direction LR + InMem["InMemoryTransport
tests + SynchronousExecutor replacement"] + PP["ProcessPoolTransport
replaces LocalExecutor body"] + CT["CeleryTransport
in climate-ref-celery"] + end + + subgraph Future["Future adapters — seam supports them"] + direction LR + SLURM["SlurmTransport
sbatch --mem --cpus --time --partition"] + PBS["PbsTransport
qsub -l mem,ncpus,walltime,queue"] + K8s["K8sTransport
pod resources + activeDeadlineSeconds"] + end + + classDef future stroke-dasharray: 5 5 + class Future,SLURM,PBS,K8s future ``` ## Wire types @@ -336,35 +330,47 @@ and telemetry persistence. ## End-to-end sequence -``` -Solver Diagnostic Lifecycle Transport Worker DB - │ submit(e,d) │ │ │ │ │ - ├──────────────►│ resources_for(d)│ │ │ │ - │ ◄──── ResourceHint(...) │ │ │ │ - │ │ │ allocate fragment ─┼──────────────────┼────────────────►│ - │ │ │ register_datasets ─┼──────────────────┼────────────────►│ - │ │ │ session.expunge │ - │ │ │ commit ──────────────────────────────────────────────── │ - │ │ │ deadline = now + r.wall_clock │ - │ │ │ dispatch(envelope)─►│ (slurm sbatch / celery send / │ - │ │ │ │ processpool.submit / inline) │ - │ │ │ ├─►│ execute_locally(envelope) │ - │ │ │ │ │ diagnostic.run │ - │ │ │ │ │ _BundleWriter.write │ - │ │ │ │ │ CV.validate (hard-fail) │ - │ │ │ │ │ capture Telemetry │ - │ │ │ │ │ return ExecutionOutcome │ - │ │ │ │◄─│ │ - │ drain() │ │ poll() ◄───────────│ │ - │ ────────────►│ │ _finalize: │ - │ │ │ session.merge(eid) ───────────────────────────────────►│ - │ │ │ _Promoter.copy(scratch→results, log+bundles+refs) │ - │ │ │ _Classifier(failure or result) → Success|Retry|GiveUp │ - │ │ │ _Ingestor.upsert(outputs, scalars, series) ──────────►│ - │ │ │ _DirtyRule.apply(decision) ──────────────────────────►│ - │ │ │ execution.telemetry = outcome.telemetry ─────────────►│ - │ │ │ execution.mark_{successful,failed} ─────────────────►│ - │ done │ │ │ │ +```mermaid +sequenceDiagram + autonumber + participant S as Solver + participant D as Diagnostic + participant L as ExecutionLifecycle + participant T as Transport + participant W as Worker + participant DB as DB + + S->>L: submit(execution, definition) + L->>D: resources_for(definition) + D-->>L: ResourceHint(memory_mb, cpu, wall_clock, queue?) + L->>DB: allocate fragment + register_datasets + L->>DB: session.expunge + commit + Note over L: deadline = clock() + resources.wall_clock + L->>T: dispatch(ExecutionEnvelope) + + Note over T,W: sbatch · qsub · pool.submit · celery send · inline + T->>W: hand off envelope + activate W + W->>W: diagnostic.run(definition) + W->>W: _BundleWriter.write (output.json, diagnostic.json, series.json) + W->>W: CV.validate (hard-fail) + W->>W: capture Telemetry (duration, peak_rss_mb, host) + W-->>T: ExecutionOutcome + deactivate W + + S->>L: drain(timeout) + loop until no in-flight executions + L->>T: poll(block, timeout) + T-->>L: ExecutionOutcome + Note over L: _finalize(outcome) + L->>DB: session.merge(execution_id) + L->>DB: _Promoter.copy(scratch → results) + Note over L: _Classifier → SUCCESS | RETRY | GIVE_UP + L->>DB: _Ingestor.upsert(outputs, scalars, series) + L->>DB: _DirtyRule.apply(decision) + L->>DB: execution.telemetry = outcome.telemetry + L->>DB: mark_successful / mark_failed + end ``` ## Retry classification @@ -582,19 +588,38 @@ Mitigations: Three designs were considered in detail. Sketches and trade-offs follow; the chosen design is a deliberate hybrid. +```mermaid +quadrantChart + title Design trade-off space + x-axis "Surface area (concepts)" --> "Larger" + y-axis "Defaults baked in" --> "More" + quadrant-1 "Heavy & opinionated" + quadrant-2 "Lean & opinionated" + quadrant-3 "Lean & open" + quadrant-4 "Heavy & open" + "A - Minimal": [0.18, 0.55] + "B - Maximally flexible": [0.92, 0.18] + "C - Common-case optimised": [0.38, 0.92] + "Hybrid (chosen)": [0.42, 0.7] +``` + +Comparison at a glance (relative scale, 0–5): + +| Dimension | A — Minimal | B — Maximal | C — Common-case | Hybrid (chosen) | +|------------------------|:-:|:-:|:-:|:-:| +| Public surface | 1 | 5 | 2 | 2 | +| Defaults baked in | 3 | 1 | 5 | 4 | +| Bend without editing | 3 | 5 | 2 | 3 | +| Migration churn | 4 | 5 | 2 | 3 | +| Resource-hint support | 0 | 5 | 0 | 5 | +| Speculation tax | 0 | 3 | 0 | 1 | + ## A — Minimal interface (rejected as a pure form) Two public methods (`submit`, `drain`), one Protocol (`Transport`). Everything else collapses behind the facade and cannot be customised without editing the module. -``` -Public surface: ████░░░░░░░░░░░░░░░░ (smallest) -Defaults baked: ████████░░░░░░░░░░░░ -Future bend: ██████████░░░░░░░░░░ (one port + edit-in-place) -Migration churn: ████████████████░░░░ (deletes Executor Protocol) -``` - Strength: minimal cognitive load. Weakness: no place to declare resource hints, no place for per-provider retry policies; both end up requiring constructor kwarg additions on first concrete demand. @@ -605,14 +630,6 @@ Five ports (`Transport`, `ArtifactStore`, `RetryPolicy`, `IngestSink`, `FragmentAllocator`), seven lifecycle hooks, plugin discovery via entry points, schema-versioned wire types. -``` -Public surface: ████████████████████ (5 ports × N methods + 7 hooks) -Defaults baked: ████░░░░░░░░░░░░░░░░ (everything injectable) -Future bend: ████████████████████ (port-shaped for K8s/S3/Prom) -Migration churn: ████████████████████ (new wire format + plugin registry) -Speculation tax: ██████████░░░░░░░░░░ (hooks + FragmentAllocator-as-port) -``` - Strength: every anticipated future requirement has a seam already. Weakness: heavy speculation tax — five ports without five concrete implementations driving them is exactly the shape that calcifies into @@ -623,14 +640,6 @@ implementations driving them is exactly the shape that calcifies into One class with `dispatch`, `drain`, plus `replay_abandoned`, `dry_run`, `ingest`, `with_transport`. Sensible defaults sourced from `Config`. -``` -Public surface: ████████░░░░░░░░░░░░ (1 hot method + 4 cold methods) -Defaults baked: ████████████████████ (highest) -Future bend: ████████░░░░░░░░░░░░ (single port; widen kwargs) -Migration churn: ██████░░░░░░░░░░░░░░ (smallest; adapters survive) -Hidden leak: ████░░░░░░░░░░░░░░░░ (Celery ignores on_done callback) -``` - Strength: the solver call site collapses to one line per iteration. Weakness: a `ResultSink` callback that Celery silently ignores is an asymmetry that will trip someone, and the design has no place for From 67f5fd2e4a0d5ee62ad1dd2f16a370e3e4538dac Mon Sep 17 00:00:00 2001 From: Jared Lewis Date: Tue, 12 May 2026 20:29:02 +1000 Subject: [PATCH 04/12] rfc(execution-lifecycle): swap module boundary to classDiagram Mermaid classDiagram is idiomatic for the Protocol + adapters pattern. LR layout fits PR width; method signatures stay legible without HTML hacks. --- text/0003-execution-lifecycle.md | 87 ++++++++++++++++++++------------ 1 file changed, 54 insertions(+), 33 deletions(-) diff --git a/text/0003-execution-lifecycle.md b/text/0003-execution-lifecycle.md index e0461d0..dafd7f6 100644 --- a/text/0003-execution-lifecycle.md +++ b/text/0003-execution-lifecycle.md @@ -135,39 +135,60 @@ instead of eight thin ones. ## Module boundary ```mermaid -flowchart TB - subgraph Lifecycle["ExecutionLifecycle   climate_ref.lifecycle"] - direction TB - Public["Public
__init__(config, db, transport, *, retry, cv, clock)
submit(execution, definition)
drain(timeout=None)
replay_abandoned() · dry_run(group, datasets)"] - Private["Owned (private)
_FragmentAllocator · _Classifier · _Promoter
_Ingestor · _DirtyRule · _BundleWriter"] - Public --- Private - end - - Lifecycle -- dispatch / poll --> Port - - subgraph Port["Transport (Protocol)   climate_ref_core.lifecycle.ports"] - TP["name: ClassVar[str]
dispatch(envelope: ExecutionEnvelope) -> None
poll(block, timeout) -> Iterator[ExecutionOutcome]
shutdown(timeout) -> None"] - end - - Port --> Today - Port -.future.-> Future - - subgraph Today["Adapters in this RFC"] - direction LR - InMem["InMemoryTransport
tests + SynchronousExecutor replacement"] - PP["ProcessPoolTransport
replaces LocalExecutor body"] - CT["CeleryTransport
in climate-ref-celery"] - end - - subgraph Future["Future adapters — seam supports them"] - direction LR - SLURM["SlurmTransport
sbatch --mem --cpus --time --partition"] - PBS["PbsTransport
qsub -l mem,ncpus,walltime,queue"] - K8s["K8sTransport
pod resources + activeDeadlineSeconds"] - end - - classDef future stroke-dasharray: 5 5 - class Future,SLURM,PBS,K8s future +classDiagram + direction LR + + class ExecutionLifecycle { + <> + +submit(execution, definition) + +drain(timeout) None + +replay_abandoned() list~int~ + +dry_run(group, datasets) ExecutionDefinition + -_FragmentAllocator + -_Classifier + -_Promoter + -_Ingestor + -_DirtyRule + -_BundleWriter + } + + class Transport { + <> + +name: ClassVar~str~ + +dispatch(envelope: ExecutionEnvelope) None + +poll(block, timeout) Iterator~ExecutionOutcome~ + +shutdown(timeout) None + } + + class InMemoryTransport { + tests + SynchronousExecutor + } + class ProcessPoolTransport { + replaces LocalExecutor + } + class CeleryTransport { + climate-ref-celery + } + class SlurmTransport { + future + sbatch --mem --cpus --time --partition + } + class PbsTransport { + future + qsub -l mem,ncpus,walltime,queue + } + class K8sTransport { + future + pod resources + activeDeadlineSeconds + } + + ExecutionLifecycle ..> Transport : dispatch / poll + InMemoryTransport ..|> Transport + ProcessPoolTransport ..|> Transport + CeleryTransport ..|> Transport + SlurmTransport ..|> Transport + PbsTransport ..|> Transport + K8sTransport ..|> Transport ``` ## Wire types From 8f8263b33119daf38103a4442bedc58adaace7ec Mon Sep 17 00:00:00 2001 From: Jared Lewis Date: Tue, 12 May 2026 20:32:40 +1000 Subject: [PATCH 05/12] rfc(execution-lifecycle): simplify; trim doc by ~half Compress prose, drop per-design subsections in Rationale (table tells the story), tighten Drawbacks/Prior art/Unresolved/Future to bullets. All three diagrams kept; technical substance preserved. --- text/0003-execution-lifecycle.md | 839 ++++++++----------------------- 1 file changed, 204 insertions(+), 635 deletions(-) diff --git a/text/0003-execution-lifecycle.md b/text/0003-execution-lifecycle.md index dafd7f6..edc0a65 100644 --- a/text/0003-execution-lifecycle.md +++ b/text/0003-execution-lifecycle.md @@ -5,129 +5,55 @@ # Summary [summary]: #summary -Consolidate the lifecycle of a single diagnostic execution +Consolidate the lifecycle of one diagnostic execution — allocation, dispatch, run, classify, publish, ingest, finalise — -into one deep module (`ExecutionLifecycle`) sitting behind a single `Transport` port, -and surface a declarative `ResourceHint` on each `Diagnostic` so providers -can express memory / CPU / wall-clock expectations in one place -without reaching into the executor or scheduler layer. +into one deep module (`ExecutionLifecycle`) behind one `Transport` port. +Add a declarative `ResourceHint` on every `Diagnostic` +and capture per-execution `Telemetry`, +so providers can express memory / CPU / wall-clock once +and future schedulers (SLURM, PBS, K8s) +plug in as ~100 LOC adapters. # Motivation [motivation]: #motivation -## Problem - -The lifecycle of one diagnostic execution today is fragmented across roughly -eight files in two packages. -A single execution flows through, in order: - -1. `solver.py` creates an `Execution` row with a `PLACEHOLDER_FRAGMENT`, - calls `assign_execution_fragment` to flush, compute a group-short string, - and rewrite the output fragment. -2. `solver.py` rebuilds the `ExecutionDefinition` via `attrs.evolve`, - calls `execution.register_datasets`, expunges the row, commits. -3. `executor.run(definition, execution)` is called outside the transaction. - Three implementations exist: - `SynchronousExecutor` runs in-process, - `LocalExecutor` submits to a `ProcessPoolExecutor`, - `CeleryExecutor` sends to a broker with linked result handlers. -4. `execute_locally` (in `climate-ref-core`) wraps `diagnostic.run`, - classifies exceptions via the private `_is_system_error`, - special-cases `CondaCommandError`, - and produces an `ExecutionResult` — whose `build_from_output_bundle` - factory writes three JSON files to disk despite being a frozen attrs class. -5. `process_result` and `handle_execution_result` copy log, metric bundle, - output bundle, series file, and every bundle-referenced plot/data/html - file from scratch to results, - then ingest scalar values, series values, and outputs in a nested - transaction. -6. `ExecutionGroup.dirty` is toggled in three different branches across - `handle_execution_result` and `mark_execution_failed`. - -Concrete problems this fragmentation causes today: - -- **Provider authors have no obvious place to express parallelisation hints.** - ESMValTool diagnostics that need 16 GB of memory or eight hours of - wall-clock have nowhere to declare it. - The `LocalExecutor` enforces a 6 hour per-task timeout; - the `CeleryExecutor` does not enforce a per-task timeout at all; - neither passes resource requirements through to anything that could honour - them. - When the project eventually adds SLURM or PBS adapters, - the natural way to wire memory / CPU / wall-clock into `sbatch` or `qsub` - directives does not exist. -- **Retry classification is split across the file tree.** - `_is_system_error` (in `climate-ref-core/executor.py`), - `CondaCommandError` handling (in the same file), - missing-log detection (in `result_handling.py`), - per-task timeout (in `LocalExecutor`), - and pool-shutdown abandonment (also in `LocalExecutor`) - each set `retryable` independently. - There is no single classifier function and no contract between - `ExecutionResult.build_from_failure` and the executor about which - exception types are retryable. -- **The `dirty` flag is decided in three places.** - Success path sets it `False`; - non-retryable failure sets it `False`; - retryable failure leaves it `True`; - missing log leaves it `True`. - These branches are spread across `handle_execution_result` and the - Local/Celery executors with no unifying decision function. -- **Tests have to mock subprocess, filesystem, DB, and Celery.** - `tests/unit/test_providers.py` patches `subprocess` directly for the - conda provider; `tests/unit/test_executor.py` imports the private - `_is_system_error`; integration tests are the only thing that exercises - the full happy path. -- **`ExecutionResult` is not picklable cleanly.** - Its frozen factory writes three JSON files; - re-construction in a test requires a real tmp_path filesystem. -- **The `Executor` Protocol is shallow.** - Its `run` / `join` interface forces every adapter - (Synchronous, Local, Celery, future SLURM/PBS/K8s) - to reinvent the same dance: - reattach the detached `Execution` row, - open a fresh transaction, - copy scratch to results, - load the CV, - ingest scalars and series, - apply the dirty rule, - mark the row. - -## Why now - -The CMIP REF is approaching the point where serious deployment targets are -plausible: -SLURM and PBS adapters, -distributed Celery deployments behind a managed broker, -possibly Kubernetes Jobs. -Each new transport currently means a new ~250-LOC adapter that re-implements -the lifecycle dance. -The same fragmentation also blocks anything resembling -**informed scheduling** — even simple things like "ESMValTool needs more memory -than ILAMB" or "the ENSO diagnostic should land on the `bigmem` queue" -have no representation in the codebase. - -This RFC is **not** about replacing SLURM, PBS, or Celery as schedulers. -It is about defining the seam between *climate-ref* and a scheduler so the -scheduler has enough information to do its job, -and so the lifecycle around the scheduler is one robust, well-tested module -instead of eight thin ones. - -## Expected outcomes - -1. The solver dispatch loop shrinks from ~50 lines of bookkeeping to ~3. -2. A new transport adapter (SLURM, PBS, K8s) is a single class - implementing `Transport`, around 80–120 LOC, - with no copy-paste of fragment allocation, ingestion, CV loading, - dirty-flag logic, or `Execution` reattachment. -3. Provider authors declare `resources: ResourceHint(...)` once on a - diagnostic class and the rest of the system honours it. -4. Per-execution telemetry (peak memory, duration, host) - is captured on every outcome and persisted on the `Execution` row, - making future adaptive scheduling - a feature addition rather than a schema change. -5. Unit tests run with zero filesystem and zero subprocess - via an `InMemoryTransport` and a separated bundle writer. +The lifecycle of one execution is currently fragmented across ~8 files +in 2 packages. +A single happy-path run touches +`solver.py` (allocates row + fragment, register_datasets, expunge, commit), +`climate_ref_core/executor.py` (`execute_locally`, `_is_system_error`, +`CondaCommandError` handling), +`climate_ref_core/diagnostics.py` (`ExecutionResult.build_from_output_bundle` +which secretly writes three JSON files inside a frozen-attrs factory), +`climate_ref/executor/result_handling.py` (scratch→results copy ×4, +ingestion in nested-tx, dirty-flag toggled in 3 branches), +`climate_ref/executor/fragment.py` (PLACEHOLDER_FRAGMENT, group_short), +and one of `synchronous.py` / `local.py` / `climate_ref_celery/executor.py` +(each reimplementing the same reattach / commit / mark dance). + +Concrete consequences today: + +- **Providers have no place to declare parallelisation hints.** + ESMValTool diagnostics that need 16 GB or 8 h have nowhere to say so; + `LocalExecutor` hardcodes a 6 h per-task timeout, + `CeleryExecutor` enforces no per-task timeout at all, + and a future SLURM/PBS adapter has nothing to translate into `sbatch`/`qsub`. +- **Retry classification is scattered** across `_is_system_error`, + the `CondaCommandError` branch, missing-log handling, `LocalExecutor`'s + per-task timeout, and pool-shutdown abandonment. +- **The `dirty` flag is decided in three places** + (success path, non-retryable failure, retryable failure / missing log). +- **CV validation is silenced** (`logger.warning` with TODO instead of raising). +- **Tests mock subprocess + filesystem + DB + Celery** and import private + helpers (`_is_system_error`) to compensate for the missing seam. +- **`ExecutionResult` is not cleanly picklable** because its frozen factory + performs disk I/O. + +The CMIP REF is approaching deployment targets (SLURM / PBS / K8s) where +each new adapter would otherwise mean ~250 LOC of duplicated lifecycle +wiring. +This RFC is **not** about replacing SLURM, PBS, K8s, or Celery as schedulers. +It is about defining a single robust seam *above* them. # Reference-level explanation [reference-level-explanation]: #reference-level-explanation @@ -160,27 +86,12 @@ classDiagram +shutdown(timeout) None } - class InMemoryTransport { - tests + SynchronousExecutor - } - class ProcessPoolTransport { - replaces LocalExecutor - } - class CeleryTransport { - climate-ref-celery - } - class SlurmTransport { - future - sbatch --mem --cpus --time --partition - } - class PbsTransport { - future - qsub -l mem,ncpus,walltime,queue - } - class K8sTransport { - future - pod resources + activeDeadlineSeconds - } + class InMemoryTransport { tests + SynchronousExecutor } + class ProcessPoolTransport { replaces LocalExecutor } + class CeleryTransport { climate-ref-celery } + class SlurmTransport { future · sbatch --mem --cpus --time } + class PbsTransport { future · qsub -l mem,ncpus,walltime } + class K8sTransport { future · pod resources + deadline } ExecutionLifecycle ..> Transport : dispatch / poll InMemoryTransport ..|> Transport @@ -193,42 +104,33 @@ classDiagram ## Wire types -Three frozen, picklable value types cross the transport boundary -(process pool, broker, future network hop). -None of them carry DB sessions, `Config`, or controlled vocabulary. +Picklable value objects only. No DB sessions, `Config`, or CV cross the boundary. ```python @attrs.frozen class ResourceHint: - """Declarative resource expectation, lives on the diagnostic class. - - Per-execution overrides are produced by Diagnostic.resources_for(). - """ memory_mb: int = 4096 cpu: int = 1 wall_clock: timedelta = timedelta(hours=2) - queue: str | None = None # transport-specific routing tag + queue: str | None = None # transport-specific routing tag @attrs.frozen class ExecutionEnvelope: - """Wire format sent to a worker. Picklable.""" execution_id: int definition: ExecutionDefinition resources: ResourceHint - deadline: datetime # clock() + resources.wall_clock at submit time + deadline: datetime # clock() + resources.wall_clock at submit time @attrs.frozen class Telemetry: - """Captured by the worker, persisted on the Execution row by the coordinator.""" duration: timedelta peak_rss_mb: int | None host: str exit_code: int | None - transport_meta: Mapping[str, str] # slurm jobid / k8s pod / celery task_id … + transport_meta: Mapping[str, str] # slurm jobid / k8s pod / celery task_id @attrs.frozen class ExecutionOutcome: - """What poll() yields back to the lifecycle.""" execution_id: int result: ExecutionResult | None # None ⇒ transport-side abandonment failure: ExecutionFailure | None # timeout | broker_lost | pool_shutdown @@ -236,119 +138,54 @@ class ExecutionOutcome: ``` `ExecutionResult` becomes pure data. -The current `build_from_output_bundle` factory -(which writes three JSON files inside a frozen-attrs constructor) -is split into `ExecutionResult.from_bundle(definition, bundle)` — -which constructs the value object without I/O — -and a separate `_BundleWriter.write(definition, bundle)` -that is invoked worker-side only. +The current `build_from_output_bundle` factory is split into a pure +`ExecutionResult.from_bundle(definition, bundle)` and a worker-side +`_BundleWriter.write(definition, bundle)` that owns the JSON I/O. ## Diagnostic-side declaration ```python class Diagnostic(AbstractDiagnostic): - ... - resources: ResourceHint = ResourceHint() + resources: ResourceHint = ResourceHint() # project-wide default def resources_for(self, definition: ExecutionDefinition) -> ResourceHint: - """Optional per-execution sizing. - - Default returns ``self.resources`` unchanged. - Diagnostics whose memory scales with input size override this. - """ + """Optional per-execution sizing. Default returns self.resources.""" return self.resources -``` -Concrete provider declarations under the new system: -```python -# packages/climate-ref-esmvaltool/src/climate_ref_esmvaltool/diagnostics/base.py +# Examples class ESMValToolDiagnostic(CommandLineDiagnostic): - resources = ResourceHint( - memory_mb=16000, cpu=4, wall_clock=timedelta(hours=6), - ) + resources = ResourceHint(memory_mb=16000, cpu=4, wall_clock=timedelta(hours=6)) -# packages/climate-ref-pmp/src/climate_ref_pmp/diagnostics/enso.py class EnsoDiagnostic(Diagnostic): - resources = ResourceHint( - memory_mb=24000, cpu=8, wall_clock=timedelta(hours=8), - queue="bigmem", - ) + resources = ResourceHint(memory_mb=24000, cpu=8, + wall_clock=timedelta(hours=8), queue="bigmem") -# packages/climate-ref-ilamb/src/climate_ref_ilamb/standard.py class IlambDiagnostic(Diagnostic): - resources = ResourceHint( - memory_mb=8000, cpu=2, wall_clock=timedelta(hours=2), - ) + resources = ResourceHint(memory_mb=8000, cpu=2, wall_clock=timedelta(hours=2)) def resources_for(self, defn): n = len(defn.datasets.get_cmip6()) return attrs.evolve(self.resources, memory_mb=8000 + 500 * n) ``` -The base `Diagnostic` default of `(4 GB, 1 CPU, 2 h)` is the project-wide -sensible starting point. -Providers that need more either override the class attribute -or override `resources_for` for input-size scaling. - ## Solver dispatch — before and after -**Before** (`solver.py:709-757`, condensed): - ```python -execution = Execution( - execution_group=execution_group, - dataset_hash=definition.datasets.hash, - output_fragment=PLACEHOLDER_FRAGMENT, - provider_version=potential_execution.provider.version, -) -db.session.add(execution) -fragment = assign_execution_fragment( - db.session, execution, - provider_slug=provider_slug, - diagnostic_slug=potential_execution.diagnostic.slug, - selectors=potential_execution.selectors, - group_id=execution_group.id, -) -definition = attrs.evolve( - definition, - output_directory=config.paths.scratch.resolve() / pathlib.Path(fragment), -) -execution.register_datasets(db, definition.datasets) -if execute: - db.session.expunge(execution) - pending = (definition, execution) -# … commit happens at top of loop … -if pending is not None: - executor.run(definition=pending[0], execution=pending[1]) -… -executor.join(timeout=timeout) -``` - -**After**: +# Before: ~50 lines of bookkeeping in solver.py:709-757 +# PLACEHOLDER_FRAGMENT, assign_execution_fragment, attrs.evolve, +# register_datasets, expunge, commit, executor.run, … executor.join -```python +# After: lifecycle = ExecutionLifecycle(config, db, transport) for group, datasets, definition in planned_executions: - execution = Execution(execution_group=group, - dataset_hash=datasets.hash, - provider_version=definition.diagnostic.provider.version) + execution = Execution(execution_group=group, dataset_hash=datasets.hash, + provider_version=definition.diagnostic.provider.version) lifecycle.submit(execution, definition) lifecycle.drain(timeout=timeout) ``` -`submit` privately handles fragment allocation, -`register_datasets`, -the expunge/commit dance, -and `transport.dispatch`. -`drain` privately handles polling, -reattachment, -artifact promotion, -ingestion, -the dirty rule, -and telemetry persistence. - ## End-to-end sequence ```mermaid @@ -359,23 +196,19 @@ sequenceDiagram participant L as ExecutionLifecycle participant T as Transport participant W as Worker - participant DB as DB + participant DB S->>L: submit(execution, definition) L->>D: resources_for(definition) - D-->>L: ResourceHint(memory_mb, cpu, wall_clock, queue?) - L->>DB: allocate fragment + register_datasets - L->>DB: session.expunge + commit + D-->>L: ResourceHint(...) + L->>DB: allocate fragment + register_datasets + expunge + commit Note over L: deadline = clock() + resources.wall_clock L->>T: dispatch(ExecutionEnvelope) Note over T,W: sbatch · qsub · pool.submit · celery send · inline T->>W: hand off envelope activate W - W->>W: diagnostic.run(definition) - W->>W: _BundleWriter.write (output.json, diagnostic.json, series.json) - W->>W: CV.validate (hard-fail) - W->>W: capture Telemetry (duration, peak_rss_mb, host) + W->>W: diagnostic.run · _BundleWriter.write · CV.validate · Telemetry W-->>T: ExecutionOutcome deactivate W @@ -383,231 +216,92 @@ sequenceDiagram loop until no in-flight executions L->>T: poll(block, timeout) T-->>L: ExecutionOutcome - Note over L: _finalize(outcome) - L->>DB: session.merge(execution_id) - L->>DB: _Promoter.copy(scratch → results) Note over L: _Classifier → SUCCESS | RETRY | GIVE_UP - L->>DB: _Ingestor.upsert(outputs, scalars, series) - L->>DB: _DirtyRule.apply(decision) - L->>DB: execution.telemetry = outcome.telemetry - L->>DB: mark_successful / mark_failed + L->>DB: merge · promote artifacts · upsert outputs/scalars/series + L->>DB: _DirtyRule.apply · save telemetry · mark_* end ``` -## Retry classification - -A single `RetryPolicy.classify(outcome) -> RetryDecision` replaces: -`_is_system_error`, -the `CondaCommandError` branch in `execute_locally`, -the missing-log branch in `handle_execution_result`, -the per-task timeout handling in `LocalExecutor.join`, -the pool-shutdown abandonment in `LocalExecutor._fail_outstanding`. +## Retry + dirty rule (one source of truth) ```python class RetryDecision(enum.Enum): SUCCESS = "success" - RETRY = "retry" # leaves dirty=True - GIVE_UP = "give_up" # sets dirty=False, marks failed + RETRY = "retry" # leaves dirty=True + GIVE_UP = "give_up" # sets dirty=False, marks failed class DefaultRetryPolicy: SYSTEM_ERRORS = (OSError, MemoryError, SystemExit, KeyboardInterrupt) NON_RETRYABLE = (CondaCommandError,) def classify(self, outcome: ExecutionOutcome) -> RetryDecision: - if outcome.failure is not None: - # transport-side: timeout, broker_lost, pool_shutdown → retry - return RetryDecision.RETRY - result = outcome.result - if result is None: + if outcome.failure is not None: # timeout | broker_lost | pool_shutdown return RetryDecision.RETRY - if result.successful: - return RetryDecision.SUCCESS - return (RetryDecision.RETRY if result.retryable - else RetryDecision.GIVE_UP) -``` - -The dirty rule is derived from the decision in one place: - -```python -class DefaultDirtyRule: - def apply(self, execution: Execution, decision: RetryDecision) -> None: - if decision in (RetryDecision.SUCCESS, RetryDecision.GIVE_UP): - execution.execution_group.dirty = False - # RETRY: leave dirty=True so the next solve picks it up -``` - -## Idempotent re-ingest - -Re-running `_finalize` on the same `execution_id` -(e.g. a join-time crash followed by `replay_abandoned`) -must not double-insert. -The ingester moves to upserts keyed on natural identifiers: - -- `ExecutionOutput`: `(execution_id, output_type, short_name)` -- `ScalarMetricValue`: `(execution_id, dimensions_hash)` -- `SeriesMetricValue`: `(execution_id, dimensions_hash, index_name)` - -Where SQLite is in use the implementation uses -`INSERT … ON CONFLICT DO NOTHING`; -on PostgreSQL the same construct works natively. -The scratch-to-results copy uses `shutil.copy` with `exist_ok=True` -on the destination directory. - -## Telemetry - -Three new columns are added to `Execution`: - -```sql -ALTER TABLE execution ADD COLUMN duration_seconds FLOAT NULL; -ALTER TABLE execution ADD COLUMN peak_rss_mb INTEGER NULL; -ALTER TABLE execution ADD COLUMN telemetry_meta JSON NULL; + r = outcome.result + if r is None: return RetryDecision.RETRY + if r.successful: return RetryDecision.SUCCESS + return RetryDecision.RETRY if r.retryable else RetryDecision.GIVE_UP ``` -`telemetry_meta` is a small JSON blob with hostname, exit code, -and transport-specific identifiers (Celery task id, SLURM job id, …). -None of these are read by the system at solve time; -they exist so that a future adaptive resource provider -can ask "what was the rolling p95 memory for diagnostic X -on dataset shape Y?" without a schema migration. +Replaces `_is_system_error`, the `CondaCommandError` branch, +missing-log handling, the per-task timeout path, +and pool-shutdown abandonment — five sites collapse to one. -The shape of that future provider is **not** in this RFC. -A natural seam is: +## Idempotent ingest, telemetry -```python -class ResourceProvider(Protocol): - def hint_for(self, definition: ExecutionDefinition) -> ResourceHint: ... -``` - -with a `StaticResourceProvider` (uses `diagnostic.resources_for`) -and a future `AdaptiveResourceProvider` (reads telemetry, applies multipliers). -`ExecutionLifecycle` would gain a `resources:` constructor kwarg defaulting -to `StaticResourceProvider`. -This is intentionally deferred — the column existing today -is the only commitment. - -## Module / file layout - -``` -packages/climate-ref-core/src/climate_ref_core/ -├── lifecycle/ -│ ├── __init__.py ← public re-exports -│ ├── ports.py ← Transport, ResourceHint, Envelope, Outcome, Telemetry -│ ├── results.py ← ExecutionResult (pure data) + from_bundle factory -│ └── worker.py ← execute_worker(envelope) — picklable entry point -│ -packages/climate-ref/src/climate_ref/ -├── lifecycle.py ← ExecutionLifecycle class -├── transports/ -│ ├── inmemory.py ← InMemoryTransport -│ └── processpool.py ← ProcessPoolTransport (was LocalExecutor) -└── executor/ ← deleted; the Executor Protocol goes away -│ -packages/climate-ref-celery/src/climate_ref_celery/ -└── transport.py ← CeleryTransport (was CeleryExecutor) -``` +Ingestion uses `INSERT … ON CONFLICT DO NOTHING` on natural keys +(`(execution_id, output_type, short_name)` for outputs, +`(execution_id, dimensions_hash[, index_name])` for metric/series values), +so `replay_abandoned()` is safe. +Scratch-to-results copy uses `exist_ok=True`. -The existing `Executor` Protocol in `climate-ref-core/executor.py` -and the three concrete executors are deleted. -`execute_locally` becomes a thin wrapper around `execute_worker` -kept for backwards compatibility for one release cycle, then removed. +New `Execution` columns: `duration_seconds`, `peak_rss_mb`, `telemetry_meta JSON`. +No solver code reads these today; they exist so a future adaptive +`ResourceProvider` is a feature addition rather than a schema migration. ## Test impact -**Deleted tests** (boundary tests will replace them): - -- `tests/unit/test_executor.py::test_is_system_error_*` - — private-import tests on `_is_system_error` -- `tests/unit/test_providers.py::test_conda_*` subprocess patch chains - for the conda provider's `run()` are replaced by transport-level boundary tests -- `tests/unit/executor/test_synchronous.py::*` reattachment tests -- `tests/unit/executor/test_local.py::*` future-handling tests - beyond what `ProcessPoolTransport`'s own tests cover -- Per-executor `mark_execution_failed` mock-chain tests - -**New boundary tests** (against `ExecutionLifecycle` with `InMemoryTransport`): - -- `submit` allocates a unique fragment per execution -- `submit` persists `register_datasets` even when the transport dispatch fails -- `drain` ingests one successful outcome end-to-end and sets `dirty=False` -- `drain` re-dispatches a retryable failure on the next solve - by leaving `dirty=True` -- `drain` marks a non-retryable failure with `dirty=False` and `successful=False` -- `drain` is idempotent: re-draining the same outcome does not double-insert - scalar or series rows -- `drain` enforces `resources.wall_clock` uniformly — a worker that exceeds - the deadline produces a retryable failure regardless of transport -- CV validation fails hard with a `ResultValidationError` when metrics - reference unknown dimensions (replaces today's silent `logger.warning`) -- `replay_abandoned` returns the IDs of executions whose rows exist - but never received an outcome +Delete: `_is_system_error` private-import tests, +subprocess patch chains in `test_providers.py`, +per-executor reattach tests, +`mark_execution_failed` mock chains. + +Add boundary tests against `ExecutionLifecycle` + `InMemoryTransport`: +unique fragment per submit; +end-to-end success → `dirty=False`; +retryable failure leaves `dirty=True`; +non-retryable failure → `dirty=False`, `successful=False`; +re-drain idempotent (no double-insert); +`wall_clock` enforced uniformly across transports; +CV mismatch raises `ResultValidationError`; +`replay_abandoned` returns stranded IDs. # Drawbacks [drawbacks]: #drawbacks -## Migration cost - -This is not a small refactor. -The dispatch loop, three executor classes, result handling, -fragment allocation, and ingestion all move at once. -A staged migration is possible — `InMemoryTransport` and `ProcessPoolTransport` -can ship before `CeleryTransport` — but the lifecycle module itself has to -land in one piece because every transport depends on it. - -## CV validation becomes a hard fail - -Today, when scalar or series metric values do not conform to the controlled -vocabulary, `result_handling.py` logs a warning with a `TODO`. -This RFC promotes that to a hard `ResultValidationError`. -Any in-flight diagnostic with a CV mismatch will start failing. -This is the *intended* behaviour — silent CV drift is a real source of data -quality issues — but it is a behaviour change that needs deliberate handling -during rollout (e.g. a one-release-cycle deprecation period where the -violation is logged with `ERROR` instead of `WARNING` but does not yet raise). - -## A single fat class owns more responsibility - -`ExecutionLifecycle` ends up around 400–500 LOC. -That is intentional: the things it owns are already coupled across eight -files today, and module *depth* is the design goal. -But it does mean the file is large; reviewers should expect that. - -## The Celery boundary still has one quirk - -In the current code, Celery's `link=handle_result.s(...)` chains a result -handler that runs on the broker side of the boundary. -In the new design, `CeleryTransport.poll` is what feeds outcomes back into -`drain`. -This is cleaner, but it means a coordinator-side crash mid-drain can leave -results "in flight" in the broker that the coordinator has not yet observed. -The mitigation is `replay_abandoned()`, which on next startup queries -unfinished executions and re-polls the broker for them. -This is a robustness improvement over today -(where a coordinator crash silently strands rows in `successful=NULL`), -but it is not free — -it requires `CeleryTransport` to persist task IDs alongside execution IDs -in a small table or in `telemetry_meta` so the broker can be queried later. - -## Resource hints can be wrong - -If a provider declares 4 GB for a diagnostic that actually needs 16 GB, -the SLURM adapter will OOM-kill the job. -Today's `LocalExecutor` runs in the parent address space and "just works" -up to the host's memory. -With `ResourceHint` in place, getting the number wrong is a new failure mode. -Mitigations: - -- The default `ResourceHint(memory_mb=4096, cpu=1, wall_clock=2h)` - is generous enough that most diagnostics run unchanged. -- `ProcessPoolTransport` deliberately ignores everything except - `wall_clock` — running on a developer laptop does not get harder. -- Telemetry capture means the first failed run produces the data needed - to adjust the hint upward. +- **Migration is wide.** The lifecycle lands in one piece because every + transport depends on it. Transports can ship staggered after that. +- **CV becomes hard-fail.** Today's silent `logger.warning` becomes a + raise. Intentional, but needs a one-cycle deprecation window where + the violation is `ERROR` but not raised. +- **One fat class (~400–500 LOC).** Intentional depth, but reviewers + should expect a large file. +- **Celery has no broker-side result handler anymore.** + `CeleryTransport.poll` feeds outcomes back into `drain`. + A coordinator crash mid-drain is recovered by `replay_abandoned`, + which requires `CeleryTransport` to persist task IDs alongside + execution IDs. +- **Resource hints can be wrong.** SLURM will OOM-kill a job whose + declared memory is too low. Mitigated by a generous default + (4 GB / 1 CPU / 2 h), by `ProcessPoolTransport` ignoring everything + except `wall_clock`, and by telemetry capture making the first + failed run actionable. # Rationale and alternatives [rationale-and-alternatives]: #rationale-and-alternatives -Three designs were considered in detail. -Sketches and trade-offs follow; the chosen design is a deliberate hybrid. +Three designs were considered. +The chosen interface is a deliberate hybrid. ```mermaid quadrantChart @@ -624,9 +318,7 @@ quadrantChart "Hybrid (chosen)": [0.42, 0.7] ``` -Comparison at a glance (relative scale, 0–5): - -| Dimension | A — Minimal | B — Maximal | C — Common-case | Hybrid (chosen) | +| Dimension | A — Minimal | B — Maximal | C — Common-case | **Hybrid** | |------------------------|:-:|:-:|:-:|:-:| | Public surface | 1 | 5 | 2 | 2 | | Defaults baked in | 3 | 1 | 5 | 4 | @@ -635,218 +327,95 @@ Comparison at a glance (relative scale, 0–5): | Resource-hint support | 0 | 5 | 0 | 5 | | Speculation tax | 0 | 3 | 0 | 1 | -## A — Minimal interface (rejected as a pure form) - -Two public methods (`submit`, `drain`), one Protocol (`Transport`). -Everything else collapses behind the facade and cannot be customised -without editing the module. - -Strength: minimal cognitive load. Weakness: no place to declare -resource hints, no place for per-provider retry policies; both end up -requiring constructor kwarg additions on first concrete demand. - -## B — Maximally flexible (rejected as a pure form) - -Five ports (`Transport`, `ArtifactStore`, `RetryPolicy`, `IngestSink`, -`FragmentAllocator`), seven lifecycle hooks, plugin discovery via entry -points, schema-versioned wire types. - -Strength: every anticipated future requirement has a seam already. -Weakness: heavy speculation tax — five ports without five concrete -implementations driving them is exactly the shape that calcifies into -"the way things have always been done". - -## C — Common-case optimised (rejected as a pure form) - -One class with `dispatch`, `drain`, plus `replay_abandoned`, `dry_run`, -`ingest`, `with_transport`. Sensible defaults sourced from `Config`. - -Strength: the solver call site collapses to one line per iteration. -Weakness: a `ResultSink` callback that Celery silently ignores is an -asymmetry that will trip someone, and the design has no place for -resource hints. - -## The chosen hybrid - -- **C's façade** — `ExecutionLifecycle` is one class with `submit` and - `drain` as the hot path and a small set of cold-path entry points. -- **A's transport contract** — `Transport.dispatch(envelope)` plus - `poll()` returning `Iterator[ExecutionOutcome]`. No `ResultSink` - callback. The lifecycle owns the result loop. -- **A's bundle-writer separation** — `ExecutionResult` is pure data; - bundle JSON write moves into a worker-side `_BundleWriter`. -- **B's `ExecutionEnvelope` and `Telemetry` wire types** — earned by - the actual provider-side pain (no place for resource hints) and by the - need to capture per-execution actuals for future adaptive sizing. -- **Defer the rest of B** — no `ArtifactStore` port (S3 mirror is - hypothetical), no `IngestSink` priority chain (DB-only today), - no `LifecycleHooks` (today's `logger.*` calls suffice), - no `FragmentAllocator` port (one implementation foreseeable), - no entry-point plugin registry. -- **Delete the `Executor` Protocol.** It is a shallow interface - dressed up as flexibility; `SynchronousExecutor` becomes - `ExecutionLifecycle(transport=InMemoryTransport())`. - -## Impact of not doing this - -Each future transport (SLURM, PBS, K8s) re-implements ~250 LOC of -lifecycle wiring per adapter. -Resource hints have to be retrofitted later through a new wire format, -which is a strictly larger change than introducing them at the same time -as the lifecycle consolidation. -Per-task timeout, CV validation, dirty-flag rules, and retry -classification stay scattered, and integration tests remain the only -honest signal that the system works end-to-end. +- **A — Minimal**: 2 methods, 1 port, everything else hidden. + No place for resource hints or per-provider retry without later kwarg growth. +- **B — Maximal**: 5 ports (Transport, ArtifactStore, RetryPolicy, + IngestSink, FragmentAllocator) + 7 hooks + entry-point plugin registry. + Earned the wire-type split and `ResourceHint`; everything else is + speculation. +- **C — Common-case**: one class, defaults sourced from `Config`, + solver call site collapses to one line. + A `ResultSink` callback that Celery silently ignores is an asymmetry + that will trip someone, and there is still no place for resource hints. + +**Chosen hybrid**: C's façade (one class, hot/cold method split) + +A's transport contract (`dispatch(envelope)` + `poll() -> Iterator[Outcome]`, +no result callback) + A's `BundleWriter` separation + +B's `ExecutionEnvelope`/`Telemetry` wire types. +Deferred: `ArtifactStore`, `IngestSink`, `LifecycleHooks`, +`FragmentAllocator` as a port, plugin registry. +The shallow `Executor` Protocol and the three concrete executors +are deleted. + +**Impact of not doing this**: each new transport reimplements ~250 LOC +of lifecycle wiring; resource hints retrofit later through a new wire +format (strictly larger change); per-task timeout, CV validation, +dirty-flag, and retry classification stay scattered. # Prior art [prior-art]: #prior-art -## Dask's `distributed` - -Dask's scheduler accepts `resources=` annotations on submitted tasks -(e.g. `client.submit(func, resources={"GPU": 1})`). -The pattern of declaring resource needs on the unit-of-work and letting -the scheduler honour them is the direct inspiration for `ResourceHint`. -Dask does not concern itself with what is "above" the scheduler -(catalog, ingest, controlled vocabulary), -which is the layer this RFC is consolidating. - -## Snakemake / Nextflow - -Both pipeline frameworks have first-class resource directives -(`resources: mem_mb=…, runtime=…`) on rules / processes, -which translate transparently into SLURM / PBS / K8s directives. -This RFC reuses that mental model at the diagnostic level. - -## Airflow operators - -Airflow's executor/operator split is the closest analogue: -operators carry the work definition, executors decide how to run it. -Airflow learned the hard way that an executor that owns result -handling is too narrow an interface; their `BaseExecutor.execute_async` -plus `sync` (poll) split is essentially what `Transport.dispatch` / -`Transport.poll` is here. - -## Celery's `task_time_limit` and queue routing - -Celery already supports per-queue routing and per-task time limits. -This RFC's `ResourceHint.queue` maps directly onto Celery's queue -routing; `wall_clock` maps onto `task_time_limit` plus -`task_soft_time_limit`. -The `CeleryExecutor` in `climate-ref-celery` does not currently use -either; the new `CeleryTransport` does. - -## The Rust RFC process itself - -The shape of this document (Summary, Motivation, Reference-level, -Drawbacks, Rationale, Prior art, Unresolved, Future) is inherited from -[rust-lang/rfcs](https://github.com/rust-lang/rfcs) -via this repository's template. +- **Dask `distributed`** — `resources=` annotations on submitted tasks + inspire `ResourceHint`. +- **Snakemake / Nextflow** — first-class `resources:` directives + translate transparently into SLURM / PBS / K8s. Same mental model at + the diagnostic level. +- **Airflow** — executor / operator split; `BaseExecutor.execute_async` + + `sync` is essentially `Transport.dispatch` + `Transport.poll`. +- **Celery** — `task_time_limit` + queue routing. `ResourceHint.queue` + maps onto Celery queues; `wall_clock` onto `task_time_limit` / + `task_soft_time_limit`. Today's `CeleryExecutor` uses neither. +- **Rust RFC process** — document shape inherited via this repo's + template. # Unresolved questions [unresolved-questions]: #unresolved-questions -To resolve through the RFC discussion: - -- **Should `ResourceHint` include I/O class (e.g. `io_intensive: bool`) - or GPU (`gpu: int`)?** - Neither has a concrete consumer today. - The recommended default is to add fields when a concrete adapter - needs them rather than speculating now. -- **Where does the controlled vocabulary live in test contexts?** - Today `CV.load_from_file(config.paths.dimensions_cv)` is invoked - per-result inside `handle_execution_result`. - The RFC moves it to lifecycle construction, which means tests need - a deterministic CV. - A `PermissiveCV()` test fixture is one option; - baking the project's CV into a constant for test use is another. -- **Should `replay_abandoned` be called automatically on - `ExecutionLifecycle.__init__`?** - Doing so makes recovery transparent; - not doing so makes startup behaviour explicit. - The current draft assumes it is explicit and called from the CLI. +To resolve through this RFC: + +- Should `ResourceHint` include `gpu: int` / `io_intensive: bool`? + Recommended default: add when a concrete adapter needs them. +- Where does the CV come from in tests? `PermissiveCV()` fixture + vs. project CV baked into a constant. +- Is `replay_abandoned` automatic on `__init__` or explicit from the CLI? + Draft assumes explicit. To resolve through implementation: -- **Exact unique-constraint shape for upserts** — - `(execution_id, output_type, short_name)` for `ExecutionOutput` - is the obvious key, but the existing schema does not enforce it. - An Alembic migration is needed. -- **`CeleryTransport.poll` semantics** — - whether to use `AsyncResult.get(timeout=…)` per task - or `app.control.inspect` for batch status. - Implementation choice; does not affect the public interface. -- **`Telemetry.peak_rss_mb` capture on macOS** — - `resource.getrusage` returns bytes on macOS and KB on Linux. - A small platform shim is required worker-side. +- Exact upsert unique constraints + Alembic migration. +- `CeleryTransport.poll` semantics (per-task `AsyncResult.get` vs batch + inspection). +- `peak_rss_mb` capture across macOS / Linux (`getrusage` unit difference). Out of scope: -- **Adaptive resource provider** (reading rolling telemetry to suggest - hints). Telemetry columns land here; the provider is a follow-up RFC. -- **GPU scheduling**. - None of the current diagnostics use GPUs; - when one does, `ResourceHint` gains a `gpu: int` field. -- **Multi-tenant queues**. - `ResourceHint.queue` is enough for single-deployment routing; - multi-tenant work (per-org queues, fair-share) is out of scope. -- **Replacing SLURM / PBS / Celery as schedulers**. - Explicitly *not* in scope: the lifecycle owns the seam *above* the - scheduler, not the scheduler itself. +- Adaptive resource provider (telemetry columns land here; the + provider is a follow-up RFC). +- GPU scheduling, multi-tenant queues. +- Replacing SLURM / PBS / Celery as schedulers. # Future possibilities [future-possibilities]: #future-possibilities -The deepening proposed here unlocks several follow-ups, each a -self-contained piece of work: - -- **SLURM and PBS transports.** - Each is a ~100 LOC adapter implementing `Transport`. - `dispatch` builds a job script from `envelope.resources`, - shells out to `sbatch` / `qsub`, records the job ID in `telemetry_meta`; - `poll` queries `squeue` / `qstat` and surfaces completed jobs back as - `ExecutionOutcome`s. - No changes to the lifecycle or to provider code are required. - -- **Adaptive `ResourceProvider`.** - Reads `Execution.peak_rss_mb` over a rolling window per - `(diagnostic_slug, dataset_shape_signature)`, - suggests a memory hint at the rolling p95 × 1.2, - applies a per-CLI flag to opt in. - Schema is already in place from this RFC; the provider is the only - new code. - -- **Per-provider retry policies.** - Currently `RetryPolicy` is a single injected object. - ESMValTool's memory-exhaustion failures are retryable with a larger - memory hint; a `Mapping[str, RetryPolicy]` keyed on provider slug - becomes the natural extension when concrete demand arrives. - -- **Streaming partial results.** - `Transport.poll` already yields outcomes incrementally; - a `PartialOutcome` event could be added to the iterator for - progress reporting if a UI consumer appears. - -- **S3 / HTTP artifact store.** - If results need to land somewhere other than the local filesystem, - an `ArtifactStore` port can be carved out of `_Promoter` later. - The wire format does not change (scratch path remains relative); - only the lifecycle's promote step is rewritten. - -- **Pluggable ingest sinks.** - Prometheus metrics, audit logs, S3 mirrors — each is a side-effect - observer over `ExecutionOutcome`. - A small `IngestSink` interface with ordered observers is a small - follow-up if and when a second sink (beyond the DB) materialises. - -- **Recovery on coordinator restart.** - `replay_abandoned()` is in this RFC. - Extending it to interrogate transport-side state (SLURM job that - finished while the coordinator was down) is a natural follow-up that - only needs `Transport.lookup(transport_meta) -> ExecutionOutcome | None` - on each adapter. - -None of the above are reasons to accept this RFC on their own; -they are listed to show that the seam introduced here is the right shape -for the directions the project is plausibly heading, -without baking any of them in prematurely. +Each item below is a self-contained follow-up enabled by this RFC: + +- **SLURM / PBS transports** — ~100 LOC adapters; `dispatch` builds a + job script from `envelope.resources`, `poll` queries `squeue` / `qstat`. +- **Adaptive `ResourceProvider`** — reads `Execution.peak_rss_mb` over + a rolling window, suggests memory hint at p95 × 1.2. +- **Per-provider retry policies** — single `RetryPolicy` becomes + `Mapping[str, RetryPolicy]` keyed on provider slug when concrete + demand arrives. +- **Streaming partial outcomes** — `Transport.poll` already incremental; + add a `PartialOutcome` event when a UI consumer appears. +- **S3 / HTTP artifact store** — extract `_Promoter` into an + `ArtifactStore` port when a non-local target lands. +- **Pluggable ingest sinks** — Prometheus, audit log, S3 mirror as + ordered `IngestSink` observers when a second sink materialises. +- **Cross-restart recovery** — extend `replay_abandoned` with + `Transport.lookup(transport_meta)` so a SLURM job finished while the + coordinator was down is picked up. + +None of these are reasons to accept this RFC on their own; +they show the seam is the right shape for the directions the project is +plausibly heading, without pre-baking any of them. From 034a91af23eec2620d71606400a67e3d432878ec Mon Sep 17 00:00:00 2001 From: Jared Lewis Date: Thu, 4 Jun 2026 12:09:42 +1000 Subject: [PATCH 06/12] rfc(execution-lifecycle): address adversarial review MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Correct factual and overstated claims, and surface design risks the draft glossed: - Drop the false "not picklable" claim — ExecutionResult already pickles across the ProcessPoolExecutor today; reframe as construction entangled with disk I/O (a testability cost, not a correctness one). - Split retry classification into worker-side (exception -> outcome, consolidating _is_system_error + CondaCommandError) and coordinator-side (outcome -> decision); drop the misleading "five sites collapse to one" and the unused policy constants. - Default wall_clock 2h -> 6h to preserve current LocalExecutor budget; fix the mitigation text. Avoids a silent kill-at-2h regression. - Move deadline computation to job start (transport-side); anchoring at submit time would expire queued SLURM/PBS jobs before they run. - Expand the Celery push->pull trade-off (loss of fire-and-forget ingestion) from one bullet to an explicit, weighted drawback. - Add a staged migration plan and a deprecation cycle for the public Executor protocol / import_executor_cls config key. - Note ResourceHint must live in climate_ref_core (layering). - Clarify ON CONFLICT DO NOTHING is safe because each retry mints a new execution_id. - Remove dry_run from the lifecycle surface (planning, not dispatch). - Soften unverifiable LOC estimates; scope dirty-flag rule to the result path (CLI resets stay separate). --- text/0003-execution-lifecycle.md | 173 +++++++++++++++++++++++-------- 1 file changed, 132 insertions(+), 41 deletions(-) diff --git a/text/0003-execution-lifecycle.md b/text/0003-execution-lifecycle.md index edc0a65..0c40a56 100644 --- a/text/0003-execution-lifecycle.md +++ b/text/0003-execution-lifecycle.md @@ -12,7 +12,8 @@ Add a declarative `ResourceHint` on every `Diagnostic` and capture per-execution `Telemetry`, so providers can express memory / CPU / wall-clock once and future schedulers (SLURM, PBS, K8s) -plug in as ~100 LOC adapters. +plug in as thin adapters that only translate the envelope +and poll job state. # Motivation [motivation]: #motivation @@ -23,10 +24,11 @@ A single happy-path run touches `solver.py` (allocates row + fragment, register_datasets, expunge, commit), `climate_ref_core/executor.py` (`execute_locally`, `_is_system_error`, `CondaCommandError` handling), -`climate_ref_core/diagnostics.py` (`ExecutionResult.build_from_output_bundle` -which secretly writes three JSON files inside a frozen-attrs factory), +`climate_ref_core/diagnostics.py` (`ExecutionResult.build_from_output_bundle`, +a static factory that writes three JSON files to disk as a side effect of +constructing the result), `climate_ref/executor/result_handling.py` (scratch→results copy ×4, -ingestion in nested-tx, dirty-flag toggled in 3 branches), +ingestion in nested-tx, dirty-flag toggled in 3 branches of the result path), `climate_ref/executor/fragment.py` (PLACEHOLDER_FRAGMENT, group_short), and one of `synchronous.py` / `local.py` / `climate_ref_celery/executor.py` (each reimplementing the same reattach / commit / mark dance). @@ -35,23 +37,34 @@ Concrete consequences today: - **Providers have no place to declare parallelisation hints.** ESMValTool diagnostics that need 16 GB or 8 h have nowhere to say so; - `LocalExecutor` hardcodes a 6 h per-task timeout, + `LocalExecutor` applies one 6 h per-task timeout to every diagnostic + (a single constructor default, not per-diagnostic), `CeleryExecutor` enforces no per-task timeout at all, and a future SLURM/PBS adapter has nothing to translate into `sbatch`/`qsub`. - **Retry classification is scattered** across `_is_system_error`, the `CondaCommandError` branch, missing-log handling, `LocalExecutor`'s per-task timeout, and pool-shutdown abandonment. -- **The `dirty` flag is decided in three places** +- **The `dirty` flag is decided in three branches of the result path** (success path, non-retryable failure, retryable failure / missing log). + User-initiated `dirty=True` resets in `cli/executions.py` (rerun / reset) + are deliberately separate and stay outside the consolidated rule. - **CV validation is silenced** (`logger.warning` with TODO instead of raising). - **Tests mock subprocess + filesystem + DB + Celery** and import private helpers (`_is_system_error`) to compensate for the missing seam. -- **`ExecutionResult` is not cleanly picklable** because its frozen factory - performs disk I/O. - -The CMIP REF is approaching deployment targets (SLURM / PBS / K8s) where -each new adapter would otherwise mean ~250 LOC of duplicated lifecycle -wiring. +- **`ExecutionResult` construction is entangled with disk I/O.** + It already pickles across the `ProcessPoolExecutor` boundary today + (`_process_run` returns it to the parent), so picklability is not the + problem — the problem is that the factory cannot be exercised without a + writable output directory, which forces filesystem fixtures into every + unit test that builds a result. + +The CMIP REF is approaching deployment targets (SLURM / PBS / K8s). +The recurring lifecycle logic — fragment allocation, the reattach / commit / +mark dance, scratch→results promotion, retry classification — is already +shared in `result_handling.py`; what each new transport reimplements is the +dispatch / poll / timeout wiring around it. +This RFC pulls that shared logic behind one seam so a transport contributes +only its dispatch and poll, not a copy of the lifecycle. This RFC is **not** about replacing SLURM, PBS, K8s, or Celery as schedulers. It is about defining a single robust seam *above* them. @@ -69,7 +82,6 @@ classDiagram +submit(execution, definition) +drain(timeout) None +replay_abandoned() list~int~ - +dry_run(group, datasets) ExecutionDefinition -_FragmentAllocator -_Classifier -_Promoter @@ -106,12 +118,23 @@ classDiagram Picklable value objects only. No DB sessions, `Config`, or CV cross the boundary. +`ResourceHint` lives in `climate_ref_core` (the `Diagnostic` base class +declares it, and core cannot import the application package). +`ExecutionEnvelope`, `Telemetry`, `ExecutionOutcome`, and the `Transport` +protocol live in `climate_ref.lifecycle` alongside `ExecutionLifecycle`. + +The default `wall_clock` is **6 h**, matching today's `LocalExecutor` +per-task timeout, so diagnostics that run for hours without declaring +`resources` keep their current budget (Celery enforces no limit today, so +nothing regresses there either). Tightening the default is a separate, +explicit decision. + ```python @attrs.frozen class ResourceHint: memory_mb: int = 4096 cpu: int = 1 - wall_clock: timedelta = timedelta(hours=2) + wall_clock: timedelta = timedelta(hours=6) # matches current LocalExecutor budget queue: str | None = None # transport-specific routing tag @attrs.frozen @@ -119,7 +142,10 @@ class ExecutionEnvelope: execution_id: int definition: ExecutionDefinition resources: ResourceHint - deadline: datetime # clock() + resources.wall_clock at submit time + # wall_clock travels in `resources`; the *deadline* is computed by the + # transport when the job starts running, not here — a queued SLURM/PBS + # job may wait hours before it begins, so anchoring the deadline at + # submit time would expire jobs before they start. @attrs.frozen class Telemetry: @@ -202,10 +228,10 @@ sequenceDiagram L->>D: resources_for(definition) D-->>L: ResourceHint(...) L->>DB: allocate fragment + register_datasets + expunge + commit - Note over L: deadline = clock() + resources.wall_clock L->>T: dispatch(ExecutionEnvelope) Note over T,W: sbatch · qsub · pool.submit · celery send · inline + Note over T: deadline = job_start + resources.wall_clock (transport-side) T->>W: hand off envelope activate W W->>W: diagnostic.run · _BundleWriter.write · CV.validate · Telemetry @@ -224,6 +250,28 @@ sequenceDiagram ## Retry + dirty rule (one source of truth) +Classification happens in two clearly separated places. + +**Worker side** — exception → outcome. The worker runs the diagnostic and +maps the raised exception (or clean return) onto the booleans carried by +`ExecutionResult` / `ExecutionFailure`. This is where the exception-type +knowledge lives, because the exception is only ever raised on the worker: + +```python +# worker-side, inside the run path +SYSTEM_ERRORS = (OSError, MemoryError, SystemExit, KeyboardInterrupt) # → retryable +NON_RETRYABLE = (CondaCommandError,) # → give up +``` + +This consolidates the exception-classification logic that is **today** +spread across `_is_system_error` and the separate `CondaCommandError` +branch in `execute_locally` into one worker-side function. + +**Coordinator side** — outcome → decision. The policy never inspects +exception types; it maps the already-classified outcome onto a decision, +so the same rule applies identically to every transport (including remote +ones where the exception object never comes back): + ```python class RetryDecision(enum.Enum): SUCCESS = "success" @@ -231,9 +279,6 @@ class RetryDecision(enum.Enum): GIVE_UP = "give_up" # sets dirty=False, marks failed class DefaultRetryPolicy: - SYSTEM_ERRORS = (OSError, MemoryError, SystemExit, KeyboardInterrupt) - NON_RETRYABLE = (CondaCommandError,) - def classify(self, outcome: ExecutionOutcome) -> RetryDecision: if outcome.failure is not None: # timeout | broker_lost | pool_shutdown return RetryDecision.RETRY @@ -243,9 +288,10 @@ class DefaultRetryPolicy: return RetryDecision.RETRY if r.retryable else RetryDecision.GIVE_UP ``` -Replaces `_is_system_error`, the `CondaCommandError` branch, -missing-log handling, the per-task timeout path, -and pool-shutdown abandonment — five sites collapse to one. +Net effect: exception classification goes from two scattered sites to one +worker-side function, and the *transport-level* outcomes that today live in +the per-executor `join` loops (missing log, per-task timeout, pool-shutdown +abandonment) collapse into the single coordinator policy above. ## Idempotent ingest, telemetry @@ -255,6 +301,13 @@ Ingestion uses `INSERT … ON CONFLICT DO NOTHING` on natural keys so `replay_abandoned()` is safe. Scratch-to-results copy uses `exist_ok=True`. +`DO NOTHING` (rather than `DO UPDATE`) is correct because every key is +scoped to `execution_id`, and each solve mints a **new** `Execution` row per +attempt: a retry produces a fresh `execution_id`, so its values never +collide with the abandoned attempt's. The conflict clause therefore only +guards re-ingestion of the *same* `execution_id` during replay — it never +silently keeps stale values from a previous attempt. + New `Execution` columns: `duration_seconds`, `peak_rss_mb`, `telemetry_meta JSON`. No solver code reads these today; they exist so a future adaptive `ResourceProvider` is a feature addition rather than a schema migration. @@ -279,22 +332,47 @@ CV mismatch raises `ResultValidationError`; # Drawbacks [drawbacks]: #drawbacks -- **Migration is wide.** The lifecycle lands in one piece because every - transport depends on it. Transports can ship staggered after that. +- **Celery loses fire-and-forget ingestion — the biggest trade-off.** + Today `CeleryExecutor` attaches `link` / `link_error` callbacks + (`handle_result` / `handle_failure`) so a worker ingests its own result + with no live coordinator; the submitting process can exit immediately. + The pull model (`Transport.poll` feeding `drain`) couples ingestion to a + coordinator that stays alive for the whole batch. This is a real + regression for the distributed case and is accepted deliberately: it buys + one ingestion path and uniform retry/dirty handling across transports, + and `replay_abandoned` (backed by `CeleryTransport` persisting task IDs + alongside execution IDs) recovers a coordinator crash mid-drain. If + detached submission turns out to be a hard requirement, a worker-side + `IngestSink` callback can be added later without changing the seam — but + the draft does **not** preserve it, and reviewers should weigh that. +- **Migration is wide and the seam swap is atomic.** Staggering applies to + *adding* transports later, not to the cutover: `climate-ref-core` (wire + types, `ResourceHint` on `Diagnostic`), `climate-ref` + (`ExecutionLifecycle`, the transports), `climate-ref-celery`, and an + Alembic migration all land together, because the seam replaces the + `Executor` protocol the solver calls. Proposed landing order to bound + risk: (1) add `ResourceHint` + telemetry columns (additive, no behaviour + change); (2) introduce `ExecutionLifecycle` + `InMemoryTransport` + + `ProcessPoolTransport` behind the existing solver entry point with the old + executors still present; (3) port Celery; (4) delete the old executors. +- **Deleting the `Executor` protocol is a breaking public change.** + `import_executor_cls` resolves an executor from a dotted path in `Config`, + so the executor class is a documented extension point and any downstream + custom executor implements it. The cutover must ship a deprecation cycle: + keep `import_executor_cls` resolving known names to the new transports, + warn on custom FQNs, and provide a config-migration note. This is not yet + spelled out and is a precondition for merge. - **CV becomes hard-fail.** Today's silent `logger.warning` becomes a raise. Intentional, but needs a one-cycle deprecation window where - the violation is `ERROR` but not raised. + the violation is `ERROR` but not raised — and it ships in the same wide + migration as the seam swap, so it must be feature-flagged to keep the two + behaviour changes independently bisectable. - **One fat class (~400–500 LOC).** Intentional depth, but reviewers - should expect a large file. -- **Celery has no broker-side result handler anymore.** - `CeleryTransport.poll` feeds outcomes back into `drain`. - A coordinator crash mid-drain is recovered by `replay_abandoned`, - which requires `CeleryTransport` to persist task IDs alongside - execution IDs. + should expect a large file. (LOC is an estimate, not a target.) - **Resource hints can be wrong.** SLURM will OOM-kill a job whose - declared memory is too low. Mitigated by a generous default - (4 GB / 1 CPU / 2 h), by `ProcessPoolTransport` ignoring everything - except `wall_clock`, and by telemetry capture making the first + declared memory is too low. Mitigated by a default that preserves current + behaviour (4 GB / 1 CPU / 6 h), by `ProcessPoolTransport` ignoring + everything except `wall_clock`, and by telemetry capture making the first failed run actionable. # Rationale and alternatives @@ -342,15 +420,22 @@ quadrantChart A's transport contract (`dispatch(envelope)` + `poll() -> Iterator[Outcome]`, no result callback) + A's `BundleWriter` separation + B's `ExecutionEnvelope`/`Telemetry` wire types. +Dropping the result callback is what costs Celery its fire-and-forget +ingestion (see Drawbacks); it is chosen for one uniform pull path, and a +worker-side `IngestSink` remains a non-breaking future addition. Deferred: `ArtifactStore`, `IngestSink`, `LifecycleHooks`, `FragmentAllocator` as a port, plugin registry. The shallow `Executor` Protocol and the three concrete executors -are deleted. +are deleted (with a deprecation cycle for `import_executor_cls`; see +Drawbacks). -**Impact of not doing this**: each new transport reimplements ~250 LOC -of lifecycle wiring; resource hints retrofit later through a new wire -format (strictly larger change); per-task timeout, CV validation, -dirty-flag, and retry classification stay scattered. +**Impact of not doing this**: each new transport reimplements its own +dispatch / poll / timeout wiring and re-derives retry and dirty handling +inline (the shared promotion/ingest helpers in `result_handling.py` already +exist, but nothing forces a new transport to route through them); +resource hints retrofit later through a new wire format (strictly larger +change); per-task timeout, CV validation, dirty-flag, and exception +classification stay scattered. # Prior art [prior-art]: #prior-art @@ -379,10 +464,15 @@ To resolve through this RFC: vs. project CV baked into a constant. - Is `replay_abandoned` automatic on `__init__` or explicit from the CLI? Draft assumes explicit. +- Is detached (coordinator-free) submission a hard requirement for Celery + deployments? If yes, a worker-side `IngestSink` ships with the cutover + rather than as a deferred addition. To resolve through implementation: - Exact upsert unique constraints + Alembic migration. +- Deprecation path for `import_executor_cls` / the executor FQN config key + (name → transport mapping, warning on custom executors, migration note). - `CeleryTransport.poll` semantics (per-task `AsyncResult.get` vs batch inspection). - `peak_rss_mb` capture across macOS / Linux (`getrusage` unit difference). @@ -399,8 +489,9 @@ Out of scope: Each item below is a self-contained follow-up enabled by this RFC: -- **SLURM / PBS transports** — ~100 LOC adapters; `dispatch` builds a - job script from `envelope.resources`, `poll` queries `squeue` / `qstat`. +- **SLURM / PBS transports** — thin adapters: `dispatch` builds a job + script from `envelope.resources` and submits it, `poll` queries + `squeue` / `qstat` and computes the deadline from job start time. - **Adaptive `ResourceProvider`** — reads `Execution.peak_rss_mb` over a rolling window, suggests memory hint at p95 × 1.2. - **Per-provider retry policies** — single `RetryPolicy` becomes From 36dd7a298b8eb3b83454a5dec0c629206a6894ec Mon Sep 17 00:00:00 2001 From: Jared Lewis Date: Thu, 4 Jun 2026 15:41:01 +1000 Subject: [PATCH 07/12] chore: formatting --- text/0003-execution-lifecycle.md | 26 +++++++++----------------- 1 file changed, 9 insertions(+), 17 deletions(-) diff --git a/text/0003-execution-lifecycle.md b/text/0003-execution-lifecycle.md index 0c40a56..4773bd8 100644 --- a/text/0003-execution-lifecycle.md +++ b/text/0003-execution-lifecycle.md @@ -3,7 +3,6 @@ - RFC PR: [Climate-REF/rfcs#0003](https://github.com/Climate-REF/rfcs/pull/3) # Summary -[summary]: #summary Consolidate the lifecycle of one diagnostic execution — allocation, dispatch, run, classify, publish, ingest, finalise — @@ -16,7 +15,6 @@ plug in as thin adapters that only translate the envelope and poll job state. # Motivation -[motivation]: #motivation The lifecycle of one execution is currently fragmented across ~8 files in 2 packages. @@ -69,7 +67,6 @@ This RFC is **not** about replacing SLURM, PBS, K8s, or Celery as schedulers. It is about defining a single robust seam *above* them. # Reference-level explanation -[reference-level-explanation]: #reference-level-explanation ## Module boundary @@ -330,7 +327,6 @@ CV mismatch raises `ResultValidationError`; `replay_abandoned` returns stranded IDs. # Drawbacks -[drawbacks]: #drawbacks - **Celery loses fire-and-forget ingestion — the biggest trade-off.** Today `CeleryExecutor` attaches `link` / `link_error` callbacks @@ -376,7 +372,6 @@ CV mismatch raises `ResultValidationError`; failed run actionable. # Rationale and alternatives -[rationale-and-alternatives]: #rationale-and-alternatives Three designs were considered. The chosen interface is a deliberate hybrid. @@ -396,14 +391,14 @@ quadrantChart "Hybrid (chosen)": [0.42, 0.7] ``` -| Dimension | A — Minimal | B — Maximal | C — Common-case | **Hybrid** | -|------------------------|:-:|:-:|:-:|:-:| -| Public surface | 1 | 5 | 2 | 2 | -| Defaults baked in | 3 | 1 | 5 | 4 | -| Bend without editing | 3 | 5 | 2 | 3 | -| Migration churn | 4 | 5 | 2 | 3 | -| Resource-hint support | 0 | 5 | 0 | 5 | -| Speculation tax | 0 | 3 | 0 | 1 | +| Dimension | A — Minimal | B — Maximal | C — Common-case | **Hybrid** | +| --------------------- | :---------: | :---------: | :-------------: | :--------: | +| Public surface | 1 | 5 | 2 | 2 | +| Defaults baked in | 3 | 1 | 5 | 4 | +| Bend without editing | 3 | 5 | 2 | 3 | +| Migration churn | 4 | 5 | 2 | 3 | +| Resource-hint support | 0 | 5 | 0 | 5 | +| Speculation tax | 0 | 3 | 0 | 1 | - **A — Minimal**: 2 methods, 1 port, everything else hidden. No place for resource hints or per-provider retry without later kwarg growth. @@ -438,7 +433,6 @@ change); per-task timeout, CV validation, dirty-flag, and exception classification stay scattered. # Prior art -[prior-art]: #prior-art - **Dask `distributed`** — `resources=` annotations on submitted tasks inspire `ResourceHint`. @@ -446,7 +440,7 @@ classification stay scattered. translate transparently into SLURM / PBS / K8s. Same mental model at the diagnostic level. - **Airflow** — executor / operator split; `BaseExecutor.execute_async` - + `sync` is essentially `Transport.dispatch` + `Transport.poll`. + - `sync` is essentially `Transport.dispatch` + `Transport.poll`. - **Celery** — `task_time_limit` + queue routing. `ResourceHint.queue` maps onto Celery queues; `wall_clock` onto `task_time_limit` / `task_soft_time_limit`. Today's `CeleryExecutor` uses neither. @@ -454,7 +448,6 @@ classification stay scattered. template. # Unresolved questions -[unresolved-questions]: #unresolved-questions To resolve through this RFC: @@ -485,7 +478,6 @@ Out of scope: - Replacing SLURM / PBS / Celery as schedulers. # Future possibilities -[future-possibilities]: #future-possibilities Each item below is a self-contained follow-up enabled by this RFC: From dde7dbb8098c066c6288fa4dc277b1c68f08849a Mon Sep 17 00:00:00 2001 From: Jared Lewis Date: Thu, 4 Jun 2026 16:02:57 +1000 Subject: [PATCH 08/12] rfc(execution-lifecycle): rewrite around on-disk manifest + liveness transport MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Full rewrite after design review. Pivots the proposal from a result-carrying transport to a disk-backed lifecycle: - Source of truth is an on-disk, two-phase execution manifest (execution.json), REF-owned and typed, embedding CMEC provenance. Phase 1 (identity) written before the run; phase 2 (outcome) written last, even on recoverable failure. Absent outcome = incomplete (in-flight or hard-killed) — crashes are now diagnosable from durable state. - Transport narrows to launch + liveness (RUNNING|EXITED|GONE); results never cross the boundary. Manifest decides outcome, transport decides liveness, a present manifest outcome wins over liveness. - Split into two verbs: run (definition -> on-disk bundle) and ingest (dir -> DB, transport-agnostic, idempotent, CV-validated, replayable). - All executors ported (InMemory, ProcessPool, Celery, Slurm, Pbs); only K8s deferred. Celery link/link_error demotes to an optional eager-ingest hook. - Primary driver reframed as regression output + crash-robustness; HPC and cross-deployment portability are downstream beneficiaries. - Adds execution state diagram, staged migration plan, and CMEC-aligned terms. - Defers: K8s, external/null execution, portability tier 3 (remote-file datasets, own RFC), external-version enforcement (ingest-as-is), telemetry/ adaptive provider, CV hard-fail. Markdown uses semantic line breaks targeting ~110 columns. --- text/0003-execution-lifecycle.md | 862 +++++++++++++++---------------- 1 file changed, 431 insertions(+), 431 deletions(-) diff --git a/text/0003-execution-lifecycle.md b/text/0003-execution-lifecycle.md index 4773bd8..4c25eab 100644 --- a/text/0003-execution-lifecycle.md +++ b/text/0003-execution-lifecycle.md @@ -3,502 +3,502 @@ - RFC PR: [Climate-REF/rfcs#0003](https://github.com/Climate-REF/rfcs/pull/3) # Summary +[summary]: #summary -Consolidate the lifecycle of one diagnostic execution -— allocation, dispatch, run, classify, publish, ingest, finalise — -into one deep module (`ExecutionLifecycle`) behind one `Transport` port. -Add a declarative `ResourceHint` on every `Diagnostic` -and capture per-execution `Telemetry`, -so providers can express memory / CPU / wall-clock once -and future schedulers (SLURM, PBS, K8s) -plug in as thin adapters that only translate the envelope -and poll job state. +Replace the shallow `Executor` protocol and its three divergent implementations +(`SynchronousExecutor`, `LocalExecutor`, `CeleryExecutor`) +with one execution lifecycle built on two ideas. -# Motivation +1. A self-describing, on-disk **execution manifest** (`execution.json`) + that records — in CMEC-compatible terms — what an execution *is* and how it *ended*. +2. A thin **`Transport`** port that only *launches* work and reports *liveness* + (`RUNNING | EXITED | GONE`). + +Results never travel back through the transport. +Every execution method — in-process, process pool, Celery, SLURM, PBS — +completes by writing its CMEC bundles to a directory on disk, +and a single, transport-agnostic *ingest* step loads the manifest and bundles into the database. + +This makes executions crash-robust and replayable, +makes diagnostic output self-describing so it can be compared across runs and versions (regression), +and reduces every backend to a launcher plus a status query. -The lifecycle of one execution is currently fragmented across ~8 files -in 2 packages. -A single happy-path run touches -`solver.py` (allocates row + fragment, register_datasets, expunge, commit), -`climate_ref_core/executor.py` (`execute_locally`, `_is_system_error`, -`CondaCommandError` handling), -`climate_ref_core/diagnostics.py` (`ExecutionResult.build_from_output_bundle`, -a static factory that writes three JSON files to disk as a side effect of -constructing the result), -`climate_ref/executor/result_handling.py` (scratch→results copy ×4, -ingestion in nested-tx, dirty-flag toggled in 3 branches of the result path), -`climate_ref/executor/fragment.py` (PLACEHOLDER_FRAGMENT, group_short), -and one of `synchronous.py` / `local.py` / `climate_ref_celery/executor.py` -(each reimplementing the same reattach / commit / mark dance). - -Concrete consequences today: - -- **Providers have no place to declare parallelisation hints.** - ESMValTool diagnostics that need 16 GB or 8 h have nowhere to say so; - `LocalExecutor` applies one 6 h per-task timeout to every diagnostic - (a single constructor default, not per-diagnostic), - `CeleryExecutor` enforces no per-task timeout at all, - and a future SLURM/PBS adapter has nothing to translate into `sbatch`/`qsub`. -- **Retry classification is scattered** across `_is_system_error`, - the `CondaCommandError` branch, missing-log handling, `LocalExecutor`'s - per-task timeout, and pool-shutdown abandonment. -- **The `dirty` flag is decided in three branches of the result path** - (success path, non-retryable failure, retryable failure / missing log). - User-initiated `dirty=True` resets in `cli/executions.py` (rerun / reset) - are deliberately separate and stay outside the consolidated rule. -- **CV validation is silenced** (`logger.warning` with TODO instead of raising). -- **Tests mock subprocess + filesystem + DB + Celery** and import private - helpers (`_is_system_error`) to compensate for the missing seam. -- **`ExecutionResult` construction is entangled with disk I/O.** - It already pickles across the `ProcessPoolExecutor` boundary today - (`_process_run` returns it to the parent), so picklability is not the - problem — the problem is that the factory cannot be exercised without a - writable output directory, which forces filesystem fixtures into every - unit test that builds a result. - -The CMIP REF is approaching deployment targets (SLURM / PBS / K8s). -The recurring lifecycle logic — fragment allocation, the reattach / commit / -mark dance, scratch→results promotion, retry classification — is already -shared in `result_handling.py`; what each new transport reimplements is the -dispatch / poll / timeout wiring around it. -This RFC pulls that shared logic behind one seam so a transport contributes -only its dispatch and poll, not a copy of the lifecycle. -This RFC is **not** about replacing SLURM, PBS, K8s, or Celery as schedulers. -It is about defining a single robust seam *above* them. +# Motivation +[motivation]: #motivation + +The REF needs **one robust way to run diagnostics** that works across a range of deployments — +a laptop, a Celery cluster, and an HPC batch scheduler at a modelling centre. +"Robust" has a concrete meaning here: +an execution survives coordinator and worker crashes cleanly, +and an operator can always see *what a run did and why it failed* from durable state, not from a live process. + +Today none of that holds, for four reasons. + +## The lifecycle is fragmented across divergent executors + +A single happy-path run touches eight files in two packages: + +- `solver.py` allocates the row and fragment, + registers datasets, expunges, and commits before handing off to an executor. +- `climate_ref_core/executor.py` holds `execute_locally`, `_is_system_error`, + and the `CondaCommandError` branch. +- `climate_ref_core/diagnostics.py` holds `ExecutionResult.build_from_output_bundle`, + a static factory that writes three JSON files to disk as a side effect of constructing the result. +- `climate_ref/executor/result_handling.py` copies scratch→results, ingests inside a nested transaction, + and toggles the `dirty` flag in three branches. +- One of `synchronous.py` / `local.py` / `climate_ref_celery/executor.py` + re-implements the reattach / commit / mark dance. + +The three executors diverge in ways that matter: +`LocalExecutor` applies a single 6-hour per-task timeout to every diagnostic; +`CeleryExecutor` enforces no per-task timeout at all; +`SynchronousExecutor` runs inline. +Retry classification is scattered across `_is_system_error`, the `CondaCommandError` branch, +missing-log handling, the per-task timeout path, and pool-shutdown abandonment. + +## Executions are not robust to crashes + +Failure state is decided from a *live* exception in `execute_locally` and recorded only in the database. +If a worker is OOM-killed, the coordinator dies mid-drain, or a Celery broker is lost, +there is no durable record on disk of what the execution was or how far it got. +The `LocalExecutor` join loop marks abandoned futures retryable from memory; +once that process is gone, the knowledge is gone. +An operator inspecting a stuck `Execution` row (`successful IS NULL`) +cannot tell whether it is still running or died hours ago. + +## Results are not self-describing + +An execution directory today contains the CMEC output bundle (`output.json`), +the CMEC diagnostic bundle (`diagnostic.json`), the series values (`series.json`), and a log (`out.log`). +**Nothing on disk says which diagnostic, provider version, or datasets produced them** — +that identity lives only in the database, and `reingest` rebuilds it *from* the database. +This blocks the primary motivation for this RFC: **regression output**. +To compare a diagnostic's output across versions or runs, +or to re-ingest results after changing extraction logic, +the result directory must carry its own identity and outcome. +It also blocks any future where a directory is copied between deployments. + +## There is no seam for HPC + +Modelling centres run on SLURM and PBS. +There is nowhere for a diagnostic to declare it needs 16 GB or 8 hours, +and nothing a batch adapter could translate into `sbatch --mem --time` or `qsub -l`. +Each new backend would otherwise re-implement the whole lifecycle inline. + +**Primary driver: regression output and crash-robustness.** +HPC reach and cross-deployment portability are downstream beneficiaries of the same on-disk format, +not the justification on their own. # Reference-level explanation +[reference-level-explanation]: #reference-level-explanation -## Module boundary +## Two verbs: `run` and `ingest` -```mermaid -classDiagram - direction LR - - class ExecutionLifecycle { - <> - +submit(execution, definition) - +drain(timeout) None - +replay_abandoned() list~int~ - -_FragmentAllocator - -_Classifier - -_Promoter - -_Ingestor - -_DirtyRule - -_BundleWriter - } - - class Transport { - <> - +name: ClassVar~str~ - +dispatch(envelope: ExecutionEnvelope) None - +poll(block, timeout) Iterator~ExecutionOutcome~ - +shutdown(timeout) None - } - - class InMemoryTransport { tests + SynchronousExecutor } - class ProcessPoolTransport { replaces LocalExecutor } - class CeleryTransport { climate-ref-celery } - class SlurmTransport { future · sbatch --mem --cpus --time } - class PbsTransport { future · qsub -l mem,ncpus,walltime } - class K8sTransport { future · pod resources + deadline } - - ExecutionLifecycle ..> Transport : dispatch / poll - InMemoryTransport ..|> Transport - ProcessPoolTransport ..|> Transport - CeleryTransport ..|> Transport - SlurmTransport ..|> Transport - PbsTransport ..|> Transport - K8sTransport ..|> Transport -``` +The lifecycle splits cleanly into two operations that today are tangled together. -## Wire types +- **`run`**: turn an `ExecutionDefinition` into an output directory on disk — + a manifest plus CMEC bundles. A `Transport` drives this (or, in future, an external system does). +- **`ingest`**: load one output directory into the database. Transport-agnostic, idempotent, replayable. -Picklable value objects only. No DB sessions, `Config`, or CV cross the boundary. +Every execution method funnels through the same `ingest`. +Live completion, re-ingestion after a logic change, and (future) cross-deployment import +are the *same code path* reading the *same on-disk format*. -`ResourceHint` lives in `climate_ref_core` (the `Diagnostic` base class -declares it, and core cannot import the application package). -`ExecutionEnvelope`, `Telemetry`, `ExecutionOutcome`, and the `Transport` -protocol live in `climate_ref.lifecycle` alongside `ExecutionLifecycle`. +## The execution manifest (`execution.json`) -The default `wall_clock` is **6 h**, matching today's `LocalExecutor` -per-task timeout, so diagnostics that run for hours without declaring -`resources` keep their current budget (Celery enforces no limit today, so -nothing regresses there either). Tightening the default is a separate, -explicit decision. +A REF-owned, typed, versioned file written to the execution directory. +It is the durable record of an execution's identity and outcome. +CMEC provides building blocks for *provenance* but has no vocabulary for *outcome*, +so the manifest is a REF schema that embeds CMEC provenance rather than living inside the CMEC output bundle. ```python -@attrs.frozen -class ResourceHint: - memory_mb: int = 4096 - cpu: int = 1 - wall_clock: timedelta = timedelta(hours=6) # matches current LocalExecutor budget - queue: str | None = None # transport-specific routing tag +class ManifestStatus(enum.Enum): + SUCCESS = "success" # ran; produced a valid, CV-checkable bundle + FAILED = "failed" # ran; diagnostic logic error — not retryable (give up) + RECOVERABLE = "recoverable" # ran; system/transient error — retryable @attrs.frozen -class ExecutionEnvelope: - execution_id: int - definition: ExecutionDefinition - resources: ResourceHint - # wall_clock travels in `resources`; the *deadline* is computed by the - # transport when the job starts running, not here — a queued SLURM/PBS - # job may wait hours before it begins, so anchoring the deadline at - # submit time would expire jobs before they start. +class DatasetRef: + instance_id: str # Dataset.slug — the CMIP6 instance_id; assumed globally unique + slug_column: str + source_type: str # SourceDatasetType: cmip6, obs4mips, ... @attrs.frozen -class Telemetry: - duration: timedelta - peak_rss_mb: int | None - host: str +class Outcome: # phase 2 — written when the diagnostic ends + status: ManifestStatus exit_code: int | None - transport_meta: Mapping[str, str] # slurm jobid / k8s pod / celery task_id + started_at: datetime + finished_at: datetime + output_bundle: str | None # output.json — CMEC output bundle (CMECOutput) + metric_bundle: str | None # diagnostic.json — CMEC diagnostic bundle (CMECMetric) + series: str | None # series.json + log: str = "out.log" + provenance: OutputProvenance | None = None # CMEC environment / modeldata / obsdata / log @attrs.frozen -class ExecutionOutcome: - execution_id: int - result: ExecutionResult | None # None ⇒ transport-side abandonment - failure: ExecutionFailure | None # timeout | broker_lost | pool_shutdown - telemetry: Telemetry +class ExecutionManifest: + schema_version: int + # --- identity: phase 1, written before the diagnostic runs --- + diagnostic_slug: str + diagnostic_version: int + provider_slug: str + provider_version: str + dataset_hash: str + datasets: list[DatasetRef] + selectors: dict[str, str] + # --- outcome: phase 2 --- + outcome: Outcome | None = None # None ⇒ incomplete (in-flight, or hard-killed before writing) ``` -`ExecutionResult` becomes pure data. -The current `build_from_output_bundle` factory is split into a pure -`ExecutionResult.from_bundle(definition, bundle)` and a worker-side -`_BundleWriter.write(definition, bundle)` that owns the JSON I/O. +**Two-phase write.** +The identity half is the *first* file written to the directory, before the diagnostic runs, +with `outcome = None`. +The outcome half is the *last* thing the worker writes, **even on a recoverable failure**. +Therefore a manifest with `outcome is None` means *incomplete*: +either still in flight, or hard-killed (OOM, `SIGKILL`, node loss) before it could write its epitaph. +This single invariant is what makes crashes diagnosable from disk alone. -## Diagnostic-side declaration +**CMEC alignment.** +`provenance` reuses the CMEC `OutputProvenance` shape (`environment`, `modeldata`, `obsdata`, `log`), +so the manifest captures inputs and environment in standard terms. +`datasets` references inputs by `instance_id` (the CMEC/ESGF dataset identity), +which is what makes a directory portable: it names its inputs without carrying them. -```python -class Diagnostic(AbstractDiagnostic): - resources: ResourceHint = ResourceHint() # project-wide default +## The `Transport` port — liveness, not results - def resources_for(self, definition: ExecutionDefinition) -> ResourceHint: - """Optional per-execution sizing. Default returns self.resources.""" - return self.resources +A transport launches an execution and answers one question: *is it alive?* +It never carries results. +```python +class Status(enum.Enum): + RUNNING = "running" # still executing + EXITED = "exited" # ended with a clean exit code + GONE = "gone" # vanished without a clean exit (OOM, SIGKILL, node loss, timeout-kill) + +class Transport(Protocol): + name: ClassVar[str] + def dispatch(self, envelope: ExecutionEnvelope) -> str: ... # launch; return durable handle + def status(self, handles: Mapping[int, str]) -> Mapping[int, Status]: ... + def cancel(self, handle: str) -> None: ... # deadline / drain kill + def shutdown(self, timeout: float) -> None: ... +``` -# Examples -class ESMValToolDiagnostic(CommandLineDiagnostic): - resources = ResourceHint(memory_mb=16000, cpu=4, wall_clock=timedelta(hours=6)) +`dispatch` returns a durable handle (pid, Celery task id, SLURM job id) +that the lifecycle persists on the `Execution` row in the same commit, +so a freshly restarted coordinator can still ask about the job. -class EnsoDiagnostic(Diagnostic): - resources = ResourceHint(memory_mb=24000, cpu=8, - wall_clock=timedelta(hours=8), queue="bigmem") +```python +@attrs.frozen +class ResourceHint: + memory_mb: int = 4096 + cpu: int = 1 + wall_clock: timedelta = timedelta(hours=6) # matches today's LocalExecutor budget + queue: str | None = None # transport-specific routing tag -class IlambDiagnostic(Diagnostic): - resources = ResourceHint(memory_mb=8000, cpu=2, wall_clock=timedelta(hours=2)) - def resources_for(self, defn): - n = len(defn.datasets.get_cmip6()) - return attrs.evolve(self.resources, memory_mb=8000 + 500 * n) +@attrs.frozen +class ExecutionEnvelope: + execution_id: int + definition: ExecutionDefinition + resources: ResourceHint + # wall_clock travels inside `resources`; the deadline is computed by the transport + # (or the scheduler) at job *start*, never at submit time — a queued SLURM/PBS job + # may wait hours before it begins. ``` -## Solver dispatch — before and after +The default `wall_clock` is 6 h to match today's `LocalExecutor` budget, +so diagnostics that run for hours without declaring `resources` keep their current allowance. -```python -# Before: ~50 lines of bookkeeping in solver.py:709-757 -# PLACEHOLDER_FRAGMENT, assign_execution_fragment, attrs.evolve, -# register_datasets, expunge, commit, executor.run, … executor.join +## The lifecycle and the drain loop -# After: -lifecycle = ExecutionLifecycle(config, db, transport) +`ExecutionLifecycle` is one deep module that owns submission, the drain loop, retry/dirty, and ingest. +It is stateless with respect to in-flight work: +the set of executions to wait on comes from the database (`successful IS NULL`), not from memory. +This is what makes it re-runnable and crash-safe — `drain` after a restart simply resumes. +```python for group, datasets, definition in planned_executions: execution = Execution(execution_group=group, dataset_hash=datasets.hash, provider_version=definition.diagnostic.provider.version) - lifecycle.submit(execution, definition) + lifecycle.submit(execution, definition) # writes phase-1 manifest, dispatch(), persists handle lifecycle.drain(timeout=timeout) ``` -## End-to-end sequence +`drain` loops over the in-flight set. +**The transport decides liveness; the manifest decides outcome; +a present manifest outcome wins over liveness.** -```mermaid -sequenceDiagram - autonumber - participant S as Solver - participant D as Diagnostic - participant L as ExecutionLifecycle - participant T as Transport - participant W as Worker - participant DB - - S->>L: submit(execution, definition) - L->>D: resources_for(definition) - D-->>L: ResourceHint(...) - L->>DB: allocate fragment + register_datasets + expunge + commit - L->>T: dispatch(ExecutionEnvelope) - - Note over T,W: sbatch · qsub · pool.submit · celery send · inline - Note over T: deadline = job_start + resources.wall_clock (transport-side) - T->>W: hand off envelope - activate W - W->>W: diagnostic.run · _BundleWriter.write · CV.validate · Telemetry - W-->>T: ExecutionOutcome - deactivate W - - S->>L: drain(timeout) - loop until no in-flight executions - L->>T: poll(block, timeout) - T-->>L: ExecutionOutcome - Note over L: _Classifier → SUCCESS | RETRY | GIVE_UP - L->>DB: merge · promote artifacts · upsert outputs/scalars/series - L->>DB: _DirtyRule.apply · save telemetry · mark_* - end +```python +for ex in db.in_flight(submitted): # successful IS NULL + match transport.status({ex.id: ex.handle})[ex.id]: + case Status.RUNNING: + if past_deadline(ex): + transport.cancel(ex.handle) # ProcessPool only; schedulers self-kill + # else: keep waiting + + case Status.EXITED: # clean exit — ask the disk what it meant + outcome = read_manifest(ex.dir).outcome + if outcome is None: # exited 0 but wrote no outcome → incomplete + mark_retryable(ex) + elif outcome.status is SUCCESS: + ingest_from_disk(ex.dir) # CV-validated; same path for every transport + elif outcome.status is FAILED: + mark_failed(ex) # diagnostic error, give up + else: # RECOVERABLE + mark_retryable(ex) + + case Status.GONE: # no clean exit + outcome = read_manifest(ex.dir).outcome # worker may have logged a recoverable fail first + if outcome is None: + mark_retryable(ex) # hard crash: OOM / SIGKILL / node loss + else: + apply(outcome) # manifest wins (e.g. SUCCESS written then node reaped) ``` -## Retry + dirty rule (one source of truth) - -Classification happens in two clearly separated places. +The decisive robustness rule is the last branch: +if the worker wrote `SUCCESS` and the node then died before the scheduler reaped it, +the transport reports `GONE` but the manifest reports success — and the manifest wins. +Transport liveness only adjudicates the *no-manifest-outcome* case. -**Worker side** — exception → outcome. The worker runs the diagnostic and -maps the raised exception (or clean return) onto the booleans carried by -`ExecutionResult` / `ExecutionFailure`. This is where the exception-type -knowledge lives, because the exception is only ever raised on the worker: +## Execution states -```python -# worker-side, inside the run path -SYSTEM_ERRORS = (OSError, MemoryError, SystemExit, KeyboardInterrupt) # → retryable -NON_RETRYABLE = (CondaCommandError,) # → give up +```mermaid +stateDiagram-v2 + [*] --> Planned: solver creates Execution (successful=NULL) + Planned --> Dispatched: transport.dispatch()
phase-1 manifest written + Dispatched --> Running: worker starts + + Running --> Running: status=RUNNING, within deadline + Running --> Gone: status=RUNNING, past deadline
transport.cancel() + + state poll <> + Running --> poll: status=EXITED + Running --> Gone: status=GONE + Gone --> poll: read manifest + + poll --> Ingesting: manifest.outcome = SUCCESS
(wins even if GONE) + poll --> Failed: manifest.outcome = FAILED
(diagnostic error) + poll --> Retryable: manifest.outcome = RECOVERABLE + poll --> Retryable: manifest.outcome absent
(incomplete / OOM / SIGKILL) + + Ingesting --> Successful: bundles→DB, CV-validated
successful=True, group.dirty=False + Ingesting --> Retryable: ingest error + + Successful --> [*] + Failed --> [*]: successful=False, group.dirty=False + Retryable --> [*]: successful=False, group.dirty=True
→ new Execution next solve + + note right of Dispatched + External/Null transport (future): + no status channel — status is + derived from manifest presence + end note ``` -This consolidates the exception-classification logic that is **today** -spread across `_is_system_error` and the separate `CondaCommandError` -branch in `execute_locally` into one worker-side function. +A row has three terminal outcomes: +**Successful** (clean, group marked clean); +**Failed** (diagnostic error, not retryable, group marked clean); +**Retryable** (system/transient error or hard crash — the group stays dirty, and the next solve mints a *new* +`Execution`). +A retry never re-runs the same row. + +## Retry and dirty: two clearly separated decisions -**Coordinator side** — outcome → decision. The policy never inspects -exception types; it maps the already-classified outcome onto a decision, -so the same rule applies identically to every transport (including remote -ones where the exception object never comes back): +Classification happens in two places, and only two. + +**Worker side — exception → outcome.** +The worker maps the raised exception (or a clean return) onto `ManifestStatus` and writes it to the manifest. +This is the one place that knows exception *types*, because the exception is only ever raised on the worker. +It consolidates the logic spread today across `_is_system_error` and the `CondaCommandError` branch: ```python -class RetryDecision(enum.Enum): - SUCCESS = "success" - RETRY = "retry" # leaves dirty=True - GIVE_UP = "give_up" # sets dirty=False, marks failed - -class DefaultRetryPolicy: - def classify(self, outcome: ExecutionOutcome) -> RetryDecision: - if outcome.failure is not None: # timeout | broker_lost | pool_shutdown - return RetryDecision.RETRY - r = outcome.result - if r is None: return RetryDecision.RETRY - if r.successful: return RetryDecision.SUCCESS - return RetryDecision.RETRY if r.retryable else RetryDecision.GIVE_UP +SYSTEM_ERRORS = (OSError, MemoryError, SystemExit, KeyboardInterrupt) # → RECOVERABLE +NON_RETRYABLE = (CondaCommandError,) # → FAILED ``` -Net effect: exception classification goes from two scattered sites to one -worker-side function, and the *transport-level* outcomes that today live in -the per-executor `join` loops (missing log, per-task timeout, pool-shutdown -abandonment) collapse into the single coordinator policy above. - -## Idempotent ingest, telemetry - -Ingestion uses `INSERT … ON CONFLICT DO NOTHING` on natural keys -(`(execution_id, output_type, short_name)` for outputs, -`(execution_id, dimensions_hash[, index_name])` for metric/series values), -so `replay_abandoned()` is safe. -Scratch-to-results copy uses `exist_ok=True`. - -`DO NOTHING` (rather than `DO UPDATE`) is correct because every key is -scoped to `execution_id`, and each solve mints a **new** `Execution` row per -attempt: a retry produces a fresh `execution_id`, so its values never -collide with the abandoned attempt's. The conflict clause therefore only -guards re-ingestion of the *same* `execution_id` during replay — it never -silently keeps stale values from a previous attempt. - -New `Execution` columns: `duration_seconds`, `peak_rss_mb`, `telemetry_meta JSON`. -No solver code reads these today; they exist so a future adaptive -`ResourceProvider` is a feature addition rather than a schema migration. - -## Test impact - -Delete: `_is_system_error` private-import tests, -subprocess patch chains in `test_providers.py`, -per-executor reattach tests, -`mark_execution_failed` mock chains. - -Add boundary tests against `ExecutionLifecycle` + `InMemoryTransport`: -unique fragment per submit; -end-to-end success → `dirty=False`; -retryable failure leaves `dirty=True`; -non-retryable failure → `dirty=False`, `successful=False`; -re-drain idempotent (no double-insert); -`wall_clock` enforced uniformly across transports; -CV mismatch raises `ResultValidationError`; -`replay_abandoned` returns stranded IDs. +**Coordinator side — outcome → decision.** +The drain loop maps the already-classified outcome (plus transport liveness) onto `SUCCESS | retry | give up`. +It never inspects exception types, so it behaves identically for every transport, +including remote ones where the exception object never comes back. + +The `dirty` flag follows directly: +`SUCCESS` and `FAILED` clear it; `RECOVERABLE` and incomplete leave it set so the next solve retries. +User-initiated `dirty = True` resets in `cli/executions.py` (rerun / reset) stay separate by design. + +## Ingest from disk + +`ingest_from_disk(directory)` reads `execution.json` and the bundles it names, then writes to the database. +It is the only path that mutates result tables, and it is: + +- **Idempotent.** Inserts use `ON CONFLICT DO NOTHING` on natural keys + (`(execution_id, output_type, short_name)` for outputs; + `(execution_id, dimensions_hash[, index_name])` for diagnostic and series values). + Each solve mints a new `execution_id` per attempt, so retries never collide; + the conflict clause only guards re-ingestion of the *same* execution during a replay or eager/poll race. +- **CV-validated.** The CMEC diagnostic and series bundles are validated against the controlled vocabulary + at ingest, exactly as today. (This RFC keeps the current `logger.warning` behaviour; + making CV a hard failure is explicitly out of scope — see below.) +- **Portable.** Result directories hold results only, with relative internal paths. + Inputs are referenced by `instance_id` and resolved against the importing deployment's dataset index; + a directory never carries or hardcodes input data. + +This is also what enables the existing `reingest` use case to work *without* the database it was produced from +(within limits — see portability tiers under Unresolved questions). + +## Transports + +All current executors are replaced, and the two HPC backends are added, as part of this work. +Each is a launcher plus a status query — typically under a few hundred lines. + +| Transport | Replaces / adds | `dispatch` | `status` source | deadline enforced by | +|-----------------------|------------------------|-----------------------|-------------------------|---------------------------| +| `InMemoryTransport` | `SynchronousExecutor` | run inline | always `EXITED` | n/a (inline) | +| `ProcessPoolTransport`| `LocalExecutor` | `pool.submit` | `future` state | coordinator (`cancel`) | +| `CeleryTransport` | `CeleryExecutor` | `app.send_task` | `AsyncResult` | broker (`task_time_limit`)| +| `SlurmTransport` | new | `sbatch --mem --time` | `sacct` / `squeue` | scheduler (`--time`) | +| `PbsTransport` | new | `qsub -l` | `qstat` | scheduler (walltime) | + +`ProcessPoolTransport` is the only transport that enforces the deadline coordinator-side +(via `cancel` → `future.cancel`), because there is no external scheduler to do it; +this preserves today's `LocalExecutor` per-task timeout. + +**Celery callback demotes to a latency optimisation.** +Today `CeleryExecutor` attaches `link` / `link_error` callbacks (`handle_result` / `handle_failure`) +that ingest worker-side. +Under this design the worker writes its bundle and manifest to disk like every other transport, +and the coordinator (or a resident orchestrator running `drain`) ingests from disk. +The `link` callback may be kept as an *eager-ingest hook* — "ingest this directory now" rather than at the next +poll — but it is no longer load-bearing: if the callback is lost, the disk scan recovers the result. +This dissolves the previous push-vs-pull asymmetry; both modes run the same `ingest_from_disk`. + +## Migration plan + +The cutover replaces the seam the solver calls, so it lands in stages, each independently shippable: + +1. **Manifest, additively.** Current executors start writing `execution.json` (both phases). + No behaviour change; result directories become self-describing immediately — + this alone unlocks regression comparison. +2. **Seam + local transports.** Introduce `ExecutionLifecycle`, the `Transport` port, `ingest_from_disk`, + and `InMemoryTransport` + `ProcessPoolTransport`; route the solver through the lifecycle. +3. **Celery.** Port to `CeleryTransport`; convert `link`/`link_error` to the optional eager-ingest hook. +4. **HPC.** Add `SlurmTransport` and `PbsTransport`. +5. **Cleanup.** Delete the `Executor` protocol and the three old executors once nothing imports them. + +`import_executor_cls` (which resolves an executor from a dotted path in `Config`) gains a deprecation cycle: +known names map to the new transports, custom dotted paths warn, and a config-migration note ships with step 5. # Drawbacks - -- **Celery loses fire-and-forget ingestion — the biggest trade-off.** - Today `CeleryExecutor` attaches `link` / `link_error` callbacks - (`handle_result` / `handle_failure`) so a worker ingests its own result - with no live coordinator; the submitting process can exit immediately. - The pull model (`Transport.poll` feeding `drain`) couples ingestion to a - coordinator that stays alive for the whole batch. This is a real - regression for the distributed case and is accepted deliberately: it buys - one ingestion path and uniform retry/dirty handling across transports, - and `replay_abandoned` (backed by `CeleryTransport` persisting task IDs - alongside execution IDs) recovers a coordinator crash mid-drain. If - detached submission turns out to be a hard requirement, a worker-side - `IngestSink` callback can be added later without changing the seam — but - the draft does **not** preserve it, and reviewers should weigh that. -- **Migration is wide and the seam swap is atomic.** Staggering applies to - *adding* transports later, not to the cutover: `climate-ref-core` (wire - types, `ResourceHint` on `Diagnostic`), `climate-ref` - (`ExecutionLifecycle`, the transports), `climate-ref-celery`, and an - Alembic migration all land together, because the seam replaces the - `Executor` protocol the solver calls. Proposed landing order to bound - risk: (1) add `ResourceHint` + telemetry columns (additive, no behaviour - change); (2) introduce `ExecutionLifecycle` + `InMemoryTransport` + - `ProcessPoolTransport` behind the existing solver entry point with the old - executors still present; (3) port Celery; (4) delete the old executors. -- **Deleting the `Executor` protocol is a breaking public change.** - `import_executor_cls` resolves an executor from a dotted path in `Config`, - so the executor class is a documented extension point and any downstream - custom executor implements it. The cutover must ship a deprecation cycle: - keep `import_executor_cls` resolving known names to the new transports, - warn on custom FQNs, and provide a config-migration note. This is not yet - spelled out and is a precondition for merge. -- **CV becomes hard-fail.** Today's silent `logger.warning` becomes a - raise. Intentional, but needs a one-cycle deprecation window where - the violation is `ERROR` but not raised — and it ships in the same wide - migration as the seam swap, so it must be feature-flagged to keep the two - behaviour changes independently bisectable. -- **One fat class (~400–500 LOC).** Intentional depth, but reviewers - should expect a large file. (LOC is an estimate, not a target.) -- **Resource hints can be wrong.** SLURM will OOM-kill a job whose - declared memory is too low. Mitigated by a default that preserves current - behaviour (4 GB / 1 CPU / 6 h), by `ProcessPoolTransport` ignoring - everything except `wall_clock`, and by telemetry capture making the first - failed run actionable. +[drawbacks]: #drawbacks + +- **The migration is wide.** The seam the solver calls is replaced, touching `climate-ref-core`, + `climate-ref`, and `climate-ref-celery`. The staging above bounds the risk but does not remove it. +- **One deep module.** `ExecutionLifecycle` is intentionally a large, central class. + Reviewers should expect depth concentrated in one file rather than spread thinly. +- **A resident coordinator is needed for steady-state ingest.** + With the callback demoted, ingest normally runs in the process that calls `drain` + (the CLI, or a resident orchestrator). The eager-ingest hook keeps Celery's low latency, + but the fire-and-forget-with-no-coordinator mode is gone unless the hook is retained. +- **Manifest is a new on-disk contract.** Once written, `execution.json` must be versioned and migrated + carefully; a schema change is a data-format change. +- **Resource hints can be wrong.** A SLURM job whose declared memory is too low is OOM-killed. + Mitigated by a default that preserves current behaviour (4 GB / 1 CPU / 6 h), + by `ProcessPoolTransport` honouring only `wall_clock`, + and by the manifest making the first failed run diagnosable. # Rationale and alternatives - -Three designs were considered. -The chosen interface is a deliberate hybrid. - -```mermaid -quadrantChart - title Design trade-off space - x-axis "Surface area (concepts)" --> "Larger" - y-axis "Defaults baked in" --> "More" - quadrant-1 "Heavy & opinionated" - quadrant-2 "Lean & opinionated" - quadrant-3 "Lean & open" - quadrant-4 "Heavy & open" - "A - Minimal": [0.18, 0.55] - "B - Maximally flexible": [0.92, 0.18] - "C - Common-case optimised": [0.38, 0.92] - "Hybrid (chosen)": [0.42, 0.7] -``` - -| Dimension | A — Minimal | B — Maximal | C — Common-case | **Hybrid** | -| --------------------- | :---------: | :---------: | :-------------: | :--------: | -| Public surface | 1 | 5 | 2 | 2 | -| Defaults baked in | 3 | 1 | 5 | 4 | -| Bend without editing | 3 | 5 | 2 | 3 | -| Migration churn | 4 | 5 | 2 | 3 | -| Resource-hint support | 0 | 5 | 0 | 5 | -| Speculation tax | 0 | 3 | 0 | 1 | - -- **A — Minimal**: 2 methods, 1 port, everything else hidden. - No place for resource hints or per-provider retry without later kwarg growth. -- **B — Maximal**: 5 ports (Transport, ArtifactStore, RetryPolicy, - IngestSink, FragmentAllocator) + 7 hooks + entry-point plugin registry. - Earned the wire-type split and `ResourceHint`; everything else is - speculation. -- **C — Common-case**: one class, defaults sourced from `Config`, - solver call site collapses to one line. - A `ResultSink` callback that Celery silently ignores is an asymmetry - that will trip someone, and there is still no place for resource hints. - -**Chosen hybrid**: C's façade (one class, hot/cold method split) + -A's transport contract (`dispatch(envelope)` + `poll() -> Iterator[Outcome]`, -no result callback) + A's `BundleWriter` separation + -B's `ExecutionEnvelope`/`Telemetry` wire types. -Dropping the result callback is what costs Celery its fire-and-forget -ingestion (see Drawbacks); it is chosen for one uniform pull path, and a -worker-side `IngestSink` remains a non-breaking future addition. -Deferred: `ArtifactStore`, `IngestSink`, `LifecycleHooks`, -`FragmentAllocator` as a port, plugin registry. -The shallow `Executor` Protocol and the three concrete executors -are deleted (with a deprecation cycle for `import_executor_cls`; see -Drawbacks). - -**Impact of not doing this**: each new transport reimplements its own -dispatch / poll / timeout wiring and re-derives retry and dirty handling -inline (the shared promotion/ingest helpers in `result_handling.py` already -exist, but nothing forces a new transport to route through them); -resource hints retrofit later through a new wire format (strictly larger -change); per-task timeout, CV validation, dirty-flag, and exception -classification stay scattered. +[rationale-and-alternatives]: #rationale-and-alternatives + +The central choice is **where the source of truth lives**. +This design puts it on disk (the manifest), with the transport reduced to a liveness oracle. +The main alternatives: + +- **Transport returns results (`poll() -> Iterator[Outcome]`).** + The earlier draft of this RFC. + It forces every transport to materialise a full result object, which is natural for a process pool + and a Celery result backend but *false* for SLURM, where the job writes to disk and exits with a code. + It also keeps the lifecycle tail running in two places (worker-side for Celery callbacks, + coordinator-side for the pool). Routing all completion through an on-disk manifest removes both problems. + +- **Keep per-executor lifecycles, share helpers only.** + This is roughly today's state: `result_handling.py` already shares promotion and ingest. + But nothing forces a new backend through the shared path, and crash-robustness is impossible + without a durable on-disk record. A new SLURM executor would re-derive retry and dirty handling inline. + +- **Database-only completion (no manifest).** + Record outcome only in the database, as today. + This cannot survive a coordinator crash mid-run, cannot make a result directory self-describing, + and cannot support regression comparison or replay — the primary drivers. + +**Impact of not doing this.** +Each new backend re-implements lifecycle wiring and re-derives retry/dirty inline; +results stay anonymous on disk, so regression comparison and replay stay impossible; +crashes remain undiagnosable from durable state. # Prior art - -- **Dask `distributed`** — `resources=` annotations on submitted tasks - inspire `ResourceHint`. -- **Snakemake / Nextflow** — first-class `resources:` directives - translate transparently into SLURM / PBS / K8s. Same mental model at - the diagnostic level. -- **Airflow** — executor / operator split; `BaseExecutor.execute_async` - - `sync` is essentially `Transport.dispatch` + `Transport.poll`. -- **Celery** — `task_time_limit` + queue routing. `ResourceHint.queue` - maps onto Celery queues; `wall_clock` onto `task_time_limit` / - `task_soft_time_limit`. Today's `CeleryExecutor` uses neither. -- **Rust RFC process** — document shape inherited via this repo's - template. +[prior-art]: #prior-art + +- **CMEC / EMDS.** The output and diagnostic bundles, provenance, and dimensions are CMEC concepts; + the manifest extends CMEC provenance rather than inventing a parallel vocabulary. +- **Snakemake / Nextflow.** First-class `resources:` directives that translate into SLURM / PBS / cloud. + Same mental model at the diagnostic level (`ResourceHint`). + Both also treat the working directory as the durable record of a job — the manifest follows this. +- **Dask `distributed`.** `resources=` annotations on submitted tasks inspire `ResourceHint`. +- **Airflow.** Executor / scheduler split; `BaseExecutor` exposes async submit plus a sync/heartbeat that + reconciles task state from the database — essentially `dispatch` plus `status`-driven `drain`. +- **Celery.** `task_time_limit` and queue routing; `ResourceHint.queue` maps onto Celery queues and + `wall_clock` onto `task_time_limit`. Today's `CeleryExecutor` uses neither. +- **Make / build systems.** A target is rebuilt unless its output exists and is current — the same logic as + reading a manifest outcome from disk before deciding to re-run. # Unresolved questions +[unresolved-questions]: #unresolved-questions -To resolve through this RFC: +To resolve through the RFC process: -- Should `ResourceHint` include `gpu: int` / `io_intensive: bool`? - Recommended default: add when a concrete adapter needs them. -- Where does the CV come from in tests? `PermissiveCV()` fixture - vs. project CV baked into a constant. -- Is `replay_abandoned` automatic on `__init__` or explicit from the CLI? - Draft assumes explicit. -- Is detached (coordinator-free) submission a hard requirement for Celery - deployments? If yes, a worker-side `IngestSink` ships with the cutover - rather than as a deferred addition. +- The exact `ExecutionManifest` field set and `schema_version` stability guarantees. +- Whether the Celery eager-ingest hook (`link`) is retained or Celery polls purely via `status`. +- Where the controlled vocabulary comes from in tests (a permissive fixture vs. the project CV). To resolve through implementation: -- Exact upsert unique constraints + Alembic migration. -- Deprecation path for `import_executor_cls` / the executor FQN config key - (name → transport mapping, warning on custom executors, migration note). -- `CeleryTransport.poll` semantics (per-task `AsyncResult.get` vs batch - inspection). -- `peak_rss_mb` capture across macOS / Linux (`getrusage` unit difference). - -Out of scope: - -- Adaptive resource provider (telemetry columns land here; the - provider is a follow-up RFC). -- GPU scheduling, multi-tenant queues. -- Replacing SLURM / PBS / Celery as schedulers. +- Exact upsert unique constraints and any Alembic migration + (`output_fragment`, `provider_version`, and `diagnostic_version` already exist). +- `sacct` / `squeue` / `qstat` polling cadence and output parsing for the HPC transports. +- The deprecation mechanics for `import_executor_cls` and the executor dotted-path config key. + +Out of scope for this RFC (addressed independently later): + +- **K8s transport.** +- **External / "null" execution** — a directory produced entirely outside the REF, then ingested. + The format already supports it; the operational contract does not yet. +- **Cross-deployment portability beyond shared datasets.** + Tier 2 (the importing deployment already indexes the same ESGF datasets, resolved by `instance_id`) + is enabled by the manifest's `datasets` field. + Tier 3 (datasets *not* present locally) needs dataset support for remote / not-yet-local files + and is a separate RFC. The manifest carries the cross-link information (`instance_id`) but not the files. +- **External-execution version policy.** The default is ingest-as-is: + a result lands under its declared `diagnostic_version`, surfaces only if it equals the local + `promoted_version`, is CV-validated, and is never rejected on version. Hardening is deferred. +- **Adaptive resource provider and per-execution telemetry columns.** +- **Making CV validation a hard failure.** # Future possibilities - -Each item below is a self-contained follow-up enabled by this RFC: - -- **SLURM / PBS transports** — thin adapters: `dispatch` builds a job - script from `envelope.resources` and submits it, `poll` queries - `squeue` / `qstat` and computes the deadline from job start time. -- **Adaptive `ResourceProvider`** — reads `Execution.peak_rss_mb` over - a rolling window, suggests memory hint at p95 × 1.2. -- **Per-provider retry policies** — single `RetryPolicy` becomes - `Mapping[str, RetryPolicy]` keyed on provider slug when concrete - demand arrives. -- **Streaming partial outcomes** — `Transport.poll` already incremental; - add a `PartialOutcome` event when a UI consumer appears. -- **S3 / HTTP artifact store** — extract `_Promoter` into an - `ArtifactStore` port when a non-local target lands. -- **Pluggable ingest sinks** — Prometheus, audit log, S3 mirror as - ordered `IngestSink` observers when a second sink materialises. -- **Cross-restart recovery** — extend `replay_abandoned` with - `Transport.lookup(transport_meta)` so a SLURM job finished while the - coordinator was down is picked up. - -None of these are reasons to accept this RFC on their own; -they show the seam is the right shape for the directions the project is -plausibly heading, without pre-baking any of them. +[future-possibilities]: #future-possibilities + +- **K8s transport** — a pod per execution with `resources` and an active deadline; `status` from the pod phase. +- **External / null transport** — the lowest-friction modelling-centre adoption path: + a centre runs the diagnostic in its own pipeline, writes `execution.json` + bundles to the expected layout, + and `ref ingest ` loads it. The manifest becomes a public, versioned contract. +- **Remote-file datasets (portability tier 3)** — resolve `instance_id` references whose files are not local, + populated from ESGF queries or from imported result bundles, so a fresh database can be bootstrapped. +- **Adaptive `ResourceProvider`** — capture per-execution telemetry (peak RSS, duration) into the manifest + and the database, then suggest `ResourceHint` memory at, say, p95 × 1.2 over a rolling window. +- **Cross-restart recovery** — already largely free: because `drain` reconstructs the in-flight set from the + database and `status` queries durable scheduler state, a coordinator that was down while a SLURM job finished + picks it up on the next `drain`. +- **Per-provider retry policies** — the single worker-side classifier becomes a per-provider mapping + when concrete demand appears. + +None of these justify the RFC on their own; +they show the manifest-plus-liveness seam is shaped for where the project is heading +without pre-building any of them. From 11f951060f14dd1b046be87bef76bba820c52d15 Mon Sep 17 00:00:00 2001 From: Jared Lewis Date: Fri, 5 Jun 2026 10:20:49 +1000 Subject: [PATCH 09/12] rfc(execution-lifecycle): add drop-dead deadline to the manifest The manifest now declares when an execution must complete by. - Move started_at into phase-1 (identity) and add an absolute deadline (= started_at + wall_clock), stamped by the worker at job start. - Deadline is anchored at start, never submit, so a job that waited in a SLURM queue is judged from when it actually began. - Drain reads the deadline from the manifest; operators (or a reconciling coordinator with a weak transport) can flag an overdue execution (now > deadline and outcome absent) without querying the scheduler. - Thread the field through the two-phase prose, envelope comment, drain loop, and state diagram. --- text/0003-execution-lifecycle.md | 103 +++++++++++++++---------------- 1 file changed, 51 insertions(+), 52 deletions(-) diff --git a/text/0003-execution-lifecycle.md b/text/0003-execution-lifecycle.md index 4c25eab..a832dcb 100644 --- a/text/0003-execution-lifecycle.md +++ b/text/0003-execution-lifecycle.md @@ -1,16 +1,14 @@ - Feature Name: `execution_lifecycle` - Start Date: 2026-05-12 -- RFC PR: [Climate-REF/rfcs#0003](https://github.com/Climate-REF/rfcs/pull/3) +- RFC PR: [Climate-REF/rfcs#3](https://github.com/Climate-REF/rfcs/pull/3) # Summary -[summary]: #summary Replace the shallow `Executor` protocol and its three divergent implementations (`SynchronousExecutor`, `LocalExecutor`, `CeleryExecutor`) with one execution lifecycle built on two ideas. -1. A self-describing, on-disk **execution manifest** (`execution.json`) - that records — in CMEC-compatible terms — what an execution *is* and how it *ended*. +1. A self-describing, on-disk **execution manifest** (`execution.json`) that records what an execution *is* and how it *ended*. 2. A thin **`Transport`** port that only *launches* work and reports *liveness* (`RUNNING | EXITED | GONE`). @@ -24,12 +22,10 @@ makes diagnostic output self-describing so it can be compared across runs and ve and reduces every backend to a launcher plus a status query. # Motivation -[motivation]: #motivation -The REF needs **one robust way to run diagnostics** that works across a range of deployments — +The REF needs **one robust way to run diagnostics** that works across a range of deployments - a laptop, a Celery cluster, and an HPC batch scheduler at a modelling centre. -"Robust" has a concrete meaning here: -an execution survives coordinator and worker crashes cleanly, +"Robust" has a concrete meaning here: an execution survives coordinator and worker crashes cleanly, and an operator can always see *what a run did and why it failed* from durable state, not from a live process. Today none of that holds, for four reasons. @@ -44,7 +40,7 @@ A single happy-path run touches eight files in two packages: and the `CondaCommandError` branch. - `climate_ref_core/diagnostics.py` holds `ExecutionResult.build_from_output_bundle`, a static factory that writes three JSON files to disk as a side effect of constructing the result. -- `climate_ref/executor/result_handling.py` copies scratch→results, ingests inside a nested transaction, +- `climate_ref/executor/result_handling.py` copies scratch->results, ingests inside a nested transaction, and toggles the `dirty` flag in three branches. - One of `synchronous.py` / `local.py` / `climate_ref_celery/executor.py` re-implements the reattach / commit / mark dance. @@ -80,17 +76,11 @@ It also blocks any future where a directory is copied between deployments. ## There is no seam for HPC -Modelling centres run on SLURM and PBS. +Modelling centres often use SLURM and PBS for task scheduling. There is nowhere for a diagnostic to declare it needs 16 GB or 8 hours, and nothing a batch adapter could translate into `sbatch --mem --time` or `qsub -l`. -Each new backend would otherwise re-implement the whole lifecycle inline. - -**Primary driver: regression output and crash-robustness.** -HPC reach and cross-deployment portability are downstream beneficiaries of the same on-disk format, -not the justification on their own. # Reference-level explanation -[reference-level-explanation]: #reference-level-explanation ## Two verbs: `run` and `ingest` @@ -104,6 +94,9 @@ Every execution method funnels through the same `ingest`. Live completion, re-ingestion after a logic change, and (future) cross-deployment import are the *same code path* reading the *same on-disk format*. +The concept of `ingest` isn't new in the REF. +Datasets are currently ingested to extract information from the dataset and put it in the database. + ## The execution manifest (`execution.json`) A REF-owned, typed, versioned file written to the execution directory. @@ -127,7 +120,6 @@ class DatasetRef: class Outcome: # phase 2 — written when the diagnostic ends status: ManifestStatus exit_code: int | None - started_at: datetime finished_at: datetime output_bundle: str | None # output.json — CMEC output bundle (CMECOutput) metric_bundle: str | None # diagnostic.json — CMEC diagnostic bundle (CMECMetric) @@ -138,7 +130,7 @@ class Outcome: # phase 2 — written when the diagnostic ends @attrs.frozen class ExecutionManifest: schema_version: int - # --- identity: phase 1, written before the diagnostic runs --- + # --- identity: phase 1, written when the run starts --- diagnostic_slug: str diagnostic_version: int provider_slug: str @@ -146,28 +138,38 @@ class ExecutionManifest: dataset_hash: str datasets: list[DatasetRef] selectors: dict[str, str] + started_at: datetime # when the worker began running the diagnostic + deadline: datetime # drop-dead: complete-by = started_at + resources.wall_clock # --- outcome: phase 2 --- outcome: Outcome | None = None # None ⇒ incomplete (in-flight, or hard-killed before writing) ``` **Two-phase write.** -The identity half is the *first* file written to the directory, before the diagnostic runs, -with `outcome = None`. +The identity half is the *first* file written to the directory, the moment the worker starts — +before the diagnostic runs, with `outcome = None`. +It stamps `started_at` and `deadline` (the drop-dead time, `started_at + wall_clock`), +so the file itself declares when the execution must be complete by. The outcome half is the *last* thing the worker writes, **even on a recoverable failure**. Therefore a manifest with `outcome is None` means *incomplete*: either still in flight, or hard-killed (OOM, `SIGKILL`, node loss) before it could write its epitaph. This single invariant is what makes crashes diagnosable from disk alone. +**Drop-dead deadline.** +Because `deadline` is recorded on disk, an operator — or a reconciling coordinator with a weak transport — +can flag an overdue execution (`now > deadline` and `outcome is None`) without querying the scheduler. +It is anchored at `started_at`, never at submit time, so a job that waited hours in a SLURM queue +is judged from when it actually began. + **CMEC alignment.** `provenance` reuses the CMEC `OutputProvenance` shape (`environment`, `modeldata`, `obsdata`, `log`), so the manifest captures inputs and environment in standard terms. `datasets` references inputs by `instance_id` (the CMEC/ESGF dataset identity), which is what makes a directory portable: it names its inputs without carrying them. -## The `Transport` port — liveness, not results +## The `Transport` port A transport launches an execution and answers one question: *is it alive?* -It never carries results. +It never carries results as that greatly simplifies the required functionality. ```python class Status(enum.Enum): @@ -185,7 +187,7 @@ class Transport(Protocol): `dispatch` returns a durable handle (pid, Celery task id, SLURM job id) that the lifecycle persists on the `Execution` row in the same commit, -so a freshly restarted coordinator can still ask about the job. +so a freshly restarted coordinator can still ask about the job that may be in flight. ```python @attrs.frozen @@ -200,9 +202,9 @@ class ExecutionEnvelope: execution_id: int definition: ExecutionDefinition resources: ResourceHint - # wall_clock travels inside `resources`; the deadline is computed by the transport - # (or the scheduler) at job *start*, never at submit time — a queued SLURM/PBS job - # may wait hours before it begins. + # wall_clock travels inside `resources`; the worker stamps the absolute deadline into the + # manifest at job *start* (started_at + wall_clock), never at submit — a queued SLURM/PBS + # job may wait hours before it begins. ``` The default `wall_clock` is 6 h to match today's `LocalExecutor` budget, @@ -210,7 +212,7 @@ so diagnostics that run for hours without declaring `resources` keep their curre ## The lifecycle and the drain loop -`ExecutionLifecycle` is one deep module that owns submission, the drain loop, retry/dirty, and ingest. +`ExecutionLifecycle` becomes a deep module that owns submission, the drain loop, retry/dirty, and ingest. It is stateless with respect to in-flight work: the set of executions to wait on comes from the database (`successful IS NULL`), not from memory. This is what makes it re-runnable and crash-safe — `drain` after a restart simply resumes. @@ -219,7 +221,7 @@ This is what makes it re-runnable and crash-safe — `drain` after a restart sim for group, datasets, definition in planned_executions: execution = Execution(execution_group=group, dataset_hash=datasets.hash, provider_version=definition.diagnostic.provider.version) - lifecycle.submit(execution, definition) # writes phase-1 manifest, dispatch(), persists handle + lifecycle.submit(execution, definition) # dispatch(), persist handle; worker writes phase-1 at start lifecycle.drain(timeout=timeout) ``` @@ -232,13 +234,13 @@ a present manifest outcome wins over liveness.** for ex in db.in_flight(submitted): # successful IS NULL match transport.status({ex.id: ex.handle})[ex.id]: case Status.RUNNING: - if past_deadline(ex): - transport.cancel(ex.handle) # ProcessPool only; schedulers self-kill + if now > read_manifest(ex.dir).deadline: # drop-dead from phase-1 manifest + transport.cancel(ex.handle) # ProcessPool only; schedulers self-kill # else: keep waiting case Status.EXITED: # clean exit — ask the disk what it meant outcome = read_manifest(ex.dir).outcome - if outcome is None: # exited 0 but wrote no outcome → incomplete + if outcome is None: # exited 0 but wrote no outcome -> incomplete mark_retryable(ex) elif outcome.status is SUCCESS: ingest_from_disk(ex.dir) # CV-validated; same path for every transport @@ -265,8 +267,8 @@ Transport liveness only adjudicates the *no-manifest-outcome* case. ```mermaid stateDiagram-v2 [*] --> Planned: solver creates Execution (successful=NULL) - Planned --> Dispatched: transport.dispatch()
phase-1 manifest written - Dispatched --> Running: worker starts + Planned --> Dispatched: transport.dispatch(), persist handle + Dispatched --> Running: worker starts, writes phase-1
(started_at, deadline) Running --> Running: status=RUNNING, within deadline Running --> Gone: status=RUNNING, past deadline
transport.cancel() @@ -281,12 +283,12 @@ stateDiagram-v2 poll --> Retryable: manifest.outcome = RECOVERABLE poll --> Retryable: manifest.outcome absent
(incomplete / OOM / SIGKILL) - Ingesting --> Successful: bundles→DB, CV-validated
successful=True, group.dirty=False + Ingesting --> Successful: bundles->DB, CV-validated
successful=True, group.dirty=False Ingesting --> Retryable: ingest error Successful --> [*] Failed --> [*]: successful=False, group.dirty=False - Retryable --> [*]: successful=False, group.dirty=True
→ new Execution next solve + Retryable --> [*]: successful=False, group.dirty=True
-> new Execution next solve note right of Dispatched External/Null transport (future): @@ -302,27 +304,30 @@ A row has three terminal outcomes: `Execution`). A retry never re-runs the same row. -## Retry and dirty: two clearly separated decisions +## Execution Status -Classification happens in two places, and only two. +Classification of the status of an execution happens in two places. -**Worker side — exception → outcome.** +**Worker side — exception -> outcome.** The worker maps the raised exception (or a clean return) onto `ManifestStatus` and writes it to the manifest. This is the one place that knows exception *types*, because the exception is only ever raised on the worker. It consolidates the logic spread today across `_is_system_error` and the `CondaCommandError` branch: ```python -SYSTEM_ERRORS = (OSError, MemoryError, SystemExit, KeyboardInterrupt) # → RECOVERABLE -NON_RETRYABLE = (CondaCommandError,) # → FAILED +SYSTEM_ERRORS = (OSError, MemoryError, SystemExit, KeyboardInterrupt) # -> RECOVERABLE +NON_RETRYABLE = (CondaCommandError,) # -> FAILED ``` -**Coordinator side — outcome → decision.** -The drain loop maps the already-classified outcome (plus transport liveness) onto `SUCCESS | retry | give up`. +**Coordinator side — outcome -> decision.** +The drain loop maps the outcome in the manifest (plus transport liveness) onto `SUCCESS | retry | give up`. It never inspects exception types, so it behaves identically for every transport, including remote ones where the exception object never comes back. -The `dirty` flag follows directly: -`SUCCESS` and `FAILED` clear it; `RECOVERABLE` and incomplete leave it set so the next solve retries. +The `dirty` flag for an execution group is then updated: + +- `SUCCESS` and `FAILED` clear the dirty state. +- `RECOVERABLE` and incomplete leave it set so the next solve retries the execution. + User-initiated `dirty = True` resets in `cli/executions.py` (rerun / reset) stay separate by design. ## Ingest from disk @@ -336,8 +341,7 @@ It is the only path that mutates result tables, and it is: Each solve mints a new `execution_id` per attempt, so retries never collide; the conflict clause only guards re-ingestion of the *same* execution during a replay or eager/poll race. - **CV-validated.** The CMEC diagnostic and series bundles are validated against the controlled vocabulary - at ingest, exactly as today. (This RFC keeps the current `logger.warning` behaviour; - making CV a hard failure is explicitly out of scope — see below.) + at ingest, exactly as today. Validations errors result in warnings rather than hard failures for now. - **Portable.** Result directories hold results only, with relative internal paths. Inputs are referenced by `instance_id` and resolved against the importing deployment's dataset index; a directory never carries or hardcodes input data. @@ -359,7 +363,7 @@ Each is a launcher plus a status query — typically under a few hundred lines. | `PbsTransport` | new | `qsub -l` | `qstat` | scheduler (walltime) | `ProcessPoolTransport` is the only transport that enforces the deadline coordinator-side -(via `cancel` → `future.cancel`), because there is no external scheduler to do it; +(via `cancel` -> `future.cancel`), because there is no external scheduler to do it; this preserves today's `LocalExecutor` per-task timeout. **Celery callback demotes to a latency optimisation.** @@ -388,7 +392,6 @@ The cutover replaces the seam the solver calls, so it lands in stages, each inde known names map to the new transports, custom dotted paths warn, and a config-migration note ships with step 5. # Drawbacks -[drawbacks]: #drawbacks - **The migration is wide.** The seam the solver calls is replaced, touching `climate-ref-core`, `climate-ref`, and `climate-ref-celery`. The staging above bounds the risk but does not remove it. @@ -406,7 +409,6 @@ known names map to the new transports, custom dotted paths warn, and a config-mi and by the manifest making the first failed run diagnosable. # Rationale and alternatives -[rationale-and-alternatives]: #rationale-and-alternatives The central choice is **where the source of truth lives**. This design puts it on disk (the manifest), with the transport reduced to a liveness oracle. @@ -435,7 +437,6 @@ results stay anonymous on disk, so regression comparison and replay stay impossi crashes remain undiagnosable from durable state. # Prior art -[prior-art]: #prior-art - **CMEC / EMDS.** The output and diagnostic bundles, provenance, and dimensions are CMEC concepts; the manifest extends CMEC provenance rather than inventing a parallel vocabulary. @@ -451,7 +452,6 @@ crashes remain undiagnosable from durable state. reading a manifest outcome from disk before deciding to re-run. # Unresolved questions -[unresolved-questions]: #unresolved-questions To resolve through the RFC process: @@ -483,7 +483,6 @@ Out of scope for this RFC (addressed independently later): - **Making CV validation a hard failure.** # Future possibilities -[future-possibilities]: #future-possibilities - **K8s transport** — a pod per execution with `resources` and an active deadline; `status` from the pod phase. - **External / null transport** — the lowest-friction modelling-centre adoption path: From d2336a4567086dfc4352ca3788f7d3a1c68904c7 Mon Sep 17 00:00:00 2001 From: Jared Lewis Date: Fri, 5 Jun 2026 10:37:01 +1000 Subject: [PATCH 10/12] rfc(execution-lifecycle): one HpcTransport for Slurm and PBS The existing HPCExecutor already handles both schedulers via parsl (SlurmProvider / SmartPBSProvider behind one HighThroughputExecutor), so the seam needs one HpcTransport, not separate Slurm/Pbs transports. - Collapse the two HPC table rows into one HpcTransport (parsl-backed, scheduler by config); fix the dispatch/status columns (parsl, not raw sbatch/qsub/sacct/qstat). - Note the parsl pilot model: scheduler walltime bounds the block, not the individual execution; per-execution wall_clock is enforced in the pilot. - Migration step 4 becomes 'port HPCExecutor', not 'add new transports'. - Unresolved question reframed to parsl future/pilot state mapping. --- text/0003-execution-lifecycle.md | 56 +++++++++++++++++++------------- 1 file changed, 33 insertions(+), 23 deletions(-) diff --git a/text/0003-execution-lifecycle.md b/text/0003-execution-lifecycle.md index a832dcb..5fd1146 100644 --- a/text/0003-execution-lifecycle.md +++ b/text/0003-execution-lifecycle.md @@ -212,33 +212,37 @@ so diagnostics that run for hours without declaring `resources` keep their curre ## The lifecycle and the drain loop -`ExecutionLifecycle` becomes a deep module that owns submission, the drain loop, retry/dirty, and ingest. +`ExecutionLifecycle` owns the submission of new executions, the drain loop, retry/dirty, and ingest. It is stateless with respect to in-flight work: the set of executions to wait on comes from the database (`successful IS NULL`), not from memory. -This is what makes it re-runnable and crash-safe — `drain` after a restart simply resumes. +This is what makes it re-runnable and crash-safe. Running `drain` after a restart simply resumes. ```python for group, datasets, definition in planned_executions: - execution = Execution(execution_group=group, dataset_hash=datasets.hash, - provider_version=definition.diagnostic.provider.version) - lifecycle.submit(execution, definition) # dispatch(), persist handle; worker writes phase-1 at start + execution = Execution( + execution_group=group, + dataset_hash=datasets.hash, + provider_version=definition.diagnostic.provider.version + ) + # dispatch(), persist handle in db, worker writes phase-1 at start of execution + lifecycle.submit(execution, definition) lifecycle.drain(timeout=timeout) ``` -`drain` loops over the in-flight set. -**The transport decides liveness; the manifest decides outcome; -a present manifest outcome wins over liveness.** +`drain` loops over the in-flight executions as queried from the database. +The transport decides liveness of the exeuction and the manifest determines the outcome. +If a outcome is present in a manifest, it wins over the liveness check. ```python for ex in db.in_flight(submitted): # successful IS NULL match transport.status({ex.id: ex.handle})[ex.id]: case Status.RUNNING: - if now > read_manifest(ex.dir).deadline: # drop-dead from phase-1 manifest - transport.cancel(ex.handle) # ProcessPool only; schedulers self-kill + if now > read_manifest(ex.dir).deadline: # deadline check + transport.cancel(ex.handle) # ProcessPool only, other schedulers use native deadlines # else: keep waiting - case Status.EXITED: # clean exit — ask the disk what it meant + case Status.EXITED: # clean exit - check the outcome from disk outcome = read_manifest(ex.dir).outcome if outcome is None: # exited 0 but wrote no outcome -> incomplete mark_retryable(ex) @@ -252,7 +256,7 @@ for ex in db.in_flight(submitted): # successful IS NULL case Status.GONE: # no clean exit outcome = read_manifest(ex.dir).outcome # worker may have logged a recoverable fail first if outcome is None: - mark_retryable(ex) # hard crash: OOM / SIGKILL / node loss + mark_retryable(ex) # hard crash: OOM / SIGKILL / node loss / queue loss else: apply(outcome) # manifest wins (e.g. SUCCESS written then node reaped) ``` @@ -351,21 +355,26 @@ This is also what enables the existing `reingest` use case to work *without* the ## Transports -All current executors are replaced, and the two HPC backends are added, as part of this work. -Each is a launcher plus a status query — typically under a few hundred lines. +All current executors are replaced, including the parsl-based HPC backend — +one transport covers both Slurm and PBS, selected by config, as `HPCExecutor` does today. +Each is a launcher plus a status query. -| Transport | Replaces / adds | `dispatch` | `status` source | deadline enforced by | -|-----------------------|------------------------|-----------------------|-------------------------|---------------------------| -| `InMemoryTransport` | `SynchronousExecutor` | run inline | always `EXITED` | n/a (inline) | -| `ProcessPoolTransport`| `LocalExecutor` | `pool.submit` | `future` state | coordinator (`cancel`) | -| `CeleryTransport` | `CeleryExecutor` | `app.send_task` | `AsyncResult` | broker (`task_time_limit`)| -| `SlurmTransport` | new | `sbatch --mem --time` | `sacct` / `squeue` | scheduler (`--time`) | -| `PbsTransport` | new | `qsub -l` | `qstat` | scheduler (walltime) | +| Transport | Replaces | `dispatch` | `status` source | deadline enforced by | +|-----------------------|------------------------|-----------------------------------|---------------------|---------------------------| +| `InMemoryTransport` | `SynchronousExecutor` | run inline | always `EXITED` | n/a (inline) | +| `ProcessPoolTransport`| `LocalExecutor` | `pool.submit` | `future` state | coordinator (`cancel`) | +| `CeleryTransport` | `CeleryExecutor` | `app.send_task` | `AsyncResult` | broker (`task_time_limit`)| +| `HpcTransport` | `HPCExecutor` (parsl) | parsl provider submit (Slurm/PBS) | parsl future state | block walltime + pilot | `ProcessPoolTransport` is the only transport that enforces the deadline coordinator-side (via `cancel` -> `future.cancel`), because there is no external scheduler to do it; this preserves today's `LocalExecutor` per-task timeout. +`HpcTransport` wraps parsl rather than calling `sbatch` / `qsub` directly. +parsl requests scheduler *blocks* (pilot jobs) and runs many executions inside each one, +so the scheduler walltime bounds the allocation, not the individual execution; +per-execution `wall_clock` is enforced inside the pilot, as `HPCExecutor.join` does today. + **Celery callback demotes to a latency optimisation.** Today `CeleryExecutor` attaches `link` / `link_error` callbacks (`handle_result` / `handle_failure`) that ingest worker-side. @@ -385,7 +394,7 @@ The cutover replaces the seam the solver calls, so it lands in stages, each inde 2. **Seam + local transports.** Introduce `ExecutionLifecycle`, the `Transport` port, `ingest_from_disk`, and `InMemoryTransport` + `ProcessPoolTransport`; route the solver through the lifecycle. 3. **Celery.** Port to `CeleryTransport`; convert `link`/`link_error` to the optional eager-ingest hook. -4. **HPC.** Add `SlurmTransport` and `PbsTransport`. +4. **HPC.** Port `HPCExecutor` to `HpcTransport` (Slurm + PBS via parsl). 5. **Cleanup.** Delete the `Executor` protocol and the three old executors once nothing imports them. `import_executor_cls` (which resolves an executor from a dotted path in `Config`) gains a deprecation cycle: @@ -463,7 +472,8 @@ To resolve through implementation: - Exact upsert unique constraints and any Alembic migration (`output_fragment`, `provider_version`, and `diagnostic_version` already exist). -- `sacct` / `squeue` / `qstat` polling cadence and output parsing for the HPC transports. +- Mapping parsl future / pilot-block state to `RUNNING | EXITED | GONE`, and enforcing the + per-execution deadline inside the pilot. - The deprecation mechanics for `import_executor_cls` and the executor dotted-path config key. Out of scope for this RFC (addressed independently later): From 6a23e118ce1ed54b90037850962328b6f764a8f7 Mon Sep 17 00:00:00 2001 From: Jared Lewis Date: Fri, 5 Jun 2026 10:49:12 +1000 Subject: [PATCH 11/12] rfc(execution-lifecycle): collapse EXITED/GONE drain branches MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit EXITED and GONE already took identical action — a present manifest outcome is applied either way, an absent one is retryable either way. Transport status only separates RUNNING from done; the manifest decides the rest. - Merge the two branches into 'case Status.EXITED | Status.GONE'; GONE keeps a distinct warning log (transport lost the job) but no distinct control flow. - Add the atomic outcome-write invariant (temp + fsync + rename) so a GONE job can never observe a half-written outcome — this is what lets a lost job be trusted exactly as much as a clean exit. - Rewrite the robustness paragraph to match. --- text/0003-execution-lifecycle.md | 93 +++++++++++++------------------- 1 file changed, 36 insertions(+), 57 deletions(-) diff --git a/text/0003-execution-lifecycle.md b/text/0003-execution-lifecycle.md index 5fd1146..fca583d 100644 --- a/text/0003-execution-lifecycle.md +++ b/text/0003-execution-lifecycle.md @@ -154,6 +154,11 @@ Therefore a manifest with `outcome is None` means *incomplete*: either still in flight, or hard-killed (OOM, `SIGKILL`, node loss) before it could write its epitaph. This single invariant is what makes crashes diagnosable from disk alone. +The outcome must be written **atomically** (write a temp file, `fsync`, then `rename`), +so a worker killed mid-write can never leave a half-written outcome that reads as present-but-corrupt. +A present outcome is therefore always complete, which is what lets a lost (`GONE`) job be trusted +exactly as much as a cleanly `EXITED` one. + **Drop-dead deadline.** Because `deadline` is recorded on disk, an operator — or a reconciling coordinator with a weak transport — can flag an overdue execution (`now > deadline` and `outcome is None`) without querying the scheduler. @@ -242,9 +247,11 @@ for ex in db.in_flight(submitted): # successful IS NULL transport.cancel(ex.handle) # ProcessPool only, other schedulers use native deadlines # else: keep waiting - case Status.EXITED: # clean exit - check the outcome from disk + case Status.EXITED | Status.GONE as st: # done - the manifest, not the transport, decides + if st is Status.GONE: # transport lost the job; the manifest still rules + logger.warning(f"{ex} vanished without a clean exit; trusting the on-disk manifest") outcome = read_manifest(ex.dir).outcome - if outcome is None: # exited 0 but wrote no outcome -> incomplete + if outcome is None: # clean exit w/o epitaph, or hard crash (OOM / SIGKILL) mark_retryable(ex) elif outcome.status is SUCCESS: ingest_from_disk(ex.dir) # CV-validated; same path for every transport @@ -252,19 +259,14 @@ for ex in db.in_flight(submitted): # successful IS NULL mark_failed(ex) # diagnostic error, give up else: # RECOVERABLE mark_retryable(ex) - - case Status.GONE: # no clean exit - outcome = read_manifest(ex.dir).outcome # worker may have logged a recoverable fail first - if outcome is None: - mark_retryable(ex) # hard crash: OOM / SIGKILL / node loss / queue loss - else: - apply(outcome) # manifest wins (e.g. SUCCESS written then node reaped) ``` -The decisive robustness rule is the last branch: -if the worker wrote `SUCCESS` and the node then died before the scheduler reaped it, -the transport reports `GONE` but the manifest reports success — and the manifest wins. -Transport liveness only adjudicates the *no-manifest-outcome* case. +`EXITED` and `GONE` therefore take the *same* action — a present manifest outcome is applied either way, +and an absent one is retryable either way. +Transport status only separates `RUNNING` from done; the manifest decides the rest. +`GONE` differs only in that it emits a warning (the transport lost track of the job), +which matters for the case where the worker wrote `SUCCESS` and the node then died before being reaped: +the transport reports `GONE`, but the manifest still wins and the result is ingested. ## Execution states @@ -364,34 +366,34 @@ Each is a launcher plus a status query. | `InMemoryTransport` | `SynchronousExecutor` | run inline | always `EXITED` | n/a (inline) | | `ProcessPoolTransport`| `LocalExecutor` | `pool.submit` | `future` state | coordinator (`cancel`) | | `CeleryTransport` | `CeleryExecutor` | `app.send_task` | `AsyncResult` | broker (`task_time_limit`)| -| `HpcTransport` | `HPCExecutor` (parsl) | parsl provider submit (Slurm/PBS) | parsl future state | block walltime + pilot | +| `HPCTransport` | `HPCExecutor` (parsl) | parsl provider submit (Slurm/PBS) | parsl future state | block walltime + pilot | `ProcessPoolTransport` is the only transport that enforces the deadline coordinator-side (via `cancel` -> `future.cancel`), because there is no external scheduler to do it; this preserves today's `LocalExecutor` per-task timeout. -`HpcTransport` wraps parsl rather than calling `sbatch` / `qsub` directly. +`HPCTransport` wraps parsl rather than calling `sbatch` / `qsub` directly. parsl requests scheduler *blocks* (pilot jobs) and runs many executions inside each one, -so the scheduler walltime bounds the allocation, not the individual execution; +so the scheduler walltime bounds the allocation, not the individual execution (to confirm); per-execution `wall_clock` is enforced inside the pilot, as `HPCExecutor.join` does today. -**Celery callback demotes to a latency optimisation.** +**Celery callback.** Today `CeleryExecutor` attaches `link` / `link_error` callbacks (`handle_result` / `handle_failure`) that ingest worker-side. +In production, this requires an additional `orchastrator` worker. Under this design the worker writes its bundle and manifest to disk like every other transport, and the coordinator (or a resident orchestrator running `drain`) ingests from disk. -The `link` callback may be kept as an *eager-ingest hook* — "ingest this directory now" rather than at the next -poll — but it is no longer load-bearing: if the callback is lost, the disk scan recovers the result. -This dissolves the previous push-vs-pull asymmetry; both modes run the same `ingest_from_disk`. +The `link` callback may be kept as an *eager-ingest hook* — "ingest this directory now" rather than at the next poll. +It's no longer load-bearing as if the callback is lost, the disk scan recovers the result. +This dissolves the previous push-vs-pull asymmetry and both modes run the same `ingest_from_disk`. ## Migration plan The cutover replaces the seam the solver calls, so it lands in stages, each independently shippable: -1. **Manifest, additively.** Current executors start writing `execution.json` (both phases). - No behaviour change; result directories become self-describing immediately — - this alone unlocks regression comparison. -2. **Seam + local transports.** Introduce `ExecutionLifecycle`, the `Transport` port, `ingest_from_disk`, +1. **ExecutionManifest, additively.** Current executors start writing `execution.json` (both phases). + No behaviour change; result directories become self-describing immediately — this alone unlocks regression comparison. +2. **New classes + local transports.** Introduce `ExecutionLifecycle`, the `Transport` port, `ingest_from_disk`, and `InMemoryTransport` + `ProcessPoolTransport`; route the solver through the lifecycle. 3. **Celery.** Port to `CeleryTransport`; convert `link`/`link_error` to the optional eager-ingest hook. 4. **HPC.** Port `HPCExecutor` to `HpcTransport` (Slurm + PBS via parsl). @@ -457,44 +459,26 @@ crashes remain undiagnosable from durable state. reconciles task state from the database — essentially `dispatch` plus `status`-driven `drain`. - **Celery.** `task_time_limit` and queue routing; `ResourceHint.queue` maps onto Celery queues and `wall_clock` onto `task_time_limit`. Today's `CeleryExecutor` uses neither. -- **Make / build systems.** A target is rebuilt unless its output exists and is current — the same logic as - reading a manifest outcome from disk before deciding to re-run. +- **Make / build systems.** A target is rebuilt unless its output exists and is current. + The same logic as reading a manifest outcome from disk before deciding to re-run. # Unresolved questions -To resolve through the RFC process: +Outstanding questions that are not yet resolved: - The exact `ExecutionManifest` field set and `schema_version` stability guarantees. - Whether the Celery eager-ingest hook (`link`) is retained or Celery polls purely via `status`. -- Where the controlled vocabulary comes from in tests (a permissive fixture vs. the project CV). - -To resolve through implementation: - - Exact upsert unique constraints and any Alembic migration (`output_fragment`, `provider_version`, and `diagnostic_version` already exist). -- Mapping parsl future / pilot-block state to `RUNNING | EXITED | GONE`, and enforcing the - per-execution deadline inside the pilot. +- Mapping parsl future state to `RUNNING | EXITED | GONE`, + and enforcing the per-execution deadline. - The deprecation mechanics for `import_executor_cls` and the executor dotted-path config key. -Out of scope for this RFC (addressed independently later): - -- **K8s transport.** -- **External / "null" execution** — a directory produced entirely outside the REF, then ingested. - The format already supports it; the operational contract does not yet. -- **Cross-deployment portability beyond shared datasets.** - Tier 2 (the importing deployment already indexes the same ESGF datasets, resolved by `instance_id`) - is enabled by the manifest's `datasets` field. - Tier 3 (datasets *not* present locally) needs dataset support for remote / not-yet-local files - and is a separate RFC. The manifest carries the cross-link information (`instance_id`) but not the files. -- **External-execution version policy.** The default is ingest-as-is: - a result lands under its declared `diagnostic_version`, surfaces only if it equals the local - `promoted_version`, is CV-validated, and is never rejected on version. Hardening is deferred. -- **Adaptive resource provider and per-execution telemetry columns.** -- **Making CV validation a hard failure.** - # Future possibilities - **K8s transport** — a pod per execution with `resources` and an active deadline; `status` from the pod phase. + The jobs are long enough running that the start up cost of a container may be negligible. + This may even deprecate Celery. - **External / null transport** — the lowest-friction modelling-centre adoption path: a centre runs the diagnostic in its own pipeline, writes `execution.json` + bundles to the expected layout, and `ref ingest ` loads it. The manifest becomes a public, versioned contract. @@ -503,11 +487,6 @@ Out of scope for this RFC (addressed independently later): - **Adaptive `ResourceProvider`** — capture per-execution telemetry (peak RSS, duration) into the manifest and the database, then suggest `ResourceHint` memory at, say, p95 × 1.2 over a rolling window. - **Cross-restart recovery** — already largely free: because `drain` reconstructs the in-flight set from the - database and `status` queries durable scheduler state, a coordinator that was down while a SLURM job finished - picks it up on the next `drain`. -- **Per-provider retry policies** — the single worker-side classifier becomes a per-provider mapping - when concrete demand appears. - -None of these justify the RFC on their own; -they show the manifest-plus-liveness seam is shaped for where the project is heading -without pre-building any of them. + database and `status` queries durable scheduler state, + a coordinator that was down while a SLURM job finished picks it up on the next `drain`. +- **Per-provider retry policies** — the single worker-side classifier becomes a per-provider mapping when concrete demand appears. From 5b7e53ecd5d943cdc754907893b90ab3cdc8cc34 Mon Sep 17 00:00:00 2001 From: Jared Lewis Date: Fri, 5 Jun 2026 11:41:50 +1000 Subject: [PATCH 12/12] rfc(execution-lifecycle): align coordinator outcome->decision with collapsed drain The paragraph claimed the decision used '(plus transport liveness)' and listed 'SUCCESS | retry | give up', which contradicted the collapsed drain loop and the robustness line (liveness only separates RUNNING from done). - Spell out the full mapping: SUCCESS -> ingest, RECOVERABLE -> retry, FAILED -> give up, finished-with-no-outcome -> retry. - State that liveness is used only to know the execution finished, not to classify it. --- text/0003-execution-lifecycle.md | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/text/0003-execution-lifecycle.md b/text/0003-execution-lifecycle.md index fca583d..1721e01 100644 --- a/text/0003-execution-lifecycle.md +++ b/text/0003-execution-lifecycle.md @@ -325,8 +325,11 @@ NON_RETRYABLE = (CondaCommandError,) # -> FAIL ``` **Coordinator side — outcome -> decision.** -The drain loop maps the outcome in the manifest (plus transport liveness) onto `SUCCESS | retry | give up`. -It never inspects exception types, so it behaves identically for every transport, +The drain loop maps the manifest outcome onto ingest / retry / give up: +`SUCCESS` -> ingest, `RECOVERABLE` -> retry, `FAILED` -> give up, +and a finished execution with no outcome -> retry. +It never inspects exception types, and it relies on transport liveness only to know the execution has +finished — not to classify it — so it behaves identically for every transport, including remote ones where the exception object never comes back. The `dirty` flag for an execution group is then updated: