rfc: execution lifecycle consolidation by lewisjared · Pull Request #3 · Climate-REF/rfcs

lewisjared · 2026-05-12T07:35:34Z

Summary

Replace the shallow Executor protocol and its executors (SynchronousExecutor, LocalExecutor, CeleryExecutor, and the parsl-based HPCExecutor) with one execution lifecycle built on two ideas:

a self-describing, on-disk execution manifest (execution.json) that records what an execution is and how it ended
a thin Transport port that only launches work and reports liveness (RUNNING | EXITED | GONE).

A key tenant of the existing architecture is that the workers don't control the database. They are only responsible for running executions.
Currently the executors inform the orchastrator the result of an execution.
This is stateful making it difficult to run async and requires a lot of extra hoops to work with celery (an extra worker and linked tasks).

Every backend completes by writing the REF content to a directory on disk, and a single, transport-agnostic ingest step loads the manifest and bundles the results into the database. This makes executions crash-robust and replayable, makes diagnostic output self-describing so it can be compared across runs and versions (regression), and reduces every backend to a launcher plus a status query.

The other pro here is that users could (in theory) rsync the content and then reingest it to their own copy of the REF.
This isn't an explicit target here, but I think this goes a way towards this.

Current Design

moves the source of truth to an on-disk, manifest which describes the execution (written at start of execution and then outcome written on success/failure);
narrows the transport to launch + liveness — the manifest decides the outcome, the transport decides liveness, and a present manifest outcome wins over liveness;
splits the lifecycle into two verbs — run and ingest — with ingest transport-agnostic, idempotent, and replayable;
ports all executors: InMemoryTransport, ProcessPoolTransport, CeleryTransport, and one HPCTransport covering both Slurm and PBS via parsl (as HPCExecutor does today). Only K8s is deferred;
reframes the primary driver as regression output + crash-robustness, with HPC reach and cross-deployment portability as downstream beneficiaries of the same on-disk format.

Deferred (named, not designed here)

potential K8s and external / "null" transports
cross-deployment portability(remote-file datasets — its own RFC)
external-execution version policy (ingest-as-is for now)
diagnostic-specific resourcing
per-execution telemetry columns.

Consolidate diagnostic execution lifecycle (allocation, dispatch, run, classify, publish, ingest, finalise) into one deep module behind a single Transport port. Surface ResourceHint on Diagnostic so providers can declare memory/CPU/wall-clock once. Capture per-execution Telemetry to enable future adaptive scheduling without a schema change.

Mermaid classDiagram is idiomatic for the Protocol + adapters pattern. LR layout fits PR width; method signatures stay legible without HTML hacks.

Compress prose, drop per-design subsections in Rationale (table tells the story), tighten Drawbacks/Prior art/Unresolved/Future to bullets. All three diagrams kept; technical substance preserved.

Correct factual and overstated claims, and surface design risks the draft glossed: - Drop the false "not picklable" claim — ExecutionResult already pickles across the ProcessPoolExecutor today; reframe as construction entangled with disk I/O (a testability cost, not a correctness one). - Split retry classification into worker-side (exception -> outcome, consolidating _is_system_error + CondaCommandError) and coordinator-side (outcome -> decision); drop the misleading "five sites collapse to one" and the unused policy constants. - Default wall_clock 2h -> 6h to preserve current LocalExecutor budget; fix the mitigation text. Avoids a silent kill-at-2h regression. - Move deadline computation to job start (transport-side); anchoring at submit time would expire queued SLURM/PBS jobs before they run. - Expand the Celery push->pull trade-off (loss of fire-and-forget ingestion) from one bullet to an explicit, weighted drawback. - Add a staged migration plan and a deprecation cycle for the public Executor protocol / import_executor_cls config key. - Note ResourceHint must live in climate_ref_core (layering). - Clarify ON CONFLICT DO NOTHING is safe because each retry mints a new execution_id. - Remove dry_run from the lifecycle surface (planning, not dispatch). - Soften unverifiable LOC estimates; scope dirty-flag rule to the result path (CLI resets stay separate).

…transport Full rewrite after design review. Pivots the proposal from a result-carrying transport to a disk-backed lifecycle: - Source of truth is an on-disk, two-phase execution manifest (execution.json), REF-owned and typed, embedding CMEC provenance. Phase 1 (identity) written before the run; phase 2 (outcome) written last, even on recoverable failure. Absent outcome = incomplete (in-flight or hard-killed) — crashes are now diagnosable from durable state. - Transport narrows to launch + liveness (RUNNING|EXITED|GONE); results never cross the boundary. Manifest decides outcome, transport decides liveness, a present manifest outcome wins over liveness. - Split into two verbs: run (definition -> on-disk bundle) and ingest (dir -> DB, transport-agnostic, idempotent, CV-validated, replayable). - All executors ported (InMemory, ProcessPool, Celery, Slurm, Pbs); only K8s deferred. Celery link/link_error demotes to an optional eager-ingest hook. - Primary driver reframed as regression output + crash-robustness; HPC and cross-deployment portability are downstream beneficiaries. - Adds execution state diagram, staged migration plan, and CMEC-aligned terms. - Defers: K8s, external/null execution, portability tier 3 (remote-file datasets, own RFC), external-version enforcement (ingest-as-is), telemetry/ adaptive provider, CV hard-fail. Markdown uses semantic line breaks targeting ~110 columns.

The manifest now declares when an execution must complete by. - Move started_at into phase-1 (identity) and add an absolute deadline (= started_at + wall_clock), stamped by the worker at job start. - Deadline is anchored at start, never submit, so a job that waited in a SLURM queue is judged from when it actually began. - Drain reads the deadline from the manifest; operators (or a reconciling coordinator with a weak transport) can flag an overdue execution (now > deadline and outcome absent) without querying the scheduler. - Thread the field through the two-phase prose, envelope comment, drain loop, and state diagram.

The existing HPCExecutor already handles both schedulers via parsl (SlurmProvider / SmartPBSProvider behind one HighThroughputExecutor), so the seam needs one HpcTransport, not separate Slurm/Pbs transports. - Collapse the two HPC table rows into one HpcTransport (parsl-backed, scheduler by config); fix the dispatch/status columns (parsl, not raw sbatch/qsub/sacct/qstat). - Note the parsl pilot model: scheduler walltime bounds the block, not the individual execution; per-execution wall_clock is enforced in the pilot. - Migration step 4 becomes 'port HPCExecutor', not 'add new transports'. - Unresolved question reframed to parsl future/pilot state mapping.

EXITED and GONE already took identical action — a present manifest outcome is applied either way, an absent one is retryable either way. Transport status only separates RUNNING from done; the manifest decides the rest. - Merge the two branches into 'case Status.EXITED | Status.GONE'; GONE keeps a distinct warning log (transport lost the job) but no distinct control flow. - Add the atomic outcome-write invariant (temp + fsync + rename) so a GONE job can never observe a half-written outcome — this is what lets a lost job be trusted exactly as much as a clean exit. - Rewrite the robustness paragraph to match.

…llapsed drain The paragraph claimed the decision used '(plus transport liveness)' and listed 'SUCCESS | retry | give up', which contradicted the collapsed drain loop and the robustness line (liveness only separates RUNNING from done). - Spell out the full mapping: SUCCESS -> ingest, RECOVERABLE -> retry, FAILED -> give up, finished-with-no-outcome -> retry. - State that liveness is used only to know the execution finished, not to classify it.

lewisjared · 2026-06-05T02:09:24Z

@mikapfl @fuchsi-huber Another proposal for a larger architectural cleanup. We could take this in pieces. In general a manifest would be a useful addition to the existing output files that are only written at the end of a execution. Some additional thought might be needed to either integrate with the CMEC outputs or think through what should go in it.

lewisjared added 12 commits May 12, 2026 17:34

rfc(execution-lifecycle): assign PR #3 number

5034a53

rfc(execution-lifecycle): convert diagrams to mermaid

aa5d896

rfc(execution-lifecycle): swap module boundary to classDiagram

67f5fd2

Mermaid classDiagram is idiomatic for the Protocol + adapters pattern. LR layout fits PR width; method signatures stay legible without HTML hacks.

rfc(execution-lifecycle): simplify; trim doc by ~half

8f8263b

Compress prose, drop per-design subsections in Rationale (table tells the story), tighten Drawbacks/Prior art/Unresolved/Future to bullets. All three diagrams kept; technical substance preserved.

chore: formatting

36dd7a2

This was referenced Jun 5, 2026

RFC: CLI-managed regression baselines (committed + native bundles) #5

Open

Refactor RFC: extract a deep materialize module from handle_execution_result (toward rfcs#3 run/ingest split) Climate-REF/climate-ref#717

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rfc: execution lifecycle consolidation#3

rfc: execution lifecycle consolidation#3
lewisjared wants to merge 12 commits into
mainfrom
feat/execution-lifecycle

lewisjared commented May 12, 2026 •

edited

Loading

Uh oh!

lewisjared commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lewisjared commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Current Design

Deferred (named, not designed here)

Uh oh!

lewisjared commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

lewisjared commented May 12, 2026 •

edited

Loading