Skip to content

rfc: execution lifecycle consolidation#3

Open
lewisjared wants to merge 12 commits into
mainfrom
feat/execution-lifecycle
Open

rfc: execution lifecycle consolidation#3
lewisjared wants to merge 12 commits into
mainfrom
feat/execution-lifecycle

Conversation

@lewisjared

@lewisjared lewisjared commented May 12, 2026

Copy link
Copy Markdown
Contributor

Summary

Replace the shallow Executor protocol and its executors (SynchronousExecutor, LocalExecutor, CeleryExecutor, and the parsl-based HPCExecutor) with one execution lifecycle built on two ideas:

  1. a self-describing, on-disk execution manifest (execution.json) that records what an execution is and how it ended
  2. a thin Transport port that only launches work and reports liveness (RUNNING | EXITED | GONE).

A key tenant of the existing architecture is that the workers don't control the database. They are only responsible for running executions.
Currently the executors inform the orchastrator the result of an execution.
This is stateful making it difficult to run async and requires a lot of extra hoops to work with celery (an extra worker and linked tasks).

Every backend completes by writing the REF content to a directory on disk, and a single, transport-agnostic ingest step loads the manifest and bundles the results into the database. This makes executions crash-robust and replayable, makes diagnostic output self-describing so it can be compared across runs and versions (regression), and reduces every backend to a launcher plus a status query.

The other pro here is that users could (in theory) rsync the content and then reingest it to their own copy of the REF.
This isn't an explicit target here, but I think this goes a way towards this.

Current Design

  • moves the source of truth to an on-disk, manifest which describes the execution (written at start of execution and then outcome written on success/failure);
  • narrows the transport to launch + liveness — the manifest decides the outcome, the transport decides liveness, and a present manifest outcome wins over liveness;
  • splits the lifecycle into two verbs — run and ingest — with ingest transport-agnostic, idempotent, and replayable;
  • ports all executors: InMemoryTransport, ProcessPoolTransport, CeleryTransport, and one HPCTransport covering both Slurm and PBS via parsl (as HPCExecutor does today). Only K8s is deferred;
  • reframes the primary driver as regression output + crash-robustness, with HPC reach and cross-deployment portability as downstream beneficiaries of the same on-disk format.

Deferred (named, not designed here)

  • potential K8s and external / "null" transports
  • cross-deployment portability(remote-file datasets — its own RFC)
  • external-execution version policy (ingest-as-is for now)
  • diagnostic-specific resourcing
  • per-execution telemetry columns.

lewisjared added 12 commits May 12, 2026 17:34
Consolidate diagnostic execution lifecycle (allocation, dispatch, run,
classify, publish, ingest, finalise) into one deep module behind a single
Transport port. Surface ResourceHint on Diagnostic so providers can declare
memory/CPU/wall-clock once. Capture per-execution Telemetry to enable future
adaptive scheduling without a schema change.
Mermaid classDiagram is idiomatic for the Protocol + adapters pattern.
LR layout fits PR width; method signatures stay legible without HTML hacks.
Compress prose, drop per-design subsections in Rationale (table tells the
story), tighten Drawbacks/Prior art/Unresolved/Future to bullets. All three
diagrams kept; technical substance preserved.
Correct factual and overstated claims, and surface design risks the
draft glossed:

- Drop the false "not picklable" claim — ExecutionResult already pickles
  across the ProcessPoolExecutor today; reframe as construction entangled
  with disk I/O (a testability cost, not a correctness one).
- Split retry classification into worker-side (exception -> outcome,
  consolidating _is_system_error + CondaCommandError) and coordinator-side
  (outcome -> decision); drop the misleading "five sites collapse to one"
  and the unused policy constants.
- Default wall_clock 2h -> 6h to preserve current LocalExecutor budget;
  fix the mitigation text. Avoids a silent kill-at-2h regression.
- Move deadline computation to job start (transport-side); anchoring at
  submit time would expire queued SLURM/PBS jobs before they run.
- Expand the Celery push->pull trade-off (loss of fire-and-forget
  ingestion) from one bullet to an explicit, weighted drawback.
- Add a staged migration plan and a deprecation cycle for the public
  Executor protocol / import_executor_cls config key.
- Note ResourceHint must live in climate_ref_core (layering).
- Clarify ON CONFLICT DO NOTHING is safe because each retry mints a new
  execution_id.
- Remove dry_run from the lifecycle surface (planning, not dispatch).
- Soften unverifiable LOC estimates; scope dirty-flag rule to the result
  path (CLI resets stay separate).
…transport

Full rewrite after design review. Pivots the proposal from a result-carrying
transport to a disk-backed lifecycle:

- Source of truth is an on-disk, two-phase execution manifest (execution.json),
  REF-owned and typed, embedding CMEC provenance. Phase 1 (identity) written
  before the run; phase 2 (outcome) written last, even on recoverable failure.
  Absent outcome = incomplete (in-flight or hard-killed) — crashes are now
  diagnosable from durable state.
- Transport narrows to launch + liveness (RUNNING|EXITED|GONE); results never
  cross the boundary. Manifest decides outcome, transport decides liveness, a
  present manifest outcome wins over liveness.
- Split into two verbs: run (definition -> on-disk bundle) and ingest
  (dir -> DB, transport-agnostic, idempotent, CV-validated, replayable).
- All executors ported (InMemory, ProcessPool, Celery, Slurm, Pbs); only K8s
  deferred. Celery link/link_error demotes to an optional eager-ingest hook.
- Primary driver reframed as regression output + crash-robustness; HPC and
  cross-deployment portability are downstream beneficiaries.
- Adds execution state diagram, staged migration plan, and CMEC-aligned terms.
- Defers: K8s, external/null execution, portability tier 3 (remote-file
  datasets, own RFC), external-version enforcement (ingest-as-is), telemetry/
  adaptive provider, CV hard-fail.

Markdown uses semantic line breaks targeting ~110 columns.
The manifest now declares when an execution must complete by.

- Move started_at into phase-1 (identity) and add an absolute deadline
  (= started_at + wall_clock), stamped by the worker at job start.
- Deadline is anchored at start, never submit, so a job that waited in a
  SLURM queue is judged from when it actually began.
- Drain reads the deadline from the manifest; operators (or a reconciling
  coordinator with a weak transport) can flag an overdue execution
  (now > deadline and outcome absent) without querying the scheduler.
- Thread the field through the two-phase prose, envelope comment, drain
  loop, and state diagram.
The existing HPCExecutor already handles both schedulers via parsl
(SlurmProvider / SmartPBSProvider behind one HighThroughputExecutor), so
the seam needs one HpcTransport, not separate Slurm/Pbs transports.

- Collapse the two HPC table rows into one HpcTransport (parsl-backed,
  scheduler by config); fix the dispatch/status columns (parsl, not raw
  sbatch/qsub/sacct/qstat).
- Note the parsl pilot model: scheduler walltime bounds the block, not the
  individual execution; per-execution wall_clock is enforced in the pilot.
- Migration step 4 becomes 'port HPCExecutor', not 'add new transports'.
- Unresolved question reframed to parsl future/pilot state mapping.
EXITED and GONE already took identical action — a present manifest outcome
is applied either way, an absent one is retryable either way. Transport
status only separates RUNNING from done; the manifest decides the rest.

- Merge the two branches into 'case Status.EXITED | Status.GONE'; GONE
  keeps a distinct warning log (transport lost the job) but no distinct
  control flow.
- Add the atomic outcome-write invariant (temp + fsync + rename) so a
  GONE job can never observe a half-written outcome — this is what lets a
  lost job be trusted exactly as much as a clean exit.
- Rewrite the robustness paragraph to match.
…llapsed drain

The paragraph claimed the decision used '(plus transport liveness)' and
listed 'SUCCESS | retry | give up', which contradicted the collapsed drain
loop and the robustness line (liveness only separates RUNNING from done).

- Spell out the full mapping: SUCCESS -> ingest, RECOVERABLE -> retry,
  FAILED -> give up, finished-with-no-outcome -> retry.
- State that liveness is used only to know the execution finished, not to
  classify it.
@lewisjared

Copy link
Copy Markdown
Contributor Author

@mikapfl @fuchsi-huber Another proposal for a larger architectural cleanup. We could take this in pieces. In general a manifest would be a useful addition to the existing output files that are only written at the end of a execution. Some additional thought might be needed to either integrate with the CMEC outputs or think through what should go in it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant