rfc: execution lifecycle consolidation#3
Open
lewisjared wants to merge 12 commits into
Open
Conversation
Consolidate diagnostic execution lifecycle (allocation, dispatch, run, classify, publish, ingest, finalise) into one deep module behind a single Transport port. Surface ResourceHint on Diagnostic so providers can declare memory/CPU/wall-clock once. Capture per-execution Telemetry to enable future adaptive scheduling without a schema change.
Mermaid classDiagram is idiomatic for the Protocol + adapters pattern. LR layout fits PR width; method signatures stay legible without HTML hacks.
Compress prose, drop per-design subsections in Rationale (table tells the story), tighten Drawbacks/Prior art/Unresolved/Future to bullets. All three diagrams kept; technical substance preserved.
Correct factual and overstated claims, and surface design risks the draft glossed: - Drop the false "not picklable" claim — ExecutionResult already pickles across the ProcessPoolExecutor today; reframe as construction entangled with disk I/O (a testability cost, not a correctness one). - Split retry classification into worker-side (exception -> outcome, consolidating _is_system_error + CondaCommandError) and coordinator-side (outcome -> decision); drop the misleading "five sites collapse to one" and the unused policy constants. - Default wall_clock 2h -> 6h to preserve current LocalExecutor budget; fix the mitigation text. Avoids a silent kill-at-2h regression. - Move deadline computation to job start (transport-side); anchoring at submit time would expire queued SLURM/PBS jobs before they run. - Expand the Celery push->pull trade-off (loss of fire-and-forget ingestion) from one bullet to an explicit, weighted drawback. - Add a staged migration plan and a deprecation cycle for the public Executor protocol / import_executor_cls config key. - Note ResourceHint must live in climate_ref_core (layering). - Clarify ON CONFLICT DO NOTHING is safe because each retry mints a new execution_id. - Remove dry_run from the lifecycle surface (planning, not dispatch). - Soften unverifiable LOC estimates; scope dirty-flag rule to the result path (CLI resets stay separate).
…transport Full rewrite after design review. Pivots the proposal from a result-carrying transport to a disk-backed lifecycle: - Source of truth is an on-disk, two-phase execution manifest (execution.json), REF-owned and typed, embedding CMEC provenance. Phase 1 (identity) written before the run; phase 2 (outcome) written last, even on recoverable failure. Absent outcome = incomplete (in-flight or hard-killed) — crashes are now diagnosable from durable state. - Transport narrows to launch + liveness (RUNNING|EXITED|GONE); results never cross the boundary. Manifest decides outcome, transport decides liveness, a present manifest outcome wins over liveness. - Split into two verbs: run (definition -> on-disk bundle) and ingest (dir -> DB, transport-agnostic, idempotent, CV-validated, replayable). - All executors ported (InMemory, ProcessPool, Celery, Slurm, Pbs); only K8s deferred. Celery link/link_error demotes to an optional eager-ingest hook. - Primary driver reframed as regression output + crash-robustness; HPC and cross-deployment portability are downstream beneficiaries. - Adds execution state diagram, staged migration plan, and CMEC-aligned terms. - Defers: K8s, external/null execution, portability tier 3 (remote-file datasets, own RFC), external-version enforcement (ingest-as-is), telemetry/ adaptive provider, CV hard-fail. Markdown uses semantic line breaks targeting ~110 columns.
The manifest now declares when an execution must complete by. - Move started_at into phase-1 (identity) and add an absolute deadline (= started_at + wall_clock), stamped by the worker at job start. - Deadline is anchored at start, never submit, so a job that waited in a SLURM queue is judged from when it actually began. - Drain reads the deadline from the manifest; operators (or a reconciling coordinator with a weak transport) can flag an overdue execution (now > deadline and outcome absent) without querying the scheduler. - Thread the field through the two-phase prose, envelope comment, drain loop, and state diagram.
The existing HPCExecutor already handles both schedulers via parsl (SlurmProvider / SmartPBSProvider behind one HighThroughputExecutor), so the seam needs one HpcTransport, not separate Slurm/Pbs transports. - Collapse the two HPC table rows into one HpcTransport (parsl-backed, scheduler by config); fix the dispatch/status columns (parsl, not raw sbatch/qsub/sacct/qstat). - Note the parsl pilot model: scheduler walltime bounds the block, not the individual execution; per-execution wall_clock is enforced in the pilot. - Migration step 4 becomes 'port HPCExecutor', not 'add new transports'. - Unresolved question reframed to parsl future/pilot state mapping.
EXITED and GONE already took identical action — a present manifest outcome is applied either way, an absent one is retryable either way. Transport status only separates RUNNING from done; the manifest decides the rest. - Merge the two branches into 'case Status.EXITED | Status.GONE'; GONE keeps a distinct warning log (transport lost the job) but no distinct control flow. - Add the atomic outcome-write invariant (temp + fsync + rename) so a GONE job can never observe a half-written outcome — this is what lets a lost job be trusted exactly as much as a clean exit. - Rewrite the robustness paragraph to match.
…llapsed drain The paragraph claimed the decision used '(plus transport liveness)' and listed 'SUCCESS | retry | give up', which contradicted the collapsed drain loop and the robustness line (liveness only separates RUNNING from done). - Spell out the full mapping: SUCCESS -> ingest, RECOVERABLE -> retry, FAILED -> give up, finished-with-no-outcome -> retry. - State that liveness is used only to know the execution finished, not to classify it.
Contributor
Author
|
@mikapfl @fuchsi-huber Another proposal for a larger architectural cleanup. We could take this in pieces. In general a manifest would be a useful addition to the existing output files that are only written at the end of a execution. Some additional thought might be needed to either integrate with the CMEC outputs or think through what should go in it. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replace the shallow
Executorprotocol and its executors (SynchronousExecutor,LocalExecutor,CeleryExecutor, and the parsl-basedHPCExecutor) with one execution lifecycle built on two ideas:execution.json) that records what an execution is and how it endedTransportport that only launches work and reports liveness (RUNNING | EXITED | GONE).A key tenant of the existing architecture is that the workers don't control the database. They are only responsible for running executions.
Currently the executors inform the orchastrator the result of an execution.
This is stateful making it difficult to run async and requires a lot of extra hoops to work with celery (an extra worker and linked tasks).
Every backend completes by writing the REF content to a directory on disk, and a single, transport-agnostic ingest step loads the manifest and bundles the results into the database. This makes executions crash-robust and replayable, makes diagnostic output self-describing so it can be compared across runs and versions (regression), and reduces every backend to a launcher plus a status query.
The other pro here is that users could (in theory) rsync the content and then reingest it to their own copy of the REF.
This isn't an explicit target here, but I think this goes a way towards this.
Current Design
runandingest— with ingest transport-agnostic, idempotent, and replayable;InMemoryTransport,ProcessPoolTransport,CeleryTransport, and oneHPCTransportcovering both Slurm and PBS via parsl (asHPCExecutordoes today). Only K8s is deferred;Deferred (named, not designed here)