feat(cli): add --recover (rerun) and --recover-from (run) [gated]#1237
Draft
kumare3 wants to merge 11 commits into
Draft
feat(cli): add --recover (rerun) and --recover-from (run) [gated]#1237kumare3 wants to merge 11 commits into
kumare3 wants to merge 11 commits into
Conversation
Pin the exact RunSpec / CreateRunRequest that with_runcontext(...) builds in remote mode — every field _run_remote serializes (env_vars, labels, annotations, queue→cluster, interruptible, overwrite_cache, cache_lookup_scope, service_account, notifications, max_action_concurrency), plus the ConnectError mapping, dry-run path, and per-mode dispatch. This is the byte-for-byte oracle for the upcoming run/rerun/recover/debug unification: the extraction of _build_task_spec_from_template / _submit_remote / _apply_overrides must reproduce these unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Ketan Umare <kumare3@users.noreply.github.com>
Pull the local-TaskTemplate branch of _run_remote (image build, code-bundle cascade, serialization + task-spec translation) into a reusable _Runner._build_task_spec_from_template returning (task_spec, code_bundle, version). image_cache is folded into task_spec via the serialization context, so it is not returned. Heavy imports travel into the helper to keep `import flyte` cheap. This is the shared task-spec builder rerun-with-substitute-code will call, removing the future duplication of _replay._build_task_spec. Characterization tests (test_run_runspec_chars + test_union_run_basic) pass unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Ketan Umare <kumare3@users.noreply.github.com>
Split _run_remote's back half into three single-responsibility helpers on _Runner: - _build_env_dict: runtime env assembly (user env_vars + injected LOG/debug/sys-path keys), shared by fresh and inherited paths; returns a fresh dict (no longer mutates self._env_vars). - _apply_overrides(base, *, task): the single place runner config maps onto a RunSpec. base=None builds a fresh spec (run/recover); base set deep-copies a prior run's spec and merges overrides by key (the rerun seam — env merge + explicit field overrides). Includes a gated recover block (raises until flyteidl2 RunSpec.recover ships). - _submit_remote: the single network call site — upload_inputs + create_run + the ConnectError mapping. Consumes an already-built run_spec. - _resolve_run_target: RunIdentifier vs ProjectIdentifier resolution. - _to_cache_lookup_scope lifted to module scope. Heavy imports travel into the helpers (import flyte stays cheap; verified via -X importtime). Characterization snapshot reproduced byte-for-byte; added unit tests for the inherited-merge path and recover gating. 141 run-path tests green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Ketan Umare <kumare3@users.noreply.github.com>
Add `recover: str | None` to _Runner / with_runcontext (keyword-only, default None → fully backwards-compatible) and a hidden `--recover-from` flag on `flyte run`. The value flows into _apply_overrides, which sets RunSpec.recover once flyteidl2 ships the field and otherwise raises a clear NotImplementedError (the field is absent today). recover composes with run/rerun since it lives in the shared override seam. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Ketan Umare <kumare3@users.noreply.github.com>
_Runner.rerun(run_name, action_name, task_template, inputs) re-runs a prior run on
the shared foundation: fetch RunDetails → inherit its RunSpec (via _apply_overrides),
source the task from action_details.pb2.task (or a substitute template), and either
reuse the prior raw proto inputs (dataproxy.get_action_data) or convert new native
kwargs against the fetched interface (guess_interface), then _submit_remote.
Public surface: flyte.rerun("r1") same inputs; flyte.rerun("r1", x=2) changed inputs;
flyte.rerun("r1", task_template=fixed) substitute code. flyte.replay kept as a
deprecated thin alias (inputs=None). recover/debug compose via with_runcontext.
Remote-only for now. Exported from flyte/__init__.
Tests cover same-inputs inheritance+reuse, changed-inputs conversion against a real
interface proto, the non-remote guard, and the replay alias. 147 run-path tests green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Ketan Umare <kumare3@users.noreply.github.com>
- recover: bool|str = False. True recovers from the run being rerun (rerun-only); a run-name string recovers from that named run (the only form valid on run() / flyte run --recover-from). _resolve_recover_ref maps True->rerun target and rejects True on a plain run(). _apply_overrides takes the resolved recover_ref. - recover is remote-only: a truthy recover in local/hybrid mode raises ValueError up front in run() instead of being silently ignored. - Delete flyte.replay (no alias); flyte.rerun is the verb. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Ketan Umare <kumare3@users.noreply.github.com>
Re-run an existing run with its own fetched code + exact inputs (no local code). Complements `flyte run --recover-from` (which supplies new local code): rerun takes the run name as a positional and exposes context options (--project/--domain/--name/ --env/--label/--follow) plus a hidden `--recover` (reuse succeeded actions, coming soon). v1 reuses the prior inputs; changing inputs from the CLI is a follow-up (flyte.rerun(run, x=2) covers it programmatically). Registered in cli/main.py under "Run and stop tasks". Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Ketan Umare <kumare3@users.noreply.github.com>
Re-run an existing run with THIS local code, reusing the prior run's inputs — the CLI equivalent of flyte.rerun(run, task_template=local_task). Routes the file-loaded TaskTemplate through _Runner.rerun(run, task_template=...) and suppresses the dynamic per-input options (inputs come from the prior run; required inputs aren't demanded). Orthogonal to the gated --recover-from; --rerun-from is live and remote-only (errors with --local). `flyte rerun <run>` stays the no-local-code (fetched) path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Ketan Umare <kumare3@users.noreply.github.com>
- Widen recover to bool|str|None so the CLI's str|None --recover-from type-checks (None already means "no recover" everywhere). - "re-uses" -> "reuses" (codespell). - ruff format. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Ketan Umare <kumare3@users.noreply.github.com>
`hidden=True` still surfaces options in `flyte gen docs`, so the gated recover flags leaked into the CLI reference. Remove `flyte rerun --recover` and `flyte run --recover-from` entirely, leaving TODOs to re-add them once flyteidl2 RunSpec.recover + backend support land. `--rerun-from` (live) and the Python `with_runcontext(recover=...)` field are unaffected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Ketan Umare <kumare3@users.noreply.github.com>
Expose the recover knob on the CLI: `flyte rerun <run> --recover` recovers from the run being rerun; `flyte run <file> <task> --recover-from <run>` recovers a fresh run (new local code) from a named prior run. Both map to with_runcontext(recover=...) and reuse a prior run's succeeded actions, re-running only what failed or changed. Gated until the flyteidl2 RunSpec.recover field + actions-service support ship (raises a clear NotImplementedError until then) — hence this PR is a draft, to land once the backend is ready. Stacks on #1236 (the rerun foundation that owns the recover field). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Ketan Umare <kumare3@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds the recover knob to the CLI, split out of #1236 (which deliberately omits these flags so they don't surface in the CLI docs while the backend is unfinished).
Draft — gated until the flyteidl2
RunSpec.recoverfield + actions-service support ship. The flags are wired but raise a clearNotImplementedErrorat submit until then. Stacks on #1236 (the rerun foundation that owns therecoverfield); merge after #1236 and once the backend lands.Usage
Both map to
with_runcontext(recover=...):--recover(bool) recovers from the run being rerun;--recover-from <run>(string) recovers a freshrun()from a named prior run. recover reuses the prior run's succeeded actions and re-runs only what failed or changed (remote-only).What's here
flyte rerun --recover(bool) →with_runcontext(recover=True).rerun(run).flyte run --recover-from <run>(string) →with_runcontext(recover="<run>").run(task).The SDK
with_runcontext(recover=...)field and gating already land in #1236; this PR only adds the CLI surface.🤖 Generated with Claude Code