Skip to content

Remove hard agent-turn timeout from cron/task schedulers#214

Open
LIU9293 wants to merge 1 commit into
mainfrom
remove-cron-task-agent-timeout
Open

Remove hard agent-turn timeout from cron/task schedulers#214
LIU9293 wants to merge 1 commit into
mainfrom
remove-cron-task-agent-timeout

Conversation

@LIU9293

@LIU9293 LIU9293 commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Why

The cron scheduler (packages/core/cron/scheduler.ts) and one-time task scheduler (packages/core/tasks/scheduler.ts) both wrapped agent.sendMessage in a 2h hard timeout via withTimeout (configurable via ODE_CRON_AGENT_TIMEOUT_MS / ODE_TASK_AGENT_TIMEOUT_MS).

This bound was tripping on legitimate long-running agent workflows — Sentry triage, Linear board grooming, multi-PR sweeps, scheduled audits — and surfacing in Slack as:

Cron job failed: <title>
Request timed out

even though the underlying agent was making progress. Verified on this machine: two recent runs of `Sentry检查` and `PM推进` both failed with `CronStepTimeoutError: Cron agent turn timed out after 7200000ms` (= exactly the 2h default).

What

  • Drop the `withTimeout` wrap around `agent.sendMessage` in both schedulers.
  • Remove the `CRON_AGENT_TIMEOUT_MS` / `TASK_AGENT_TIMEOUT_MS` constants and the corresponding `ODE_CRON_AGENT_TIMEOUT_MS` / `ODE_TASK_AGENT_TIMEOUT_MS` env vars.
  • Leave a comment explaining the decision and the remaining safety nets.

What we keep

  • `CRON_PREPARE_TIMEOUT_MS` / `TASK_PREPARE_TIMEOUT_MS` (2 min): still wraps session creation + worktree setup so a wedged `getOrCreateSession` / `git worktree add` can't hold the in-process `runningJobIds` / `runningTaskIds` lock indefinitely (which would otherwise make manual "Run now" return 409 forever).
  • `reconcileInterruptedCronJobs` / `reconcileInterruptedTasks`: still recovers stuck runs on daemon restart.
  • The agent adapter's own per-request error handling.

Verification

  • `bun run typecheck` clean.
  • `bun test packages/core/test` → 78 pass / 1 skip / 0 fail.

Cron jobs and one-time tasks both wrapped `agent.sendMessage` in a 2h
hard timeout (`ODE_CRON_AGENT_TIMEOUT_MS` / `ODE_TASK_AGENT_TIMEOUT_MS`).
Long-running agent workflows (Sentry triage, board grooming, multi-PR
sweeps, scheduled audits) routinely tripped this cap and surfaced as
"Request timed out" failures even though the underlying agent was
making progress.

Drop the wrap on both sides. We still keep the prepare-step timeout
(session creation + worktree setup) so a wedged setup can't hold the
in-process `runningJobIds`/`runningTaskIds` lock; truly hung turns are
still recoverable via the daemon restart reconcile path.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant