Skip to content

fix(cron): auto-disable cron jobs whose target channel is unreachable#211

Open
LIU9293 wants to merge 4 commits into
mainfrom
fix/cron-auto-disable-channel-not-found-04526a90
Open

fix(cron): auto-disable cron jobs whose target channel is unreachable#211
LIU9293 wants to merge 4 commits into
mainfrom
fix/cron-auto-disable-channel-not-found-04526a90

Conversation

@LIU9293

@LIU9293 LIU9293 commented May 31, 2026

Copy link
Copy Markdown
Contributor

What

When a cron job's send hits a permanent channel-access error (channel_not_found, not_in_channel, is_archived, token_revoked, Discord Missing Access/Unknown Channel, Lark chat_not_found, …), the scheduler now disables the job and records the reason in last_error. The cron row stops firing until a human re-enables it.

Why

Sentry ODE-DEAMON-7 — "IM slack send failed: channel_not_found" — captured 5 events over ~5 weeks, all from a single cron job whose bot had been removed from channel C0ATGCJ0YK0. Without intervention the job would keep firing every cron tick forever, each failure capturing a fresh Sentry event with the same fingerprint. This converts that infinite drip into a one-shot auto-disable + log.

This is a behaviour fix, not a Sentry-noise suppression. The Sentry event already correctly indicates "the bot may have been removed" (see existing isBenignDeliveryFailure test); the right product response is to stop firing the job, which is what this PR does.

Design notes

  • Cross-platform detection lives in packages/shared/delivery/permanent-error.ts alongside the existing isRateLimitError. It is intentionally narrow: transient failures (5xx, ECONNRESET, 429, socket hangups, Slack message_not_found) explicitly stay retryable.
  • Two call sites in runCronJob:
    1. Success-path send: agent turn completed, but final delivery failed → if permanent, disable.
    2. Failure-notification send: agent turn already failed → if the notify error is permanent, disable. Critically the check is on the notify error, not the original turn error, so a transient agent timeout doesn't disable a job whose channel is fine.
  • Disable bookkeeping (patchCronJob({ enabled: false }) + markCronJobFailed + agent_result failAgentResult) is wrapped in best-effort try/catch so a SQLite hiccup never shadows the original delivery error.

Tests

  • 9 new unit tests in permanent-error.test.ts cover Slack message strings, Slack SDK data.error shape, Discord numeric codes (50001, 10003), Lark messages, and explicit negatives (rate limits, ECONNRESET, 5xx, message_not_found).
  • Full suite still green: bun test → 414 pass, 1 skip, 0 fail.

Out of scope

  • Auto re-enable when access is restored. The right re-enable signal is a human /invite event, which is a separate observability problem; for now ode cron enable <id> is the recovery path.
  • Auto-disabling tasks (ode task) — tasks are single-shot so the noise pattern is bounded; cron is the recurring case that needs this guard.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 566300571b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/core/cron/scheduler.ts Outdated
}
}
try {
patchCronJob(job.id, { enabled: false });

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Disable stale-channel cron rows without revalidating config

When a cron row points at a channel that has since been removed from the Ode config, this patchCronJob call re-enters updateCronJob(), which calls getChannelSnapshot() and throws before enabled is flipped. In that stale-config/unreachable-channel case the new auto-disable path only logs the failure, leaves the row enabled, and the scheduler continues firing and producing the same permanent channel_not_found/missing-access error on every tick; the same pattern is duplicated in the failure-notification path below.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed in 46fe172. Added disableCronJob(id) direct-SQL helper that flips enabled = 0 without re-validating channel config via getChannelSnapshot, so the stale-channel case (which is exactly the scenario this PR targets) now actually disables the row instead of silently swallowing the error. Both call sites switched. New test asserts patchCronJob throws while disableCronJob succeeds when the channel was removed from ode.json.

LIU9293 pushed a commit that referenced this pull request Jun 1, 2026
Addresses Codex P2 review on PR #211. The previous patch routed
auto-disable through patchCronJob -> updateCronJob -> getChannelSnapshot,
which throws "Channel not found in configured workspaces" when the cron
row's destination channel has since been removed from ode.json. In that
exact case (stale local config, bot kicked, workspace re-onboarded) the
auto-disable was silently swallowed and the row stayed enabled — the
scheduler would keep firing and capturing the same channel_not_found
Sentry event every tick, defeating the point of the PR.

Add a direct-SQL disableCronJob(id) helper that flips enabled=0 without
re-validating channel config, and use it from both call sites
(success-path send and failure-notification send).

Tests: new disableCronJob suite covers
- happy path (enabled true -> false, returns true)
- idempotent re-disable (returns false)
- missing row (returns false)
- stale-channel case where patchCronJob throws but disableCronJob still
  succeeds (the regression Codex flagged)

bun test: 418 pass, 1 skip, 0 fail.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 46fe1725ca

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +79 to +81
const code = (err as { code?: unknown }).code;
if (typeof code === "number" && PERMANENT_DISCORD_CODES.has(code)) {
return true;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Handle wrapped Discord channel failures

For Discord cron sends, this numeric-code check usually never sees the DiscordAPIError: sendChannelMessage calls resolveTextChannel, which catches channels.fetch errors and then rethrows a generic Error("Discord channel ... is not text-based or inaccessible") (packages/ims/discord/client.ts:143-157). In the deleted/private-channel cases this change is meant to auto-disable, the error arriving here has no code, so the job stays enabled and keeps retrying instead of being disabled.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed in caf32bf. resolveTextChannel now captures err.code from every client.channels.fetch() attempt and forwards a representative permanent code (10003 / 50001 / 50013 / 50007 preferred) on the thrown wrapper Error, so isPermanentChannelError can classify it via the existing numeric-code shape. The wrapper message is unchanged, and a new test in permanent-error.test.ts asserts the wrapper-with-code path is treated as permanent.

LIU9293 pushed a commit that referenced this pull request Jun 2, 2026
… wrapper

Addresses Codex P2 review on PR #211. resolveTextChannel() catches each
client.channels.fetch() failure and then rethrows a generic
Error("Discord channel ... is not text-based or inaccessible") with no
`code` attached. That generic error reaches isPermanentChannelError(),
which checks for numeric DiscordAPIErrorcode (10003 Unknown Channel /
50001 Missing Access / etc.) and falls through to false — so a Discord
cron job whose bot has been removed from the target channel stays
enabled and keeps firing every tick, exactly the noise pattern this PR
exists to fix.

Capture every Discord API code from each fetch attempt, pick a permanent
one (10003 / 50001 / 50013 / 50007) when available, and attach it to the
thrown wrapper as `code` (with the full list on `discordErrorCodes`
for diagnostics). The wrapper's message is unchanged so callers that
match on the string still work.

Add a permanent-error.test.ts case asserting the wrapper Error with a
forwarded code is classified as permanent.

bun test: 419 pass, 1 skip, 0 fail.
Kai Liu added 3 commits June 15, 2026 03:16
When a cron job's send hits a permanent channel-access error
(channel_not_found, not_in_channel, is_archived, token_revoked, etc.),
the scheduler now disables the job and records the reason. Previously
the same broken job would re-fire on every cron tick and capture an
identical Sentry event for the lifetime of the daemon — see Sentry
ODE-DEAMON-7 (5 events over ~5 weeks from one stuck job).

- Add @/shared/delivery/permanent-error with cross-platform detection
  (Slack error strings, Discord numeric codes, Lark chat_not_found)
  and intentionally narrow classification: transient failures stay
  retryable.
- Wire it into both the success-path send and the failure-notification
  send in runCronJob. The check on the notify path uses the *notify*
  error, not the original turn error, so a transient agent timeout
  doesn't disable a job when the channel is fine.
- Add unit tests covering Slack/Discord/Lark error shapes and ensuring
  rate-limits / 5xx / ECONNRESET / message_not_found stay retryable.
Addresses Codex P2 review on PR #211. The previous patch routed
auto-disable through patchCronJob -> updateCronJob -> getChannelSnapshot,
which throws "Channel not found in configured workspaces" when the cron
row's destination channel has since been removed from ode.json. In that
exact case (stale local config, bot kicked, workspace re-onboarded) the
auto-disable was silently swallowed and the row stayed enabled — the
scheduler would keep firing and capturing the same channel_not_found
Sentry event every tick, defeating the point of the PR.

Add a direct-SQL disableCronJob(id) helper that flips enabled=0 without
re-validating channel config, and use it from both call sites
(success-path send and failure-notification send).

Tests: new disableCronJob suite covers
- happy path (enabled true -> false, returns true)
- idempotent re-disable (returns false)
- missing row (returns false)
- stale-channel case where patchCronJob throws but disableCronJob still
  succeeds (the regression Codex flagged)

bun test: 418 pass, 1 skip, 0 fail.
… wrapper

Addresses Codex P2 review on PR #211. resolveTextChannel() catches each
client.channels.fetch() failure and then rethrows a generic
Error("Discord channel ... is not text-based or inaccessible") with no
`code` attached. That generic error reaches isPermanentChannelError(),
which checks for numeric DiscordAPIErrorcode (10003 Unknown Channel /
50001 Missing Access / etc.) and falls through to false — so a Discord
cron job whose bot has been removed from the target channel stays
enabled and keeps firing every tick, exactly the noise pattern this PR
exists to fix.

Capture every Discord API code from each fetch attempt, pick a permanent
one (10003 / 50001 / 50013 / 50007) when available, and attach it to the
thrown wrapper as `code` (with the full list on `discordErrorCodes`
for diagnostics). The wrapper's message is unchanged so callers that
match on the string still work.

Add a permanent-error.test.ts case asserting the wrapper Error with a
forwarded code is classified as permanent.

bun test: 419 pass, 1 skip, 0 fail.
@LIU9293 LIU9293 force-pushed the fix/cron-auto-disable-channel-not-found-04526a90 branch from caf32bf to abc106f Compare June 14, 2026 19:17
@LIU9293

LIU9293 commented Jun 14, 2026

Copy link
Copy Markdown
Contributor Author

Rebased onto latest main (66925fd). One small conflict in packages/core/cron/scheduler.ts — kept sendSlackChannelMessage imported from @/core/runtime/slack-senders (the new module on main) and added isPermanentChannelError from @/shared/delivery/permanent-error next to it.

Full bun test green locally: 434 pass, 1 skip, 0 fail.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: abc106fe48

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/ims/discord/client.ts Outdated
Comment on lines +181 to +183
const permanent = discordErrorCodes.find((c) => PERMANENT_PRIORITY.has(c));
(wrapper as Error & { code?: number; discordErrorCodes?: number[] }).code =
permanent ?? discordErrorCodes[0];

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid disabling Discord jobs on mixed fetch failures

When multiple Discord bots are configured, resolveTextChannel collects errors from every client; if the owning bot's channels.fetch() fails transiently but another workspace's bot returns Missing Access/Unknown Channel, this prioritizes the unrelated permanent code and the cron scheduler will auto-disable the job even though a retry with the owning bot could succeed. Cron top-level sends call sendDiscordChannelMessage(job.channelId, ...) without a processor/workspace id, so this mixed-attempt path is reachable in multi-workspace daemons.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in e1a33ea. resolveTextChannel now tracks the pinned bot's error code separately from the multi-bot fan-out, with this rule for the wrapper's forwarded code:

  1. Pinned bot was attempted → forward its code (whatever it was). The pinned bot owns the channel, so its outcome is authoritative; unrelated bots' permanent codes for the same channel are noise.
  2. No pinned bot (e.g. cron top-level send without a processorId) → only forward a permanent code if every captured code is permanent. Mixed transient+permanent means at least one bot might succeed on retry, so the wrapper deliberately does NOT carry a permanent code.

New test ignores Discord resolveTextChannel wrapper without a forwarded code confirms that a wrapper with no code (or a transient code like 500) is not classified as permanent, so the cron scheduler will not auto-disable on a mixed-bot fan-out. Full suite green (435 pass, 1 skip).

"missing_scope",
// Lark
"chat_not_found",
"chat not exist",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Match Lark's deleted-chat message

When the Lark send API returns a payload whose msg is just chat does not exist, the Lark client throws that message directly from larkApi, but this token does not match because it omits does. The new test only passes because it includes chat_not_found; deleted/stale Lark chat cron jobs that report the documented-style text therefore take the generic failure path and remain enabled instead of being auto-disabled.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in e1a33ea. Added "chat does not exist" as an explicit token in PERMANENT_MESSAGE_TOKENS — the shorter "chat not exist" does not match this string because of the "does" infix, so deleted/stale Lark chats whose larkApi payload surfaces as just "chat does not exist" now classify as permanent and the cron row auto-disables. New test matches Lark deleted-chat message text alone covers the bare-string case; full suite still green (435 pass, 1 skip).

…t failures

Addresses two unaddressed Codex review comments on PR #211:

1. Lark deleted/stale chats: the Lark client re-throws the raw 'msg' field
   for some payloads, which surfaces as 'chat does not exist'. The shorter
   'chat not exist' token does not match (the 'does' infix breaks the
   substring), so cron jobs targeting deleted Lark chats stayed enabled.
   Add 'chat does not exist' as an explicit token.

2. Discord multi-bot fetch races: when no processorId pins a bot,
   resolveTextChannel iterates every Discord client; mixing one bot's
   transient error with another (unrelated) bot's 'Missing Access' would
   forward the permanent code on the wrapper and auto-disable the cron
   job even though a retry could succeed. Track the pinned bot's code
   separately and only treat the multi-bot wrapper as permanent when
   either (a) the pinned bot returned a code, or (b) every captured code
   is permanent.

Tests:
- Added 'matches Lark deleted-chat message text alone' covering the bare
  'chat does not exist' string.
- Added 'ignores Discord resolveTextChannel wrapper without a forwarded
  code' covering the mixed-bot case (no code or transient code on the
  wrapper).
- Full suite green: 435 pass, 1 skip, 0 fail.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e1a33ea782

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +204 to +206
const allPermanent = discordErrorCodes.every((c) => PERMANENT_PRIORITY.has(c));
if (allPermanent) {
forwardedCode = discordErrorCodes.find((c) => PERMANENT_PRIORITY.has(c));

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Require every Discord fetch attempt to be permanent

This still auto-disables a Discord cron job when a code-less transient fetch failure is mixed with an unrelated bot's permanent error: allPermanent only checks discordErrorCodes, so an ECONNRESET/network error from the bot that could own the channel is ignored while another workspace bot's Missing Access/Unknown Channel code makes the wrapper look permanent. Cron top-level sends call this without a processor id, so in multi-workspace daemons this can disable a healthy job after a single transient fetch failure. Fresh evidence after the prior mixed-bot comment is that the current guard counts captured numeric codes rather than all failed attempts.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant