Skip to content

fix(jobscheduler): cancel periodic re-arm goroutines on shutdown#311

Merged
worstell merged 1 commit into
mainfrom
fix/scheduler-periodic-rearm-cancel
May 15, 2026
Merged

fix(jobscheduler): cancel periodic re-arm goroutines on shutdown#311
worstell merged 1 commit into
mainfrom
fix/scheduler-periodic-rearm-cancel

Conversation

@worstell
Copy link
Copy Markdown
Contributor

The graceful-shutdown change in #307 exposed a latent bug: after SIGTERM the scheduler sets q.done=true and workers exit, but the in-process re-arm goroutine spawned by SubmitPeriodicJob is a bare go func() { time.Sleep(interval); ... }() with no awareness of ctx. It dies with the pod, and the next pod's warmExistingRepos re-registers each periodic job — but periodicDelay reads the bbolt lastRun and sleeps the remaining interval before the next firing. Net effect: each periodic job skips up to one full interval per pod restart, dropping background snapshot/repack/fetch throughput sharply during and after rolling deploys.

  • Store the cancellable ctx on RootScheduler.
  • Replace time.Sleep in the re-arm and initial-delay goroutines with a select on q.ctx.Done() and a timer (new sleepThenSubmit helper).
  • Drop Submit() calls once q.done is set so re-arm goroutines that win the race against ctx cancellation can't enqueue to a dead scheduler.

Adds regression tests for both behaviours.

The graceful-shutdown change in #307 exposed a latent bug: after
SIGTERM the scheduler sets q.done=true and workers exit, but the
in-process re-arm goroutine spawned by SubmitPeriodicJob is a bare
'go func() { time.Sleep(interval); ... }()' with no awareness of
ctx. It dies with the pod, and the next pod's warmExistingRepos
re-registers each periodic job — but periodicDelay reads the bbolt
lastRun and sleeps the remaining interval before the next firing.
Net effect: each periodic job skips up to one full interval per pod
restart, dropping background snapshot/repack/fetch throughput by
~90% during and after a rolling deploy.

Fix:
- Store the cancellable ctx on RootScheduler.
- Replace time.Sleep in the re-arm and initial-delay goroutines with
  a select on q.ctx.Done() and a timer.
- Drop Submit() calls once q.done is set so re-arm goroutines that
  win the race against ctx cancellation can't enqueue to a dead
  scheduler.

Adds regression tests for both behaviours.

Amp-Thread-ID: https://ampcode.com/threads/T-019e2cde-c5c3-729a-9364-202a128cf43e
Co-authored-by: Amp <amp@ampcode.com>
@worstell worstell marked this pull request as ready for review May 15, 2026 18:50
@worstell worstell requested a review from a team as a code owner May 15, 2026 18:50
@worstell worstell requested review from jrobotham-square and removed request for a team May 15, 2026 18:50
@worstell worstell merged commit 26f531a into main May 15, 2026
8 checks passed
@worstell worstell deleted the fix/scheduler-periodic-rearm-cancel branch May 15, 2026 18:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants