You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Try to start the queue maintainer multiple times with backoff (#1184)
This one's aimed at addressing #1161. `HookPeriodicJobsStart.Start` may
return an error that causes the queue maintainer not to start, and there
are a few other intermittent errors that may cause it not to start (say
in the case of a transient DB problem). If this were to occur, the
course of action currently is for the client to to just spit an error to
logs and not try any additional remediation, which could have the effect
of leaving the queue maintainer offline for extended periods.
Here, try to address this broadly by allowing the queue maintainer a few
attempts at starting, and with our standard exponential backoff (1s, 2s,
4s, 8s, etc.). In case a queue maintainer fails to start completely, the
client requests resignation and hands leadership off to another client
to see if it can start successfully.
I think this is an okay compromise because in case of a non-transient
fundamental error (say `HookPeriodicJobsStart.Start` always returns an
error), we don't go into a hot loop that starts hammering things.
Instead, we'll get a reasonably responsible slow back off that gives
things a chance to recover, and which should be very visible in logs.
Fixes#1161.
Copy file name to clipboardExpand all lines: CHANGELOG.md
+4Lines changed: 4 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,6 +7,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
7
7
8
8
## [Unreleased]
9
9
10
+
### Fixed
11
+
12
+
- Upon a client gaining leadership, its queue maintainer is given more than one opportunity to start. [PR #1184](https://github.com/riverqueue/river/pull/1184).
@@ -606,11 +607,20 @@ type Client[TTx any] struct {
606
607
pilot riverpilot.Pilot
607
608
producersByQueueNamemap[string]*producer
608
609
queueMaintainer*maintenance.QueueMaintainer
609
-
queues*QueueBundle
610
-
services []startstop.Service
611
-
stopped<-chanstruct{}
612
-
subscriptionManager*subscriptionManager
613
-
testSignalsclientTestSignals
610
+
611
+
// queueMaintainerEpoch is incremented each time leadership is gained,
612
+
// giving each tryStartQueueMaintainer goroutine a term number.
613
+
// queueMaintainerMu serializes epoch checks with Stop calls so that a
614
+
// stale goroutine from an older term cannot tear down a maintainer
615
+
// started by a newer term.
616
+
queueMaintainerEpochint64
617
+
queueMaintainerMu sync.Mutex
618
+
619
+
queues*QueueBundle
620
+
services []startstop.Service
621
+
stopped<-chanstruct{}
622
+
subscriptionManager*subscriptionManager
623
+
testSignalsclientTestSignals
614
624
615
625
// workCancel cancels the context used for all work goroutines. Normal Stop
616
626
// does not cancel that context.
@@ -619,7 +629,9 @@ type Client[TTx any] struct {
619
629
620
630
// Test-only signals.
621
631
typeclientTestSignalsstruct {
622
-
electedLeader testsignal.TestSignal[struct{}] // notifies when elected leader
632
+
electedLeader testsignal.TestSignal[struct{}] // notifies when elected leader
633
+
queueMaintainerStartError testsignal.TestSignal[error] // notifies on each failed queue maintainer start attempt
634
+
queueMaintainerStartRetriesExhausted testsignal.TestSignal[struct{}] // notifies when leader resignation is requested after all queue maintainer start retries have been exhausted
623
635
624
636
jobCleaner*maintenance.JobCleanerTestSignals
625
637
jobRescuer*maintenance.JobRescuerTestSignals
@@ -631,6 +643,8 @@ type clientTestSignals struct {
0 commit comments