CASSSIDECAR-462: Split RestoreJobDiscoverer into a fast status-check loop and a slow slice-discovery loop#350
CASSSIDECAR-462: Split RestoreJobDiscoverer into a fast status-check loop and a slow slice-discovery loop#350mansikhara wants to merge 6 commits into
Conversation
…loop and a slow slice-discovery loop
…loop and a slow slice-discovery loop
5631573 to
f565864
Compare
…scoverer # Conflicts: # CHANGES.txt
…loop and a slow slice-discovery loop
…loop and a slow slice-discovery loop
yifan-c
left a comment
There was a problem hiding this comment.
High-level design suggestion
The slow discovery loop is the correctness guarantee — it must always run on schedule. The fast status-check loop is a best-effort latency optimization — it is safe to skip any individual tick. All shared mutable state (JobIdsByDay) is intentionally not self-synchronized; a single external mutual-exclusion gate is the sole source of thread safety. The changes below enforce this by replacing two independent CAS flags (which gave both loops equal priority and allowed concurrent access to shared state) with a ReentrantLock whose call sites encode the asymmetry directly: lock() for the slow loop (blocks briefly, never skips), tryLock() for everything else (opportunistic).
Concrete changes
-
Replace
AtomicBoolean isExecutingwithReentrantLock executionLock. Slow loop useslock()(blocks until free, never skips); fast loop andprocessJobNowusetryLock()(skip if busy). This directly encodes the priority asymmetry: a shared CAS treats all callers equally, so a fast loop holding it when the slow loop's 5–10 min timer fires causes the slow loop to miss its entire cycle. -
Remove
checkerExecutingfromStatusCheckTask.ReentrantLockis now the sole gate; a second independent flag re-introduces the two-gate concurrency problem. -
Remove
synchronizedfrom allJobIdsByDaymethods. They were only added because the two independent gates allowed concurrent access. With a single lock the original invariant — exactly one thread touchesJobIdsByDayat a time — is fully restored. -
Restore
volatile int inflightJobsCount.scheduleDecision()is called by the timer without holding the lock. ReadinginflightJobIds()there would be an unsynchronized read ofJobIdsByDayoncesynchronizedis removed. The original counter is written under the lock (visible via the lock's happens-before edge) and read as a plain volatile fromscheduleDecision(). Also fixes the O(jobs)-allocation-per-tick regression forhasInflightJobs()and the gauge update.
| * well within the slow-loop interval. The slow loop is configured to a much larger interval so | ||
| * a positive observation can only be the work of the fast status-check loop. | ||
| */ | ||
| class RestoreJobStatusCheckerIntTest extends SharedClusterIntegrationTestBase |
There was a problem hiding this comment.
nit: rename to RestoreJobDiscovererStatusCheckIntTest so that Intellij can correct associate the test with RestoreJobDiscoverer and vice versa.
| private final AtomicBoolean isExecuting = new AtomicBoolean(false); | ||
| private String localDatacenter = null; | ||
| private int inflightJobsCount = 0; | ||
| final StatusCheckTask statusCheckTask = new StatusCheckTask(); |
There was a problem hiding this comment.
I think the reason of package-private statusCheckTask is solely for testing purpose. Can we instead have a getter with @VisibleForTesting and make this field private final?
| */ | ||
| void handleStatusTransition(RestoreJob currentJob) | ||
| { | ||
| initLocalDatacenterMaybe(); |
There was a problem hiding this comment.
This method is called for every RestoreJob unnecessarily.
| } | ||
|
|
||
| @JsonProperty(value = "job_discovery_status_check_interval") | ||
| public void setJobDiscoveryStatusCheckInterval(MillisecondBoundConfiguration jobDiscoveryStatusCheckInterval) |
There was a problem hiding this comment.
Nit: This method is unused. Seems like we are following the pattern of having set methods, other set methods in class are deprecated.
No description provided.