Skip to content

CASSSIDECAR-462: Split RestoreJobDiscoverer into a fast status-check loop and a slow slice-discovery loop#350

Open
mansikhara wants to merge 6 commits into
apache:trunkfrom
mansikhara:CASSSIDECAR-462
Open

CASSSIDECAR-462: Split RestoreJobDiscoverer into a fast status-check loop and a slow slice-discovery loop#350
mansikhara wants to merge 6 commits into
apache:trunkfrom
mansikhara:CASSSIDECAR-462

Conversation

@mansikhara

Copy link
Copy Markdown
Contributor

No description provided.

@mansikhara mansikhara changed the title CASSSIDECAR-462 Split RestoreJobDiscoverer into a fast status-check loop and a slow slice-discovery loop CASSSIDECAR-462: Split RestoreJobDiscoverer into a fast status-check loop and a slow slice-discovery loop May 15, 2026

@yifan-c yifan-c left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High-level design suggestion

The slow discovery loop is the correctness guarantee — it must always run on schedule. The fast status-check loop is a best-effort latency optimization — it is safe to skip any individual tick. All shared mutable state (JobIdsByDay) is intentionally not self-synchronized; a single external mutual-exclusion gate is the sole source of thread safety. The changes below enforce this by replacing two independent CAS flags (which gave both loops equal priority and allowed concurrent access to shared state) with a ReentrantLock whose call sites encode the asymmetry directly: lock() for the slow loop (blocks briefly, never skips), tryLock() for everything else (opportunistic).

Concrete changes

  1. Replace AtomicBoolean isExecuting with ReentrantLock executionLock. Slow loop uses lock() (blocks until free, never skips); fast loop and processJobNow use tryLock() (skip if busy). This directly encodes the priority asymmetry: a shared CAS treats all callers equally, so a fast loop holding it when the slow loop's 5–10 min timer fires causes the slow loop to miss its entire cycle.

  2. Remove checkerExecuting from StatusCheckTask. ReentrantLock is now the sole gate; a second independent flag re-introduces the two-gate concurrency problem.

  3. Remove synchronized from all JobIdsByDay methods. They were only added because the two independent gates allowed concurrent access. With a single lock the original invariant — exactly one thread touches JobIdsByDay at a time — is fully restored.

  4. Restore volatile int inflightJobsCount. scheduleDecision() is called by the timer without holding the lock. Reading inflightJobIds() there would be an unsynchronized read of JobIdsByDay once synchronized is removed. The original counter is written under the lock (visible via the lock's happens-before edge) and read as a plain volatile from scheduleDecision(). Also fixes the O(jobs)-allocation-per-tick regression for hasInflightJobs() and the gauge update.

* well within the slow-loop interval. The slow loop is configured to a much larger interval so
* a positive observation can only be the work of the fast status-check loop.
*/
class RestoreJobStatusCheckerIntTest extends SharedClusterIntegrationTestBase

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: rename to RestoreJobDiscovererStatusCheckIntTest so that Intellij can correct associate the test with RestoreJobDiscoverer and vice versa.

private final AtomicBoolean isExecuting = new AtomicBoolean(false);
private String localDatacenter = null;
private int inflightJobsCount = 0;
final StatusCheckTask statusCheckTask = new StatusCheckTask();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the reason of package-private statusCheckTask is solely for testing purpose. Can we instead have a getter with @VisibleForTesting and make this field private final?

*/
void handleStatusTransition(RestoreJob currentJob)
{
initLocalDatacenterMaybe();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is called for every RestoreJob unnecessarily.

}

@JsonProperty(value = "job_discovery_status_check_interval")
public void setJobDiscoveryStatusCheckInterval(MillisecondBoundConfiguration jobDiscoveryStatusCheckInterval)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This method is unused. Seems like we are following the pattern of having set methods, other set methods in class are deprecated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants