Skip to content

CASSSIDECAR-377: Implement job coordination for cluster-wide operations#360

Open
andresbeckruiz wants to merge 2 commits into
apache:trunkfrom
andresbeckruiz:CASSSIDECAR-377
Open

CASSSIDECAR-377: Implement job coordination for cluster-wide operations#360
andresbeckruiz wants to merge 2 commits into
apache:trunkfrom
andresbeckruiz:CASSSIDECAR-377

Conversation

@andresbeckruiz

Copy link
Copy Markdown
Contributor

CASSSIDECAR-377

Original PR made against CASSSIDECAR-373 branch with review comments: andresbeckruiz#2.

Changes

  • OperationalJobCoordinator interface
  • StorageOperationalJobCoordinator: Implementation that delegates to StorageProvider's compare and set based methods to acquire a lock for an active operation
  • OperationalJob.operationType(): New abstract method returning OperationType enum, implemented by all concrete jobs
  • OperationalJobManager integration: If a job requires coordination, the manager acquires the active lock via the coordinator before execution; throws OperationalJobConflictException if rejected
  • Unit tests for StorageOperationalJobCoordinator and coordinator integration in OperationalJobManagerTest

This ticket only covers the job activation path when a job is submitted to be executed immediately. Clearing locks after completion and status update activation will be implemented in later tickets (CASSSIDECAR-378, CASSSIDECAR-379).

Future Work

  • CASSSIDECAR-378: Add support for creation and coordination of local Sidecar Jobs
  • CASSSIDECAR-379: Add PATCH operational-jobs/{id} API to update status of existing operational jobs


// New job is submitted for all cases when we do not have a corresponding downstream job
jobTracker.computeIfAbsent(job.jobId(), jobId -> {
OperationalJob tracked = jobTracker.computeIfAbsent(job.jobId(), jobId -> job);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to remove from jobTracker if there's a coordination conflict ? Or perhaps should we just add to jobTracker in case the tryCoordination was successful, so tryCoordination should come before jobtracker.computeIfAbsent ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This leaves stale job entry in CREATED state if the coordination fails.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to mark the job as FAILED if coordination fails? And set the failureReason to indicate that the job was not able to start due to coordination failure?

{
if (job.requiresCoordination())
{
Preconditions.checkState(coordinator != null,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preconditions throws illegal state exception, which is not handled in the caller

};

manager.trySubmitJob(job, onComplete, executorPool.service(), SecondBoundConfiguration.parse("5s"));
assertThat(latch.await(10, TimeUnit.SECONDS)).isTrue();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also verify that tracker doesn't have entry for this job after conflict detected.

{
Preconditions.checkState(coordinator != null,
"Job requires coordination but no OperationalJobCoordinator is configured");
boolean activated = coordinator.trySetActive(job.operationType(), job.jobId());

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am afraid to merge this PR without calling clearActive. If we delay adding that in a subsequent PR and if someone enables requiresCoordination() for any job meanwhile, then no more further operations will be allowed. I would recommend implementing calling clearActive in the same PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants