CASSSIDECAR-374: Implement durable operational job tracker by andresbeckruiz · Pull Request #359 · apache/cassandra-sidecar

andresbeckruiz · 2026-06-10T20:34:25Z

CASSSIDECAR-374

Original PR made against CASSSIDECAR-373 branch with review comments: andresbeckruiz#3.

Changes

DurableOperationalJobTracker: Implementation of OperationalJobTracker that persists job state via StorageProvider. Live local jobs are cached in a local ConcurrentHashMap, completed jobs are served from storage
OperationalJobManager: Separate job tracking from execution — the job is registered in the tracker (and persisted) before execution begins, ensuring durable trackers always have a record before the job runs
Unit tests for DurableOperationalJobTracker and persist-before-execute verification in OperationalJobManagerTest
Integration test for DurableOperationalJobTracker with Cassandra-backed StorageProvider

Future Work

CASSSIDECAR-378: Add support for creation and coordination of local Sidecar Jobs

pauloricardomg

I noticed the durable job tracker is never initialized, do we plan to do this in a follow-up PR ? Is the idea that the operator will choose whether to use the durable or the in memory job tracker ?

pauloricardomg · 2026-06-12T16:10:34Z

+    @Nullable
+    public UUID nodeId()
+    {
+        return null;


Why did we not include this field in the cluster ops schema?

Also why is this field needed, I don't see it being used anywhere.

pauloricardomg · 2026-06-12T16:11:15Z

+    @NotNull
+    public List<UUID> nodesExecuting()
+    {
+        return Collections.emptyList();


Is there any reason why we're not populating these nodes* fields from the cluster_ops_node_state table ?

pauloricardomg · 2026-06-12T19:03:15Z

+     * @param job the operational job to convert
+     * @return a new record capturing the job's current state
+     */
+    public static OperationalJobRecord fromOperationalJob(OperationalJob job)


When a job completes, the DurableOperationalJobTracker evicts it from the live map. After that, get(jobId) returns
an OperationalJobRecord from storage. Let's trace what data is available at each stage:

While the job is live (e.g., NodeDecommissionJob): nodeId() → UUID of the node being decommissioned startTime() → when execution began nodesPending() → [nodeId] initially, then [] after execution starts nodesExecuting()→ [nodeId] while running nodesSucceeded()→ [nodeId] after success nodesFailed() → [nodeId] after failure

The factory that persists the job only captures 3 fields:

// OperationalJobRecord.fromOperationalJob(job) return new OperationalJobRecord(job.jobId(), job.operationType(), job.status()); // ^^^^^^^^ ^^^^^^^^^^^^^^^ ^^^^^^^^^^ // that's it — nodeId, startTime, node lists are all discarded

After completion, get() returns the record with stubbed implementations:

nodeId() → null (always) startTime() → null (constructor sets it to null) nodesPending() → emptyList (always) nodesExecuting() → emptyList (always) nodesSucceeded() → emptyList (always) nodesFailed() → emptyList (always)

So an API client observing a decommission job would see:

DURING execution: AFTER completion (from storage): ───────────────────────────────── ───────────────────────────────── { { "jobId": "abc-123", "jobId": "abc-123", "operation": "decommission", "operation": "DECOMMISSION", "jobStatus": "RUNNING", "jobStatus": "SUCCEEDED", "startTime": "2026-06-12T...", "startTime": null, "nodesExecuting": ["node-uuid"], "nodesExecuting": [], "nodesPending": [], "nodesPending": [], "nodesSucceeded": [], "nodesSucceeded": [], "nodesFailed": [] "nodesFailed": [] } }

All the provenance — which node was decommissioned, when it started, whether it succeeded per-node — is lost once the job leaves the
local map. The record in Cassandra only has (jobId, operationType, status).

pauloricardomg · 2026-06-12T19:20:13Z

+            storageProvider.persistJob(OperationalJobRecord.fromOperationalJob(job));
+
+            job.asyncResult().onComplete(ar -> {
+                liveJobs.remove(job.jobId());


liveJobs removal should come after updateTerminalStatus to avoid RUNNING → CREATED → SUCCEEDED. when polling status between removal and update

pauloricardomg · 2026-06-12T19:23:20Z

+        return liveJobs.computeIfAbsent(jobId, id -> {
+            OperationalJob job = mappingFunction.apply(id);
+
+            storageProvider.persistJob(OperationalJobRecord.fromOperationalJob(job));


what happens if !storageProvider.isAvailable ?

pauloricardomg · 2026-06-12T19:23:52Z

+                                        TaskExecutorPool executor)
+    {
+        this.liveJobs = new ConcurrentHashMap<>(serviceConfiguration.operationalJobTrackerSize());
+        this.storageProvider = storageProvider;


who is expected to initialize the storage provider ? can this class assume it's already initialized or do we want to initialize here?

pauloricardomg · 2026-06-12T19:25:51Z

+                                        StorageProvider storageProvider,
+                                        TaskExecutorPool executor)
+    {
+        this.liveJobs = new ConcurrentHashMap<>(serviceConfiguration.operationalJobTrackerSize());


Unlike InMemoryOperationalJobTracker which has removeEldestEntry, the durable tracker has no eviction. If
asyncResult() never completes (e.g., executeBlocking fails to submit due to full pool), the entry stays in the map forever. Is this expected?

skoppu22 · 2026-06-18T14:16:10Z

+        {
+            LOGGER.warn("Failed to update terminal status for job {} (attempt {}/{}). error={}",
+                        job.jobId(), attempt, MAX_STATUS_UPDATE_ATTEMPTS, e.getMessage());
+            if (attempt < MAX_STATUS_UPDATE_ATTEMPTS)


This method is called from above method with attempt=1, if 1st attempt fails, we are doing only one more retry. Should we increase value of MAX_STATUS_UPDATE_ATTEMPTS ?

Also, if retries also failed to update, we will have stale entries in CREATED stated. We need a periodic thread running once in a while and removing stale entries (this can be deferred to later as not a blocker).

CASSSIDECAR-374: Implement durable operational job tracker

a2707c0

andresbeckruiz force-pushed the CASSSIDECAR-374 branch from 2f8f987 to b43cc8a Compare June 10, 2026 20:59

Address PR review comments

346a446

andresbeckruiz force-pushed the CASSSIDECAR-374 branch from b43cc8a to 346a446 Compare June 12, 2026 19:14

pauloricardomg reviewed Jun 12, 2026

View reviewed changes

skoppu22 reviewed Jun 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CASSSIDECAR-374: Implement durable operational job tracker#359

CASSSIDECAR-374: Implement durable operational job tracker#359
andresbeckruiz wants to merge 2 commits into
apache:trunkfrom
andresbeckruiz:CASSSIDECAR-374

andresbeckruiz commented Jun 10, 2026 •

edited

Loading

Uh oh!

pauloricardomg left a comment

Uh oh!

pauloricardomg Jun 12, 2026

Uh oh!

pauloricardomg Jun 12, 2026

Uh oh!

pauloricardomg Jun 12, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pauloricardomg Jun 12, 2026

Uh oh!

pauloricardomg Jun 12, 2026

Uh oh!

pauloricardomg Jun 12, 2026

Uh oh!

pauloricardomg Jun 12, 2026

Uh oh!

pauloricardomg Jun 12, 2026

Uh oh!

skoppu22 Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

andresbeckruiz commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Future Work

Uh oh!

pauloricardomg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

andresbeckruiz commented Jun 10, 2026 •

edited

Loading