refactor(central): decouple compliance persistence from watcher by guzalv · Pull Request #21225 · stackrox/stackrox

guzalv · 2026-06-17T11:55:07Z

Description

The compliance scan and result pipelines are coupled to the report manager's watcher system in a way that makes data persistence conditional on watcher success. Both pipelines call HandleScan/HandleResult before UpsertScan/UpsertResult. If the watcher returns any error, the pipeline short-circuits and the data is never persisted.

The watcher exists solely for report generation tracking, but it acts as a gatekeeper for whether compliance data exists in the database at all.

Shortcomings of this coupling

Data loss from watcher lifecycle errors. Errors like "scan not handled by any known scan configuration", "the watcher is stopped", and ErrScanAlreadyHandled are watcher-internal concerns unrelated to data validity. Non-transient errors cause the event to be permanently dropped by the worker queue.
Symptom-level fixes trace back here. Every fix for ROX-34779 (PRs ROX-34779: skip DeleteOldResults on scan watcher timeout #20752, ROX-34779: fix scan watcher race on sensor reconnection #20972, fix(central): prevent cross-cluster compliance data deletion on sensor reconnect #21144) patched symptoms rather than the root cause. For example, PR ROX-34779: skip DeleteOldResults on scan watcher timeout #20752 had to gate DeleteOldResults on the watcher's error state, but this introduced a trade-off: on timeout, old-cycle results are never cleaned up and accumulate forever.
Inconsistency with REMOVE_RESOURCE. The REMOVE case already follows the correct pattern (watcher errors non-fatal, delete always runs). Only CREATE/UPDATE gates persistence on watcher success.

What this PR changes

Swaps the order: persist first, then notify the watcher. Watcher error semantics are preserved (still propagated to the caller). This is the minimal structural change that breaks the coupling.

Order swap is safe because:

Neither HandleScan nor HandleResult mutate the scan/result objects
ComplianceCheckResult CRs are updated in-place across scan cycles (same K8s UID). The DedupingQueue replaces events with the same UID in-place, ensuring the latest version is processed last within FIFO ordering
DeleteOldResults provides timestamp-based cleanup of stale data on each successful scan cycle
HandleScan does not depend on the scan being in the scan datastore. HandleResult depends on the SCAN (not the result) being in the scan datastore, which is unaffected by the result pipeline's order change.

Follow-up PRs enabled by this change

Make watcher errors non-fatal. Log HandleScan/HandleResult errors instead of returning them. Events succeed regardless of watcher state. This completes the decoupling.
Fix timeout deletion (undo the PR ROX-34779: skip DeleteOldResults on scan watcher timeout #20752 trade-off). With persistence decoupled, call DeleteOldResults on timeout with includeCurrentResults=false (strictly < comparison) instead of skipping it entirely. Deletes truly old results while preserving current ones. Any aggressively-deleted results would be re-persisted on the next sensor sync since persistence is now unconditional.

User-facing documentation

CHANGELOG.md is updated OR update is not needed
documentation PR is created and is linked above OR is not needed

Testing and quality

the change is production ready: the change is GA, or otherwise the functionality is gated by a feature flag
CI results are inspected

Automated testing

How I validated my change

All pipeline unit tests pass (scan and result pipelines)
Report manager unit tests pass (no regressions)
New tests verify ordering with gomock.InOrder (UpsertScan/UpsertResult before HandleScan/HandleResult)
New tests verify error scenarios: watcher error after successful persistence, and upsert error short-circuiting before watcher notification
Pre-commit lint/style checks pass

Cluster verification

Deployed the patched central to an OCP 4.22 cluster (ga-ocp4-cron) with sensor on a second cluster (ga-ocp4-cron-2) running Compliance Operator v1.9.0.

After triggering CIS compliance scans, 240 check results were persisted to the database despite the watcher returning "this scan is not handled by any known scan configuration" errors for all events. Central logs confirm the new behavior:

pipeline.go:91: Error: unable to handle the check result in the report manager: this scan is not handled by any known scan configuration
worker_queue.go:96: Error: Unretryable error found while handling sensor message *central.SensorEvent_ComplianceOperatorResultV2 permanently

Before this change, those 240 results would NOT have been persisted — the watcher error would have short-circuited before UpsertResult was called. Now UpsertResult runs first (data persisted), then HandleResult fails (logged, error propagated to worker queue).

The compliance scan and result pipelines called HandleScan/HandleResult before UpsertScan/UpsertResult, making data persistence conditional on watcher success. The watcher exists for report generation tracking but acted as a gatekeeper for whether compliance data exists at all. Swap the order: persist first, then notify. Watcher errors are still propagated (error semantics unchanged). This ensures valid compliance data from clusters is always written to the database, regardless of watcher-internal state (e.g. no matching scan config yet, watcher already stopped, scan already handled). Follow-up PRs will make watcher errors non-fatal (completing the decoupling) and fix the timeout deletion trade-off from PR #20752. Partially-generated with AI assistance. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

openshift-ci · 2026-06-17T11:55:11Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

coderabbitai · 2026-06-17T11:55:22Z

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: e164433b-7a2d-4175-bbd6-fb37a0a404f0

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch refactor/decouple-compliance-persistence-from-watcher

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-06-17T12:05:06Z

🚀 Build Images Ready

Images are ready for commit f4f23fb. To use with deploy scripts:

export MAIN_IMAGE_TAG=4.12.x-208-gf4f23fb3da

guzalv added the ai-assisted label Jun 17, 2026

openshift-ci Bot added the do-not-merge/work-in-progress label Jun 17, 2026

github-actions Bot added the area/central label Jun 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(central): decouple compliance persistence from watcher#21225

refactor(central): decouple compliance persistence from watcher#21225
guzalv wants to merge 1 commit into
masterfrom
refactor/decouple-compliance-persistence-from-watcher

guzalv commented Jun 17, 2026 •

edited

Loading

Uh oh!

openshift-ci Bot commented Jun 17, 2026

Uh oh!

coderabbitai Bot commented Jun 17, 2026

Review skipped

Uh oh!

github-actions Bot commented Jun 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

guzalv commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Shortcomings of this coupling

What this PR changes

Follow-up PRs enabled by this change

User-facing documentation

Testing and quality

Automated testing

How I validated my change

Cluster verification

Uh oh!

openshift-ci Bot commented Jun 17, 2026

Uh oh!

coderabbitai Bot commented Jun 17, 2026

Review skipped

Uh oh!

github-actions Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🚀 Build Images Ready

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

guzalv commented Jun 17, 2026 •

edited

Loading

github-actions Bot commented Jun 17, 2026 •

edited

Loading