refactor(central): decouple compliance persistence from watcher#21225
refactor(central): decouple compliance persistence from watcher#21225guzalv wants to merge 1 commit into
Conversation
The compliance scan and result pipelines called HandleScan/HandleResult before UpsertScan/UpsertResult, making data persistence conditional on watcher success. The watcher exists for report generation tracking but acted as a gatekeeper for whether compliance data exists at all. Swap the order: persist first, then notify. Watcher errors are still propagated (error semantics unchanged). This ensures valid compliance data from clusters is always written to the database, regardless of watcher-internal state (e.g. no matching scan config yet, watcher already stopped, scan already handled). Follow-up PRs will make watcher errors non-fatal (completing the decoupling) and fix the timeout deletion trade-off from PR #20752. Partially-generated with AI assistance. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Skipping CI for Draft Pull Request. |
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
🚀 Build Images ReadyImages are ready for commit f4f23fb. To use with deploy scripts: export MAIN_IMAGE_TAG=4.12.x-208-gf4f23fb3da |
Description
The compliance scan and result pipelines are coupled to the report manager's watcher system in a way that makes data persistence conditional on watcher success. Both pipelines call
HandleScan/HandleResultbeforeUpsertScan/UpsertResult. If the watcher returns any error, the pipeline short-circuits and the data is never persisted.The watcher exists solely for report generation tracking, but it acts as a gatekeeper for whether compliance data exists in the database at all.
Shortcomings of this coupling
ErrScanAlreadyHandledare watcher-internal concerns unrelated to data validity. Non-transient errors cause the event to be permanently dropped by the worker queue.DeleteOldResultson the watcher's error state, but this introduced a trade-off: on timeout, old-cycle results are never cleaned up and accumulate forever.What this PR changes
Swaps the order: persist first, then notify the watcher. Watcher error semantics are preserved (still propagated to the caller). This is the minimal structural change that breaks the coupling.
Order swap is safe because:
HandleScannorHandleResultmutate the scan/result objectsDeleteOldResultsprovides timestamp-based cleanup of stale data on each successful scan cycleHandleScandoes not depend on the scan being in the scan datastore.HandleResultdepends on the SCAN (not the result) being in the scan datastore, which is unaffected by the result pipeline's order change.Follow-up PRs enabled by this change
HandleScan/HandleResulterrors instead of returning them. Events succeed regardless of watcher state. This completes the decoupling.DeleteOldResultson timeout withincludeCurrentResults=false(strictly<comparison) instead of skipping it entirely. Deletes truly old results while preserving current ones. Any aggressively-deleted results would be re-persisted on the next sensor sync since persistence is now unconditional.User-facing documentation
Testing and quality
Automated testing
How I validated my change
gomock.InOrder(UpsertScan/UpsertResult before HandleScan/HandleResult)Cluster verification
Deployed the patched central to an OCP 4.22 cluster (ga-ocp4-cron) with sensor on a second cluster (ga-ocp4-cron-2) running Compliance Operator v1.9.0.
After triggering CIS compliance scans, 240 check results were persisted to the database despite the watcher returning
"this scan is not handled by any known scan configuration"errors for all events. Central logs confirm the new behavior:Before this change, those 240 results would NOT have been persisted — the watcher error would have short-circuited before
UpsertResultwas called. NowUpsertResultruns first (data persisted), thenHandleResultfails (logged, error propagated to worker queue).