Skip to content

Newbranchnotify9#111

Open
canicefavour wants to merge 3 commits into
Core-Foundry:mainfrom
canicefavour:Newbranchnotify9
Open

Newbranchnotify9#111
canicefavour wants to merge 3 commits into
Core-Foundry:mainfrom
canicefavour:Newbranchnotify9

Conversation

@canicefavour

@canicefavour canicefavour commented Jun 20, 2026

Copy link
Copy Markdown

Implemented a telemetry and metrics reliability improvement to eliminate duplicate counting of successful jobs that undergo one or more retry attempts before completion.

Overview

This change addresses an issue where retryable jobs were inflating dashboard metrics by emitting duplicate success events during the retry lifecycle. As a result, jobs that failed multiple times before eventually succeeding could be counted as multiple successful executions, leading to inaccurate reporting and distorted system health analytics.

The implementation refactors retry tracking and metric emission logic to ensure that retries and final outcomes are recorded independently and accurately.

Improvements Implemented
Retry Lifecycle Investigation
Audited the complete retry execution flow and event emission pipeline.
Identified locations where metrics were being emitted multiple times during retry state transitions.
Analyzed how retry attempts, failures, and final success events interacted with telemetry collection.
Metrics Deduplication
Refactored metric emission logic to ensure successful jobs are counted exactly once regardless of the number of retry attempts.
Separated retry-attempt tracking from final execution outcome tracking.
Prevented duplicate success events from being generated during state transitions.
Telemetry & Dashboard Alignment
Verified emitted metrics conform to the expected schema consumed by monitoring and dashboard systems.
Ensured success, failure, and retry metrics are reported consistently across all execution paths.
Improved accuracy of aggregate counts, success rates, failure rates, and operational reporting.
Event Handling Improvements
Updated retry handling workflows to emit retry metrics independently of completion metrics.
Ensured final success metrics are only emitted when a job reaches its terminal successful state.
Preserved visibility into retry behavior without inflating execution totals.
Regression Protection
Added comprehensive test coverage for retry scenarios.
Implemented tests simulating multiple failures followed by eventual success.
Verified expected metric outputs such as:
2 retry attempts
1 successful execution
0 duplicate success events
Added safeguards to detect future regressions in metric emission behavior.
Result

System telemetry now accurately reflects actual job execution outcomes. Retried jobs that eventually succeed are counted as a single successful execution while still maintaining visibility into retry activity. Dashboard metrics, reporting, and operational analytics now provide a more reliable representation of system performance and health.

Closes #101

@Abd-Standard

Copy link
Copy Markdown
Collaborator

please adjust this PR description , closes not close else you wont get points even if i merge it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Fix Incorrect Success Count After Batch Retry

2 participants