Newbranchnotify9#111
Open
canicefavour wants to merge 3 commits into
Open
Conversation
Collaborator
|
please adjust this PR description , closes not close else you wont get points even if i merge it |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Implemented a telemetry and metrics reliability improvement to eliminate duplicate counting of successful jobs that undergo one or more retry attempts before completion.
Overview
This change addresses an issue where retryable jobs were inflating dashboard metrics by emitting duplicate success events during the retry lifecycle. As a result, jobs that failed multiple times before eventually succeeding could be counted as multiple successful executions, leading to inaccurate reporting and distorted system health analytics.
The implementation refactors retry tracking and metric emission logic to ensure that retries and final outcomes are recorded independently and accurately.
Improvements Implemented
Retry Lifecycle Investigation
Audited the complete retry execution flow and event emission pipeline.
Identified locations where metrics were being emitted multiple times during retry state transitions.
Analyzed how retry attempts, failures, and final success events interacted with telemetry collection.
Metrics Deduplication
Refactored metric emission logic to ensure successful jobs are counted exactly once regardless of the number of retry attempts.
Separated retry-attempt tracking from final execution outcome tracking.
Prevented duplicate success events from being generated during state transitions.
Telemetry & Dashboard Alignment
Verified emitted metrics conform to the expected schema consumed by monitoring and dashboard systems.
Ensured success, failure, and retry metrics are reported consistently across all execution paths.
Improved accuracy of aggregate counts, success rates, failure rates, and operational reporting.
Event Handling Improvements
Updated retry handling workflows to emit retry metrics independently of completion metrics.
Ensured final success metrics are only emitted when a job reaches its terminal successful state.
Preserved visibility into retry behavior without inflating execution totals.
Regression Protection
Added comprehensive test coverage for retry scenarios.
Implemented tests simulating multiple failures followed by eventual success.
Verified expected metric outputs such as:
2 retry attempts
1 successful execution
0 duplicate success events
Added safeguards to detect future regressions in metric emission behavior.
Result
System telemetry now accurately reflects actual job execution outcomes. Retried jobs that eventually succeed are counted as a single successful execution while still maintaining visibility into retry activity. Dashboard metrics, reporting, and operational analytics now provide a more reliable representation of system performance and health.
Closes #101