Skip to content

feat(orchestrator): Add dropped task metric for capped groups#5865

Merged
pfreixes merged 1 commit intomasterfrom
pfreixes/new-metrics
Apr 17, 2026
Merged

feat(orchestrator): Add dropped task metric for capped groups#5865
pfreixes merged 1 commit intomasterfrom
pfreixes/new-metrics

Conversation

@pfreixes
Copy link
Copy Markdown
Contributor

Adds a new metric for knowing when we are silently dropping tasks. This can be useful during incidents.

@pfreixes pfreixes marked this pull request as draft April 15, 2026 10:35
@pfreixes pfreixes marked this pull request as ready for review April 15, 2026 10:36
@pfreixes pfreixes changed the title Add dropped task metric for capped groups feat(orchestrator): Add dropped task metric for capped groups Apr 15, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 624cf4ead4

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

}
const droppedCountPerPrimitive = new Map<string, number>();
for (const [groupKey, droppedCount] of cappedGroupCounts.entries()) {
const primitive = groupKey.split(':')[0] || 'unknown';
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid high-cardinality tags in dropped-task metric

This derives the primitive tag from groupKey.split(':')[0], which means any group key without a colon is used verbatim as a metric tag value. In this codebase, group.key is accepted as an arbitrary string (packages/orchestrator/lib/routes/v1/postImmediate.ts, z.string().min(1)), so capped tasks for user-specific/random keys (e.g., UUID/nanoid-style keys) will emit one Datadog series per key and can quickly create unbounded cardinality. Please normalize to a bounded set (or fall back to unknown when the expected prefix format is absent) before incrementing ORCH_TASKS_DROPPED.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

number of primitives is low cardinality, but thanks for the heads up.

});
} else {
cappedGroupKeys.add(props.groupKey);
cappedGroupCounts.set(props.groupKey, (cappedGroupCounts.get(props.groupKey) ?? 0) + 1);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure of the value of this metric to be honest. Dropping tasks is usually the result of some other problem upstream that we have plenty of signal for.
If you really want it though just increment here, no need to aggregate, DD lib will do it for you

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could be wrong here, but at least i had no visibility on this during the latest incident, why we could suspicious that with some proxy metrics there was no an explicit datapoint for this (as said I could be wrong here so happy to learn what other signals we have for this)

Also its not only about incidents, and is similar to the max backpressure signal that we added for knowing when customers were reaching to their max concurrency threshold, with this new metric we can pin point specific time windows for knowing if one or multiple customers got impacted (we will still require the logging part for framing specific issues for specific customers)

Note: just for clarification, upstream -> dependencies of our services, downstream -> users of our services? or the other way arround? just asking bc Ive seen using them in both directions, and is worth it to get all of us aligned)

Copy link
Copy Markdown
Collaborator

@TBonnin TBonnin Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped tasks are the result of tasks being created faster than being executed, which we already have several metrics for. Knowing and understanding why tasks are created too fast and/or not executed fast enough is the valuable piece of information. Dropped tasks is basically the same information reported as another metric.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped tasks are the result of tasks being created faster than being executed, which we already have several metrics for.

Sorry but I kinda disagree here, dropped tasks is the canonical source of truth, the other one is a proxy metric which you could infer that dropping is happening but without explicit datapoints - the ones provided by the dropped metric - its just an educated guess.

. Knowing and understanding why tasks are created too fast and/or not executed fast enough is the valuable piece of information.

Related to the previous one, you know that a customer is sending data too fast or that the system is going to slow when data is being discarded, with the creation rate or execution rate you have proxy metrics about that.

Yea knowing what is causing is fundamental but its more a secondary step IMO, first you require data for knowing that this is happening.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, tasks.dropped gives us an explicit measure of discarded work volume. Creation and execution rates are only proxy metrics for pressure. With a direct dropped-tasks metric, we could even monitor absolute drops or dropped percentage relative to created tasks.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for having the % we would require to have a total counter, this would be IMO ideal for paths like the schedule, were we can emit total tried to schedule vs dropped, this can help you to quantify the impact of an incident.

Happy to add also this counter in this PR

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feel free to add the metric. Just let ddtrace do the aggregation.

I disagree that dropping tasks is the canonical source of truth. With it alone you cannot even infer if this is a 'too many creations' or a 'not enough executions' problem.

@pfreixes pfreixes added this pull request to the merge queue Apr 17, 2026
Merged via the queue into master with commit 2724cf5 Apr 17, 2026
23 of 24 checks passed
@pfreixes pfreixes deleted the pfreixes/new-metrics branch April 17, 2026 09:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants