feat(orchestrator): Add dropped task metric for capped groups#5865
feat(orchestrator): Add dropped task metric for capped groups#5865
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 624cf4ead4
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| } | ||
| const droppedCountPerPrimitive = new Map<string, number>(); | ||
| for (const [groupKey, droppedCount] of cappedGroupCounts.entries()) { | ||
| const primitive = groupKey.split(':')[0] || 'unknown'; |
There was a problem hiding this comment.
Avoid high-cardinality tags in dropped-task metric
This derives the primitive tag from groupKey.split(':')[0], which means any group key without a colon is used verbatim as a metric tag value. In this codebase, group.key is accepted as an arbitrary string (packages/orchestrator/lib/routes/v1/postImmediate.ts, z.string().min(1)), so capped tasks for user-specific/random keys (e.g., UUID/nanoid-style keys) will emit one Datadog series per key and can quickly create unbounded cardinality. Please normalize to a bounded set (or fall back to unknown when the expected prefix format is absent) before incrementing ORCH_TASKS_DROPPED.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
number of primitives is low cardinality, but thanks for the heads up.
| }); | ||
| } else { | ||
| cappedGroupKeys.add(props.groupKey); | ||
| cappedGroupCounts.set(props.groupKey, (cappedGroupCounts.get(props.groupKey) ?? 0) + 1); |
There was a problem hiding this comment.
not sure of the value of this metric to be honest. Dropping tasks is usually the result of some other problem upstream that we have plenty of signal for.
If you really want it though just increment here, no need to aggregate, DD lib will do it for you
There was a problem hiding this comment.
I could be wrong here, but at least i had no visibility on this during the latest incident, why we could suspicious that with some proxy metrics there was no an explicit datapoint for this (as said I could be wrong here so happy to learn what other signals we have for this)
Also its not only about incidents, and is similar to the max backpressure signal that we added for knowing when customers were reaching to their max concurrency threshold, with this new metric we can pin point specific time windows for knowing if one or multiple customers got impacted (we will still require the logging part for framing specific issues for specific customers)
Note: just for clarification, upstream -> dependencies of our services, downstream -> users of our services? or the other way arround? just asking bc Ive seen using them in both directions, and is worth it to get all of us aligned)
There was a problem hiding this comment.
Dropped tasks are the result of tasks being created faster than being executed, which we already have several metrics for. Knowing and understanding why tasks are created too fast and/or not executed fast enough is the valuable piece of information. Dropped tasks is basically the same information reported as another metric.
There was a problem hiding this comment.
Dropped tasks are the result of tasks being created faster than being executed, which we already have several metrics for.
Sorry but I kinda disagree here, dropped tasks is the canonical source of truth, the other one is a proxy metric which you could infer that dropping is happening but without explicit datapoints - the ones provided by the dropped metric - its just an educated guess.
. Knowing and understanding why tasks are created too fast and/or not executed fast enough is the valuable piece of information.
Related to the previous one, you know that a customer is sending data too fast or that the system is going to slow when data is being discarded, with the creation rate or execution rate you have proxy metrics about that.
Yea knowing what is causing is fundamental but its more a secondary step IMO, first you require data for knowing that this is happening.
There was a problem hiding this comment.
Also, tasks.dropped gives us an explicit measure of discarded work volume. Creation and execution rates are only proxy metrics for pressure. With a direct dropped-tasks metric, we could even monitor absolute drops or dropped percentage relative to created tasks.
There was a problem hiding this comment.
for having the % we would require to have a total counter, this would be IMO ideal for paths like the schedule, were we can emit total tried to schedule vs dropped, this can help you to quantify the impact of an incident.
Happy to add also this counter in this PR
There was a problem hiding this comment.
feel free to add the metric. Just let ddtrace do the aggregation.
I disagree that dropping tasks is the canonical source of truth. With it alone you cannot even infer if this is a 'too many creations' or a 'not enough executions' problem.
Adds a new metric for knowing when we are silently dropping tasks. This can be useful during incidents.