feat(moderation): give the input-moderation gate conversational context#39
Merged
Conversation
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rompt Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Its only realistic hit was an identical-request replay within 10 min; adding conversation context to any key would push that to near-zero. The moderator is cheap and fast, so a duplicate call on a retry is negligible. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ention Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
# Conflicts: # api/src/moderation/operations.ts
…te prompt The post-merge prompt repeated 'technical/detailed/sub-agent task' in both the mission line and a closing 'Do not block merely because…' sentence, and metadata in three places. Fold the sub-agent-task carve-out into the mission line and drop the closing sentence; its carve-outs are all still covered (technical/detailed/ data-queries by the mission, metadata by its dedicated paragraph and bullet). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The input-moderation classifier previously saw only the last user message in isolation, so short elliptical follow-ups ("yes", "and for 2024?", "make it bigger") had no referent and were false-flagged as off-scope or nonsensical. This change gives the gate a bounded window of recent conversation.
What changed:
<conversation_context>block (per-turn cap 500 chars, total cap 1500 chars), with the judged message isolated in<message_to_moderate>. Pure helpersbuildModerationContext/formatModerationInputlive inoperations.ts; the zero-context path is byte-identical to before.Why: testing showed too many false positives from short follow-up messages because the moderation model lacked conversational context.
Heads-up:
<hidden-context>wrapper) and is the only thing excerpted onto events/traces — the security posture is unchanged. Prior turns are reference-only and never the judged unit. For direct API callers the history is attacker-controllable, so the real moderator model must honor "judge only<message_to_moderate>"; the accepted trade-off (per the design doc) is that the worst an attacker gains is laundering an ambiguous message into an allow, which the "when in doubt, allow" rule already permits.cachedfield was dropped fromModerationEventand the stats latency filter.