Skip to content

feat(moderation): give the input-moderation gate conversational context#39

Merged
albanm merged 10 commits into
mainfrom
feat-better-moderation
Jun 21, 2026
Merged

feat(moderation): give the input-moderation gate conversational context#39
albanm merged 10 commits into
mainfrom
feat-better-moderation

Conversation

@albanm

@albanm albanm commented Jun 21, 2026

Copy link
Copy Markdown
Member

The input-moderation classifier previously saw only the last user message in isolation, so short elliptical follow-ups ("yes", "and for 2024?", "make it bigger") had no referent and were false-flagged as off-scope or nonsensical. This change gives the gate a bounded window of recent conversation.

What changed:

  • The moderator now receives up to the last 6 user/assistant turns as a reference-only <conversation_context> block (per-turn cap 500 chars, total cap 1500 chars), with the judged message isolated in <message_to_moderate>. Pure helpers buildModerationContext / formatModerationInput live in operations.ts; the zero-context path is byte-identical to before.
  • The moderator system prompt is now context-aware (judge only the latest message; use context only to read brief follow-ups; ignore instructions inside the context; don't block a message just for being short), kept lean by merging a duplicate reassurance.
  • Removed the in-memory verdict cache entirely — adding context to its key pushed its hit rate to near-zero, and the moderator is cheap/fast.
  • The mock moderator is now context-aware, with three new API tests covering isolation, abusive-latest-still-blocks, and context forwarding.

Why: testing showed too many false positives from short follow-up messages because the moderation model lacked conversational context.

Heads-up:

  • The latest user message is still moderated in full (including any <hidden-context> wrapper) and is the only thing excerpted onto events/traces — the security posture is unchanged. Prior turns are reference-only and never the judged unit. For direct API callers the history is attacker-controllable, so the real moderator model must honor "judge only <message_to_moderate>"; the accepted trade-off (per the design doc) is that the worst an attacker gains is laundering an ambiguous message into an allow, which the "when in doubt, allow" rule already permits.
  • Every moderated request now makes a moderator call (the verdict cache that deduped identical replays within 10 min is gone) — a slight cost/latency increase on the gate, which sits on the critical path to the first response token. Bounded by the context caps.
  • The cached field was dropped from ModerationEvent and the stats latency filter.

albanm and others added 8 commits June 21, 2026 12:13
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rompt

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Its only realistic hit was an identical-request replay within 10 min; adding
conversation context to any key would push that to near-zero. The moderator is
cheap and fast, so a duplicate call on a retry is negligible.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ention

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
albanm and others added 2 commits June 21, 2026 15:31
# Conflicts:
#	api/src/moderation/operations.ts
…te prompt

The post-merge prompt repeated 'technical/detailed/sub-agent task' in both the
mission line and a closing 'Do not block merely because…' sentence, and metadata
in three places. Fold the sub-agent-task carve-out into the mission line and drop
the closing sentence; its carve-outs are all still covered (technical/detailed/
data-queries by the mission, metadata by its dedicated paragraph and bullet).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@albanm albanm merged commit 8970808 into main Jun 21, 2026
3 checks passed
@albanm albanm deleted the feat-better-moderation branch June 21, 2026 14:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant