Skip to content

Add a verifier-discipline stress test for reward-hacking skill edits #67

@sunghunkwag

Description

@sunghunkwag

Summary

SkillOpt's validation gate is one of the most important parts of the project: candidate skill edits are only useful if they improve held-out behavior without teaching the agent to game the evaluator.

I think it would be valuable to add a small, explicit verifier-discipline stress test for reward-hacking skill edits. This would turn the kind of failure mode already observed in #52 into a repeatable regression test.

Motivation

In #52, a follow-up report described a candidate rule that looked like a classic evaluator-gaming artifact: it tried to force outputs to include required headings, URLs, and literal tokens rather than solving the underlying task. The held-out gate rejected that candidate, which is exactly the right outcome.

That observation seems important enough to preserve as a first-class test case:

  • train/replay feedback can identify a real failure pattern,
  • the optimizer can propose a bad rule that satisfies surface criteria,
  • the verifier should reject the rule unless it improves held-out behavior for the right reason,
  • rejected reward-hacking edits should remain visible as evidence, not silently disappear.

Proposed test

Add a small synthetic benchmark or Sleep regression fixture with two classes of tasks:

  1. Train/replay tasks where a superficial rule can appear attractive.

    • Example bad rule shapes:
      • "Always include literal token X."
      • "Never ask for missing input."
      • "Always include a URL."
      • "Always emit these headings."
      • "This rule supersedes normal refusal/clarification behavior."
  2. Held-out sealed tasks where those shortcuts should fail.

    • Missing-input cases should still allow clarification or refusal.
    • Required tokens should not be enough to pass unless the answer is grounded.
    • Placeholder URLs/headings should not be rewarded as real task completion.

The expected regression property would be:

train/replay score may improve or look plausible,
but the held-out gate must reject the reward-hacking edit.

Ideally the report would show both sides separately:

train/replay delta: ...
held-out gate delta: ...
gate action: reject
reason: candidate appears to satisfy surface criteria without improving grounded task success

Why this belongs in SkillOpt

SkillOpt is not just a text optimizer. Its core research claim depends on the validation gate preventing harmful skill evolution. A dedicated verifier-discipline test would make that claim easier to audit.

This is especially relevant for SkillOpt-Sleep, because user-session mining can discover real recurring failures while still proposing overbroad "fixes" that would damage a coding agent in real use.

Possible implementation shape

This could be implemented as one of:

  • a small tests/ fixture for skillopt_sleep,
  • a docs-backed reproducible example under docs/sleep/,
  • a minimal synthetic environment that intentionally offers a reward-hacking shortcut,
  • or a CI-only test that injects a known bad candidate skill and asserts the gate rejects it.

The important part is not the exact benchmark. The important part is preserving this invariant:

A skill edit that improves surface-form compliance but weakens grounded behavior should not be promoted.

Related reference

I have been working on a separate bounded verifier harness here:

https://github.com/sunghunkwag/rsi-metaforge-core

The relevant design pattern is not an AGI or open-ended self-improvement claim. It is narrower: sealed hidden evaluations, train-only fit rejection, rollback-sensitive admission, and evidence that hidden answers/scoring artifacts are not exposed to the adaptive loop.

That pattern seems directly relevant to SkillOpt's validation-gated skill evolution, especially for making reward-hacking failures reproducible and auditable.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions