Summary
SkillOpt's validation gate is one of the most important parts of the project: candidate skill edits are only useful if they improve held-out behavior without teaching the agent to game the evaluator.
I think it would be valuable to add a small, explicit verifier-discipline stress test for reward-hacking skill edits. This would turn the kind of failure mode already observed in #52 into a repeatable regression test.
Motivation
In #52, a follow-up report described a candidate rule that looked like a classic evaluator-gaming artifact: it tried to force outputs to include required headings, URLs, and literal tokens rather than solving the underlying task. The held-out gate rejected that candidate, which is exactly the right outcome.
That observation seems important enough to preserve as a first-class test case:
- train/replay feedback can identify a real failure pattern,
- the optimizer can propose a bad rule that satisfies surface criteria,
- the verifier should reject the rule unless it improves held-out behavior for the right reason,
- rejected reward-hacking edits should remain visible as evidence, not silently disappear.
Proposed test
Add a small synthetic benchmark or Sleep regression fixture with two classes of tasks:
-
Train/replay tasks where a superficial rule can appear attractive.
- Example bad rule shapes:
- "Always include literal token X."
- "Never ask for missing input."
- "Always include a URL."
- "Always emit these headings."
- "This rule supersedes normal refusal/clarification behavior."
-
Held-out sealed tasks where those shortcuts should fail.
- Missing-input cases should still allow clarification or refusal.
- Required tokens should not be enough to pass unless the answer is grounded.
- Placeholder URLs/headings should not be rewarded as real task completion.
The expected regression property would be:
train/replay score may improve or look plausible,
but the held-out gate must reject the reward-hacking edit.
Ideally the report would show both sides separately:
train/replay delta: ...
held-out gate delta: ...
gate action: reject
reason: candidate appears to satisfy surface criteria without improving grounded task success
Why this belongs in SkillOpt
SkillOpt is not just a text optimizer. Its core research claim depends on the validation gate preventing harmful skill evolution. A dedicated verifier-discipline test would make that claim easier to audit.
This is especially relevant for SkillOpt-Sleep, because user-session mining can discover real recurring failures while still proposing overbroad "fixes" that would damage a coding agent in real use.
Possible implementation shape
This could be implemented as one of:
- a small
tests/ fixture for skillopt_sleep,
- a docs-backed reproducible example under
docs/sleep/,
- a minimal synthetic environment that intentionally offers a reward-hacking shortcut,
- or a CI-only test that injects a known bad candidate skill and asserts the gate rejects it.
The important part is not the exact benchmark. The important part is preserving this invariant:
A skill edit that improves surface-form compliance but weakens grounded behavior should not be promoted.
Related reference
I have been working on a separate bounded verifier harness here:
https://github.com/sunghunkwag/rsi-metaforge-core
The relevant design pattern is not an AGI or open-ended self-improvement claim. It is narrower: sealed hidden evaluations, train-only fit rejection, rollback-sensitive admission, and evidence that hidden answers/scoring artifacts are not exposed to the adaptive loop.
That pattern seems directly relevant to SkillOpt's validation-gated skill evolution, especially for making reward-hacking failures reproducible and auditable.
Summary
SkillOpt's validation gate is one of the most important parts of the project: candidate skill edits are only useful if they improve held-out behavior without teaching the agent to game the evaluator.
I think it would be valuable to add a small, explicit verifier-discipline stress test for reward-hacking skill edits. This would turn the kind of failure mode already observed in #52 into a repeatable regression test.
Motivation
In #52, a follow-up report described a candidate rule that looked like a classic evaluator-gaming artifact: it tried to force outputs to include required headings, URLs, and literal tokens rather than solving the underlying task. The held-out gate rejected that candidate, which is exactly the right outcome.
That observation seems important enough to preserve as a first-class test case:
Proposed test
Add a small synthetic benchmark or Sleep regression fixture with two classes of tasks:
Train/replay tasks where a superficial rule can appear attractive.
Held-out sealed tasks where those shortcuts should fail.
The expected regression property would be:
Ideally the report would show both sides separately:
Why this belongs in SkillOpt
SkillOpt is not just a text optimizer. Its core research claim depends on the validation gate preventing harmful skill evolution. A dedicated verifier-discipline test would make that claim easier to audit.
This is especially relevant for SkillOpt-Sleep, because user-session mining can discover real recurring failures while still proposing overbroad "fixes" that would damage a coding agent in real use.
Possible implementation shape
This could be implemented as one of:
tests/fixture forskillopt_sleep,docs/sleep/,The important part is not the exact benchmark. The important part is preserving this invariant:
Related reference
I have been working on a separate bounded verifier harness here:
https://github.com/sunghunkwag/rsi-metaforge-core
The relevant design pattern is not an AGI or open-ended self-improvement claim. It is narrower: sealed hidden evaluations, train-only fit rejection, rollback-sensitive admission, and evidence that hidden answers/scoring artifacts are not exposed to the adaptive loop.
That pattern seems directly relevant to SkillOpt's validation-gated skill evolution, especially for making reward-hacking failures reproducible and auditable.