Dual-host Codex and Claude Code plugin for evidence-gated reinforcement-learning experiment work. It gives LLM agents the workflow and deterministic helper tools they need to audit an RL repository, create a task/metric/scope/budget contract, initialize repo-local .rlxp/ state, score results, and propose bounded next experiments without treating training reward as the only objective.
This release turns RLXP from a small set of broad planning skills into a task-scoped skill library that can operate practical experiment loops:
- task/study-scoped
.rlxpstate so multiple RL objectives do not share metrics, candidates, ledgers, or stop rules by accident - dedicated skill entrypoints for task audit, metric definition, baseline planning, run analysis, candidate proposal, candidate validation, reward work, curriculum, DR, monitoring, reporting, and autonomous loop control
- deterministic helper scripts for init, active state resolution, budget gates, candidate validation, loop-step planning, reward reflection, DR probing, queues, monitoring, result audits, ledger append, and report rendering
- one bundled launch-capable path,
rlxp_autoloop_runner.py, which is the audited transition from ready/validated candidate state to launched process state - canonical JSON schemas for candidates, contracts, metric specs, state, reward reflections, DR probes, queue manifests, result audits, and curriculum/environment candidates
- example adapter hints for Holosoma, generic repositories, and MJLab-style
uv run train/uv run playworkflows - distributable manifests bumped to
0.2.0for both Codex and Claude Code - package validation that checks required files, skill frontmatter, scoped helper behavior, offline scenario behavior, and no-training safety gates
Setup, audit, planning, package validation, and offline validation do not run GPU jobs, simulators, W&B/network calls, or training entry points. Training remains blocked unless the active study contract, normally .rlxp/tasks/<task_id>/studies/<study_id>/contract.yaml, records explicit approval for:
- task definition
- primary metric and guardrails
- tuning scope
- wall-clock/GPU budget
- hardware target
- baseline command
- evaluation protocol
- an approved experiment inside the confirmed scope and budget
Generated .rlxp/ state belongs in the target RL repository, not in this plugin package except temporary validation fixtures.
The only bundled script allowed to launch an experiment command is
plugins/rl-experiment-assistant/scripts/rlxp_autoloop_runner.py. It re-reads
the active contract and candidate, checks launch gates, requires
--execute --confirm RUN_CANDIDATE, records lifecycle state, and writes runner
ledger events. All other helper scripts, including loop-step, queue, probe,
monitoring, scoring, audit, and validation helpers, remain dry-run or
artifact-only helpers; they must not be treated as alternate launch paths.
.
├── .agents/plugins/marketplace.json
├── .claude-plugin/marketplace.json
└── plugins/rl-experiment-assistant/
├── .codex-plugin/plugin.json
├── .claude-plugin/plugin.json
├── skills/
├── agents/
├── docs/
├── scripts/
├── templates/
└── examples/
From this package root:
codex plugin marketplace add .Then open /plugins, install or update local-rl-experiment-assistant / rl-experiment-assistant, and start a fresh Codex session. The distributable manifests advertise version 0.2.0, so reload or reinstall the local plugin after pulling this commit.
Typical request from a target RL repository:
Use RL Experiment Assistant to audit this RL codebase. Prepare the task card, metric candidates, tuning scope, budget questions, and .rlxp report skeleton. Do not launch training.
After a contract explicitly approves autonomous operation, use the autoloop entry point instead of manually invoking a separate skill after every run:
Use $rlxp-autoloop. Resume from the current .rlxp state and continue until the approved budget or stopping rule is reached.
To initialize state through the plugin, ask the agent directly:
Use RL Experiment Assistant to initialize .rlxp for this repository, then audit the RL task and prepare the first experiment plan. Use the current working directory as the target repository. Do not launch training.
From this package root:
claude plugin marketplace add .Then in Claude Code:
/plugin install rl-experiment-assistant@local-rl-experiment-assistant
/reload-plugins
Development fallback without marketplace installation:
claude --plugin-dir ./plugins/rl-experiment-assistantClaude-specific agents are additive wrappers around the same contract-gated rlxp-* skills:
experiment-director: coordinates task audit, study contract, autoloop, candidate lifecycle, and launch gates.metrics-analyst: validates run evidence, monitoring output, reward reflection, audits, held-out metrics, and guardrails.reward-reviewer: reviews reward/curriculum/domain-randomization candidates before validation and deployment.
Normal setup is agent-driven. The user should not need to run the bundled Python helpers by hand. Ask Codex or Claude Code to initialize the target repository, and the plugin skill should resolve the installed plugin root, run the helper internally, and write .rlxp/ into the target RL repository.
Good prompt:
Use RL Experiment Assistant to initialize .rlxp for /path/to/target-rl-repo with project name <project-name>. Audit the codebase and leave training blocked until the contract is confirmed.
Expected target-repository state:
.rlxp/
├── registry.yaml
├── shared/
│ ├── schemas/
│ ├── metric_extractors/
│ └── templates/
└── tasks/
└── <task_id>/
├── task_card.md
├── adapter.yaml
└── studies/
└── <study_id>/
├── contract.yaml
├── state.json
├── report.md
├── experiments.yaml
├── ledger.jsonl
├── candidates/
├── runs/
└── audits/
registry.yaml selects the active task_id and study_id. Each task has its
own task card and adapter; each study has its own contract, state, ledger,
report, candidates, runs, and audits. This is the primary layout and prevents
one task's metric, baseline, candidate IDs, or stopping rule from interfering
with another task. Legacy flat .rlxp/ files are read only as a single implicit
task/study for compatibility.
state.json, ledger.jsonl, candidates/, runs/, and audits/ are the
self-resume surface for autonomous operation. A new agent session should be able
to infer the next safe action from those files without relying on chat history.
The plugin is a skill library for RL experiment operation. Use these entrypoints directly in Codex or Claude Code:
rlxp-audit-task: infer task, commands, rewards, terminations, curriculum, DR, metrics.rlxp-make-contract: create a task/study contract with metric, scope, budget, hardware, automation, and stop rules.rlxp-define-metrics: define primary metrics, component metrics, guardrails, diagnostics, and scorecards.rlxp-plan-baseline: prepare a baseline and smoke-test plan.rlxp-analyze-run: extract metrics, compare to incumbent, diagnose failures, and classify result.rlxp-propose-candidates: create bounded hypothesis-driven next experiments.rlxp-validate-candidate: check candidate schema, scope, task/study identity, budget, files, and invariants before launch.rlxp-reward-tuning: tune existing reward weights/scales/schedules using reward-component evidence and reflection.rlxp-reward-reflection: summarize task metrics and reward-component trends after reward-related runs.rlxp-reward-engineering: design minimal new reward logic only when approved, reflected, and justified.rlxp-reward-review: review reward changes for hacking, scale, and signal validity.rlxp-curriculum-design: tune sampling, difficulty schedules, or environment-generation curricula.rlxp-dr-probe: plan and summarize one-parameter-at-a-time robustness probes before DR range changes.rlxp-dr-design: tune domain-randomization ranges or schedules from robustness evidence and DR probes.rlxp-plan-parallel-batch: plan approved parallel candidate batches with worktree/run isolation.rlxp-queue-batch: manage resumable smoke, short-trial, promotion, and incumbent-validation queues.rlxp-monitor-training: monitor active runs and queue state for process and learning health.rlxp-audit-results: verify improvement claims before promotion or acceptance.rlxp-update-report: update the report from structured evidence.rlxp-autoloop: coordinate the full approved loop until budget or stop rule fires.
Legacy broad skills remain as compatibility wrappers:
rl-task-auditdispatches torlxp-audit-task.rl-experiment-plandispatches torlxp-make-contract,rlxp-define-metrics, andrlxp-plan-baseline.rl-reward-curriculum-designdispatches to the reward, curriculum, DR, review, and validation skills.rl-experiment-loopdispatches torlxp-autoloopand the analyze/propose/validate/monitor/audit/report skills.
The normal user interface is still skill-first:
Use $rlxp-autoloop. Resume from .rlxp state and continue within the approved contract.
The bundled scripts are enforcement and state-machine helpers for the agent,
not a command-line product the user must drive by hand. The dry-run helpers
deliberately avoid launching training. The audited transition from
ready/validated candidate state to launched process state is
rlxp_autoloop_runner.py, and it must re-check the active contract, budget,
candidate scope, metric invariants, hardware approval, explicit execution flag,
and confirmation token immediately before process launch.
RLXP candidates move through draft, validated, launched, completed,
accepted, rejected, inconclusive, debug, and when applicable
reverted. The incumbent changes only when the approved primary metric improves
and guardrails pass under the fixed evaluation protocol.
Reward-related runs should produce or consume
.rlxp/tasks/<task_id>/studies/<study_id>/runs/<run_id>/reward_reflection.json
with task metrics, reward-component trends, saturation, domination, and hacking
risk. Reward tuning should use that artifact before proposing more scalar
changes; reward engineering must require it unless the contract records an
explicit exception with the reason a reflection is impossible or irrelevant.
DR design must first look for a controlled DR probe summary. Without one, the normal next candidate is a one-parameter-at-a-time probe that estimates feasible ranges and safety notes before widening randomization. A DR tuning candidate without a probe summary must record the explicit reason a probe is impossible for this study.
Curriculum design supports sampling reweighting, difficulty schedules, and environment generation. Environment-generation candidates should add learnable intermediate cases and keep the held-out evaluation distribution fixed. Each environment-generation candidate must record learnability evidence, the difficulty bridge from incumbent failures to target cases, diversity constraints, the held-out eval invariant, and a rollback rule.
Parallel work should be isolated. When candidates patch code or mutable config,
rlxp-plan-parallel-batch should assign one candidate/run per git worktree or
equivalent sandbox, with GPU, seed, expected output, queue phase, and promotion
criteria recorded.
Autonomous execution should follow a queue/monitor/audit discipline: smoke, short trials, promoted multi-seed validation, incumbent validation, active-run monitoring, result audit, report update, then accept/reject/inconclusive/debug.
The deterministic helper surface is intentionally offline by default. These helpers do not launch experiments:
rlxp_loop_step.py: computes the next safe autoloop action without launching jobs.rlxp_validate_candidate.py: validates canonical candidate status, scope, command, and invariants.rlxp_extract_reward_components.py: creates reward-reflection summaries from local JSON/JSONL/CSV metrics.rlxp_make_dr_probe_plan.pyandrlxp_summarize_dr_probe.py: plan and summarize DR probes.rlxp_queue_step.py: selects ready queue jobs without starting processes.rlxp_monitor_run.py: inspects local metrics/logs for RL failure signatures.rlxp_audit_result.py: checks improvement claims against local evidence.
The launch-capable helper is separate:
rlxp_autoloop_runner.py: launches one validated or approved candidate only after re-checking the active contract, budget, scope, invariants, and--execute --confirm RUN_CANDIDATE.
When this plugin is used inside a Codex session, initialization still writes to the target repository the agent passes as --root. The plugin root, Codex plugin cache, and Codex home are not the experiment target.
Preferred pattern:
Use RL Experiment Assistant to initialize .rlxp for this repository. Use the current working directory as --root. Do not launch training.
If Codex is not running from the target repository, name the absolute target path:
Use RL Experiment Assistant to initialize .rlxp for /path/to/target-rl-repo. Treat that path as --root even if the Codex session is currently elsewhere. Do not launch training.
The agent may run scripts/rlxp_init.py internally as an implementation detail. It should not ask the user to paste Python commands unless the agent cannot execute local commands.
Holosoma is supported through adapter hints, not hard-coded package behavior. IsaacSim scene traversal is an adapter profile for a real target checkout, not a package assumption or default smoke test. Start from a Holosoma checkout and ask the agent to inspect the live repository before it writes the contract or launches anything.
Good prompt:
Use RL Experiment Assistant in this Holosoma repository. Use the packaged Holosoma adapter hints only as starting hints, inspect the current train/eval commands and logging, initialize .rlxp, and keep training blocked until the contract is explicitly approved.
The packaged Holosoma example includes public G1/MJWarp command fragments,
custom-fork notes for local tasks such as scene traversal, and real-validation
notes under
plugins/rl-experiment-assistant/examples/holosoma/. Those examples are useful
for a Holosoma task, but they are not the plugin's generic contract or release
smoke scenario.
For any non-Holosoma RL repository, ask the agent to start from the generic profile and adapt from code inspection:
Use RL Experiment Assistant on this RL repository with the generic profile. Inspect the training entrypoint, config system, logging, reward files, termination logic, curriculum, and evaluation command. Initialize .rlxp, fill the adapter candidates, and keep training blocked.
The agent should fill .rlxp/tasks/<task_id>/adapter.yaml with the
repo-specific details, or legacy .rlxp/adapter.yaml only when operating on an
existing flat-layout repository:
- training and evaluation commands
- config system: Hydra, Tyro, argparse, YAML, Gin, dataclasses, or custom
- logging source: W&B, TensorBoard, CSV, JSON, stdout, or custom
- primary metric and guardrails
- reward, termination, curriculum, and randomization files
- hardware launcher: local, tmux,
torchrun, Slurm, Ray, Kubernetes, or custom
If a stack becomes recurring, add a small examples/<stack>/adapter-hints.yaml
file that pre-fills defaults only. The evidence gate, metric loop, and
task/study state should remain generic.
MJLab is included as a concrete integration example under
plugins/rl-experiment-assistant/examples/mjlab/. It demonstrates how to adapt
the same contract to a uv run train / uv run play robot-learning repository
without making MJLab a package dependency.
When a user asks for setup, audit, planning, validation, or iteration, the agent should do the operational work itself:
- resolve the target repository and installed plugin root
- run bundled helper scripts internally when deterministic setup, scanning, scoring, or validation is useful
- write
.rlxp/files in the target repository - parse metrics and update the report/ledger
- ask the user only for confirmations that cannot be inferred safely, especially task, metric, scope, budget, hardware, baseline, and evaluation protocol
- refuse GPU training or W&B/network fetches until the contract explicitly allows them
- when autonomous mode is approved, keep iterating from
.rlxp/state until the budget or stop rule fires; do not stop after only preparing the next run
The bundled Python scripts are agent implementation details and maintainer utilities, not the normal user interface.
For plugin development, run the package validator from this package root before publishing or handing off a local marketplace build:
python plugins/rl-experiment-assistant/scripts/rlxp_validate_package.pyThe package validator is the release gate. Detailed maintainer checks and
optional real-Holosoma validation commands live in
plugins/rl-experiment-assistant/docs/release-validation.md.
The release gate is intentionally safe: it parses skills and schemas, compiles
helper scripts, exercises task/study state helpers, and defines the CPU-only toy
benchmark as the production smoke scenario. That smoke should check the
practical helper scripts with local fixture files, verify that dry-run helpers
do not launch, verify that rlxp_autoloop_runner.py is the only bundled
launch-capable path, and verify that the real-Holosoma validator does not run
smoke training or W&B fetches unless explicitly requested.
- Profile example mismatch: treat packaged examples as hints and re-audit the checked-out target repo before launch.
- IsaacSim Holosoma focus: treat scene traversal as a Holosoma adapter profile, not a package assumption. Use the profile command only after inspecting the target checkout and approving the contract; keep launch blocked until then.
- GPU unavailable: treat the failed preflight as an environment blocker, not a plugin failure.
- W&B unavailable: skip W&B validation; the CPU-only production smoke and offline validation remain the default package gates.
- Incomplete contract: keep launch blocked until all required confirmations and approval records are explicit.