Skip to content

junhyekh/rlxp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RL Experiment Assistant Plugin

Dual-host Codex and Claude Code plugin for evidence-gated reinforcement-learning experiment work. It gives LLM agents the workflow and deterministic helper tools they need to audit an RL repository, create a task/metric/scope/budget contract, initialize repo-local .rlxp/ state, score results, and propose bounded next experiments without treating training reward as the only objective.

Version 0.2.0

This release turns RLXP from a small set of broad planning skills into a task-scoped skill library that can operate practical experiment loops:

  • task/study-scoped .rlxp state so multiple RL objectives do not share metrics, candidates, ledgers, or stop rules by accident
  • dedicated skill entrypoints for task audit, metric definition, baseline planning, run analysis, candidate proposal, candidate validation, reward work, curriculum, DR, monitoring, reporting, and autonomous loop control
  • deterministic helper scripts for init, active state resolution, budget gates, candidate validation, loop-step planning, reward reflection, DR probing, queues, monitoring, result audits, ledger append, and report rendering
  • one bundled launch-capable path, rlxp_autoloop_runner.py, which is the audited transition from ready/validated candidate state to launched process state
  • canonical JSON schemas for candidates, contracts, metric specs, state, reward reflections, DR probes, queue manifests, result audits, and curriculum/environment candidates
  • example adapter hints for Holosoma, generic repositories, and MJLab-style uv run train / uv run play workflows
  • distributable manifests bumped to 0.2.0 for both Codex and Claude Code
  • package validation that checks required files, skill frontmatter, scoped helper behavior, offline scenario behavior, and no-training safety gates

No training by default

Setup, audit, planning, package validation, and offline validation do not run GPU jobs, simulators, W&B/network calls, or training entry points. Training remains blocked unless the active study contract, normally .rlxp/tasks/<task_id>/studies/<study_id>/contract.yaml, records explicit approval for:

  • task definition
  • primary metric and guardrails
  • tuning scope
  • wall-clock/GPU budget
  • hardware target
  • baseline command
  • evaluation protocol
  • an approved experiment inside the confirmed scope and budget

Generated .rlxp/ state belongs in the target RL repository, not in this plugin package except temporary validation fixtures.

The only bundled script allowed to launch an experiment command is plugins/rl-experiment-assistant/scripts/rlxp_autoloop_runner.py. It re-reads the active contract and candidate, checks launch gates, requires --execute --confirm RUN_CANDIDATE, records lifecycle state, and writes runner ledger events. All other helper scripts, including loop-step, queue, probe, monitoring, scoring, audit, and validation helpers, remain dry-run or artifact-only helpers; they must not be treated as alternate launch paths.

Layout

.
├── .agents/plugins/marketplace.json
├── .claude-plugin/marketplace.json
└── plugins/rl-experiment-assistant/
    ├── .codex-plugin/plugin.json
    ├── .claude-plugin/plugin.json
    ├── skills/
    ├── agents/
    ├── docs/
    ├── scripts/
    ├── templates/
    └── examples/

Install In Codex

From this package root:

codex plugin marketplace add .

Then open /plugins, install or update local-rl-experiment-assistant / rl-experiment-assistant, and start a fresh Codex session. The distributable manifests advertise version 0.2.0, so reload or reinstall the local plugin after pulling this commit.

Typical request from a target RL repository:

Use RL Experiment Assistant to audit this RL codebase. Prepare the task card, metric candidates, tuning scope, budget questions, and .rlxp report skeleton. Do not launch training.

After a contract explicitly approves autonomous operation, use the autoloop entry point instead of manually invoking a separate skill after every run:

Use $rlxp-autoloop. Resume from the current .rlxp state and continue until the approved budget or stopping rule is reached.

To initialize state through the plugin, ask the agent directly:

Use RL Experiment Assistant to initialize .rlxp for this repository, then audit the RL task and prepare the first experiment plan. Use the current working directory as the target repository. Do not launch training.

Install In Claude Code

From this package root:

claude plugin marketplace add .

Then in Claude Code:

/plugin install rl-experiment-assistant@local-rl-experiment-assistant
/reload-plugins

Development fallback without marketplace installation:

claude --plugin-dir ./plugins/rl-experiment-assistant

Claude-specific agents are additive wrappers around the same contract-gated rlxp-* skills:

  • experiment-director: coordinates task audit, study contract, autoloop, candidate lifecycle, and launch gates.
  • metrics-analyst: validates run evidence, monitoring output, reward reflection, audits, held-out metrics, and guardrails.
  • reward-reviewer: reviews reward/curriculum/domain-randomization candidates before validation and deployment.

Target Repository Setup

Normal setup is agent-driven. The user should not need to run the bundled Python helpers by hand. Ask Codex or Claude Code to initialize the target repository, and the plugin skill should resolve the installed plugin root, run the helper internally, and write .rlxp/ into the target RL repository.

Good prompt:

Use RL Experiment Assistant to initialize .rlxp for /path/to/target-rl-repo with project name <project-name>. Audit the codebase and leave training blocked until the contract is confirmed.

Expected target-repository state:

.rlxp/
├── registry.yaml
├── shared/
│   ├── schemas/
│   ├── metric_extractors/
│   └── templates/
└── tasks/
    └── <task_id>/
        ├── task_card.md
        ├── adapter.yaml
        └── studies/
            └── <study_id>/
                ├── contract.yaml
                ├── state.json
                ├── report.md
                ├── experiments.yaml
                ├── ledger.jsonl
                ├── candidates/
                ├── runs/
                └── audits/

registry.yaml selects the active task_id and study_id. Each task has its own task card and adapter; each study has its own contract, state, ledger, report, candidates, runs, and audits. This is the primary layout and prevents one task's metric, baseline, candidate IDs, or stopping rule from interfering with another task. Legacy flat .rlxp/ files are read only as a single implicit task/study for compatibility.

state.json, ledger.jsonl, candidates/, runs/, and audits/ are the self-resume surface for autonomous operation. A new agent session should be able to infer the next safe action from those files without relying on chat history.

Practical Skill Catalog

The plugin is a skill library for RL experiment operation. Use these entrypoints directly in Codex or Claude Code:

  • rlxp-audit-task: infer task, commands, rewards, terminations, curriculum, DR, metrics.
  • rlxp-make-contract: create a task/study contract with metric, scope, budget, hardware, automation, and stop rules.
  • rlxp-define-metrics: define primary metrics, component metrics, guardrails, diagnostics, and scorecards.
  • rlxp-plan-baseline: prepare a baseline and smoke-test plan.
  • rlxp-analyze-run: extract metrics, compare to incumbent, diagnose failures, and classify result.
  • rlxp-propose-candidates: create bounded hypothesis-driven next experiments.
  • rlxp-validate-candidate: check candidate schema, scope, task/study identity, budget, files, and invariants before launch.
  • rlxp-reward-tuning: tune existing reward weights/scales/schedules using reward-component evidence and reflection.
  • rlxp-reward-reflection: summarize task metrics and reward-component trends after reward-related runs.
  • rlxp-reward-engineering: design minimal new reward logic only when approved, reflected, and justified.
  • rlxp-reward-review: review reward changes for hacking, scale, and signal validity.
  • rlxp-curriculum-design: tune sampling, difficulty schedules, or environment-generation curricula.
  • rlxp-dr-probe: plan and summarize one-parameter-at-a-time robustness probes before DR range changes.
  • rlxp-dr-design: tune domain-randomization ranges or schedules from robustness evidence and DR probes.
  • rlxp-plan-parallel-batch: plan approved parallel candidate batches with worktree/run isolation.
  • rlxp-queue-batch: manage resumable smoke, short-trial, promotion, and incumbent-validation queues.
  • rlxp-monitor-training: monitor active runs and queue state for process and learning health.
  • rlxp-audit-results: verify improvement claims before promotion or acceptance.
  • rlxp-update-report: update the report from structured evidence.
  • rlxp-autoloop: coordinate the full approved loop until budget or stop rule fires.

Legacy broad skills remain as compatibility wrappers:

  • rl-task-audit dispatches to rlxp-audit-task.
  • rl-experiment-plan dispatches to rlxp-make-contract, rlxp-define-metrics, and rlxp-plan-baseline.
  • rl-reward-curriculum-design dispatches to the reward, curriculum, DR, review, and validation skills.
  • rl-experiment-loop dispatches to rlxp-autoloop and the analyze/propose/validate/monitor/audit/report skills.

Operational Model

The normal user interface is still skill-first:

Use $rlxp-autoloop. Resume from .rlxp state and continue within the approved contract.

The bundled scripts are enforcement and state-machine helpers for the agent, not a command-line product the user must drive by hand. The dry-run helpers deliberately avoid launching training. The audited transition from ready/validated candidate state to launched process state is rlxp_autoloop_runner.py, and it must re-check the active contract, budget, candidate scope, metric invariants, hardware approval, explicit execution flag, and confirmation token immediately before process launch.

RLXP candidates move through draft, validated, launched, completed, accepted, rejected, inconclusive, debug, and when applicable reverted. The incumbent changes only when the approved primary metric improves and guardrails pass under the fixed evaluation protocol.

Reward-related runs should produce or consume .rlxp/tasks/<task_id>/studies/<study_id>/runs/<run_id>/reward_reflection.json with task metrics, reward-component trends, saturation, domination, and hacking risk. Reward tuning should use that artifact before proposing more scalar changes; reward engineering must require it unless the contract records an explicit exception with the reason a reflection is impossible or irrelevant.

DR design must first look for a controlled DR probe summary. Without one, the normal next candidate is a one-parameter-at-a-time probe that estimates feasible ranges and safety notes before widening randomization. A DR tuning candidate without a probe summary must record the explicit reason a probe is impossible for this study.

Curriculum design supports sampling reweighting, difficulty schedules, and environment generation. Environment-generation candidates should add learnable intermediate cases and keep the held-out evaluation distribution fixed. Each environment-generation candidate must record learnability evidence, the difficulty bridge from incumbent failures to target cases, diversity constraints, the held-out eval invariant, and a rollback rule.

Parallel work should be isolated. When candidates patch code or mutable config, rlxp-plan-parallel-batch should assign one candidate/run per git worktree or equivalent sandbox, with GPU, seed, expected output, queue phase, and promotion criteria recorded.

Autonomous execution should follow a queue/monitor/audit discipline: smoke, short trials, promoted multi-seed validation, incumbent validation, active-run monitoring, result audit, report update, then accept/reject/inconclusive/debug.

The deterministic helper surface is intentionally offline by default. These helpers do not launch experiments:

  • rlxp_loop_step.py: computes the next safe autoloop action without launching jobs.
  • rlxp_validate_candidate.py: validates canonical candidate status, scope, command, and invariants.
  • rlxp_extract_reward_components.py: creates reward-reflection summaries from local JSON/JSONL/CSV metrics.
  • rlxp_make_dr_probe_plan.py and rlxp_summarize_dr_probe.py: plan and summarize DR probes.
  • rlxp_queue_step.py: selects ready queue jobs without starting processes.
  • rlxp_monitor_run.py: inspects local metrics/logs for RL failure signatures.
  • rlxp_audit_result.py: checks improvement claims against local evidence.

The launch-capable helper is separate:

  • rlxp_autoloop_runner.py: launches one validated or approved candidate only after re-checking the active contract, budget, scope, invariants, and --execute --confirm RUN_CANDIDATE.

Initializing From Codex

When this plugin is used inside a Codex session, initialization still writes to the target repository the agent passes as --root. The plugin root, Codex plugin cache, and Codex home are not the experiment target.

Preferred pattern:

Use RL Experiment Assistant to initialize .rlxp for this repository. Use the current working directory as --root. Do not launch training.

If Codex is not running from the target repository, name the absolute target path:

Use RL Experiment Assistant to initialize .rlxp for /path/to/target-rl-repo. Treat that path as --root even if the Codex session is currently elsewhere. Do not launch training.

The agent may run scripts/rlxp_init.py internally as an implementation detail. It should not ask the user to paste Python commands unless the agent cannot execute local commands.

Use With Holosoma

Holosoma is supported through adapter hints, not hard-coded package behavior. IsaacSim scene traversal is an adapter profile for a real target checkout, not a package assumption or default smoke test. Start from a Holosoma checkout and ask the agent to inspect the live repository before it writes the contract or launches anything.

Good prompt:

Use RL Experiment Assistant in this Holosoma repository. Use the packaged Holosoma adapter hints only as starting hints, inspect the current train/eval commands and logging, initialize .rlxp, and keep training blocked until the contract is explicitly approved.

The packaged Holosoma example includes public G1/MJWarp command fragments, custom-fork notes for local tasks such as scene traversal, and real-validation notes under plugins/rl-experiment-assistant/examples/holosoma/. Those examples are useful for a Holosoma task, but they are not the plugin's generic contract or release smoke scenario.

Integrate Into Your Repo

For any non-Holosoma RL repository, ask the agent to start from the generic profile and adapt from code inspection:

Use RL Experiment Assistant on this RL repository with the generic profile. Inspect the training entrypoint, config system, logging, reward files, termination logic, curriculum, and evaluation command. Initialize .rlxp, fill the adapter candidates, and keep training blocked.

The agent should fill .rlxp/tasks/<task_id>/adapter.yaml with the repo-specific details, or legacy .rlxp/adapter.yaml only when operating on an existing flat-layout repository:

  • training and evaluation commands
  • config system: Hydra, Tyro, argparse, YAML, Gin, dataclasses, or custom
  • logging source: W&B, TensorBoard, CSV, JSON, stdout, or custom
  • primary metric and guardrails
  • reward, termination, curriculum, and randomization files
  • hardware launcher: local, tmux, torchrun, Slurm, Ray, Kubernetes, or custom

If a stack becomes recurring, add a small examples/<stack>/adapter-hints.yaml file that pre-fills defaults only. The evidence gate, metric loop, and task/study state should remain generic.

MJLab is included as a concrete integration example under plugins/rl-experiment-assistant/examples/mjlab/. It demonstrates how to adapt the same contract to a uv run train / uv run play robot-learning repository without making MJLab a package dependency.

Agent Responsibilities

When a user asks for setup, audit, planning, validation, or iteration, the agent should do the operational work itself:

  • resolve the target repository and installed plugin root
  • run bundled helper scripts internally when deterministic setup, scanning, scoring, or validation is useful
  • write .rlxp/ files in the target repository
  • parse metrics and update the report/ledger
  • ask the user only for confirmations that cannot be inferred safely, especially task, metric, scope, budget, hardware, baseline, and evaluation protocol
  • refuse GPU training or W&B/network fetches until the contract explicitly allows them
  • when autonomous mode is approved, keep iterating from .rlxp/ state until the budget or stop rule fires; do not stop after only preparing the next run

The bundled Python scripts are agent implementation details and maintainer utilities, not the normal user interface.

Release Validation

For plugin development, run the package validator from this package root before publishing or handing off a local marketplace build:

python plugins/rl-experiment-assistant/scripts/rlxp_validate_package.py

The package validator is the release gate. Detailed maintainer checks and optional real-Holosoma validation commands live in plugins/rl-experiment-assistant/docs/release-validation.md.

The release gate is intentionally safe: it parses skills and schemas, compiles helper scripts, exercises task/study state helpers, and defines the CPU-only toy benchmark as the production smoke scenario. That smoke should check the practical helper scripts with local fixture files, verify that dry-run helpers do not launch, verify that rlxp_autoloop_runner.py is the only bundled launch-capable path, and verify that the real-Holosoma validator does not run smoke training or W&B fetches unless explicitly requested.

Troubleshooting

  • Profile example mismatch: treat packaged examples as hints and re-audit the checked-out target repo before launch.
  • IsaacSim Holosoma focus: treat scene traversal as a Holosoma adapter profile, not a package assumption. Use the profile command only after inspecting the target checkout and approving the contract; keep launch blocked until then.
  • GPU unavailable: treat the failed preflight as an environment blocker, not a plugin failure.
  • W&B unavailable: skip W&B validation; the CPU-only production smoke and offline validation remain the default package gates.
  • Incomplete contract: keep launch blocked until all required confirmations and approval records are explicit.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages