feat(cli): bench init onboarding wizard + bench doctor#883
Conversation
TDD-built (red-green per slice) implementation of the researched Option B 'wizard + doctor split' design: - benchflow/onboarding.py — deep logic module: private env file (KEY=value, 0600, merge-preserving; setdefault autoload so a real export or CI secret always wins), TOML prefs round-trip, provider resolution (prefixed or bare model ids), agent picker filtered by the SAME provider-protocol gate the run path enforces (never offer an agent the run would reject), 1-token model ping (GET /models can 200 while the route is broken — a max_tokens=1 completion is the cheapest honest check), shared CheckResult rows for smoke + doctor, and eval-run command assembly (final command + credential-free oracle smoke argv). - cli/init_cmd.py — bench init: model -> agent -> tasks -> sandbox -> creds (reuse env / store hidden-prompted key) -> smoke -> prints + OSC52-copies the ready-to-run command; every prompt has a flag mirror so a fully flagged invocation never blocks on stdin. bench doctor: same checks as standalone pass/fail rows, exit 1 on failure. - cli/main.py — app callback auto-loads ~/.benchflow/.env (BENCHFLOW_HOME overridable) for every subcommand. 22 new tests (env file, prefs, provider resolution, protocol-filtered picker, mocked-transport ping, doctor composition, command assembly, non-interactive CLI, interactive wizard, startup autoload); full agents+cli suites green (117 passed).
Stage-1 of the researched two-stage smoke (Harbor's oracle pattern): with --full-smoke + --smoke-task, init runs the credential-free oracle agent on one task in the chosen sandbox BEFORE the credential checks — proving install + sandbox plumbing independently of keys. Opt-in because a real sandbox eval takes minutes; the default smoke (checks + 1-token ping) stays seconds.
Multi-agent review (correctness / silent-failures / security / design-fit, every finding runtime-repro'd by an adversarial verifier) on the bench init diff; all 16 confirmed findings fixed: - model_ping rewritten: endpoint joined exactly like the run path (resolve_base_url per protocol + one segment — no /v1 guessing, fixing zai //v4/v1 and github-models URLs), anthropic-messages-only providers ping /v1/messages with x-api-key (+ Azure api-key), ADC/AWS providers are skipped honestly instead of failed, 200-but-not-a-completion bodies fail, and all server-supplied detail text is control-character-sanitized (terminal escape injection). - env file hardening: 'export ' prefixes and single-quoted values parse; malformed lines (empty/whitespace keys) are skipped; unreadable files warn-and-degrade — a corrupt ~/.benchflow/.env can no longer crash every subcommand from the autoload callback (including the ones needed to fix it). - run_doctor: unknown sandbox is a failing row (was: zero checks, "All checks passed"); new pure litellm-route resolution row catches proxy-lane errors the direct ping cannot; unregistered-but-inferable models (claude-*/gpt-*/gemini-*) check their inferred key instead of hard-fail. - init guards: stage-1 oracle failure now fails init (was silently "Ready."); --full-smoke without --smoke-task errors (exit 2, was silent skip); --api-key on non-api_key providers errors with the real auth mechanism (was silently discarded); the smoke verifies the key just provided, not a stale exported one; host subscription logins (check_subscription_auth) are detected and skip the key prompt; bare well-known models complete the wizard via infer_env_key_for_model. 17 new tests (40 total for init/onboarding); full agents+cli suites green (135 passed). Live re-verified: real deepseek ping 200 on the corrected join + litellm-route row.
… bugs) Second review round (comment-analyzer, pr-test-analyzer, silent-failure- hunter, type-design-analyzer, code-reviewer) on top of the earlier adversarial pass. Confirmed behavioral bugs fixed: - The wizard's default dataset emitted a command that DOESN'T RUN: bare 'skillsbench' fails bench eval run's <name>@<version> parsing (and --full-smoke then blamed the sandbox). Default is now skillsbench@1.1 and registry-style names are validated via parse_dataset_spec at init time. - A non-UTF-8 byte in ~/.benchflow/.env bricked EVERY subcommand (UnicodeDecodeError is a ValueError, not OSError — the autoload callback died before any command, including the repair paths). read_env_file now degrades with a warning naming the file. - write_env_file destroyed comments/unparseable lines on rewrite (data loss introduced by the crash fix). The merge is now line-preserving (replace keys in place, keep everything else verbatim) and refuses to clobber an undecodable file. - Subscription-auth setups were failed by their own smoke test: run_doctor now takes the agent and, when check_subscription_auth covers the model's key, reports a skipped row instead of red key/route/ping rows. - bench doctor tracebacked on corrupt or partial config.toml (TOMLDecodeError / KeyError); now the same "run `bench init`" remediation, exit 1. - --full-smoke tracebacked with FileNotFoundError when `bench` wasn't on PATH; now a clear message. - CheckResult gains skipped=True (rendered "○", counted in the doctor summary) so all-green cannot over-certify unverifiable setups; the anthropic ping sends the required anthropic-version header; the litellm route row names the exception type; the modal sandbox row no longer says "unknown" on success; stale-exported-key shadowing warns; env-write OSError during init exits cleanly instead of dumping a traceback over a just-typed key. Comment corrections per the accuracy review (credential precedence, "pure functions", resolve_provider custom-endpoint rot, run-path-join wording, SigV4 -> Bedrock bearer, help texts) and test hygiene (the full-smoke test no longer makes a live network call; RUF005; obfuscated assertion; resolve_provider return annotation; skipped-flag coverage). 52 init/onboarding tests green; agents+cli suites green (95 passed).
UX redesign per review — choose, don't type (the Hermes setup pattern): - Every step is a numbered menu with an Enter-able default: provider (from the registry, with model-family hints), model (from the provider's model catalog when it has one, else a hinted prompt), agent (the protocol- filtered list), task set (live entries from the dataset registry with descriptions, newest first, plus a local-tasks-dir option and a free-form fallback when the registry is unreachable), sandbox (from SANDBOX_PROVIDERS). Free text survives only as explicit "other" options and CI flags (all flag mirrors unchanged). - After the model is chosen the wizard finds credentials ITSELF (onboarding.detect_key): host subscription login > process environment (which includes the saved ~/.benchflow/.env via the startup autoload) > ./.env in the working folder — a key found there is announced and passed through into the bench setup for future runs. The hidden prompt is now the last resort, not the default. New onboarding seams, TDD-built: detect_key (4 tests: precedence + pass-through), dataset_choices (registry fetch, newest-first, offline degrade), plus a full CLI test driving the menu flow with zero key typing. 59 init/onboarding tests green; agents suite green (70 passed). Live: Enter x4 + model id -> key auto-found in ./.env -> real 1-token ping 200 -> runnable command.
UX redesign (commit
|
The agent (the thing the user came to benchmark) is now the first menu; the provider menu then narrows to what that agent can route via the new onboarding.compatible_providers — the same protocol gate as compatible_agents, applied in the other direction (codex-acp offers only Responses-capable providers; deepseek disappears). compatible_agents(model= None) returns the unfiltered list for the first menu; the post-model consistency gate still guards flag combinations. Tests updated + a new menu-filter test (60 green); live-verified: Enter x2 + model id -> key from ./.env -> ping 200.
- "a local tasks dir" answered with a bare relative name (mytasks) fell into the registry-spec validation and hard-exited, discarding every wizard answer. Normalized at the prompt site to ./name so it routes to --tasks-dir. - The ./.env pass-through persisted and immediately spent a possibly unrelated key with only a post-hoc note. Interactive runs now get an Enter-default confirm; the message names the absolute source, the destination file, and a last-4 fingerprint (non-tty stays unprompted so CI never blocks). - detect_key ordered subscription > exported key — the OPPOSITE of the run path (resolve_agent_env inherits the exported key and uses_native_subscription_auth then defers to it), so init could announce "subscription login, no key needed" while the run billed the key. Precedence now matches the run path: environment > subscription > ./.env. - dataset_choices crashed on non-dict registry entries instead of degrading; malformed entries are now filtered. 63 init/onboarding tests green; full suites 133 passed.
…clack UI Fixes from live user testing of the wizard: - The agent menu showed only the 9 core built-ins. agent_paths() now triggers the remote-manifest auto-load and groups the FULL catalog (core + benchflow-ai/agents manifests + installed plugins) by the naming law; the wizard asks path first (acp / ai-sdk / omnigent, with counts), then that path's agents — the user's preferred shape for a ~60-agent list. - Provider labels lied: they showed each provider's PRIMARY wire next to an agent that would use a different endpoint (aws-bedrock read "openai-responses" beside claude-agent-acp; the filter was right — bedrock and zai genuinely serve anthropic-messages — but the label wasn't). Labels now show model families, "BYO base URL" for vllm-style providers, or the endpoint the CHOSEN agent will actually use. - Subscription login is now a LISTED choice, not a silent auto-decision: the credentials step is an OpenClaw-style menu of every detected source (subscription login, environment key, ./.env key — fingerprinted, with absolute paths) plus manual entry; interactive-only (_isatty seam), the non-tty path keeps run-path-preference auto-detection so CI never blocks. detect_key_sources() returns all sources in run-path order; detect_key is now its first-hit wrapper. - Clack/OpenClaw-style visuals via the existing rich console (no new dependency): ◆ step headers, │ option rails, ● confirmed-step summary lines, dim descriptions; degrades to plain text on non-tty; user data escaped against rich markup. 68 init/onboarding tests green (paths classification, matched-protocol labels, auth menu w/ subscription choice, source ordering); full suites 138 passed. Live pseudo-tty run verified end-to-end incl. the credentials menu.
…talog Adversarial review round on the wizard v3 (4 confirmed findings): - --agent naming a catalog-only agent exited with a bogus "wire protocol mismatch": the flags path never triggered the manifest autoload the menus use. It now resolves through registry.resolve_agent (aliases + the same miss-driven autoload as `bench eval run`) and unknown agents say "Unknown agent", not protocol mismatch. - The credentials menu announced choices it didn't honor: picking the subscription login with a key exported left the key in force (the smoke verified the DECLINED key, and the run would bill it); picking ./.env under a different exported key was a setdefault no-op. Choices are now effective in-process (subscription pops the key so the smoke verifies the subscription setup; ./.env/manual picks override) with an explicit shadow warning that the shell export wins at run time — shared _warn_shadow helper with the --api-key branch. - Offline with a warm cache, the wizard silently shrank to built-ins even though the cached clone holds the full catalog: _source_root now falls back to the cached benchflow-ai/agents checkout when the refresh fetch fails (the clone is the catalog of record). Skipped: a fetch timeout in benchmark_repos (shared with task sources — separate change if wanted). - A dim "loading the agent catalog…" notice before the first menu (the autoload can hit the network). 72 init/onboarding tests green; agents suite green.
…nment pick ty flagged use[1] (str | None) flowing into str-typed sinks — the ./.env branch now binds a narrowed value; the environment pick also gets its ✓ confirmation line (it was silently branch-less).
bingran-you
left a comment
There was a problem hiding this comment.
Daily scan review (2026-07-03): requesting changes. This PR belongs to benchflow-ai/benchflow and CI is green, but the onboarding flow has user-facing correctness blockers that need fixes before merge.
Findings:
- Credential detection does not match the actual run path.
_wizard_auth_step()consumesdetect_key_sources()where process env wins over./.env, butresolve_agent_env()reads./.envbeforeos.environ. The wizard can validate/smoke one key while the printedbench eval runlater uses another. - Fully flagged local task-dir onboarding is not a true mirror of the prompt path.
--dataset mytasksis rejected unless the user writes./mytasks, while the interactive prompt normalizes the same bare directory answer. - Azure Foundry OpenAI smoke pings use
Authorization: Bearerfor allopenai-completionsproviders, but Azure OpenAI API-key auth requiresapi-key, sobench init/bench doctorcan falsely fail valid Azure setups. - Printed commands use raw string joining, so paths with spaces are not copy-pasteable.
Please align auth-source precedence with the runtime resolver, normalize flagged dataset paths the same way as the prompt path, use the provider-specific Azure auth header in smoke checks, and shell-quote printed commands. Keep the current tests, but add focused regressions for those cases.
|
Users Simulation review (2026-07-03): still blocked. The basic init/doctor smoke is healthy: a temp-home interactive But the existing changes-requested blockers on this same head are still visible in source:
No model-backed trajectory was generated in this smoke, so trajectory health was not applicable. |
|
Re: #883 (comment) Pushed Fixes:
Validation:
Leaving |
|
Addressing CHANGES_REQUESTED review Commit
Local validation passed: required pytest target set ( |
|
Follow-up for #883 (comment) and CHANGES_REQUESTED review The latest pushed head is now Updated local validation:
I left |
|
Automation status update (2026-07-03): GitHub checks are green on |
…browse Populating the agent menu no longer clones benchflow-ai/agents. The menu lists locally-registered ACP agents only (onboarding.acp_agents(), zero network — drops the path menu and the ai-sdk-*/omnigent-* paths); the repo is fetched lazily only when an agent is actually resolved for a run (resolve_agent's miss path), via the "other" escape or --agent <name>. A bare `bench init` therefore does no git clone. Removed agent_paths() and the "loading the agent catalog…" notice.
Two wizard gaps from live testing: - claude-agent-acp could never reach its subscription login from the menus: the provider registry has no plain "anthropic" endpoint (the run path serves it via subscription login / inferred ANTHROPIC_API_KEY), so no menu choice led to that auth path. Anthropic-native agents (subscription_auth replaces ANTHROPIC_API_KEY, or anthropic-messages wire) now get a synthetic "anthropic" provider entry — listed first and default — whose model prompt defaults to claude-sonnet-4-6; the credentials menu then offers the detected subscription login. Live-verified on a host with a real Claude login: Enter-through -> "Using claude-agent-acp's subscription login" -> runnable command. - The agent menu's "other" demanded a typed name; it now browses the FULL 3-path catalog (path menu -> agents) — fetched lazily at exactly that point (onboarding.catalog_paths), so a bare `bench init` still does no network work. 78 tests green (both flows pinned).
A 12-slot e2e workflow drove the real CLI (flags + interactive pseudo-tty) through every local ACP agent plus catalog/gemini/alias scenarios; every failure was adversarially reproduced before fixing. The fixes: - CLONE IN THE USER'S CWD (worst): benchmark_repos._repo_root() fell back to Path.cwd() outside a git checkout, so a pip-installed bench dumped a 169MB catalog clone into whatever directory the user ran init from. The cache now lands in $XDG_CACHE_HOME/benchflow/datasets (~/.cache fallback); inside a checkout the dev path is unchanged. Clones are also --quiet (361 lines of git progress were spewing through the wizard UI). Provenance helper handles the no-checkout case explicitly. - gemini was unusable via --agent: the consistency gate reused the provider protocol filter, which excludes native-wire agents, producing a bogus "wire protocol mismatch" even with a valid GEMINI_API_KEY. gemini now validates by model FAMILY (gemini-*) with a clear error otherwise, and interactively gets a synthetic "google — gemini-* via GEMINI_API_KEY" provider entry instead of a 21-provider lie (generalized from the anthropic synthetic-entry mechanism). - --agent mimo-code silently bound to the catalog's mimo-acp variant (its manifest claims the alias) instead of the canonical local mimo. mimo-code is now a builtin registry alias for mimo — local canonical wins before any autoload, matching the registry's own documented intent. - codex-acp's provider default dropped to aws-bedrock once deepseek was filtered out; the default preference is now deepseek, then openai. - Cosmetics: provider labels no longer duplicate the provider name when its only model prefix equals it; dataset descriptions truncate on a word boundary with an ellipsis instead of a mid-word 80-char slice. Known limitation (deliberate): the hidden key prompt on a pseudo-tty discards type-ahead — scripted runs should pass --api-key. 9 new tests; 182 passed + 104 provenance/dataset tests; ty + ruff clean. Live-verified: gemini flags path works, mimo-code -> mimo, catalog clone lands in ~/.cache/benchflow (cwd stays clean).
status announced at agent selection Never load the full agents repo: - Browsing is now a STATIC selection: onboarding.CATALOG_AGENTS (the acp catalog names, kilobytes, zero network) minus whatever is already registered. The 3-path browse is gone — the menu is acp-only. - Selecting an entry fetches ONLY that agent's manifest.toml (remote_manifests.fetch_one): local-dir sources read the single file, remote sources GET one raw file over HTTPS — no git clone anywhere in the wizard. Local-wins/gap-fill semantics preserved; fetch failures degrade with a clear message. catalog_paths() (full-clone browse) removed. - Subscription status is announced IMMEDIATELY after the agent choice, for every subscription-capable agent (registry-driven): a found host login says so (with the detect file); a missing one guides the user to log in with the concrete vendor command (claude setup-token / codex login) or continue with an API key. Also: test_task_download pinned to its cwd-cache assumption via one autouse fixture (production now routes to the user cache) + clone argv assertions carry --quiet; the unknown-agent init test sets the agents source off so it stops hitting the network mid-suite; suite-order-robust catalog tests. 92 init/onboarding tests + 186 full suite green; ty + ruff clean. Live: subscription found-line prints right after picking claude-agent-acp (real host login); 'other' -> static list -> single-file fetch -> runnable stakpak command with ~/.cache/benchflow EMPTY (no clone).
…lution CI runs manifest-parity first, which registers the entire catalog and left catalog_choices() empty (IndexError). The test now deterministically evicts one catalog entry for its duration and restores it.
|
@bingran-you All four findings are addressed on the branch (commit
Since your review the branch also gained: static catalog browse + single-manifest lazy fetch (no full-repo clone anywhere — a 169MB clone was previously landing in the user's cwd; the cache also moved to |
|
Users Simulation automation review (2026-07-04): ready by simulation on current head PR-scoped temp-home user simulation confirmed the four previous blockers are cleared:
Shared sweep canary also produced a healthy Docker OpenHands + DeepSeek artifact bundle with ACP+LLM trajectories, |
|
Users Simulation automation review (2026-07-05): ready by simulation on head Isolated temp-home smoke passed: Shared gates passed: uv sync, ruff, ty, 71-test CLI/task/usage slice, bench help/task check/SDK smoke. Live Docker/OpenHands + Gemini flash-lite canary is healthy: ACP+LLM trajectories, results.jsonl training_ready=1, reward 1.0, 24,869 tokens, $0.00079715, 69.8s at /tmp/benchflow-users-simulation-20260705/shared-openhands-gemini-flash-lite-default/2026-07-05__05-11-14/hello-world-task__40312589. OpenHands max-effort was quarantined because that ACP agent does not declare reasoning_effort=max. Labels: keep |
Adds first-run onboarding — the gap every user currently crosses by hand-exporting env vars and discovering broken creds mid-run.
Design researched against prior art (Harbor's flag-triple + credential-free oracle smoke; OpenClaw/Hermes/OpenCode's converged wizard-with-auth-branch + doctor split): Option B, wizard + doctor.
bench initInteractive chain (every prompt has a flag mirror — fully-flagged runs never touch stdin):
provider/model, bare family ids, or well-known families viainfer_env_key_for_model)._provider_supports_agent_protocol), so the wizard never offers an agent the run would reject (e.g. codex-acp on deepseek).check_subscription_auth→ skip prompt; existing env var → reuse; else hidden prompt /--api-key→ stored in~/.benchflow/.env(0600, merge-preserving, auto-loaded by the CLI callback with setdefault semantics — a real export or CI secret always wins).litellm routeresolution row (pure, catches proxy-lane errors) + 1-token completion ping through the run path's own endpoint join (neverGET /models— it can 200 while the route is broken; census evidence). ADC/AWS providers are skipped honestly. Optional--full-smoke --smoke-task Xruns the credential-free oracle agent in the chosen sandbox first (Harbor's pattern).bench eval run …command.bench doctorSame checks standalone: per-row ✅/❌ (sandbox, provider key, litellm route, model ping), exit 1 on failure.
Quality
--api-keysilently dropped for ADC providers, and more — see the fix commit).tests/test_onboarding.py,tests/test_init_cli.py); full agents+cli suites green (135 passed).