feat(cli): bench init onboarding wizard + bench doctor by Yiminnn · Pull Request #883 · benchflow-ai/benchflow

Yiminnn · 2026-07-02T23:00:03Z

Adds first-run onboarding — the gap every user currently crosses by hand-exporting env vars and discovering broken creds mid-run.

Design researched against prior art (Harbor's flag-triple + credential-free oracle smoke; OpenClaw/Hermes/OpenCode's converged wizard-with-auth-branch + doctor split): Option B, wizard + doctor.

`bench init`

Interactive chain (every prompt has a flag mirror — fully-flagged runs never touch stdin):

Model → provider resolution (registered provider/model, bare family ids, or well-known families via infer_env_key_for_model).
Agent — offered list filtered by the same provider-protocol gate the run path enforces (_provider_supports_agent_protocol), so the wizard never offers an agent the run would reject (e.g. codex-acp on deepseek).
Task set / sandbox (dataset name or tasks dir; docker/daytona).
Credentials: host subscription login detected via check_subscription_auth → skip prompt; existing env var → reuse; else hidden prompt / --api-key → stored in ~/.benchflow/.env (0600, merge-preserving, auto-loaded by the CLI callback with setdefault semantics — a real export or CI secret always wins).
Smoke test: litellm route resolution row (pure, catches proxy-lane errors) + 1-token completion ping through the run path's own endpoint join (never GET /models — it can 200 while the route is broken; census evidence). ADC/AWS providers are skipped honestly. Optional --full-smoke --smoke-task X runs the credential-free oracle agent in the chosen sandbox first (Harbor's pattern).
Prints (and OSC52-copies) the canonical bench eval run … command.

`bench doctor`

Same checks standalone: per-row ✅/❌ (sandbox, provider key, litellm route, model ping), exit 1 on failure.

Quality

TDD-built: 13 red-green slices, then a 4-dimension multi-agent review with adversarial verification — 16 confirmed findings, all fixed (endpoint-join correctness for zai/github-models/azure-anthropic, corrupt-env-file crash of every subcommand, silent oracle-failure "Ready.", terminal-escape injection via server responses, --api-key silently dropped for ADC providers, and more — see the fix commit).
40 tests (tests/test_onboarding.py, tests/test_init_cli.py); full agents+cli suites green (135 passed).
Live-verified on a real box: init with real DeepSeek key → route row + ping 200; doctor green rc=0; interactive wizard over stdin; fake key → honest 401 + exit 1 with remediation.

TDD-built (red-green per slice) implementation of the researched Option B 'wizard + doctor split' design: - benchflow/onboarding.py — deep logic module: private env file (KEY=value, 0600, merge-preserving; setdefault autoload so a real export or CI secret always wins), TOML prefs round-trip, provider resolution (prefixed or bare model ids), agent picker filtered by the SAME provider-protocol gate the run path enforces (never offer an agent the run would reject), 1-token model ping (GET /models can 200 while the route is broken — a max_tokens=1 completion is the cheapest honest check), shared CheckResult rows for smoke + doctor, and eval-run command assembly (final command + credential-free oracle smoke argv). - cli/init_cmd.py — bench init: model -> agent -> tasks -> sandbox -> creds (reuse env / store hidden-prompted key) -> smoke -> prints + OSC52-copies the ready-to-run command; every prompt has a flag mirror so a fully flagged invocation never blocks on stdin. bench doctor: same checks as standalone pass/fail rows, exit 1 on failure. - cli/main.py — app callback auto-loads ~/.benchflow/.env (BENCHFLOW_HOME overridable) for every subcommand. 22 new tests (env file, prefs, provider resolution, protocol-filtered picker, mocked-transport ping, doctor composition, command assembly, non-interactive CLI, interactive wizard, startup autoload); full agents+cli suites green (117 passed).

Stage-1 of the researched two-stage smoke (Harbor's oracle pattern): with --full-smoke + --smoke-task, init runs the credential-free oracle agent on one task in the chosen sandbox BEFORE the credential checks — proving install + sandbox plumbing independently of keys. Opt-in because a real sandbox eval takes minutes; the default smoke (checks + 1-token ping) stays seconds.

Multi-agent review (correctness / silent-failures / security / design-fit, every finding runtime-repro'd by an adversarial verifier) on the bench init diff; all 16 confirmed findings fixed: - model_ping rewritten: endpoint joined exactly like the run path (resolve_base_url per protocol + one segment — no /v1 guessing, fixing zai //v4/v1 and github-models URLs), anthropic-messages-only providers ping /v1/messages with x-api-key (+ Azure api-key), ADC/AWS providers are skipped honestly instead of failed, 200-but-not-a-completion bodies fail, and all server-supplied detail text is control-character-sanitized (terminal escape injection). - env file hardening: 'export ' prefixes and single-quoted values parse; malformed lines (empty/whitespace keys) are skipped; unreadable files warn-and-degrade — a corrupt ~/.benchflow/.env can no longer crash every subcommand from the autoload callback (including the ones needed to fix it). - run_doctor: unknown sandbox is a failing row (was: zero checks, "All checks passed"); new pure litellm-route resolution row catches proxy-lane errors the direct ping cannot; unregistered-but-inferable models (claude-*/gpt-*/gemini-*) check their inferred key instead of hard-fail. - init guards: stage-1 oracle failure now fails init (was silently "Ready."); --full-smoke without --smoke-task errors (exit 2, was silent skip); --api-key on non-api_key providers errors with the real auth mechanism (was silently discarded); the smoke verifies the key just provided, not a stale exported one; host subscription logins (check_subscription_auth) are detected and skip the key prompt; bare well-known models complete the wizard via infer_env_key_for_model. 17 new tests (40 total for init/onboarding); full agents+cli suites green (135 passed). Live re-verified: real deepseek ping 200 on the corrected join + litellm-route row.

… bugs) Second review round (comment-analyzer, pr-test-analyzer, silent-failure- hunter, type-design-analyzer, code-reviewer) on top of the earlier adversarial pass. Confirmed behavioral bugs fixed: - The wizard's default dataset emitted a command that DOESN'T RUN: bare 'skillsbench' fails bench eval run's <name>@<version> parsing (and --full-smoke then blamed the sandbox). Default is now skillsbench@1.1 and registry-style names are validated via parse_dataset_spec at init time. - A non-UTF-8 byte in ~/.benchflow/.env bricked EVERY subcommand (UnicodeDecodeError is a ValueError, not OSError — the autoload callback died before any command, including the repair paths). read_env_file now degrades with a warning naming the file. - write_env_file destroyed comments/unparseable lines on rewrite (data loss introduced by the crash fix). The merge is now line-preserving (replace keys in place, keep everything else verbatim) and refuses to clobber an undecodable file. - Subscription-auth setups were failed by their own smoke test: run_doctor now takes the agent and, when check_subscription_auth covers the model's key, reports a skipped row instead of red key/route/ping rows. - bench doctor tracebacked on corrupt or partial config.toml (TOMLDecodeError / KeyError); now the same "run `bench init`" remediation, exit 1. - --full-smoke tracebacked with FileNotFoundError when `bench` wasn't on PATH; now a clear message. - CheckResult gains skipped=True (rendered "○", counted in the doctor summary) so all-green cannot over-certify unverifiable setups; the anthropic ping sends the required anthropic-version header; the litellm route row names the exception type; the modal sandbox row no longer says "unknown" on success; stale-exported-key shadowing warns; env-write OSError during init exits cleanly instead of dumping a traceback over a just-typed key. Comment corrections per the accuracy review (credential precedence, "pure functions", resolve_provider custom-endpoint rot, run-path-join wording, SigV4 -> Bedrock bearer, help texts) and test hygiene (the full-smoke test no longer makes a live network call; RUF005; obfuscated assertion; resolve_provider return annotation; skipped-flag coverage). 52 init/onboarding tests green; agents+cli suites green (95 passed).

UX redesign per review — choose, don't type (the Hermes setup pattern): - Every step is a numbered menu with an Enter-able default: provider (from the registry, with model-family hints), model (from the provider's model catalog when it has one, else a hinted prompt), agent (the protocol- filtered list), task set (live entries from the dataset registry with descriptions, newest first, plus a local-tasks-dir option and a free-form fallback when the registry is unreachable), sandbox (from SANDBOX_PROVIDERS). Free text survives only as explicit "other" options and CI flags (all flag mirrors unchanged). - After the model is chosen the wizard finds credentials ITSELF (onboarding.detect_key): host subscription login > process environment (which includes the saved ~/.benchflow/.env via the startup autoload) > ./.env in the working folder — a key found there is announced and passed through into the bench setup for future runs. The hidden prompt is now the last resort, not the default. New onboarding seams, TDD-built: detect_key (4 tests: precedence + pass-through), dataset_choices (registry fetch, newest-first, offline degrade), plus a full CLI test driving the menu flow with zero key typing. 59 init/onboarding tests green; agents suite green (70 passed). Live: Enter x4 + model id -> key auto-found in ./.env -> real 1-token ping 200 -> runnable command.

Yiminnn · 2026-07-03T06:00:00Z

UX redesign (commit `1fbfb59`): choose, don't type

Per review feedback — the wizard is now selection-driven (the Hermes setup pattern):

Provider:                       Task set:
  ...                             1) skillsbench@1.1 — 87-task benchmark set ...
  5) deepseek — deepseek          2) skillsbench@1.0 — superseded by 1.1 ...
  ...                             3) a local tasks dir — path to your own tasks
  16) other — type a model id   Select [1]: ⏎
Select [5]: ⏎
Model id (e.g. deepseek-...): deepseek-v4-flash
Agent (able to route ...):    Select [33]: ⏎   # protocol-filtered list, default pi-acp
Sandbox: 1) docker 2) daytona 3) modal          Select [1]: ⏎

✓ DEEPSEEK_API_KEY found in ./.env — saving it to your bench setup.

Every step is a numbered menu with an Enter default — provider (registry + family hints), model (provider catalog when present), agent (protocol-filtered), task set (live dataset-registry entries w/ descriptions, newest first; offline → free-form fallback), sandbox. Free text only via explicit other options; all CI flag mirrors unchanged.
Credentials are auto-detected after model selection (onboarding.detect_key): subscription login → process env (incl. the saved ~/.benchflow/.env) → ./.env in the working folder (announced + passed through into the bench setup). The hidden prompt is the last resort, not the default.

59 tests green (incl. a full menu-driven CLI test with zero key typing) · live-verified: Enter×4 + model id → key auto-found in ./.env → real 1-token ping 200 → runnable command.

The agent (the thing the user came to benchmark) is now the first menu; the provider menu then narrows to what that agent can route via the new onboarding.compatible_providers — the same protocol gate as compatible_agents, applied in the other direction (codex-acp offers only Responses-capable providers; deepseek disappears). compatible_agents(model= None) returns the unfiltered list for the first menu; the post-model consistency gate still guards flag combinations. Tests updated + a new menu-filter test (60 green); live-verified: Enter x2 + model id -> key from ./.env -> ping 200.

- "a local tasks dir" answered with a bare relative name (mytasks) fell into the registry-spec validation and hard-exited, discarding every wizard answer. Normalized at the prompt site to ./name so it routes to --tasks-dir. - The ./.env pass-through persisted and immediately spent a possibly unrelated key with only a post-hoc note. Interactive runs now get an Enter-default confirm; the message names the absolute source, the destination file, and a last-4 fingerprint (non-tty stays unprompted so CI never blocks). - detect_key ordered subscription > exported key — the OPPOSITE of the run path (resolve_agent_env inherits the exported key and uses_native_subscription_auth then defers to it), so init could announce "subscription login, no key needed" while the run billed the key. Precedence now matches the run path: environment > subscription > ./.env. - dataset_choices crashed on non-dict registry entries instead of degrading; malformed entries are now filtered. 63 init/onboarding tests green; full suites 133 passed.

…clack UI Fixes from live user testing of the wizard: - The agent menu showed only the 9 core built-ins. agent_paths() now triggers the remote-manifest auto-load and groups the FULL catalog (core + benchflow-ai/agents manifests + installed plugins) by the naming law; the wizard asks path first (acp / ai-sdk / omnigent, with counts), then that path's agents — the user's preferred shape for a ~60-agent list. - Provider labels lied: they showed each provider's PRIMARY wire next to an agent that would use a different endpoint (aws-bedrock read "openai-responses" beside claude-agent-acp; the filter was right — bedrock and zai genuinely serve anthropic-messages — but the label wasn't). Labels now show model families, "BYO base URL" for vllm-style providers, or the endpoint the CHOSEN agent will actually use. - Subscription login is now a LISTED choice, not a silent auto-decision: the credentials step is an OpenClaw-style menu of every detected source (subscription login, environment key, ./.env key — fingerprinted, with absolute paths) plus manual entry; interactive-only (_isatty seam), the non-tty path keeps run-path-preference auto-detection so CI never blocks. detect_key_sources() returns all sources in run-path order; detect_key is now its first-hit wrapper. - Clack/OpenClaw-style visuals via the existing rich console (no new dependency): ◆ step headers, │ option rails, ● confirmed-step summary lines, dim descriptions; degrades to plain text on non-tty; user data escaped against rich markup. 68 init/onboarding tests green (paths classification, matched-protocol labels, auth menu w/ subscription choice, source ordering); full suites 138 passed. Live pseudo-tty run verified end-to-end incl. the credentials menu.

…talog Adversarial review round on the wizard v3 (4 confirmed findings): - --agent naming a catalog-only agent exited with a bogus "wire protocol mismatch": the flags path never triggered the manifest autoload the menus use. It now resolves through registry.resolve_agent (aliases + the same miss-driven autoload as `bench eval run`) and unknown agents say "Unknown agent", not protocol mismatch. - The credentials menu announced choices it didn't honor: picking the subscription login with a key exported left the key in force (the smoke verified the DECLINED key, and the run would bill it); picking ./.env under a different exported key was a setdefault no-op. Choices are now effective in-process (subscription pops the key so the smoke verifies the subscription setup; ./.env/manual picks override) with an explicit shadow warning that the shell export wins at run time — shared _warn_shadow helper with the --api-key branch. - Offline with a warm cache, the wizard silently shrank to built-ins even though the cached clone holds the full catalog: _source_root now falls back to the cached benchflow-ai/agents checkout when the refresh fetch fails (the clone is the catalog of record). Skipped: a fetch timeout in benchmark_repos (shared with task sources — separate change if wanted). - A dim "loading the agent catalog…" notice before the first menu (the autoload can hit the network). 72 init/onboarding tests green; agents suite green.

…nment pick ty flagged use[1] (str | None) flowing into str-typed sinks — the ./.env branch now binds a narrowed value; the environment pick also gets its ✓ confirmation line (it was silently branch-less).

bingran-you

Daily scan review (2026-07-03): requesting changes. This PR belongs to benchflow-ai/benchflow and CI is green, but the onboarding flow has user-facing correctness blockers that need fixes before merge.

Findings:

Credential detection does not match the actual run path. _wizard_auth_step() consumes detect_key_sources() where process env wins over ./.env, but resolve_agent_env() reads ./.env before os.environ. The wizard can validate/smoke one key while the printed bench eval run later uses another.
Fully flagged local task-dir onboarding is not a true mirror of the prompt path. --dataset mytasks is rejected unless the user writes ./mytasks, while the interactive prompt normalizes the same bare directory answer.
Azure Foundry OpenAI smoke pings use Authorization: Bearer for all openai-completions providers, but Azure OpenAI API-key auth requires api-key, so bench init / bench doctor can falsely fail valid Azure setups.
Printed commands use raw string joining, so paths with spaces are not copy-pasteable.

Please align auth-source precedence with the runtime resolver, normalize flagged dataset paths the same way as the prompt path, use the provider-specific Azure auth header in smoke checks, and shell-quote printed commands. Keep the current tests, but add focused regressions for those cases.

bingran-you · 2026-07-03T12:29:49Z

Users Simulation review (2026-07-03): still blocked.

The basic init/doctor smoke is healthy: a temp-home interactive bench init --skip-smoke with a preseeded DEEPSEEK_API_KEY, patched local agent catalog, and TTY path wrote config/env successfully; bench doctor passed with the final ping stubbed. Targeted checks also passed: tests/test_init_cli.py tests/test_onboarding.py tests/test_agent_env_resolution.py tests/test_resolve_env_helpers.py -> 175 passed, plus ruff, ty, and help for bench, bench init, bench doctor, and bench eval.

But the existing changes-requested blockers on this same head are still visible in source:

Azure OpenAI smoke still uses Authorization: Bearer for openai-completions pings instead of Azure's api-key header (src/benchflow/onboarding.py).
Printed commands still use raw string joining, so paths with spaces are not copy-pasteable (src/benchflow/onboarding.py, src/benchflow/cli/init_cmd.py).
The fully flagged path still rejects bare local task dirs such as --dataset mytasks before normalizing them like the prompt path (src/benchflow/cli/init_cmd.py).

No model-backed trajectory was generated in this smoke, so trajectory health was not applicable.

bingran-you · 2026-07-03T13:19:22Z

Re: #883 (comment)

Pushed 2f3af72bb4667bc760e56e405d9abc357754c847 to feat/bench-init.

Fixes:

Azure OpenAI model ping now sends api-key for azure-foundry-openai completions smoke.
Printed final/full-smoke commands use shell quoting via shared onboarding.shell_join(...).
Fully flagged existing local task dirs such as --dataset mytasks normalize to ./mytasks before registry validation.
Credential detection order now matches resolve_agent_env: local ./.env, then process/saved env, then subscription fallback.

Validation:

uv run python -m pytest tests/test_init_cli.py tests/test_onboarding.py tests/test_agent_env_resolution.py tests/test_resolve_env_helpers.py -q -> 180 passed in 30.58s
uv run ruff check src/benchflow/onboarding.py src/benchflow/cli/init_cmd.py tests/test_onboarding.py tests/test_init_cli.py -> All checks passed
uv run ty check src/benchflow/onboarding.py src/benchflow/cli/init_cmd.py -> All checks passed

Leaving status:blocked in place until GitHub CI is green.

bingran-you · 2026-07-03T13:19:30Z

Addressing CHANGES_REQUESTED review 4624310107.

Commit 2f3af72bb4667bc760e56e405d9abc357754c847 resolves the body-level findings:

Credential precedence now follows the canonical run resolver: local ./.env, process/saved env, then subscription fallback.
Fully flagged existing local task dirs like --dataset mytasks normalize before registry-spec validation, matching the interactive local-task path.
Azure OpenAI smoke pings use the Azure api-key header instead of bearer auth.
Printed final and stage-1 smoke commands are rendered with shell quoting, so paths with spaces are copy-pasteable.

Local validation passed: required pytest target set (180 passed in 30.58s), requested ruff target, and requested ty target. CI is running on the new head, so I left the blocked labels unchanged.

bingran-you · 2026-07-03T13:21:19Z

Follow-up for #883 (comment) and CHANGES_REQUESTED review 4624310107.

The latest pushed head is now da4df73acb128616b246a593062a8fa36bf5baf6 on feat/bench-init. It includes the logical blocker fix in 2f3af72bb4667bc760e56e405d9abc357754c847 plus a formatter-only follow-up after GitHub CI caught ruff format --check on tests/test_init_cli.py.

Updated local validation:

uv run python -m pytest tests/test_init_cli.py tests/test_onboarding.py tests/test_agent_env_resolution.py tests/test_resolve_env_helpers.py -q -> 180 passed in 29.18s
uv run ruff check src/benchflow/onboarding.py src/benchflow/cli/init_cmd.py tests/test_onboarding.py tests/test_init_cli.py -> All checks passed
uv run ty check src/benchflow/onboarding.py src/benchflow/cli/init_cmd.py -> All checks passed
uv run ruff format --check src tests tools -> 521 files already formatted

I left status:blocked and review:changes-requested unchanged while GitHub CI reruns on the new head.

bingran-you · 2026-07-03T13:28:30Z

Automation status update (2026-07-03): GitHub checks are green on da4df73a (test, pip-audit, parity, detect-scope, rollout-smoke; scope matrix/review-pack skipped as expected). Moved labels from status:blocked / review:changes-requested to status:ready / review:pending; the prior CHANGES_REQUESTED review still needs human re-review before merge.

…browse Populating the agent menu no longer clones benchflow-ai/agents. The menu lists locally-registered ACP agents only (onboarding.acp_agents(), zero network — drops the path menu and the ai-sdk-*/omnigent-* paths); the repo is fetched lazily only when an agent is actually resolved for a run (resolve_agent's miss path), via the "other" escape or --agent <name>. A bare `bench init` therefore does no git clone. Removed agent_paths() and the "loading the agent catalog…" notice.

Two wizard gaps from live testing: - claude-agent-acp could never reach its subscription login from the menus: the provider registry has no plain "anthropic" endpoint (the run path serves it via subscription login / inferred ANTHROPIC_API_KEY), so no menu choice led to that auth path. Anthropic-native agents (subscription_auth replaces ANTHROPIC_API_KEY, or anthropic-messages wire) now get a synthetic "anthropic" provider entry — listed first and default — whose model prompt defaults to claude-sonnet-4-6; the credentials menu then offers the detected subscription login. Live-verified on a host with a real Claude login: Enter-through -> "Using claude-agent-acp's subscription login" -> runnable command. - The agent menu's "other" demanded a typed name; it now browses the FULL 3-path catalog (path menu -> agents) — fetched lazily at exactly that point (onboarding.catalog_paths), so a bare `bench init` still does no network work. 78 tests green (both flows pinned).

A 12-slot e2e workflow drove the real CLI (flags + interactive pseudo-tty) through every local ACP agent plus catalog/gemini/alias scenarios; every failure was adversarially reproduced before fixing. The fixes: - CLONE IN THE USER'S CWD (worst): benchmark_repos._repo_root() fell back to Path.cwd() outside a git checkout, so a pip-installed bench dumped a 169MB catalog clone into whatever directory the user ran init from. The cache now lands in $XDG_CACHE_HOME/benchflow/datasets (~/.cache fallback); inside a checkout the dev path is unchanged. Clones are also --quiet (361 lines of git progress were spewing through the wizard UI). Provenance helper handles the no-checkout case explicitly. - gemini was unusable via --agent: the consistency gate reused the provider protocol filter, which excludes native-wire agents, producing a bogus "wire protocol mismatch" even with a valid GEMINI_API_KEY. gemini now validates by model FAMILY (gemini-*) with a clear error otherwise, and interactively gets a synthetic "google — gemini-* via GEMINI_API_KEY" provider entry instead of a 21-provider lie (generalized from the anthropic synthetic-entry mechanism). - --agent mimo-code silently bound to the catalog's mimo-acp variant (its manifest claims the alias) instead of the canonical local mimo. mimo-code is now a builtin registry alias for mimo — local canonical wins before any autoload, matching the registry's own documented intent. - codex-acp's provider default dropped to aws-bedrock once deepseek was filtered out; the default preference is now deepseek, then openai. - Cosmetics: provider labels no longer duplicate the provider name when its only model prefix equals it; dataset descriptions truncate on a word boundary with an ellipsis instead of a mid-word 80-char slice. Known limitation (deliberate): the hidden key prompt on a pseudo-tty discards type-ahead — scripted runs should pass --api-key. 9 new tests; 182 passed + 104 provenance/dataset tests; ty + ruff clean. Live-verified: gemini flags path works, mimo-code -> mimo, catalog clone lands in ~/.cache/benchflow (cwd stays clean).

status announced at agent selection Never load the full agents repo: - Browsing is now a STATIC selection: onboarding.CATALOG_AGENTS (the acp catalog names, kilobytes, zero network) minus whatever is already registered. The 3-path browse is gone — the menu is acp-only. - Selecting an entry fetches ONLY that agent's manifest.toml (remote_manifests.fetch_one): local-dir sources read the single file, remote sources GET one raw file over HTTPS — no git clone anywhere in the wizard. Local-wins/gap-fill semantics preserved; fetch failures degrade with a clear message. catalog_paths() (full-clone browse) removed. - Subscription status is announced IMMEDIATELY after the agent choice, for every subscription-capable agent (registry-driven): a found host login says so (with the detect file); a missing one guides the user to log in with the concrete vendor command (claude setup-token / codex login) or continue with an API key. Also: test_task_download pinned to its cwd-cache assumption via one autouse fixture (production now routes to the user cache) + clone argv assertions carry --quiet; the unknown-agent init test sets the agents source off so it stops hitting the network mid-suite; suite-order-robust catalog tests. 92 init/onboarding tests + 186 full suite green; ty + ruff clean. Live: subscription found-line prints right after picking claude-agent-acp (real host login); 'other' -> static list -> single-file fetch -> runnable stakpak command with ~/.cache/benchflow EMPTY (no clone).

…lution CI runs manifest-parity first, which registers the entire catalog and left catalog_choices() empty (IndexError). The test now deterministically evicts one catalog entry for its duration and restores it.

Yiminnn · 2026-07-03T22:32:35Z

@bingran-you All four findings are addressed on the branch (commit 2f3af72 "resolve PR 883 onboarding blockers" + subsequent commits preserved them through the later refactors) — point-by-point with evidence:

Auth precedence now matches the runtime resolver. detect_key_sources() orders ./.env (cwd) → process environment → subscription login, mirroring resolve_agent_env's "Inherit from .env file first, then from os.environ" (env.py:683-685 via load_dotenv_env()). The wizard's auth menu lists sources in that order and the non-interactive path takes the first — the smoke now validates the same key the run will use. Regression: TestDetectKeySources::test_all_sources_listed_in_run_path_order, test_auto_key_detection_prefers_cwd_dotenv_over_exported_key.
Flagged --dataset mytasks normalizes exactly like the prompt path via onboarding.normalize_dataset_input(..., local_tasks_dir=...) at both sites. Regression: test_flagged_local_tasks_dir_bare_name_is_normalized (+ the prompt-path twin).
Azure Foundry pings send api-key: _ping_headers() branches per provider — azure-foundry-* gets {"api-key": ...}, other openai-completions keep Bearer, anthropic-messages sends x-api-key+api-key+anthropic-version. Regression in TestModelPingProviderClasses.
Printed commands are shell-quoted: final_command() renders via shell_join (shlex.join). Regression: test_final_command_shell_quotes_task_paths_with_spaces, test_full_smoke_prints_shell_quoted_command.

Since your review the branch also gained: static catalog browse + single-manifest lazy fetch (no full-repo clone anywhere — a 169MB clone was previously landing in the user's cwd; the cache also moved to $XDG_CACHE_HOME/benchflow), subscription status announced immediately after agent selection with login guidance (claude setup-token / codex login), gemini/mimo-code/codex-default fixes from an all-agents e2e sweep, and suite-order-robust tests. CI is fully green (test, parity, rollout-smoke, pip-audit). Requesting re-review.

bingran-you · 2026-07-04T12:20:58Z

Users Simulation automation review (2026-07-04): ready by simulation on current head acb98b0f.

PR-scoped temp-home user simulation confirmed the four previous blockers are cleared:

bench --help, bench init --help, and bench doctor --help passed.
Azure OpenAI ping uses api-key and no bearer header in a mocked provider probe.
bench init with conflicting exported key and cwd .env picked the cwd key, wrote the temp-home env, and printed a shell-quoted --tasks-dir './my tasks' command.
bench doctor passed against the local stub.
uv sync --extra dev --locked, tests/test_onboarding.py tests/test_init_cli.py tests/test_task_download.py -> 116 passed, focused ruff passed, and uv run ty check src/ passed.

Shared sweep canary also produced a healthy Docker OpenHands + DeepSeek artifact bundle with ACP+LLM trajectories, results.jsonl, 37,265 tokens, $0.00273728, 71.4s total, and validator healthy. Keep status:ready / review:pending; the existing CHANGES_REQUESTED review still needs human re-review before merge.

bingran-you · 2026-07-05T12:26:28Z

Users Simulation automation review (2026-07-05): ready by simulation on head acb98b0fd61f1b4f16b66374c7654467059baeaa.

Isolated temp-home smoke passed: bench --help, bench eval --help, bench tasks --help, bench init --help, bench doctor --help, real bench init against a spaced local tasks dir with conflicting project .env, and real bench doctor against the saved setup. Focused tests (tests/test_init_cli.py tests/test_onboarding.py tests/test_task_download.py), ty check src/, and ruff check . passed. No run artifacts were produced for this CLI-only PR.

Shared gates passed: uv sync, ruff, ty, 71-test CLI/task/usage slice, bench help/task check/SDK smoke. Live Docker/OpenHands + Gemini flash-lite canary is healthy: ACP+LLM trajectories, results.jsonl training_ready=1, reward 1.0, 24,869 tokens, $0.00079715, 69.8s at /tmp/benchflow-users-simulation-20260705/shared-openhands-gemini-flash-lite-default/2026-07-05__05-11-14/hello-world-task__40312589. OpenHands max-effort was quarantined because that ACP agent does not declare reasoning_effort=max.

Labels: keep status:ready + review:pending; automation is not a human approval, so I am not adding an approval label.

symphony-bot added 3 commits July 2, 2026 22:23

Yiminnn temporarily deployed to pypi-internal-preview July 2, 2026 23:00 — with GitHub Actions Inactive

Yiminnn temporarily deployed to pypi-internal-preview July 2, 2026 23:18 — with GitHub Actions Inactive

Yiminnn temporarily deployed to pypi-internal-preview July 3, 2026 05:59 — with GitHub Actions Inactive

Yiminnn temporarily deployed to pypi-internal-preview July 3, 2026 06:06 — with GitHub Actions Inactive

Yiminnn temporarily deployed to pypi-internal-preview July 3, 2026 06:17 — with GitHub Actions Inactive

Yiminnn temporarily deployed to pypi-internal-preview July 3, 2026 06:27 — with GitHub Actions Inactive

Yiminnn temporarily deployed to pypi-internal-preview July 3, 2026 06:42 — with GitHub Actions Inactive

fix(init): type-safe auth-menu branches + confirm line for the enviro…

5489f75

…nment pick ty flagged use[1] (str | None) flowing into str-typed sinks — the ./.env branch now binds a narrowed value; the environment pick also gets its ✓ confirmation line (it was silently branch-less).

Yiminnn temporarily deployed to pypi-internal-preview July 3, 2026 06:52 — with GitHub Actions Inactive

bingran-you requested changes Jul 3, 2026

View reviewed changes

bingran-you added P1 Important debt — must fix soon, but does not block the current release. and removed P2 Anti-pattern / type safety / docs precision / minor schema drift / non-deterministic but contained. review:pending PR is ready-for-review, no reviewer engagement yet. labels Jul 3, 2026

bingran-you added the status:blocked Waiting on external dependency. Add a comment explaining why. label Jul 3, 2026

fix(init): resolve PR 883 onboarding blockers

2f3af72

bingran-you temporarily deployed to pypi-internal-preview July 3, 2026 13:18 — with GitHub Actions Inactive

style(init): format PR 883 regression test

da4df73

bingran-you temporarily deployed to pypi-internal-preview July 3, 2026 13:21 — with GitHub Actions Inactive

Yiminnn temporarily deployed to pypi-internal-preview July 3, 2026 18:00 — with GitHub Actions Inactive

Yiminnn temporarily deployed to pypi-internal-preview July 3, 2026 18:54 — with GitHub Actions Inactive

Yiminnn temporarily deployed to pypi-internal-preview July 3, 2026 21:27 — with GitHub Actions Inactive

style: RUF059 — unused unpacked var in dataset-ellipsis test

f08b4d6

Yiminnn temporarily deployed to pypi-internal-preview July 3, 2026 21:39 — with GitHub Actions Inactive

Yiminnn temporarily deployed to pypi-internal-preview July 3, 2026 22:05 — with GitHub Actions Inactive

test: make the static-catalog test immune to suite-order registry pol…

40dcd97

…lution CI runs manifest-parity first, which registers the entire catalog and left catalog_choices() empty (IndexError). The test now deterministically evicts one catalog entry for its duration and restores it.

Yiminnn temporarily deployed to pypi-internal-preview July 3, 2026 22:15 — with GitHub Actions Inactive

style: B905 zip strict=True

acb98b0

Yiminnn temporarily deployed to pypi-internal-preview July 3, 2026 22:24 — with GitHub Actions Inactive

Yiminnn requested a review from bingran-you July 3, 2026 22:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(cli): bench init onboarding wizard + bench doctor#883

feat(cli): bench init onboarding wizard + bench doctor#883
Yiminnn wants to merge 19 commits into
mainfrom
feat/bench-init

Yiminnn commented Jul 2, 2026

Uh oh!

Yiminnn commented Jul 3, 2026

Uh oh!

bingran-you left a comment

Uh oh!

bingran-you commented Jul 3, 2026

Uh oh!

bingran-you commented Jul 3, 2026

Uh oh!

bingran-you commented Jul 3, 2026

Uh oh!

bingran-you commented Jul 3, 2026

Uh oh!

bingran-you commented Jul 3, 2026

Uh oh!

Yiminnn commented Jul 3, 2026

Uh oh!

bingran-you commented Jul 4, 2026

Uh oh!

bingran-you commented Jul 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Yiminnn commented Jul 2, 2026

bench init

bench doctor

Quality

Uh oh!

Yiminnn commented Jul 3, 2026

UX redesign (commit 1fbfb59): choose, don't type

Uh oh!

bingran-you left a comment

Choose a reason for hiding this comment

Uh oh!

bingran-you commented Jul 3, 2026

Uh oh!

bingran-you commented Jul 3, 2026

Uh oh!

bingran-you commented Jul 3, 2026

Uh oh!

bingran-you commented Jul 3, 2026

Uh oh!

bingran-you commented Jul 3, 2026

Uh oh!

Yiminnn commented Jul 3, 2026

Uh oh!

bingran-you commented Jul 4, 2026

Uh oh!

bingran-you commented Jul 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`bench init`

`bench doctor`

UX redesign (commit `1fbfb59`): choose, don't type