Skip to content

feat(cli): bench init onboarding wizard + bench doctor#883

Open
Yiminnn wants to merge 19 commits into
mainfrom
feat/bench-init
Open

feat(cli): bench init onboarding wizard + bench doctor#883
Yiminnn wants to merge 19 commits into
mainfrom
feat/bench-init

Conversation

@Yiminnn

@Yiminnn Yiminnn commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

Adds first-run onboarding — the gap every user currently crosses by hand-exporting env vars and discovering broken creds mid-run.

Design researched against prior art (Harbor's flag-triple + credential-free oracle smoke; OpenClaw/Hermes/OpenCode's converged wizard-with-auth-branch + doctor split): Option B, wizard + doctor.

bench init

Interactive chain (every prompt has a flag mirror — fully-flagged runs never touch stdin):

  1. Model → provider resolution (registered provider/model, bare family ids, or well-known families via infer_env_key_for_model).
  2. Agent — offered list filtered by the same provider-protocol gate the run path enforces (_provider_supports_agent_protocol), so the wizard never offers an agent the run would reject (e.g. codex-acp on deepseek).
  3. Task set / sandbox (dataset name or tasks dir; docker/daytona).
  4. Credentials: host subscription login detected via check_subscription_auth → skip prompt; existing env var → reuse; else hidden prompt / --api-key → stored in ~/.benchflow/.env (0600, merge-preserving, auto-loaded by the CLI callback with setdefault semantics — a real export or CI secret always wins).
  5. Smoke test: litellm route resolution row (pure, catches proxy-lane errors) + 1-token completion ping through the run path's own endpoint join (never GET /models — it can 200 while the route is broken; census evidence). ADC/AWS providers are skipped honestly. Optional --full-smoke --smoke-task X runs the credential-free oracle agent in the chosen sandbox first (Harbor's pattern).
  6. Prints (and OSC52-copies) the canonical bench eval run … command.

bench doctor

Same checks standalone: per-row ✅/❌ (sandbox, provider key, litellm route, model ping), exit 1 on failure.

Quality

  • TDD-built: 13 red-green slices, then a 4-dimension multi-agent review with adversarial verification — 16 confirmed findings, all fixed (endpoint-join correctness for zai/github-models/azure-anthropic, corrupt-env-file crash of every subcommand, silent oracle-failure "Ready.", terminal-escape injection via server responses, --api-key silently dropped for ADC providers, and more — see the fix commit).
  • 40 tests (tests/test_onboarding.py, tests/test_init_cli.py); full agents+cli suites green (135 passed).
  • Live-verified on a real box: init with real DeepSeek key → route row + ping 200; doctor green rc=0; interactive wizard over stdin; fake key → honest 401 + exit 1 with remediation.

symphony-bot added 3 commits July 2, 2026 22:23
TDD-built (red-green per slice) implementation of the researched Option B
'wizard + doctor split' design:

- benchflow/onboarding.py — deep logic module: private env file
  (KEY=value, 0600, merge-preserving; setdefault autoload so a real export
  or CI secret always wins), TOML prefs round-trip, provider resolution
  (prefixed or bare model ids), agent picker filtered by the SAME
  provider-protocol gate the run path enforces (never offer an agent the
  run would reject), 1-token model ping (GET /models can 200 while the
  route is broken — a max_tokens=1 completion is the cheapest honest
  check), shared CheckResult rows for smoke + doctor, and eval-run command
  assembly (final command + credential-free oracle smoke argv).
- cli/init_cmd.py — bench init: model -> agent -> tasks -> sandbox -> creds
  (reuse env / store hidden-prompted key) -> smoke -> prints + OSC52-copies
  the ready-to-run command; every prompt has a flag mirror so a fully
  flagged invocation never blocks on stdin. bench doctor: same checks as
  standalone pass/fail rows, exit 1 on failure.
- cli/main.py — app callback auto-loads ~/.benchflow/.env (BENCHFLOW_HOME
  overridable) for every subcommand.

22 new tests (env file, prefs, provider resolution, protocol-filtered
picker, mocked-transport ping, doctor composition, command assembly,
non-interactive CLI, interactive wizard, startup autoload); full agents+cli
suites green (117 passed).
Stage-1 of the researched two-stage smoke (Harbor's oracle pattern): with
--full-smoke + --smoke-task, init runs the credential-free oracle agent on
one task in the chosen sandbox BEFORE the credential checks — proving
install + sandbox plumbing independently of keys. Opt-in because a real
sandbox eval takes minutes; the default smoke (checks + 1-token ping) stays
seconds.
Multi-agent review (correctness / silent-failures / security / design-fit,
every finding runtime-repro'd by an adversarial verifier) on the bench init
diff; all 16 confirmed findings fixed:

- model_ping rewritten: endpoint joined exactly like the run path
  (resolve_base_url per protocol + one segment — no /v1 guessing, fixing
  zai //v4/v1 and github-models URLs), anthropic-messages-only providers
  ping /v1/messages with x-api-key (+ Azure api-key), ADC/AWS providers are
  skipped honestly instead of failed, 200-but-not-a-completion bodies fail,
  and all server-supplied detail text is control-character-sanitized
  (terminal escape injection).
- env file hardening: 'export ' prefixes and single-quoted values parse;
  malformed lines (empty/whitespace keys) are skipped; unreadable files
  warn-and-degrade — a corrupt ~/.benchflow/.env can no longer crash every
  subcommand from the autoload callback (including the ones needed to fix
  it).
- run_doctor: unknown sandbox is a failing row (was: zero checks, "All
  checks passed"); new pure litellm-route resolution row catches proxy-lane
  errors the direct ping cannot; unregistered-but-inferable models
  (claude-*/gpt-*/gemini-*) check their inferred key instead of hard-fail.
- init guards: stage-1 oracle failure now fails init (was silently
  "Ready."); --full-smoke without --smoke-task errors (exit 2, was silent
  skip); --api-key on non-api_key providers errors with the real auth
  mechanism (was silently discarded); the smoke verifies the key just
  provided, not a stale exported one; host subscription logins
  (check_subscription_auth) are detected and skip the key prompt; bare
  well-known models complete the wizard via infer_env_key_for_model.

17 new tests (40 total for init/onboarding); full agents+cli suites green
(135 passed). Live re-verified: real deepseek ping 200 on the corrected
join + litellm-route row.
@Yiminnn Yiminnn temporarily deployed to pypi-internal-preview July 2, 2026 23:00 — with GitHub Actions Inactive
… bugs)

Second review round (comment-analyzer, pr-test-analyzer, silent-failure-
hunter, type-design-analyzer, code-reviewer) on top of the earlier
adversarial pass. Confirmed behavioral bugs fixed:

- The wizard's default dataset emitted a command that DOESN'T RUN: bare
  'skillsbench' fails bench eval run's <name>@<version> parsing (and
  --full-smoke then blamed the sandbox). Default is now skillsbench@1.1 and
  registry-style names are validated via parse_dataset_spec at init time.
- A non-UTF-8 byte in ~/.benchflow/.env bricked EVERY subcommand
  (UnicodeDecodeError is a ValueError, not OSError — the autoload callback
  died before any command, including the repair paths). read_env_file now
  degrades with a warning naming the file.
- write_env_file destroyed comments/unparseable lines on rewrite (data loss
  introduced by the crash fix). The merge is now line-preserving (replace
  keys in place, keep everything else verbatim) and refuses to clobber an
  undecodable file.
- Subscription-auth setups were failed by their own smoke test: run_doctor
  now takes the agent and, when check_subscription_auth covers the model's
  key, reports a skipped row instead of red key/route/ping rows.
- bench doctor tracebacked on corrupt or partial config.toml (TOMLDecodeError
  / KeyError); now the same "run `bench init`" remediation, exit 1.
- --full-smoke tracebacked with FileNotFoundError when `bench` wasn't on
  PATH; now a clear message.
- CheckResult gains skipped=True (rendered "○", counted in the doctor
  summary) so all-green cannot over-certify unverifiable setups; the
  anthropic ping sends the required anthropic-version header; the litellm
  route row names the exception type; the modal sandbox row no longer says
  "unknown" on success; stale-exported-key shadowing warns; env-write
  OSError during init exits cleanly instead of dumping a traceback over a
  just-typed key.

Comment corrections per the accuracy review (credential precedence, "pure
functions", resolve_provider custom-endpoint rot, run-path-join wording,
SigV4 -> Bedrock bearer, help texts) and test hygiene (the full-smoke test
no longer makes a live network call; RUF005; obfuscated assertion;
resolve_provider return annotation; skipped-flag coverage).

52 init/onboarding tests green; agents+cli suites green (95 passed).
@Yiminnn Yiminnn temporarily deployed to pypi-internal-preview July 2, 2026 23:18 — with GitHub Actions Inactive
UX redesign per review — choose, don't type (the Hermes setup pattern):

- Every step is a numbered menu with an Enter-able default: provider (from
  the registry, with model-family hints), model (from the provider's model
  catalog when it has one, else a hinted prompt), agent (the protocol-
  filtered list), task set (live entries from the dataset registry with
  descriptions, newest first, plus a local-tasks-dir option and a free-form
  fallback when the registry is unreachable), sandbox (from
  SANDBOX_PROVIDERS). Free text survives only as explicit "other" options
  and CI flags (all flag mirrors unchanged).
- After the model is chosen the wizard finds credentials ITSELF
  (onboarding.detect_key): host subscription login > process environment
  (which includes the saved ~/.benchflow/.env via the startup autoload) >
  ./.env in the working folder — a key found there is announced and passed
  through into the bench setup for future runs. The hidden prompt is now the
  last resort, not the default.

New onboarding seams, TDD-built: detect_key (4 tests: precedence +
pass-through), dataset_choices (registry fetch, newest-first, offline
degrade), plus a full CLI test driving the menu flow with zero key typing.
59 init/onboarding tests green; agents suite green (70 passed). Live: Enter
x4 + model id -> key auto-found in ./.env -> real 1-token ping 200 ->
runnable command.
@Yiminnn Yiminnn temporarily deployed to pypi-internal-preview July 3, 2026 05:59 — with GitHub Actions Inactive
@Yiminnn

Yiminnn commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator Author

UX redesign (commit 1fbfb59): choose, don't type

Per review feedback — the wizard is now selection-driven (the Hermes setup pattern):

Provider:                       Task set:
  ...                             1) skillsbench@1.1 — 87-task benchmark set ...
  5) deepseek — deepseek          2) skillsbench@1.0 — superseded by 1.1 ...
  ...                             3) a local tasks dir — path to your own tasks
  16) other — type a model id   Select [1]: ⏎
Select [5]: ⏎
Model id (e.g. deepseek-...): deepseek-v4-flash
Agent (able to route ...):    Select [33]: ⏎   # protocol-filtered list, default pi-acp
Sandbox: 1) docker 2) daytona 3) modal          Select [1]: ⏎

✓ DEEPSEEK_API_KEY found in ./.env — saving it to your bench setup.
  • Every step is a numbered menu with an Enter default — provider (registry + family hints), model (provider catalog when present), agent (protocol-filtered), task set (live dataset-registry entries w/ descriptions, newest first; offline → free-form fallback), sandbox. Free text only via explicit other options; all CI flag mirrors unchanged.
  • Credentials are auto-detected after model selection (onboarding.detect_key): subscription login → process env (incl. the saved ~/.benchflow/.env) → ./.env in the working folder (announced + passed through into the bench setup). The hidden prompt is the last resort, not the default.

59 tests green (incl. a full menu-driven CLI test with zero key typing) · live-verified: Enter×4 + model id → key auto-found in ./.env → real 1-token ping 200 → runnable command.

The agent (the thing the user came to benchmark) is now the first menu; the
provider menu then narrows to what that agent can route via the new
onboarding.compatible_providers — the same protocol gate as
compatible_agents, applied in the other direction (codex-acp offers only
Responses-capable providers; deepseek disappears). compatible_agents(model=
None) returns the unfiltered list for the first menu; the post-model
consistency gate still guards flag combinations. Tests updated + a new
menu-filter test (60 green); live-verified: Enter x2 + model id -> key from
./.env -> ping 200.
@Yiminnn Yiminnn temporarily deployed to pypi-internal-preview July 3, 2026 06:06 — with GitHub Actions Inactive
- "a local tasks dir" answered with a bare relative name (mytasks) fell
  into the registry-spec validation and hard-exited, discarding every
  wizard answer. Normalized at the prompt site to ./name so it routes to
  --tasks-dir.
- The ./.env pass-through persisted and immediately spent a possibly
  unrelated key with only a post-hoc note. Interactive runs now get an
  Enter-default confirm; the message names the absolute source, the
  destination file, and a last-4 fingerprint (non-tty stays unprompted so
  CI never blocks).
- detect_key ordered subscription > exported key — the OPPOSITE of the run
  path (resolve_agent_env inherits the exported key and
  uses_native_subscription_auth then defers to it), so init could announce
  "subscription login, no key needed" while the run billed the key.
  Precedence now matches the run path: environment > subscription > ./.env.
- dataset_choices crashed on non-dict registry entries instead of
  degrading; malformed entries are now filtered.

63 init/onboarding tests green; full suites 133 passed.
@Yiminnn Yiminnn temporarily deployed to pypi-internal-preview July 3, 2026 06:17 — with GitHub Actions Inactive
…clack UI

Fixes from live user testing of the wizard:

- The agent menu showed only the 9 core built-ins. agent_paths() now
  triggers the remote-manifest auto-load and groups the FULL catalog (core +
  benchflow-ai/agents manifests + installed plugins) by the naming law; the
  wizard asks path first (acp / ai-sdk / omnigent, with counts), then that
  path's agents — the user's preferred shape for a ~60-agent list.
- Provider labels lied: they showed each provider's PRIMARY wire next to an
  agent that would use a different endpoint (aws-bedrock read
  "openai-responses" beside claude-agent-acp; the filter was right — bedrock
  and zai genuinely serve anthropic-messages — but the label wasn't).
  Labels now show model families, "BYO base URL" for vllm-style providers,
  or the endpoint the CHOSEN agent will actually use.
- Subscription login is now a LISTED choice, not a silent auto-decision:
  the credentials step is an OpenClaw-style menu of every detected source
  (subscription login, environment key, ./.env key — fingerprinted, with
  absolute paths) plus manual entry; interactive-only (_isatty seam), the
  non-tty path keeps run-path-preference auto-detection so CI never blocks.
  detect_key_sources() returns all sources in run-path order; detect_key is
  now its first-hit wrapper.
- Clack/OpenClaw-style visuals via the existing rich console (no new
  dependency): ◆ step headers, │ option rails, ● confirmed-step summary
  lines, dim descriptions; degrades to plain text on non-tty; user data
  escaped against rich markup.

68 init/onboarding tests green (paths classification, matched-protocol
labels, auth menu w/ subscription choice, source ordering); full suites 138
passed. Live pseudo-tty run verified end-to-end incl. the credentials menu.
@Yiminnn Yiminnn temporarily deployed to pypi-internal-preview July 3, 2026 06:27 — with GitHub Actions Inactive
…talog

Adversarial review round on the wizard v3 (4 confirmed findings):

- --agent naming a catalog-only agent exited with a bogus "wire protocol
  mismatch": the flags path never triggered the manifest autoload the menus
  use. It now resolves through registry.resolve_agent (aliases + the same
  miss-driven autoload as `bench eval run`) and unknown agents say "Unknown
  agent", not protocol mismatch.
- The credentials menu announced choices it didn't honor: picking the
  subscription login with a key exported left the key in force (the smoke
  verified the DECLINED key, and the run would bill it); picking ./.env
  under a different exported key was a setdefault no-op. Choices are now
  effective in-process (subscription pops the key so the smoke verifies the
  subscription setup; ./.env/manual picks override) with an explicit shadow
  warning that the shell export wins at run time — shared _warn_shadow
  helper with the --api-key branch.
- Offline with a warm cache, the wizard silently shrank to built-ins even
  though the cached clone holds the full catalog: _source_root now falls
  back to the cached benchflow-ai/agents checkout when the refresh fetch
  fails (the clone is the catalog of record). Skipped: a fetch timeout in
  benchmark_repos (shared with task sources — separate change if wanted).
- A dim "loading the agent catalog…" notice before the first menu (the
  autoload can hit the network).

72 init/onboarding tests green; agents suite green.
@Yiminnn Yiminnn temporarily deployed to pypi-internal-preview July 3, 2026 06:42 — with GitHub Actions Inactive
…nment pick

ty flagged use[1] (str | None) flowing into str-typed sinks — the ./.env
branch now binds a narrowed value; the environment pick also gets its ✓
confirmation line (it was silently branch-less).
@Yiminnn Yiminnn temporarily deployed to pypi-internal-preview July 3, 2026 06:52 — with GitHub Actions Inactive
@bingran-you bingran-you added enhancement New feature or request P2 Anti-pattern / type safety / docs precision / minor schema drift / non-deterministic but contained. area:diagnostics Issue / PR lives primarily in the "diagnostics" subsystem. review:pending PR is ready-for-review, no reviewer engagement yet. review:changes-requested Author needs to push more commits before this can merge. labels Jul 3, 2026

@bingran-you bingran-you left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Daily scan review (2026-07-03): requesting changes. This PR belongs to benchflow-ai/benchflow and CI is green, but the onboarding flow has user-facing correctness blockers that need fixes before merge.

Findings:

  1. Credential detection does not match the actual run path. _wizard_auth_step() consumes detect_key_sources() where process env wins over ./.env, but resolve_agent_env() reads ./.env before os.environ. The wizard can validate/smoke one key while the printed bench eval run later uses another.
  2. Fully flagged local task-dir onboarding is not a true mirror of the prompt path. --dataset mytasks is rejected unless the user writes ./mytasks, while the interactive prompt normalizes the same bare directory answer.
  3. Azure Foundry OpenAI smoke pings use Authorization: Bearer for all openai-completions providers, but Azure OpenAI API-key auth requires api-key, so bench init / bench doctor can falsely fail valid Azure setups.
  4. Printed commands use raw string joining, so paths with spaces are not copy-pasteable.

Please align auth-source precedence with the runtime resolver, normalize flagged dataset paths the same way as the prompt path, use the provider-specific Azure auth header in smoke checks, and shell-quote printed commands. Keep the current tests, but add focused regressions for those cases.

@bingran-you bingran-you added P1 Important debt — must fix soon, but does not block the current release. and removed P2 Anti-pattern / type safety / docs precision / minor schema drift / non-deterministic but contained. review:pending PR is ready-for-review, no reviewer engagement yet. labels Jul 3, 2026
@bingran-you

Copy link
Copy Markdown
Collaborator

Users Simulation review (2026-07-03): still blocked.

The basic init/doctor smoke is healthy: a temp-home interactive bench init --skip-smoke with a preseeded DEEPSEEK_API_KEY, patched local agent catalog, and TTY path wrote config/env successfully; bench doctor passed with the final ping stubbed. Targeted checks also passed: tests/test_init_cli.py tests/test_onboarding.py tests/test_agent_env_resolution.py tests/test_resolve_env_helpers.py -> 175 passed, plus ruff, ty, and help for bench, bench init, bench doctor, and bench eval.

But the existing changes-requested blockers on this same head are still visible in source:

  • Azure OpenAI smoke still uses Authorization: Bearer for openai-completions pings instead of Azure's api-key header (src/benchflow/onboarding.py).
  • Printed commands still use raw string joining, so paths with spaces are not copy-pasteable (src/benchflow/onboarding.py, src/benchflow/cli/init_cmd.py).
  • The fully flagged path still rejects bare local task dirs such as --dataset mytasks before normalizing them like the prompt path (src/benchflow/cli/init_cmd.py).

No model-backed trajectory was generated in this smoke, so trajectory health was not applicable.

@bingran-you bingran-you added the status:blocked Waiting on external dependency. Add a comment explaining why. label Jul 3, 2026
@bingran-you bingran-you temporarily deployed to pypi-internal-preview July 3, 2026 13:18 — with GitHub Actions Inactive
@bingran-you

Copy link
Copy Markdown
Collaborator

Re: #883 (comment)

Pushed 2f3af72bb4667bc760e56e405d9abc357754c847 to feat/bench-init.

Fixes:

  • Azure OpenAI model ping now sends api-key for azure-foundry-openai completions smoke.
  • Printed final/full-smoke commands use shell quoting via shared onboarding.shell_join(...).
  • Fully flagged existing local task dirs such as --dataset mytasks normalize to ./mytasks before registry validation.
  • Credential detection order now matches resolve_agent_env: local ./.env, then process/saved env, then subscription fallback.

Validation:

  • uv run python -m pytest tests/test_init_cli.py tests/test_onboarding.py tests/test_agent_env_resolution.py tests/test_resolve_env_helpers.py -q -> 180 passed in 30.58s
  • uv run ruff check src/benchflow/onboarding.py src/benchflow/cli/init_cmd.py tests/test_onboarding.py tests/test_init_cli.py -> All checks passed
  • uv run ty check src/benchflow/onboarding.py src/benchflow/cli/init_cmd.py -> All checks passed

Leaving status:blocked in place until GitHub CI is green.

@bingran-you

Copy link
Copy Markdown
Collaborator

Addressing CHANGES_REQUESTED review 4624310107.

Commit 2f3af72bb4667bc760e56e405d9abc357754c847 resolves the body-level findings:

  1. Credential precedence now follows the canonical run resolver: local ./.env, process/saved env, then subscription fallback.
  2. Fully flagged existing local task dirs like --dataset mytasks normalize before registry-spec validation, matching the interactive local-task path.
  3. Azure OpenAI smoke pings use the Azure api-key header instead of bearer auth.
  4. Printed final and stage-1 smoke commands are rendered with shell quoting, so paths with spaces are copy-pasteable.

Local validation passed: required pytest target set (180 passed in 30.58s), requested ruff target, and requested ty target. CI is running on the new head, so I left the blocked labels unchanged.

@bingran-you bingran-you temporarily deployed to pypi-internal-preview July 3, 2026 13:21 — with GitHub Actions Inactive
@bingran-you

Copy link
Copy Markdown
Collaborator

Follow-up for #883 (comment) and CHANGES_REQUESTED review 4624310107.

The latest pushed head is now da4df73acb128616b246a593062a8fa36bf5baf6 on feat/bench-init. It includes the logical blocker fix in 2f3af72bb4667bc760e56e405d9abc357754c847 plus a formatter-only follow-up after GitHub CI caught ruff format --check on tests/test_init_cli.py.

Updated local validation:

  • uv run python -m pytest tests/test_init_cli.py tests/test_onboarding.py tests/test_agent_env_resolution.py tests/test_resolve_env_helpers.py -q -> 180 passed in 29.18s
  • uv run ruff check src/benchflow/onboarding.py src/benchflow/cli/init_cmd.py tests/test_onboarding.py tests/test_init_cli.py -> All checks passed
  • uv run ty check src/benchflow/onboarding.py src/benchflow/cli/init_cmd.py -> All checks passed
  • uv run ruff format --check src tests tools -> 521 files already formatted

I left status:blocked and review:changes-requested unchanged while GitHub CI reruns on the new head.

@bingran-you bingran-you added status:ready Triaged, unassigned, available to claim. review:pending PR is ready-for-review, no reviewer engagement yet. and removed status:blocked Waiting on external dependency. Add a comment explaining why. review:changes-requested Author needs to push more commits before this can merge. labels Jul 3, 2026
@bingran-you

Copy link
Copy Markdown
Collaborator

Automation status update (2026-07-03): GitHub checks are green on da4df73a (test, pip-audit, parity, detect-scope, rollout-smoke; scope matrix/review-pack skipped as expected). Moved labels from status:blocked / review:changes-requested to status:ready / review:pending; the prior CHANGES_REQUESTED review still needs human re-review before merge.

…browse

Populating the agent menu no longer clones benchflow-ai/agents. The menu
lists locally-registered ACP agents only (onboarding.acp_agents(), zero
network — drops the path menu and the ai-sdk-*/omnigent-* paths); the repo
is fetched lazily only when an agent is actually resolved for a run
(resolve_agent's miss path), via the "other" escape or --agent <name>. A
bare `bench init` therefore does no git clone. Removed agent_paths() and the
"loading the agent catalog…" notice.
@Yiminnn Yiminnn temporarily deployed to pypi-internal-preview July 3, 2026 18:00 — with GitHub Actions Inactive
Two wizard gaps from live testing:

- claude-agent-acp could never reach its subscription login from the menus:
  the provider registry has no plain "anthropic" endpoint (the run path
  serves it via subscription login / inferred ANTHROPIC_API_KEY), so no menu
  choice led to that auth path. Anthropic-native agents (subscription_auth
  replaces ANTHROPIC_API_KEY, or anthropic-messages wire) now get a
  synthetic "anthropic" provider entry — listed first and default — whose
  model prompt defaults to claude-sonnet-4-6; the credentials menu then
  offers the detected subscription login. Live-verified on a host with a
  real Claude login: Enter-through -> "Using claude-agent-acp's subscription
  login" -> runnable command.
- The agent menu's "other" demanded a typed name; it now browses the FULL
  3-path catalog (path menu -> agents) — fetched lazily at exactly that
  point (onboarding.catalog_paths), so a bare `bench init` still does no
  network work.

78 tests green (both flows pinned).
@Yiminnn Yiminnn temporarily deployed to pypi-internal-preview July 3, 2026 18:54 — with GitHub Actions Inactive
A 12-slot e2e workflow drove the real CLI (flags + interactive pseudo-tty)
through every local ACP agent plus catalog/gemini/alias scenarios; every
failure was adversarially reproduced before fixing. The fixes:

- CLONE IN THE USER'S CWD (worst): benchmark_repos._repo_root() fell back
  to Path.cwd() outside a git checkout, so a pip-installed bench dumped a
  169MB catalog clone into whatever directory the user ran init from. The
  cache now lands in $XDG_CACHE_HOME/benchflow/datasets (~/.cache fallback);
  inside a checkout the dev path is unchanged. Clones are also --quiet (361
  lines of git progress were spewing through the wizard UI). Provenance
  helper handles the no-checkout case explicitly.
- gemini was unusable via --agent: the consistency gate reused the provider
  protocol filter, which excludes native-wire agents, producing a bogus
  "wire protocol mismatch" even with a valid GEMINI_API_KEY. gemini now
  validates by model FAMILY (gemini-*) with a clear error otherwise, and
  interactively gets a synthetic "google — gemini-* via GEMINI_API_KEY"
  provider entry instead of a 21-provider lie (generalized from the
  anthropic synthetic-entry mechanism).
- --agent mimo-code silently bound to the catalog's mimo-acp variant (its
  manifest claims the alias) instead of the canonical local mimo. mimo-code
  is now a builtin registry alias for mimo — local canonical wins before
  any autoload, matching the registry's own documented intent.
- codex-acp's provider default dropped to aws-bedrock once deepseek was
  filtered out; the default preference is now deepseek, then openai.
- Cosmetics: provider labels no longer duplicate the provider name when its
  only model prefix equals it; dataset descriptions truncate on a word
  boundary with an ellipsis instead of a mid-word 80-char slice.

Known limitation (deliberate): the hidden key prompt on a pseudo-tty
discards type-ahead — scripted runs should pass --api-key.

9 new tests; 182 passed + 104 provenance/dataset tests; ty + ruff clean.
Live-verified: gemini flags path works, mimo-code -> mimo, catalog clone
lands in ~/.cache/benchflow (cwd stays clean).
@Yiminnn Yiminnn temporarily deployed to pypi-internal-preview July 3, 2026 21:27 — with GitHub Actions Inactive
@Yiminnn Yiminnn temporarily deployed to pypi-internal-preview July 3, 2026 21:39 — with GitHub Actions Inactive
status announced at agent selection

Never load the full agents repo:

- Browsing is now a STATIC selection: onboarding.CATALOG_AGENTS (the acp
  catalog names, kilobytes, zero network) minus whatever is already
  registered. The 3-path browse is gone — the menu is acp-only.
- Selecting an entry fetches ONLY that agent's manifest.toml
  (remote_manifests.fetch_one): local-dir sources read the single file,
  remote sources GET one raw file over HTTPS — no git clone anywhere in the
  wizard. Local-wins/gap-fill semantics preserved; fetch failures degrade
  with a clear message. catalog_paths() (full-clone browse) removed.
- Subscription status is announced IMMEDIATELY after the agent choice, for
  every subscription-capable agent (registry-driven): a found host login
  says so (with the detect file); a missing one guides the user to log in
  with the concrete vendor command (claude setup-token / codex login) or
  continue with an API key.

Also: test_task_download pinned to its cwd-cache assumption via one autouse
fixture (production now routes to the user cache) + clone argv assertions
carry --quiet; the unknown-agent init test sets the agents source off so it
stops hitting the network mid-suite; suite-order-robust catalog tests.

92 init/onboarding tests + 186 full suite green; ty + ruff clean. Live:
subscription found-line prints right after picking claude-agent-acp (real
host login); 'other' -> static list -> single-file fetch -> runnable
stakpak command with ~/.cache/benchflow EMPTY (no clone).
@Yiminnn Yiminnn temporarily deployed to pypi-internal-preview July 3, 2026 22:05 — with GitHub Actions Inactive
…lution

CI runs manifest-parity first, which registers the entire catalog and left
catalog_choices() empty (IndexError). The test now deterministically evicts
one catalog entry for its duration and restores it.
@Yiminnn Yiminnn temporarily deployed to pypi-internal-preview July 3, 2026 22:15 — with GitHub Actions Inactive
@Yiminnn Yiminnn temporarily deployed to pypi-internal-preview July 3, 2026 22:24 — with GitHub Actions Inactive
@Yiminnn

Yiminnn commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator Author

@bingran-you All four findings are addressed on the branch (commit 2f3af72 "resolve PR 883 onboarding blockers" + subsequent commits preserved them through the later refactors) — point-by-point with evidence:

  1. Auth precedence now matches the runtime resolver. detect_key_sources() orders ./.env (cwd) → process environment → subscription login, mirroring resolve_agent_env's "Inherit from .env file first, then from os.environ" (env.py:683-685 via load_dotenv_env()). The wizard's auth menu lists sources in that order and the non-interactive path takes the first — the smoke now validates the same key the run will use. Regression: TestDetectKeySources::test_all_sources_listed_in_run_path_order, test_auto_key_detection_prefers_cwd_dotenv_over_exported_key.
  2. Flagged --dataset mytasks normalizes exactly like the prompt path via onboarding.normalize_dataset_input(..., local_tasks_dir=...) at both sites. Regression: test_flagged_local_tasks_dir_bare_name_is_normalized (+ the prompt-path twin).
  3. Azure Foundry pings send api-key: _ping_headers() branches per provider — azure-foundry-* gets {"api-key": ...}, other openai-completions keep Bearer, anthropic-messages sends x-api-key+api-key+anthropic-version. Regression in TestModelPingProviderClasses.
  4. Printed commands are shell-quoted: final_command() renders via shell_join (shlex.join). Regression: test_final_command_shell_quotes_task_paths_with_spaces, test_full_smoke_prints_shell_quoted_command.

Since your review the branch also gained: static catalog browse + single-manifest lazy fetch (no full-repo clone anywhere — a 169MB clone was previously landing in the user's cwd; the cache also moved to $XDG_CACHE_HOME/benchflow), subscription status announced immediately after agent selection with login guidance (claude setup-token / codex login), gemini/mimo-code/codex-default fixes from an all-agents e2e sweep, and suite-order-robust tests. CI is fully green (test, parity, rollout-smoke, pip-audit). Requesting re-review.

@Yiminnn Yiminnn requested a review from bingran-you July 3, 2026 22:32
@bingran-you

Copy link
Copy Markdown
Collaborator

Users Simulation automation review (2026-07-04): ready by simulation on current head acb98b0f.

PR-scoped temp-home user simulation confirmed the four previous blockers are cleared:

  • bench --help, bench init --help, and bench doctor --help passed.
  • Azure OpenAI ping uses api-key and no bearer header in a mocked provider probe.
  • bench init with conflicting exported key and cwd .env picked the cwd key, wrote the temp-home env, and printed a shell-quoted --tasks-dir './my tasks' command.
  • bench doctor passed against the local stub.
  • uv sync --extra dev --locked, tests/test_onboarding.py tests/test_init_cli.py tests/test_task_download.py -> 116 passed, focused ruff passed, and uv run ty check src/ passed.

Shared sweep canary also produced a healthy Docker OpenHands + DeepSeek artifact bundle with ACP+LLM trajectories, results.jsonl, 37,265 tokens, $0.00273728, 71.4s total, and validator healthy. Keep status:ready / review:pending; the existing CHANGES_REQUESTED review still needs human re-review before merge.

@bingran-you

Copy link
Copy Markdown
Collaborator

Users Simulation automation review (2026-07-05): ready by simulation on head acb98b0fd61f1b4f16b66374c7654467059baeaa.

Isolated temp-home smoke passed: bench --help, bench eval --help, bench tasks --help, bench init --help, bench doctor --help, real bench init against a spaced local tasks dir with conflicting project .env, and real bench doctor against the saved setup. Focused tests (tests/test_init_cli.py tests/test_onboarding.py tests/test_task_download.py), ty check src/, and ruff check . passed. No run artifacts were produced for this CLI-only PR.

Shared gates passed: uv sync, ruff, ty, 71-test CLI/task/usage slice, bench help/task check/SDK smoke. Live Docker/OpenHands + Gemini flash-lite canary is healthy: ACP+LLM trajectories, results.jsonl training_ready=1, reward 1.0, 24,869 tokens, $0.00079715, 69.8s at /tmp/benchflow-users-simulation-20260705/shared-openhands-gemini-flash-lite-default/2026-07-05__05-11-14/hello-world-task__40312589. OpenHands max-effort was quarantined because that ACP agent does not declare reasoning_effort=max.

Labels: keep status:ready + review:pending; automation is not a human approval, so I am not adding an approval label.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:diagnostics Issue / PR lives primarily in the "diagnostics" subsystem. enhancement New feature or request P1 Important debt — must fix soon, but does not block the current release. review:pending PR is ready-for-review, no reviewer engagement yet. status:ready Triaged, unassigned, available to claim.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants