Skip to content

Delivery Review + Clean-Room Intent Capture#107

Open
gbrbks wants to merge 58 commits into
mainfrom
feature/archie-delivery-review
Open

Delivery Review + Clean-Room Intent Capture#107
gbrbks wants to merge 58 commits into
mainfrom
feature/archie-delivery-review

Conversation

@gbrbks

@gbrbks gbrbks commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

Delivery Review + Clean-Room Intent Capture

Adds an AI delivery review to Archie — a PR-time check that reconciles intent, blueprint, and diff into one verdict: did the change build what was asked, and did it break anything? Plus a clean-room intent-capture layer that feeds it trustworthy, transparent intent.

Non-blocking by contract: every review posts an advisory PR comment and never fails a check or blocks an edit. The developer runs zero Archie commands in the happy path.

What's in it

  • Delivery review pipeline — edge-A (intent ⋈ diff completeness) · behavioral + conformance code review (intent-aware) · edge-C (requirement ⋈ invariant conflict) · a single editor gate (falsification-required, floor/anchor/dedup, cannot invent) · a delivery-verdict comment.
  • Three surfaces — deep-scan cold-read gate, sync.py review (manual), and the PR gate (GitHub Action, auto).
  • Clean-room intent capture — the coding agent no longer authors its own yardstick. Hooks Archie already installs log the user's verbatim planning turns; an isolated agent, blind to the code, turns them into .archie/intent.json criteria. Committed, versioned, human-ratifiable, auto-synthesized at PR time if nobody ran it.
  • Transparency — the verdict comment renders the criteria list with ✅/❌, provenance + confidence + a trust label (confirmed vs unconfirmed), and a correction footer. show/synthesize/confirm-intent are opt-in.
  • CI POC wiringdelivery_review.py runs via the existing setup-archie-intent-review.sh workflow (base-ref execution; secrets not exposed to PR-editable code).

Guarantees

  • Doesn't block developers. Capture is a silent log append; the review only comments. The enforcement hook's blocking exit 2 is preserved (proven by real subprocess tests).
  • Breaks the circularity. Criteria are authored blind to the implementation — verified end-to-end; a test enforces the prompt carries no code.

Quality

  • 181 tests pass (python3 3.9.6); verify_sync clean (52 scripts, canonical ↔ npm assets).
  • Built subagent-driven, TDD per task, with per-task reviews and adversarial final reviews — which caught and fixed a data-loss wipe, several security holes, a swallowed enforcement exit 2, and an always-0/N completeness bug before merge.
  • Provenance/originality audit in the design doc (§11) — borrowed vocabulary renamed, mechanisms re-provenanced to public prior art.

Deferred (documented follow-ups)

Linear/Jira ticket fetch (seam built) · the full contract→tracer→challenger invariant specialist (single-pass shipped) · sync auto-trigger hook · verdict→check-state/AIS teeth · richer verdict styling · honest met/unknown split (needs edge-A per-criterion verdicts).

Design + plans: docs/archie-delivery-review-design.md, docs/archie-branch-intent-design.md, docs/archie-intent-capture-design.md.

🤖 Generated with Claude Code

gbrbks and others added 30 commits July 1, 2026 15:47
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Implement selector.py — non-reviewing router that intersects changed files
with blueprint anchors to pick Lane-2 specialists. Pure function, zero
dependencies. Includes TDD test suite (3 tests: invariant selection on cited
file, data-lifecycle selection on store file, empty result on untouched file).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Replace raw substring branch in _hit() with path-segment-aware check
to prevent short anchors like "db" matching unrelated files such as
services/redis_db_client.py. Add regression test covering this case.
Import stays as sys.path form — archie package init pulls in tomllib
(Python 3.11+) which is unavailable on this env (Python 3.9).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Editor gate: the single reliability pass every surface routes through.
Validates schema, floor-gates, anchor-checks, and dedups against store.
Never invents — every confirmed item is an input item unchanged in id.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Implement consumers() to invert scanner's import graph and compute
transitive blast radius. Pure stdlib, no dependencies, handles module
stem extraction and BFS cycle-free walk over dependencies.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Change review() default arg from run=run_verifier (definition-time bind)
to run=None with call-time lookup so monkeypatch.setattr works correctly.
Strengthen test_review_mocked to assert a real finding flows through the
lambda; add test_parse_findings_drops_missing_falsification negative test.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Append three functions to archie/standalone/intent.py to resolve
ticket IDs from mixed sources and persist branch intent records.

- ticket_ids_from: Extract unique ticket IDs from branch, PR body,
  commit messages via regex
- load_branch_record: Load .archie/intent/<branch>.json
- save_branch_record: Save record, never downgrading confidence

Records are ranked {inferred: 0, commits: 1, pr_body/prompt: 2, linear: 3}.
Higher-confidence sources never get overwritten by lower ones.

Tests: 3/3 passing (test_intent_ladder.py). Task 7 regression check:
3/3 passing (test_intent.py).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Append aggregate_verdict() function to reconcile.py for computing
{intent_completeness, breaks, conflicts, gate_signal} from confirmed
findings per intent spec.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add evidence-schema fields (kind, edge:"B", anchor, assumptions,
falsification) to step-5b-risk.md emission contract. Add contract
test + fixture that verify every risk finding satisfies
evidence_schema.has_evidence_fields.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add `should_review` (intake eligibility) and `render_verdict` (Markdown
verdict) pure functions in delivery_review.py; wire GitHub Action step.
Full PR-gate orchestration deferred to a later task.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Count distinct criterion_id values when computing unmet acceptance
criteria in aggregate_verdict; fall back to per-finding count when
criterion_id is absent. Thread criterion_id onto edge-A findings in
parse_edge_a so downstream dedup works.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…verdict

A1: add extract_json_obj to evidence_schema — brace-depth scan that resumes
after failed parses so stray balanced-brace prose no longer silently drops all
findings; replace find/rfind/json.loads in behavioral_review + reconcile.

A2: add coerce_confidence to evidence_schema — null/string confidence degrades
to 0.0 instead of crashing; wired into make_finding and both parsers.

A3: aggregate_verdict now counts intent_drift findings, includes drift in
gate_signal formula (0.1 weight), and returns "drift" key.

A4: guard None acceptance_criteria in build_edge_a_prompt and aggregate_verdict
with (... or []) so a stored null doesn't crash iteration.

Tests: 23 passing (17 existing + 6 new).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…precision

B1 editor_gate: widen dedup key to (file,line,kind) so distinct findings on different
lines of the same file are not silently dropped; coerce string anchor.line to int before
membership test; treat line=None + file-in-changed-set as a file-level finding (keep).
B2 selector: remove interior-substring clause from _hit so anchor src/api/ no longer
matches vendor/src/api/x.py; drop break after first invariant match so all matching
invariant ids surface in the reason string.
B3 reachability: replace stem-only reverse graph with path-suffix resolution for
path-ish imports (containing /), avoiding cross-directory basename collisions; bare
module names still use stem fallback via __stem__: keys; remove dead stem assignment.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…rank merge, path collision

C1: add _TICKET_DENYLIST to reject CVE/UTF/SHA/RFC/ISO etc. from ticket extraction.
C2: use O_NOFOLLOW + unlink-before-write to prevent symlink write-through; set 0o600 perms.
C3: equal-rank saves now merge ticket_ids/goals/ac/non_goals instead of overwriting.
C4: _record_path uses slug+sha1[:8] suffix so "a/b" and "a__b" never collide.
New tests: ticket denylist, symlink refused, perms 0o600, equal-rank merge, path no-collision.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…store

D1: add _is_safe_ref guard; reject dashed refs in detect_base and changed_files;
git diff uses '--' end-of-options separator; merge-base ref validated via
rev-parse --verify; add changed_files_result for structured ok/reason signal.

D2: _merge_findings_into_store and gate_and_merge guard against bare-list store
(backs up to findings.json.corrupt sidecar, never silently wipes); all store
writes are atomic via tempfile.mkstemp + os.replace in the same dir.

Tests: 10 new tests (diff_basis x7, finalize x3) + existing 7 all green (17/17).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…aping

- E1: `should_review` now uses `or 0` / `or []` guards so None-valued
  `changed_files` and absent/null `labels` from GitHub JSON no longer raise.
- E2: Added `_sanitize()` (html.escape + @ neutralization via zero-width space);
  applied to every model-derived field in `render_verdict` so LLM-generated
  problem_statement / kind / anchor.file cannot inject HTML comment markers,
  raw HTML, or live @mentions into the PR comment body.
- Tests: 5 new cases covering None inputs, marker injection (exactly 1 real
  marker survives), @mention neutralization, and HTML escaping. Existing
  loose `"1" in md` assertion tightened to `"1 break(s)" in md`.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…fetch

Restructure _run_codex so the finally-unlink is at the outer try level,
guaranteeing cleanup on FileNotFoundError, TimeoutExpired, and non-zero
returncode. Switch NamedTemporaryFile(delete=False) → mkstemp+os.close.
Add two targeted tests that assert both failure modes return "" and leave
no temp file. Sync agent_cli.py to npm-package/assets/.

Harden archie-check.yml: add least-privilege permissions block
(contents:read, pull-requests:write), change fetch-depth 2→0 for full
merge-base reachability, and add trust-model comment above delivery-review.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…eview entrypoint

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…alignment

- Add resolve()/build_resolve_prompt() to intent.py: fills goals/acceptance_criteria via one LLM call; no-op when raw is empty
- Wire resolve() into sync_review.run_sync_review before review_edge_a so edge-A always sees populated criteria
- Add severity kwarg (default "medium") to evidence_schema.make_finding
- behavioral_review.parse_findings: severity="high", severity_class="tradeoff_undermined" (advisory, non-blocking)
- reconcile.parse_edge_a: per-kind severity (unmet=high/partial=medium/drift=low), severity_class="tradeoff_undermined"
- step-5b-risk.md: limit emitted kind to behavioral_break|conformance_break (remove schema_drift/mechanical_violation/pattern_divergence)
- Tests: resolve() noop/populate/bad-json, sync_review resolves before edge-A, severity field, non-blocking severity_class, risk taxonomy

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
gbrbks and others added 27 commits July 1, 2026 18:43
Add tests/test_delivery_integration.py: drives selector → intent.resolve →
review_edge_a → behavioral_review → review_edge_c → editor_gate →
aggregate_verdict with only the LLM seam mocked, proving no
shape-mismatch bugs hide at module boundaries.

Second test exercises anchor-gate string/int coercion (line as "999" str)
to confirm off-anchor findings are suppressed.

Tighten tests/test_selector.py: split reason string on comma before
checking inv-1 so the assertion cannot pass for inv-10.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…-ref exec + resolve persistence

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…cord

Was reading PR body from an unset ARCHIE_PR_BODY env var -> completeness
always 0/0 in the real workflow. Now sourced from the event payload; branch
record keyed on PR head ref. Verified end-to-end with real LLM: 3/6 criteria.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add a delivery_review.py step (base-ref exec, same trust model + secret +
blueprint prereq as intent_review) to the canonical archie-intent-review.yml
that setup-archie-intent-review.sh installs. Running the existing CI setup
script now provisions BOTH intent review and delivery review on PRs.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ptional Linear)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…+ PR body)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…formance)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…t.json

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The merge_specs function now captures each spec's singular ticket_id (in
addition to ticket_ids list), ensuring no tickets silently drop when merging.
load_committed_intent return type corrected to dict | None to match actual
behavior (returns None on absent/malformed files).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…d blind transform)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…vents

- Implements intent_synthesize.py: regenerates .archie/intent.json from events
- Accepts ONLY the intent-event log, NOT code/diff/conversation (blindness guaranteed)
- Writes unconfirmed spec with provenance (capture_points, captured_at)
- Resynthesize can RETIRE criteria (no scope ratchet) via regeneration not union
- build_synthesis_prompt() asserts "NOT shown the implementation" in synthesis prompt
- parse_synthesis() extracts JSON spec with normalized acceptance_criteria
- synthesize(root, run=None) writes atomically to .archie/intent.json via os.replace
- Zero deps beyond stdlib; imports guarded; tests pass fake run to assert blindness
- 5 tests pass: blindness check, parse_synthesis, unconfirmed spec, retire, no-events

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…parency)

Add four transparency subcommands to sync.py: cmd_capture_intent appends a user-turn event; cmd_synthesize_intent runs the blind transform; cmd_show_intent pretty-prints goals/criteria/provenance/confirmed; cmd_confirm_intent sets confirmed=true. Includes dispatch registration, _usage() lines, and intent_imports() helper. All tests pass (3 passed).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ilent SKILL step

- pre-turn.sh: append best-effort user-turn intent event via intent_capture.py
  (guarded by [ -f .archie/intent_capture.py ]; || true; exit 0 preserved)
- pre-validate.sh: append best-effort edit-transition marker after PYEOF
  (fires only on non-blocking path; blocking exit 2 still propagates correctly)
- SKILL.md: replace "Capture branch intent" agent-authoring step with
  "Branch intent (captured automatically)" — synthesize-intent / show-intent /
  confirm-intent commands; removes "author \`goals\` and concrete" instruction
- Copy mirrors to npm-package/assets/ (verify_sync.py: 52 scripts, PASS)
- Add tests/test_intent_hook_wiring.py (3 tests, all pass)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…mpts

Non-goals in the intent spec are now threaded through: (a) merge_specs unions
and deduplicates them, (b) build_edge_a_prompt and build_conformance_prompt
include a NON-GOALS section when present, (c) intent_brief surfaces them so
behavioral reviewer sees them. Empty non_goals lists leave prompts unchanged
(backward compatible). Scope creep is now checkable.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…esize fallback

- aggregate_verdict: unaddressed criteria count as unknown (not silently met);
  return dict gains 'unknown'; met = addressed - unmet only.
- render_verdict(verdict, confirmed, spec=None): provenance+confidence header,
  per-criterion ✅/❌ list, trust label (human-confirmed vs lower-trust),
  correction-loop footer. Caller updated to pass spec.
- run_pr_gate: hands-off auto-synthesize fallback — if no .archie/intent.json
  exists but events were captured, runs blind synthesize() before assembling intent.
  Non-blocking, always guarded.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ture

Capture _rc=$? immediately after the enforcement heredoc PYEOF so the
intent-capture if-block (best-effort, || true) cannot swallow a blocking
exit 2 — the script now ends with exit $_rc. Apply the identical fix to
both canonical and npm-package mirror copies. Also fix two test-integrity
minors: use the hyphenated intent-events.jsonl filename the library
actually produces, and tighten the SKILL.md assertion from or to and.
Add tests/test_pre_validate_exit_code.py with structural + real execution
tests that prove exit 2 survives the appended capture block.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…); update stale tests

Replace unknown/addressed logic in aggregate_verdict with silence=met semantics:
a criterion not flagged unmet by edge-A is counted met (total − unmet). unknown
is now always 0, reserved pending per-criterion verdicts from edge-A. Updates
three test_verdict assertions and the stale write-intent sync-skill test.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@vercel

vercel Bot commented Jul 2, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
archie Ready Ready Preview, Comment Jul 2, 2026 9:05pm
archie-viewer Ready Ready Preview, Comment Jul 2, 2026 9:05pm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant