Delivery Review + Clean-Room Intent Capture#107
Open
gbrbks wants to merge 58 commits into
Open
Conversation
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Implement selector.py — non-reviewing router that intersects changed files with blueprint anchors to pick Lane-2 specialists. Pure function, zero dependencies. Includes TDD test suite (3 tests: invariant selection on cited file, data-lifecycle selection on store file, empty result on untouched file). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Replace raw substring branch in _hit() with path-segment-aware check to prevent short anchors like "db" matching unrelated files such as services/redis_db_client.py. Add regression test covering this case. Import stays as sys.path form — archie package init pulls in tomllib (Python 3.11+) which is unavailable on this env (Python 3.9). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Editor gate: the single reliability pass every surface routes through. Validates schema, floor-gates, anchor-checks, and dedups against store. Never invents — every confirmed item is an input item unchanged in id. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Implement consumers() to invert scanner's import graph and compute transitive blast radius. Pure stdlib, no dependencies, handles module stem extraction and BFS cycle-free walk over dependencies. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Change review() default arg from run=run_verifier (definition-time bind) to run=None with call-time lookup so monkeypatch.setattr works correctly. Strengthen test_review_mocked to assert a real finding flows through the lambda; add test_parse_findings_drops_missing_falsification negative test. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Append three functions to archie/standalone/intent.py to resolve
ticket IDs from mixed sources and persist branch intent records.
- ticket_ids_from: Extract unique ticket IDs from branch, PR body,
commit messages via regex
- load_branch_record: Load .archie/intent/<branch>.json
- save_branch_record: Save record, never downgrading confidence
Records are ranked {inferred: 0, commits: 1, pr_body/prompt: 2, linear: 3}.
Higher-confidence sources never get overwritten by lower ones.
Tests: 3/3 passing (test_intent_ladder.py). Task 7 regression check:
3/3 passing (test_intent.py).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Append aggregate_verdict() function to reconcile.py for computing
{intent_completeness, breaks, conflicts, gate_signal} from confirmed
findings per intent spec.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add evidence-schema fields (kind, edge:"B", anchor, assumptions, falsification) to step-5b-risk.md emission contract. Add contract test + fixture that verify every risk finding satisfies evidence_schema.has_evidence_fields. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add `should_review` (intake eligibility) and `render_verdict` (Markdown verdict) pure functions in delivery_review.py; wire GitHub Action step. Full PR-gate orchestration deferred to a later task. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Count distinct criterion_id values when computing unmet acceptance criteria in aggregate_verdict; fall back to per-finding count when criterion_id is absent. Thread criterion_id onto edge-A findings in parse_edge_a so downstream dedup works. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…verdict A1: add extract_json_obj to evidence_schema — brace-depth scan that resumes after failed parses so stray balanced-brace prose no longer silently drops all findings; replace find/rfind/json.loads in behavioral_review + reconcile. A2: add coerce_confidence to evidence_schema — null/string confidence degrades to 0.0 instead of crashing; wired into make_finding and both parsers. A3: aggregate_verdict now counts intent_drift findings, includes drift in gate_signal formula (0.1 weight), and returns "drift" key. A4: guard None acceptance_criteria in build_edge_a_prompt and aggregate_verdict with (... or []) so a stored null doesn't crash iteration. Tests: 23 passing (17 existing + 6 new). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…precision B1 editor_gate: widen dedup key to (file,line,kind) so distinct findings on different lines of the same file are not silently dropped; coerce string anchor.line to int before membership test; treat line=None + file-in-changed-set as a file-level finding (keep). B2 selector: remove interior-substring clause from _hit so anchor src/api/ no longer matches vendor/src/api/x.py; drop break after first invariant match so all matching invariant ids surface in the reason string. B3 reachability: replace stem-only reverse graph with path-suffix resolution for path-ish imports (containing /), avoiding cross-directory basename collisions; bare module names still use stem fallback via __stem__: keys; remove dead stem assignment. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…rank merge, path collision C1: add _TICKET_DENYLIST to reject CVE/UTF/SHA/RFC/ISO etc. from ticket extraction. C2: use O_NOFOLLOW + unlink-before-write to prevent symlink write-through; set 0o600 perms. C3: equal-rank saves now merge ticket_ids/goals/ac/non_goals instead of overwriting. C4: _record_path uses slug+sha1[:8] suffix so "a/b" and "a__b" never collide. New tests: ticket denylist, symlink refused, perms 0o600, equal-rank merge, path no-collision. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…store D1: add _is_safe_ref guard; reject dashed refs in detect_base and changed_files; git diff uses '--' end-of-options separator; merge-base ref validated via rev-parse --verify; add changed_files_result for structured ok/reason signal. D2: _merge_findings_into_store and gate_and_merge guard against bare-list store (backs up to findings.json.corrupt sidecar, never silently wipes); all store writes are atomic via tempfile.mkstemp + os.replace in the same dir. Tests: 10 new tests (diff_basis x7, finalize x3) + existing 7 all green (17/17). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…aping - E1: `should_review` now uses `or 0` / `or []` guards so None-valued `changed_files` and absent/null `labels` from GitHub JSON no longer raise. - E2: Added `_sanitize()` (html.escape + @ neutralization via zero-width space); applied to every model-derived field in `render_verdict` so LLM-generated problem_statement / kind / anchor.file cannot inject HTML comment markers, raw HTML, or live @mentions into the PR comment body. - Tests: 5 new cases covering None inputs, marker injection (exactly 1 real marker survives), @mention neutralization, and HTML escaping. Existing loose `"1" in md` assertion tightened to `"1 break(s)" in md`. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…fetch Restructure _run_codex so the finally-unlink is at the outer try level, guaranteeing cleanup on FileNotFoundError, TimeoutExpired, and non-zero returncode. Switch NamedTemporaryFile(delete=False) → mkstemp+os.close. Add two targeted tests that assert both failure modes return "" and leave no temp file. Sync agent_cli.py to npm-package/assets/. Harden archie-check.yml: add least-privilege permissions block (contents:read, pull-requests:write), change fetch-depth 2→0 for full merge-base reachability, and add trust-model comment above delivery-review. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…eview entrypoint Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…alignment - Add resolve()/build_resolve_prompt() to intent.py: fills goals/acceptance_criteria via one LLM call; no-op when raw is empty - Wire resolve() into sync_review.run_sync_review before review_edge_a so edge-A always sees populated criteria - Add severity kwarg (default "medium") to evidence_schema.make_finding - behavioral_review.parse_findings: severity="high", severity_class="tradeoff_undermined" (advisory, non-blocking) - reconcile.parse_edge_a: per-kind severity (unmet=high/partial=medium/drift=low), severity_class="tradeoff_undermined" - step-5b-risk.md: limit emitted kind to behavioral_break|conformance_break (remove schema_drift/mechanical_violation/pattern_divergence) - Tests: resolve() noop/populate/bad-json, sync_review resolves before edge-A, severity field, non-blocking severity_class, risk taxonomy Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add tests/test_delivery_integration.py: drives selector → intent.resolve → review_edge_a → behavioral_review → review_edge_c → editor_gate → aggregate_verdict with only the LLM seam mocked, proving no shape-mismatch bugs hide at module boundaries. Second test exercises anchor-gate string/int coercion (line as "999" str) to confirm off-anchor findings are suppressed. Tighten tests/test_selector.py: split reason string on comma before checking inv-1 so the assertion cannot pass for inv-10. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…-ref exec + resolve persistence Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…cord Was reading PR body from an unset ARCHIE_PR_BODY env var -> completeness always 0/0 in the real workflow. Now sourced from the event payload; branch record keyed on PR head ref. Verified end-to-end with real LLM: 3/6 criteria. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add a delivery_review.py step (base-ref exec, same trust model + secret + blueprint prereq as intent_review) to the canonical archie-intent-review.yml that setup-archie-intent-review.sh installs. Running the existing CI setup script now provisions BOTH intent review and delivery review on PRs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ptional Linear) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…+ PR body) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…formance) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…t.json Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The merge_specs function now captures each spec's singular ticket_id (in addition to ticket_ids list), ensuring no tickets silently drop when merging. load_committed_intent return type corrected to dict | None to match actual behavior (returns None on absent/malformed files). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…d blind transform) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…vents - Implements intent_synthesize.py: regenerates .archie/intent.json from events - Accepts ONLY the intent-event log, NOT code/diff/conversation (blindness guaranteed) - Writes unconfirmed spec with provenance (capture_points, captured_at) - Resynthesize can RETIRE criteria (no scope ratchet) via regeneration not union - build_synthesis_prompt() asserts "NOT shown the implementation" in synthesis prompt - parse_synthesis() extracts JSON spec with normalized acceptance_criteria - synthesize(root, run=None) writes atomically to .archie/intent.json via os.replace - Zero deps beyond stdlib; imports guarded; tests pass fake run to assert blindness - 5 tests pass: blindness check, parse_synthesis, unconfirmed spec, retire, no-events Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…parency) Add four transparency subcommands to sync.py: cmd_capture_intent appends a user-turn event; cmd_synthesize_intent runs the blind transform; cmd_show_intent pretty-prints goals/criteria/provenance/confirmed; cmd_confirm_intent sets confirmed=true. Includes dispatch registration, _usage() lines, and intent_imports() helper. All tests pass (3 passed). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ilent SKILL step - pre-turn.sh: append best-effort user-turn intent event via intent_capture.py (guarded by [ -f .archie/intent_capture.py ]; || true; exit 0 preserved) - pre-validate.sh: append best-effort edit-transition marker after PYEOF (fires only on non-blocking path; blocking exit 2 still propagates correctly) - SKILL.md: replace "Capture branch intent" agent-authoring step with "Branch intent (captured automatically)" — synthesize-intent / show-intent / confirm-intent commands; removes "author \`goals\` and concrete" instruction - Copy mirrors to npm-package/assets/ (verify_sync.py: 52 scripts, PASS) - Add tests/test_intent_hook_wiring.py (3 tests, all pass) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…mpts Non-goals in the intent spec are now threaded through: (a) merge_specs unions and deduplicates them, (b) build_edge_a_prompt and build_conformance_prompt include a NON-GOALS section when present, (c) intent_brief surfaces them so behavioral reviewer sees them. Empty non_goals lists leave prompts unchanged (backward compatible). Scope creep is now checkable. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…esize fallback - aggregate_verdict: unaddressed criteria count as unknown (not silently met); return dict gains 'unknown'; met = addressed - unmet only. - render_verdict(verdict, confirmed, spec=None): provenance+confidence header, per-criterion ✅/❌ list, trust label (human-confirmed vs lower-trust), correction-loop footer. Caller updated to pass spec. - run_pr_gate: hands-off auto-synthesize fallback — if no .archie/intent.json exists but events were captured, runs blind synthesize() before assembling intent. Non-blocking, always guarded. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ture Capture _rc=$? immediately after the enforcement heredoc PYEOF so the intent-capture if-block (best-effort, || true) cannot swallow a blocking exit 2 — the script now ends with exit $_rc. Apply the identical fix to both canonical and npm-package mirror copies. Also fix two test-integrity minors: use the hyphenated intent-events.jsonl filename the library actually produces, and tighten the SKILL.md assertion from or to and. Add tests/test_pre_validate_exit_code.py with structural + real execution tests that prove exit 2 survives the appended capture block. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…); update stale tests Replace unknown/addressed logic in aggregate_verdict with silence=met semantics: a criterion not flagged unmet by edge-A is counted met (total − unmet). unknown is now always 0, reserved pending per-criterion verdicts from edge-A. Updates three test_verdict assertions and the stale write-intent sync-skill test. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Delivery Review + Clean-Room Intent Capture
Adds an AI delivery review to Archie — a PR-time check that reconciles intent, blueprint, and diff into one verdict: did the change build what was asked, and did it break anything? Plus a clean-room intent-capture layer that feeds it trustworthy, transparent intent.
Non-blocking by contract: every review posts an advisory PR comment and never fails a check or blocks an edit. The developer runs zero Archie commands in the happy path.
What's in it
sync.py review(manual), and the PR gate (GitHub Action, auto)..archie/intent.jsoncriteria. Committed, versioned, human-ratifiable, auto-synthesized at PR time if nobody ran it.show/synthesize/confirm-intentare opt-in.delivery_review.pyruns via the existingsetup-archie-intent-review.shworkflow (base-ref execution; secrets not exposed to PR-editable code).Guarantees
exit 2is preserved (proven by real subprocess tests).Quality
verify_syncclean (52 scripts, canonical ↔ npm assets).exit 2, and an always-0/Ncompleteness bug before merge.Deferred (documented follow-ups)
Linear/Jira ticket fetch (seam built) · the full
contract→tracer→challengerinvariant specialist (single-pass shipped) · sync auto-trigger hook · verdict→check-state/AIS teeth · richer verdict styling · honest met/unknown split (needs edge-A per-criterion verdicts).Design + plans:
docs/archie-delivery-review-design.md,docs/archie-branch-intent-design.md,docs/archie-intent-capture-design.md.🤖 Generated with Claude Code