perf(code-graph): batch index inserts in one transaction (large-repo first index 86s → 5s)#31
Conversation
…first index 86s → 5s) `insertMany` prepared its statement once but `await`ed `execute` per row with NO transaction, so Kùzu auto-committed (fsync'd) every row. A large first index was ~14k serial commit round-trips — ~70s of pure commit wait on a ~1.3k-file repo (real 86s, of which only 8.6s was CPU). Since the first review AUTO-INDEXES, a stranger's first run on a real repo blocked ~90s. Wrap each batch's row loop in a single `BEGIN TRANSACTION … COMMIT` (ROLLBACK on error). One commit instead of thousands, and the batch is now atomic — a failure rolls back instead of leaving a half-written graph (safe now that symbol ids are collision-proof, so no mid-batch PK abort). Measured on playright (1328 files): full index 86.2s → 4.98s (~17×), identical graph (1328 files / 4759 symbols / 4255 imports). All indexing integration scenarios pass (build, gitignore-robust, neighborhood, incremental, blast-hub, precise, cochange-inc, worktree-seed — the incremental + worktree paths use the same insertMany). 552 unit green; typecheck + build green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…31 review) Address the Plex review of PR #31: - NIT (item 3): a whole batch in ONE transaction buffers the full WAL/undo set — fine for the measured 1.3k-file repo, but a huge monorepo could balloon. Commit per CHUNK (10k rows) and re-BEGIN: still a handful of commits (the ~17× win holds — playright's largest batch <5k rows commits once, so normal repos are unchanged), with a bounded buffer. Smaller transactions also shrink the failed-COMMIT blast radius (item 2). - IMPROVEMENT (item 1): the prior comment overclaimed "atomic — rolls back instead of a half-written graph." Not true across the build: callers issue several insertMany's, and the incremental path's DETACH DELETE auto-commits BEFORE the re-insert, so delete+reinsert was never one unit (and wasn't pre-PR). Reworded to scope this as a PERF change; a partial build is recovered by the next full rebuild (sweep GraphFreshness / fresh index). True delete+reinsert atomicity would need an outer transaction spanning multiple insertMany's (which would nest — item 2's concern) — deferred. - IMPROVEMENT (item 2): only ROLLBACK when a transaction is actually open (`inTxn` guard), so a BEGIN failure doesn't trigger a spurious "no active transaction" rollback; documented that ROLLBACK closes the txn so the reused long-lived connection isn't wedged for the next insertMany, and the ORIGINAL error is surfaced. Re-measured playright: still 5.08s (chunking doesn't regress normal repos). Indexing scenarios (build, gitignore-robust, incremental, worktree-seed) pass; 554 unit green; typecheck + build green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Plex review feedback addressed — ready for re-review. The Plex review (in-session) raised 1 nit + 2 improvements — all about the transaction wrap's interaction with callers and its atomicity claim, not the perf change itself.
Re-measured: playright (1328 files) still 5.08s — chunking doesn't regress normal repos. Tests: Checks: No GitHub review threads existed to reply to/resolve (the review was in-session). |
What changed
insertMany(the core index insert path) prepared its statement once butawaitedexecuteper row with no transaction, so Kùzu auto-committed (fsync'd) every row. A large first index was ~14k serial commit round-trips — most of a ~70s wait. Since the first review auto-indexes, a stranger's first run on a real repo blocked ~90s.Wrap each batch's row loop in a single
BEGIN TRANSACTION … COMMIT(ROLLBACK on error): one commit instead of thousands, and the batch is now atomic — a failure rolls back instead of leaving a half-written graph (safe now that symbol ids are collision-proof, so no mid-batch PK abort).Measured (playright, 1328 files)
How to test
pnpm test:integrationindexing scenarios all pass —build,gitignore-robust,neighborhood,incremental,blast-hub,precise,cochange-inc,worktree-seed(the incremental + worktree-seed paths exercise the sameinsertMany).pnpm test:unit(554) +pnpm typecheck+pnpm buildgreen.plex index <large-repo>— seconds instead of a minute-plus; the graph counts are unchanged.Notes
Found during the launch-readiness audit (npm advisories came back clean; error/empty states are mostly good). One minor UX item noted for a follow-up:
plex reviewwith nothing staged dumps the full agent-notes context instead of a clean "nothing to review" message.🤖 Generated with Claude Code