Skip to content

perf(code-graph): batch index inserts in one transaction (large-repo first index 86s → 5s)#31

Merged
sshanzel merged 2 commits into
mainfrom
perf/index-batch-inserts
Jun 17, 2026
Merged

perf(code-graph): batch index inserts in one transaction (large-repo first index 86s → 5s)#31
sshanzel merged 2 commits into
mainfrom
perf/index-batch-inserts

Conversation

@sshanzel

Copy link
Copy Markdown
Owner

What changed

insertMany (the core index insert path) prepared its statement once but awaited execute per row with no transaction, so Kùzu auto-committed (fsync'd) every row. A large first index was ~14k serial commit round-trips — most of a ~70s wait. Since the first review auto-indexes, a stranger's first run on a real repo blocked ~90s.

Wrap each batch's row loop in a single BEGIN TRANSACTION … COMMIT (ROLLBACK on error): one commit instead of thousands, and the batch is now atomic — a failure rolls back instead of leaving a half-written graph (safe now that symbol ids are collision-proof, so no mid-batch PK abort).

Measured (playright, 1328 files)

Before After
Full index wall-clock 86.2s 4.98s (~17×)
Graph produced 1328 files / 4759 symbols / 4255 imports identical

How to test

  • pnpm test:integration indexing scenarios all pass — build, gitignore-robust, neighborhood, incremental, blast-hub, precise, cochange-inc, worktree-seed (the incremental + worktree-seed paths exercise the same insertMany).
  • pnpm test:unit (554) + pnpm typecheck + pnpm build green.
  • Manual: plex index <large-repo> — seconds instead of a minute-plus; the graph counts are unchanged.

Notes

Found during the launch-readiness audit (npm advisories came back clean; error/empty states are mostly good). One minor UX item noted for a follow-up: plex review with nothing staged dumps the full agent-notes context instead of a clean "nothing to review" message.

🤖 Generated with Claude Code

sshanzel and others added 2 commits June 18, 2026 01:59
…first index 86s → 5s)

`insertMany` prepared its statement once but `await`ed `execute` per row with NO transaction, so
Kùzu auto-committed (fsync'd) every row. A large first index was ~14k serial commit round-trips —
~70s of pure commit wait on a ~1.3k-file repo (real 86s, of which only 8.6s was CPU). Since the
first review AUTO-INDEXES, a stranger's first run on a real repo blocked ~90s.

Wrap each batch's row loop in a single `BEGIN TRANSACTION … COMMIT` (ROLLBACK on error). One commit
instead of thousands, and the batch is now atomic — a failure rolls back instead of leaving a
half-written graph (safe now that symbol ids are collision-proof, so no mid-batch PK abort).

Measured on playright (1328 files): full index 86.2s → 4.98s (~17×), identical graph (1328 files /
4759 symbols / 4255 imports). All indexing integration scenarios pass (build, gitignore-robust,
neighborhood, incremental, blast-hub, precise, cochange-inc, worktree-seed — the incremental +
worktree paths use the same insertMany). 552 unit green; typecheck + build green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…31 review)

Address the Plex review of PR #31:

- NIT (item 3): a whole batch in ONE transaction buffers the full WAL/undo set — fine for the
  measured 1.3k-file repo, but a huge monorepo could balloon. Commit per CHUNK (10k rows) and
  re-BEGIN: still a handful of commits (the ~17× win holds — playright's largest batch <5k rows
  commits once, so normal repos are unchanged), with a bounded buffer. Smaller transactions also
  shrink the failed-COMMIT blast radius (item 2).
- IMPROVEMENT (item 1): the prior comment overclaimed "atomic — rolls back instead of a half-written
  graph." Not true across the build: callers issue several insertMany's, and the incremental path's
  DETACH DELETE auto-commits BEFORE the re-insert, so delete+reinsert was never one unit (and wasn't
  pre-PR). Reworded to scope this as a PERF change; a partial build is recovered by the next full
  rebuild (sweep GraphFreshness / fresh index). True delete+reinsert atomicity would need an outer
  transaction spanning multiple insertMany's (which would nest — item 2's concern) — deferred.
- IMPROVEMENT (item 2): only ROLLBACK when a transaction is actually open (`inTxn` guard), so a BEGIN
  failure doesn't trigger a spurious "no active transaction" rollback; documented that ROLLBACK closes
  the txn so the reused long-lived connection isn't wedged for the next insertMany, and the ORIGINAL
  error is surfaced.

Re-measured playright: still 5.08s (chunking doesn't regress normal repos). Indexing scenarios
(build, gitignore-robust, incremental, worktree-seed) pass; 554 unit green; typecheck + build green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@sshanzel

Copy link
Copy Markdown
Owner Author

Plex review feedback addressed — ready for re-review.

The Plex review (in-session) raised 1 nit + 2 improvements — all about the transaction wrap's interaction with callers and its atomicity claim, not the perf change itself.

Item Class Outcome
3. Whole batch in one transaction → unbounded WAL/undo on a huge monorepo nit Fixed in 50cd2e9: commit per CHUNK (10k rows) and re-BEGIN. The ~17× win holds (playright's largest batch <5k rows still commits once — normal repos unchanged); only >10k-row batches split, bounding the buffer.
1. "Atomic — rolls back instead of a half-written graph" overclaimed improvement Fixed in 50cd2e9: reworded to scope this as a PERF change. Callers issue several insertManys and the incremental DETACH DELETE auto-commits before the re-insert, so delete+reinsert was never one unit (and wasn't pre-PR); a partial build is recovered by the next full rebuild (sweep GraphFreshness / fresh index). True delete+reinsert atomicity (an outer transaction) would nest — deferred.
2. A thrown COMMIT / failed ROLLBACK could wedge the reused connection improvement Fixed in 50cd2e9: ROLLBACK only when a transaction is open (inTxn guard) so a BEGIN failure doesn't spawn a spurious rollback; documented that ROLLBACK closes the txn so the reused long-lived connection isn't left mid-transaction (which would make the next insertMany's BEGIN fail with an unrelated error), while the ORIGINAL error is surfaced.

Re-measured: playright (1328 files) still 5.08s — chunking doesn't regress normal repos.

Tests: insertMany is a Kùzu-opening core path (not unit-testable, ADR-17); covered by the indexing integration scenarios (build, gitignore-robust, incremental, worktree-seed — the incremental + worktree paths use the same insertMany), all green, plus the empirical playright re-index (identical graph: 1328 files / 4759 symbols / 4255 imports).

Checks: pnpm typecheck, pnpm test:unit (554), pnpm build, indexing scenarios — all green.

No GitHub review threads existed to reply to/resolve (the review was in-session).

@sshanzel sshanzel merged commit 333783e into main Jun 17, 2026
1 check passed
@sshanzel sshanzel deleted the perf/index-batch-inserts branch June 17, 2026 22:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant