Skip to content

fix(code-graph): respect .gitignore + collision-safe symbol ids (indexing never crashes on a repo)#30

Merged
sshanzel merged 2 commits into
mainfrom
fix/index-respect-gitignore
Jun 17, 2026
Merged

fix(code-graph): respect .gitignore + collision-safe symbol ids (indexing never crashes on a repo)#30
sshanzel merged 2 commits into
mainfrom
fix/index-respect-gitignore

Conversation

@sshanzel

Copy link
Copy Markdown
Owner

What changed

Indexing a repo that contains build output (e.g. a Playwright playwright-report/ full of minified bundles) crashed the entire index:

Found duplicated primary key value
apps/console/playwright-report/trace/assets/codeMirrorModule-Ds_H_9Yq.js#h#5

Two root causes, both fixed:

  1. We indexed gitignored files. discoverFiles now uses git ls-files (TRACKED files only) in a git repo, so .gitignored build output, vendored bundles, and generated reports are never indexed — the principled fix vs. a hardcoded skip-list, and consistent with co-change already being git-based. A non-git directory falls back to the SKIP_DIRS filesystem walk. (The incremental path already respected ignores — git diff never reports ignored files.)
  2. The Symbol PK file#name#startLine wasn't collision-safe. A minified file packs many declarations on one line, so two same-named symbols at the same line duplicated the PK and Kùzu aborted the whole index. uniqueSymbolId appends a stable #<n> suffix on collision (deterministic extraction order ⇒ stable across re-indexes), so one pathological file can never take down the graph — belt-and-suspenders even for a committed minified/vendored file.

Verification

Re-indexing the repo that crashed (playright) now succeeds: 1328 files / 4759 symbols / 4255 imports / 100 co-change pairs (was: crash → an edgeless partial graph). The viz code-graph landing renders it with structure (80 nodes / 214 edges) instead of edgeless.

How to test

  • pnpm test:integration — new gitignore-robust scenario: a tracked file with two same-named declarations on one line survives the build (both symbols, disambiguated ids), and a .gitignored file is not indexed.
  • pnpm test:unit (552) + pnpm typecheck + pnpm build green.
  • Manual: plex index <repo-with-a-gitignored-report-dir> → succeeds; the report dir is absent from the graph.

Notes

This was the launch-checklist "generated-file index crash" item — a real "user indexes a normal repo with a dist//report dir → broken graph" bug. Independent of the viz PRs.

🤖 Generated with Claude Code

sshanzel and others added 2 commits June 18, 2026 01:28
…xing never crashes on a repo)

Indexing a repo with build output (e.g. a `playwright-report/` of minified bundles) crashed the
WHOLE index: `Found duplicated primary key value …codeMirrorModule-Ds_H_9Yq.js#h#5`. Two root
causes, both fixed:

1. We indexed gitignored files. `discoverFiles` now uses `git ls-files` (TRACKED files only) in a
   git repo, so .gitignore'd build output / vendored bundles / reports are never indexed — the
   principled fix vs. a hardcoded skip-list, and consistent with co-change being git-based. A non-git
   dir still falls back to the SKIP_DIRS filesystem walk. (Incremental already respected ignores —
   `git diff` never reports ignored files.)
2. Symbol PK `file#name#startLine` wasn't collision-safe — a minified file packs many declarations on
   one line, so two same-named symbols at the same line duplicated the PK and Kùzu aborted the entire
   index. `uniqueSymbolId` appends a stable `#<n>` suffix on collision (deterministic extraction order
   ⇒ stable across re-indexes), so one pathological file can never take down the whole graph.

Verified: re-indexing playright now SUCCEEDS — 1328 files / 4759 symbols / 4255 imports / 100
co-change pairs (was: crash → an edgeless partial graph). The viz code-graph landing now renders it
with structure (80 nodes / 214 edges).

Tests: new `gitignore-robust` integration scenario (a tracked dup-symbol file survives; a gitignored
file is not indexed). 552 unit green; typecheck + build green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ly (PR #30 review)

Address the Plex review of PR #30:

- IMPROVEMENT (item 1): `git ls-files` lists TRACKED files only, so a brand-new source file created
  mid-feature but not yet `git add`ed got no File/Symbol nodes — an empty blast radius for the very
  files a review is about. Switch to `git ls-files --cached --others --exclude-standard` (rename
  listTrackedFiles → listWorktreeFiles): the whole working tree MINUS ignored paths. Build output
  (`playwright-report/`, `dist/`, the self-ignored `.plex`) is still excluded; new uncommitted source
  is now included. The correct implementation of "respect .gitignore".
- NIT (item 2): documented the intentional git-vs-fallback divergence on committed dot-directory
  source (`.storybook/*.ts`) — the git path indexes it as real code, the non-git walk skips it; git is
  authoritative, the walk is best-effort. No behaviour change.
- NOTE (item 3): no change — `uniqueSymbolId`'s `seen` loop guarantees PK uniqueness regardless of
  whether the `#<n>` suffix could theoretically collide (the reviewer agreed it's not a live bug).

Tests: the `gitignore-robust` scenario now also asserts an UNTRACKED-not-ignored file IS indexed
(`res.files === 2`, the report/ dir excluded). 552 unit green; typecheck + build green; the scenario
passes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@sshanzel

Copy link
Copy Markdown
Owner Author

Plex review feedback addressed — ready for re-review.

The Plex review (in-session) raised 1 improvement + 1 nit + 1 note. The core fix (collision-safe symbol ids; gitignore-respect) was verified sound — the seen-set scope, -z parsing, the full-path indexable filter, and the incremental delete-then-re-extract ordering all checked out.

Item Class Outcome
1. Untracked-but-not-ignored files dropped from the full index improvement Fixed in 9233970: git ls-files (tracked-only) → git ls-files --cached --others --exclude-standard (working tree minus ignored). Build output stays excluded; a brand-new uncommitted source file is now indexed, so a mid-feature review's blast radius isn't empty for its own files. listTrackedFileslistWorktreeFiles.
2. Tracked dot-dir source indexed in git path but not the non-git walk nit Intentional, now documented in 9233970: the git path indexes committed .storybook/*.ts as real code; the filesystem walk (non-git fallback) skips dot-dirs. Git is authoritative; the walk is best-effort. No behaviour change.
3. uniqueSymbolId #<n> suffix not provably disjoint from real ids note No change — the seen-set loop guarantees PK uniqueness within a build regardless of the suffix; not a live bug (reviewer concurred).

Tests: the gitignore-robust integration scenario now also asserts an untracked-not-ignored file IS indexed (res.files === 2, the report/ dir excluded), alongside the existing dup-symbol-no-crash + gitignored-file-skipped checks.

Checks: pnpm typecheck, pnpm test:unit (552), pnpm build, and the gitignore-robust scenario — all green.

No GitHub review threads existed to reply to/resolve (the review was in-session).

@sshanzel sshanzel merged commit 6920de9 into main Jun 17, 2026
1 check passed
@sshanzel sshanzel deleted the fix/index-respect-gitignore branch June 17, 2026 21:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant