Skip to content

Clemg/diff byte arena#775

Draft
clemg wants to merge 3 commits into
pierrecomputer:beta-1.3from
clemg:clemg/diff-byte-arena
Draft

Clemg/diff byte arena#775
clemg wants to merge 3 commits into
pierrecomputer:beta-1.3from
clemg:clemg/diff-byte-arena

Conversation

@clemg
Copy link
Copy Markdown
Contributor

@clemg clemg commented Jun 3, 2026

Description

Changing the way we represent patch storage from a string[] to a byte arena (a contiguous byte array), which uses less memory the bigger the PR gets.
For full explanation and details, see: #760

Note that I didn't update any documentation yet

Motivation & Context

About half of the time, I was getting OOMs crashes on huge PRs like the linux v6..v7 comparison (which is not too crazy given the size of the patch), but still annoying. I think this other way of representing the lines is more efficient and can help either rendering bigger diffs on good hardware, or just normal diff on older hardware

Type of changes

  • Bug fix (non-breaking change which fixes an issue)
  • Refactoring (non-breaking change)
  • New feature (non-breaking change which adds functionality). You must have
    first discussed with the dev team and they should be aware that this PR is
    being opened
  • Breaking change (fix or feature that would change existing functionality).
    You must have first discussed with the dev team and they should be aware
    that this PR is being opened
  • Documentation update

Checklist

  • I have read the
    contributing guidelines
  • My code follows the code style of the project (bun run lint)
  • My code is formatted properly (bun run format)
  • I have updated the documentation accordingly (if applicable)
  • I have added tests to cover my changes (if applicable)
  • All new and existing tests pass (bun run diffs:test)

How was AI used in generating this PR

The tests have been fully generated by opus 4.8

Related issues

See: #760

clemg added 3 commits June 3, 2026 16:45
`additionLines`/`deletionLines` change from `string[]` to `DiffLines`: a plain
data object holding a file's lines as one UTF-8 byte arena plus an offset table,
decoded on demand via `lineAt` / `joinLines`. On a huge diff (linux v6..v7,
~22.8M lines across ~77k files) this avoids tens of millions of tiny `String`
objects, so the V8 heap drops ~33% on that compare and the parser is faster: it
no longer encode+decode-detaches every line, it encodes once on seal and decodes
only the visible (virtualized) lines.

It is plain data on purpose, so it survives structured clone (the highlight
worker), `structuredClone`, and IndexedDB without a revive step (no class, no
prototype to drop). `.length` stays a field, so the many `.length` consumers are
unchanged; only content reads migrate (`x[i]` -> `lineAt(x, i)`,
`x.join('')` -> `joinLines(x)`). Per-file offsets use the smallest int width that
fits the file. A file with a lone surrogate keeps exact strings as a fallback,
and merge-conflict diffs keep plain strings (no encode) so their parse stays at
parity. The parsed model is byte-identical to before (snapshot + content-hash).
Adds diffLines.test.ts (arena round-trip, multibyte, emoji-keeps-arena, lone-surrogate fallback, BOM, offset-width, plainLines, joinLines, isWellFormed) and a withPlainLines snapshot converter so the existing parsed-model snapshots assert byte-identical line content.
The byte-arena type change makes additionLines/deletionLines a DiffLines, so the editor's FileDiff whole-side accessors (getDeletionFile/getAdditionFile) read them with joinLines(...) instead of .join('').
@vercel
Copy link
Copy Markdown

vercel Bot commented Jun 3, 2026

@clemg is attempting to deploy a commit to the Pierre Computer Company Team on Vercel.

A member of the Team first needs to authorize it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant