Clemg/diff byte arena#775
Draft
clemg wants to merge 3 commits into
Draft
Conversation
`additionLines`/`deletionLines` change from `string[]` to `DiffLines`: a plain
data object holding a file's lines as one UTF-8 byte arena plus an offset table,
decoded on demand via `lineAt` / `joinLines`. On a huge diff (linux v6..v7,
~22.8M lines across ~77k files) this avoids tens of millions of tiny `String`
objects, so the V8 heap drops ~33% on that compare and the parser is faster: it
no longer encode+decode-detaches every line, it encodes once on seal and decodes
only the visible (virtualized) lines.
It is plain data on purpose, so it survives structured clone (the highlight
worker), `structuredClone`, and IndexedDB without a revive step (no class, no
prototype to drop). `.length` stays a field, so the many `.length` consumers are
unchanged; only content reads migrate (`x[i]` -> `lineAt(x, i)`,
`x.join('')` -> `joinLines(x)`). Per-file offsets use the smallest int width that
fits the file. A file with a lone surrogate keeps exact strings as a fallback,
and merge-conflict diffs keep plain strings (no encode) so their parse stays at
parity. The parsed model is byte-identical to before (snapshot + content-hash).
Adds diffLines.test.ts (arena round-trip, multibyte, emoji-keeps-arena, lone-surrogate fallback, BOM, offset-width, plainLines, joinLines, isWellFormed) and a withPlainLines snapshot converter so the existing parsed-model snapshots assert byte-identical line content.
The byte-arena type change makes additionLines/deletionLines a DiffLines, so the editor's FileDiff whole-side accessors (getDeletionFile/getAdditionFile) read them with joinLines(...) instead of .join('').
|
@clemg is attempting to deploy a commit to the Pierre Computer Company Team on Vercel. A member of the Team first needs to authorize it. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Changing the way we represent patch storage from a
string[]to a byte arena (a contiguous byte array), which uses less memory the bigger the PR gets.For full explanation and details, see: #760
Note that I didn't update any documentation yet
Motivation & Context
About half of the time, I was getting OOMs crashes on huge PRs like the linux v6..v7 comparison (which is not too crazy given the size of the patch), but still annoying. I think this other way of representing the lines is more efficient and can help either rendering bigger diffs on good hardware, or just normal diff on older hardware
Type of changes
first discussed with the dev team and they should be aware that this PR is
being opened
You must have first discussed with the dev team and they should be aware
that this PR is being opened
Checklist
contributing guidelines
bun run lint)bun run format)bun run diffs:test)How was AI used in generating this PR
The tests have been fully generated by opus 4.8
Related issues
See: #760