Detect word breaks between elements against a reconstructed advance width by splitbrain · Pull Request #430 · PrinsFrank/pdfparser

splitbrain · 2026-06-26T18:51:19Z

This improves the space detection as alluded to in #429 using Option A: reconstructing the advance.

Basically it overhauls the gap comparison by correctly reconstructing the cursor advancement including kerning and whitespace hints. This allows for using a general threshold of 0.25em to detect spaces. In addition a simple workaround to keep subscripts in line was added.

IMHO The change in whitespace detection is overall positive. It does introduce some spurious whitespaces around punctuation, but I think that's tolerable.

Sidenote: I think it might make sense to define the goals of this library. Personally I am interested in extracting text content for search indexing. Most important to me are proper word boundaries (no word splits, no words mushed together). Additional whitespace and punctuation are not very important to me.

PrinsFrank

I don't fully know yet what to do about #429 , and how to handle the leakage of text extraction logic into the positionedTextElement and ContentStream classes.

Can you at least break this up further into smaller targeted PRs? I think you discovered some missing x-offsets in the spaceWidths that was unaccounted for, I think that could be a separate PR. The same applies to the changes in the TextOverlapStrategy, I don't think those changes are related at all to the changes in space insertions?

I have a suspicion that there are also some refactors that don't actually fix anything. For example, a new iteration over the positionedTextElements was introduced that does nothing other than increase the memory footprint.

Having the smallest changes as possible in here also makes PRs easier to review. ;)

splitbrain · 2026-06-27T17:08:28Z

Quick note, before I address your individual questions. I tried to keep this as focused as possible. I would recommend reviewing each commit separately. I did not submit these separately because I am already going insane from having to juggle too many branches. Each of these commits touch the same files. Having them all base off main makes it even harder to work with. The main work is in afc69cb - i can not split this any smaller.

Also this is were your code comment policy comes back to bite us both. If this were my library, each step would be heavily commented inline making it much easier to review...

splitbrain · 2026-06-30T16:07:48Z

@PrinsFrank I talked to another SoftwareEngineer who shares your views on comment styles. She said that her team is using review comments to annotate pull requests for the reviewer so that reading the diff becomes easier without cluttering the code itself.

Would that help you? If so I can offer two ways to do this. I can either attach a separate markdown file here that basically walks you through the code with files and line numbers. Or I can add my own review here via github's review comment interface.

Let me know if either of the two would be helpful.

PS. I rebased the branch on your recent main commits.

PrinsFrank · 2026-06-30T21:08:40Z

@splitbrain I do understand what you're doing in each file, and I really appreciate all the changes you've done in this PR!

The reason I prefer small changes is that I still have a full mental model of the library, and with each PR my understanding gets updated so that I still have an overview. The resulting changes in the test fixtures also make it quite clear what kind of effect the changes have on the output. Having many changes in one PR makes getting a full overview a bit more difficult, and all the compounding changes in the test fixtures cannot be isolated to get an understanding of the changes.

I'm sorry reviewing this takes a bit more time. I know trying to get a mental model of a codebase is not common practice anymore in 2026 with the amount of AI generated code, but this is also my hobby project so I like doing that! ;)

PrinsFrank · 2026-06-30T22:09:44Z

@splitbrain I've cherry picked the other changes besides the space between text element changes in #440 and #441 (And added you as co-author in the commits), so the changes left in this PR are only those for the proper advance width. Would you mind rebasing one more time? I think this PR is then ready to be merged as well!

…idth The gap that signals a word break between two adjacent text elements is next.offsetX - prev.offsetX - advance(prev), so it is only as accurate as the reconstructed advance. Tj/TJ do not move the text matrix in this parser, so the advance cannot be read off the matrix and must be rebuilt. PositionedTextElement::getAdvanceWidth() reconstructs the cursor advance per spec §9.4.4: per-glyph width and Tc, plus Tw for single-byte code 32 (gated off for composite fonts), minus the TJ kerning numbers, scaled by Th and the text matrix, and measured with the element's own font rather than the next element's. With an accurate advance, within-word gaps collapse near zero, so the same WORD_BREAK_THRESHOLD_EM of 0.25 that already gates the within-element break now also separates between-element word breaks from kerning, replacing the slack 0.40 threshold and next-element-font width that hid the advance error.

splitbrain · 2026-07-01T07:32:18Z

rebased again. I left out the subscript unit tests, assuming you intentionally left them out when merging the subscript feature into main.

PrinsFrank

Thank you very much!

splitbrain force-pushed the sketch-nonadvance branch from c9d0af9 to bd274dc Compare June 27, 2026 16:22

splitbrain mentioned this pull request Jun 27, 2026

Correctly support rotated texts #431

Closed

PrinsFrank requested changes Jun 27, 2026

View reviewed changes

splitbrain force-pushed the sketch-nonadvance branch 2 times, most recently from 465569f to 0093365 Compare June 30, 2026 15:38

This was referenced Jun 30, 2026

Detect word breaks between text segments using consistent 0.25EM threshold #440

Merged

Detect enclosed subscript when grouping text elements that are on the same line #441

Merged

splitbrain force-pushed the sketch-nonadvance branch from 0093365 to 855471a Compare July 1, 2026 07:29

PrinsFrank changed the title ~~Better Space Detection~~ Detect word breaks between elements against a reconstructed advance width Jul 1, 2026

PrinsFrank approved these changes Jul 1, 2026

View reviewed changes

PrinsFrank merged commit f4119fd into PrinsFrank:main Jul 1, 2026
21 checks passed

splitbrain deleted the sketch-nonadvance branch July 1, 2026 18:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Detect word breaks between elements against a reconstructed advance width#430

Detect word breaks between elements against a reconstructed advance width#430
PrinsFrank merged 1 commit into
PrinsFrank:mainfrom
cosmocode:sketch-nonadvance

splitbrain commented Jun 26, 2026 •

edited

Loading

Uh oh!

PrinsFrank left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

splitbrain commented Jun 27, 2026 •

edited

Loading

Uh oh!

splitbrain commented Jun 30, 2026

Uh oh!

PrinsFrank commented Jun 30, 2026

Uh oh!

PrinsFrank commented Jun 30, 2026

Uh oh!

splitbrain commented Jul 1, 2026 •

edited

Loading

Uh oh!

PrinsFrank left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

splitbrain commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PrinsFrank left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

splitbrain commented Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

splitbrain commented Jun 30, 2026

Uh oh!

PrinsFrank commented Jun 30, 2026

Uh oh!

PrinsFrank commented Jun 30, 2026

Uh oh!

splitbrain commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PrinsFrank left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

splitbrain commented Jun 26, 2026 •

edited

Loading

splitbrain commented Jun 27, 2026 •

edited

Loading

splitbrain commented Jul 1, 2026 •

edited

Loading