Skip to content

Detect word breaks between elements against a reconstructed advance width#430

Merged
PrinsFrank merged 1 commit into
PrinsFrank:mainfrom
cosmocode:sketch-nonadvance
Jul 1, 2026
Merged

Detect word breaks between elements against a reconstructed advance width#430
PrinsFrank merged 1 commit into
PrinsFrank:mainfrom
cosmocode:sketch-nonadvance

Conversation

@splitbrain

@splitbrain splitbrain commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

This improves the space detection as alluded to in #429 using Option A: reconstructing the advance.

Basically it overhauls the gap comparison by correctly reconstructing the cursor advancement including kerning and whitespace hints. This allows for using a general threshold of 0.25em to detect spaces. In addition a simple workaround to keep subscripts in line was added.

IMHO The change in whitespace detection is overall positive. It does introduce some spurious whitespaces around punctuation, but I think that's tolerable.

Sidenote: I think it might make sense to define the goals of this library. Personally I am interested in extracting text content for search indexing. Most important to me are proper word boundaries (no word splits, no words mushed together). Additional whitespace and punctuation are not very important to me.

@PrinsFrank PrinsFrank left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully know yet what to do about #429 , and how to handle the leakage of text extraction logic into the positionedTextElement and ContentStream classes.

Can you at least break this up further into smaller targeted PRs? I think you discovered some missing x-offsets in the spaceWidths that was unaccounted for, I think that could be a separate PR. The same applies to the changes in the TextOverlapStrategy, I don't think those changes are related at all to the changes in space insertions?

I have a suspicion that there are also some refactors that don't actually fix anything. For example, a new iteration over the positionedTextElements was introduced that does nothing other than increase the memory footprint.

Having the smallest changes as possible in here also makes PRs easier to review. ;)

Comment thread src/Document/ContentStream/ContentStream.php Outdated
Comment thread src/Document/ContentStream/PositionedText/PositionedTextElement.php Outdated
Comment thread src/Document/ContentStream/PositionedText/PositionedTextElement.php Outdated
@splitbrain

splitbrain commented Jun 27, 2026

Copy link
Copy Markdown
Contributor Author

Quick note, before I address your individual questions. I tried to keep this as focused as possible. I would recommend reviewing each commit separately. I did not submit these separately because I am already going insane from having to juggle too many branches. Each of these commits touch the same files. Having them all base off main makes it even harder to work with. The main work is in afc69cb - i can not split this any smaller.

Also this is were your code comment policy comes back to bite us both. If this were my library, each step would be heavily commented inline making it much easier to review...

@splitbrain splitbrain force-pushed the sketch-nonadvance branch 2 times, most recently from 465569f to 0093365 Compare June 30, 2026 15:38
@splitbrain

Copy link
Copy Markdown
Contributor Author

@PrinsFrank I talked to another SoftwareEngineer who shares your views on comment styles. She said that her team is using review comments to annotate pull requests for the reviewer so that reading the diff becomes easier without cluttering the code itself.

Would that help you? If so I can offer two ways to do this. I can either attach a separate markdown file here that basically walks you through the code with files and line numbers. Or I can add my own review here via github's review comment interface.

Let me know if either of the two would be helpful.

PS. I rebased the branch on your recent main commits.

@PrinsFrank

Copy link
Copy Markdown
Owner

@splitbrain I do understand what you're doing in each file, and I really appreciate all the changes you've done in this PR!

The reason I prefer small changes is that I still have a full mental model of the library, and with each PR my understanding gets updated so that I still have an overview. The resulting changes in the test fixtures also make it quite clear what kind of effect the changes have on the output. Having many changes in one PR makes getting a full overview a bit more difficult, and all the compounding changes in the test fixtures cannot be isolated to get an understanding of the changes.

I'm sorry reviewing this takes a bit more time. I know trying to get a mental model of a codebase is not common practice anymore in 2026 with the amount of AI generated code, but this is also my hobby project so I like doing that! ;)

@PrinsFrank

Copy link
Copy Markdown
Owner

@splitbrain I've cherry picked the other changes besides the space between text element changes in #440 and #441 (And added you as co-author in the commits), so the changes left in this PR are only those for the proper advance width. Would you mind rebasing one more time? I think this PR is then ready to be merged as well!

…idth

The gap that signals a word break between two adjacent text elements is
next.offsetX - prev.offsetX - advance(prev), so it is only as accurate as the
reconstructed advance. Tj/TJ do not move the text matrix in this parser, so the
advance cannot be read off the matrix and must be rebuilt.

PositionedTextElement::getAdvanceWidth() reconstructs the cursor advance per
spec §9.4.4: per-glyph width and Tc, plus Tw for single-byte code 32 (gated off
for composite fonts), minus the TJ kerning numbers, scaled by Th and the text
matrix, and measured with the element's own font rather than the next element's.

With an accurate advance, within-word gaps collapse near zero, so the same
WORD_BREAK_THRESHOLD_EM of 0.25 that already gates the within-element break now
also separates between-element word breaks from kerning, replacing the slack
0.40 threshold and next-element-font width that hid the advance error.
@splitbrain splitbrain force-pushed the sketch-nonadvance branch from 0093365 to 855471a Compare July 1, 2026 07:29
@splitbrain

splitbrain commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

rebased again. I left out the subscript unit tests, assuming you intentionally left them out when merging the subscript feature into main.

@PrinsFrank PrinsFrank changed the title Better Space Detection Detect word breaks between elements against a reconstructed advance width Jul 1, 2026

@PrinsFrank PrinsFrank left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much!

@PrinsFrank PrinsFrank merged commit f4119fd into PrinsFrank:main Jul 1, 2026
21 checks passed
@splitbrain splitbrain deleted the sketch-nonadvance branch July 1, 2026 18:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants