Detect word breaks between elements against a reconstructed advance width#430
Conversation
c9d0af9 to
bd274dc
Compare
PrinsFrank
left a comment
There was a problem hiding this comment.
I don't fully know yet what to do about #429 , and how to handle the leakage of text extraction logic into the positionedTextElement and ContentStream classes.
Can you at least break this up further into smaller targeted PRs? I think you discovered some missing x-offsets in the spaceWidths that was unaccounted for, I think that could be a separate PR. The same applies to the changes in the TextOverlapStrategy, I don't think those changes are related at all to the changes in space insertions?
I have a suspicion that there are also some refactors that don't actually fix anything. For example, a new iteration over the positionedTextElements was introduced that does nothing other than increase the memory footprint.
Having the smallest changes as possible in here also makes PRs easier to review. ;)
|
Quick note, before I address your individual questions. I tried to keep this as focused as possible. I would recommend reviewing each commit separately. I did not submit these separately because I am already going insane from having to juggle too many branches. Each of these commits touch the same files. Having them all base off main makes it even harder to work with. The main work is in afc69cb - i can not split this any smaller. Also this is were your code comment policy comes back to bite us both. If this were my library, each step would be heavily commented inline making it much easier to review... |
465569f to
0093365
Compare
|
@PrinsFrank I talked to another SoftwareEngineer who shares your views on comment styles. She said that her team is using review comments to annotate pull requests for the reviewer so that reading the diff becomes easier without cluttering the code itself. Would that help you? If so I can offer two ways to do this. I can either attach a separate markdown file here that basically walks you through the code with files and line numbers. Or I can add my own review here via github's review comment interface. Let me know if either of the two would be helpful. PS. I rebased the branch on your recent main commits. |
|
@splitbrain I do understand what you're doing in each file, and I really appreciate all the changes you've done in this PR! The reason I prefer small changes is that I still have a full mental model of the library, and with each PR my understanding gets updated so that I still have an overview. The resulting changes in the test fixtures also make it quite clear what kind of effect the changes have on the output. Having many changes in one PR makes getting a full overview a bit more difficult, and all the compounding changes in the test fixtures cannot be isolated to get an understanding of the changes. I'm sorry reviewing this takes a bit more time. I know trying to get a mental model of a codebase is not common practice anymore in 2026 with the amount of AI generated code, but this is also my hobby project so I like doing that! ;) |
|
@splitbrain I've cherry picked the other changes besides the space between text element changes in #440 and #441 (And added you as co-author in the commits), so the changes left in this PR are only those for the proper advance width. Would you mind rebasing one more time? I think this PR is then ready to be merged as well! |
…idth The gap that signals a word break between two adjacent text elements is next.offsetX - prev.offsetX - advance(prev), so it is only as accurate as the reconstructed advance. Tj/TJ do not move the text matrix in this parser, so the advance cannot be read off the matrix and must be rebuilt. PositionedTextElement::getAdvanceWidth() reconstructs the cursor advance per spec §9.4.4: per-glyph width and Tc, plus Tw for single-byte code 32 (gated off for composite fonts), minus the TJ kerning numbers, scaled by Th and the text matrix, and measured with the element's own font rather than the next element's. With an accurate advance, within-word gaps collapse near zero, so the same WORD_BREAK_THRESHOLD_EM of 0.25 that already gates the within-element break now also separates between-element word breaks from kerning, replacing the slack 0.40 threshold and next-element-font width that hid the advance error.
0093365 to
855471a
Compare
|
rebased again. I left out the subscript unit tests, assuming you intentionally left them out when merging the subscript feature into main. |
This improves the space detection as alluded to in #429 using Option A: reconstructing the advance.
Basically it overhauls the gap comparison by correctly reconstructing the cursor advancement including kerning and whitespace hints. This allows for using a general threshold of 0.25em to detect spaces. In addition a simple workaround to keep subscripts in line was added.
IMHO The change in whitespace detection is overall positive. It does introduce some spurious whitespaces around punctuation, but I think that's tolerable.