Correctly support rotated texts#431
Closed
splitbrain wants to merge 9 commits into
Closed
Conversation
…vance width Word-break detection compares the gap between two adjacent text elements against a threshold. The gap is next.offsetX - prev.offsetX - advance(prev), so it is only as accurate as the reconstructed advance. PositionedTextElement::getAdvanceWidth() reconstructs the cursor advance per spec §9.4.4: per-glyph width and Tc, minus the TJ kerning numbers, plus Tw for single-byte code 32 (gated off for composite fonts), scaled by Th and the text matrix, and measured with the element's own font rather than the next element's. With an accurate advance, within-word gaps collapse near zero, so the gap that signals a word break is a roughly constant fraction of the em across producers: a single WORD_BREAK_THRESHOLD of 0.25 separates word breaks from kerning. The intra-element large-negative-offset rule reuses the same constant.
parseOperand() runs the operand-splitting regex once and returns the
tokens as {chars, offset} pairs, with the TJ adjustment already
converted to a float or null. getText(), getCodePoints() and
getTotalOffset() consume those pairs, so the pattern and its guards -
including the unrecognized-format check on an empty match - live in one
place instead of three.
A smaller glyph whose baseline is shifted off the line, such as the "2" in "CO2", overlaps the line too little to clear the line-overlap threshold and ended up on a line of its own. When such an element partially overlaps the line and sits horizontally within the run already collected, it is enclosed by surrounding text and belongs to that line. A smaller element that merely sits close but is not enclosed, like a subtitle beside a heading, keeps its own line.
A Vector captures a 2D direction in page space (length, normalized, dot product, perpendicular normal). TransformationMatrix::baselineVector() names the baseline (text advance) direction in page space: the matrix's first column (scaleX, shearX). Nothing consumes them yet; the glyph geometry and line-grouping that follow are built on these.
Font::getWidthForChar(s) drop their TransformationMatrix parameter and return the unscaled text-space width (w0·Tfs + Tc) rather than pre-multiplying by scaleX. getAdvanceWidth() takes a page-space direction Vector (default page X) and returns the signed projection of the reconstructed advance onto it -- a dot product against the baseline vector -- so the advance can later be measured along a rotated baseline. The default direction reproduces the previous advance · scaleX exactly, so the extracted text is unchanged. The perpendicular-to-baseline glyph height that rotated text needs lands with the baseline-aware strategy.
The word-break decision moves out of ContentStream into the strategy, because a gap between two runs is meaningful only relative to how that strategy laid them out. LineGroupingStrategy gains requiresSpaceBetween(); ContentStream keeps only the empty-run and already-spaced guards and asks the strategy. A MatrixOffsetSpacing trait holds the axis-aligned heuristic the two upright strategies share (TextOverlapStrategy, StrictLineGrouping): a space belongs when the gap between X offsets, less the previous run's reconstructed advance width, reaches WORD_BREAK_THRESHOLD of the em -- the same comparison ContentStream made inline before, so the extracted text is unchanged.
BaselineClusterStrategy is orientation-agnostic and two-pass. Pass 1 clusters runs into lines by their actual baseline -- baselines pointing nearly the same way whose glyph extents overlap along the baseline normal -- and orders each line along its baseline vector, so multi-run rotated text (a watermark, a vertical label, a diagonal overlay) is reassembled and reads in order at any angle rather than fragmenting to one line per run. Pass 2 segments the lines into orientation-bucketed blocks, orders the lines within a block along the block's own normal, and orders the blocks by their highest point on the page, so a rotated overlay stays contiguous instead of interleaving with the body it crosses. Lines and blocks are modelled as Line and Block value objects built on Vector; requiresSpaceBetween() measures the gap and the advance along the baseline, reducing to the axis-aligned comparison for horizontal text. getHeight() now measures the glyph extent perpendicular to the baseline (hypot(shearY, scaleY)) so rotated text keeps a non-zero height; this is what lets the strategy cluster rotated runs, and it changes grouping only where text is rotated. For horizontal text the baseline normal is the page Y axis and there is a single orientation bucket, so this is byte-identical to TextOverlapStrategy on every non-rotated page; Page::getText switches to it as the default. The only existing sample it changes is text-objects-across-multiple-streams (pages 5 and 7), whose rotated channel labels now read in baseline order with the repeated device labels separated by the reconstructed advance; its fixture is updated to match. TextOverlapStrategy remains as a non-default, axis-aligned alternative.
Regression fixtures exercising BaselineClusterStrategy, where the previous default fragmented rotated text to one line per run: - text-rotated-angles: one word per page rotated around the full circle (every 30 degrees), drawn glyph-by-glyph along its baseline; every angle is reassembled onto a single line. - text-rotated-watermark: a single 90-degree word drawn glyph-by-glyph, reassembled onto one line. - text-rotated-label: a LibreOffice document that emits a rotated word as per-glyph runs in the content stream; reassembled into one line. - text-rotated-overlay: a justified body paragraph crossed by two diagonal overlay paragraphs (35 and 265 degrees); each overlay stays contiguous and reads in order rather than interleaving with the body.
The fallback font size for content streams that show text without ever setting one lived as a bare literal at five call sites, and disagreed: getHeight() assumed 12 while the rest assumed 10. Funnel them all through TextState::getFontSize(), backed by one private DEFAULT_FONT_SIZE constant.
PrinsFrank
requested changes
Jun 27, 2026
PrinsFrank
left a comment
Owner
There was a problem hiding this comment.
I'm fine with stacked diffs, but pointing them both to the main branch results in the second PR to be impossible to review. When branching like Main->A->B you can point the PR for A to main and the PR for B to branch A. In that case, when the PR for branch A gets merged the PR for branch B will also automatically point to the main branch.
I'll focus my review time on the previous PR first as these are already quite some changes to understand. ;)
Contributor
Author
I can't. Because I can only create PRs against any of your branches. As long as A is not merged into a branch at your repo I can not target it in a PR. |
Contributor
Author
|
I will close this until #430 is merged |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR introduces a new default LineGrouping strategy named BaselineClusterStrategy. Instead of grouping text elements by their position on the page only, it groups them by their rotation as well. This will, for example, prevent watermarks to interweave with "normal" text. Or rotated glyph based text to be split into individual lines.
For horizontal text, the results are exactly the same as before.
Note that this changes the LineGrouping interface by moving the space decision into it - it now depends on the used strategy. A few data object classes were introduced to make the code cleaner.
I believe this to be the better overall strategy, but the old strategies are still available. If it's worth keeping them is up for discussion.
This PR builds on top of the changes from #430 (eg. includes the commits over there)