Correctly support rotated texts#444
Open
splitbrain wants to merge 6 commits into
Open
Conversation
A Vector captures a 2D direction in page space (length, normalized, dot product, perpendicular normal). TransformationMatrix::baselineVector() names the baseline (text advance) direction in page space: the matrix's first column (scaleX, shearX). Nothing consumes them yet; the glyph geometry and line-grouping that follow are built on these.
Font::getWidthForChar(s) drop their TransformationMatrix parameter and return the unscaled text-space width (w0·Tfs + Tc) rather than pre-multiplying by scaleX. getAdvanceWidth() takes a page-space direction Vector (default page X) and returns the signed projection of the reconstructed advance onto it -- a dot product against the baseline vector -- so the advance can later be measured along a rotated baseline. The default direction reproduces the previous advance · scaleX exactly, so the extracted text is unchanged. The perpendicular-to-baseline glyph height that rotated text needs lands with the baseline-aware strategy.
The word-break decision moves out of ContentStream into the strategy, because a gap between two runs is meaningful only relative to how that strategy laid them out. LineGroupingStrategy gains requiresSpaceBetween(); ContentStream keeps only the empty-run and already-spaced guards and asks the strategy. A MatrixOffsetSpacing trait holds the axis-aligned heuristic the two upright strategies share (TextOverlapStrategy, StrictLineGrouping): a space belongs when the gap between X offsets, less the previous run's reconstructed advance width, reaches WORD_BREAK_THRESHOLD of the em -- the same comparison ContentStream made inline before, so the extracted text is unchanged.
BaselineClusterStrategy is orientation-agnostic and two-pass. Pass 1 clusters runs into lines by their actual baseline -- baselines pointing nearly the same way whose glyph extents overlap along the baseline normal -- and orders each line along its baseline vector, so multi-run rotated text (a watermark, a vertical label, a diagonal overlay) is reassembled and reads in order at any angle rather than fragmenting to one line per run. Pass 2 segments the lines into orientation-bucketed blocks, orders the lines within a block along the block's own normal, and orders the blocks by their highest point on the page, so a rotated overlay stays contiguous instead of interleaving with the body it crosses. Lines and blocks are modelled as Line and Block value objects built on Vector; requiresSpaceBetween() measures the gap and the advance along the baseline, reducing to the axis-aligned comparison for horizontal text. getHeight() now measures the glyph extent perpendicular to the baseline (hypot(shearY, scaleY)) so rotated text keeps a non-zero height; this is what lets the strategy cluster rotated runs, and it changes grouping only where text is rotated. For horizontal text the baseline normal is the page Y axis and there is a single orientation bucket, so this is byte-identical to TextOverlapStrategy on every non-rotated page; Page::getText switches to it as the default. The only existing sample it changes is text-objects-across-multiple-streams (pages 5 and 7), whose rotated channel labels now read in baseline order with the repeated device labels separated by the reconstructed advance; its fixture is updated to match. TextOverlapStrategy remains as a non-default, axis-aligned alternative.
Regression fixtures exercising BaselineClusterStrategy, where the previous default fragmented rotated text to one line per run: - text-rotated-angles: one word per page rotated around the full circle (every 30 degrees), drawn glyph-by-glyph along its baseline; every angle is reassembled onto a single line. - text-rotated-watermark: a single 90-degree word drawn glyph-by-glyph, reassembled onto one line. - text-rotated-label: a LibreOffice document that emits a rotated word as per-glyph runs in the content stream; reassembled into one line. - text-rotated-overlay: a justified body paragraph crossed by two diagonal overlay paragraphs (35 and 265 degrees); each overlay stays contiguous and reads in order rather than interleaving with the body.
4ffa5b8 to
426c28c
Compare
k00ni
reviewed
Jul 2, 2026
The fallback font size for content streams that show text without ever setting one lived as a bare literal at five call sites, and disagreed: getHeight() assumed 12 while the rest assumed 10. Funnel them all through TextState::getFontSize(), backed by one private DEFAULT_FONT_SIZE constant.
426c28c to
6a975fa
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR introduces a new default LineGrouping strategy named BaselineClusterStrategy. Instead of grouping text elements by their position on the page only, it groups them by their rotation as well. This will, for example, prevent watermarks to interweave with "normal" text. Or rotated glyph based text to be split into individual lines.
For horizontal text, the results are exactly the same as before.
Note that this changes the LineGrouping interface by moving the space decision into it - it now depends on the used strategy. A few data object classes were introduced to make the code cleaner.
I believe this to be the better overall strategy, but the old strategies are still available. If it's worth keeping them is up for discussion.
This was previously submitted as #431 but is now rebased on current main. I assume you will want to dissect it into multiple PRs again. I will keep rebasing this branch whenever parts of it land in main.
I include an (AI generated) WALKTHROUGH.md here (not committed). It might help reviewing the code. Feel free to ignore if that's not helpful to you.