Skip to content

Correctly support rotated texts#431

Closed
splitbrain wants to merge 9 commits into
PrinsFrank:mainfrom
cosmocode:rotation-by-cluster-split
Closed

Correctly support rotated texts#431
splitbrain wants to merge 9 commits into
PrinsFrank:mainfrom
cosmocode:rotation-by-cluster-split

Conversation

@splitbrain

Copy link
Copy Markdown
Contributor

This PR introduces a new default LineGrouping strategy named BaselineClusterStrategy. Instead of grouping text elements by their position on the page only, it groups them by their rotation as well. This will, for example, prevent watermarks to interweave with "normal" text. Or rotated glyph based text to be split into individual lines.

For horizontal text, the results are exactly the same as before.

Note that this changes the LineGrouping interface by moving the space decision into it - it now depends on the used strategy. A few data object classes were introduced to make the code cleaner.

I believe this to be the better overall strategy, but the old strategies are still available. If it's worth keeping them is up for discussion.

This PR builds on top of the changes from #430 (eg. includes the commits over there)

splitbrain and others added 9 commits June 27, 2026 18:20
…vance width

Word-break detection compares the gap between two adjacent text elements
against a threshold. The gap is next.offsetX - prev.offsetX - advance(prev),
so it is only as accurate as the reconstructed advance.

PositionedTextElement::getAdvanceWidth() reconstructs the cursor advance per
spec §9.4.4: per-glyph width and Tc, minus the TJ kerning numbers, plus Tw for
single-byte code 32 (gated off for composite fonts), scaled by Th and the text
matrix, and measured with the element's own font rather than the next
element's.

With an accurate advance, within-word gaps collapse near zero, so the gap that
signals a word break is a roughly constant fraction of the em across producers:
a single WORD_BREAK_THRESHOLD of 0.25 separates word breaks from kerning. The
intra-element large-negative-offset rule reuses the same constant.
parseOperand() runs the operand-splitting regex once and returns the
tokens as {chars, offset} pairs, with the TJ adjustment already
converted to a float or null. getText(), getCodePoints() and
getTotalOffset() consume those pairs, so the pattern and its guards -
including the unrecognized-format check on an empty match - live in one
place instead of three.
A smaller glyph whose baseline is shifted off the line, such as the "2" in
"CO2", overlaps the line too little to clear the line-overlap threshold and
ended up on a line of its own. When such an element partially overlaps the
line and sits horizontally within the run already collected, it is enclosed
by surrounding text and belongs to that line. A smaller element that merely
sits close but is not enclosed, like a subtitle beside a heading, keeps its
own line.
A Vector captures a 2D direction in page space (length, normalized, dot
product, perpendicular normal). TransformationMatrix::baselineVector()
names the baseline (text advance) direction in page space: the matrix's
first column (scaleX, shearX). Nothing consumes them yet; the glyph
geometry and line-grouping that follow are built on these.
Font::getWidthForChar(s) drop their TransformationMatrix parameter and
return the unscaled text-space width (w0·Tfs + Tc) rather than
pre-multiplying by scaleX. getAdvanceWidth() takes a page-space direction
Vector (default page X) and returns the signed projection of the
reconstructed advance onto it -- a dot product against the baseline
vector -- so the advance can later be measured along a rotated baseline.

The default direction reproduces the previous advance · scaleX exactly,
so the extracted text is unchanged. The perpendicular-to-baseline glyph
height that rotated text needs lands with the baseline-aware strategy.
The word-break decision moves out of ContentStream into the strategy,
because a gap between two runs is meaningful only relative to how that
strategy laid them out. LineGroupingStrategy gains requiresSpaceBetween();
ContentStream keeps only the empty-run and already-spaced guards and asks
the strategy.

A MatrixOffsetSpacing trait holds the axis-aligned heuristic the two
upright strategies share (TextOverlapStrategy, StrictLineGrouping): a
space belongs when the gap between X offsets, less the previous run's
reconstructed advance width, reaches WORD_BREAK_THRESHOLD of the em -- the
same comparison ContentStream made inline before, so the extracted text is
unchanged.
BaselineClusterStrategy is orientation-agnostic and two-pass. Pass 1
clusters runs into lines by their actual baseline -- baselines pointing
nearly the same way whose glyph extents overlap along the baseline normal
-- and orders each line along its baseline vector, so multi-run rotated
text (a watermark, a vertical label, a diagonal overlay) is reassembled
and reads in order at any angle rather than fragmenting to one line per
run. Pass 2 segments the lines into orientation-bucketed blocks, orders
the lines within a block along the block's own normal, and orders the
blocks by their highest point on the page, so a rotated overlay stays
contiguous instead of interleaving with the body it crosses. Lines and
blocks are modelled as Line and Block value objects built on Vector;
requiresSpaceBetween() measures the gap and the advance along the
baseline, reducing to the axis-aligned comparison for horizontal text.

getHeight() now measures the glyph extent perpendicular to the baseline
(hypot(shearY, scaleY)) so rotated text keeps a non-zero height; this is
what lets the strategy cluster rotated runs, and it changes grouping only
where text is rotated.

For horizontal text the baseline normal is the page Y axis and there is a
single orientation bucket, so this is byte-identical to TextOverlapStrategy
on every non-rotated page; Page::getText switches to it as the default.
The only existing sample it changes is text-objects-across-multiple-streams
(pages 5 and 7), whose rotated channel labels now read in baseline order
with the repeated device labels separated by the reconstructed advance;
its fixture is updated to match. TextOverlapStrategy remains as a
non-default, axis-aligned alternative.
Regression fixtures exercising BaselineClusterStrategy, where the previous
default fragmented rotated text to one line per run:

- text-rotated-angles: one word per page rotated around the full circle
  (every 30 degrees), drawn glyph-by-glyph along its baseline; every angle
  is reassembled onto a single line.
- text-rotated-watermark: a single 90-degree word drawn glyph-by-glyph,
  reassembled onto one line.
- text-rotated-label: a LibreOffice document that emits a rotated word as
  per-glyph runs in the content stream; reassembled into one line.
- text-rotated-overlay: a justified body paragraph crossed by two diagonal
  overlay paragraphs (35 and 265 degrees); each overlay stays contiguous
  and reads in order rather than interleaving with the body.
The fallback font size for content streams that show text without ever
setting one lived as a bare literal at five call sites, and disagreed:
getHeight() assumed 12 while the rest assumed 10. Funnel them all through
TextState::getFontSize(), backed by one private DEFAULT_FONT_SIZE constant.

@PrinsFrank PrinsFrank left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with stacked diffs, but pointing them both to the main branch results in the second PR to be impossible to review. When branching like Main->A->B you can point the PR for A to main and the PR for B to branch A. In that case, when the PR for branch A gets merged the PR for branch B will also automatically point to the main branch.

I'll focus my review time on the previous PR first as these are already quite some changes to understand. ;)

@splitbrain

Copy link
Copy Markdown
Contributor Author

you can point the PR for A to main and the PR for B to branch A

I can't. Because I can only create PRs against any of your branches. As long as A is not merged into a branch at your repo I can not target it in a PR.

@splitbrain

Copy link
Copy Markdown
Contributor Author

I will close this until #430 is merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants