Correctly support rotated texts by splitbrain · Pull Request #431 · PrinsFrank/pdfparser

splitbrain · 2026-06-27T16:34:19Z

This PR introduces a new default LineGrouping strategy named BaselineClusterStrategy. Instead of grouping text elements by their position on the page only, it groups them by their rotation as well. This will, for example, prevent watermarks to interweave with "normal" text. Or rotated glyph based text to be split into individual lines.

For horizontal text, the results are exactly the same as before.

Note that this changes the LineGrouping interface by moving the space decision into it - it now depends on the used strategy. A few data object classes were introduced to make the code cleaner.

I believe this to be the better overall strategy, but the old strategies are still available. If it's worth keeping them is up for discussion.

This PR builds on top of the changes from #430 (eg. includes the commits over there)

…vance width Word-break detection compares the gap between two adjacent text elements against a threshold. The gap is next.offsetX - prev.offsetX - advance(prev), so it is only as accurate as the reconstructed advance. PositionedTextElement::getAdvanceWidth() reconstructs the cursor advance per spec §9.4.4: per-glyph width and Tc, minus the TJ kerning numbers, plus Tw for single-byte code 32 (gated off for composite fonts), scaled by Th and the text matrix, and measured with the element's own font rather than the next element's. With an accurate advance, within-word gaps collapse near zero, so the gap that signals a word break is a roughly constant fraction of the em across producers: a single WORD_BREAK_THRESHOLD of 0.25 separates word breaks from kerning. The intra-element large-negative-offset rule reuses the same constant.

parseOperand() runs the operand-splitting regex once and returns the tokens as {chars, offset} pairs, with the TJ adjustment already converted to a float or null. getText(), getCodePoints() and getTotalOffset() consume those pairs, so the pattern and its guards - including the unrecognized-format check on an empty match - live in one place instead of three.

A smaller glyph whose baseline is shifted off the line, such as the "2" in "CO2", overlaps the line too little to clear the line-overlap threshold and ended up on a line of its own. When such an element partially overlaps the line and sits horizontally within the run already collected, it is enclosed by surrounding text and belongs to that line. A smaller element that merely sits close but is not enclosed, like a subtitle beside a heading, keeps its own line.

A Vector captures a 2D direction in page space (length, normalized, dot product, perpendicular normal). TransformationMatrix::baselineVector() names the baseline (text advance) direction in page space: the matrix's first column (scaleX, shearX). Nothing consumes them yet; the glyph geometry and line-grouping that follow are built on these.

Font::getWidthForChar(s) drop their TransformationMatrix parameter and return the unscaled text-space width (w0·Tfs + Tc) rather than pre-multiplying by scaleX. getAdvanceWidth() takes a page-space direction Vector (default page X) and returns the signed projection of the reconstructed advance onto it -- a dot product against the baseline vector -- so the advance can later be measured along a rotated baseline. The default direction reproduces the previous advance · scaleX exactly, so the extracted text is unchanged. The perpendicular-to-baseline glyph height that rotated text needs lands with the baseline-aware strategy.

The word-break decision moves out of ContentStream into the strategy, because a gap between two runs is meaningful only relative to how that strategy laid them out. LineGroupingStrategy gains requiresSpaceBetween(); ContentStream keeps only the empty-run and already-spaced guards and asks the strategy. A MatrixOffsetSpacing trait holds the axis-aligned heuristic the two upright strategies share (TextOverlapStrategy, StrictLineGrouping): a space belongs when the gap between X offsets, less the previous run's reconstructed advance width, reaches WORD_BREAK_THRESHOLD of the em -- the same comparison ContentStream made inline before, so the extracted text is unchanged.

BaselineClusterStrategy is orientation-agnostic and two-pass. Pass 1 clusters runs into lines by their actual baseline -- baselines pointing nearly the same way whose glyph extents overlap along the baseline normal -- and orders each line along its baseline vector, so multi-run rotated text (a watermark, a vertical label, a diagonal overlay) is reassembled and reads in order at any angle rather than fragmenting to one line per run. Pass 2 segments the lines into orientation-bucketed blocks, orders the lines within a block along the block's own normal, and orders the blocks by their highest point on the page, so a rotated overlay stays contiguous instead of interleaving with the body it crosses. Lines and blocks are modelled as Line and Block value objects built on Vector; requiresSpaceBetween() measures the gap and the advance along the baseline, reducing to the axis-aligned comparison for horizontal text. getHeight() now measures the glyph extent perpendicular to the baseline (hypot(shearY, scaleY)) so rotated text keeps a non-zero height; this is what lets the strategy cluster rotated runs, and it changes grouping only where text is rotated. For horizontal text the baseline normal is the page Y axis and there is a single orientation bucket, so this is byte-identical to TextOverlapStrategy on every non-rotated page; Page::getText switches to it as the default. The only existing sample it changes is text-objects-across-multiple-streams (pages 5 and 7), whose rotated channel labels now read in baseline order with the repeated device labels separated by the reconstructed advance; its fixture is updated to match. TextOverlapStrategy remains as a non-default, axis-aligned alternative.

Regression fixtures exercising BaselineClusterStrategy, where the previous default fragmented rotated text to one line per run: - text-rotated-angles: one word per page rotated around the full circle (every 30 degrees), drawn glyph-by-glyph along its baseline; every angle is reassembled onto a single line. - text-rotated-watermark: a single 90-degree word drawn glyph-by-glyph, reassembled onto one line. - text-rotated-label: a LibreOffice document that emits a rotated word as per-glyph runs in the content stream; reassembled into one line. - text-rotated-overlay: a justified body paragraph crossed by two diagonal overlay paragraphs (35 and 265 degrees); each overlay stays contiguous and reads in order rather than interleaving with the body.

The fallback font size for content streams that show text without ever setting one lived as a bare literal at five call sites, and disagreed: getHeight() assumed 12 while the rest assumed 10. Funnel them all through TextState::getFontSize(), backed by one private DEFAULT_FONT_SIZE constant.

PrinsFrank

I'm fine with stacked diffs, but pointing them both to the main branch results in the second PR to be impossible to review. When branching like Main->A->B you can point the PR for A to main and the PR for B to branch A. In that case, when the PR for branch A gets merged the PR for branch B will also automatically point to the main branch.

I'll focus my review time on the previous PR first as these are already quite some changes to understand. ;)

splitbrain · 2026-06-27T17:41:27Z

you can point the PR for A to main and the PR for B to branch A

I can't. Because I can only create PRs against any of your branches. As long as A is not merged into a branch at your repo I can not target it in a PR.

splitbrain · 2026-06-27T18:07:10Z

I will close this until #430 is merged

splitbrain and others added 9 commits June 27, 2026 18:20

PrinsFrank requested changes Jun 27, 2026

View reviewed changes

splitbrain closed this Jun 27, 2026

splitbrain mentioned this pull request Jul 1, 2026

Correctly support rotated texts #444

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Correctly support rotated texts#431

Correctly support rotated texts#431
splitbrain wants to merge 9 commits into
PrinsFrank:mainfrom
cosmocode:rotation-by-cluster-split

splitbrain commented Jun 27, 2026

Uh oh!

PrinsFrank left a comment

Uh oh!

splitbrain commented Jun 27, 2026

Uh oh!

splitbrain commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

splitbrain commented Jun 27, 2026

Uh oh!

PrinsFrank left a comment

Choose a reason for hiding this comment

Uh oh!

splitbrain commented Jun 27, 2026

Uh oh!

splitbrain commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants