refactor(pos-utils): unify CoNLL-U training-data extraction and fix multiword-token span-shift by marcbachmann · Pull Request #3694 · Automattic/harper

marcbachmann · 2026-06-21T15:48:19Z

Issues

N/A — no tracking issue.
This is a standalone refactor that also fixes a latent multiword-token / contraction bug in the training-data extraction.

Description

Five CoNLL-U training-data extraction sites each hand-rolled their own sentence
walk:

tagger::brill_tagger::BrillTagger::epoch (transformation training)
tagger::freq_dict_builder::FreqDictBuilder::inc_from_conllu_file (base freq dict)
chunker::brill_chunker training (epoch)
chunker::burn_chunker::BurnChunker::extract_sents_from_files
chunker::upos_freq_dict::UPOSFreqDict::inc_from_conllu_file

Each merged contraction suffixes with a copy-pasted list, iterated raw token
rows, and (the chunkers) located noun phrases over those raw rows. Locating noun
phrases on raw sent.tokens shifts every span in a sentence containing a
multiword-token range row or an ellipsis node, because the head-link walk
resolves head ids by position and the extra rows offset those positions. The
result: wrong tokens, tags and NP labels for any sentence with a multiword token
(e.g. English cannot = can + not, don't = do + n't) or an ellipsis
node.

This PR moves all extraction into one place — conllu_utils — built on a single
multiword-token/contraction-aware sentence walk:

syntactic_words / walk_sentence merge UD component rows back into the
surface tokens Harper actually tokenizes at runtime (Harper lexes don't as a
single token), keeping a span map so NP flags computed on the syntactic words
fold back onto the merged tokens without shifting.
sentence_to_training_pair → (tokens, tags) and
sentence_to_training_record → (tokens, tags, NP mask), plus the batch
helpers extract_pairs_from_files / extract_records_from_files.
locate_noun_phrases_in_words — moved here from
chunker::np_extraction (deleted) and switched to operate on syntactic
words — is the shared NP locator.

All five sites now call these helpers. The duplicated loops are removed and the
multiword-token / contraction span-shift is fixed.

Scope: inference and the shipped models are unaffected — these are
training-data paths only. Regenerating the Brill tagger, the Burn chunker, or the
frequency dictionaries will produce corrected data on sentences containing
multiword tokens or contractions.

Net: 8 files, ~+430 / −240; the per-call-site loops shrink, the duplicated
np_extraction.rs is deleted, and the shared toolkit lands in conllu_utils.

Demo

CoNLL-U input with an English multiword-token range (cannot = can + not):

1   I       PRON   nsubj
2-3 cannot  _      _          ← multiword-token range row
2   can     AUX    aux
3   not     PART   advmod
4   see     VERB   root
5   the     DET    det
6   door    NOUN   obj

	tokens	tags	NP mask
before (raw walk)	`[I, cannot, can, not, see, the, door]`	`[PRON, _, AUX, PART, VERB, DET, NOUN]`	`[T,F,F,F,F,F,T]` — "the door" mis-located
after (this PR)	`[I, cannot, see, the, door]`	`[PRON, AUX, VERB, DET, NOUN]`	`[T,F,F,T,T]` — "the door" correct

How Has This Been Tested?

cargo build -p harper-pos-utils and cargo build -p harper-pos-utils --features training — both compile warning-free (the previously-duplicated loops are gone; no dead code).
cargo test -p harper-pos-utils --features training — green, including the conllu_utils unit tests that pin the merging/alignment: multiword-token collapse, contraction-suffix merge, and record/NP-mask alignment.
The divergence between the old hand-rolled loops and the shared helpers was verified token-by-token on concrete CoNLL-U inputs (multiword-token ranges incl. English cannot/don't, the I do n't know contraction, and an ellipsis 5.1 node) by running both code paths side by side.
The extraction lives in training-only methods, so there are no unit tests asserting their previous (buggy) output; correctness is covered by the conllu_utils helper tests they now delegate to.

AI Disclosure

I am a human and didn't use any AI.
I used LLM features of my editor, but not an agent.
I used an AI agent interactively.
I am an agent or I got an agent to do the work autonomously.

If Your PR Implements or Enhances a Linter

N/A — this changes harper-pos-utils training-data extraction, not a linter.

Checklist

I have performed a self-review of my own code
I have added tests to cover my changes
I have considered splitting this into smaller pull requests

…_utils Five training-data extraction sites — the Brill tagger (`epoch`, `FreqDictBuilder`) and the Brill / Burn / UPOS-frequency chunkers — each hand-rolled their own CoNLL-U sentence walk: merging contraction suffixes with a copy-pasted list, iterating raw token rows, and (for the chunkers) locating noun phrases over those raw rows. That last part shifts every noun-phrase span in a sentence containing a multiword-token range (e.g. English "cannot", "don't") or an ellipsis node, so the extracted tokens, tags and NP labels are wrong for such sentences. Move the extraction into one place — `conllu_utils` — built on a single multiword-token/contraction-aware sentence walk: - `syntactic_words` / `walk_sentence` merge UD component rows back into the surface tokens Harper actually tokenizes at runtime ("don't" is one token), keeping a span map so noun-phrase flags (computed on the syntactic words) fold back onto the merged tokens without shifting. - `sentence_to_training_pair` / `sentence_to_training_record` and the `extract_pairs_from_files` / `extract_records_from_files` batch helpers return `(tokens, tags[, NP mask])`. - `locate_noun_phrases_in_words` (moved here from `chunker::np_extraction`, now operating on syntactic words) is the shared NP locator. All five sites call these helpers; the duplicated loops are gone and the multiword-token / contraction span-shift is fixed. Inference and the shipped models are unaffected — these are training-data paths only. Regenerating the Brill tagger, Burn chunker or frequency dictionaries will produce corrected data on sentences with multiword tokens or contractions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(pos-utils): unify CoNLL-U training-data extraction and fix multiword-token span-shift#3694

refactor(pos-utils): unify CoNLL-U training-data extraction and fix multiword-token span-shift#3694
marcbachmann wants to merge 1 commit into
Automattic:masterfrom
marcbachmann:conllu-training-extraction

marcbachmann commented Jun 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

marcbachmann commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issues

Description

Demo

How Has This Been Tested?

AI Disclosure

If Your PR Implements or Enhances a Linter

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

marcbachmann commented Jun 21, 2026 •

edited

Loading