Skip to content

refactor(pos-utils): unify CoNLL-U training-data extraction and fix multiword-token span-shift#3694

Open
marcbachmann wants to merge 1 commit into
Automattic:masterfrom
marcbachmann:conllu-training-extraction
Open

refactor(pos-utils): unify CoNLL-U training-data extraction and fix multiword-token span-shift#3694
marcbachmann wants to merge 1 commit into
Automattic:masterfrom
marcbachmann:conllu-training-extraction

Conversation

@marcbachmann

@marcbachmann marcbachmann commented Jun 21, 2026

Copy link
Copy Markdown
Contributor

Issues

N/A — no tracking issue.
This is a standalone refactor that also fixes a latent multiword-token / contraction bug in the training-data extraction.

Description

Five CoNLL-U training-data extraction sites each hand-rolled their own sentence
walk:

  • tagger::brill_tagger::BrillTagger::epoch (transformation training)
  • tagger::freq_dict_builder::FreqDictBuilder::inc_from_conllu_file (base freq dict)
  • chunker::brill_chunker training (epoch)
  • chunker::burn_chunker::BurnChunker::extract_sents_from_files
  • chunker::upos_freq_dict::UPOSFreqDict::inc_from_conllu_file

Each merged contraction suffixes with a copy-pasted list, iterated raw token
rows, and (the chunkers) located noun phrases over those raw rows. Locating noun
phrases on raw sent.tokens shifts every span in a sentence containing a
multiword-token range row or an ellipsis node, because the head-link walk
resolves head ids by position and the extra rows offset those positions. The
result: wrong tokens, tags and NP labels for any sentence with a multiword token
(e.g. English cannot = can + not, don't = do + n't) or an ellipsis
node.

This PR moves all extraction into one place — conllu_utils — built on a single
multiword-token/contraction-aware sentence walk:

  • syntactic_words / walk_sentence merge UD component rows back into the
    surface tokens Harper actually tokenizes at runtime (Harper lexes don't as a
    single token), keeping a span map so NP flags computed on the syntactic words
    fold back onto the merged tokens without shifting.
  • sentence_to_training_pair(tokens, tags) and
    sentence_to_training_record(tokens, tags, NP mask), plus the batch
    helpers extract_pairs_from_files / extract_records_from_files.
  • locate_noun_phrases_in_words — moved here from
    chunker::np_extraction (deleted) and switched to operate on syntactic
    words — is the shared NP locator.

All five sites now call these helpers. The duplicated loops are removed and the
multiword-token / contraction span-shift is fixed.

Scope: inference and the shipped models are unaffected — these are
training-data paths only. Regenerating the Brill tagger, the Burn chunker, or the
frequency dictionaries will produce corrected data on sentences containing
multiword tokens or contractions.

Net: 8 files, ~+430 / −240; the per-call-site loops shrink, the duplicated
np_extraction.rs is deleted, and the shared toolkit lands in conllu_utils.

Demo

CoNLL-U input with an English multiword-token range (cannot = can + not):

1   I       PRON   nsubj
2-3 cannot  _      _          ← multiword-token range row
2   can     AUX    aux
3   not     PART   advmod
4   see     VERB   root
5   the     DET    det
6   door    NOUN   obj
tokens tags NP mask
before (raw walk) [I, cannot, can, not, see, the, door] [PRON, _, AUX, PART, VERB, DET, NOUN] [T,F,F,F,F,F,T] — "the door" mis-located
after (this PR) [I, cannot, see, the, door] [PRON, AUX, VERB, DET, NOUN] [T,F,F,T,T] — "the door" correct

How Has This Been Tested?

  • cargo build -p harper-pos-utils and cargo build -p harper-pos-utils --features training — both compile warning-free (the previously-duplicated loops are gone; no dead code).
  • cargo test -p harper-pos-utils --features training — green, including the conllu_utils unit tests that pin the merging/alignment: multiword-token collapse, contraction-suffix merge, and record/NP-mask alignment.
  • The divergence between the old hand-rolled loops and the shared helpers was verified token-by-token on concrete CoNLL-U inputs (multiword-token ranges incl. English cannot/don't, the I do n't know contraction, and an ellipsis 5.1 node) by running both code paths side by side.
  • The extraction lives in training-only methods, so there are no unit tests asserting their previous (buggy) output; correctness is covered by the conllu_utils helper tests they now delegate to.

AI Disclosure

  • I am a human and didn't use any AI.
  • I used LLM features of my editor, but not an agent.
  • I used an AI agent interactively.
  • I am an agent or I got an agent to do the work autonomously.

If Your PR Implements or Enhances a Linter

N/A — this changes harper-pos-utils training-data extraction, not a linter.

Checklist

  • I have performed a self-review of my own code
  • I have added tests to cover my changes
  • I have considered splitting this into smaller pull requests

…_utils

Five training-data extraction sites — the Brill tagger (`epoch`,
`FreqDictBuilder`) and the Brill / Burn / UPOS-frequency chunkers — each
hand-rolled their own CoNLL-U sentence walk: merging contraction suffixes with
a copy-pasted list, iterating raw token rows, and (for the chunkers) locating
noun phrases over those raw rows. That last part shifts every noun-phrase span
in a sentence containing a multiword-token range (e.g. English "cannot",
"don't") or an ellipsis node, so the extracted tokens, tags and NP labels are
wrong for such sentences.

Move the extraction into one place — `conllu_utils` — built on a single
multiword-token/contraction-aware sentence walk:

- `syntactic_words` / `walk_sentence` merge UD component rows back into the
  surface tokens Harper actually tokenizes at runtime ("don't" is one token),
  keeping a span map so noun-phrase flags (computed on the syntactic words)
  fold back onto the merged tokens without shifting.
- `sentence_to_training_pair` / `sentence_to_training_record` and the
  `extract_pairs_from_files` / `extract_records_from_files` batch helpers
  return `(tokens, tags[, NP mask])`.
- `locate_noun_phrases_in_words` (moved here from `chunker::np_extraction`,
  now operating on syntactic words) is the shared NP locator.

All five sites call these helpers; the duplicated loops are gone and the
multiword-token / contraction span-shift is fixed.

Inference and the shipped models are unaffected — these are training-data
paths only. Regenerating the Brill tagger, Burn chunker or frequency
dictionaries will produce corrected data on sentences with multiword tokens
or contractions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant