refactor(pos-utils): unify CoNLL-U training-data extraction and fix multiword-token span-shift#3694
Open
marcbachmann wants to merge 1 commit into
Open
Conversation
…_utils
Five training-data extraction sites — the Brill tagger (`epoch`,
`FreqDictBuilder`) and the Brill / Burn / UPOS-frequency chunkers — each
hand-rolled their own CoNLL-U sentence walk: merging contraction suffixes with
a copy-pasted list, iterating raw token rows, and (for the chunkers) locating
noun phrases over those raw rows. That last part shifts every noun-phrase span
in a sentence containing a multiword-token range (e.g. English "cannot",
"don't") or an ellipsis node, so the extracted tokens, tags and NP labels are
wrong for such sentences.
Move the extraction into one place — `conllu_utils` — built on a single
multiword-token/contraction-aware sentence walk:
- `syntactic_words` / `walk_sentence` merge UD component rows back into the
surface tokens Harper actually tokenizes at runtime ("don't" is one token),
keeping a span map so noun-phrase flags (computed on the syntactic words)
fold back onto the merged tokens without shifting.
- `sentence_to_training_pair` / `sentence_to_training_record` and the
`extract_pairs_from_files` / `extract_records_from_files` batch helpers
return `(tokens, tags[, NP mask])`.
- `locate_noun_phrases_in_words` (moved here from `chunker::np_extraction`,
now operating on syntactic words) is the shared NP locator.
All five sites call these helpers; the duplicated loops are gone and the
multiword-token / contraction span-shift is fixed.
Inference and the shipped models are unaffected — these are training-data
paths only. Regenerating the Brill tagger, Burn chunker or frequency
dictionaries will produce corrected data on sentences with multiword tokens
or contractions.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Issues
N/A — no tracking issue.
This is a standalone refactor that also fixes a latent multiword-token / contraction bug in the training-data extraction.
Description
Five CoNLL-U training-data extraction sites each hand-rolled their own sentence
walk:
tagger::brill_tagger::BrillTagger::epoch(transformation training)tagger::freq_dict_builder::FreqDictBuilder::inc_from_conllu_file(base freq dict)chunker::brill_chunkertraining (epoch)chunker::burn_chunker::BurnChunker::extract_sents_from_fileschunker::upos_freq_dict::UPOSFreqDict::inc_from_conllu_fileEach merged contraction suffixes with a copy-pasted list, iterated raw token
rows, and (the chunkers) located noun phrases over those raw rows. Locating noun
phrases on raw
sent.tokensshifts every span in a sentence containing amultiword-token range row or an ellipsis node, because the head-link walk
resolves head ids by position and the extra rows offset those positions. The
result: wrong tokens, tags and NP labels for any sentence with a multiword token
(e.g. English
cannot=can+not,don't=do+n't) or an ellipsisnode.
This PR moves all extraction into one place —
conllu_utils— built on a singlemultiword-token/contraction-aware sentence walk:
syntactic_words/walk_sentencemerge UD component rows back into thesurface tokens Harper actually tokenizes at runtime (Harper lexes
don'tas asingle token), keeping a span map so NP flags computed on the syntactic words
fold back onto the merged tokens without shifting.
sentence_to_training_pair→(tokens, tags)andsentence_to_training_record→(tokens, tags, NP mask), plus the batchhelpers
extract_pairs_from_files/extract_records_from_files.locate_noun_phrases_in_words— moved here fromchunker::np_extraction(deleted) and switched to operate on syntacticwords — is the shared NP locator.
All five sites now call these helpers. The duplicated loops are removed and the
multiword-token / contraction span-shift is fixed.
Scope: inference and the shipped models are unaffected — these are
training-data paths only. Regenerating the Brill tagger, the Burn chunker, or the
frequency dictionaries will produce corrected data on sentences containing
multiword tokens or contractions.
Net: 8 files, ~+430 / −240; the per-call-site loops shrink, the duplicated
np_extraction.rsis deleted, and the shared toolkit lands inconllu_utils.Demo
CoNLL-U input with an English multiword-token range (
cannot=can+not):[I, cannot, can, not, see, the, door][PRON, _, AUX, PART, VERB, DET, NOUN][T,F,F,F,F,F,T]— "the door" mis-located[I, cannot, see, the, door][PRON, AUX, VERB, DET, NOUN][T,F,F,T,T]— "the door" correctHow Has This Been Tested?
cargo build -p harper-pos-utilsandcargo build -p harper-pos-utils --features training— both compile warning-free (the previously-duplicated loops are gone; no dead code).cargo test -p harper-pos-utils --features training— green, including theconllu_utilsunit tests that pin the merging/alignment: multiword-token collapse, contraction-suffix merge, and record/NP-mask alignment.cannot/don't, theI do n't knowcontraction, and an ellipsis5.1node) by running both code paths side by side.conllu_utilshelper tests they now delegate to.AI Disclosure
If Your PR Implements or Enhances a Linter
N/A — this changes
harper-pos-utilstraining-data extraction, not a linter.Checklist