feat(dataset push): text family + generic sidecar staging#14
Conversation
|
👋 Heads-up — Code review queue is at 20 / 8 Above the WIP limit. The team convention is to review existing PRs before opening new work. Open PRs currently in Code review (oldest first):
Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.) |
aptracebloc
left a comment
There was a problem hiding this comment.
Approve ✅ — text family + sidecar engine validated
Live on EKS dev: staging verified for the new shapes —
- text_classification:
texts/sidecar (5 files: 4.txt+ labels.csv) staged correctly. - masked_language_modeling:
sequences/sidecar +tokenizer.jsonas an ExtraFile at the table root (6 files total) —FileCountaccurate across both.
The generic Sidecars/ExtraFiles machinery (the piece OD/segmentation reuse) works as designed.
MLM caveat reproduced live (not a CLI bug): a live MLM push is rejected by the deployed jobs-manager with HTTP 400 — its embedded schema's category enum lacks masked_language_modeling and rejects sequences. The CLI surfaces this cleanly as exit 8. This reproduced even on chart 1.4.2.
👉 Ops follow-up (still open): refresh the deployed jobs-manager's embedded
ingest.v1.jsonto current so MLM (and keypoint, #15) can ingest. Tracks with the image-currency work — not a blocker for this PR.
Note (non-blocking): Schema drift check red is expected — re-sync lands in #15.
|
👋 Heads-up — Code review queue is at 26 / 8 Above the WIP limit. The team convention is to review existing PRs before opening new work. Open PRs currently in Code review (oldest first):
Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.) |
Adds text_classification and masked_language_modeling, and generalizes the staging machinery from images-only to arbitrary sidecar directories plus extra root files — the shared piece the remaining families need. - Generic sidecar staging (walk.go, stream.go): LocalLayout gains Sidecars (dir name -> files, staged under "<name>/") and ExtraFiles (dest -> src, staged at the table root), plus FileCount(). The tar writer packages them after images, sorted for determinism. Images stays for image_classification. - text.go: DiscoverText for text_classification (labels.csv + texts/) and masked_language_modeling (labels.csv + sequences/ + a required tokenizer.json at root, staged as an ExtraFile). discoverSidecarFiles is a reusable walker (object detection / segmentation will reuse it). - Build (spec.go): the text branch emits texts/ or sequences/ + a label (text_classification only; MLM has none). - dataset.go: category dispatch + gate now accept the text family; pre-flight is text-aware (sidecar file count + tokenizer line). Validated live: text_classification — staged labels.csv + 5 texts/ -> ingestor Job 100% (5/5), rows confirmed in MySQL. MLM note: code-complete + unit-tested + accepted by the current data-ingestors schema/engine (proven by the e2e), but the *deployed* ingdemo jobs-manager carries a stale embedded schema predating MLM, so a live MLM push is rejected server-side (HTTP 400) until that image is refreshed. Not a CLI issue — the CLI surfaces the 400 cleanly. Tests: push/text_test.go (DiscoverText for both categories incl. MLM-requires-tokenizer + missing-dir errors; text Build passes the schema, MLM has no label); updated the unsupported-category gate test. go build / vet / test green. Stacked on cli#13 (tabular family) -> cli#12 (live-ingestion fixes). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
419af18 to
a90c92c
Compare
Summary
Adds the text family to
dataset push—text_classificationandmasked_language_modeling— and lands the generic sidecar-staging refactor the remaining families need (7 of 10 modalities now). Validated live: text_classification (5/5, 100%, rows in MySQL).What's added
walk.go,stream.go):LocalLayoutgainsSidecars(dir → files, staged under<name>/) andExtraFiles(dest → src, staged at the table root), plusFileCount(). The tar writer packages them after images, sorted for determinism.Imagesstays for image_classification. This is the shared piece object-detection / segmentation will reuse.text.go):text_classification(labels.csv +texts/);masked_language_modeling(labels.csv +sequences/+ a requiredtokenizer.jsonat the root, staged as an ExtraFile so the ingestor reads it fromSRC_PATH/tokenizer.json).discoverSidecarFilesis the reusable walker.spec.go): emitstexts/orsequences/+ a label (text_classification only; MLM has none).Test plan
go build/vet/test ./...green; newpush/text_test.go(both categories, MLM-requires-tokenizer, missing-dir errors; Build passes the schema with MLM label-less).texts/→ ingestor Job 100% (5/5); 5 rows intraining_test_datasets.clidemo_txt_train.MLM live caveat (not a CLI bug)
A live MLM push to the current ingdemo client is rejected server-side with HTTP 400: the deployed jobs-manager carries a stale embedded schema that predates
masked_language_modeling(its enum lists every category except MLM, and rejectssequences). The CLI's vendored schema is current (matches data-ingestors), so it validates locally; the older server re-validates and rejects. MLM is correct against the current contract (unit-tested + the data-ingestors e2e proves the engine accepts it) and will work once the client's jobs-manager image is refreshed. The CLI surfaces the 400 cleanly (exit 8).Follow-ups
object_detection,keypoint_detection) reusediscoverSidecarFilesforannotations/— next increment → 9/10.semantic_segmentationstays blocked on data-ingestors #136.🤖 Generated with Claude Code