Skip to content

feat(dataset push): text family + generic sidecar staging#14

Merged
aptracebloc merged 1 commit into
developfrom
feat/dataset-push-text
Jun 2, 2026
Merged

feat(dataset push): text family + generic sidecar staging#14
aptracebloc merged 1 commit into
developfrom
feat/dataset-push-text

Conversation

@LukasWodka
Copy link
Copy Markdown

Summary

Adds the text family to dataset pushtext_classification and masked_language_modeling — and lands the generic sidecar-staging refactor the remaining families need (7 of 10 modalities now). Validated live: text_classification (5/5, 100%, rows in MySQL).

🔗 Stacked on #13 (tabular family) → #12 (live-ingestion fixes). Review those first; this PR's net-new is the feat(dataset push): support the text family … commit. The diff collapses as the stack merges.

What's added

  • Generic sidecar staging (walk.go, stream.go): LocalLayout gains Sidecars (dir → files, staged under <name>/) and ExtraFiles (dest → src, staged at the table root), plus FileCount(). The tar writer packages them after images, sorted for determinism. Images stays for image_classification. This is the shared piece object-detection / segmentation will reuse.
  • Text discovery (text.go): text_classification (labels.csv + texts/); masked_language_modeling (labels.csv + sequences/ + a required tokenizer.json at the root, staged as an ExtraFile so the ingestor reads it from SRC_PATH/tokenizer.json). discoverSidecarFiles is the reusable walker.
  • Build (spec.go): emits texts/ or sequences/ + a label (text_classification only; MLM has none).
  • Category dispatch + gate + pre-flight are text-aware.

Test plan

  • go build/vet/test ./... green; new push/text_test.go (both categories, MLM-requires-tokenizer, missing-dir errors; Build passes the schema with MLM label-less).
  • Live (text_classification): staged labels.csv + 5 texts/ → ingestor Job 100% (5/5); 5 rows in training_test_datasets.clidemo_txt_train.

MLM live caveat (not a CLI bug)

A live MLM push to the current ingdemo client is rejected server-side with HTTP 400: the deployed jobs-manager carries a stale embedded schema that predates masked_language_modeling (its enum lists every category except MLM, and rejects sequences). The CLI's vendored schema is current (matches data-ingestors), so it validates locally; the older server re-validates and rejects. MLM is correct against the current contract (unit-tested + the data-ingestors e2e proves the engine accepts it) and will work once the client's jobs-manager image is refreshed. The CLI surfaces the 400 cleanly (exit 8).

👉 Ops follow-up: refresh the deployed jobs-manager's embedded ingest.v1.json to current (it carries MLM). Tracks with the image-currency / multi-arch-digest work.

Follow-ups

  • Image-extras (object_detection, keypoint_detection) reuse discoverSidecarFiles for annotations/ — next increment → 9/10. semantic_segmentation stays blocked on data-ingestors #136.

🤖 Generated with Claude Code

@LukasWodka
Copy link
Copy Markdown
Author

👋 Heads-up — Code review queue is at 20 / 8

Above the WIP limit. The team convention is to review existing PRs before opening new work.

Open PRs currently in Code review (oldest first):

Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.)

Copy link
Copy Markdown

@aptracebloc aptracebloc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve ✅ — text family + sidecar engine validated

Live on EKS dev: staging verified for the new shapes —

  • text_classification: texts/ sidecar (5 files: 4 .txt + labels.csv) staged correctly.
  • masked_language_modeling: sequences/ sidecar + tokenizer.json as an ExtraFile at the table root (6 files total) — FileCount accurate across both.

The generic Sidecars/ExtraFiles machinery (the piece OD/segmentation reuse) works as designed.

MLM caveat reproduced live (not a CLI bug): a live MLM push is rejected by the deployed jobs-manager with HTTP 400 — its embedded schema's category enum lacks masked_language_modeling and rejects sequences. The CLI surfaces this cleanly as exit 8. This reproduced even on chart 1.4.2.

👉 Ops follow-up (still open): refresh the deployed jobs-manager's embedded ingest.v1.json to current so MLM (and keypoint, #15) can ingest. Tracks with the image-currency work — not a blocker for this PR.

Note (non-blocking): Schema drift check red is expected — re-sync lands in #15.

@LukasWodka
Copy link
Copy Markdown
Author

👋 Heads-up — Code review queue is at 26 / 8

Above the WIP limit. The team convention is to review existing PRs before opening new work.

Open PRs currently in Code review (oldest first):

  • averaging-service#98 — ci: add fr-gate caller workflow · author: @LukasWodka · no reviewer assigned
  • cli#12 — fix(dataset push): make live ingestion work end-to-end · author: @LukasWodka · no reviewer assigned
  • cli#13 — feat(dataset push): tabular / time-series modality family · author: @LukasWodka · no reviewer assigned
  • cli#15 — feat(dataset push): object_detection + keypoint_detection · author: @LukasWodka · no reviewer assigned
  • cli#16 — test(cli): coverage wins (preflight/progress/errors) + smoke-test hardening · author: @LukasWodka · no reviewer assigned
  • cli#17 — test(cli): integration harness for the real-I/O seams (kind e2e) · author: @LukasWodka · no reviewer assigned
  • client-runtime#45 — Staging: Enhance jobs-manager with HTTP proxy, ingestion endpoint, and fixes · author: @saadqbal · no reviewer assigned
  • client-runtime#61 — docs: record MySQL credential threat-model decision · author: @saadqbal · no reviewer assigned
  • client-runtime#65 — fix(#64): re-sync jobs-manager ingest schema (accept masked_language_modeling) + anti-drift · author: @LukasWodka · no reviewer assigned
  • client-runtime#67 — ci: publish jobs-manager images on merge (closes the deploy-delivery gap) · author: @LukasWodka · no reviewer assigned

Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.)

Adds text_classification and masked_language_modeling, and generalizes
the staging machinery from images-only to arbitrary sidecar directories
plus extra root files — the shared piece the remaining families need.

- Generic sidecar staging (walk.go, stream.go): LocalLayout gains
  Sidecars (dir name -> files, staged under "<name>/") and ExtraFiles
  (dest -> src, staged at the table root), plus FileCount(). The tar
  writer packages them after images, sorted for determinism. Images
  stays for image_classification.
- text.go: DiscoverText for text_classification (labels.csv + texts/)
  and masked_language_modeling (labels.csv + sequences/ + a required
  tokenizer.json at root, staged as an ExtraFile). discoverSidecarFiles
  is a reusable walker (object detection / segmentation will reuse it).
- Build (spec.go): the text branch emits texts/ or sequences/ + a label
  (text_classification only; MLM has none).
- dataset.go: category dispatch + gate now accept the text family;
  pre-flight is text-aware (sidecar file count + tokenizer line).

Validated live: text_classification — staged labels.csv + 5 texts/ ->
ingestor Job 100% (5/5), rows confirmed in MySQL.

MLM note: code-complete + unit-tested + accepted by the current
data-ingestors schema/engine (proven by the e2e), but the *deployed*
ingdemo jobs-manager carries a stale embedded schema predating MLM, so
a live MLM push is rejected server-side (HTTP 400) until that image is
refreshed. Not a CLI issue — the CLI surfaces the 400 cleanly.

Tests: push/text_test.go (DiscoverText for both categories incl.
MLM-requires-tokenizer + missing-dir errors; text Build passes the
schema, MLM has no label); updated the unsupported-category gate test.
go build / vet / test green.

Stacked on cli#13 (tabular family) -> cli#12 (live-ingestion fixes).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@aptracebloc aptracebloc force-pushed the feat/dataset-push-text branch from 419af18 to a90c92c Compare June 2, 2026 11:01
@aptracebloc aptracebloc merged commit 9712728 into develop Jun 2, 2026
7 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants