feat(dataset push): tabular / time-series modality family by LukasWodka · Pull Request #13 · tracebloc/cli

LukasWodka · 2026-06-02T08:47:10Z

Summary

Widens tracebloc dataset push from image_classification-only to the tabular / time-series family — tabular_classification, tabular_regression, time_series_forecasting, time_to_event_prediction (5 of 10 modalities now). The Python ingestor already supports these; this adds the CLI-side flag / layout / spec surface. Validated end-to-end on a live cluster (tabular_classification, 8/8 records, 100%, rows confirmed in MySQL).

🔗 Stacked on #12 (the dataset-push live-ingestion fixes). Review #12 first — this PR's net-new is the feat(dataset push): support the tabular … commit; the diff collapses once #12 merges to develop.

What's added

Category dispatch (push/category.go): image vs tabular families, mirroring data-ingestors' conventions.py groupings.
Tabular layout (push/tabular.go): a single CSV (no sidecar files), staged via the existing machinery (CSV + empty image list — no changes to the tar/stream path).
Schema — auto-inferred from the CSV (INT/FLOAT/VARCHAR) so customers don't hand-write one; --schema col:TYPE,... overrides. Reserved framework columns (id, data_id, …) are skipped so a CSV carrying an id column doesn't produce a schema the ingestor rejects (the #135b guard in data-ingestors).
Label — string form for tabular_classification; object form with policy=bucket (default) for the regression-class categories so the raw numeric target never leaves the cluster. Added --label-policy and --time-column.
Pre-flight is category-aware (data CSV + column count for tabular).

Test plan

go build/vet/test ./... green; new push/tabular_test.go (DiscoverTabular, InferSchema incl. reserved-skip, ParseSchema) + spec_test.go tabular cases (all three label shapes pass the schema; regression defaults to bucket).
Live: dataset push ./tab --category tabular_classification --label-column label → schema auto-inferred (id skipped) → staged → ingestor Job ran → 100% (8/8); 8 rows confirmed in training_test_datasets.clidemo_tab_train.

Follow-ups (not in this PR)

Text family (text_classification, masked_language_modeling) and image-extras (object_detection, keypoint_detection) need sidecar-dir staging — next increments.
Same environment caveat as fix(dataset push): make live ingestion work end-to-end #12 (amd64-only ingestor digest on arm64 nodes) applies to live runs; not a CLI issue.

🤖 Generated with Claude Code

LukasWodka · 2026-06-02T08:47:55Z

👋 Heads-up — Code review queue is at 19 / 8

Above the WIP limit. The team convention is to review existing PRs before opening new work.

Open PRs currently in Code review (oldest first):

averaging-service#98 — ci: add fr-gate caller workflow · author: @LukasWodka · no reviewer assigned
client#171 — Installer: verify readiness & credentials before reporting success; RHEL-family support · author: @LukasWodka · no reviewer assigned
client-runtime#45 — Staging: Enhance jobs-manager with HTTP proxy, ingestion endpoint, and fixes · author: @saadqbal · no reviewer assigned
client-runtime#61 — docs: record MySQL credential threat-model decision · author: @saadqbal · no reviewer assigned
data-ingestors#132 — ci: add fr-gate caller workflow · author: @LukasWodka · no reviewer assigned
data-ingestors#133 — docs: fix declarative-ingest path/column drift (issue #131 A-series) · author: @divyasinghds · no reviewer assigned
data-ingestors#138 — test(e2e): end-to-end ingestion equivalence suite (all modalities, real MySQL) · author: @LukasWodka · no reviewer assigned
data-ingestors#139 — fix(ingest): MLM tokenizer + relax PascalVOC difficult + reserved-id guard (#137, #135) · author: @LukasWodka · no reviewer assigned
design-system#19 — fix: un-track coverage/ and node_modules/ from git · author: @LukasWodka · no reviewer assigned
design-system#22 — ci: add Vitest test workflow · author: @LukasWodka · reviewer: @aptracebloc

Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.)

aptracebloc

Approve ✅ — tabular family validated end-to-end

Live on EKS dev (chart 1.4.2, amd64): tabular_regression ingested 240/240 records, 100%, rows confirmed in training_test_datasets (real anonymized dataset).

Verified:

Schema auto-inference (INT/FLOAT/VARCHAR) with reserved-column skip — a CSV carrying id correctly skips it (18 cols inferred), avoiding the database.create_table collision.
--schema col:TYPE override parsing (bad form → exit 2).
Regression-class object label defaults to policy: bucket; tabular_classification emits the plain string label. All three label shapes pass schema validation (dry-run, 7/7 categories).
Category-aware pre-flight (data CSV + column count).

Note (non-blocking): Schema drift check red is expected — the schema re-sync is stacked in #15; goes green once #15 lands.

LukasWodka · 2026-06-02T10:58:16Z

👋 Heads-up — Code review queue is at 26 / 8

Above the WIP limit. The team convention is to review existing PRs before opening new work.

Open PRs currently in Code review (oldest first):

averaging-service#98 — ci: add fr-gate caller workflow · author: @LukasWodka · no reviewer assigned
cli#12 — fix(dataset push): make live ingestion work end-to-end · author: @LukasWodka · no reviewer assigned
cli#14 — feat(dataset push): text family + generic sidecar staging · author: @LukasWodka · no reviewer assigned
cli#15 — feat(dataset push): object_detection + keypoint_detection · author: @LukasWodka · no reviewer assigned
cli#16 — test(cli): coverage wins (preflight/progress/errors) + smoke-test hardening · author: @LukasWodka · no reviewer assigned
cli#17 — test(cli): integration harness for the real-I/O seams (kind e2e) · author: @LukasWodka · no reviewer assigned
client-runtime#45 — Staging: Enhance jobs-manager with HTTP proxy, ingestion endpoint, and fixes · author: @saadqbal · no reviewer assigned
client-runtime#61 — docs: record MySQL credential threat-model decision · author: @saadqbal · no reviewer assigned
client-runtime#65 — fix(#64): re-sync jobs-manager ingest schema (accept masked_language_modeling) + anti-drift · author: @LukasWodka · no reviewer assigned
client-runtime#67 — ci: publish jobs-manager images on merge (closes the deploy-delivery gap) · author: @LukasWodka · no reviewer assigned

Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.)

Extends `dataset push` from image_classification-only to also cover tabular_classification, tabular_regression, time_series_forecasting, and time_to_event_prediction. The Python ingestor already supports these; this adds the CLI-side flag / layout / spec surface. Validated end-to-end on a live cluster (tabular_classification, 8/8 records, 100%, rows confirmed in MySQL). - Category dispatch (push/category.go): image vs tabular families, mirroring data-ingestors' conventions.py groupings. - Tabular local layout (push/tabular.go): a single CSV (no sidecar files), staged via the existing machinery (CSV + empty image list). - Schema: auto-inferred from the CSV (INT/FLOAT/VARCHAR) so customers don't hand-write one; --schema col:TYPE,... overrides. Reserved framework columns (id, data_id, ...) are skipped so a CSV carrying an id column doesn't produce a schema the ingestor rejects (the #135b guard). - Label: string form for tabular_classification; object form with policy=bucket (default) for the regression-class categories so the raw numeric target never leaves the cluster. Added --label-policy and --time-column. - Build() branches by category (push/spec.go); pre-flight is category-aware (data CSV + column count for tabular). Tests: push/tabular_test.go (DiscoverTabular, InferSchema incl. reserved-skip, ParseSchema); spec_test.go (tabular Build passes the schema for all three label shapes, regression defaults to bucket); updated the unsupported-category gate test. go build / vet / test green. Stacked on cli#12 (the dataset-push live-ingestion fixes). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…e-sync) Adds the two remaining engine-supported image categories, taking the CLI to 9/10 modalities (only semantic_segmentation remains, blocked on the ingestor — data-ingestors#136). - object_detection: reuses the generic sidecar walker for annotations/ (.xml). Validated live end-to-end — 128 records (bounding boxes) ingested, rows confirmed in MySQL. - keypoint_detection: labels.csv + images/ (keypoint coords live in the CSV's Annotation column, read server-side). Adds --number-of-keypoints (required; no default). Emits target_size + number_of_keypoints as TOP-LEVEL fields, which the schema's keypoint conditional requires. - Re-synced the embedded schema from data-ingestors develop. The vendored copy was stale: it lacked keypoint's top-level target_size + number_of_keypoints and their required-for-keypoint conditional, so the CLI couldn't validate a keypoint spec at all. `ingest validate` and dataset push now validate keypoint correctly. Schema-skew findings (deployment/release hygiene, NOT CLI bugs): * sync-schema.sh defaults to data-ingestors *master*, which is stale (lacks both MLM and keypoint); the current schema is on *develop*. Repoint the sync source to develop, or promote develop -> master. (sync --check vs master flags this drift — pre-existing, surfaced here.) * The deployed ingdemo client runs jobs-manager and the ingestor on DIFFERENT schema versions: the ingestor (newer) REQUIRES keypoint's top-level fields; jobs-manager (older) REJECTS them as additional properties. So keypoint can't be ingested there until both components are refreshed to a matching schema. The CLI's emission is correct against the current/consistent schema (unit-verified). OD is unaffected (no new fields). Tests: push/image_extras_test.go (DiscoverObjectDetection + missing-annotations); spec_test.go (OD emits annotations; keypoint emits top-level target_size + number_of_keypoints; both pass the schema); updated the unsupported-category gate test (now segmentation only). go build / vet / test green. Stacked on cli#14 (text) -> #13 (tabular) -> #12 (fixes). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…e-sync) (#15) Adds the two remaining engine-supported image categories, taking the CLI to 9/10 modalities (only semantic_segmentation remains, blocked on the ingestor — data-ingestors#136). - object_detection: reuses the generic sidecar walker for annotations/ (.xml). Validated live end-to-end — 128 records (bounding boxes) ingested, rows confirmed in MySQL. - keypoint_detection: labels.csv + images/ (keypoint coords live in the CSV's Annotation column, read server-side). Adds --number-of-keypoints (required; no default). Emits target_size + number_of_keypoints as TOP-LEVEL fields, which the schema's keypoint conditional requires. - Re-synced the embedded schema from data-ingestors develop. The vendored copy was stale: it lacked keypoint's top-level target_size + number_of_keypoints and their required-for-keypoint conditional, so the CLI couldn't validate a keypoint spec at all. `ingest validate` and dataset push now validate keypoint correctly. Schema-skew findings (deployment/release hygiene, NOT CLI bugs): * sync-schema.sh defaults to data-ingestors *master*, which is stale (lacks both MLM and keypoint); the current schema is on *develop*. Repoint the sync source to develop, or promote develop -> master. (sync --check vs master flags this drift — pre-existing, surfaced here.) * The deployed ingdemo client runs jobs-manager and the ingestor on DIFFERENT schema versions: the ingestor (newer) REQUIRES keypoint's top-level fields; jobs-manager (older) REJECTS them as additional properties. So keypoint can't be ingested there until both components are refreshed to a matching schema. The CLI's emission is correct against the current/consistent schema (unit-verified). OD is unaffected (no new fields). Tests: push/image_extras_test.go (DiscoverObjectDetection + missing-annotations); spec_test.go (OD emits annotations; keypoint emits top-level target_size + number_of_keypoints; both pass the schema); updated the unsupported-category gate test (now segmentation only). go build / vet / test green. Stacked on cli#14 (text) -> #13 (tabular) -> #12 (fixes). Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

LukasWodka assigned saadqbal Jun 2, 2026

This was referenced Jun 2, 2026

feat(dataset push): text family + generic sidecar staging #14

Merged

feat(dataset push): object_detection + keypoint_detection #15

Merged

LukasWodka assigned aptracebloc and unassigned saadqbal Jun 2, 2026

aptracebloc mentioned this pull request Jun 2, 2026

fix(dataset push): make live ingestion work end-to-end #12

Merged

aptracebloc marked this pull request as ready for review June 2, 2026 10:57

aptracebloc approved these changes Jun 2, 2026

View reviewed changes

aptracebloc force-pushed the feat/dataset-push-tabular branch from 3e5d32e to ff09484 Compare June 2, 2026 11:00

aptracebloc merged commit 857df92 into develop Jun 2, 2026
7 of 8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(dataset push): tabular / time-series modality family#13

feat(dataset push): tabular / time-series modality family#13
aptracebloc merged 1 commit into
developfrom
feat/dataset-push-tabular

LukasWodka commented Jun 2, 2026

Uh oh!

LukasWodka commented Jun 2, 2026

Uh oh!

aptracebloc left a comment

Uh oh!

LukasWodka commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

LukasWodka commented Jun 2, 2026

Summary

What's added

Test plan

Follow-ups (not in this PR)

Uh oh!

LukasWodka commented Jun 2, 2026

Uh oh!

aptracebloc left a comment

Choose a reason for hiding this comment

Approve ✅ — tabular family validated end-to-end

Uh oh!

LukasWodka commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants