Skip to content

feat(dataset push): tabular / time-series modality family#13

Merged
aptracebloc merged 1 commit into
developfrom
feat/dataset-push-tabular
Jun 2, 2026
Merged

feat(dataset push): tabular / time-series modality family#13
aptracebloc merged 1 commit into
developfrom
feat/dataset-push-tabular

Conversation

@LukasWodka
Copy link
Copy Markdown

Summary

Widens tracebloc dataset push from image_classification-only to the tabular / time-series familytabular_classification, tabular_regression, time_series_forecasting, time_to_event_prediction (5 of 10 modalities now). The Python ingestor already supports these; this adds the CLI-side flag / layout / spec surface. Validated end-to-end on a live cluster (tabular_classification, 8/8 records, 100%, rows confirmed in MySQL).

🔗 Stacked on #12 (the dataset-push live-ingestion fixes). Review #12 first — this PR's net-new is the feat(dataset push): support the tabular … commit; the diff collapses once #12 merges to develop.

What's added

  • Category dispatch (push/category.go): image vs tabular families, mirroring data-ingestors' conventions.py groupings.
  • Tabular layout (push/tabular.go): a single CSV (no sidecar files), staged via the existing machinery (CSV + empty image list — no changes to the tar/stream path).
  • Schema — auto-inferred from the CSV (INT/FLOAT/VARCHAR) so customers don't hand-write one; --schema col:TYPE,... overrides. Reserved framework columns (id, data_id, …) are skipped so a CSV carrying an id column doesn't produce a schema the ingestor rejects (the #135b guard in data-ingestors).
  • Label — string form for tabular_classification; object form with policy=bucket (default) for the regression-class categories so the raw numeric target never leaves the cluster. Added --label-policy and --time-column.
  • Pre-flight is category-aware (data CSV + column count for tabular).

Test plan

  • go build/vet/test ./... green; new push/tabular_test.go (DiscoverTabular, InferSchema incl. reserved-skip, ParseSchema) + spec_test.go tabular cases (all three label shapes pass the schema; regression defaults to bucket).
  • Live: dataset push ./tab --category tabular_classification --label-column label → schema auto-inferred (id skipped) → staged → ingestor Job ran → 100% (8/8); 8 rows confirmed in training_test_datasets.clidemo_tab_train.

Follow-ups (not in this PR)

  • Text family (text_classification, masked_language_modeling) and image-extras (object_detection, keypoint_detection) need sidecar-dir staging — next increments.
  • Same environment caveat as fix(dataset push): make live ingestion work end-to-end #12 (amd64-only ingestor digest on arm64 nodes) applies to live runs; not a CLI issue.

🤖 Generated with Claude Code

@LukasWodka
Copy link
Copy Markdown
Author

👋 Heads-up — Code review queue is at 19 / 8

Above the WIP limit. The team convention is to review existing PRs before opening new work.

Open PRs currently in Code review (oldest first):

Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.)

Copy link
Copy Markdown

@aptracebloc aptracebloc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve ✅ — tabular family validated end-to-end

Live on EKS dev (chart 1.4.2, amd64): tabular_regression ingested 240/240 records, 100%, rows confirmed in training_test_datasets (real anonymized dataset).

Verified:

  • Schema auto-inference (INT/FLOAT/VARCHAR) with reserved-column skip — a CSV carrying id correctly skips it (18 cols inferred), avoiding the database.create_table collision.
  • --schema col:TYPE override parsing (bad form → exit 2).
  • Regression-class object label defaults to policy: bucket; tabular_classification emits the plain string label. All three label shapes pass schema validation (dry-run, 7/7 categories).
  • Category-aware pre-flight (data CSV + column count).

Note (non-blocking): Schema drift check red is expected — the schema re-sync is stacked in #15; goes green once #15 lands.

@LukasWodka
Copy link
Copy Markdown
Author

👋 Heads-up — Code review queue is at 26 / 8

Above the WIP limit. The team convention is to review existing PRs before opening new work.

Open PRs currently in Code review (oldest first):

  • averaging-service#98 — ci: add fr-gate caller workflow · author: @LukasWodka · no reviewer assigned
  • cli#12 — fix(dataset push): make live ingestion work end-to-end · author: @LukasWodka · no reviewer assigned
  • cli#14 — feat(dataset push): text family + generic sidecar staging · author: @LukasWodka · no reviewer assigned
  • cli#15 — feat(dataset push): object_detection + keypoint_detection · author: @LukasWodka · no reviewer assigned
  • cli#16 — test(cli): coverage wins (preflight/progress/errors) + smoke-test hardening · author: @LukasWodka · no reviewer assigned
  • cli#17 — test(cli): integration harness for the real-I/O seams (kind e2e) · author: @LukasWodka · no reviewer assigned
  • client-runtime#45 — Staging: Enhance jobs-manager with HTTP proxy, ingestion endpoint, and fixes · author: @saadqbal · no reviewer assigned
  • client-runtime#61 — docs: record MySQL credential threat-model decision · author: @saadqbal · no reviewer assigned
  • client-runtime#65 — fix(#64): re-sync jobs-manager ingest schema (accept masked_language_modeling) + anti-drift · author: @LukasWodka · no reviewer assigned
  • client-runtime#67 — ci: publish jobs-manager images on merge (closes the deploy-delivery gap) · author: @LukasWodka · no reviewer assigned

Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.)

Extends `dataset push` from image_classification-only to also cover
tabular_classification, tabular_regression, time_series_forecasting,
and time_to_event_prediction. The Python ingestor already supports
these; this adds the CLI-side flag / layout / spec surface. Validated
end-to-end on a live cluster (tabular_classification, 8/8 records,
100%, rows confirmed in MySQL).

- Category dispatch (push/category.go): image vs tabular families,
  mirroring data-ingestors' conventions.py groupings.
- Tabular local layout (push/tabular.go): a single CSV (no sidecar
  files), staged via the existing machinery (CSV + empty image list).
- Schema: auto-inferred from the CSV (INT/FLOAT/VARCHAR) so customers
  don't hand-write one; --schema col:TYPE,... overrides. Reserved
  framework columns (id, data_id, ...) are skipped so a CSV carrying
  an id column doesn't produce a schema the ingestor rejects (the
  #135b guard).
- Label: string form for tabular_classification; object form with
  policy=bucket (default) for the regression-class categories so the
  raw numeric target never leaves the cluster. Added --label-policy
  and --time-column.
- Build() branches by category (push/spec.go); pre-flight is
  category-aware (data CSV + column count for tabular).

Tests: push/tabular_test.go (DiscoverTabular, InferSchema incl.
reserved-skip, ParseSchema); spec_test.go (tabular Build passes the
schema for all three label shapes, regression defaults to bucket);
updated the unsupported-category gate test. go build / vet / test green.

Stacked on cli#12 (the dataset-push live-ingestion fixes).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@aptracebloc aptracebloc force-pushed the feat/dataset-push-tabular branch from 3e5d32e to ff09484 Compare June 2, 2026 11:00
@aptracebloc aptracebloc merged commit 857df92 into develop Jun 2, 2026
7 of 8 checks passed
aptracebloc pushed a commit that referenced this pull request Jun 2, 2026
…e-sync)

Adds the two remaining engine-supported image categories, taking the CLI
to 9/10 modalities (only semantic_segmentation remains, blocked on the
ingestor — data-ingestors#136).

- object_detection: reuses the generic sidecar walker for annotations/
  (.xml). Validated live end-to-end — 128 records (bounding boxes)
  ingested, rows confirmed in MySQL.
- keypoint_detection: labels.csv + images/ (keypoint coords live in the
  CSV's Annotation column, read server-side). Adds --number-of-keypoints
  (required; no default). Emits target_size + number_of_keypoints as
  TOP-LEVEL fields, which the schema's keypoint conditional requires.

- Re-synced the embedded schema from data-ingestors develop. The vendored
  copy was stale: it lacked keypoint's top-level target_size +
  number_of_keypoints and their required-for-keypoint conditional, so the
  CLI couldn't validate a keypoint spec at all. `ingest validate` and
  dataset push now validate keypoint correctly.

Schema-skew findings (deployment/release hygiene, NOT CLI bugs):
  * sync-schema.sh defaults to data-ingestors *master*, which is stale
    (lacks both MLM and keypoint); the current schema is on *develop*.
    Repoint the sync source to develop, or promote develop -> master.
    (sync --check vs master flags this drift — pre-existing, surfaced here.)
  * The deployed ingdemo client runs jobs-manager and the ingestor on
    DIFFERENT schema versions: the ingestor (newer) REQUIRES keypoint's
    top-level fields; jobs-manager (older) REJECTS them as additional
    properties. So keypoint can't be ingested there until both components
    are refreshed to a matching schema. The CLI's emission is correct
    against the current/consistent schema (unit-verified). OD is
    unaffected (no new fields).

Tests: push/image_extras_test.go (DiscoverObjectDetection +
missing-annotations); spec_test.go (OD emits annotations; keypoint emits
top-level target_size + number_of_keypoints; both pass the schema);
updated the unsupported-category gate test (now segmentation only).
go build / vet / test green.

Stacked on cli#14 (text) -> #13 (tabular) -> #12 (fixes).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
aptracebloc pushed a commit that referenced this pull request Jun 2, 2026
…e-sync) (#15)

Adds the two remaining engine-supported image categories, taking the CLI
to 9/10 modalities (only semantic_segmentation remains, blocked on the
ingestor — data-ingestors#136).

- object_detection: reuses the generic sidecar walker for annotations/
  (.xml). Validated live end-to-end — 128 records (bounding boxes)
  ingested, rows confirmed in MySQL.
- keypoint_detection: labels.csv + images/ (keypoint coords live in the
  CSV's Annotation column, read server-side). Adds --number-of-keypoints
  (required; no default). Emits target_size + number_of_keypoints as
  TOP-LEVEL fields, which the schema's keypoint conditional requires.

- Re-synced the embedded schema from data-ingestors develop. The vendored
  copy was stale: it lacked keypoint's top-level target_size +
  number_of_keypoints and their required-for-keypoint conditional, so the
  CLI couldn't validate a keypoint spec at all. `ingest validate` and
  dataset push now validate keypoint correctly.

Schema-skew findings (deployment/release hygiene, NOT CLI bugs):
  * sync-schema.sh defaults to data-ingestors *master*, which is stale
    (lacks both MLM and keypoint); the current schema is on *develop*.
    Repoint the sync source to develop, or promote develop -> master.
    (sync --check vs master flags this drift — pre-existing, surfaced here.)
  * The deployed ingdemo client runs jobs-manager and the ingestor on
    DIFFERENT schema versions: the ingestor (newer) REQUIRES keypoint's
    top-level fields; jobs-manager (older) REJECTS them as additional
    properties. So keypoint can't be ingested there until both components
    are refreshed to a matching schema. The CLI's emission is correct
    against the current/consistent schema (unit-verified). OD is
    unaffected (no new fields).

Tests: push/image_extras_test.go (DiscoverObjectDetection +
missing-annotations); spec_test.go (OD emits annotations; keypoint emits
top-level target_size + number_of_keypoints; both pass the schema);
updated the unsupported-category gate test (now segmentation only).
go build / vet / test green.

Stacked on cli#14 (text) -> #13 (tabular) -> #12 (fixes).

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants