feat(dataset push): tabular / time-series modality family#13
Conversation
|
👋 Heads-up — Code review queue is at 19 / 8 Above the WIP limit. The team convention is to review existing PRs before opening new work. Open PRs currently in Code review (oldest first):
Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.) |
aptracebloc
left a comment
There was a problem hiding this comment.
Approve ✅ — tabular family validated end-to-end
Live on EKS dev (chart 1.4.2, amd64): tabular_regression ingested 240/240 records, 100%, rows confirmed in training_test_datasets (real anonymized dataset).
Verified:
- Schema auto-inference (INT/FLOAT/VARCHAR) with reserved-column skip — a CSV carrying
idcorrectly skips it (18 cols inferred), avoiding thedatabase.create_tablecollision. --schema col:TYPEoverride parsing (bad form → exit 2).- Regression-class object label defaults to
policy: bucket;tabular_classificationemits the plain string label. All three label shapes pass schema validation (dry-run, 7/7 categories). - Category-aware pre-flight (data CSV + column count).
Note (non-blocking): Schema drift check red is expected — the schema re-sync is stacked in #15; goes green once #15 lands.
|
👋 Heads-up — Code review queue is at 26 / 8 Above the WIP limit. The team convention is to review existing PRs before opening new work. Open PRs currently in Code review (oldest first):
Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.) |
Extends `dataset push` from image_classification-only to also cover tabular_classification, tabular_regression, time_series_forecasting, and time_to_event_prediction. The Python ingestor already supports these; this adds the CLI-side flag / layout / spec surface. Validated end-to-end on a live cluster (tabular_classification, 8/8 records, 100%, rows confirmed in MySQL). - Category dispatch (push/category.go): image vs tabular families, mirroring data-ingestors' conventions.py groupings. - Tabular local layout (push/tabular.go): a single CSV (no sidecar files), staged via the existing machinery (CSV + empty image list). - Schema: auto-inferred from the CSV (INT/FLOAT/VARCHAR) so customers don't hand-write one; --schema col:TYPE,... overrides. Reserved framework columns (id, data_id, ...) are skipped so a CSV carrying an id column doesn't produce a schema the ingestor rejects (the #135b guard). - Label: string form for tabular_classification; object form with policy=bucket (default) for the regression-class categories so the raw numeric target never leaves the cluster. Added --label-policy and --time-column. - Build() branches by category (push/spec.go); pre-flight is category-aware (data CSV + column count for tabular). Tests: push/tabular_test.go (DiscoverTabular, InferSchema incl. reserved-skip, ParseSchema); spec_test.go (tabular Build passes the schema for all three label shapes, regression defaults to bucket); updated the unsupported-category gate test. go build / vet / test green. Stacked on cli#12 (the dataset-push live-ingestion fixes). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
3e5d32e to
ff09484
Compare
…e-sync)
Adds the two remaining engine-supported image categories, taking the CLI
to 9/10 modalities (only semantic_segmentation remains, blocked on the
ingestor — data-ingestors#136).
- object_detection: reuses the generic sidecar walker for annotations/
(.xml). Validated live end-to-end — 128 records (bounding boxes)
ingested, rows confirmed in MySQL.
- keypoint_detection: labels.csv + images/ (keypoint coords live in the
CSV's Annotation column, read server-side). Adds --number-of-keypoints
(required; no default). Emits target_size + number_of_keypoints as
TOP-LEVEL fields, which the schema's keypoint conditional requires.
- Re-synced the embedded schema from data-ingestors develop. The vendored
copy was stale: it lacked keypoint's top-level target_size +
number_of_keypoints and their required-for-keypoint conditional, so the
CLI couldn't validate a keypoint spec at all. `ingest validate` and
dataset push now validate keypoint correctly.
Schema-skew findings (deployment/release hygiene, NOT CLI bugs):
* sync-schema.sh defaults to data-ingestors *master*, which is stale
(lacks both MLM and keypoint); the current schema is on *develop*.
Repoint the sync source to develop, or promote develop -> master.
(sync --check vs master flags this drift — pre-existing, surfaced here.)
* The deployed ingdemo client runs jobs-manager and the ingestor on
DIFFERENT schema versions: the ingestor (newer) REQUIRES keypoint's
top-level fields; jobs-manager (older) REJECTS them as additional
properties. So keypoint can't be ingested there until both components
are refreshed to a matching schema. The CLI's emission is correct
against the current/consistent schema (unit-verified). OD is
unaffected (no new fields).
Tests: push/image_extras_test.go (DiscoverObjectDetection +
missing-annotations); spec_test.go (OD emits annotations; keypoint emits
top-level target_size + number_of_keypoints; both pass the schema);
updated the unsupported-category gate test (now segmentation only).
go build / vet / test green.
Stacked on cli#14 (text) -> #13 (tabular) -> #12 (fixes).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…e-sync) (#15) Adds the two remaining engine-supported image categories, taking the CLI to 9/10 modalities (only semantic_segmentation remains, blocked on the ingestor — data-ingestors#136). - object_detection: reuses the generic sidecar walker for annotations/ (.xml). Validated live end-to-end — 128 records (bounding boxes) ingested, rows confirmed in MySQL. - keypoint_detection: labels.csv + images/ (keypoint coords live in the CSV's Annotation column, read server-side). Adds --number-of-keypoints (required; no default). Emits target_size + number_of_keypoints as TOP-LEVEL fields, which the schema's keypoint conditional requires. - Re-synced the embedded schema from data-ingestors develop. The vendored copy was stale: it lacked keypoint's top-level target_size + number_of_keypoints and their required-for-keypoint conditional, so the CLI couldn't validate a keypoint spec at all. `ingest validate` and dataset push now validate keypoint correctly. Schema-skew findings (deployment/release hygiene, NOT CLI bugs): * sync-schema.sh defaults to data-ingestors *master*, which is stale (lacks both MLM and keypoint); the current schema is on *develop*. Repoint the sync source to develop, or promote develop -> master. (sync --check vs master flags this drift — pre-existing, surfaced here.) * The deployed ingdemo client runs jobs-manager and the ingestor on DIFFERENT schema versions: the ingestor (newer) REQUIRES keypoint's top-level fields; jobs-manager (older) REJECTS them as additional properties. So keypoint can't be ingested there until both components are refreshed to a matching schema. The CLI's emission is correct against the current/consistent schema (unit-verified). OD is unaffected (no new fields). Tests: push/image_extras_test.go (DiscoverObjectDetection + missing-annotations); spec_test.go (OD emits annotations; keypoint emits top-level target_size + number_of_keypoints; both pass the schema); updated the unsupported-category gate test (now segmentation only). go build / vet / test green. Stacked on cli#14 (text) -> #13 (tabular) -> #12 (fixes). Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Summary
Widens
tracebloc dataset pushfrom image_classification-only to the tabular / time-series family —tabular_classification,tabular_regression,time_series_forecasting,time_to_event_prediction(5 of 10 modalities now). The Python ingestor already supports these; this adds the CLI-side flag / layout / spec surface. Validated end-to-end on a live cluster (tabular_classification, 8/8 records, 100%, rows confirmed in MySQL).What's added
push/category.go): image vs tabular families, mirroring data-ingestors'conventions.pygroupings.push/tabular.go): a single CSV (no sidecar files), staged via the existing machinery (CSV + empty image list — no changes to the tar/stream path).INT/FLOAT/VARCHAR) so customers don't hand-write one;--schema col:TYPE,...overrides. Reserved framework columns (id,data_id, …) are skipped so a CSV carrying anidcolumn doesn't produce a schema the ingestor rejects (the #135b guard in data-ingestors).tabular_classification; object form withpolicy=bucket(default) for the regression-class categories so the raw numeric target never leaves the cluster. Added--label-policyand--time-column.Test plan
go build/vet/test ./...green; newpush/tabular_test.go(DiscoverTabular, InferSchema incl. reserved-skip, ParseSchema) +spec_test.gotabular cases (all three label shapes pass the schema; regression defaults to bucket).dataset push ./tab --category tabular_classification --label-column label→ schema auto-inferred (idskipped) → staged → ingestor Job ran → 100% (8/8); 8 rows confirmed intraining_test_datasets.clidemo_tab_train.Follow-ups (not in this PR)
text_classification,masked_language_modeling) and image-extras (object_detection,keypoint_detection) need sidecar-dir staging — next increments.🤖 Generated with Claude Code