Skip to content

feat(dataset push): object_detection + keypoint_detection#15

Merged
aptracebloc merged 1 commit into
developfrom
feat/dataset-push-image-extras
Jun 2, 2026
Merged

feat(dataset push): object_detection + keypoint_detection#15
aptracebloc merged 1 commit into
developfrom
feat/dataset-push-image-extras

Conversation

@LukasWodka
Copy link
Copy Markdown
Contributor

Summary

Adds object_detection and keypoint_detection — the CLI now covers 9 of 10 modalities (only semantic_segmentation remains, blocked on data-ingestors#136). OD is live-validated (128 records, rows in MySQL); keypoint is code-complete + correct against the current schema, live-blocked by a deployment schema skew (below).

🔗 Stacked on #14 (text) → #13 (tabular) → #12 (fixes). Net-new is the feat(dataset push): object_detection + keypoint_detection … commit.

What's added

  • object_detection: reuses the generic sidecar walker for annotations/ (Pascal VOC .xml). Live: staged images + annotations → Pascal-VOC + resolution validators passed → 128 bounding-box records ingested, confirmed in MySQL.
  • keypoint_detection: labels.csv + images/ (keypoint coords are in the CSV's Annotation column, read server-side). New --number-of-keypoints (required). Emits target_size + number_of_keypoints top-level — the schema's keypoint conditional requires them there.
  • Embedded-schema re-sync (internal/schema/ingest.v1.json): the vendored copy was stale — missing keypoint's top-level fields + the conditional, so the CLI couldn't validate a keypoint spec at all. Re-synced from data-ingestors develop.

⚠️ Schema-skew findings (deployment / release hygiene — not CLI bugs)

  1. CLI sync source is stale. sync-schema.sh defaults to data-ingestors master, which lacks both MLM and keypoint; the current schema is on develop. → Repoint the sync source to develop, or promote develop→master. sync --check vs master flags this (pre-existing drift, surfaced here).
  2. Deployed jobs-manager and ingestor are on different schema versions. On ingdemo, the ingestor (newer) requires keypoint's top-level fields while jobs-manager (older) rejects them as additional properties — so keypoint can't be ingested there until both are refreshed to a matching schema. Evidenced both ways (jobs-manager HTTP 400 on top-level; ingestor HTTP 400 when the same fields were under file_options). The CLI's emission is correct for a consistent/current deployment.

Test plan

  • go build/vet/test ./... green; push/image_extras_test.go (OD discover + missing-annotations); spec_test.go (OD emits annotations; keypoint emits top-level target_size + number_of_keypoints; both pass the schema).
  • Live (OD): dataset push ./od --category object_detection --label-column image_label → 128 records, 100%, rows in training_test_datasets.clidemo_od3_train.
  • keypoint: unit-validated against the current schema; live blocked by finding Phase 2: kubeconfig discovery + parent release detection + SA token #2.

Matrix after this PR

✅ live: image_classification, object_detection, tabular_classification, tabular_regression, time_series_forecasting, time_to_event_prediction, text_classification
✅ code-complete (live-blocked by deployment skew): keypoint_detection, masked_language_modeling
⏸ semantic_segmentation — blocked on data-ingestors#136

🤖 Generated with Claude Code

@LukasWodka
Copy link
Copy Markdown
Contributor Author

👋 Heads-up — Code review queue is at 20 / 8

Above the WIP limit. The team convention is to review existing PRs before opening new work.

Open PRs currently in Code review (oldest first):

Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.)

Copy link
Copy Markdown
Contributor

@aptracebloc aptracebloc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve ✅ — object_detection + keypoint + schema re-sync

Brings the CLI to 9/10 modalities and resolves the schema-drift on the rest of the stack (this is the PR that re-syncs internal/schema/ingest.v1.jsonSchema drift check is green here, and develop is clean once this lands).

Verified:

  • object_detection: annotations/ (Pascal VOC .xml) packaged via the shared sidecar walker — 9 files staged (4 img + 4 xml + labels.csv); reaches the ingestor's Pascal-VOC validator on a live run.
  • keypoint_detection: spec synthesis validates against the re-synced schema (dry-run) with top-level target_size + number_of_keypoints (the keypoint conditional); missing --number-of-keypoints → exit 2.
  • Embedded schema re-sync matches data-ingestors master (keypoint/MLM fields present).

Known (deployment, not CLI): keypoint can't ingest on the current deployed jobs-manager (older embedded schema rejects the top-level fields) — same ops follow-up as #14's MLM. The CLI's emission is correct for a current deployment.

Merge last in the stack (#12#13#14#15); this one turns develop fully green.

@LukasWodka
Copy link
Copy Markdown
Contributor Author

👋 Heads-up — Code review queue is at 26 / 8

Above the WIP limit. The team convention is to review existing PRs before opening new work.

Open PRs currently in Code review (oldest first):

  • averaging-service#98 — ci: add fr-gate caller workflow · author: @LukasWodka · no reviewer assigned
  • cli#12 — fix(dataset push): make live ingestion work end-to-end · author: @LukasWodka · no reviewer assigned
  • cli#13 — feat(dataset push): tabular / time-series modality family · author: @LukasWodka · no reviewer assigned
  • cli#14 — feat(dataset push): text family + generic sidecar staging · author: @LukasWodka · no reviewer assigned
  • cli#16 — test(cli): coverage wins (preflight/progress/errors) + smoke-test hardening · author: @LukasWodka · no reviewer assigned
  • cli#17 — test(cli): integration harness for the real-I/O seams (kind e2e) · author: @LukasWodka · no reviewer assigned
  • client-runtime#45 — Staging: Enhance jobs-manager with HTTP proxy, ingestion endpoint, and fixes · author: @saadqbal · no reviewer assigned
  • client-runtime#61 — docs: record MySQL credential threat-model decision · author: @saadqbal · no reviewer assigned
  • client-runtime#65 — fix(#64): re-sync jobs-manager ingest schema (accept masked_language_modeling) + anti-drift · author: @LukasWodka · no reviewer assigned
  • client-runtime#67 — ci: publish jobs-manager images on merge (closes the deploy-delivery gap) · author: @LukasWodka · no reviewer assigned

Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.)

…e-sync)

Adds the two remaining engine-supported image categories, taking the CLI
to 9/10 modalities (only semantic_segmentation remains, blocked on the
ingestor — data-ingestors#136).

- object_detection: reuses the generic sidecar walker for annotations/
  (.xml). Validated live end-to-end — 128 records (bounding boxes)
  ingested, rows confirmed in MySQL.
- keypoint_detection: labels.csv + images/ (keypoint coords live in the
  CSV's Annotation column, read server-side). Adds --number-of-keypoints
  (required; no default). Emits target_size + number_of_keypoints as
  TOP-LEVEL fields, which the schema's keypoint conditional requires.

- Re-synced the embedded schema from data-ingestors develop. The vendored
  copy was stale: it lacked keypoint's top-level target_size +
  number_of_keypoints and their required-for-keypoint conditional, so the
  CLI couldn't validate a keypoint spec at all. `ingest validate` and
  dataset push now validate keypoint correctly.

Schema-skew findings (deployment/release hygiene, NOT CLI bugs):
  * sync-schema.sh defaults to data-ingestors *master*, which is stale
    (lacks both MLM and keypoint); the current schema is on *develop*.
    Repoint the sync source to develop, or promote develop -> master.
    (sync --check vs master flags this drift — pre-existing, surfaced here.)
  * The deployed ingdemo client runs jobs-manager and the ingestor on
    DIFFERENT schema versions: the ingestor (newer) REQUIRES keypoint's
    top-level fields; jobs-manager (older) REJECTS them as additional
    properties. So keypoint can't be ingested there until both components
    are refreshed to a matching schema. The CLI's emission is correct
    against the current/consistent schema (unit-verified). OD is
    unaffected (no new fields).

Tests: push/image_extras_test.go (DiscoverObjectDetection +
missing-annotations); spec_test.go (OD emits annotations; keypoint emits
top-level target_size + number_of_keypoints; both pass the schema);
updated the unsupported-category gate test (now segmentation only).
go build / vet / test green.

Stacked on cli#14 (text) -> #13 (tabular) -> #12 (fixes).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@aptracebloc aptracebloc force-pushed the feat/dataset-push-image-extras branch from 56c9d8d to 3595665 Compare June 2, 2026 11:04
@aptracebloc aptracebloc merged commit 67ecb78 into develop Jun 2, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants