Skip to content

fix(dataset push): make live ingestion work end-to-end#12

Merged
aptracebloc merged 1 commit into
developfrom
fix/dataset-push-e2e-fixes
Jun 2, 2026
Merged

fix(dataset push): make live ingestion work end-to-end#12
aptracebloc merged 1 commit into
developfrom
fix/dataset-push-e2e-fixes

Conversation

@LukasWodka
Copy link
Copy Markdown
Contributor

Summary

The first real-cluster run of tracebloc dataset push (against a live tracebloc/client release) surfaced three issues the mock-only test suite couldn't catch. This PR fixes all three. End-to-end validated: image_classification, 6/6 records, 100% success, rows confirmed in MySQL, metadata sent to the backend API.

Fixes

1. Staging collided with the ingestor's DEST_PATH (the blocker)

The CLI staged source files into /data/shared/<table> — exactly the ingestor's DEST_PATH (data-ingestors config.DEST_PATH = STORAGE_PATH/TABLE_NAME). The ingestor's DuplicateValidator rejects a pre-existing, non-empty destination, so the CLI's own staging tripped the check on every run (Destination directory '/data/shared/<table>' already exists).

StagedPrefix now stages under /data/shared/.tracebloc-staging/<table>, leaving the destination fresh. Added FinalDestPrefix for the real destination shown in pre-flight.

2. Image target_size had no override

image_classification defaults to 512x512 and the ingestor validates the resolution (it does not resize) — so any other size hard-failed the Image Resolution Validator with no escape from the CLI.

→ Added --target-size WxH, plus auto-detect from the first image when omitted (emitted as spec.file_options.target_size).

3. Watch treated a broken log stream as fatal

A Pod replaced / restarted / deleted mid-follow (e.g. a backoffLimit retry) returned exit 9 even when the Job ultimately succeeded.

→ The Job's terminal status is now the source of truth; a non-ctx stream error is fatal only when the Job outcome can't be determined.

Test plan

  • go build ./... && go vet ./... && go test ./... — all green.
  • New/updated unit tests: push/detect_test.go (ParseTargetSize, DetectImageSize); spec_test.go (staging≠dest regression + target_size passes schema); watch_test.go (Job status wins on stream break); stream_test.go (derives paths from StagedPrefix).
  • Live: built the binary and ran dataset push end-to-end against a real client → staged → submitted → ingestor Job ran → INGESTION SUMMARY 100%; 6 rows confirmed in training_test_datasets.clidemo_clean_train.

Notes / out of scope

  • The live run also re-confirmed two environment/chart issues not addressed here (tracked separately): the pinned INGESTOR_IMAGE_DIGEST is amd64-only (arm64 nodes can't pull it), and the jobs-manager SA needs system:auth-delegator for TokenReview. Neither is a CLI bug — the CLI submits and watches correctly.
  • The README still says "Pre-alpha, Phase 0"; a docs refresh + v0.1 release is the planned follow-up.

🤖 Generated with Claude Code

The first real-cluster run of `dataset push` surfaced three issues that
the mock-only test suite couldn't catch. All three are now validated
end-to-end against a live tracebloc/client release (image_classification,
6/6 records, 100% success, rows confirmed in MySQL).

1. Staging collided with the ingestor's DEST_PATH (the blocker). The CLI
   staged source files into /data/shared/<table>, which is exactly the
   ingestor's DEST_PATH (data-ingestors config.DEST_PATH =
   STORAGE_PATH/TABLE_NAME). Its DuplicateValidator rejects a
   pre-existing, non-empty destination, so the CLI's own staging tripped
   the check on every run. StagedPrefix now stages under
   /data/shared/.tracebloc-staging/<table>, leaving the destination
   fresh. Added FinalDestPrefix for the real destination shown in
   pre-flight.

2. Image target_size had no override. image_classification defaults to
   512x512 and the ingestor VALIDATES the resolution (it does not
   resize), so any other size hard-failed the Image Resolution
   Validator. Added --target-size WxH, plus auto-detect from the first
   image when the flag is omitted, emitted as spec.file_options.target_size.

3. Watch treated a broken log stream as fatal. A Pod replaced/restarted/
   deleted mid-follow (e.g. a backoffLimit retry) returned exit 9 even
   when the Job ultimately succeeded. The Job's terminal status is now
   the source of truth: a non-ctx stream error is only fatal if the Job
   outcome can't be determined.

Tests: new push/detect_test.go (ParseTargetSize, DetectImageSize);
spec_test.go (staging != dest regression, target_size passes schema);
watch_test.go (Job status wins on stream break); stream_test.go updated
to derive paths from StagedPrefix. go build / vet / test all green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@LukasWodka
Copy link
Copy Markdown
Contributor Author

👋 Heads-up — Code review queue is at 18 / 8

Above the WIP limit. The team convention is to review existing PRs before opening new work.

Open PRs currently in Code review (oldest first):

Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.)

Copy link
Copy Markdown
Contributor

@aptracebloc aptracebloc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve ✅ — live-validated end-to-end

Tested the whole stack from this branch's binary against a real tracebloc/client release (EKS dev, chart 1.4.2, amd64).

This PR's three fixes, verified:

  • Staging path — files land under /data/shared/.tracebloc-staging/<table> (confirmed on the PVC); the ingestor's DEST_PATH stays fresh, so DuplicateValidator passes on a real run. No collision.
  • --target-size + auto-detect — auto-detected 64x64 from the first image; explicit --target-size/W,H parsing validated, bad values → exit 2.
  • Watch resilience — a live run hit Retrying (attempt 1) and the watch still reported success from the Job's terminal status → exit 0, 240/240 ingested.

Also confirmed: multi-file tar stream, SIGINT-safe stage-Pod cleanup (no orphans), and the full exit-code contract (2/3 guards all fire correctly).

Note (non-blocking): the Schema drift check is red here only because the embedded-schema re-sync lands in #15 (this branch doesn't touch ingest.v1.json). It's an UNSTABLE/non-required check and goes green once #15 merges. Merging in stack order #12#13#14#15.

@LukasWodka
Copy link
Copy Markdown
Contributor Author

👋 Heads-up — Code review queue is at 26 / 8

Above the WIP limit. The team convention is to review existing PRs before opening new work.

Open PRs currently in Code review (oldest first):

  • averaging-service#98 — ci: add fr-gate caller workflow · author: @LukasWodka · no reviewer assigned
  • cli#13 — feat(dataset push): tabular / time-series modality family · author: @LukasWodka · no reviewer assigned
  • cli#14 — feat(dataset push): text family + generic sidecar staging · author: @LukasWodka · no reviewer assigned
  • cli#15 — feat(dataset push): object_detection + keypoint_detection · author: @LukasWodka · no reviewer assigned
  • cli#16 — test(cli): coverage wins (preflight/progress/errors) + smoke-test hardening · author: @LukasWodka · no reviewer assigned
  • cli#17 — test(cli): integration harness for the real-I/O seams (kind e2e) · author: @LukasWodka · no reviewer assigned
  • client-runtime#45 — Staging: Enhance jobs-manager with HTTP proxy, ingestion endpoint, and fixes · author: @saadqbal · no reviewer assigned
  • client-runtime#61 — docs: record MySQL credential threat-model decision · author: @saadqbal · no reviewer assigned
  • client-runtime#65 — fix(#64): re-sync jobs-manager ingest schema (accept masked_language_modeling) + anti-drift · author: @LukasWodka · no reviewer assigned
  • client-runtime#67 — ci: publish jobs-manager images on merge (closes the deploy-delivery gap) · author: @LukasWodka · no reviewer assigned

Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.)

@aptracebloc aptracebloc merged commit 806f17b into develop Jun 2, 2026
13 of 14 checks passed
aptracebloc pushed a commit that referenced this pull request Jun 2, 2026
…e-sync)

Adds the two remaining engine-supported image categories, taking the CLI
to 9/10 modalities (only semantic_segmentation remains, blocked on the
ingestor — data-ingestors#136).

- object_detection: reuses the generic sidecar walker for annotations/
  (.xml). Validated live end-to-end — 128 records (bounding boxes)
  ingested, rows confirmed in MySQL.
- keypoint_detection: labels.csv + images/ (keypoint coords live in the
  CSV's Annotation column, read server-side). Adds --number-of-keypoints
  (required; no default). Emits target_size + number_of_keypoints as
  TOP-LEVEL fields, which the schema's keypoint conditional requires.

- Re-synced the embedded schema from data-ingestors develop. The vendored
  copy was stale: it lacked keypoint's top-level target_size +
  number_of_keypoints and their required-for-keypoint conditional, so the
  CLI couldn't validate a keypoint spec at all. `ingest validate` and
  dataset push now validate keypoint correctly.

Schema-skew findings (deployment/release hygiene, NOT CLI bugs):
  * sync-schema.sh defaults to data-ingestors *master*, which is stale
    (lacks both MLM and keypoint); the current schema is on *develop*.
    Repoint the sync source to develop, or promote develop -> master.
    (sync --check vs master flags this drift — pre-existing, surfaced here.)
  * The deployed ingdemo client runs jobs-manager and the ingestor on
    DIFFERENT schema versions: the ingestor (newer) REQUIRES keypoint's
    top-level fields; jobs-manager (older) REJECTS them as additional
    properties. So keypoint can't be ingested there until both components
    are refreshed to a matching schema. The CLI's emission is correct
    against the current/consistent schema (unit-verified). OD is
    unaffected (no new fields).

Tests: push/image_extras_test.go (DiscoverObjectDetection +
missing-annotations); spec_test.go (OD emits annotations; keypoint emits
top-level target_size + number_of_keypoints; both pass the schema);
updated the unsupported-category gate test (now segmentation only).
go build / vet / test green.

Stacked on cli#14 (text) -> #13 (tabular) -> #12 (fixes).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
aptracebloc pushed a commit that referenced this pull request Jun 2, 2026
…e-sync) (#15)

Adds the two remaining engine-supported image categories, taking the CLI
to 9/10 modalities (only semantic_segmentation remains, blocked on the
ingestor — data-ingestors#136).

- object_detection: reuses the generic sidecar walker for annotations/
  (.xml). Validated live end-to-end — 128 records (bounding boxes)
  ingested, rows confirmed in MySQL.
- keypoint_detection: labels.csv + images/ (keypoint coords live in the
  CSV's Annotation column, read server-side). Adds --number-of-keypoints
  (required; no default). Emits target_size + number_of_keypoints as
  TOP-LEVEL fields, which the schema's keypoint conditional requires.

- Re-synced the embedded schema from data-ingestors develop. The vendored
  copy was stale: it lacked keypoint's top-level target_size +
  number_of_keypoints and their required-for-keypoint conditional, so the
  CLI couldn't validate a keypoint spec at all. `ingest validate` and
  dataset push now validate keypoint correctly.

Schema-skew findings (deployment/release hygiene, NOT CLI bugs):
  * sync-schema.sh defaults to data-ingestors *master*, which is stale
    (lacks both MLM and keypoint); the current schema is on *develop*.
    Repoint the sync source to develop, or promote develop -> master.
    (sync --check vs master flags this drift — pre-existing, surfaced here.)
  * The deployed ingdemo client runs jobs-manager and the ingestor on
    DIFFERENT schema versions: the ingestor (newer) REQUIRES keypoint's
    top-level fields; jobs-manager (older) REJECTS them as additional
    properties. So keypoint can't be ingested there until both components
    are refreshed to a matching schema. The CLI's emission is correct
    against the current/consistent schema (unit-verified). OD is
    unaffected (no new fields).

Tests: push/image_extras_test.go (DiscoverObjectDetection +
missing-annotations); spec_test.go (OD emits annotations; keypoint emits
top-level target_size + number_of_keypoints; both pass the schema);
updated the unsupported-category gate test (now segmentation only).
go build / vet / test green.

Stacked on cli#14 (text) -> #13 (tabular) -> #12 (fixes).

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants