fix(dataset push): make live ingestion work end-to-end#12
Conversation
The first real-cluster run of `dataset push` surfaced three issues that the mock-only test suite couldn't catch. All three are now validated end-to-end against a live tracebloc/client release (image_classification, 6/6 records, 100% success, rows confirmed in MySQL). 1. Staging collided with the ingestor's DEST_PATH (the blocker). The CLI staged source files into /data/shared/<table>, which is exactly the ingestor's DEST_PATH (data-ingestors config.DEST_PATH = STORAGE_PATH/TABLE_NAME). Its DuplicateValidator rejects a pre-existing, non-empty destination, so the CLI's own staging tripped the check on every run. StagedPrefix now stages under /data/shared/.tracebloc-staging/<table>, leaving the destination fresh. Added FinalDestPrefix for the real destination shown in pre-flight. 2. Image target_size had no override. image_classification defaults to 512x512 and the ingestor VALIDATES the resolution (it does not resize), so any other size hard-failed the Image Resolution Validator. Added --target-size WxH, plus auto-detect from the first image when the flag is omitted, emitted as spec.file_options.target_size. 3. Watch treated a broken log stream as fatal. A Pod replaced/restarted/ deleted mid-follow (e.g. a backoffLimit retry) returned exit 9 even when the Job ultimately succeeded. The Job's terminal status is now the source of truth: a non-ctx stream error is only fatal if the Job outcome can't be determined. Tests: new push/detect_test.go (ParseTargetSize, DetectImageSize); spec_test.go (staging != dest regression, target_size passes schema); watch_test.go (Job status wins on stream break); stream_test.go updated to derive paths from StagedPrefix. go build / vet / test all green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
👋 Heads-up — Code review queue is at 18 / 8 Above the WIP limit. The team convention is to review existing PRs before opening new work. Open PRs currently in Code review (oldest first):
Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.) |
aptracebloc
left a comment
There was a problem hiding this comment.
Approve ✅ — live-validated end-to-end
Tested the whole stack from this branch's binary against a real tracebloc/client release (EKS dev, chart 1.4.2, amd64).
This PR's three fixes, verified:
- Staging path — files land under
/data/shared/.tracebloc-staging/<table>(confirmed on the PVC); the ingestor'sDEST_PATHstays fresh, so DuplicateValidator passes on a real run. No collision. --target-size+ auto-detect — auto-detected64x64from the first image; explicit--target-size/W,Hparsing validated, bad values → exit 2.- Watch resilience — a live run hit
Retrying (attempt 1)and the watch still reported success from the Job's terminal status → exit 0,240/240ingested.
Also confirmed: multi-file tar stream, SIGINT-safe stage-Pod cleanup (no orphans), and the full exit-code contract (2/3 guards all fire correctly).
Note (non-blocking): the Schema drift check is red here only because the embedded-schema re-sync lands in #15 (this branch doesn't touch ingest.v1.json). It's an UNSTABLE/non-required check and goes green once #15 merges. Merging in stack order #12→#13→#14→#15.
|
👋 Heads-up — Code review queue is at 26 / 8 Above the WIP limit. The team convention is to review existing PRs before opening new work. Open PRs currently in Code review (oldest first):
Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.) |
…e-sync)
Adds the two remaining engine-supported image categories, taking the CLI
to 9/10 modalities (only semantic_segmentation remains, blocked on the
ingestor — data-ingestors#136).
- object_detection: reuses the generic sidecar walker for annotations/
(.xml). Validated live end-to-end — 128 records (bounding boxes)
ingested, rows confirmed in MySQL.
- keypoint_detection: labels.csv + images/ (keypoint coords live in the
CSV's Annotation column, read server-side). Adds --number-of-keypoints
(required; no default). Emits target_size + number_of_keypoints as
TOP-LEVEL fields, which the schema's keypoint conditional requires.
- Re-synced the embedded schema from data-ingestors develop. The vendored
copy was stale: it lacked keypoint's top-level target_size +
number_of_keypoints and their required-for-keypoint conditional, so the
CLI couldn't validate a keypoint spec at all. `ingest validate` and
dataset push now validate keypoint correctly.
Schema-skew findings (deployment/release hygiene, NOT CLI bugs):
* sync-schema.sh defaults to data-ingestors *master*, which is stale
(lacks both MLM and keypoint); the current schema is on *develop*.
Repoint the sync source to develop, or promote develop -> master.
(sync --check vs master flags this drift — pre-existing, surfaced here.)
* The deployed ingdemo client runs jobs-manager and the ingestor on
DIFFERENT schema versions: the ingestor (newer) REQUIRES keypoint's
top-level fields; jobs-manager (older) REJECTS them as additional
properties. So keypoint can't be ingested there until both components
are refreshed to a matching schema. The CLI's emission is correct
against the current/consistent schema (unit-verified). OD is
unaffected (no new fields).
Tests: push/image_extras_test.go (DiscoverObjectDetection +
missing-annotations); spec_test.go (OD emits annotations; keypoint emits
top-level target_size + number_of_keypoints; both pass the schema);
updated the unsupported-category gate test (now segmentation only).
go build / vet / test green.
Stacked on cli#14 (text) -> #13 (tabular) -> #12 (fixes).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…e-sync) (#15) Adds the two remaining engine-supported image categories, taking the CLI to 9/10 modalities (only semantic_segmentation remains, blocked on the ingestor — data-ingestors#136). - object_detection: reuses the generic sidecar walker for annotations/ (.xml). Validated live end-to-end — 128 records (bounding boxes) ingested, rows confirmed in MySQL. - keypoint_detection: labels.csv + images/ (keypoint coords live in the CSV's Annotation column, read server-side). Adds --number-of-keypoints (required; no default). Emits target_size + number_of_keypoints as TOP-LEVEL fields, which the schema's keypoint conditional requires. - Re-synced the embedded schema from data-ingestors develop. The vendored copy was stale: it lacked keypoint's top-level target_size + number_of_keypoints and their required-for-keypoint conditional, so the CLI couldn't validate a keypoint spec at all. `ingest validate` and dataset push now validate keypoint correctly. Schema-skew findings (deployment/release hygiene, NOT CLI bugs): * sync-schema.sh defaults to data-ingestors *master*, which is stale (lacks both MLM and keypoint); the current schema is on *develop*. Repoint the sync source to develop, or promote develop -> master. (sync --check vs master flags this drift — pre-existing, surfaced here.) * The deployed ingdemo client runs jobs-manager and the ingestor on DIFFERENT schema versions: the ingestor (newer) REQUIRES keypoint's top-level fields; jobs-manager (older) REJECTS them as additional properties. So keypoint can't be ingested there until both components are refreshed to a matching schema. The CLI's emission is correct against the current/consistent schema (unit-verified). OD is unaffected (no new fields). Tests: push/image_extras_test.go (DiscoverObjectDetection + missing-annotations); spec_test.go (OD emits annotations; keypoint emits top-level target_size + number_of_keypoints; both pass the schema); updated the unsupported-category gate test (now segmentation only). go build / vet / test green. Stacked on cli#14 (text) -> #13 (tabular) -> #12 (fixes). Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Summary
The first real-cluster run of
tracebloc dataset push(against a livetracebloc/clientrelease) surfaced three issues the mock-only test suite couldn't catch. This PR fixes all three. End-to-end validated:image_classification, 6/6 records, 100% success, rows confirmed in MySQL, metadata sent to the backend API.Fixes
1. Staging collided with the ingestor's
DEST_PATH(the blocker)The CLI staged source files into
/data/shared/<table>— exactly the ingestor'sDEST_PATH(data-ingestors config.DEST_PATH = STORAGE_PATH/TABLE_NAME). The ingestor'sDuplicateValidatorrejects a pre-existing, non-empty destination, so the CLI's own staging tripped the check on every run (Destination directory '/data/shared/<table>' already exists).→
StagedPrefixnow stages under/data/shared/.tracebloc-staging/<table>, leaving the destination fresh. AddedFinalDestPrefixfor the real destination shown in pre-flight.2. Image
target_sizehad no overrideimage_classificationdefaults to512x512and the ingestor validates the resolution (it does not resize) — so any other size hard-failed the Image Resolution Validator with no escape from the CLI.→ Added
--target-size WxH, plus auto-detect from the first image when omitted (emitted asspec.file_options.target_size).3. Watch treated a broken log stream as fatal
A Pod replaced / restarted / deleted mid-follow (e.g. a
backoffLimitretry) returned exit 9 even when the Job ultimately succeeded.→ The Job's terminal status is now the source of truth; a non-ctx stream error is fatal only when the Job outcome can't be determined.
Test plan
go build ./... && go vet ./... && go test ./...— all green.push/detect_test.go(ParseTargetSize,DetectImageSize);spec_test.go(staging≠dest regression +target_sizepasses schema);watch_test.go(Job status wins on stream break);stream_test.go(derives paths fromStagedPrefix).dataset pushend-to-end against a real client → staged → submitted → ingestor Job ran → INGESTION SUMMARY 100%; 6 rows confirmed intraining_test_datasets.clidemo_clean_train.Notes / out of scope
INGESTOR_IMAGE_DIGESTis amd64-only (arm64 nodes can't pull it), and the jobs-manager SA needssystem:auth-delegatorfor TokenReview. Neither is a CLI bug — the CLI submits and watches correctly.🤖 Generated with Claude Code