Skip to content

feat(dataset): pre-flight for dataset push (PR-a of #151)#8

Merged
saadqbal merged 5 commits into
developfrom
feat/151-dataset-push-preflight
May 22, 2026
Merged

feat(dataset): pre-flight for dataset push (PR-a of #151)#8
saadqbal merged 5 commits into
developfrom
feat/151-dataset-push-preflight

Conversation

@saadqbal
Copy link
Copy Markdown
Collaborator

@saadqbal saadqbal commented May 21, 2026

Summary

Phase 3 (tracebloc/client#151) lands in two PRs to keep diffs reviewable.
This PR-a is the no-op-safe pre-flight: synthesize the ingest spec from flags, validate against the embedded ingest.v1 schema, walk the local directory, discover the cluster's parent release + shared PVC, print a summary. Then either stop (--dry-run) or fail cleanly with "wait for PR-b" so a customer running PR-a's binary doesn't see "0 files transferred" confusion.

PR-b (next) plugs the novel-engineering core into the TODO: PR-b branch at the end of runDatasetPush: ephemeral stage Pod (alpine 3.20 pinned by digest), client-go remotecommand SPDY executor + tar-over-exec stream, schollz/progressbar/v3, SIGINT-safe cleanup.

What this PR adds

  • internal/push/ (new package): SpecArgs.Build() (flag→spec, schema-driven), Discover() (layout + size caps: 1 GiB total, 500 MiB per file, accepts .jpg/.jpeg/.png/.webp case-insensitive).
  • internal/cluster/pvc.go: DiscoverSharedPVC() against the chart's client-pvc (hardcoded by _helpers.tpl), verifies Bound phase, surfaces access modes for PR-b's stage-Pod scheduling warning.
  • internal/cli/dataset.go: tracebloc dataset push <local-path> with --table/--category/--intent/--label-column/--dry-run/--ingestor-sa plus the kubeconfig flag trio. Pre-flight ordering "fail fast, fail local" — every check that doesn't need the cluster runs first.

Exit code contract

Code Cause
0 --dry-run succeeded (or — once PR-b lands — the full push)
2 Schema validation failed on the synthesized spec
3 Local-layout or kubeconfig error
4 Cluster reachable but parent release / shared PVC missing
6 Pre-flight succeeded but staging not yet implemented (PR-b will deliver)

Exit 6 is new and distinct from Phase 2's exit 5 (no usable SA token) so the next PR's tests can assert the exact code without conflicting with Phase 2's contract.

Why image_classification only

Per epic tracebloc/client#147 v0.1 non-goals — other categories are one-PR additions in v0.2. --category does not pre-validate; the embedded schema's enum check is the single source of truth (avoiding Go-side drift the moment data-ingestors adds a new category).

Test plan

  • go vet ./... — green
  • go test -race -cover ./... — green (push 81.0%, cluster 83.2%, schema 80.7%, cli 52.0%)
  • gofmt -s -l . — no drift
  • errcheck ./... — green
  • Binary smoke: tracebloc dataset push <real-layout> --kubeconfig=/missing → exit 3 with clear error chain
  • Schema-failure smoke: --intent=potato → exit 2 with "value must be one of 'train', 'test'" on stderr
  • Real-cluster --dry-run against EKS (deferred to PR-b's full integration test)

Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com


Note

Medium Risk
Adds a new CLI subcommand that validates user input, walks local files, and queries Kubernetes for the parent release/PVC; while it’s pre-flight only, it touches filesystem and cluster discovery paths that can affect user workflows and error/exit-code contracts.

Overview
Introduces a new tracebloc dataset push <local-path> command that performs a no-op-safe pre-flight: validates --table against path traversal, synthesizes an ingest spec from flags and validates it against the embedded schema, walks the local dataset layout (including v0.1 size caps), discovers the target cluster’s parent release and shared PVC, then prints a single-screen readiness summary.

Adds a new internal/push package for spec construction (SpecArgs.Build, StagedPrefix, ValidateTableName) and local layout discovery (Discover, HumanBytes), plus cluster.DiscoverSharedPVC to verify client-pvc exists and is Bound (with access-mode warning support). When run without --dry-run, the command currently exits with code 6 to explicitly signal staging is not implemented until PR-b; comprehensive tests pin exit codes and key error paths.

Reviewed by Cursor Bugbot for commit b1419f2. Bugbot is set up for automated code reviews on this repo. Configure here.

Phase 3 (tracebloc/client#151) lands in two PRs to keep diffs
reviewable. PR-a (this one) is the no-op-safe path: synthesize the
ingest spec from flags, validate against the embedded ingest.v1
schema, walk the local directory, discover the cluster's parent
release + shared PVC. Print a summary, then either stop (--dry-run)
or fail cleanly with "wait for PR-b" so a customer who pulled the
PR-a binary doesn't see "0 files transferred" confusion.

PR-b adds the novel-engineering piece: ephemeral stage Pod (alpine
3.20 pinned by digest), client-go remotecommand SPDY executor + tar
stream, schollz/progressbar/v3, SIGINT-safe cleanup. It plugs into
the "TODO: PR-b" branch at the end of runDatasetPush() without
touching anything upstream.

Surface area added:

  internal/push/                       (new package)
    spec.go         SpecArgs -> ingest.v1.json-conforming map.
                    Path is leave-validation-to-schema; the schema
                    is the single source of truth.
    walk.go         Discover() walks <root>/labels.csv +
                    <root>/images/, enforces v0.1 size caps
                    (1 GiB total, 500 MiB per file), accepts
                    {.jpg,.jpeg,.png,.webp} case-insensitively.

  internal/cluster/pvc.go              (new file)
    DiscoverSharedPVC reads the chart's `client-pvc` (hardcoded by
    _helpers.tpl, not parameterized) in the namespace, verifies it
    is Bound, returns access modes. IsReadWriteMany() drives the
    stage-Pod scheduling warning in PR-b.

  internal/cli/dataset.go              (new file)
    `tracebloc dataset push <local-path>` command. Order:
      1. flag->spec synthesize + schema-validate (stderr diagnostic,
         exit 2 — same wording style as ingest validate)
      2. walk local layout + size caps (exit 3)
      3. load kubeconfig + clientset (exit 3)
      4. discover parent release (exit 4)
      5. discover shared PVC (exit 4)
      6. print pre-flight summary
      7. --dry-run stop, or exit 6 "wait for PR-b"

  internal/cli/root.go                 (1-line wire-up)
    Register the new command.

Why image_classification only: the epic (tracebloc/client#147) v0.1
non-goals defer other categories to v0.2 as one-PR additions. The
--category flag does NOT pre-validate so the schema's enum check
produces the canonical error (rather than CLI-side drift).

Why exit code 6 for "not implemented yet": distinct from the
existing schema (2), local (3), discovery (4), and Phase-2's
token-mint (5) codes. Tests assert the exact code so the contract
sticks across PR-b's refactor.

Tests:
  spec_test.go        Build() ↔ embedded schema contract (the
                      core "this must produce something the
                      validator accepts" pin)
  walk_test.go        Layout + size caps, all happy + sad paths
                      (case-insensitive ext, skip .DS_Store,
                      missing files, byte-sum sanity)
  pvc_test.go         Fake-clientset coverage: happy path,
                      not-found diagnostic, Pending-phase
                      diagnostic, mixed-mode IsReadWriteMany
  dataset_test.go     CLI integration: schema-fail exit 2,
                      walk-fail exit 3, kubeconfig-fail exit 3,
                      cobra args check

Locally: vet, test -race -cover, gofmt -s, errcheck — all green.
Coverage: push 81.0%, cluster 83.2%, schema 80.7%, cli 52.0%.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@LukasWodka
Copy link
Copy Markdown
Contributor

👋 Heads-up — Code review queue is at 17 / 8

Above the WIP limit. The team convention is to review existing PRs before opening new work.

Open PRs currently in Code review (oldest first):

Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.)

Comment thread internal/cli/dataset.go Outdated
Cursor Bugbot (Low severity) on PR #8: pluralS() in dataset.go was
an exact duplicate of plural() already defined in ingest.go — same
cli package, identical body. Removed pluralS, the schema-error
diagnostic now calls plural() directly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread internal/push/spec.go
Cursor Bugbot on PR #8 (commit 4240097), Medium severity: the
`table` value flows unsanitized into the /data/shared/<table>/ PVC
path. The embedded ingest.v1 schema only enforces minLength:1 on
`table` — no pattern — so --table=../../etc would resolve (via
path-cleaning) to /etc, and PR-b's stage Pod would write outside
the intended subtree, potentially clobbering another table's
staged data.

Fix:

  - push.ValidateTableName(table): rejects anything not matching
    ^[A-Za-z0-9_]+$ — the intersection of "valid unquoted MySQL
    identifier" and "safe single path segment". Called as step 1
    of runDatasetPush, before SpecArgs.Build().

  - StagedPrefix hardened: drops path.Join (whose ".."-cleaning
    is the silent footgun) for plain concatenation, and panics if
    handed an unsafe name. Callers MUST ValidateTableName first;
    the panic is a defense-in-depth backstop so a PR-b call site
    that forgets validation fails loudly in tests instead of
    silently escaping the prefix.

  - Build() doc gains the precondition note.

The upstream half — adding a `pattern` constraint to the schema's
`table` field so the helm flow + jobs-manager get the same guard —
is filed as tracebloc/data-ingestors#116. Once that lands and the
CLI re-syncs the schema, ValidateTableName collapses to a thin
wrapper.

Tests:
  - TestValidateTableName_Accepts / _RejectsUnsafe: the security
    regression pin (../../etc, ../foo, slashes, dots, etc.)
  - TestStagedPrefix_PanicsOnUnsafeName: the backstop
  - TestDatasetPush_TraversalTableName_ExitsTwo: CLI-layer,
    --table=../../etc → exit 2 before any cluster work

Locally: vet, test -race -cover, gofmt -s, errcheck — all green.
Coverage: push 83.3%, cluster 83.2%, schema 80.7%, cli 52.2%.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread internal/cli/dataset.go Outdated
Cursor Bugbot on PR #8 (commit 837157f), Low severity:
humanBytesForSummary in dataset.go was a line-for-line copy of
humanBytes in walk.go — both new in this PR. A future tweak (TiB
support, precision change) would likely patch one and not the
other, drifting the size shown in an over-cap error from the size
shown in the dry-run summary.

Fix: export push.HumanBytes as the single implementation; delete
the dataset.go copy. Both the over-cap diagnostics and the
pre-flight summary now format through the same function.

Locally: vet, test -race -cover, gofmt -s, errcheck — all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@saadqbal saadqbal self-assigned this May 22, 2026
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 9915866. Configure here.

Comment thread internal/push/walk.go
Cursor Bugbot on PR #8 (commit 9915866), Low severity: Discover
checked imagesStat.IsDir() for the images/ entry but never the
reverse for labels.csv. A directory literally named "labels.csv"
passes os.Stat, so the pre-flight would accept it and PR-b's tar
stream would later fail confusingly trying to read a directory as
a CSV.

Added the symmetric labelsStat.IsDir() guard with a clear error.
TestDiscover_LabelsCSVIsDirectory pins it.

Locally: vet, test -race -cover, gofmt -s, errcheck — all green.
Coverage: push 83.8%.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@saadqbal saadqbal merged commit d2a7906 into develop May 22, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants