Skip to content

feat(dataset): stage Pod + tar-over-exec stream (PR-b of #151)#9

Merged
saadqbal merged 12 commits into
developfrom
feat/151-dataset-push-stream
May 22, 2026
Merged

feat(dataset): stage Pod + tar-over-exec stream (PR-b of #151)#9
saadqbal merged 12 commits into
developfrom
feat/151-dataset-push-stream

Conversation

@saadqbal
Copy link
Copy Markdown
Collaborator

@saadqbal saadqbal commented May 22, 2026

Summary

Completes Phase 3 (tracebloc/client#151). PR-a (#8) delivered the no-op-safe pre-flight; this PR-b plugs in the actual file staging.

Flow now end-to-end:

  1. validate table name + flag→spec synth + v0.1 category gate (NEW)
  2. schema validation
  3. local layout walk + size caps
  4. kubeconfig + parent release + shared PVC discovery
  5. pre-flight summary
  6. orphan-Pod scan + warning (NEW)
  7. create ephemeral stage Pod (alpine 3.20 by digest, PSA-restricted) (NEW)
  8. wait Ready with image-pull / scheduler-failure hints (NEW)
  9. tar-over-exec stream via client-go SPDY executor, with progress bar (NEW)
  10. defer-delete the Pod with a fresh ctx (SIGINT-safe) (NEW)

New files

  • internal/push/pod.go — Pod spec builder + Create/Wait/Delete (alpine 3.20 pinned by digest sha256:d9e853e8..., PSA-restricted, activeDeadlineSeconds=600 defense-in-depth)
  • internal/push/stream.go — Executor interface (SPDYExecutor in prod, fakeExecutor in tests) + tar packager
  • internal/push/progress.go — schollz/progressbar/v3 wrapper, TTY-detected (no-op in CI)
  • internal/push/orphan.go — find + warn about stale stage Pods (warn-only in v0.1; auto-cleanup is v0.2)
  • internal/push/stage.go — orchestrator with SIGINT-safe cleanup via fresh-context defer

Medium issues from PR-a self-review — addressed

  • v0.1 category gate: schema-valid-but-unsupported categories (e.g. tabular_classification) now hit a CLI-side gate BEFORE schema validation. Customers see category "tabular_classification" is not supported in v0.1 (only image_classification) instead of the confusing missing property 'schema' from the schema validator.
  • --ingestor-sa test coverage: TestStage_IngestorSANameFlowsToPod pins that the override actually flows to the stage Pod's ServiceAccountName.

Tricky bits worth a closer review

  • Pipe deadlock guard (stream.go): if exec returns early (ctx cancel, remote tar dies), the tar-write goroutine would block on pipe.Write forever — fixed by closing the pipe Reader after exec.Exec returns. Caught by TestStage_CancelledContext_StillCleansUp.
  • Fresh-context defer (stage.go): the deferred DeleteStagePod uses context.WithTimeout(context.Background(), 30s) so cleanup runs even when the parent ctx is cancelled (SIGINT). Without this, every Ctrl-C leaks an orphan Pod.
  • Named-return tar Close (stream.go): writeLayoutTar propagates the tar trailer-write error if-and-only-if the function otherwise succeeded — silent truncation would be much worse than a noisy error.

Exit codes (updated)

Code Cause
0 files staged successfully (Phase 4 will add: submitted + completed)
2 schema validation failed OR v0.1-unsupported category
3 local-layout or kubeconfig error
4 cluster reachable but parent release / shared PVC missing
7 pre-flight succeeded but staging failed (Pod create, image pull, exec, remote tar)

Test plan

  • go vet ./... — green
  • go test -race -cover ./... — green (push 79.8%, cluster 83.2%, schema 80.7%, cli 52.2%)
  • gofmt -s -l . — no drift
  • errcheck ./... — green
  • Local smoke: v0.1 gate fires with actionable message
  • Real EKS smoke against a tracebloc/client install — staging cats_dogs sample

After this PR merges, only Phase 4 (tracebloc/client#152 — submit-to-jobs-manager + watch + summary) and Phase 5 (#153 — release distribution) remain for v0.1.

Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com


Note

Medium Risk
Adds new Kubernetes Pod lifecycle + exec streaming logic and changes CLI signal/exit-code behavior, which can impact cluster resources and data staging if regressions occur. Risk is mitigated by extensive unit tests and restrictive Pod security context, but it still touches critical transfer/cleanup paths.

Overview
Completes Phase 3 dataset push by replacing the previous “pre-flight only” behavior with actual file staging to the shared PVC: it now creates an ephemeral, PSA-restricted stage Pod, waits for readiness with better diagnostics, streams labels.csv + images/* via tar-over-exec (SPDY) with an optional TTY progress bar, and always cleans up the Pod (including on Ctrl-C).

Adds safeguards and UX improvements: a v0.1-only category gate, stricter table-name validation (including max length), best-effort orphan stage-Pod detection with actionable warnings, symlink rejection in local layout discovery, a new --stage-pod-image override for air-gapped environments, and updated exit codes/docs. Also wires SIGINT/SIGTERM into Cobra’s context so long-running staging cancels cleanly instead of skipping defers.

Reviewed by Cursor Bugbot for commit 93a9ba2. Bugbot is set up for automated code reviews on this repo. Configure here.

Completes Phase 3 (tracebloc/client#151). PR-a delivered the
no-op-safe pre-flight; this PR-b plugs in the actual file
staging — ephemeral Pod, client-go SPDY executor, tar stream,
SIGINT-safe cleanup.

## What lands

internal/push/ (5 new files):
- pod.go        Stage Pod spec builder + Create/Wait/Delete
                lifecycle. Alpine 3.20 pinned by digest, PSA-
                restricted security context, activeDeadline-
                Seconds=600 in-cluster self-kill defense-in-depth,
                random 8-hex-char suffix for parallel-push
                collision avoidance.
- stream.go     Executor interface (SPDYExecutor in prod,
                fakeExecutor in tests). StreamLayout wires a
                tar.Writer through an io.Pipe to the exec stdin,
                running `tar -xf - -C /data/shared/<table>` in
                the Pod. Captures remote stderr into the error
                surface so customers see "no space left on
                device" verbatim, not a generic exec failure.
- progress.go   schollz/progressbar/v3 wrapper, TTY-detected via
                golang.org/x/term. No-op in CI / non-TTY output
                so the bar doesn't pollute log streams.
- orphan.go     Scans for stage Pods labeled
                managed-by=tracebloc-cli older than 5 min;
                renders an actionable warning with the
                kubectl-delete command. Warn-only in v0.1
                (auto-cleanup is v0.2 — can't distinguish
                "crashed previous push" from "still-running
                parallel push from another workstation").
- stage.go      The Stage() orchestrator. Order: orphan scan →
                CreateStagePod → defer DeleteStagePod with a
                FRESH context (SIGINT-safe — defers fire even
                when the parent ctx is cancelled) → WaitForReady
                → StreamLayout → cleanup. On any error past
                Create, the deferred delete still runs.

internal/cli/dataset.go:
- Replace the exit-6 stub with push.Stage(...) wired to
  cluster-discovered ns/PVC/SA + the new SPDYExecutor.
- Add v0.1 category gate (Medium-1 from PR-a self-review).
  Runs BEFORE schema validation so unsupported-but-schema-
  valid categories like tabular_classification get the
  actionable "v0.1 supports only image_classification" message
  instead of the schema's confusing "missing property 'schema'".
- Add --stage-pod-image flag for air-gapped customers.
- Update exit-code doc: drop exit 6 (PR-b stub), add exit 7
  (staging-step failed, distinct from pre-flight 3/4).

internal/push/stage_test.go covers Medium-2 from PR-a review:
TestStage_IngestorSANameFlowsToPod pins that the --ingestor-sa
override actually lands on the stage Pod's ServiceAccountName.

## Tricky bits

Pipe deadlock guard: when the executor returns early (ctx
cancel, remote tar dies immediately), the tar-write goroutine
would block on pipe.Write forever because nothing reads. Fix:
close the pipe Reader after exec.Exec returns, which unblocks
the writer with io.ErrClosedPipe. Caught by
TestStage_CancelledContext_StillCleansUp.

SIGINT safety: the deferred DeleteStagePod uses a fresh ctx
with a 30s deadline rather than the (possibly cancelled)
parent ctx. Without this, Ctrl-C right after pod-create leaks
an orphan Pod. activeDeadlineSeconds=600 is the in-cluster
backstop for the truly-pathological case (hard-kill, network
partition between laptop and cluster).

errcheck cleanup: writeLayoutTar uses a named return + deferred
close that promotes a tar trailer-write error if and only if
the function otherwise succeeded — silent truncation would be
much worse than a noisy error.

## Test plan

- [x] go vet ./... — green
- [x] go test -race -cover ./... — green
      (push 79.8%, cluster 83.2%, schema 80.7%, cli 52.2%)
- [x] gofmt -s -l . — no drift
- [x] errcheck ./... — green
- [x] go build — binary builds
- [x] Local smoke: --category=tabular_classification → exit 2
      with "v0.1 supports only image_classification" message
      (the new gate's actionable diagnostic)
- [ ] Real EKS smoke (manual; out of CI scope)

## Open items deferred

Per PR-a review's Low/Nit list:
- runDatasetPushArgs.Context shadowing context.Context
- printPushPreflight rendering AccessModes with %v
- testutil package consolidation for imgcDir / imgcLayout
- dataset_test kubeconfig path traversal via subtests
- Discover hint for nested-image-dirs

These are tracked for v0.2 cleanup.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@LukasWodka
Copy link
Copy Markdown
Contributor

👋 Heads-up — Code review queue is at 21 / 8

Above the WIP limit. The team convention is to review existing PRs before opening new work.

Open PRs currently in Code review (oldest first):

Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.)

@saadqbal saadqbal self-assigned this May 22, 2026
Comment thread internal/push/pod.go Outdated
Comment thread internal/push/stage.go
…+Med)

Two real bugs caught by Bugbot's re-review on PR-b:

## 1. High — Stage Pod name breaks DNS-1123

BuildStagePodSpec used the raw table name as a Pod name segment.
ValidateTableName accepts [A-Za-z0-9_]+ (MySQL identifier rules), but
K8s Pod names follow DNS-1123 subdomain rules: lowercase alnum +
hyphen, must start/end alnum. The canonical example throughout
tracebloc docs (cats_dogs_train, snake_case) would fail Pod create
post-pre-flight — worst-of-both-worlds UX (pre-flight says "good!"
then create fails).

Fix: transform the table name into a DNS-1123-safe segment for the
Pod name only — lowercase, _→-, trim leading/trailing hyphens, cap
length, fallback to "tbl" for the pathological all-underscore case.
The original verbatim name still goes in the tracebloc.io/table
label so orphan warnings stay traceable.

TestDNS1123SafeTableSegment covers cats_dogs → cats-dogs, MyTable →
mytable, _leading_underscore → leading-underscore, _ → tbl,
50-char → 30-char truncation. Each result is cross-validated
against an inline DNS-1123 regex check.

## 2. Medium — SIGINT skips Pod cleanup

push.Stage documented SIGINT-safe cleanup via context cancellation +
fresh-ctx deferred DeleteStagePod, but cmd/tracebloc/main.go called
Execute() without signal.NotifyContext. Default Go runtime behaviour
on SIGINT is to exit without running defers — so every Ctrl-C during
staging leaked an orphan Pod until activeDeadlineSeconds (10 min)
fired. The docstring was false advertising.

Fix: signal.NotifyContext(ctx, SIGINT, SIGTERM) in main, passed to
ExecuteContext. First Ctrl-C cancels the cobra ctx → push.Stage's
in-flight HTTP cancels → deferred cleanup runs (with its own fresh
ctx, also kept). Second Ctrl-C does normal hard-kill (stop() in
defer un-registers the handler) — important if the cleanup itself
hangs.

Local: vet, test -race -cover, gofmt -s, errcheck — all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread internal/push/pod.go
Comment thread internal/push/pod.go
Two new findings from Bugbot's re-review on commit 35ae044:

## Medium — WaitForStagePodReady polled through terminal errors

The poll callback returned (false, nil) for EVERY Pods.Get error,
which meant Pod-deleted-out-of-band (NotFound) or RBAC-revoked-mid-
push (Forbidden) waited the full 60s timeout despite the situation
being unrecoverable. Customer-facing impact: 60s of "Waiting for
stage Pod to be Ready..." for an error that should surface in <1s.

Fix: classify the error. NotFound + Forbidden terminate the poll
immediately (return (false, err)); everything else stays transient
(network blip, brief API unavailability). The
apierrors.Is{NotFound,Forbidden} helpers were already imported
from the previous fixes, so this is a single-block change.

Two new tests pin the contract via t.Now() bounds — if a regression
makes either case transient again, the test waits the full 60s and
the assertion catches it.

## Low — BuildStagePodSpec swallowed crypto/rand error

`suffix, _ := randomSuffix(4)` — if crypto/rand failed (rare but
possible on systems with exhausted entropy), the suffix was empty
and the Pod name became `tracebloc-stage-cats-dogs-` with a bare
trailing hyphen. DNS-1123 rejects that → opaque API error from
kube-apiserver instead of a clear local diagnostic.

Fix: change BuildStagePodSpec signature to (*Pod, error) and
propagate the rand failure. CreateStagePod (the prod caller)
already returns an error; tests use a mustBuildStagePod helper
that t.Fatal's on the rare path. 8 test call sites updated via
find-and-replace, plus the override-image test by hand.

The third inline comment ("SIGINT skips pod cleanup", bug_id
24ab9106) is a stale carry-over of the prior-round finding I
already fixed in 35ae044 — same bug_id, GitHub auto-shifted the
anchor onto the new commit. Replied in-thread pointing at the fix.
Bugbot's review-body confirms "2 NEW findings", not 3.

Local: vet, test -race -cover, gofmt -s, errcheck — all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread internal/push/pod.go
Comment thread internal/push/stream.go Outdated
Round 4 of Bugbot fixes on PR-b. Two new findings on commit 7baa321
(plus 3 stale carry-overs from earlier rounds — same bug IDs as
already-fixed findings, GitHub keeps re-anchoring them onto each
new commit; Bugbot's review-body confirms "2 NEW findings").

## Medium-Security — symlink images bypassed size caps

The vulnerability: Discover sized image entries via DirEntry.Info()
(Lstat-equivalent: returns the symlink's own ~100-byte size).
writeTarFile read them via os.Stat + os.Open (which follow
symlinks). So a customer with `images/evil.jpg` symlinked to
`/var/log/system.log` (multi-GB) or `~/.ssh/id_rsa` could:

  1. Pass Discover's MaxSingleFileBytes + MaxTotalBytes caps
     trivially (cap is 1 GiB; the symlink itself is ~100 bytes).
  2. Stream the target's FULL contents into the cluster PVC, where
     the cluster admin can read them via `kubectl exec`.

This is both a size-cap bypass AND an arbitrary-local-file
disclosure. Worse, it can fire unintentionally — a customer who
ran `ln -s ~/datasets/big-images images` to share data across
projects would silently stream gigabytes during what they thought
was a small test push.

Fix in three layers:

  1. walk.go: os.Stat → os.Lstat for the labels.csv check (so the
     mode bits include ModeSymlink) + new rejectSymlink() helper
     called for both labels.csv and each image entry. v0.1 rejects
     symlinks outright with a clear "v0.1 doesn't allow symlinks
     (security: ...)" message pointing at `cp -L` for materializing
     and the v0.2 cloud-source story for distributed data.
  2. stream.go: writeTarFile's os.Stat → os.Lstat + symlink check
     too. Defense-in-depth — Discover is the primary fix, but a
     future refactor that calls writeTarFile directly (e.g. a new
     resume-from-partial-transfer code path in v0.2) would
     re-introduce the hole without this layer.
  3. New tests pin both rejections (labels.csv-as-symlink + image-
     as-symlink). Skipped on Windows where symlinks require admin.

## Medium — WaitForStagePodReady didn't short-circuit on Phase=Failed

Companion to the NotFound/Forbidden short-circuit landed in the
previous commit. The poll positively-terminated on Ready=True
but never negatively-terminated on Phase=Failed (container
crashed at startup: PSA rejection, image crashloop, OOM at
container init) or Phase=Succeeded (stage container's sleep
exited unexpectedly). Same 60s timeout symptom for an outcome
that's actually decided in <5s.

Fix: in the poll body, check Phase after the conditions loop;
return a structured error including the podReadyTimeoutHint
output (container-status reason + message) so the customer sees
"terminated in phase Failed (OOMKilled — ...)" instead of a
generic timeout.

New test TestWaitForStagePodReady_FailedPhaseIsTerminal pins
the <3s elapsed contract for an OOMKilled startup.

Local: vet, test -race -cover, gofmt -s, errcheck — all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread internal/push/walk.go
Comment thread internal/push/walk.go
Comment thread internal/push/pod.go
Comment thread internal/push/stage.go
Comment thread internal/push/stream.go Outdated
…ress defer

Round 5 of Bugbot fixes on PR-b. The "5 potential issues" review-body
count corresponds to 5 NEW findings; the other 5 inline comments are
stale carry-overs of bug IDs already fixed in earlier rounds (24ab9106,
15e29a3e, 258869d3, 5c577688, 7fcf137f — GitHub auto-anchors them onto
each new commit).

## Medium — Symlinked images/ directory bypassed all checks

Round 4 fixed symlinked FILES inside images/ (Lstat on each entry).
It missed the case where images/ ITSELF is a symlink — `ln -s /etc
images` passed os.Stat (which follows symlinks), the IsDir check
succeeded, and the loop walked /etc. Same disclosure pattern as the
round-4 finding, one layer up.

Fix: os.Stat → os.Lstat on imagesDir + rejectSymlink before the
IsDir check. Symmetric with the labels.csv treatment from round 4.
New test TestDiscover_RejectsSymlinkedImagesDir pins the rejection.

## Medium — Table label could exceed Kubernetes 63-char limit

ValidateTableName accepted [A-Za-z0-9_]+ without a length cap, but
the stage Pod's tracebloc.io/table label carries the raw name. K8s
label values cap at 63 chars (DNS-1123 label rule), so a 100-char
name passed pre-flight then failed Pod creation with an opaque
label-validation error.

Fix: cap at 63 chars (also matches MySQL's 64-char identifier limit
with 1 char of headroom for any future ingestion-run-id suffix).
New MaxTableNameLength const + boundary test pin the cap.

## Medium — Re-push left stale PVC files

The remote command was `mkdir -p <dest> && tar -xf - -C <dest>`. If
a previous push had 3 images and the new push has 2, the PVC ends
up with the union — and labels.csv now disagrees with the stage
directory, silently breaking ingestion.

Fix: `rm -rf <dest> && mkdir -p <dest> && tar -xf - -C <dest>`. Safe
because dest = StagedPrefix(table) = /data/shared/<table> and
ValidateTableName has ensured `table` is a single safe segment, so
rm -rf can only nuke that one per-table subdir, never the parent
/data/shared or sibling tables. TestStreamLayout_RemoteCommand
updated to assert the rm AND its ordering before mkdir.

## Low — Progress bar not Finish'd on early Stage failure

Stage's deferred Finish lives inside StreamLayout, so a failure
earlier in the lifecycle (CreateStagePod, WaitForStagePodReady)
returned without clearing the TTY bar. Visual artifact on the
customer's terminal after a Pod-create failure.

Fix: defer progress.Finish() at the construction site in
runDatasetPush. Schollz Finish is idempotent so double-call on
happy path is a no-op.

## Hard-link bypass — documented, not fixed

Bugbot also flagged that hard links to outside files aren't
rejected. Filesystem mode bits don't distinguish a hard link from
a regular file the way ModeSymlink distinguishes symlinks, and a
high-Nlink check has too many false positives.

The implicit security boundary is the CLI's process-level read
permissions: a customer can only hard-link files they already
have read access to, so this isn't a privilege escalation. v0.2
may add openat(O_NOFOLLOW) sandboxing if customers need harder
isolation. Documented in rejectSymlink's docstring as a known
limitation alongside the v0.2 plan.

Local: vet, test -race -cover, gofmt -s, errcheck — all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread internal/push/pod.go
Comment thread internal/push/stream.go Outdated
Round 6. Review-body says "2 NEW + 3 unresolved" — Bugbot itself is
flagging the 9 carry-overs at this point. Two new findings, both
real:

## High — Pod activeDeadlineSeconds (600s) cut too close

The deadline starts at Pod CREATION, but image pull (up to 60s) +
WaitForStagePodReady ceiling (60s) + worst-case stream (1 GiB at
2 MB/s ≈ 8.5 min) eat the budget. Near-cap customers on slow
uplinks could see the kubelet terminate mid-transfer with no
useful diagnostic.

Fix: bump StagePodActiveDeadline from 600 → 1800 (30 min). Budget
breakdown in the const's docstring; comfortable margin for
variance. Cost is "an idle alpine Pod with sleep idles for ~22min
after a successful push" — ~5 MiB cluster RAM, zero CPU, deleted
seconds later by the defer'd cleanup anyway.

Test pin: TestStagePodActiveDeadline_CoversFullLifecycle asserts
the floor at 1500s so a future regression to 600 is caught.

## Medium — Windows tar paths used backslashes

`filepath.Join("images", filepath.Base(abs))` produces
`images\file.jpg` on Windows. USTAR / POSIX tar requires forward
slashes; the Linux stage Pod's `tar -xf` either rejects or
extracts as a flat-named file (collapsing the images/ subdir
the ingest spec expects). Breaks the Windows-built CLI.

Fix: switch to `path.Join` for the tar HEADER name. Keep
`filepath.Join` everywhere else (where the OS-native separator
is the right thing). Test pin
TestStreamLayout_TarPathsAreForwardSlash asserts no backslashes
in any tar entry name, regardless of host OS.

## Carry-overs

Bugbot's review body now explicitly says "There are 3 total
unresolved" — it knows the carry-overs aren't on the new commit
anymore. Counts I'm tracking as known-fixed across earlier
rounds: 24ab9106 (SIGINT, r3), 15e29a3e (Pod-Get spin, r3),
258869d3 (rand error, r3), 5c577688 (Failed Pod, r4), 7fcf137f
(symlink files, r4), de426248 (hard links, r5 — documented
limitation), cd462de9 (symlink dir, r5), a8e9e5c7 (label len,
r5), 0b296807 (progress finish, r5).

Local: vet, test -race -cover, gofmt -s, errcheck — all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread internal/push/orphan.go
Comment thread internal/push/orphan.go Outdated
Comment thread internal/push/stream.go Outdated
Comment thread internal/push/pod.go Outdated
…re-push

Round 7 of Bugbot fixes on PR-b. Review-body confirmed "4 NEW + ...".
All four real, all four fixed:

## Medium — Wait errors mislabeled as timeout

WaitForStagePodReady wrapped EVERY poll exit (NotFound, Forbidden,
Phase=Failed, ctx.Canceled from SIGINT) in "did not become Ready
within 60s." Misleading: the customer who hit Ctrl-C 2 seconds
into the wait would see a "timed out" diagnostic.

Fix: branch on errors.Is(err, context.DeadlineExceeded). True
timeout keeps the "did not become Ready within %s" wording;
early-exit cases surface as "did not reach Ready state: %w" with
the actual cause front and center.

## Medium — Orphan-delete hint would nuke parallel pushes

FormatOrphansWarning's kubectl-delete command used the
`managed-by=tracebloc-cli,component=stage-pod` label selector,
which matches every stage Pod in the namespace including
LEGITIMATE RUNNING ONES from parallel pushes (other workstations,
or even this one's just-started Pod). A customer copy-pasting the
suggested cleanup could silently kill someone else's in-progress
push.

Fix: list specific orphan Pod names in the delete command —
`kubectl delete pod -n <ns> <name1> <name2> ...`. Test regression-
pins the absence of the label-selector form.

## Medium — Re-push deleted before transfer succeeded

Round 5's hermetic-re-push fix (rm -rf $DEST && mkdir + tar)
satisfied "no stale files" but not "preserve on failure." A
tar mid-stream failure (Ctrl-C, network drop, container OOM) left
the customer with NOTHING on the PVC — the previous push's data
already nuked, the new push aborted before completing.

Fix: extract to <dest>.staging, swap on success:

  rm -rf <dest>.staging              # cleanup any prior failure
  mkdir -p <dest>.staging && tar -xf - -C <dest>.staging
  rm -rf <dest> && mv <dest>.staging <dest>    # swap on success
  rm -rf <dest>.staging              # defensive cleanup

The shell's && sequencing means swap only fires if tar succeeded.
Lost the prior `exec /bin/tar` micro-optimization (can't exec
mid-chain) — fine, the shell process is tiny. The window between
the destination rm and the mv is sub-ms; closing it fully would
need a double-mv (old/new/cleanup) which is v0.2 territory.

stream_test.go pins:
  - mkdir-p of .staging happens
  - tar extracts to .staging (not directly to dest)
  - mv from .staging to dest exists
  - tar happens BEFORE any rm of $DEST (transactional property)
  - no rm of the parent /data/shared (sibling-table safety)

## Medium — Orphan scan flagged active pushes

FindOrphanStagePods's only filter was age > 5min. But pod.go itself
budgets ~8.5 min for a 1 GiB transfer — a legitimately-running
near-cap push would be mislabeled as orphan by the same customer's
next concurrent push, and the (newly-fixed) delete hint would now
correctly target that specific running Pod by name, killing it.

Fix: skip Phase=Running pods entirely. A Running stage Pod is
presumed to be doing real work; activeDeadlineSeconds is its
cluster-side safety net. Pods in non-Running phases past grace
(Pending stuck on image pull, Failed from crash) still flag —
those are the genuine orphan shapes.

Two new tests:
  - TestFindOrphanStagePods_SkipsRunningPods: 30-min-old Running
    Pod doesn't surface as orphan
  - TestFindOrphanStagePods_FlagsNonRunningPastGrace: 30-min-old
    Failed Pod DOES surface (regression guard — narrowing the
    filter should reduce false positives, not eliminate the
    warning entirely)

Local: vet, test -race -cover, gofmt -s, errcheck — all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread internal/push/stream.go Outdated
Comment thread internal/push/stream.go
Round 8 of Bugbot fixes. Two new findings on commit 42c5e93,
both real:

## Medium — Parallel pushes corrupted staging

Two concurrent `dataset push` runs for the same `--table` shared
one `$DEST.staging` path on the PVC. Each Pod's `rm -rf
$DEST.staging`, `tar -xf -`, `mv` sequence raced against the
other, with no coordination. Worst case: push A's `rm -rf`
wipes push B's mid-extraction state, B's tar then writes into
a partially-removed dir, customer ends up with an interleaved-
corrupted-and-half-committed dataset.

Fix: each invocation generates a fresh 8-hex-char suffix for the
staging dir name (`$DEST.staging-<8hex>`), reusing pod.go's
randomSuffix. Two parallel pushes now have distinct staging
dirs — no interleaved-write hole.

The FINAL `rm $DEST && mv $STAGING $DEST` step is still
last-write-wins under concurrent commits, but that's an
acceptable v0.1 semantic (concurrent pushes for the same table
are inherently undefined; whichever commits last "wins" with a
COHERENT dataset, not an interleaved one).

New test TestStreamLayout_StagingSuffixIsUniquePerInvocation
pins the contract by calling StreamLayout twice and asserting
distinct staging suffixes.

## Medium — Stream skipped size-cap enforcement (TOCTOU)

Pre-flight Discover checked MaxSingleFileBytes + MaxTotalBytes,
but writeTarFile / writeLayoutTar streamed files later via
os.Lstat → io.Copy with NO re-check. A file that grew between
pre-flight and stream (legitimate dataset prep racing the push,
or adversarial overwrites) would silently upload past the 500 MiB
/ 1 GiB advertised caps.

Fix in three layers:

  1. writeTarFile re-checks the single-file cap (os.Lstat size
     vs MaxSingleFileBytes) right before WriteHeader. A file
     that grew gets a clear "exceeds v0.1 single-file cap"
     error using the same sizeError formatter Discover uses.

  2. writeTarFile now returns (int64, error) — the declared
     header size. writeLayoutTar accumulates this into a
     running total and aborts mid-stream if it exceeds
     MaxTotalBytes. Tests added below to pin the running-total
     contract.

  3. io.Copy → io.CopyN(tw, f, st.Size()) caps the actual byte
     transfer at the declared header size. Without the cap, a
     file that grew between Lstat and io.Copy would overflow
     the tar header — header says N bytes, body has > N → tar
     archive corruption. CopyN-and-trust truncates instead.

Closes both the metadata-side (header size) and body-side
(byte stream) halves of the stream-time TOCTOU window.

Local: vet, test -race -cover, gofmt -s, errcheck — all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread internal/push/stream.go Outdated
Comment thread internal/push/stream.go Outdated
…dirs

Round 9 of Bugbot fixes. Both findings real:

## Medium — Tar goroutine could deadlock on suffix error

The randomSuffix call landed AFTER the tar-write goroutine spawned.
If crypto/rand failed (rare but possible on entropy-starved systems
— same scenario the round-3 BuildStagePodSpec fix handles), the
function returned without closing the pipe or waiting on the
goroutine. The goroutine would then block once the ~64 KB pipe
buffer filled, leaking forever.

Fix: hoist randomSuffix (and the entire remoteCmd composition)
ABOVE the io.Pipe + goroutine setup. The function now bails on
suffix failure before touching the pipe at all, so there's
nothing to deadlock on.

## Medium — Failed pushes leaked .staging-<hex> dirs

Round 8's unique-per-invocation suffix fixed parallel-push
interleaving but created a new leak: if THIS push fails before
the final `mv` step, the .staging-<hex> dir lingers on the PVC.
Round 8's `rm -rf $STAGING` at the start only cleans up THIS
invocation's path, not the previous-failed one. Repeated failed
pushes accumulate unbounded storage on the shared PVC.

Fix: append a defensive cleanup pass to the remote script:

  find $(dirname $DEST) -maxdepth 1 -name "$(basename $DEST).staging-*" \\
    -mmin +60 -exec rm -rf {} + 2>/dev/null || true

Constraints baked into the pattern:
  - -mmin +60 (1 hour) is 2x activeDeadlineSeconds — anything
    older HAS to be from a dead stage Pod, so we can't race with
    a parallel push's in-progress staging
  - -name pattern scopes to THIS table's staging siblings only;
    other tables' .staging-* dirs are none of our business
  - 2>/dev/null || true — find failures don't fail the whole
    stream (defensive cleanup is best-effort, the customer's
    actual push already committed before this runs)
  - Uses ';' (semicolon) instead of '&&' so it runs even if the
    main push sequence failed somehow — orphan cleanup should
    happen regardless

Updated TestStreamLayout_RemoteCommand asserts the find pattern
is present with the exact -mmin window and table-scoped -name.

Local: vet, test -race -cover, gofmt -s, errcheck — all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread internal/push/stream.go Outdated
Bugbot's tenth-round single finding: the previous r7 pattern
of `rm $DEST && mv $STAGING $DEST` had a small but real failure
window — if the rm succeeded and the mv failed (ENOSPC, fs-level
rename error, transient kernel weirdness), the customer's
previous dataset was already gone and the new data was stranded
under .staging-<hex> where the ingestor wouldn't see it.

Fix: backup-and-rollback. Rename the existing $DEST to a unique
$DEST.old-<hex> sibling FIRST; only then mv $STAGING into $DEST;
on the latter's failure, restore the backup. On success, drop
the backup.

  { [ -e $DEST ] && mv $DEST $DEST.old-<hex> || true; } &&
  { mv $STAGING $DEST ||
    { [ -e $DEST.old-<hex> ] && mv $DEST.old-<hex> $DEST; exit 1; }; } &&
  rm -rf $DEST.old-<hex>

Properties this gives us:

  - tar fails: $DEST untouched (transactional from r7)
  - backup mv fails: $DEST untouched
  - main mv fails: backup at .old-<hex> survives; customer
    recovers via `kubectl exec -- mv .old-<hex> $DEST`
  - final rm fails: just leaves an .old-<hex> cruft, picked up
    by the r9 find -mmin +60 sweep (extended to also catch
    .old-* siblings)
  - first-ever push (no pre-existing $DEST): `[ -e $DEST ] && ...`
    short-circuits, backup mv silently skipped via `|| true`

The .old-<hex> suffix reuses the same random as .staging-<hex>
so two parallel pushes can't collide on the backup name. Both
.staging-* AND .old-* now flow through the orphan-cleanup find
pattern (with the `-o` alternation), so r9's leak-prevention
covers both halves of the swap.

Test updates pin:
  - Backup mv (DEST → .old) appears BEFORE primary mv
    (.staging → DEST) — rollback contract
  - Backup and staging suffixes MUST agree (rollback-target
    correctness)
  - find pattern includes both -name "...staging-*" and
    -name "...old-*"

Local: vet, test -race -cover, gofmt -s, errcheck — all green.

This is round 10. The new-finding rate has been decaying:
5→2→2→2→1. We're approaching the asymptote.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread internal/push/stream.go
Comment thread internal/push/stream.go Outdated
…t r11)

Two new findings on r10's commit, both real:

## Medium — Remote script masked failures via POSIX `;` + `|| true`

The previous shell used `&&-chain ; find ... || true`. POSIX `;`
runs the next command regardless of prior exit status, and the
trailing `|| true` then forces exit 0. A failed tar or mv earlier
in the && chain would set $? to non-zero, then `;` would clobber
it back to 0 via the find's `|| true`. Result: a remote push
that actually failed would return exit 0 to the exec subprocess,
and StreamLayout would happily report "Staged N files" on a
catastrophic failure.

Fix: rewrite the script using `set -e` + explicit newline-
separated statements + an explicit `if ! mv; then rollback;
exit 1; fi` for the swap. With set -e:
  - Any non-guarded non-zero exits the script with that status.
  - `cmd || true` continues to suppress (find cleanup stays
    best-effort).
  - `if ! cmd; then ...; fi` is treated as guarded, so set -e
    doesn't preempt the explicit rollback handler.

Multi-line shell-c args work across busybox sh (alpine 3.20),
dash, and bash equally. No portability regression.

## Medium — Stream error masked the upstream tar error

After r8 added stream-time MaxSingleFileBytes / MaxTotalBytes
rechecks in writeLayoutTar, the LOCAL tar build can legitimately
fail mid-stream. When it does, the goroutine's CloseWithError
propagates to the pipe reader; exec sees broken-pipe and returns
a generic "exec stream against ns/pod failed" error.

The previous code checked streamErr FIRST and returned the
exec-flavored framing. Customer saw "streaming files failed"
instead of the actually-actionable "dataset exceeded v0.1 total
cap of 1.00 GiB after streaming ..." diagnostic from the tar
side.

Fix: swap the order. Check tarErr first — when both are non-nil,
the tar side is the upstream cause; the broken-pipe streamErr is
downstream noise. streamErr-only (no tarErr) is still the real
network/RBAC/remote-tar case and gets surfaced with the exec
wording.

Local: vet, test -race -cover, gofmt -s, errcheck — all green.

Round 11 finding rate: 2 new (1+2+5+2+4+2+2+1+2 → still trending
roughly downward, though not monotone). One more clean round
and we're at the asymptote.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 8bfada3. Configure here.

Comment thread internal/push/stream.go
Independent review pass over PR #9 (post-bugbot-r11). Five findings,
none caught by bugbot — all cross-cutting doc/style:

1. stream.go: dead comment block above remoteCmd composition. The
   block was a holdover from the && era; it claimed "&& sequencing
   means the swap only fires if tar succeeded" but we switched to
   `set -e` in r11. The canonical pipeline doc lives below the
   composition; collapsed the upstairs block into a single semantic-
   guarantees summary (HERMETIC/TRANSACTIONAL/PARALLEL-SAFE/EXIT-
   FAITHFUL) and removed the contradictory && reference.

2. stage.go: StageOptions docstring referenced a non-existent
   `OrphanLogger` field. Stale from an early design where orphan
   warnings were emitted via a logger callback rather than the
   integrated FormatOrphansWarning we ended up with. Corrected to
   the actual nil-able fields (Progress, Out).

3. orphan.go: OrphanGracePeriod comment claimed "5 minutes is
   generous: a healthy stage Pod is fully done in ~30 seconds...
   under a couple minutes at the 1 GiB cap." That directly
   contradicts pod.go's StagePodActiveDeadline budget (~8.5 min
   for a 1 GiB transfer at 2 MB/s, plus image pull, plus ready
   wait). Rewrote to reflect the post-r7 semantic: Running Pods
   are never flagged regardless of age; the 5-min grace targets
   the Pending/Failed/Unknown shapes that are genuinely stuck.

4. orphan.go: joinNames was a 9-line wrapper around `strings.Join(
   names, " ")`. The comment justifying it ("single point of
   change for tests") doesn't hold — the tests assert on the
   final output string, not on this helper's surface. Inlined.

5. cli/dataset.go: runDatasetPush docstring still said "the PR-a
   slim implementation. It performs every pre-flight check... the
   actual file staging is gated behind a clear 'not yet
   implemented' error." PR-b (this PR) actually implements the
   staging — the doc lagged the code. Updated to reflect the
   complete Phase 3 flow.

No behavioral change. Tests + lint green. The doc fixes matter
because future readers (humans and the next bugbot pass) get
contradictory signals when the comments disagree with the code.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@saadqbal saadqbal merged commit b3efc62 into develop May 22, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants