Skip to content

Sync develop → main for v0.1.0-alpha#22

Merged
aptracebloc merged 17 commits into
mainfrom
develop
Jun 3, 2026
Merged

Sync develop → main for v0.1.0-alpha#22
aptracebloc merged 17 commits into
mainfrom
develop

Conversation

@aptracebloc
Copy link
Copy Markdown
Contributor

@aptracebloc aptracebloc commented Jun 3, 2026

Summary

Roll-up sync of developmain for the v0.1.0-alpha release. Brings main (last at ddbe987, an early Phase-0/1 state) up to the full v0.1 feature set: the complete tracebloc dataset push flow across 9/10 modalities, the release pipeline, and the test/integration hardening. No divergence — main is an ancestor of develop (17 commits).

Merge as a merge commit (not squash) per the repo's branch-flow convention, to preserve the full per-PR history on main.

Changelog (main..develop)

Core ingestion CLI

Modality expansion — dataset push now covers 9 of 10 categories

Testing

Docs

Review fixes

Release: v0.1.0-alpha

First tagged release, cut from develop (a661a9a) and validated end-to-end:

  • https://github.com/tracebloc/cli/releases/tag/v0.1.0-alpha (prerelease)
  • 5 cosign-signed Go binaries — linux/{amd64,arm64}, darwin/{amd64,arm64}, windows/amd64 — plus SHA256SUMS and the install scripts.
  • Pipeline (release.yml) succeeded on its first run; downloaded binary verifies against SHA256SUMS and self-reports tracebloc 0.1.0-alpha.
  • Ships under the current tracebloc name (provisional — the stable v0.1.0 will carry the final product name). Homebrew deferred.
  • Live-validated on EKS dev: tabular ingestion 100% (rows confirmed in MySQL); all 9 supported categories pass submit after the jobs-manager schema refresh.

🤖 Generated with Claude Code

saadqbal and others added 15 commits May 21, 2026 19:09
Phase 1: embed ingest.v1.json + `tracebloc ingest validate`
* feat(cluster): kubeconfig discovery, parent release detection, SA token

Phase 2 of the v0.1 roadmap. Adds the plumbing the future `tracebloc
dataset push` flow needs: where does the customer's kubeconfig point,
which tracebloc release lives there, and how do we authenticate to
its jobs-manager.

End-to-end validated against the dev EKS cluster
(tb-client-dev-templates): discovers chart 1.3.5, resolves
jobs-manager.tracebloc-templates.svc:8080, reads INGESTOR_IMAGE_DIGEST
out of the deployment env, mints a 10-minute SA token via
TokenRequest.

What lands:

- internal/cluster/kubeconfig.go — Load() that honors --kubeconfig,
  $KUBECONFIG, ~/.kube/config (via clientcmd's full default loading
  rules — *not* an empty ExplicitPath, which silently refuses to
  fall back to defaults; that was the first bug the real-cluster
  smoke caught).

- internal/cluster/discover.go — DiscoverParentRelease() finds the
  tracebloc/client release in a namespace by listing chart-managed
  Deployments and filtering by name suffix (-jobs-manager). The
  chart shares app.kubernetes.io/name=client across mysql/jobs-
  manager/requests-proxy, so suffix matching is what distinguishes
  jobs-manager. Returns a friendly multi-release error when
  ambiguous, with remediation text in the message.

- internal/cluster/token.go — MintIngestorToken() tries the modern
  TokenRequest path first, falls back to a static
  service-account-token Secret on RBAC denial / older clusters / SA
  missing. Errors propagate verbatim on non-recoverable failures
  (network, context cancellation) so customers see the real
  problem instead of a misleading "static fallback also failed."

- internal/cli/cluster.go — `tracebloc cluster info` command. Prints
  context, server, namespace, parent release info, SA + token state
  (with SHA256(token)[:8] instead of the raw bytes — token must
  never appear in scrollback). Exit codes 3 (kubeconfig issue) / 4
  (no parent release) / 5 (token mint failed).

Tests:

- 5 new test files covering happy path, multi-release ambiguity,
  service name fallback, the sibling-deployment filter regression
  (mysql + requests-proxy + jobs-manager all share chart-level
  labels — discovery must pick jobs-manager by name suffix),
  TokenRequest happy path, static-secret fallback, non-recoverable
  error pass-through, and the combined-failure remediation message.

Coverage: internal/cluster 83.8%, internal/cli 59.5% (cluster info
itself is hard to unit-test without a real cluster — Phase 3+ adds
integration tests against a kind cluster).

Library footprint: brings in k8s.io/client-go + apimachinery + api
(@v0.31.0) and sigs.k8s.io/yaml. Cross-compiled binaries grow from
~10MB to ~30MB; cost is acceptable for the customer-experience
upside of "your kubeconfig is all you need."

Closes tracebloc/client#150.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(ci): resolve Go version + dep conflicts for lint to typecheck

CI on PR #2 hit a lint typecheck error
("undefined: jsonschema / yaml (typecheck)") that took a few iterations
to diagnose. Root cause: the k8s.io/client-go v0.31 line pulls in
transitive deps (sigs.k8s.io/structured-merge-diff/v6 vs v4) that
fight in go.mod, and golangci-lint v1.61's bundled Go SDK can't
typecheck a module whose `go` directive is newer than what the linter
supports.

Resolution:

1. Bump k8s.io/client-go + api + apimachinery from v0.31.0 to
   v0.36.1 (latest stable). Fixes the structured-merge-diff
   version split — v0.36 uses v6 consistently across the
   dependency chain.
2. Accept whatever `go mod tidy` writes to go.mod's `go` directive
   (currently 1.26 on this dev machine, 1.24 on others — same
   either way since Go modules are forward-compatible). Stop
   fighting tidy; pinning a stale version produces typecheck
   errors instead of real findings.
3. Bump golangci-lint in the workflow from v1.61.0 to v1.64.7, the
   first version that handles the Go 1.24+ source the dep tree
   now requires.
4. Update .golangci.yml `run.go: "1.24"` to match go.mod's effective
   minimum.
5. Refresh the go.mod comment so future readers understand why
   the version directive isn't pinned low.

Local validation: `make vet test fmt-check schema-check` all
green; cluster-info smoke against the dev EKS still discovers
chart 1.3.5 + mints a TokenRequest token.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci: trim linters that whole-program-analyze the k8s.io dep tree

CI's Lint job has now failed three runs in a row with
"The runner has received a shutdown signal" after ~2 minutes of
golangci-lint actually running. Not a flaky runner — reproducible.

Root cause: `staticcheck` and `unused` do full-program SSA analysis
across every transitive dep. With k8s.io/client-go + apimachinery +
api in the graph (≈80 indirect modules), that exceeds whatever
budget the standard 4-CPU GitHub-hosted runner allots for the job.
The runner gets preempted before lint completes.

Drop `staticcheck` and `unused` from the active linter set. Keep
the cheap per-file linters that catch the bugs we've actually hit
this week (errcheck, govet, ineffassign, gofmt, goimports, misspell,
unconvert).

Filed v0.2 ticket to bring them back via either (a) a self-hosted
larger runner, (b) `-skip-files=k8s.io/*` patterns that don't
exist in golangci-lint v1.64 but do in v2.x, or (c) split the lint
job to run staticcheck only on `./internal/...` (our own code) and
skip module-cache packages.

The dropped linters' value relative to CI cycles spent debugging
this:
  - staticcheck SA-checks are valuable but redundant with `govet`
    for the most-likely-to-bite cases (printf, lock copies, etc.)
  - `unused` rarely fires on a brand-new codebase where every
    symbol is just-introduced.

Pragmatic tradeoff for v0.1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci: replace golangci-lint-action with standalone errcheck + gofmt -s

Four consecutive Lint failures on PR #2, all hitting the same
"shutdown signal received" at ~2 minutes (15:17, 15:24, 15:33, 15:35).
Not lint config (the trim from staticcheck/unused didn't help),
not a flaky runner (reproducible), not an OOM (no resource warning).
golangci-lint-action@v6 + the k8s.io/* dep tree appears to be the
incompatible combination in early 2026's GitHub Actions environment.

Rather than spend another iteration debugging the action, replace
it with standalone tools that already work locally and have
predictable behavior:

  - errcheck v1.7.0   (the bug class we've actually hit this week)
  - gofmt -s          (the formatting check; matches what `make fmt-check` does)
  - ineffassign v0.1  (cheap dead-assignment detection)
  - misspell          (typo guard)

Combined runtime in standalone mode: ~10s. golangci-lint's value-add
beyond these was staticcheck + unused — both already deferred to #6
as v0.2 work pending a strategy for the dep tree.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci: unpin lint-tool versions; their old tags don't build with current Go

errcheck v1.7.0 + ineffassign v0.1.0 + misspell v0.3.4 all transitively
import a golang.org/x/tools version (v0.17 era) that fails to compile
under current Go ("invalid array length -delta * delta" in
tokeninternal.go). Pinning was the right instinct for reproducibility,
but the upstream tools haven't shipped current-Go-compat tags yet.

Use @latest for now; reproducibility tradeoff is acceptable given
these are lint tools, not runtime deps. Document in #6 as
"pin once upstream tags newer versions" follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(lint): explicit-discard 20 unchecked fmt.Fprintf in cluster.go

errcheck caught 20 unchecked Fprintf/Fprintln returns in
runClusterInfo — same class of finding that was previously caught
+ fixed in internal/cli/ingest.go and cmd/tracebloc/main.go.
I missed cluster.go when I added the explicit-discard pattern there.

Same rationale as the other sites: the exit code is the contract;
a pipe-write failure shouldn't convert a successful diagnostic into
a non-zero exit. Wrap each call with `_, _ =`.

Now caught by CI thanks to the standalone-errcheck swap from the
previous commit. The whole reason for the lint-job rework was to
catch this exact bug class earlier in the loop — we just had to
trade the golangci-lint-action for a working setup first.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(lint): explicit-discard the stderr Fprintln in main.go too

Preempts the next errcheck cycle. Same explicit-discard rationale as
the cluster.go fixes — stderr unreachable shouldn't change the exit
code we propagate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cluster): admit hardcoded ingestor SA name + add --ingestor-sa override

Bugbot caught a contradiction in PR #2's Phase 2 code:

- The comment said "Read INGESTOR_IMAGE_DIGEST + ingestor SA name
  from the Deployment's pod-spec env"
- The struct doc said "customers can override; the value comes from
  jobs-manager's environment"
- But the switch only handled INGESTOR_IMAGE_DIGEST. SA name was
  hardcoded to "ingestor", silently ignoring any customer override.

Truthful fix:

1. Update the comment and struct doc to admit the limitation.
2. Add a `--ingestor-sa` flag on `cluster info` so customers who
   set `ingestionAuthz.serviceAccountName` to a non-default value
   in the parent client chart can still use the CLI today.
3. Plumb the override through `runClusterInfo` -> applied to the
   discovered ParentRelease before token mint.
4. Drop the now-only-INGESTOR_IMAGE_DIGEST switch statement to a
   plain `if` — clearer, errcheck-friendlier, and signals there's
   only one env var being read.

File #7 in tracebloc/cli for the proper fix: discover the SA name
from the chart-rendered ingestionAuthz ConfigMap so the flag
becomes unnecessary. v0.2 work, not blocking Phase 2 ship.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cluster): defer to apierrors stdlib; close go.mod/lint Go-version drift

Cursor Bugbot's re-review on 123b56a flagged two new issues in the
Phase 2 (#150) tree:

1. internal/cluster/token.go reimplemented isForbidden / isNotFound /
   isMethodNotSupported / statusCode against Status.Code (numeric HTTP
   code) when k8s.io/apimachinery/pkg/api/errors already exports
   IsForbidden / IsNotFound / IsMethodNotSupported that key off
   Status.Reason (the typed enum). The two can diverge silently for
   non-standard status errors. The test file already imports apierrors
   and constructs fake errors via apierrors.NewForbidden(), so deferring
   to the stdlib is both safer AND removes ~20 lines of homegrown code.

   The four token tests still pass at 82.1% pkg coverage because
   NewForbidden() sets both Code and Reason fields.

2. go.mod's top-of-file comment claimed "Minimum Go is 1.22" but the
   actual `go 1.26.0` directive (forced by k8s.io/* v0.36.x deps)
   contradicted it, and .golangci.yml pinned `go: "1.24"` — also stale.
   Rewrote the go.mod comment to admit reality + tell future-me to
   bump both together, and bumped the lint config to "1.26" to match.

The third inline comment from Bugbot's re-review is a stale carry-over
of the SA-name finding fixed in 123b56a (same bug ID 5e4b5df0…, GitHub
auto-shifted its anchor onto the new lines). Bugbot's own review-body
count confirms 2 new findings, not 3.

Local: go vet, go test -race -cover, gofmt -s, errcheck — all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(dataset): pre-flight for `dataset push` (PR-a of #151)

Phase 3 (tracebloc/client#151) lands in two PRs to keep diffs
reviewable. PR-a (this one) is the no-op-safe path: synthesize the
ingest spec from flags, validate against the embedded ingest.v1
schema, walk the local directory, discover the cluster's parent
release + shared PVC. Print a summary, then either stop (--dry-run)
or fail cleanly with "wait for PR-b" so a customer who pulled the
PR-a binary doesn't see "0 files transferred" confusion.

PR-b adds the novel-engineering piece: ephemeral stage Pod (alpine
3.20 pinned by digest), client-go remotecommand SPDY executor + tar
stream, schollz/progressbar/v3, SIGINT-safe cleanup. It plugs into
the "TODO: PR-b" branch at the end of runDatasetPush() without
touching anything upstream.

Surface area added:

  internal/push/                       (new package)
    spec.go         SpecArgs -> ingest.v1.json-conforming map.
                    Path is leave-validation-to-schema; the schema
                    is the single source of truth.
    walk.go         Discover() walks <root>/labels.csv +
                    <root>/images/, enforces v0.1 size caps
                    (1 GiB total, 500 MiB per file), accepts
                    {.jpg,.jpeg,.png,.webp} case-insensitively.

  internal/cluster/pvc.go              (new file)
    DiscoverSharedPVC reads the chart's `client-pvc` (hardcoded by
    _helpers.tpl, not parameterized) in the namespace, verifies it
    is Bound, returns access modes. IsReadWriteMany() drives the
    stage-Pod scheduling warning in PR-b.

  internal/cli/dataset.go              (new file)
    `tracebloc dataset push <local-path>` command. Order:
      1. flag->spec synthesize + schema-validate (stderr diagnostic,
         exit 2 — same wording style as ingest validate)
      2. walk local layout + size caps (exit 3)
      3. load kubeconfig + clientset (exit 3)
      4. discover parent release (exit 4)
      5. discover shared PVC (exit 4)
      6. print pre-flight summary
      7. --dry-run stop, or exit 6 "wait for PR-b"

  internal/cli/root.go                 (1-line wire-up)
    Register the new command.

Why image_classification only: the epic (tracebloc/client#147) v0.1
non-goals defer other categories to v0.2 as one-PR additions. The
--category flag does NOT pre-validate so the schema's enum check
produces the canonical error (rather than CLI-side drift).

Why exit code 6 for "not implemented yet": distinct from the
existing schema (2), local (3), discovery (4), and Phase-2's
token-mint (5) codes. Tests assert the exact code so the contract
sticks across PR-b's refactor.

Tests:
  spec_test.go        Build() ↔ embedded schema contract (the
                      core "this must produce something the
                      validator accepts" pin)
  walk_test.go        Layout + size caps, all happy + sad paths
                      (case-insensitive ext, skip .DS_Store,
                      missing files, byte-sum sanity)
  pvc_test.go         Fake-clientset coverage: happy path,
                      not-found diagnostic, Pending-phase
                      diagnostic, mixed-mode IsReadWriteMany
  dataset_test.go     CLI integration: schema-fail exit 2,
                      walk-fail exit 3, kubeconfig-fail exit 3,
                      cobra args check

Locally: vet, test -race -cover, gofmt -s, errcheck — all green.
Coverage: push 81.0%, cluster 83.2%, schema 80.7%, cli 52.0%.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(dataset): drop duplicate pluralS, reuse existing plural helper

Cursor Bugbot (Low severity) on PR #8: pluralS() in dataset.go was
an exact duplicate of plural() already defined in ingest.go — same
cli package, identical body. Removed pluralS, the schema-error
diagnostic now calls plural() directly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(push): reject path-traversal table names (Bugbot, Medium)

Cursor Bugbot on PR #8 (commit 4240097), Medium severity: the
`table` value flows unsanitized into the /data/shared/<table>/ PVC
path. The embedded ingest.v1 schema only enforces minLength:1 on
`table` — no pattern — so --table=../../etc would resolve (via
path-cleaning) to /etc, and PR-b's stage Pod would write outside
the intended subtree, potentially clobbering another table's
staged data.

Fix:

  - push.ValidateTableName(table): rejects anything not matching
    ^[A-Za-z0-9_]+$ — the intersection of "valid unquoted MySQL
    identifier" and "safe single path segment". Called as step 1
    of runDatasetPush, before SpecArgs.Build().

  - StagedPrefix hardened: drops path.Join (whose ".."-cleaning
    is the silent footgun) for plain concatenation, and panics if
    handed an unsafe name. Callers MUST ValidateTableName first;
    the panic is a defense-in-depth backstop so a PR-b call site
    that forgets validation fails loudly in tests instead of
    silently escaping the prefix.

  - Build() doc gains the precondition note.

The upstream half — adding a `pattern` constraint to the schema's
`table` field so the helm flow + jobs-manager get the same guard —
is filed as tracebloc/data-ingestors#116. Once that lands and the
CLI re-syncs the schema, ValidateTableName collapses to a thin
wrapper.

Tests:
  - TestValidateTableName_Accepts / _RejectsUnsafe: the security
    regression pin (../../etc, ../foo, slashes, dots, etc.)
  - TestStagedPrefix_PanicsOnUnsafeName: the backstop
  - TestDatasetPush_TraversalTableName_ExitsTwo: CLI-layer,
    --table=../../etc → exit 2 before any cluster work

Locally: vet, test -race -cover, gofmt -s, errcheck — all green.
Coverage: push 83.3%, cluster 83.2%, schema 80.7%, cli 52.2%.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor(push): export HumanBytes, drop the copy in dataset.go

Cursor Bugbot on PR #8 (commit 837157f), Low severity:
humanBytesForSummary in dataset.go was a line-for-line copy of
humanBytes in walk.go — both new in this PR. A future tweak (TiB
support, precision change) would likely patch one and not the
other, drifting the size shown in an over-cap error from the size
shown in the dry-run summary.

Fix: export push.HumanBytes as the single implementation; delete
the dataset.go copy. Both the over-cap diagnostics and the
pre-flight summary now format through the same function.

Locally: vet, test -race -cover, gofmt -s, errcheck — all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(push): reject a directory named labels.csv in Discover

Cursor Bugbot on PR #8 (commit 9915866), Low severity: Discover
checked imagesStat.IsDir() for the images/ entry but never the
reverse for labels.csv. A directory literally named "labels.csv"
passes os.Stat, so the pre-flight would accept it and PR-b's tar
stream would later fail confusingly trying to read a directory as
a CSV.

Added the symmetric labelsStat.IsDir() guard with a clear error.
TestDiscover_LabelsCSVIsDirectory pins it.

Locally: vet, test -race -cover, gofmt -s, errcheck — all green.
Coverage: push 83.8%.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(dataset): stage Pod + tar-over-exec stream (PR-b of #151)

Completes Phase 3 (tracebloc/client#151). PR-a delivered the
no-op-safe pre-flight; this PR-b plugs in the actual file
staging — ephemeral Pod, client-go SPDY executor, tar stream,
SIGINT-safe cleanup.

## What lands

internal/push/ (5 new files):
- pod.go        Stage Pod spec builder + Create/Wait/Delete
                lifecycle. Alpine 3.20 pinned by digest, PSA-
                restricted security context, activeDeadline-
                Seconds=600 in-cluster self-kill defense-in-depth,
                random 8-hex-char suffix for parallel-push
                collision avoidance.
- stream.go     Executor interface (SPDYExecutor in prod,
                fakeExecutor in tests). StreamLayout wires a
                tar.Writer through an io.Pipe to the exec stdin,
                running `tar -xf - -C /data/shared/<table>` in
                the Pod. Captures remote stderr into the error
                surface so customers see "no space left on
                device" verbatim, not a generic exec failure.
- progress.go   schollz/progressbar/v3 wrapper, TTY-detected via
                golang.org/x/term. No-op in CI / non-TTY output
                so the bar doesn't pollute log streams.
- orphan.go     Scans for stage Pods labeled
                managed-by=tracebloc-cli older than 5 min;
                renders an actionable warning with the
                kubectl-delete command. Warn-only in v0.1
                (auto-cleanup is v0.2 — can't distinguish
                "crashed previous push" from "still-running
                parallel push from another workstation").
- stage.go      The Stage() orchestrator. Order: orphan scan →
                CreateStagePod → defer DeleteStagePod with a
                FRESH context (SIGINT-safe — defers fire even
                when the parent ctx is cancelled) → WaitForReady
                → StreamLayout → cleanup. On any error past
                Create, the deferred delete still runs.

internal/cli/dataset.go:
- Replace the exit-6 stub with push.Stage(...) wired to
  cluster-discovered ns/PVC/SA + the new SPDYExecutor.
- Add v0.1 category gate (Medium-1 from PR-a self-review).
  Runs BEFORE schema validation so unsupported-but-schema-
  valid categories like tabular_classification get the
  actionable "v0.1 supports only image_classification" message
  instead of the schema's confusing "missing property 'schema'".
- Add --stage-pod-image flag for air-gapped customers.
- Update exit-code doc: drop exit 6 (PR-b stub), add exit 7
  (staging-step failed, distinct from pre-flight 3/4).

internal/push/stage_test.go covers Medium-2 from PR-a review:
TestStage_IngestorSANameFlowsToPod pins that the --ingestor-sa
override actually lands on the stage Pod's ServiceAccountName.

## Tricky bits

Pipe deadlock guard: when the executor returns early (ctx
cancel, remote tar dies immediately), the tar-write goroutine
would block on pipe.Write forever because nothing reads. Fix:
close the pipe Reader after exec.Exec returns, which unblocks
the writer with io.ErrClosedPipe. Caught by
TestStage_CancelledContext_StillCleansUp.

SIGINT safety: the deferred DeleteStagePod uses a fresh ctx
with a 30s deadline rather than the (possibly cancelled)
parent ctx. Without this, Ctrl-C right after pod-create leaks
an orphan Pod. activeDeadlineSeconds=600 is the in-cluster
backstop for the truly-pathological case (hard-kill, network
partition between laptop and cluster).

errcheck cleanup: writeLayoutTar uses a named return + deferred
close that promotes a tar trailer-write error if and only if
the function otherwise succeeded — silent truncation would be
much worse than a noisy error.

## Test plan

- [x] go vet ./... — green
- [x] go test -race -cover ./... — green
      (push 79.8%, cluster 83.2%, schema 80.7%, cli 52.2%)
- [x] gofmt -s -l . — no drift
- [x] errcheck ./... — green
- [x] go build — binary builds
- [x] Local smoke: --category=tabular_classification → exit 2
      with "v0.1 supports only image_classification" message
      (the new gate's actionable diagnostic)
- [ ] Real EKS smoke (manual; out of CI scope)

## Open items deferred

Per PR-a review's Low/Nit list:
- runDatasetPushArgs.Context shadowing context.Context
- printPushPreflight rendering AccessModes with %v
- testutil package consolidation for imgcDir / imgcLayout
- dataset_test kubeconfig path traversal via subtests
- Discover hint for nested-image-dirs

These are tracked for v0.2 cleanup.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(push): DNS-1123-safe Pod names + wire SIGINT to ctx (Bugbot, High+Med)

Two real bugs caught by Bugbot's re-review on PR-b:

## 1. High — Stage Pod name breaks DNS-1123

BuildStagePodSpec used the raw table name as a Pod name segment.
ValidateTableName accepts [A-Za-z0-9_]+ (MySQL identifier rules), but
K8s Pod names follow DNS-1123 subdomain rules: lowercase alnum +
hyphen, must start/end alnum. The canonical example throughout
tracebloc docs (cats_dogs_train, snake_case) would fail Pod create
post-pre-flight — worst-of-both-worlds UX (pre-flight says "good!"
then create fails).

Fix: transform the table name into a DNS-1123-safe segment for the
Pod name only — lowercase, _→-, trim leading/trailing hyphens, cap
length, fallback to "tbl" for the pathological all-underscore case.
The original verbatim name still goes in the tracebloc.io/table
label so orphan warnings stay traceable.

TestDNS1123SafeTableSegment covers cats_dogs → cats-dogs, MyTable →
mytable, _leading_underscore → leading-underscore, _ → tbl,
50-char → 30-char truncation. Each result is cross-validated
against an inline DNS-1123 regex check.

## 2. Medium — SIGINT skips Pod cleanup

push.Stage documented SIGINT-safe cleanup via context cancellation +
fresh-ctx deferred DeleteStagePod, but cmd/tracebloc/main.go called
Execute() without signal.NotifyContext. Default Go runtime behaviour
on SIGINT is to exit without running defers — so every Ctrl-C during
staging leaked an orphan Pod until activeDeadlineSeconds (10 min)
fired. The docstring was false advertising.

Fix: signal.NotifyContext(ctx, SIGINT, SIGTERM) in main, passed to
ExecuteContext. First Ctrl-C cancels the cobra ctx → push.Stage's
in-flight HTTP cancels → deferred cleanup runs (with its own fresh
ctx, also kept). Second Ctrl-C does normal hard-kill (stop() in
defer un-registers the handler) — important if the cleanup itself
hangs.

Local: vet, test -race -cover, gofmt -s, errcheck — all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(push): surface NotFound/Forbidden + propagate rand error (Bugbot)

Two new findings from Bugbot's re-review on commit 35ae044:

## Medium — WaitForStagePodReady polled through terminal errors

The poll callback returned (false, nil) for EVERY Pods.Get error,
which meant Pod-deleted-out-of-band (NotFound) or RBAC-revoked-mid-
push (Forbidden) waited the full 60s timeout despite the situation
being unrecoverable. Customer-facing impact: 60s of "Waiting for
stage Pod to be Ready..." for an error that should surface in <1s.

Fix: classify the error. NotFound + Forbidden terminate the poll
immediately (return (false, err)); everything else stays transient
(network blip, brief API unavailability). The
apierrors.Is{NotFound,Forbidden} helpers were already imported
from the previous fixes, so this is a single-block change.

Two new tests pin the contract via t.Now() bounds — if a regression
makes either case transient again, the test waits the full 60s and
the assertion catches it.

## Low — BuildStagePodSpec swallowed crypto/rand error

`suffix, _ := randomSuffix(4)` — if crypto/rand failed (rare but
possible on systems with exhausted entropy), the suffix was empty
and the Pod name became `tracebloc-stage-cats-dogs-` with a bare
trailing hyphen. DNS-1123 rejects that → opaque API error from
kube-apiserver instead of a clear local diagnostic.

Fix: change BuildStagePodSpec signature to (*Pod, error) and
propagate the rand failure. CreateStagePod (the prod caller)
already returns an error; tests use a mustBuildStagePod helper
that t.Fatal's on the rare path. 8 test call sites updated via
find-and-replace, plus the override-image test by hand.

The third inline comment ("SIGINT skips pod cleanup", bug_id
24ab9106) is a stale carry-over of the prior-round finding I
already fixed in 35ae044 — same bug_id, GitHub auto-shifted the
anchor onto the new commit. Replied in-thread pointing at the fix.
Bugbot's review-body confirms "2 NEW findings", not 3.

Local: vet, test -race -cover, gofmt -s, errcheck — all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(push): close symlink size-cap bypass + Phase=Failed short-circuit

Round 4 of Bugbot fixes on PR-b. Two new findings on commit 7baa321
(plus 3 stale carry-overs from earlier rounds — same bug IDs as
already-fixed findings, GitHub keeps re-anchoring them onto each
new commit; Bugbot's review-body confirms "2 NEW findings").

## Medium-Security — symlink images bypassed size caps

The vulnerability: Discover sized image entries via DirEntry.Info()
(Lstat-equivalent: returns the symlink's own ~100-byte size).
writeTarFile read them via os.Stat + os.Open (which follow
symlinks). So a customer with `images/evil.jpg` symlinked to
`/var/log/system.log` (multi-GB) or `~/.ssh/id_rsa` could:

  1. Pass Discover's MaxSingleFileBytes + MaxTotalBytes caps
     trivially (cap is 1 GiB; the symlink itself is ~100 bytes).
  2. Stream the target's FULL contents into the cluster PVC, where
     the cluster admin can read them via `kubectl exec`.

This is both a size-cap bypass AND an arbitrary-local-file
disclosure. Worse, it can fire unintentionally — a customer who
ran `ln -s ~/datasets/big-images images` to share data across
projects would silently stream gigabytes during what they thought
was a small test push.

Fix in three layers:

  1. walk.go: os.Stat → os.Lstat for the labels.csv check (so the
     mode bits include ModeSymlink) + new rejectSymlink() helper
     called for both labels.csv and each image entry. v0.1 rejects
     symlinks outright with a clear "v0.1 doesn't allow symlinks
     (security: ...)" message pointing at `cp -L` for materializing
     and the v0.2 cloud-source story for distributed data.
  2. stream.go: writeTarFile's os.Stat → os.Lstat + symlink check
     too. Defense-in-depth — Discover is the primary fix, but a
     future refactor that calls writeTarFile directly (e.g. a new
     resume-from-partial-transfer code path in v0.2) would
     re-introduce the hole without this layer.
  3. New tests pin both rejections (labels.csv-as-symlink + image-
     as-symlink). Skipped on Windows where symlinks require admin.

## Medium — WaitForStagePodReady didn't short-circuit on Phase=Failed

Companion to the NotFound/Forbidden short-circuit landed in the
previous commit. The poll positively-terminated on Ready=True
but never negatively-terminated on Phase=Failed (container
crashed at startup: PSA rejection, image crashloop, OOM at
container init) or Phase=Succeeded (stage container's sleep
exited unexpectedly). Same 60s timeout symptom for an outcome
that's actually decided in <5s.

Fix: in the poll body, check Phase after the conditions loop;
return a structured error including the podReadyTimeoutHint
output (container-status reason + message) so the customer sees
"terminated in phase Failed (OOMKilled — ...)" instead of a
generic timeout.

New test TestWaitForStagePodReady_FailedPhaseIsTerminal pins
the <3s elapsed contract for an OOMKilled startup.

Local: vet, test -race -cover, gofmt -s, errcheck — all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(push): Lstat images dir + table-len cap + hermetic re-push + progress defer

Round 5 of Bugbot fixes on PR-b. The "5 potential issues" review-body
count corresponds to 5 NEW findings; the other 5 inline comments are
stale carry-overs of bug IDs already fixed in earlier rounds (24ab9106,
15e29a3e, 258869d3, 5c577688, 7fcf137f — GitHub auto-anchors them onto
each new commit).

## Medium — Symlinked images/ directory bypassed all checks

Round 4 fixed symlinked FILES inside images/ (Lstat on each entry).
It missed the case where images/ ITSELF is a symlink — `ln -s /etc
images` passed os.Stat (which follows symlinks), the IsDir check
succeeded, and the loop walked /etc. Same disclosure pattern as the
round-4 finding, one layer up.

Fix: os.Stat → os.Lstat on imagesDir + rejectSymlink before the
IsDir check. Symmetric with the labels.csv treatment from round 4.
New test TestDiscover_RejectsSymlinkedImagesDir pins the rejection.

## Medium — Table label could exceed Kubernetes 63-char limit

ValidateTableName accepted [A-Za-z0-9_]+ without a length cap, but
the stage Pod's tracebloc.io/table label carries the raw name. K8s
label values cap at 63 chars (DNS-1123 label rule), so a 100-char
name passed pre-flight then failed Pod creation with an opaque
label-validation error.

Fix: cap at 63 chars (also matches MySQL's 64-char identifier limit
with 1 char of headroom for any future ingestion-run-id suffix).
New MaxTableNameLength const + boundary test pin the cap.

## Medium — Re-push left stale PVC files

The remote command was `mkdir -p <dest> && tar -xf - -C <dest>`. If
a previous push had 3 images and the new push has 2, the PVC ends
up with the union — and labels.csv now disagrees with the stage
directory, silently breaking ingestion.

Fix: `rm -rf <dest> && mkdir -p <dest> && tar -xf - -C <dest>`. Safe
because dest = StagedPrefix(table) = /data/shared/<table> and
ValidateTableName has ensured `table` is a single safe segment, so
rm -rf can only nuke that one per-table subdir, never the parent
/data/shared or sibling tables. TestStreamLayout_RemoteCommand
updated to assert the rm AND its ordering before mkdir.

## Low — Progress bar not Finish'd on early Stage failure

Stage's deferred Finish lives inside StreamLayout, so a failure
earlier in the lifecycle (CreateStagePod, WaitForStagePodReady)
returned without clearing the TTY bar. Visual artifact on the
customer's terminal after a Pod-create failure.

Fix: defer progress.Finish() at the construction site in
runDatasetPush. Schollz Finish is idempotent so double-call on
happy path is a no-op.

## Hard-link bypass — documented, not fixed

Bugbot also flagged that hard links to outside files aren't
rejected. Filesystem mode bits don't distinguish a hard link from
a regular file the way ModeSymlink distinguishes symlinks, and a
high-Nlink check has too many false positives.

The implicit security boundary is the CLI's process-level read
permissions: a customer can only hard-link files they already
have read access to, so this isn't a privilege escalation. v0.2
may add openat(O_NOFOLLOW) sandboxing if customers need harder
isolation. Documented in rejectSymlink's docstring as a known
limitation alongside the v0.2 plan.

Local: vet, test -race -cover, gofmt -s, errcheck — all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(push): 30m activeDeadline + portable tar paths (Bugbot round 6)

Round 6. Review-body says "2 NEW + 3 unresolved" — Bugbot itself is
flagging the 9 carry-overs at this point. Two new findings, both
real:

## High — Pod activeDeadlineSeconds (600s) cut too close

The deadline starts at Pod CREATION, but image pull (up to 60s) +
WaitForStagePodReady ceiling (60s) + worst-case stream (1 GiB at
2 MB/s ≈ 8.5 min) eat the budget. Near-cap customers on slow
uplinks could see the kubelet terminate mid-transfer with no
useful diagnostic.

Fix: bump StagePodActiveDeadline from 600 → 1800 (30 min). Budget
breakdown in the const's docstring; comfortable margin for
variance. Cost is "an idle alpine Pod with sleep idles for ~22min
after a successful push" — ~5 MiB cluster RAM, zero CPU, deleted
seconds later by the defer'd cleanup anyway.

Test pin: TestStagePodActiveDeadline_CoversFullLifecycle asserts
the floor at 1500s so a future regression to 600 is caught.

## Medium — Windows tar paths used backslashes

`filepath.Join("images", filepath.Base(abs))` produces
`images\file.jpg` on Windows. USTAR / POSIX tar requires forward
slashes; the Linux stage Pod's `tar -xf` either rejects or
extracts as a flat-named file (collapsing the images/ subdir
the ingest spec expects). Breaks the Windows-built CLI.

Fix: switch to `path.Join` for the tar HEADER name. Keep
`filepath.Join` everywhere else (where the OS-native separator
is the right thing). Test pin
TestStreamLayout_TarPathsAreForwardSlash asserts no backslashes
in any tar entry name, regardless of host OS.

## Carry-overs

Bugbot's review body now explicitly says "There are 3 total
unresolved" — it knows the carry-overs aren't on the new commit
anymore. Counts I'm tracking as known-fixed across earlier
rounds: 24ab9106 (SIGINT, r3), 15e29a3e (Pod-Get spin, r3),
258869d3 (rand error, r3), 5c577688 (Failed Pod, r4), 7fcf137f
(symlink files, r4), de426248 (hard links, r5 — documented
limitation), cd462de9 (symlink dir, r5), a8e9e5c7 (label len,
r5), 0b296807 (progress finish, r5).

Local: vet, test -race -cover, gofmt -s, errcheck — all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(push): wait-error labels + orphan-scan precision + transactional re-push

Round 7 of Bugbot fixes on PR-b. Review-body confirmed "4 NEW + ...".
All four real, all four fixed:

## Medium — Wait errors mislabeled as timeout

WaitForStagePodReady wrapped EVERY poll exit (NotFound, Forbidden,
Phase=Failed, ctx.Canceled from SIGINT) in "did not become Ready
within 60s." Misleading: the customer who hit Ctrl-C 2 seconds
into the wait would see a "timed out" diagnostic.

Fix: branch on errors.Is(err, context.DeadlineExceeded). True
timeout keeps the "did not become Ready within %s" wording;
early-exit cases surface as "did not reach Ready state: %w" with
the actual cause front and center.

## Medium — Orphan-delete hint would nuke parallel pushes

FormatOrphansWarning's kubectl-delete command used the
`managed-by=tracebloc-cli,component=stage-pod` label selector,
which matches every stage Pod in the namespace including
LEGITIMATE RUNNING ONES from parallel pushes (other workstations,
or even this one's just-started Pod). A customer copy-pasting the
suggested cleanup could silently kill someone else's in-progress
push.

Fix: list specific orphan Pod names in the delete command —
`kubectl delete pod -n <ns> <name1> <name2> ...`. Test regression-
pins the absence of the label-selector form.

## Medium — Re-push deleted before transfer succeeded

Round 5's hermetic-re-push fix (rm -rf $DEST && mkdir + tar)
satisfied "no stale files" but not "preserve on failure." A
tar mid-stream failure (Ctrl-C, network drop, container OOM) left
the customer with NOTHING on the PVC — the previous push's data
already nuked, the new push aborted before completing.

Fix: extract to <dest>.staging, swap on success:

  rm -rf <dest>.staging              # cleanup any prior failure
  mkdir -p <dest>.staging && tar -xf - -C <dest>.staging
  rm -rf <dest> && mv <dest>.staging <dest>    # swap on success
  rm -rf <dest>.staging              # defensive cleanup

The shell's && sequencing means swap only fires if tar succeeded.
Lost the prior `exec /bin/tar` micro-optimization (can't exec
mid-chain) — fine, the shell process is tiny. The window between
the destination rm and the mv is sub-ms; closing it fully would
need a double-mv (old/new/cleanup) which is v0.2 territory.

stream_test.go pins:
  - mkdir-p of .staging happens
  - tar extracts to .staging (not directly to dest)
  - mv from .staging to dest exists
  - tar happens BEFORE any rm of $DEST (transactional property)
  - no rm of the parent /data/shared (sibling-table safety)

## Medium — Orphan scan flagged active pushes

FindOrphanStagePods's only filter was age > 5min. But pod.go itself
budgets ~8.5 min for a 1 GiB transfer — a legitimately-running
near-cap push would be mislabeled as orphan by the same customer's
next concurrent push, and the (newly-fixed) delete hint would now
correctly target that specific running Pod by name, killing it.

Fix: skip Phase=Running pods entirely. A Running stage Pod is
presumed to be doing real work; activeDeadlineSeconds is its
cluster-side safety net. Pods in non-Running phases past grace
(Pending stuck on image pull, Failed from crash) still flag —
those are the genuine orphan shapes.

Two new tests:
  - TestFindOrphanStagePods_SkipsRunningPods: 30-min-old Running
    Pod doesn't surface as orphan
  - TestFindOrphanStagePods_FlagsNonRunningPastGrace: 30-min-old
    Failed Pod DOES surface (regression guard — narrowing the
    filter should reduce false positives, not eliminate the
    warning entirely)

Local: vet, test -race -cover, gofmt -s, errcheck — all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(push): unique staging suffix + stream-time size-cap recheck

Round 8 of Bugbot fixes. Two new findings on commit 42c5e93,
both real:

## Medium — Parallel pushes corrupted staging

Two concurrent `dataset push` runs for the same `--table` shared
one `$DEST.staging` path on the PVC. Each Pod's `rm -rf
$DEST.staging`, `tar -xf -`, `mv` sequence raced against the
other, with no coordination. Worst case: push A's `rm -rf`
wipes push B's mid-extraction state, B's tar then writes into
a partially-removed dir, customer ends up with an interleaved-
corrupted-and-half-committed dataset.

Fix: each invocation generates a fresh 8-hex-char suffix for the
staging dir name (`$DEST.staging-<8hex>`), reusing pod.go's
randomSuffix. Two parallel pushes now have distinct staging
dirs — no interleaved-write hole.

The FINAL `rm $DEST && mv $STAGING $DEST` step is still
last-write-wins under concurrent commits, but that's an
acceptable v0.1 semantic (concurrent pushes for the same table
are inherently undefined; whichever commits last "wins" with a
COHERENT dataset, not an interleaved one).

New test TestStreamLayout_StagingSuffixIsUniquePerInvocation
pins the contract by calling StreamLayout twice and asserting
distinct staging suffixes.

## Medium — Stream skipped size-cap enforcement (TOCTOU)

Pre-flight Discover checked MaxSingleFileBytes + MaxTotalBytes,
but writeTarFile / writeLayoutTar streamed files later via
os.Lstat → io.Copy with NO re-check. A file that grew between
pre-flight and stream (legitimate dataset prep racing the push,
or adversarial overwrites) would silently upload past the 500 MiB
/ 1 GiB advertised caps.

Fix in three layers:

  1. writeTarFile re-checks the single-file cap (os.Lstat size
     vs MaxSingleFileBytes) right before WriteHeader. A file
     that grew gets a clear "exceeds v0.1 single-file cap"
     error using the same sizeError formatter Discover uses.

  2. writeTarFile now returns (int64, error) — the declared
     header size. writeLayoutTar accumulates this into a
     running total and aborts mid-stream if it exceeds
     MaxTotalBytes. Tests added below to pin the running-total
     contract.

  3. io.Copy → io.CopyN(tw, f, st.Size()) caps the actual byte
     transfer at the declared header size. Without the cap, a
     file that grew between Lstat and io.Copy would overflow
     the tar header — header says N bytes, body has > N → tar
     archive corruption. CopyN-and-trust truncates instead.

Closes both the metadata-side (header size) and body-side
(byte stream) halves of the stream-time TOCTOU window.

Local: vet, test -race -cover, gofmt -s, errcheck — all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(push): hoist randomSuffix above goroutine + clean orphan staging dirs

Round 9 of Bugbot fixes. Both findings real:

## Medium — Tar goroutine could deadlock on suffix error

The randomSuffix call landed AFTER the tar-write goroutine spawned.
If crypto/rand failed (rare but possible on entropy-starved systems
— same scenario the round-3 BuildStagePodSpec fix handles), the
function returned without closing the pipe or waiting on the
goroutine. The goroutine would then block once the ~64 KB pipe
buffer filled, leaking forever.

Fix: hoist randomSuffix (and the entire remoteCmd composition)
ABOVE the io.Pipe + goroutine setup. The function now bails on
suffix failure before touching the pipe at all, so there's
nothing to deadlock on.

## Medium — Failed pushes leaked .staging-<hex> dirs

Round 8's unique-per-invocation suffix fixed parallel-push
interleaving but created a new leak: if THIS push fails before
the final `mv` step, the .staging-<hex> dir lingers on the PVC.
Round 8's `rm -rf $STAGING` at the start only cleans up THIS
invocation's path, not the previous-failed one. Repeated failed
pushes accumulate unbounded storage on the shared PVC.

Fix: append a defensive cleanup pass to the remote script:

  find $(dirname $DEST) -maxdepth 1 -name "$(basename $DEST).staging-*" \\
    -mmin +60 -exec rm -rf {} + 2>/dev/null || true

Constraints baked into the pattern:
  - -mmin +60 (1 hour) is 2x activeDeadlineSeconds — anything
    older HAS to be from a dead stage Pod, so we can't race with
    a parallel push's in-progress staging
  - -name pattern scopes to THIS table's staging siblings only;
    other tables' .staging-* dirs are none of our business
  - 2>/dev/null || true — find failures don't fail the whole
    stream (defensive cleanup is best-effort, the customer's
    actual push already committed before this runs)
  - Uses ';' (semicolon) instead of '&&' so it runs even if the
    main push sequence failed somehow — orphan cleanup should
    happen regardless

Updated TestStreamLayout_RemoteCommand asserts the find pattern
is present with the exact -mmin window and table-scoped -name.

Local: vet, test -race -cover, gofmt -s, errcheck — all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(push): backup-and-rollback dir swap (Bugbot r10)

Bugbot's tenth-round single finding: the previous r7 pattern
of `rm $DEST && mv $STAGING $DEST` had a small but real failure
window — if the rm succeeded and the mv failed (ENOSPC, fs-level
rename error, transient kernel weirdness), the customer's
previous dataset was already gone and the new data was stranded
under .staging-<hex> where the ingestor wouldn't see it.

Fix: backup-and-rollback. Rename the existing $DEST to a unique
$DEST.old-<hex> sibling FIRST; only then mv $STAGING into $DEST;
on the latter's failure, restore the backup. On success, drop
the backup.

  { [ -e $DEST ] && mv $DEST $DEST.old-<hex> || true; } &&
  { mv $STAGING $DEST ||
    { [ -e $DEST.old-<hex> ] && mv $DEST.old-<hex> $DEST; exit 1; }; } &&
  rm -rf $DEST.old-<hex>

Properties this gives us:

  - tar fails: $DEST untouched (transactional from r7)
  - backup mv fails: $DEST untouched
  - main mv fails: backup at .old-<hex> survives; customer
    recovers via `kubectl exec -- mv .old-<hex> $DEST`
  - final rm fails: just leaves an .old-<hex> cruft, picked up
    by the r9 find -mmin +60 sweep (extended to also catch
    .old-* siblings)
  - first-ever push (no pre-existing $DEST): `[ -e $DEST ] && ...`
    short-circuits, backup mv silently skipped via `|| true`

The .old-<hex> suffix reuses the same random as .staging-<hex>
so two parallel pushes can't collide on the backup name. Both
.staging-* AND .old-* now flow through the orphan-cleanup find
pattern (with the `-o` alternation), so r9's leak-prevention
covers both halves of the swap.

Test updates pin:
  - Backup mv (DEST → .old) appears BEFORE primary mv
    (.staging → DEST) — rollback contract
  - Backup and staging suffixes MUST agree (rollback-target
    correctness)
  - find pattern includes both -name "...staging-*" and
    -name "...old-*"

Local: vet, test -race -cover, gofmt -s, errcheck — all green.

This is round 10. The new-finding rate has been decaying:
5→2→2→2→1. We're approaching the asymptote.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(push): propagate remote-script failures + prefer tar error (Bugbot r11)

Two new findings on r10's commit, both real:

## Medium — Remote script masked failures via POSIX `;` + `|| true`

The previous shell used `&&-chain ; find ... || true`. POSIX `;`
runs the next command regardless of prior exit status, and the
trailing `|| true` then forces exit 0. A failed tar or mv earlier
in the && chain would set $? to non-zero, then `;` would clobber
it back to 0 via the find's `|| true`. Result: a remote push
that actually failed would return exit 0 to the exec subprocess,
and StreamLayout would happily report "Staged N files" on a
catastrophic failure.

Fix: rewrite the script using `set -e` + explicit newline-
separated statements + an explicit `if ! mv; then rollback;
exit 1; fi` for the swap. With set -e:
  - Any non-guarded non-zero exits the script with that status.
  - `cmd || true` continues to suppress (find cleanup stays
    best-effort).
  - `if ! cmd; then ...; fi` is treated as guarded, so set -e
    doesn't preempt the explicit rollback handler.

Multi-line shell-c args work across busybox sh (alpine 3.20),
dash, and bash equally. No portability regression.

## Medium — Stream error masked the upstream tar error

After r8 added stream-time MaxSingleFileBytes / MaxTotalBytes
rechecks in writeLayoutTar, the LOCAL tar build can legitimately
fail mid-stream. When it does, the goroutine's CloseWithError
propagates to the pipe reader; exec sees broken-pipe and returns
a generic "exec stream against ns/pod failed" error.

The previous code checked streamErr FIRST and returned the
exec-flavored framing. Customer saw "streaming files failed"
instead of the actually-actionable "dataset exceeded v0.1 total
cap of 1.00 GiB after streaming ..." diagnostic from the tar
side.

Fix: swap the order. Check tarErr first — when both are non-nil,
the tar side is the upstream cause; the broken-pipe streamErr is
downstream noise. streamErr-only (no tarErr) is still the real
network/RBAC/remote-tar case and gets surfaced with the exec
wording.

Local: vet, test -race -cover, gofmt -s, errcheck — all green.

Round 11 finding rate: 2 new (1+2+5+2+4+2+2+1+2 → still trending
roughly downward, though not monotone). One more clean round
and we're at the asymptote.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: doc/cleanup cross-cutting drift from 11 rounds of bugbot fixes

Independent review pass over PR #9 (post-bugbot-r11). Five findings,
none caught by bugbot — all cross-cutting doc/style:

1. stream.go: dead comment block above remoteCmd composition. The
   block was a holdover from the && era; it claimed "&& sequencing
   means the swap only fires if tar succeeded" but we switched to
   `set -e` in r11. The canonical pipeline doc lives below the
   composition; collapsed the upstairs block into a single semantic-
   guarantees summary (HERMETIC/TRANSACTIONAL/PARALLEL-SAFE/EXIT-
   FAITHFUL) and removed the contradictory && reference.

2. stage.go: StageOptions docstring referenced a non-existent
   `OrphanLogger` field. Stale from an early design where orphan
   warnings were emitted via a logger callback rather than the
   integrated FormatOrphansWarning we ended up with. Corrected to
   the actual nil-able fields (Progress, Out).

3. orphan.go: OrphanGracePeriod comment claimed "5 minutes is
   generous: a healthy stage Pod is fully done in ~30 seconds...
   under a couple minutes at the 1 GiB cap." That directly
   contradicts pod.go's StagePodActiveDeadline budget (~8.5 min
   for a 1 GiB transfer at 2 MB/s, plus image pull, plus ready
   wait). Rewrote to reflect the post-r7 semantic: Running Pods
   are never flagged regardless of age; the 5-min grace targets
   the Pending/Failed/Unknown shapes that are genuinely stuck.

4. orphan.go: joinNames was a 9-line wrapper around `strings.Join(
   names, " ")`. The comment justifying it ("single point of
   change for tests") doesn't hold — the tests assert on the
   final output string, not on this helper's surface. Inlined.

5. cli/dataset.go: runDatasetPush docstring still said "the PR-a
   slim implementation. It performs every pre-flight check... the
   actual file staging is gated behind a clear 'not yet
   implemented' error." PR-b (this PR) actually implements the
   staging — the doc lagged the code. Updated to reflect the
   complete Phase 3 flow.

No behavioral change. Tests + lint green. The doc fixes matter
because future readers (humans and the next bugbot pass) get
contradictory signals when the comments disagree with the code.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…52) (#10)

* feat(dataset): Phase 4 — submit to jobs-manager + watch + summary (#152)

Closes the v0.1 happy path. After Phase 3 stages files on the PVC,
Phase 4 POSTs the ingest spec to jobs-manager, watches the spawned
ingestor Job, and renders the parsed INGESTION SUMMARY panel.
Customer sees a single end-to-end flow:

  $ tracebloc dataset push ./cats-dogs --table=cats_dogs_train ...
  Local dataset: ...
  Target cluster: ...
  Synthesized ingest spec: ...
  Created stage Pod tracebloc/tracebloc-stage-...
  Streaming N files (X MiB) to /data/shared/cats_dogs_train/...
  Staged N files to /data/shared/cats_dogs_train
  Submitted: jobs-manager spawned ingestor Job tracebloc/ingestor-...
  Streaming logs from Job tracebloc/ingestor-...:
   [ingestor's own log output streams here, verbatim]
  ┌─ Ingestion summary ──────────────────────────┐
  │ Ingestor ID:               run-abc           │
  │ Total records:             1,234             │
  │ Inserted:                  1,200             │
  │ ... etc                                       │
  └──────────────────────────────────────────────┘

## What lands

internal/submit/ (5 new files):

  - body.go       SubmitRequest struct (ingest_config + idempotency_key
                  + optional image_digest); BuildRequest auto-generates
                  a 16-byte hex idempotency key per invocation unless
                  --idempotency-key overrides.
  - client.go     HTTPSubmitter with bearer-token auth; 4xx/5xx surface
                  jobs-manager's body verbatim via SubmitError;
                  IsAuthError(err) for the 401/403 exit-code branch.
  - watch.go     Poll-based watch (per the design discussion): wait
                  for the Job's Pod to exist + Running, stream logs via
                  GetLogs(Follow=true), terminal poll on Job conditions
                  to distinguish Succeeded/Failed/Unknown.
  - summary.go    Streaming parser for the 📊 INGESTION SUMMARY banner.
                  Strips ANSI codes, accumulates fields by prefix match,
                  finalizes on the closing ═-rule. RenderPanel formats
                  the parsed Summary as a box-drawn panel.
  - submit.go     Run() orchestrator: BuildRequest → POST → announce
                  (Spawned vs Replayed) → either --detach exit or watch
                  loop. SIGINT during watch produces Outcome=Detached
                  with a kubectl-logs reconnect hint.

internal/cli/dataset.go: wired after push.Stage. Mints a 1-hour SA
token (vs cluster info's 10 min — the watch loop can run that long).
Adds --detach, --idempotency-key, --image-digest flags. Maps Submit
+ Watch outcomes to exit codes 5/8/9.

## Design choices (per pre-build discussion)

  - Poll every 2s for Job status (not client-go Watch). Matches the
    Phase 3 WaitForStagePodReady pattern; simpler test surface, no
    long-lived connection to babysit. ~30 API calls/min is negligible.
  - SIGINT-after-201 → graceful detach. The cluster has already
    accepted the run; Ctrl-C just stops watching. Customer sees the
    kubectl logs reconnect hint and exit 0. Matches kubectl logs
    behavior.

## Exit codes (full set, v0.1)

  0 — success (or --detach success: 201 came back clean)
  2 — schema / v0.1-unsupported-category
  3 — local-layout or kubeconfig
  4 — cluster discovery (parent release or shared PVC missing)
  5 — SA token couldn't be obtained, OR jobs-manager 401/403
  7 — Phase 3 stage step failed
  8 — jobs-manager 4xx/5xx other than auth
  9 — ingestion Job exited non-zero, OR summary reports row failures

## Tests

submit/  ≥85% line coverage across 5 files. Per-file:
  - body_test.go     idempotency-key generation, override, uniqueness,
                     JSON shape pinning (image_digest omitempty)
  - client_test.go   httptest.Server: 201 happy path, replay path,
                     4xx body framing, 401/403 → IsAuthError,
                     5xx ≠ auth, missing-job_name, network drop,
                     ctx-cancel
  - summary_test.go  Real ingestor banner end-to-end, byte-by-byte
                     streamed feed, no-banner path (early failure),
                     post-banner lines ignored, ANSI stripping,
                     panel rendering, comma formatting
  - watch_test.go    Fake clientset: Running Pod surfaces, fast-
                     completion path, Forbidden short-circuits,
                     no-Pod timeout, Job Conditions to outcome,
                     parserWriter contract
  - submit_test.go   Run orchestrator: --detach happy path, replay,
                     SubmitError propagation, request fields plumb
                     through, nil-Out defaults to Discard, IsAuthError

Local: vet, test -race -cover, gofmt -s, errcheck — all green.

Phase 5 (#153: GitHub Releases, Homebrew tap, install.sh) is the
last item on the v0.1 epic.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(submit): enforce JobWatchTimeout + flush parser + typed watch errors + Unknown exit code

Four Bugbot findings on PR #10's initial commit, all real:

## Medium — JobWatchTimeout was declared but never enforced

The constant existed and documented "absolute cap on a single
dataset-push's watch phase," but WatchJob never used it. Per-step
timeouts (PodReadyTimeout, finalJobStatus's 30s) bounded sub-loops,
but the log-stream phase ran indefinitely against ctx.Done() with
no overall cap. A stuck ingestor would hang the CLI forever.

Fix: context.WithTimeout(ctx, JobWatchTimeout) at the top of
WatchJob. Cancel via defer. Cap fires as ctx.DeadlineExceeded
through the inner loops.

## Medium — Watch failures used the submit exit code

submit.Run wrapped both POST errors AND watch errors in the same
return path. cli/dataset.go's switch mapped IsAuthError → 5 and
"everything else" → 8 (submit-failed). But waitForJobPod /
streamPodLogsAndParse / finalJobStatus failures aren't submit
failures — jobs-manager already accepted the run, the cluster is
doing the work, the CLI just lost the ability to follow along.
Exit 8 misled customers into thinking the submit was rejected.

Fix: introduce a WatchError type wrapping watch-phase errors;
add IsWatchError predicate. cli/dataset.go now routes:
  IsAuthError      → 5 (token / auth)
  IsWatchError     → 9 (ingest-side; cluster keeps running)
  else             → 8 (submit-side rejection)

Test TestIsWatchError pins the typing including wrapped errors.

## Low — Log parser didn't flush partial last lines

streamPodLogsAndParse used bufio.Scanner for line-based stream
display, but the SummaryParser's own internal buffer (for
chunked-write line assembly) was never flushed at end-of-stream.
If the Pod terminated without a trailing '\n' on its final
stdout write — rare but possible if the container's stdout was
killed mid-write — the final line would be lost, potentially
including the closing ═-rule that finalizes the banner.

Fix: parser.FlushLine() at end-of-stream AND on non-EOF scanner
errors.

## Medium — Outcome=Unknown exited zero

The exit-code switch only handled Failed + Succeeded. Outcome=
Unknown (returned by finalJobStatus when the 30s post-stream
poll doesn't see a terminal condition) fell through to the
default `return nil` → exit 0. But "we don't know if it worked"
is not success.

Fix: add an explicit Unknown branch that exits 9 with a
diagnostic pointing the customer at `kubectl get job ...` to
check the actual outcome.

Local: vet, test -race -cover, gofmt -s, errcheck — all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(submit): zero-metric finalize + namespace check + fresh ctx for final status

Round 2 of Bugbot fixes on PR #10. Four findings, all real — three
are second-order effects of the r1 fixes:

## Medium — Zero-metric banner never finalized

The hasParsedAnyField check (which gates the closing ═-rule's
finalization) required ANY metric counter to be > 0. But an
ingestion that ran-then-died-early can legitimately produce a
banner where ALL counts are 0 — the ingestor still prints the
structure, just with zeroes. The parser would never finalize on
that case, leaving the parser open to perturbation from
post-banner shutdown logs.

Fix: track a dedicated sawAnyField bool that latches on ANY
successful field-line parse, regardless of the value. The
finalization check now keys off "did we see any banner field"
instead of "did we see a non-zero one."

## Medium — Missing submit response namespace not rejected

client.go rejected 2xx responses with no job_name but happily
accepted responses with no namespace. A malformed-but-parsing
response would then drive the watch loop's k8s API calls at the
empty namespace, plus produce a broken `kubectl logs` reconnect
hint.

Fix: parallel check for empty Namespace, same error shape as the
job_name check.

## Medium — Watch timeout starved finalJobStatus

R1 wrapped the WHOLE WatchJob in ctx.WithTimeout(JobWatchTimeout)
to enforce the 1-hour cap. Side effect: finalJobStatus (which
runs AFTER log streaming) inherited the SAME ctx — so a
59-minute log stream left finalJobStatus with 1 minute of
budget, and a 60-minute stream left it with 0. Successful slow
ingestions misreported as Unknown → exit 9.

Fix: split the ctx hierarchy.
  - customerCtx — original input, carries SIGINT cancellation
  - watchCtx = WithTimeout(customerCtx, JobWatchTimeout) — caps
    the pod-wait + log-stream phases
  - finalCtx = WithTimeout(customerCtx, 30s) — fresh budget
    derived from customerCtx, not from watchCtx

The customerCtx-vs-watchCtx distinction also fixes the detach
detection: check customerCtx.Err() not ctx.Err() so a
JobWatchTimeout expiry doesn't masquerade as a SIGINT-Detach.

## Medium — SIGINT after logs yielded exit 9

After log streaming ended cleanly, Ctrl-C during the brief
finalJobStatus poll returned a watch error → exit 9. But the
contract is "post-201 SIGINT = graceful detach, exit 0" — the
cluster already has the work, the customer is just done watching.

Fix: if customerCtx is Canceled inside the finalJobStatus error
branch, return Outcome=Detached with the same reconnect-hint
path the log-stream-cancel case uses. Consistent detach
semantics across both watch sub-phases.

Local: vet, test -race -cover, gofmt -s, errcheck — all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(submit): port-forward to jobs-manager from off-cluster (Bugbot r3)

Major architectural finding from Bugbot's third PR #10 review:
the CLI runs on the customer's laptop (or off-cluster CI), but
the discovered JobsManagerService URL is a *.svc.cluster.local
name that the laptop can't resolve. The submit POST would always
fail. The original Phase 4 design quietly assumed in-cluster
network reachability — wrong for the dominant v0.1 use case.

The chart's existing flow doesn't have this problem because the
helm hook runs as a Pod IN the cluster.

## Fix: in-process port-forward, same mechanism as kubectl pf

New internal/submit/portforward.go:
- PortForwardJobsManager opens an SPDY tunnel via the customer's
  kubeconfig-authenticated apiserver connection.
- pickServicePod resolves the jobs-manager Service to a Running
  Pod backing it (Services aren't directly port-forwardable;
  client-go's portforward API speaks Pod names).
- ForwardedConnection.Close tears it down on defer; idempotent so
  the orchestrator's defer-Close + any inner cleanup don't risk
  a double-close panic.

Wiring in cli/dataset.go:
- After token mint, open the port-forward to
  JobsManagerServiceName:JobsManagerPort (8080 today).
- Defer Close — runs on every path, including the SIGINT-detach
  + error paths from submit.Run.
- HTTPSubmitter targets http://localhost:<allocated-port>; the
  in-cluster URL is no longer used at all from the CLI.

cluster/discover.go: expose the bare Service name + port alongside
the existing FQDN URL. Phase 2 still constructs the FQDN for the
diagnostic output of `cluster info`, but Phase 4 needs the
unqualified pieces for the port-forward setup.

## Tests

internal/submit/portforward_test.go covers the pickServicePod
resolution (5 cases: happy path, skip-non-Running, no-matching-
Pod, missing-Service, selector-less). The full port-forward
loop needs a real apiserver — covered by EKS smoke.

internal/cluster/discover_test.go: updated the struct-literal
comparison in TestDiscoverParentRelease_HappyPath to include the
two new fields.

Local: vet, test -race -cover, gofmt -s, errcheck — all green.

## What we got wrong in the initial design

Designing Phase 4, I read the chart's existing flow (which runs
in-cluster) + assumed the CLI could use the same URL. That's
exactly the "cargo-cult the URL" mistake. The right framing
would have been: "Phase 4 needs to reach jobs-manager from the
laptop — what mechanism does kubectl have for that?" → port-
forward.

The fix is correct + well-tested where unit-testable. The full
loop will get exercised on the EKS smoke test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(submit): pick most-recent useful-phase Pod, not items[0] (Bugbot r4)

Bugbot's fourth PR #10 finding: waitForJobPod picked pods.Items[0]
from an unordered List API response. A Job with backoffLimit > 0
(or any scenario where jobs-manager re-spawned the Pod for retry,
node drain, etc.) can have multiple Pods bearing the same
`job-name=<j>` label. Picking items[0] could grab the old Failed
Pod from a prior retry while the current Running one waits — the
CLI then streams stale logs and reports the prior-attempt
outcome.

Fix: iterate all Pods, prefer the most recent (by creationTimestamp)
in a useful phase (Running | Succeeded | Failed). Pending Pods
don't count — no log stream yet, so we keep polling rather than
"return a Pending pod's name."

Behaviorally:
  - Single Pod (common): same as before — that one Pod returns.
  - Old Failed + new Running: returns the Running. Correct.
  - Old Failed + new Pending: returns nothing yet; keeps polling
    until Pending → Running. Correct.
  - Multiple Running (parallelism > 1): returns the most recent
    — defensible choice; either would work for log streaming.

Two new tests:
  - TestWaitForJobPod_PicksMostRecentNotFirst: seeds an older
    Failed + newer Running; asserts the Running's name returns.
  - TestWaitForJobPod_AllPendingKeepsPolling: seeds an only-
    Pending Pod; asserts DeadlineExceeded on a tight ctx (no
    early return with the Pending name).

Local: vet, test -race -cover, gofmt -s, errcheck — all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(submit): Pod-wait timeout = Detached (not failed); non-blocking errCh send

Round 5 Bugbot findings on PR #10:

## Medium — Pod-wait timeout reported ingest failure

waitForJobPod times out after PodReadyTimeout (5min) if the
ingestor Pod doesn't reach Running — slow image pull, scheduling
backlog, PSA still rejecting. The previous flow wrapped that as
WatchError → exit 9 ("ingestion failed"). That's a false
positive: jobs-manager already accepted the submit, the cluster
will run the ingestion (eventually), the CLI just gave up
observing within the timeout.

Fix: distinguish errors.Is(err, context.DeadlineExceeded) from
waitForJobPod and map to Outcome=Detached (same semantic as
SIGINT-mid-watch). Customer sees the kubectl-logs reconnect hint
and exits 0. The submit was successful; the ingestion runs.

New test TestWatchJob_PodWaitTimeoutMapsToDetach pins the
contract via a tight outer ctx that triggers DeadlineExceeded
through the same code path.

## Medium — Defensive non-blocking errCh send in portforward.go

Bugbot's read: "unbuffered handoff can block the goroutine
indefinitely." False on the literal code (errCh is
make(chan error, 1), buffered) — but the spirit of the concern
is reasonable: a future refactor that adds a second send path,
or changes the buffer size, could re-introduce the leak.

Fix: keep the cap-1 buffer AND wrap the send in a non-blocking
select with default. Two layers of safety, so the goroutine
never blocks regardless of how the channel is used downstream.
Closes out the finding structurally rather than relying on the
code reader to spot that the buffer was already sized correctly.

Local: vet, test -race -cover, gofmt -s, errcheck — all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(submit): JobWatchTimeout-during-stream = Detached, not exit 9

Bugbot PR #10 r6 caught the inconsistency r5 introduced: the
PodReadyTimeout path was mapped to Detached (commit 0b02cd4),
but the SAME JobWatchTimeout constant capping the log-stream
phase still mapped to exit 9 ("ingestion failed") if it expired
during streaming. Both should mean the same thing: "the cluster
keeps running; the CLI just gave up observing."

Fix: accept context.DeadlineExceeded as well as context.Canceled
in the post-stream branch. If watchCtx (the 1-hour-capped one)
expired but customerCtx (the SIGINT-aware parent) didn't, that's
the JobWatchTimeout-hit case — map to Detached, same UX as the
SIGINT path. Customer sees the kubectl-logs reconnect hint and
exits 0.

The watchCtx-vs-customerCtx distinction (from r3) is what lets
us tell SIGINT from JobWatchTimeout at this branch point —
without it, the only way to detect timeout-vs-cancel would be
the inner error.

Local: vet, test -race -cover, gofmt -s, errcheck — all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(submit): per-reason detach message (Bugbot r7)

R5 + r6 expanded "Detached" to cover PodReadyTimeout and
JobWatchTimeout in addition to the original SIGINT case, but the
orchestrator's diagnostic still said "Detached on signal" for
all three. Customers who hit the 5-min Pod-ready cap or 1-hour
watch cap aren't pressing Ctrl-C; the framing was misleading.

Fix: WatchResult gains a DetachReason field (DetachReasonSignal,
DetachReasonPodWaitTimeout, DetachReasonWatchCap, or None).
Every detach branch in WatchJob sets the appropriate reason.
submit.Run's print-detach block now switches on the reason:

  Signal           → "Detached on signal."
  PodWaitTimeout   → "Detached: the ingestor Pod didn't reach
                      Ready within the observation window..."
  WatchCap         → "Detached: the watch window (1 hour)
                      elapsed while the ingestion was still
                      running."

Common reconnect hint (`kubectl logs -f`) follows in all cases.

The customerCtx-vs-watchCtx distinction (introduced in r3) makes
the reason classification trivially correct: customerCtx.Canceled
takes precedence over watchCtx.DeadlineExceeded if both fired, so
a customer hitting Ctrl-C exactly at the 1-hour cap still gets
the "on signal" message.

Local: vet, test -race -cover, gofmt -s, errcheck — all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ormula (#153) (#11)

* feat(release): GitHub Release workflow, install scripts, Homebrew formula (#153)

Closes the v0.1 epic (#147). Phase 5 is the distribution
infrastructure that makes the CLI installable without `go install`.

## What lands in this PR

### .github/workflows/release.yml

Triggered by `v*.*.*` tags (and workflow_dispatch for manual
re-runs). Two-job pipeline:

  - `release` (matrix): cross-compiles linux/{amd64,arm64},
    darwin/{amd64,arm64}, windows/amd64. Each binary gets a
    cosign keyless OIDC signature (.sig + .cert), per-platform
    SHA256, and a stamped version string via -ldflags.

  - `publish`: aggregates SHA256SUMS, stages the install scripts,
    creates the GitHub Release with auto-generated notes +
    every artifact attached. prerelease=true when the tag
    contains a hyphen (e.g. v0.1.0-rc1).

A third `bump-homebrew-tap` job is wired but gated `if: false`
until the tracebloc/homebrew-tap repo + HOMEBREW_TAP_TOKEN secret
are set up. The path is: flip the gate, secret is added,
formula gets updated automatically on each tag. RELEASE_CHECKLIST
documents the one-time setup.

### scripts/install.sh

POSIX-compatible (tested against dash, busybox sh, bash) — the
customer's distro might not have bash. Detects OS+arch,
resolves the latest tag via the /releases/latest redirect-trail
(no rate-limited API call), downloads binary + SHA256SUMS,
verifies the checksum, optionally verifies cosign signature if
cosign is on PATH, installs to /usr/local/bin (falls back to
~/.local/bin with PATH advice if not writable).

One-liner: `curl -fsSL https://github.com/tracebloc/cli/releases/latest/download/install.sh | sh`

### scripts/install.ps1

PowerShell 5.1+ (ships with Windows 10 21H1+). amd64 only for
v0.1; arm64 errors with a clear "file an issue" pointer.
Strict-mode + halt-on-error so half-installs don't happen.
Same SHA256 + cosign verification surface as install.sh.
Installs to $LOCALAPPDATA\Programs\tracebloc and PATH-adds at
user scope.

One-liner: `irm https://github.com/tracebloc/cli/releases/latest/download/install.ps1 | iex`

### scripts/homebrew-formula.rb.tmpl

Formula template the bump-homebrew-tap job renders per release.
Substitutes __VERSION__, __TAG__, and four per-platform
__SHA_*__ placeholders via sed (no Ruby in the runner needed).
Ships pre-built binaries rather than building from source — the
k8s.io transitive dep tree is too heavy to build on customer
laptops.

### scripts/RELEASE_CHECKLIST.md

Per-release on-call doc. Distinguishes "what's automated"
(everything in the workflow) from "what needs one-time setup"
(homebrew-tap repo, eventually install.tracebloc.io DNS) from
"what's manual per release" (the actual tag push + sanity-test
of each install path).

## Hosting choice

GitHub Release raw URLs (per user decision). Zero infrastructure
needed; the bootstrap URL works the day v0.1.0 ships. install.
tracebloc.io is a v0.2 follow-up via CNAME / Cloudflare worker.

## Out of this PR

- Creating the tracebloc/homebrew-tap repo (separate; documented
  in RELEASE_CHECKLIST.md)
- DNS for install.tracebloc.io (v0.2)
- Tagging v0.1.0 (post-merge, after the Phase 6 dogfood)

Local: vet, test -race -cover, gofmt -s, errcheck — all green.

After this merges, the v0.1 epic (#147) is mechanically complete
— `tag v0.1.0` produces the full release pipeline output.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(install.sh): no sha256sum/shasum = hard error, not silent skip

Bugbot PR #11 round 1: the previous flow printed "✓ checksum
matches" unconditionally after the verify block, even when
neither sha256sum nor shasum was on PATH (the no-tool branch
set actual="" and the mismatch check guarded on -n "$actual",
so it was silently skipped). Two failures in one:

  1. Misleading log: claimed verification succeeded when it
     didn't run.
  2. Security regression: an attacker could MITM the binary and
     the customer would never notice on a tool-missing host.

Fix: refuse to install when no SHA256 tool is available. Error
message lists the per-distro install commands so the customer
knows the fix.

Trade-off: a "warn-but-continue" mode would be friendlier on
exotic hosts, but the whole point of this script is to verify a
binary the customer pulls off the internet. "Friendly" defaults
that skip the security check are exactly the install-script
anti-pattern that's bitten the industry repeatedly. Hard error
is the right call.

`sh -n` syntax check clean.

* fix(install): cosign-fail-installs (.ps1) + custom-prefix-ignored (.sh)

Two Bugbot PR #11 r2 findings, both real:

## Medium — install.ps1: failed cosign verify treated as "skip"

The previous flow's try-block contained BOTH the .sig/.cert
download AND the cosign verify + Write-Error. With
$ErrorActionPreference = 'Stop', Write-Error throws an exception
caught by the SAME catch that handled missing-sig downloads — so
a verified-failed binary went through the "couldn't download
.sig/.cert — release may pre-date signing" branch and continued
to install.

Fix: separate the two concerns. Download in a try/catch (failures
= "predates signing, skip"). Verify OUTSIDE the try via & cosign
(external process, doesn't interact with $ErrorActionPreference);
$LASTEXITCODE check + Write-Host + explicit exit 1 if non-zero.
A failed signature now correctly refuses to install.

## Medium — install.sh: custom --prefix silently ignored if missing

POSIX `test -w` returns false for nonexistent paths. The previous
flow used `[ ! -w "$PREFIX" ]` to decide whether to fall back to
~/.local/bin — but that meant a legitimate `--prefix /opt/tracebloc`
(a dir that doesn't exist yet) silently triggered the fallback,
overriding the customer's explicit choice.

Fix: try `mkdir -p "$PREFIX"` first; if THAT succeeds AND the
resulting dir is -w, use it. Only fall back if mkdir fails (no
write perms on parent) OR the existing dir is unwriteable. The
dominant default-prefix case (`/usr/local/bin` without sudo)
still routes to ~/.local/bin; custom prefixes get respected.

Local: `sh -n scripts/install.sh` clean; full Go test suite +
gofmt + errcheck green.

* fix(install.ps1): Fail helper for clean errors + null-PATH guard (Bugbot r3)

Two more PowerShell-specific Bugbot findings on PR #11:

## Medium — Write-Error + exit 1 = dead exit, ugly UX

With $ErrorActionPreference = 'Stop', every Write-Error call
throws a terminating error BEFORE any subsequent statement runs,
so the `exit 1` lines after them were dead code. The result: a
customer hitting an error saw PowerShell's full error record
(stack trace + script line numbers + ErrorCategoryInfo) instead
of a clean one-line message.

The cosign branch was fixed in r2 with Write-Host + explicit
exit. The earlier error paths (Get-Arch, Resolve-Tag, SHA256
mismatch, missing SHA256SUMS entry) still used the broken
Write-Error pattern.

Fix: add a Fail helper that does `Write-Host "Error: $msg" -ForegroundColor Red; exit 1` and replace every Write-Error call
site with `Fail "..."`. DRY + consistent UX.

## Medium-Security — Null user PATH = leading semicolon = PATH-injection

GetEnvironmentVariable('Path', 'User') returns $null on fresh
Windows installs (or accounts that never set a user-scope PATH).
The naive `"$userPath;$InstallPrefix"` interpolation then
produced `";C:\Users\...\Programs\tracebloc"` — a leading
semicolon = empty PATH entry, which Windows resolves as the
CURRENT WORKING DIRECTORY. That's a well-known PATH-injection
vector: any binary planted in the user's cwd runs ahead of real
PATH entries.

Fix: null-guard $userPath before concatenation. If $userPath is
falsy (null or empty), use just $InstallPrefix as the new value.
The `$existingEntries -notcontains` check now also handles the
null case correctly via the `if ($userPath) { ... } else { @() }`
fallback.

Local: go test green, gofmt + errcheck clean.

This is r3 on PR #11 — the install scripts have surfaced the
most findings because they're shell+PowerShell where bugbot's
language-specific coverage is sharpest. Each finding has been
real.

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The first real-cluster run of `dataset push` surfaced three issues that
the mock-only test suite couldn't catch. All three are now validated
end-to-end against a live tracebloc/client release (image_classification,
6/6 records, 100% success, rows confirmed in MySQL).

1. Staging collided with the ingestor's DEST_PATH (the blocker). The CLI
   staged source files into /data/shared/<table>, which is exactly the
   ingestor's DEST_PATH (data-ingestors config.DEST_PATH =
   STORAGE_PATH/TABLE_NAME). Its DuplicateValidator rejects a
   pre-existing, non-empty destination, so the CLI's own staging tripped
   the check on every run. StagedPrefix now stages under
   /data/shared/.tracebloc-staging/<table>, leaving the destination
   fresh. Added FinalDestPrefix for the real destination shown in
   pre-flight.

2. Image target_size had no override. image_classification defaults to
   512x512 and the ingestor VALIDATES the resolution (it does not
   resize), so any other size hard-failed the Image Resolution
   Validator. Added --target-size WxH, plus auto-detect from the first
   image when the flag is omitted, emitted as spec.file_options.target_size.

3. Watch treated a broken log stream as fatal. A Pod replaced/restarted/
   deleted mid-follow (e.g. a backoffLimit retry) returned exit 9 even
   when the Job ultimately succeeded. The Job's terminal status is now
   the source of truth: a non-ctx stream error is only fatal if the Job
   outcome can't be determined.

Tests: new push/detect_test.go (ParseTargetSize, DetectImageSize);
spec_test.go (staging != dest regression, target_size passes schema);
watch_test.go (Job status wins on stream break); stream_test.go updated
to derive paths from StagedPrefix. go build / vet / test all green.

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
…#13)

Extends `dataset push` from image_classification-only to also cover
tabular_classification, tabular_regression, time_series_forecasting,
and time_to_event_prediction. The Python ingestor already supports
these; this adds the CLI-side flag / layout / spec surface. Validated
end-to-end on a live cluster (tabular_classification, 8/8 records,
100%, rows confirmed in MySQL).

- Category dispatch (push/category.go): image vs tabular families,
  mirroring data-ingestors' conventions.py groupings.
- Tabular local layout (push/tabular.go): a single CSV (no sidecar
  files), staged via the existing machinery (CSV + empty image list).
- Schema: auto-inferred from the CSV (INT/FLOAT/VARCHAR) so customers
  don't hand-write one; --schema col:TYPE,... overrides. Reserved
  framework columns (id, data_id, ...) are skipped so a CSV carrying
  an id column doesn't produce a schema the ingestor rejects (the
  #135b guard).
- Label: string form for tabular_classification; object form with
  policy=bucket (default) for the regression-class categories so the
  raw numeric target never leaves the cluster. Added --label-policy
  and --time-column.
- Build() branches by category (push/spec.go); pre-flight is
  category-aware (data CSV + column count for tabular).

Tests: push/tabular_test.go (DiscoverTabular, InferSchema incl.
reserved-skip, ParseSchema); spec_test.go (tabular Build passes the
schema for all three label shapes, regression defaults to bucket);
updated the unsupported-category gate test. go build / vet / test green.

Stacked on cli#12 (the dataset-push live-ingestion fixes).

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
…#14)

Adds text_classification and masked_language_modeling, and generalizes
the staging machinery from images-only to arbitrary sidecar directories
plus extra root files — the shared piece the remaining families need.

- Generic sidecar staging (walk.go, stream.go): LocalLayout gains
  Sidecars (dir name -> files, staged under "<name>/") and ExtraFiles
  (dest -> src, staged at the table root), plus FileCount(). The tar
  writer packages them after images, sorted for determinism. Images
  stays for image_classification.
- text.go: DiscoverText for text_classification (labels.csv + texts/)
  and masked_language_modeling (labels.csv + sequences/ + a required
  tokenizer.json at root, staged as an ExtraFile). discoverSidecarFiles
  is a reusable walker (object detection / segmentation will reuse it).
- Build (spec.go): the text branch emits texts/ or sequences/ + a label
  (text_classification only; MLM has none).
- dataset.go: category dispatch + gate now accept the text family;
  pre-flight is text-aware (sidecar file count + tokenizer line).

Validated live: text_classification — staged labels.csv + 5 texts/ ->
ingestor Job 100% (5/5), rows confirmed in MySQL.

MLM note: code-complete + unit-tested + accepted by the current
data-ingestors schema/engine (proven by the e2e), but the *deployed*
ingdemo jobs-manager carries a stale embedded schema predating MLM, so
a live MLM push is rejected server-side (HTTP 400) until that image is
refreshed. Not a CLI issue — the CLI surfaces the 400 cleanly.

Tests: push/text_test.go (DiscoverText for both categories incl.
MLM-requires-tokenizer + missing-dir errors; text Build passes the
schema, MLM has no label); updated the unsupported-category gate test.
go build / vet / test green.

Stacked on cli#13 (tabular family) -> cli#12 (live-ingestion fixes).

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
…e-sync) (#15)

Adds the two remaining engine-supported image categories, taking the CLI
to 9/10 modalities (only semantic_segmentation remains, blocked on the
ingestor — data-ingestors#136).

- object_detection: reuses the generic sidecar walker for annotations/
  (.xml). Validated live end-to-end — 128 records (bounding boxes)
  ingested, rows confirmed in MySQL.
- keypoint_detection: labels.csv + images/ (keypoint coords live in the
  CSV's Annotation column, read server-side). Adds --number-of-keypoints
  (required; no default). Emits target_size + number_of_keypoints as
  TOP-LEVEL fields, which the schema's keypoint conditional requires.

- Re-synced the embedded schema from data-ingestors develop. The vendored
  copy was stale: it lacked keypoint's top-level target_size +
  number_of_keypoints and their required-for-keypoint conditional, so the
  CLI couldn't validate a keypoint spec at all. `ingest validate` and
  dataset push now validate keypoint correctly.

Schema-skew findings (deployment/release hygiene, NOT CLI bugs):
  * sync-schema.sh defaults to data-ingestors *master*, which is stale
    (lacks both MLM and keypoint); the current schema is on *develop*.
    Repoint the sync source to develop, or promote develop -> master.
    (sync --check vs master flags this drift — pre-existing, surfaced here.)
  * The deployed ingdemo client runs jobs-manager and the ingestor on
    DIFFERENT schema versions: the ingestor (newer) REQUIRES keypoint's
    top-level fields; jobs-manager (older) REJECTS them as additional
    properties. So keypoint can't be ingested there until both components
    are refreshed to a matching schema. The CLI's emission is correct
    against the current/consistent schema (unit-verified). OD is
    unaffected (no new fields).

Tests: push/image_extras_test.go (DiscoverObjectDetection +
missing-annotations); spec_test.go (OD emits annotations; keypoint emits
top-level target_size + number_of_keypoints; both pass the schema);
updated the unsupported-category gate test (now segmentation only).
go build / vet / test green.

Stacked on cli#14 (text) -> #13 (tabular) -> #12 (fixes).

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
)

Lifts unit coverage on cheap-to-test, previously-0% functions that
don't need a cluster, and beefs up the CI smoke test to exercise the
real binary's schema validation.

- progress_test.go: NewProgress on a non-TTY returns the no-op sink;
  isTTY false cases.
- coverage_test.go (cli): printPushPreflight renders the key facts;
  exitError.Error/Code; runClusterInfo exits 3 on a bad kubeconfig.
- watcherr_test.go (submit): WatchError.Unwrap + IsWatchError
  classification (drives the exit-9 vs exit-8 mapping).
- CI smoke test now runs `ingest validate` against committed fixtures
  (valid -> exit 0; missing-images -> exit 2) + `dataset push --help`,
  so a schema/validation/wiring regression is caught on the built
  binary without a cluster.

Per-package coverage after: cli 61%, cluster 84%, push 84%, schema 81%,
submit 66%. The remaining gaps are the real-I/O seams
(SPDYExecutor.Exec, PortForwardJobsManager, NewClientset) — those need
a real cluster and are covered by the integration harness (separate PR).

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
The unit suite mocks every cluster boundary, leaving the highest-risk
code — the SPDYExecutor tar-over-exec stream — at 0% coverage. Every
live bug this project hit (staging<->DEST collision, validator drift,
schema skew) was only findable by running it by hand. This adds an
automated integration harness against a real cluster.

- test/integration/ (//go:build integration):
  * Connectivity — cluster.Load + NewClientset against a live API server.
  * StageAndVerify — creates a throwaway namespace + PVC + stage pod,
    streams a dataset via the REAL push.SPDYExecutor tar-over-exec, then
    exec's back into the pod to confirm the files landed. Covers
    SPDYExecutor.Exec + StreamLayout + CreateStagePod /
    WaitForStagePodReady — the seams with zero unit coverage.
- Makefile: `make test-integration` (ambient kubeconfig; -count=1,
  build-tagged).
- .github/workflows/e2e.yml: boots kind + runs the suite. Gated to
  nightly / manual dispatch / `e2e`-labeled PRs (kind boot is too heavy
  to run per-PR).

Validated locally against a live k3d cluster: both tests pass; the
stream lands files on the PVC and the exec-back confirms content. The
files are build-tag-gated, so normal `go test ./...` and the CI test
job are unaffected. The kind CI job gets its first real run on the
nightly schedule / manual dispatch / an `e2e`-labeled PR.

Follow-ups (test/integration/README.md): a PortForwardJobsManager
fixture; a full `dataset push` through a real jobs-manager + ingestor
(needs the chart in-cluster) to catch cross-component schema skew e2e.

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
README was stuck at Phase 0; updates Status (9/10 modalities), Customer-experience/install caveats, Go 1.26 floor, and the roadmap. Also fixes three user-facing help/output strings (cluster info 'coming in Phase 3'; dataset and dataset push --help referencing Phase 4 / PR-b / image-only) that contradicted the shipped binary. Docs only.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The tracebloc/homebrew-tap repo isn't set up yet and Homebrew is deferred past the v0.1 alpha. Removes the brew install line + softens the Status/roadmap mentions (framed as a deferred follow-up). GitHub Release + install.sh/ps1 paths remain. Docs only; bump-homebrew-tap was already if:false.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The root command's Long help still claimed the binary "implements only
version and completion" and pointed at the roadmap for future phases —
stale since v0.1 shipped the dataset/ingest/cluster commands. PR #18
refreshed the README and the three other user-facing --help strings but
missed this one (it scoped to cluster info / dataset / dataset push).
Describe the shipped command set instead.

Docs/strings only; no behavior change.

Closes #19

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
@aptracebloc aptracebloc requested a review from saadqbal June 3, 2026 09:10
@aptracebloc aptracebloc self-assigned this Jun 3, 2026
@LukasWodka
Copy link
Copy Markdown
Contributor

👋 Heads-up — Code review queue is at 11 / 8

Above the WIP limit. The team convention is to review existing PRs before opening new work.

Open PRs currently in Code review (oldest first):

Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.)

Comment thread internal/push/detect.go
Comment thread Makefile
v0.1.0-alpha is published (pre-release). The README still said "first
release pending" / "no tagged release exists yet" and advertised
`curl https://install.tracebloc.io | sh` — but that vanity domain isn't
wired up, and GitHub's `latest` redirect skips pre-releases, so neither
the domain shortcut nor a `latest`-based one-liner can install the alpha.

- Status: "first release pending" -> v0.1.0-alpha published (pre-release).
- Install: replace the dead install.tracebloc.io commands with the
  working explicit-tag invocation + a link to the signed release assets;
  note the domain / `latest` / Homebrew all await the first stable tag.
- Roadmap: Phase 5 "awaiting first v* tag" -> alpha published.

Docs only; no behavior change.

Closes #23

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Comment thread internal/cli/dataset.go
1. target_size: emit [height, width] (was [width, height]) to match the ingest.v1 schema + ingestor; only affected non-square datasets. Added TestBuild_TargetSize_EmittedHeightWidth (non-square) regression test.

2. Makefile: make ci lint now runs the same standalone tools as CI (errcheck + ineffassign + misspell) instead of golangci-lint, restoring the local==CI invariant while golangci-lint-action stays disabled (#6); golangci-lint moved to make lint-full.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@aptracebloc
Copy link
Copy Markdown
Contributor Author

@BugBot run.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 534d75b. Configure here.

Comment thread internal/cluster/kubeconfig.go
@aptracebloc aptracebloc merged commit 8346110 into main Jun 3, 2026
20 of 21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants