Installer UX: drop PriorityClass, fix namespace, one-per-machine guard, surface version by LukasWodka · Pull Request #192 · tracebloc/client

LukasWodka · 2026-06-03T14:06:13Z

Installer-onboarding UX — five related changes that simplify the install and fix the "second deploy on one machine" experience.

1. Drop the data-plane PriorityClass (chart)

tracebloc-data-plane (cluster-scoped, fixed-name) was the only thing forcing one client per cluster — a second release collided with a cryptic Helm error — and it blocked multiple tracebloc namespaces in a BYO cluster. mysql doesn't need it (memory requests==limits, PVC durability, a PDB). Default create: false + name: ""; opt-in preserved. helm-unittest 144/144.

2. Fixed namespace + drop the workspace prompt (installer)

"Choose a workspace name" made users invent a non-identity label (the backend identifies a client by its credentials; this is just the local k8s namespace). It defaulted to default and was the field that collided on a second install. Dropped → TB_NAMESPACE defaults to tracebloc (env-overridable).

3. One-client-per-machine guard (bash + PowerShell)

After credentials verify, compare the entered Client ID against any client already installed here — scanning all namespaces (helm list -A / ConvertFrom-Json), so it catches both fresh tracebloc installs and ones an older installer left in another namespace. Same ID → normal upgrade; a different ID hard-blocks with options (repair / switch via k3d cluster delete / use another machine) instead of a silent client switch.

4. Surface the client version (installer)

Users had no easy way to see their client version (the CLI isn't shipped yet; helm list needs the namespace). Show the chart version where they already look: a Version line in the install summary, and the first line of --diagnose + the bundle header.

Why as a set

Dropping the PriorityClass removes the accidental one-per-cluster guard; the fixed namespace makes a second install an upgrade-in-place; the clientId guard makes one-per-machine explicit; the version line closes the discoverability gap that surfaced along the way.

Testing — all green

chart: helm lint (aks/eks/bm/oc) + template + helm-unittest 144/144
bash: bats (incl. the guard + version tests)
PowerShell: Pester on ubuntu + windows + PSScriptAnalyzer
end-to-end: real k3d install E2E (ubuntu 22.04/24.04/arm) + the 9-distro prereq matrix

Targets develop.

The cluster-scoped, fixed-name `tracebloc-data-plane` PriorityClass was the only thing forcing one tracebloc client per cluster (a second release collided on it with a cryptic Helm error) and blocking multiple tracebloc namespaces in one BYO cluster. mysql doesn't need it: memory requests==limits (last evicted under memory pressure), data on a PVC (eviction = transient restart, not data loss), and a PDB guards voluntary disruptions. Its only unique benefit was letting the scheduler preempt training jobs to keep mysql scheduled on a packed node — a narrow case. Default priorityClass.create=false + name="" so new installs template no PriorityClass and mysql carries no priorityClassName. Opt back in (create:true + name) on contended clusters, or reference an out-of-band one (create:false + name:<existing>). helm-unittest updated; 144/144 pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ace prompt The "Choose a workspace name" prompt asked the user to invent a label that isn't their identity (the backend identifies a client by its credentials, not this string — it's just the local k8s namespace / Helm release name; the installer even discards the auth response body). It defaulted to a meaningless "default" and was the field that collided on a second install. - Drop the prompt; TB_NAMESPACE defaults to a fixed "tracebloc" (env-overridable for advanced/GitOps setups). - One-client-per-machine guard: after credentials verify, compare the entered Client ID against any client already installed here (helm get values). Same ID = a normal re-run/upgrade; a DIFFERENT ID hard-blocks with an explanation and options (repair / switch via `k3d cluster delete` / use another machine) instead of silently re-pointing the machine. This replaces the accidental PriorityClass collision (now dropped) with an intentional, explained guard. - Document the TB_NAMESPACE override; update bats (input sequences + 2 new guard tests). bats 26/27 — the 1 failure is a pre-existing macOS-bash-3.2 quirk in _extract_yaml_value, unrelated (CI bash 5 passes it). NOTE: install-k8s.ps1 + its Pester tests still need the same mirror (follow-up). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

LukasWodka · 2026-06-03T14:07:14Z

👋 Heads-up — Code review queue is at 12 / 8

Above the WIP limit. The team convention is to review existing PRs before opening new work.

Open PRs currently in Code review (oldest first):

averaging-service#98 — ci: add fr-gate caller workflow · author: @LukasWodka · no reviewer assigned
cli#32 — feat(Update Helm chart release workflow to use 'client' naming convention #26): installer-style terminal UX package (internal/ui) · author: @aptracebloc · no reviewer assigned
client#191 — fix(image-refresh: surface persistent ingestor (ghcr) digest-resolve failures instead of silently skipping #190): fail image-refresh loudly when the ingestor (ghcr) digest can't resolve · author: @saadqbal · no reviewer assigned
client-runtime#76 — docs(chore: real merge commit to unblock PR #57 #59): PodTokenReconciler design — Option A decided, loopholes closed · author: @saadqbal · no reviewer assigned
data-ingestors#146 — test(schema): lock category enum to the engine's TaskCategory (anti-drift) · author: @LukasWodka · reviewer: @saadqbal
data-ingestors#148 — Develop to prod · author: @divyasinghds · reviewer: @saadqbal
design-system#19 — fix: un-track coverage/ and node_modules/ from git · author: @LukasWodka · no reviewer assigned
design-system#22 — ci: add Vitest test workflow · author: @LukasWodka · reviewer: @aptracebloc
design-system#23 — ci: add fr-gate caller workflow · author: @LukasWodka · no reviewer assigned
frontend-app#499 — ci: add Vitest + Cypress test workflow · author: @LukasWodka · reviewer: @aptracebloc

Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.)

…owerShell) Mirrors the bash change in install-k8s.ps1: - drop the "Choose a workspace name" prompt; TB_NAMESPACE defaults to a fixed "tracebloc" (override via $env:TB_NAMESPACE). - one-client-per-machine guard: after credentials verify, compare the entered Client ID against any client already installed here (helm get values); a different ID hard-blocks with the same explanation/options as bash. - Pester: 2 new guard tests (block-different / allow-same). The existing Install-ClientHelm tests use dispatch-by-prompt Read-Host mocks, so the prompt removal doesn't disturb them. No pwsh locally -> verified via CI (Pester ubuntu+windows + PSScriptAnalyzer). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

LukasWodka · 2026-06-03T14:27:21Z

👋 Heads-up — Code review queue is at 12 / 8

Above the WIP limit. The team convention is to review existing PRs before opening new work.

Open PRs currently in Code review (oldest first):

averaging-service#98 — ci: add fr-gate caller workflow · author: @LukasWodka · no reviewer assigned
client#191 — fix(image-refresh: surface persistent ingestor (ghcr) digest-resolve failures instead of silently skipping #190): fail image-refresh loudly when the ingestor (ghcr) digest can't resolve · author: @saadqbal · no reviewer assigned
client-runtime#76 — docs(chore: real merge commit to unblock PR #57 #59): PodTokenReconciler design — Option A decided, loopholes closed · author: @saadqbal · no reviewer assigned
data-ingestors#146 — test(schema): lock category enum to the engine's TaskCategory (anti-drift) · author: @LukasWodka · reviewer: @saadqbal
data-ingestors#148 — Develop to prod · author: @divyasinghds · reviewer: @saadqbal
design-system#19 — fix: un-track coverage/ and node_modules/ from git · author: @LukasWodka · no reviewer assigned
design-system#22 — ci: add Vitest test workflow · author: @LukasWodka · reviewer: @aptracebloc
design-system#23 — ci: add fr-gate caller workflow · author: @LukasWodka · no reviewer assigned
frontend-app#499 — ci: add Vitest + Cypress test workflow · author: @LukasWodka · reviewer: @aptracebloc
model-zoo#77 — ci: add fr-gate caller workflow · author: @LukasWodka · no reviewer assigned

Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.)

The guard checked only the `tracebloc` namespace, so a client installed by an older installer version (default namespace `default`, or a custom name) wasn't detected -- a re-run could create a second coexisting client. Now enumerate all client-chart releases (helm list -A) and compare each one's clientId, covering both fresh and migrated installs. bash uses jq (already a dependency; falls back to the tracebloc namespace if absent); PowerShell uses ConvertFrom-Json. The block message names the namespace. bats + Pester guard tests updated. Verified: bats green locally (jq path); PowerShell via CI Pester. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Users had no easy way to see which client version they're on (the CLI isn't shipped yet; `helm list` needs the namespace, and nothing surfaced it). Show the chart version where they already look: - install summary: a "Version" line next to Workspace. - --diagnose: as the first console line + recorded in the bundle header (the #1 thing support needs). Adds a best-effort `_chart_version` / Get-ChartVersion helper (greps helm's CHART column -> no jq). bash + PowerShell; bats + Pester coverage added. Verified: summary.bats + diagnose.bats green locally; ps1 via CI Pester. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(resource-monitor): always grant read-only ClusterRole (decouple from clusterScope) Under clusterScope: false the chart rendered only a namespace-scoped Role in the release namespace. But the resource-monitor's code: * calls core_v1_api.list_pod_for_all_namespaces(field_selector=spec.nodeName=...) -- a CLUSTER-SCOPED list verb a namespaced Role can never satisfy; and * read_namespaced_pod()s its OWN pod, which lives in .Values.nodeAgents.namespace.name (NOT .Release.Namespace). So with clusterScope: false the DaemonSet 403'd on startup and crashlooped (70+ restarts observed on a live cluster). Per-node monitoring is intrinsically cluster-scoped. Always render the read-only ClusterRole + ClusterRoleBinding regardless of clusterScope (get/list/watch on pods/nodes/namespaces + metrics; no write, exec, or secret access). resourceMonitor: false still fully disables the component. clusterScope continues to gate the training/jobs isolation footprint elsewhere -- it must not leave the node monitor without permissions it cannot run without. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * test(resource-monitor): assert always-cluster-scoped RBAC under clusterScope=false Follow-up to the RBAC fix: node_agents_namespace_test.yaml still asserted the old behavior (namespaced Role + RoleBinding in the release namespace when clusterScope=false). Update that case to assert the corrected contract -- a ClusterRole + ClusterRoleBinding always render (with no metadata.namespace), while the subject SA still lives in the node-agents namespace. The clusterScope=false path stays under test; only the asserted behavior changes to match the fix. Verified with `helm unittest` (all resource-monitor suites pass). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(rbac): grant `get` on configmaps/secrets to jobs-manager SA The ingestion endpoint's orphan-resource verify path (client-runtime#52) and missing-row self-heal (client-runtime#54) read the existing ConfigMap/Secret on a create-409 to confirm content matches before reuse. The Role/ClusterRole only granted `create`, so those reads returned Forbidden and the endpoint 500'd instead of the intended 409/200-replay — verified live on the dev cluster. Add `get` alongside `create` in both the ClusterRole (clusterScope: true) and namespace Role (clusterScope: false) branches. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * ci(helm): guard that the pinned ingestor digest is multi-arch (closes #186) (#187) Add a helm-ci job (ingestor-multiarch) that parses images.ingestor.digest and fails the build unless it's a multi-arch index (linux/amd64 + linux/arm64). Greenfield installs spawn the ingestor Job from this PINNED digest before image-refresh first ticks, so an amd64-only pin breaks data ingestion on arm64 hosts (Apple Silicon, Graviton) with "no match for platform" / ImagePullBackOff. This would have caught #160 (the amd64-only v0.3.1 pin) before it shipped. ghcr.io/tracebloc/ingestor is public -> no secrets. Verified: passes on the current multi-arch baseline (sha256:d361fa77, v0.3.2 / #184), fails on the old amd64-only sha256:a0861ea9. Note: the digest is already multi-arch on develop as of v0.3.2 (#184 — the same d361fa77 index this PR previously bumped to), so #187 no longer touches values.yaml; it adds only the regression guard so an amd64-only pin can't slip back in. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> * fix(#190): fail image-refresh loudly when the ingestor (ghcr) digest can't resolve (#191) image-refresh silently skipped every tick when get_latest_digest returned empty for the ghcr.io ingestor image (egress/proxy/firewall to ghcr.io, or a blocked token endpoint) — never reaching the registry-drift branch that sets the new digest. jobs-manager + pods-monitor pull from docker.io and refreshed fine, so the CronJob looked healthy while the ingestor digest stayed pinned on the install-time baseline. That's why the berlin-team arm64 install sat on the amd64-only v0.3.1 digest even after :0.3 went multi-arch (#186 follow-up #2). Now count consecutive ingestor-resolve failures on a deployment annotation: - below imageRefresh.ingestorResolveFailureThreshold (default 3, ~45 min at the 15-min schedule) -> WARN + skip, as before (tolerate transient blips); - at/above it -> ERROR with actionable guidance, a tracebloc.io/ingestor-refresh-last-error annotation, and a non-zero exit so the Job fails visibly in `kubectl get cronjob` / monitoring — the same surfacing idiom Pass 2's stuck-rollout check already relies on; - a successful resolve clears the streak. Threshold is nil-guarded (default 3) for --reuse-values upgrades and schema-validated (integer >= 1). The digest-resolution logic itself is unchanged (verified correct: it returns the multi-arch index digest). helm unittest 146/146, helm lint clean, shellcheck + sh -n clean. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> * test(requests-proxy): add helm-unittest coverage for requests-proxy Deployment (#194) requests-proxy-deployment.yaml was the only data-plane workload template without a unit test. This suite pins the properties most costly to regress: - security-context invariants (no SA-token automount, runAsNonRoot, seccomp RuntimeDefault, runAsUser 1001, no privilege escalation, drop ALL caps, read-only root filesystem) — see docs/SECURITY.md - the single-replica / single gunicorn worker constraint (the pod token registry is process-local; >1 worker silently shards token lookups) - the docker.io/tracebloc/jobs-manager image source and port 8888 - the nil-guarded resource defaults, plus an override case that exercises the default-through-dict fallthrough (guards the historic `readOnlyRootFilesystem: trueresources:` newline-eating regression) Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> * fix(#196): allow training-pod egress to the requests-proxy (8888) (#197) The training-egress NetworkPolicy denies all pod-to-pod / ClusterIP egress (rule 2 excepts the cluster CIDRs) and re-permits only MySQL (rule 3). When the requests-proxy architecture shipped — training pods POST epoch results / FLOPs to requests-proxy-service:8888 instead of holding Service Bus credentials — this template was never updated to re-permit egress to the proxy. Result on every install with the policy enabled: pods hit "requests-proxy-service:8888 ... [Errno 111] Connection refused" at the first epoch finalize → CrashLoopBackOff → all experiments fail. Add rule 4 mirroring the MySQL rule: TCP/8888 to podSelector app=requests-proxy (same namespace). Service selector + port from templates/requests-proxy-service.yaml. Verified: `helm template -f ci/bm-values.yaml --show-only templates/network-policy-training.yaml` renders the new rule as valid YAML. Found live on a fresh client (tracebloc-amazon / k3d): jobs-manager reached the proxy (HTTP 401) while training pods got connection-refused — the only differentiator was this egress policy. Interim: live-patched the cluster + suspended its auto-upgrade CronJob (so reuse-values wouldn't revert the patch); re-enable once this lands + releases. Closes #196. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Installer UX: drop PriorityClass, fix namespace, one-per-machine guard, surface version (#192) * fix(chart): drop the data-plane PriorityClass by default The cluster-scoped, fixed-name `tracebloc-data-plane` PriorityClass was the only thing forcing one tracebloc client per cluster (a second release collided on it with a cryptic Helm error) and blocking multiple tracebloc namespaces in one BYO cluster. mysql doesn't need it: memory requests==limits (last evicted under memory pressure), data on a PVC (eviction = transient restart, not data loss), and a PDB guards voluntary disruptions. Its only unique benefit was letting the scheduler preempt training jobs to keep mysql scheduled on a packed node — a narrow case. Default priorityClass.create=false + name="" so new installs template no PriorityClass and mysql carries no priorityClassName. Opt back in (create:true + name) on contended clusters, or reference an out-of-band one (create:false + name:<existing>). helm-unittest updated; 144/144 pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(installer): fixed namespace + one-per-machine guard; drop workspace prompt The "Choose a workspace name" prompt asked the user to invent a label that isn't their identity (the backend identifies a client by its credentials, not this string — it's just the local k8s namespace / Helm release name; the installer even discards the auth response body). It defaulted to a meaningless "default" and was the field that collided on a second install. - Drop the prompt; TB_NAMESPACE defaults to a fixed "tracebloc" (env-overridable for advanced/GitOps setups). - One-client-per-machine guard: after credentials verify, compare the entered Client ID against any client already installed here (helm get values). Same ID = a normal re-run/upgrade; a DIFFERENT ID hard-blocks with an explanation and options (repair / switch via `k3d cluster delete` / use another machine) instead of silently re-pointing the machine. This replaces the accidental PriorityClass collision (now dropped) with an intentional, explained guard. - Document the TB_NAMESPACE override; update bats (input sequences + 2 new guard tests). bats 26/27 — the 1 failure is a pre-existing macOS-bash-3.2 quirk in _extract_yaml_value, unrelated (CI bash 5 passes it). NOTE: install-k8s.ps1 + its Pester tests still need the same mirror (follow-up). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(installer-ps): mirror fixed namespace + one-per-machine guard (PowerShell) Mirrors the bash change in install-k8s.ps1: - drop the "Choose a workspace name" prompt; TB_NAMESPACE defaults to a fixed "tracebloc" (override via $env:TB_NAMESPACE). - one-client-per-machine guard: after credentials verify, compare the entered Client ID against any client already installed here (helm get values); a different ID hard-blocks with the same explanation/options as bash. - Pester: 2 new guard tests (block-different / allow-same). The existing Install-ClientHelm tests use dispatch-by-prompt Read-Host mocks, so the prompt removal doesn't disturb them. No pwsh locally -> verified via CI (Pester ubuntu+windows + PSScriptAnalyzer). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(installer): scan all namespaces in the one-per-machine guard The guard checked only the `tracebloc` namespace, so a client installed by an older installer version (default namespace `default`, or a custom name) wasn't detected -- a re-run could create a second coexisting client. Now enumerate all client-chart releases (helm list -A) and compare each one's clientId, covering both fresh and migrated installs. bash uses jq (already a dependency; falls back to the tracebloc namespace if absent); PowerShell uses ConvertFrom-Json. The block message names the namespace. bats + Pester guard tests updated. Verified: bats green locally (jq path); PowerShell via CI Pester. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(installer): show client (chart) version in summary + --diagnose Users had no easy way to see which client version they're on (the CLI isn't shipped yet; `helm list` needs the namespace, and nothing surfaced it). Show the chart version where they already look: - install summary: a "Version" line next to Workspace. - --diagnose: as the first console line + recorded in the bundle header (the #1 thing support needs). Adds a best-effort `_chart_version` / Get-ChartVersion helper (greps helm's CHART column -> no jq). bash + PowerShell; bats + Pester coverage added. Verified: summary.bats + diagnose.bats green locally; ps1 via CI Pester. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> * chore: bump chart 1.4.4 → 1.4.5 to ship the training-egress proxy fix (#198) 1.4.4 is already published on the tracebloc.github.io/client Pages channel and is what clusters run. The training-egress NetworkPolicy fix (#197, allow training → requests-proxy:8888) merged to develop without a version bump, so it is currently undeliverable: chart-releaser won't overwrite the existing 1.4.4 release, and clusters already on 1.4.4 would see no version change and pull nothing. Bump to 1.4.5 (lockstep version/appVersion, matching 1.4.3/1.4.4 history) so a v1.4.5 release publishes a new version that auto-upgrade actually pulls. Chart-only change; no image change. Ref #196 / #197. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: shujaat hasan <shujaathasan@shujaats-MacBook-Pro.local> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> Co-authored-by: shujaat_tracebloc <153823837+shujaatTracebloc@users.noreply.github.com> Co-authored-by: lukasWuttke <54042461+LukasWodka@users.noreply.github.com>

LukasWodka and others added 2 commits June 3, 2026 16:05

LukasWodka assigned saadqbal Jun 3, 2026

LukasWodka requested a review from saadqbal June 3, 2026 14:06

LukasWodka marked this pull request as ready for review June 3, 2026 14:26

LukasWodka mentioned this pull request Jun 3, 2026

feat(kanban): nightly drift reconciler + per-repo .kanban.yml override tracebloc/.github#55

Merged

4 tasks

saadqbal previously approved these changes Jun 3, 2026

View reviewed changes

LukasWodka mentioned this pull request Jun 3, 2026

feat(#27): installer-style ingestion summary screen tracebloc/cli#33

Merged

LukasWodka dismissed saadqbal’s stale review via d1d44dd June 3, 2026 15:24

LukasWodka changed the title ~~Simplify install: drop data-plane PriorityClass + fixed namespace + one-per-machine guard~~ Installer UX: drop PriorityClass, fix namespace, one-per-machine guard, surface version Jun 3, 2026

saadqbal approved these changes Jun 4, 2026

View reviewed changes

saadqbal merged commit 36b78e4 into develop Jun 4, 2026
28 checks passed

This was referenced Jun 4, 2026

Release v1.4.5: promote develop → main #199

Merged

Windows installer: one-client-per-machine guard misses unquoted/single-quoted clientId #200

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Installer UX: drop PriorityClass, fix namespace, one-per-machine guard, surface version#192

Installer UX: drop PriorityClass, fix namespace, one-per-machine guard, surface version#192
saadqbal merged 5 commits into
developfrom
feat/fixed-namespace-drop-priorityclass

LukasWodka commented Jun 3, 2026 •

edited

Loading

Uh oh!

LukasWodka commented Jun 3, 2026

Uh oh!

LukasWodka commented Jun 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

LukasWodka commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Drop the data-plane PriorityClass (chart)

2. Fixed namespace + drop the workspace prompt (installer)

3. One-client-per-machine guard (bash + PowerShell)

4. Surface the client version (installer)

Why as a set

Testing — all green

Uh oh!

LukasWodka commented Jun 3, 2026

Uh oh!

LukasWodka commented Jun 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

LukasWodka commented Jun 3, 2026 •

edited

Loading