feat(installer): --diagnose support bundle (redacted) by LukasWodka · Pull Request #175 · tracebloc/client

LukasWodka · 2026-06-01T19:27:03Z

🚫 STACKED ON #171 + #172 + #173 + #174 — DO NOT MERGE BEFORE THEM. Branched off fix/reboot-persistence (#174). Until the stack merges, this PR's range includes their commits — review only the latest commit / the 7 files under "What changed". Flips draft → ready as the stack lands.

Summary

A one-command, redacted support bundle. When a customer hits an install/runtime problem, instead of a multi-round log-gathering email thread (the Charité thread is the archetype) they run one command and send us a single file. bash <(curl … i.sh) --diagnose (or install-k8s.ps1 -Diagnose) writes ~/.tracebloc/tracebloc-diagnose-<timestamp>.tgz. Installer-only — no chart changes.

Type of change

Feature

What changed

scripts/lib/diagnose.sh (new) — run_diagnose(): host/versions, docker/k3d, kubectl overview + describe of non-Running pods, workload logs (namespace auto-discovered), helm, install log + values.yaml, proxy env → redact every file → tar. _redact_file() does the redaction. Runs under set +e (best-effort) and short-circuits before any install work, so it works when the install is broken.
scripts/install-k8s.sh — sources it + the --diagnose branch at the top of main() (clears the EXIT trap).
scripts/install.sh — adds diagnose.sh to the bootstrap download manifest (verified by the sourced-libs cross-check + a live bootstrap run).
scripts/install-k8s.ps1 — -Diagnose + Invoke-DiagnoseBundle + Edit-Redaction (Windows mirror; Compress-Archive).
scripts/lib/common.sh — documents --diagnose in --help.

Security — redaction (the crux; this file goes to support)

clientPassword, proxy credentials (user:pass@host), and password=/token/secret values are redacted from every file before archiving. clientId is kept (the identifier support needs, not a secret). The bats/Pester redaction tests are the gate.

Test plan

scripts/tests/diagnose.bats (7, incl. the end-to-end redaction gate) + Pester Edit-Redaction + Invoke-DiagnoseBundle (extract-and-grep) — green (bats 114, Pester 51/0/3-skip).
Real --diagnose on a Linux VM: produced a 16-file bundle; the seeded dev password + auth-proxy creds had 0 occurrences in the archive (clientId kept, [REDACTED] markers present); also runs cleanly with no cluster.
Real curl-bootstrap: bash <(curl … install.sh@branch) --diagnose fetched diagnose.sh, ran it, produced the bundle, 0 leaks.
⚠️ PowerShell runtime not validated on Windows (logic Pester-covered) — Windows reviewer confirms Invoke-DiagnoseBundle live.

Checklist

Tests added / updated and passing locally
Docs updated (--diagnose in --help)
No secrets / credentials in the diff (tests use synthetic secrets; redaction verified)
Windows reviewer for the ps1 runtime

LukasWodka · 2026-06-01T19:27:49Z

👋 Heads-up — Code review queue is at 16 / 8

Above the WIP limit. The team convention is to review existing PRs before opening new work.

Open PRs currently in Code review (oldest first):

averaging-service#98 — ci: add fr-gate caller workflow · author: @LukasWodka · no reviewer assigned
client#171 — Installer: verify readiness & credentials before reporting success; RHEL-family support · author: @LukasWodka · no reviewer assigned
client-runtime#45 — Staging: Enhance jobs-manager with HTTP proxy, ingestion endpoint, and fixes · author: @saadqbal · no reviewer assigned
client-runtime#61 — docs: record MySQL credential threat-model decision · author: @saadqbal · no reviewer assigned
data-ingestors#132 — ci: add fr-gate caller workflow · author: @LukasWodka · no reviewer assigned
data-ingestors#133 — docs: fix declarative-ingest path/column drift (issue feat(#129): parent client chart owns the shared ingestor ServiceAccount (1.3.4) #131 A-series) · author: @divyasinghds · no reviewer assigned
design-system#19 — fix: un-track coverage/ and node_modules/ from git · author: @LukasWodka · no reviewer assigned
design-system#22 — ci: add Vitest test workflow · author: @LukasWodka · reviewer: @aptracebloc
design-system#23 — ci: add fr-gate caller workflow · author: @LukasWodka · no reviewer assigned
docs#46 — docs: make declarative-ingest staging self-contained (data-ingestors#131 B/C) · author: @divyasinghds · no reviewer assigned

Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.)

LukasWodka · 2026-06-01T20:04:06Z

Pre-merge security review (covered the whole stack #171–#175). Proxy temp-file handling, perms (0600 under umask 077 + cleaned up), injection, and the other PRs came back clean. Two redaction gaps in the support bundle were found and fixed in 4b70123:

Redaction missed non-clientPassword keys — only clientPassword:/password= were matched, so dockerRegistry.password (registry token) and HTTP_PROXY_PASSWORD survived into the bundle. Broadened _redact_file (bash) + Edit-Redaction (ps1) to redact any *password key, case-insensitive, : or = form.
helm get manifest shipped base64 Secrets (bash only) — rendered Secret objects carry base64 CLIENT_PASSWORD + .dockerconfigjson that text redaction can't see. Dropped the manifest collection.

Regression tests added; re-verified end-to-end (real --diagnose): clientPassword, registry token, HTTP_PROXY_PASSWORD, and proxy URL creds all have 0 occurrences in the archive. bats 119 / Pester 52 green.

LukasWodka · 2026-06-01T20:22:06Z

Pre-merge correctness review (high effort, whole stack) — three real bugs fixed in 51dede6 (their failure paths weren't exercised by the passing tests):

cluster.sh — k3d "${K3D_ARGS[@]}" was a bare command under set -e, so a k3d-create failure aborted the script before the 'already exists' reuse, the error dump, and the proxy temp-dir cleanup could run. Now && create_rc=0 || create_rc=$?. Proven under set -e.
summary.sh — CLIENT_STATE was defaulted to starting at source time, so install_cleanup's [[ -z … ]] guard was always false → the 'did not complete / check the log / safe to re-run' hint never printed on early failures. Defaulted empty.
preflight.sh — with curl absent (direct ./install-k8s.sh on a minimal VM, before deps install curl), connectivity probes returned nocurl and hard-failed with a misleading 'egress blocked'. Now skips with a warning.

Plus a defensive switch ("$env:CLIENT_ENV") in Get-BackendUrl (refuted as a live bug on pwsh 7, but cheap insurance for the unvalidated 5.1 path). Regression test added; bats 120 / Pester 52 green. Note: these touch files owned by #171/#172/#173 but land here on the tip — they ride along when the stack merges in order.

Adds a one-command support bundle so a customer hitting an install/runtime problem can send a single file instead of a multi-round log-gathering email thread (the Charité thread is the archetype). `bash <(curl ... i.sh) --diagnose` (or `install-k8s.ps1 -Diagnose`) collects logs + cluster/host status into ~/.tracebloc/tracebloc-diagnose-<ts>.tgz. Two guarantees: - Best-effort: the whole collection runs under `set +e` and short-circuits before any install work, so it works even when the install is broken. - Credential-safe: clientPassword, proxy credentials (user:pass@host), and password=/token/secret values are REDACTED from every file before archiving; clientId is kept (it's the identifier support needs, not a secret). scripts/lib/diagnose.sh -- run_diagnose() (host/versions, docker/k3d, kubectl overview + describe of non-Running pods, workload logs with namespace auto-discovery, helm, install log + values.yaml, proxy env) + _redact_file(). install-k8s.sh sources it + adds the --diagnose short-circuit (clears the EXIT trap so the post-install message doesn't fire); install.sh adds it to the bootstrap download manifest. install-k8s.ps1 mirrors with -Diagnose / Invoke-DiagnoseBundle / Edit-Redaction. Documented in --help. Tests: diagnose.bats (7, incl. the end-to-end redaction gate) + Pester (+3). Verified on a Linux VM: the real --diagnose flag produced a 16-file bundle and the seeded dev password + proxy credentials had ZERO occurrences in the archive (clientId kept); also works with no cluster present. Stacked on #171 + #172 + #173 + #174. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…eal preflight probes) Measured changed-line coverage of the stack had dropped (bash ~84% vs #171's ~96%) because the new code added integration-only branches the mocked unit suites skipped. Recover the unit-testable portion: - diagnose.bats: exercise the kubectl/docker/helm collection path (has()=true + mocked tools) -> diagnose.sh 64% -> 90%. - preflight.bats: test the REAL _pf_probe_url curl-exit-code -> token mapping, the missing-curl path, and the _pf_ncpu/_pf_total_mem_kb/_pf_free_kb readers (re-sourced past the setup stubs) -> preflight.sh 79% -> 90%. Bash changed-line coverage: 83.6% -> 92.3% (kcov, 383/415). The residual ~8% is integration-only (real k3d/docker create + macOS/Windows-specific branches + MAIN orchestration), validated by the live VM E2Es (reboot recovery, auth-proxy, preflight blocked-egress/arm64, diagnose redaction grep). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

A pre-merge security review of the support bundle found two ways credentials could land in the (supposedly redacted) archive the customer sends to support: 1. Redaction only matched `clientPassword:` and `password=` -- it missed any other *password key in colon form, so `dockerRegistry.password` (a registry token) and `HTTP_PROXY_PASSWORD` survived. Broadened _redact_file (bash) and Edit-Redaction (ps1) to redact ANY *password key, case-insensitive, in : or = form (portable explicit char classes -- BSD sed has no I flag). 2. (bash only) The bundle collected `helm get manifest`, which renders the k8s Secret objects with base64-encoded CLIENT_PASSWORD + .dockerconfigjson that text redaction can't see. Dropped the manifest collection (helm get values + kubectl output already cover triage). Regression tests added (diagnose.bats + Pester). Re-verified end-to-end on a Linux VM with the real --diagnose flag: clientPassword, the dockerRegistry token, HTTP_PROXY_PASSWORD, and proxy URL creds all have ZERO occurrences in the archive; clientId kept; manifest no longer collected. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

A pre-merge correctness review (high effort) over the stack found three real bugs — the failure paths weren't exercised by the passing tests: 1. cluster.sh: `k3d "${K3D_ARGS[@]}" > ...` was a bare command under `set -e`, so a k3d-create FAILURE aborted the script immediately, skipping the 'already exists' graceful reuse, the error dump, AND the proxy temp-dir cleanup. Capture rc set-e-safely (`&& create_rc=0 || create_rc=$?`). Proven under set -e: both the error-dump and reuse paths now run. 2. summary.sh: CLIENT_STATE was defaulted to "starting" at source time, so install_cleanup's `[[ -z "$CLIENT_STATE" ]]` guard was always false and the "did not complete / check the log / safe to re-run" hint never printed on an early failure (preflight / docker / cluster / helm). Default it empty; the readiness gate sets the real state. 3. preflight.sh: with curl absent (direct ./install-k8s.sh on a minimal VM, before install_system_deps adds curl), the connectivity probes returned 'nocurl' and hard-failed with a misleading "egress blocked". Skip the check with a warning when curl isn't present yet. Also defensively quote `switch ("$env:CLIENT_ENV")` in Get-BackendUrl so the prod default fires regardless of PowerShell version (refuted as a live bug on pwsh 7 -- default does fire -- but cheap insurance for the unvalidated 5.1 path). Tests: +nocurl-skip test, fixed an over-blunt has() mock; bats 120 / Pester 52 green. (#1's set -e abort can't be caught by bats -- no set -e there -- so it was verified manually under set -e.) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…nosing it as a group issue) Asad's AlmaLinux 9 / EC2 test: docker-ce installed fine but `dockerd` crashed on startup (exit 1), systemd throttled it ("Start request repeated too quickly"), and the installer then printed "Could not connect to Docker -- try logging out and back in" -- the GROUP-not-active hint, which is wrong for a dead daemon and sent him in circles (logout/login didn't help). The throttle also means a bare re-run can't recover. install_docker_engine now: - `systemctl enable docker` WITHOUT `--now` (a start failure no longer hard-aborts the script under `set -e` at that line); - `systemctl reset-failed docker` before starting, so a throttled/failed unit from a prior attempt can be retried (a plain re-run now recovers); - when `docker info` fails AND the daemon isn't active (vs. the group-not-active case, which is still re-exec'd via `sg docker`), surface Docker's OWN error (systemctl status + the journalctl error lines) with likely RHEL/AlmaLinux causes, instead of the misleading group hint. Test: setup-linux.bats daemon-won't-start case; bats 121 green. NOTE: this fixes the installer's HANDLING. Asad's root cause (why dockerd exits 1 on that box) is still masked by the systemd throttle and is being chased separately. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…d on minimal RHEL/AlmaLinux) Root cause of Asad's AlmaLinux 9 / EC2 failure: dockerd died on startup with "failed to register bridge driver: iptables ... addrtype ... missing kernel module". Minimal RHEL/AlmaLinux cloud images (incl. AWS) ship kernel-modules-core but NOT the full kernel-modules package, so xt_addrtype (+ br_netfilter, overlay) aren't available and Docker can't program its bridge NAT rules. New _ensure_kernel_modules() (setup-linux.sh), called before starting Docker: modprobe overlay / br_netfilter / xt_addrtype / iptable_nat / ip_tables; on RHEL-family, if a load fails, `dnf install kernel-modules-$(uname -r)` and retry; persist to /etc/modules-load.d for reboots. Best-effort + idempotent (verified clean on a healthy Ubuntu box). Also sharpened the daemon-won't-start hint to point at the kernel-modules remedy when the error mentions addrtype/missing module. This is the hospital-VM profile (minimal RHEL/Alma), so it's a real install-side fix, not just error handling. Test: setup-linux.bats _ensure_kernel_modules; bats 122. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…tic gate Adds .github/workflows/installer-tests.yaml to validate the installer across the breadth of environments customers actually run — not just Ubuntu-amd64: • static — shellcheck (clean at --severity=warning) + bash -n + PSScriptAnalyzer • unit-bash — bats (mocked), 124 tests • unit-pester — Pester on Linux pwsh AND real windows-latest (the .ps1's true target) • distro-prereqs — NEW: runs the REAL Linux prereq path (PM detect, system deps, Docker branch, kernel modules, kubectl/k3d/helm) in a fresh container per distro family: ubuntu 22.04/24.04, debian 12, almalinux 9/8, rockylinux 9, amazonlinux 2023, fedora, opensuse leap. The matrix paid for itself before it even shipped: validating it locally against real distro containers surfaced a genuine gap — minimal Amazon Linux 2023 ships no openssl/tar, so helm's get-helm-3 fails ("openssl must first be installed"). Fixed in install_system_deps (ensure openssl + tar; package names are uniform across apt/dnf/yum/zypper/pacman), with bats coverage. All 9 validated distro branches now install every prerequisite. Installer-test jobs moved out of helm-ci.yaml into their own workflow (no more duplicate runs; helm-ci no longer triggers on scripts/** changes). Arch omitted (x86-only image + bare-container keyring friction; pacman branch covered by bats). Real k3d cluster-up (e2e) intentionally deferred — needs a stubbed backend; tracked. Validated locally via mac Docker: ubuntu:22.04, almalinux:9, amazonlinux:2023, opensuse/leap:15.6 → all PASS; bats 124 green; shellcheck 0 findings. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@test

…Config test The new workflow's first run caught two issues — exactly its job: 1. Static analysis failed: `bash -n` ran on .bats files, which are bats DSL (@test "name" { … }), not valid bash. Restrict the syntax check to *.sh; .bats are validated by being run in the unit-bash job. 2. Pester on real windows-latest failed 1/55: the Confirm-Config test set $env:USERPROFILE = $env:HOME, but $env:HOME is empty on Windows, so [System.IO.Path]::GetFullPath("") threw "path is empty". The INSTALLER is correct (defaults to $env:USERPROFILE, always set on Windows) — the test fixture was Linux-centric. Derive a profile dir valid on both OSes. For the record, the first run's wins: all 9 distro prereq jobs (ubuntu 22.04/ 24.04, debian 12, almalinux 8/9, rockylinux 9, amazonlinux 2023, fedora, opensuse leap) + bats + Linux Pester passed on GHA's amd64 runners. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…alse positives) The libs are sourced together as one program, so single-file shellcheck reports SC2034 "unused" for shared vars defined in common.sh and consumed in other sourced files (CURL_SECURE, ARCH_DL, colours…). Gate at --severity=error (0 findings); warnings still printed as advisory. With this, Static analysis joins the already-green distro matrix + Windows/Linux Pester + bats. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Adds scripts/tests/e2e-cluster.sh + an e2e-cluster matrix job — the highest- fidelity check CI can run. It drives the installer's OWN create_cluster() to bring up an actual k3d cluster on a real kernel (Docker is preinstalled on the runner), asserts every node reaches Ready, then proves the cluster can pull, schedule, and run a public workload (nginx:alpine), and tears down. It deliberately stops BEFORE the tracebloc helm install / backend registration (private images + real credentials), so it needs no secrets. Runs on ubuntu-22.04, ubuntu-24.04, and ubuntu-24.04-arm (arm64 runners are free on this public repo) — covering the real cluster path on both architectures. Validated locally on an arm64 Ubuntu VM: create_cluster() → server+agent Ready (k3s v1.29.4) → nginx pod Running → teardown. shellcheck clean (0 errors). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The amd64 runners proved the cluster comes up fine (nodes Ready), but the probe pod failed with "serviceaccount default not found" — kubectl run raced the SA controller, which creates default/default asynchronously after the node goes Ready. arm64 dodged it by timing. Wait for the SA before running the pod. Pure test-harness fix; the installer cluster path is correct on all arches. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Adds scripts/tests/e2e-proxy.sh + an e2e-proxy job. Stands up a squid that REQUIRES basic auth, brings up a k3d cluster via the installer's create_cluster() with HTTP(S)_PROXY=http://user:pass@host.k3d.internal:3128, and proves the nodes pull a workload image THROUGH the authed proxy — the squid access log shows an authenticated CONNECT to auth.docker.io (which only a real image pull makes, never the readiness probe), closing the "proxy silently bypassed" false positive. It also asserts anonymous requests are refused, so auth is genuinely enforced. Guards the corporate-proxy hardening end-to-end (#172/#174, the Charité/hospital archetype): _write_k3d_proxy_config passes proxy env via a k3d config FILE so the '@' in user:pass@host survives (k3d splits --env on '@'), plus _augment_no_proxy. If the credentials regress, squid 407s and the pull hangs — the test fails loudly. Stops before the helm install / backend registration; no secrets. Validated locally on an arm64 Ubuntu VM: anonymous refused → cluster up via the authed proxy → nginx pulled through it (auth.docker.io + registry-1.docker.io CONNECTs by the proxy user) → teardown. shellcheck clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…#176) dockerd crash-loops on minimal RHEL/AlmaLinux images because xt_addrtype/ iptable_nat/br_netfilter live in kernel-modules-extra, not the base kernel-modules package. The prior fix installed kernel-modules-$(uname -r) — the wrong package — so the self-heal never took. Install kernel-modules-extra (unversioned). When the repo's extra modules target a newer kernel than the running one (stale AMI), they can't load until reboot: detect that, set KMODS_REBOOT_REQUIRED, and have install_docker_engine print a clear reboot-and-re-run message instead of a raw Docker error. Modules persist via /etc/modules-load.d/tracebloc.conf. Verified end-to-end on a pristine AlmaLinux 10.1 MINIMAL EC2 box: reboot gate fires, post-reboot modules load, re-run reaches Connected. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

LukasWodka · 2026-06-02T11:38:07Z

👋 Heads-up — Code review queue is at 24 / 8

Above the WIP limit. The team convention is to review existing PRs before opening new work.

Open PRs currently in Code review (oldest first):

averaging-service#98 — ci: add fr-gate caller workflow · author: @LukasWodka · no reviewer assigned
cli#16 — test(cli): coverage wins (preflight/progress/errors) + smoke-test hardening · author: @LukasWodka · no reviewer assigned
cli#17 — test(cli): integration harness for the real-I/O seams (kind e2e) · author: @LukasWodka · no reviewer assigned
client-runtime#45 — Staging: Enhance jobs-manager with HTTP proxy, ingestion endpoint, and fixes · author: @saadqbal · no reviewer assigned
client-runtime#61 — docs: record MySQL credential threat-model decision · author: @saadqbal · no reviewer assigned
client-runtime#65 — fix(ci: add advance-deploy-env caller workflow #64): re-sync jobs-manager ingest schema (accept masked_language_modeling) + anti-drift · author: @LukasWodka · no reviewer assigned
client-runtime#67 — ci: publish jobs-manager images on merge (closes the deploy-delivery gap) · author: @LukasWodka · no reviewer assigned
data-ingestors#132 — ci: add fr-gate caller workflow · author: @LukasWodka · no reviewer assigned
data-ingestors#133 — docs: fix declarative-ingest path/column drift (issue feat(#129): parent client chart owns the shared ingestor ServiceAccount (1.3.4) #131 A-series) · author: @divyasinghds · no reviewer assigned
data-ingestors#138 — test(e2e): end-to-end ingestion equivalence suite (all modalities, real MySQL) · author: @LukasWodka · no reviewer assigned

Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.)

…177) (#178) Installer-only patch release — promotes #171–#175 (RHEL-family support, credential/readiness verification, preflight gate, reboot persistence, --diagnose bundle) to production via develop→main. Chart templates/values are unchanged from 1.4.2. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

@test

* Sync main → develop after v1.4.2 release (#170) * Installer: verify readiness & credentials before reporting success; RHEL-family support (#171) * fix(installer): run on RHEL-family Linux (#718, #719, #720) Docker: install docker-ce from the official Docker CentOS dnf repo on AlmaLinux/Rocky/Oracle, which get.docker.com rejects as unsupported. k3d: preserve PATH through sudo so the post-install lookup survives RHEL secure_path (which omits /usr/local/bin). conntrack: use the conntrack apt package on Debian/Ubuntu and conntrack-tools elsewhere. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(installer): verify credentials and readiness before reporting success (#716, #717) Credentials entered at the prompt are validated against the backend api-token-auth endpoint (the same call jobs-manager makes) with a re-prompt loop, so a wrong Client ID or password is caught immediately instead of after a full deploy. After helm apply, wait_for_client_ready polls rollout status and classifies the outcome; print_summary reports connected, starting, bad_creds, image_pull or crash, and prints the data-never-leaves message only when the client is verifiably connected. Exit code now reflects the real outcome. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(installer): mirror credential and readiness checks on Windows (#716, #717) Test-Credentials, Wait-ForClientReady and Get-NotReadyState mirror the bash logic in install-k8s.ps1, and Print-Summary is now state-branched. Validated with the PowerShell 7.4 parser; runtime behavior still needs a check on a Windows host. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * test(installer): bats + Pester unit suites, wired into CI scripts/tests/: 64 bats tests (summary, install-client-helm, setup-linux, common) + 32 Pester tests for install-k8s.ps1 — all mocked, no Docker/k3d/network needed. Changed-line coverage measured with kcov (bash 96.2%) and Pester (PowerShell 97.4%); residual lines are the real RHEL Docker-install commands + the guarded main() orchestration, exercised by the integration E2E. A TB_PESTER guard lets the suite dot-source install-k8s.ps1 without running the installer. New installer-tests job in helm-ci.yaml runs both suites on PRs (scripts/ added to path filters). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> * fix(installer): harden corporate-proxy support (auth proxy, NO_PROXY, 0.0.0.0 detect, Windows parity) (#172) A customer running behind an authenticated corporate HTTP proxy hit install failures. The 0.0.0.0 kubeconfig headline was fixed for the bash path in #166/#167, but adverse testing (a forward proxy + real k3d on Linux VMs) surfaced three remaining gaps plus a Windows parity hole: - Gap A: authenticated proxies (http://user:pass@host) were silently SKIPPED — k3d's --env KEY=VALUE@FILTER can't carry an '@' in the value. Now propagated via a k3d --config file (structured YAML env) so credentials survive intact. Verified on k3d v5.8.3 (it merges the --config env with the existing CLI flags). - Gap B: NO_PROXY was propagated verbatim. Now auto-augmented with the cluster-internal ranges (loopback + RFC1918 + .svc/.cluster.local + host.k3d.internal), both into the cluster and host-side, so in-cluster traffic never routes through the proxy — fixes the misroute AND the observed `k3d cluster create --wait` hang. - Gap C: a cluster created outside the installer and bound to 0.0.0.0 is now detected (serverlb HostIp) and flagged with a non-destructive recreate remedy. - Windows parity: install-k8s.ps1::New-K3dCluster had NONE of the bash fixes — it still bound --api-port 0.0.0.0:6550 (the original headline bug, still live on Windows), normalized only host.docker.internal in the kubeconfig, and propagated zero proxy env. Now mirrors bash: 127.0.0.1:6550, a 0.0.0.0->127.0.0.1 kubeconfig rewrite, and Get-EffectiveNoProxy + Write-K3dProxyConfig (auth + augmented NO_PROXY, written UTF-8 without a BOM). Tests: new scripts/tests/cluster.bats (15) + Pester for the two ps1 helpers (6), both green. Verified end-to-end on Linux VMs: auth creds propagated into the node, no startup hang behind an unreachable proxy, and 0.0.0.0 detection firing. Stacked on #171 (the installer test scaffolding + final install-k8s.ps1 live there). Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> * feat(installer): preflight gate (arch, egress, disk, RAM, CPU) — fail fast, clearly (#173) Most install failures fall into two shapes: the environment can't support what the installer does and it fails CRYPTICALLY minutes in, or it claims success it hasn't earned. #171/#172 fixed specific cases; this attacks the first pattern systematically with a preflight gate that runs before any install/cluster work and fails in seconds with a precise, actionable reason. scripts/lib/preflight.sh — run_preflight() runs at the top of Step 1/4: - Architecture (arm64 guard): the tracebloc client images (e.g. mysql-client) are amd64-only. amd64 -> ok; arm64 on Docker Desktop -> info (emulated); arm64 Linux without QEMU binfmt -> hard fail with the `tonistiigi/binfmt --install amd64` remedy. Override: TRACEBLOC_ALLOW_ARM64=1. (This is exactly the `exec format error` we hit on arm64.) - Egress connectivity: probes the endpoints the install needs and reports which are blocked. Hard-fail on the criticals (registry-1.docker.io, ghcr.io, the CLIENT_ENV backend, tracebloc.github.io); warn-only on tool-download hosts when the tool isn't already installed. A TLS/cert error emits a break-and-inspect proxy hint. The probe honors HTTP_PROXY. - Disk / RAM / CPU: hard-fail on critically low disk; warn on low disk/RAM/CPU. - Aggregates results (runs ALL checks, then exits once with a summary). Escape hatch: TRACEBLOC_SKIP_PREFLIGHT=1. install-k8s.sh sources + calls run_preflight; install.sh adds preflight.sh to the curl-bootstrap download manifest (verified: every sourced lib is downloaded). install-k8s.ps1: Test-Preflight + Get-Pf* helpers mirror the bash logic for Windows (Get-CimInstance disk/RAM/CPU, Invoke-WebRequest connectivity). docs/INSTALL.md: a "Network requirements (egress allowlist)" section the preflight error points users to. Tests: scripts/tests/preflight.bats (21) + Pester Describes (Test-PfUrl, Test-Preflight; the Get-Cim* readers are Windows-only so they skip off-Windows). Verified end-to-end on a Linux VM: healthy run passes; a blocked ghcr.io fails in ~1s naming the host; arm64 without binfmt fails with the remedy. Stacked on #171 + #172. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> * feat(installer): guarantee reboot persistence + make it visible (#174) A real VM reboot test showed the k3d cluster already auto-recovers on Linux (k3d sets --restart unless-stopped on its nodes; the Docker install enables docker.service on boot). So no systemd unit is needed — this GUARANTEES that behavior even on edge setups, and tells the user about it: - cluster.sh: new ensure_cluster_autostart() (called from create_cluster) -- `docker update --restart unless-stopped` on the k3d nodes (covers externally-created clusters / a future k3d default change) + `systemctl enable docker` on Linux (covers the installed-but-disabled re-run case the fresh-install path misses). Opt out with TRACEBLOC_NO_AUTOSTART=1. - summary.sh: _reboot_note() in the connected summary -- Linux: "Survives reboot"; macOS/Windows: enable Docker Desktop start-on-login. - install-k8s.ps1: Set-ClusterAutostart (defensive unless-stopped on the nodes) + the Docker Desktop reboot note in Print-Summary. Tests: cluster.bats (+4), summary.bats (+3), Pester (+2) -- all green. Verified by a real reboot on a Linux VM: with the restart policy stripped (policy=no), ensure_cluster_autostart restored it to unless-stopped + enabled docker; after `limactl stop/start`, the cluster, node, AND a deployed workload all returned Running with no intervention. Stacked on #171 + #172 + #173. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> * feat(installer): --diagnose support bundle (redacted) (#175) * feat(installer): --diagnose support bundle (redacted) Adds a one-command support bundle so a customer hitting an install/runtime problem can send a single file instead of a multi-round log-gathering email thread (the Charité thread is the archetype). `bash <(curl ... i.sh) --diagnose` (or `install-k8s.ps1 -Diagnose`) collects logs + cluster/host status into ~/.tracebloc/tracebloc-diagnose-<ts>.tgz. Two guarantees: - Best-effort: the whole collection runs under `set +e` and short-circuits before any install work, so it works even when the install is broken. - Credential-safe: clientPassword, proxy credentials (user:pass@host), and password=/token/secret values are REDACTED from every file before archiving; clientId is kept (it's the identifier support needs, not a secret). scripts/lib/diagnose.sh -- run_diagnose() (host/versions, docker/k3d, kubectl overview + describe of non-Running pods, workload logs with namespace auto-discovery, helm, install log + values.yaml, proxy env) + _redact_file(). install-k8s.sh sources it + adds the --diagnose short-circuit (clears the EXIT trap so the post-install message doesn't fire); install.sh adds it to the bootstrap download manifest. install-k8s.ps1 mirrors with -Diagnose / Invoke-DiagnoseBundle / Edit-Redaction. Documented in --help. Tests: diagnose.bats (7, incl. the end-to-end redaction gate) + Pester (+3). Verified on a Linux VM: the real --diagnose flag produced a 16-file bundle and the seeded dev password + proxy credentials had ZERO occurrences in the archive (clientId kept); also works with no cluster present. Stacked on #171 + #172 + #173 + #174. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * test(installer): raise changed-line coverage (diagnose collection + real preflight probes) Measured changed-line coverage of the stack had dropped (bash ~84% vs #171's ~96%) because the new code added integration-only branches the mocked unit suites skipped. Recover the unit-testable portion: - diagnose.bats: exercise the kubectl/docker/helm collection path (has()=true + mocked tools) -> diagnose.sh 64% -> 90%. - preflight.bats: test the REAL _pf_probe_url curl-exit-code -> token mapping, the missing-curl path, and the _pf_ncpu/_pf_total_mem_kb/_pf_free_kb readers (re-sourced past the setup stubs) -> preflight.sh 79% -> 90%. Bash changed-line coverage: 83.6% -> 92.3% (kcov, 383/415). The residual ~8% is integration-only (real k3d/docker create + macOS/Windows-specific branches + MAIN orchestration), validated by the live VM E2Es (reboot recovery, auth-proxy, preflight blocked-egress/arm64, diagnose redaction grep). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(installer): close two redaction gaps in --diagnose (security review) A pre-merge security review of the support bundle found two ways credentials could land in the (supposedly redacted) archive the customer sends to support: 1. Redaction only matched `clientPassword:` and `password=` -- it missed any other *password key in colon form, so `dockerRegistry.password` (a registry token) and `HTTP_PROXY_PASSWORD` survived. Broadened _redact_file (bash) and Edit-Redaction (ps1) to redact ANY *password key, case-insensitive, in : or = form (portable explicit char classes -- BSD sed has no I flag). 2. (bash only) The bundle collected `helm get manifest`, which renders the k8s Secret objects with base64-encoded CLIENT_PASSWORD + .dockerconfigjson that text redaction can't see. Dropped the manifest collection (helm get values + kubectl output already cover triage). Regression tests added (diagnose.bats + Pester). Re-verified end-to-end on a Linux VM with the real --diagnose flag: clientPassword, the dockerRegistry token, HTTP_PROXY_PASSWORD, and proxy URL creds all have ZERO occurrences in the archive; clientId kept; manifest no longer collected. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(installer): correctness fixes from code review A pre-merge correctness review (high effort) over the stack found three real bugs — the failure paths weren't exercised by the passing tests: 1. cluster.sh: `k3d "${K3D_ARGS[@]}" > ...` was a bare command under `set -e`, so a k3d-create FAILURE aborted the script immediately, skipping the 'already exists' graceful reuse, the error dump, AND the proxy temp-dir cleanup. Capture rc set-e-safely (`&& create_rc=0 || create_rc=$?`). Proven under set -e: both the error-dump and reuse paths now run. 2. summary.sh: CLIENT_STATE was defaulted to "starting" at source time, so install_cleanup's `[[ -z "$CLIENT_STATE" ]]` guard was always false and the "did not complete / check the log / safe to re-run" hint never printed on an early failure (preflight / docker / cluster / helm). Default it empty; the readiness gate sets the real state. 3. preflight.sh: with curl absent (direct ./install-k8s.sh on a minimal VM, before install_system_deps adds curl), the connectivity probes returned 'nocurl' and hard-failed with a misleading "egress blocked". Skip the check with a warning when curl isn't present yet. Also defensively quote `switch ("$env:CLIENT_ENV")` in Get-BackendUrl so the prod default fires regardless of PowerShell version (refuted as a live bug on pwsh 7 -- default does fire -- but cheap insurance for the unvalidated 5.1 path). Tests: +nocurl-skip test, fixed an over-blunt has() mock; bats 120 / Pester 52 green. (#1's set -e abort can't be caught by bats -- no set -e there -- so it was verified manually under set -e.) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(installer): handle a Docker daemon that won't start (stop misdiagnosing it as a group issue) Asad's AlmaLinux 9 / EC2 test: docker-ce installed fine but `dockerd` crashed on startup (exit 1), systemd throttled it ("Start request repeated too quickly"), and the installer then printed "Could not connect to Docker -- try logging out and back in" -- the GROUP-not-active hint, which is wrong for a dead daemon and sent him in circles (logout/login didn't help). The throttle also means a bare re-run can't recover. install_docker_engine now: - `systemctl enable docker` WITHOUT `--now` (a start failure no longer hard-aborts the script under `set -e` at that line); - `systemctl reset-failed docker` before starting, so a throttled/failed unit from a prior attempt can be retried (a plain re-run now recovers); - when `docker info` fails AND the daemon isn't active (vs. the group-not-active case, which is still re-exec'd via `sg docker`), surface Docker's OWN error (systemctl status + the journalctl error lines) with likely RHEL/AlmaLinux causes, instead of the misleading group hint. Test: setup-linux.bats daemon-won't-start case; bats 121 green. NOTE: this fixes the installer's HANDLING. Asad's root cause (why dockerd exits 1 on that box) is still masked by the systemd throttle and is being chased separately. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(installer): load the kernel modules Docker/k3s need (fixes dockerd on minimal RHEL/AlmaLinux) Root cause of Asad's AlmaLinux 9 / EC2 failure: dockerd died on startup with "failed to register bridge driver: iptables ... addrtype ... missing kernel module". Minimal RHEL/AlmaLinux cloud images (incl. AWS) ship kernel-modules-core but NOT the full kernel-modules package, so xt_addrtype (+ br_netfilter, overlay) aren't available and Docker can't program its bridge NAT rules. New _ensure_kernel_modules() (setup-linux.sh), called before starting Docker: modprobe overlay / br_netfilter / xt_addrtype / iptable_nat / ip_tables; on RHEL-family, if a load fails, `dnf install kernel-modules-$(uname -r)` and retry; persist to /etc/modules-load.d for reboots. Best-effort + idempotent (verified clean on a healthy Ubuntu box). Also sharpened the daemon-won't-start hint to point at the kernel-modules remedy when the error mentions addrtype/missing module. This is the hospital-VM profile (minimal RHEL/Alma), so it's a real install-side fix, not just error handling. Test: setup-linux.bats _ensure_kernel_modules; bats 122. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * ci(installer): cross-distro prereq matrix + real-Windows Pester + static gate Adds .github/workflows/installer-tests.yaml to validate the installer across the breadth of environments customers actually run — not just Ubuntu-amd64: • static — shellcheck (clean at --severity=warning) + bash -n + PSScriptAnalyzer • unit-bash — bats (mocked), 124 tests • unit-pester — Pester on Linux pwsh AND real windows-latest (the .ps1's true target) • distro-prereqs — NEW: runs the REAL Linux prereq path (PM detect, system deps, Docker branch, kernel modules, kubectl/k3d/helm) in a fresh container per distro family: ubuntu 22.04/24.04, debian 12, almalinux 9/8, rockylinux 9, amazonlinux 2023, fedora, opensuse leap. The matrix paid for itself before it even shipped: validating it locally against real distro containers surfaced a genuine gap — minimal Amazon Linux 2023 ships no openssl/tar, so helm's get-helm-3 fails ("openssl must first be installed"). Fixed in install_system_deps (ensure openssl + tar; package names are uniform across apt/dnf/yum/zypper/pacman), with bats coverage. All 9 validated distro branches now install every prerequisite. Installer-test jobs moved out of helm-ci.yaml into their own workflow (no more duplicate runs; helm-ci no longer triggers on scripts/** changes). Arch omitted (x86-only image + bare-container keyring friction; pacman branch covered by bats). Real k3d cluster-up (e2e) intentionally deferred — needs a stubbed backend; tracked. Validated locally via mac Docker: ubuntu:22.04, almalinux:9, amazonlinux:2023, opensuse/leap:15.6 → all PASS; bats 124 green; shellcheck 0 findings. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * ci(installer): fix static gate (.bats ≠ bash) + Windows-safe Confirm-Config test The new workflow's first run caught two issues — exactly its job: 1. Static analysis failed: `bash -n` ran on .bats files, which are bats DSL (@test "name" { … }), not valid bash. Restrict the syntax check to *.sh; .bats are validated by being run in the unit-bash job. 2. Pester on real windows-latest failed 1/55: the Confirm-Config test set $env:USERPROFILE = $env:HOME, but $env:HOME is empty on Windows, so [System.IO.Path]::GetFullPath("") threw "path is empty". The INSTALLER is correct (defaults to $env:USERPROFILE, always set on Windows) — the test fixture was Linux-centric. Derive a profile dir valid on both OSes. For the record, the first run's wins: all 9 distro prereq jobs (ubuntu 22.04/ 24.04, debian 12, almalinux 8/9, rockylinux 9, amazonlinux 2023, fedora, opensuse leap) + bats + Linux Pester passed on GHA's amd64 runners. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * ci(installer): gate ShellCheck at error severity (SC2034 cross-file false positives) The libs are sourced together as one program, so single-file shellcheck reports SC2034 "unused" for shared vars defined in common.sh and consumed in other sourced files (CURL_SECURE, ARCH_DL, colours…). Gate at --severity=error (0 findings); warnings still printed as advisory. With this, Static analysis joins the already-green distro matrix + Windows/Linux Pester + bats. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * ci(installer): real k3d cluster-up E2E on Ubuntu (amd64 + arm64) Adds scripts/tests/e2e-cluster.sh + an e2e-cluster matrix job — the highest- fidelity check CI can run. It drives the installer's OWN create_cluster() to bring up an actual k3d cluster on a real kernel (Docker is preinstalled on the runner), asserts every node reaches Ready, then proves the cluster can pull, schedule, and run a public workload (nginx:alpine), and tears down. It deliberately stops BEFORE the tracebloc helm install / backend registration (private images + real credentials), so it needs no secrets. Runs on ubuntu-22.04, ubuntu-24.04, and ubuntu-24.04-arm (arm64 runners are free on this public repo) — covering the real cluster path on both architectures. Validated locally on an arm64 Ubuntu VM: create_cluster() → server+agent Ready (k3s v1.29.4) → nginx pod Running → teardown. shellcheck clean (0 errors). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * ci(installer): fix E2E probe race on the default ServiceAccount The amd64 runners proved the cluster comes up fine (nodes Ready), but the probe pod failed with "serviceaccount default not found" — kubectl run raced the SA controller, which creates default/default asynchronously after the node goes Ready. arm64 dodged it by timing. Wait for the SA before running the pod. Pure test-harness fix; the installer cluster path is correct on all arches. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * ci(installer): authenticated corporate-proxy E2E (squid) Adds scripts/tests/e2e-proxy.sh + an e2e-proxy job. Stands up a squid that REQUIRES basic auth, brings up a k3d cluster via the installer's create_cluster() with HTTP(S)_PROXY=http://user:pass@host.k3d.internal:3128, and proves the nodes pull a workload image THROUGH the authed proxy — the squid access log shows an authenticated CONNECT to auth.docker.io (which only a real image pull makes, never the readiness probe), closing the "proxy silently bypassed" false positive. It also asserts anonymous requests are refused, so auth is genuinely enforced. Guards the corporate-proxy hardening end-to-end (#172/#174, the Charité/hospital archetype): _write_k3d_proxy_config passes proxy env via a k3d config FILE so the '@' in user:pass@host survives (k3d splits --env on '@'), plus _augment_no_proxy. If the credentials regress, squid 407s and the pull hangs — the test fails loudly. Stops before the helm install / backend registration; no secrets. Validated locally on an arm64 Ubuntu VM: anonymous refused → cluster up via the authed proxy → nginx pulled through it (auth.docker.io + registry-1.docker.io CONNECTs by the proxy user) → teardown. shellcheck clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(installer): install kernel-modules-extra + handle reboot-required (#176) dockerd crash-loops on minimal RHEL/AlmaLinux images because xt_addrtype/ iptable_nat/br_netfilter live in kernel-modules-extra, not the base kernel-modules package. The prior fix installed kernel-modules-$(uname -r) — the wrong package — so the self-heal never took. Install kernel-modules-extra (unversioned). When the repo's extra modules target a newer kernel than the running one (stale AMI), they can't load until reboot: detect that, set KMODS_REBOOT_REQUIRED, and have install_docker_engine print a clear reboot-and-re-run message instead of a raw Docker error. Modules persist via /etc/modules-load.d/tracebloc.conf. Verified end-to-end on a pristine AlmaLinux 10.1 MINIMAL EC2 box: reboot gate fires, post-reboot modules load, re-run reaches Connected. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> Co-authored-by: Asad Iqbal <asad.dsoft@gmail.com> * chore(chart): bump version to 1.4.3 for installer-hardening release (#177) (#178) Installer-only patch release — promotes #171–#175 (RHEL-family support, credential/readiness verification, preflight gate, reboot persistence, --diagnose bundle) to production via develop→main. Chart templates/values are unchanged from 1.4.2. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: lukasWuttke <54042461+LukasWodka@users.noreply.github.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

LukasWodka assigned saadqbal Jun 1, 2026

saadqbal changed the base branch from develop to fix/reboot-persistence June 2, 2026 10:16

saadqbal force-pushed the fix/reboot-persistence branch from e69b9e1 to 8953324 Compare June 2, 2026 10:32

saadqbal force-pushed the fix/diagnose-bundle branch from 9af4990 to 54bccd7 Compare June 2, 2026 10:32

saadqbal force-pushed the fix/reboot-persistence branch from 8953324 to ab4b555 Compare June 2, 2026 11:08

saadqbal force-pushed the fix/diagnose-bundle branch from 54bccd7 to decf881 Compare June 2, 2026 11:08

saadqbal force-pushed the fix/reboot-persistence branch from ab4b555 to cba4c35 Compare June 2, 2026 11:13

saadqbal force-pushed the fix/diagnose-bundle branch from decf881 to ba4fb8c Compare June 2, 2026 11:13

LukasWodka and others added 13 commits June 2, 2026 16:36

saadqbal force-pushed the fix/diagnose-bundle branch from ba4fb8c to d97e20d Compare June 2, 2026 11:36

saadqbal changed the base branch from fix/reboot-persistence to develop June 2, 2026 11:36

saadqbal marked this pull request as ready for review June 2, 2026 11:37

saadqbal closed this Jun 2, 2026

saadqbal reopened this Jun 2, 2026

LukasWodka requested a review from saadqbal June 2, 2026 11:42

saadqbal approved these changes Jun 2, 2026

View reviewed changes

saadqbal merged commit da9f45d into develop Jun 2, 2026
33 checks passed

saadqbal deleted the fix/diagnose-bundle branch June 2, 2026 11:50

This was referenced Jun 2, 2026

Release v1.4.3 — promote installer hardening to production #177

Open

chore(chart): bump to 1.4.3 for installer-hardening release #178

Merged

saadqbal mentioned this pull request Jun 2, 2026

Sync develop → main for v1.4.3 chart release #179

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(installer): --diagnose support bundle (redacted)#175

feat(installer): --diagnose support bundle (redacted)#175
saadqbal merged 13 commits into
developfrom
fix/diagnose-bundle

LukasWodka commented Jun 1, 2026

Uh oh!

LukasWodka commented Jun 1, 2026

Uh oh!

LukasWodka commented Jun 1, 2026

Uh oh!

LukasWodka commented Jun 1, 2026

Uh oh!

LukasWodka commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

LukasWodka commented Jun 1, 2026

Summary

Related

Type of change

What changed

Security — redaction (the crux; this file goes to support)

Test plan

Checklist

Uh oh!

LukasWodka commented Jun 1, 2026

Uh oh!

LukasWodka commented Jun 1, 2026

Uh oh!

LukasWodka commented Jun 1, 2026

Uh oh!

LukasWodka commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants