feat(installer): --diagnose support bundle (redacted)#175
Conversation
|
👋 Heads-up — Code review queue is at 16 / 8 Above the WIP limit. The team convention is to review existing PRs before opening new work. Open PRs currently in Code review (oldest first):
Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.) |
|
Pre-merge security review (covered the whole stack #171–#175). Proxy temp-file handling, perms (0600 under umask 077 + cleaned up), injection, and the other PRs came back clean. Two redaction gaps in the support bundle were found and fixed in
Regression tests added; re-verified end-to-end (real |
|
Pre-merge correctness review (high effort, whole stack) — three real bugs fixed in
Plus a defensive |
e69b9e1 to
8953324
Compare
9af4990 to
54bccd7
Compare
8953324 to
ab4b555
Compare
54bccd7 to
decf881
Compare
ab4b555 to
cba4c35
Compare
decf881 to
ba4fb8c
Compare
Adds a one-command support bundle so a customer hitting an install/runtime problem can send a single file instead of a multi-round log-gathering email thread (the Charité thread is the archetype). `bash <(curl ... i.sh) --diagnose` (or `install-k8s.ps1 -Diagnose`) collects logs + cluster/host status into ~/.tracebloc/tracebloc-diagnose-<ts>.tgz. Two guarantees: - Best-effort: the whole collection runs under `set +e` and short-circuits before any install work, so it works even when the install is broken. - Credential-safe: clientPassword, proxy credentials (user:pass@host), and password=/token/secret values are REDACTED from every file before archiving; clientId is kept (it's the identifier support needs, not a secret). scripts/lib/diagnose.sh -- run_diagnose() (host/versions, docker/k3d, kubectl overview + describe of non-Running pods, workload logs with namespace auto-discovery, helm, install log + values.yaml, proxy env) + _redact_file(). install-k8s.sh sources it + adds the --diagnose short-circuit (clears the EXIT trap so the post-install message doesn't fire); install.sh adds it to the bootstrap download manifest. install-k8s.ps1 mirrors with -Diagnose / Invoke-DiagnoseBundle / Edit-Redaction. Documented in --help. Tests: diagnose.bats (7, incl. the end-to-end redaction gate) + Pester (+3). Verified on a Linux VM: the real --diagnose flag produced a 16-file bundle and the seeded dev password + proxy credentials had ZERO occurrences in the archive (clientId kept); also works with no cluster present. Stacked on #171 + #172 + #173 + #174. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…eal preflight probes) Measured changed-line coverage of the stack had dropped (bash ~84% vs #171's ~96%) because the new code added integration-only branches the mocked unit suites skipped. Recover the unit-testable portion: - diagnose.bats: exercise the kubectl/docker/helm collection path (has()=true + mocked tools) -> diagnose.sh 64% -> 90%. - preflight.bats: test the REAL _pf_probe_url curl-exit-code -> token mapping, the missing-curl path, and the _pf_ncpu/_pf_total_mem_kb/_pf_free_kb readers (re-sourced past the setup stubs) -> preflight.sh 79% -> 90%. Bash changed-line coverage: 83.6% -> 92.3% (kcov, 383/415). The residual ~8% is integration-only (real k3d/docker create + macOS/Windows-specific branches + MAIN orchestration), validated by the live VM E2Es (reboot recovery, auth-proxy, preflight blocked-egress/arm64, diagnose redaction grep). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A pre-merge security review of the support bundle found two ways credentials could land in the (supposedly redacted) archive the customer sends to support: 1. Redaction only matched `clientPassword:` and `password=` -- it missed any other *password key in colon form, so `dockerRegistry.password` (a registry token) and `HTTP_PROXY_PASSWORD` survived. Broadened _redact_file (bash) and Edit-Redaction (ps1) to redact ANY *password key, case-insensitive, in : or = form (portable explicit char classes -- BSD sed has no I flag). 2. (bash only) The bundle collected `helm get manifest`, which renders the k8s Secret objects with base64-encoded CLIENT_PASSWORD + .dockerconfigjson that text redaction can't see. Dropped the manifest collection (helm get values + kubectl output already cover triage). Regression tests added (diagnose.bats + Pester). Re-verified end-to-end on a Linux VM with the real --diagnose flag: clientPassword, the dockerRegistry token, HTTP_PROXY_PASSWORD, and proxy URL creds all have ZERO occurrences in the archive; clientId kept; manifest no longer collected. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A pre-merge correctness review (high effort) over the stack found three real
bugs — the failure paths weren't exercised by the passing tests:
1. cluster.sh: `k3d "${K3D_ARGS[@]}" > ...` was a bare command under `set -e`,
so a k3d-create FAILURE aborted the script immediately, skipping the
'already exists' graceful reuse, the error dump, AND the proxy temp-dir
cleanup. Capture rc set-e-safely (`&& create_rc=0 || create_rc=$?`).
Proven under set -e: both the error-dump and reuse paths now run.
2. summary.sh: CLIENT_STATE was defaulted to "starting" at source time, so
install_cleanup's `[[ -z "$CLIENT_STATE" ]]` guard was always false and the
"did not complete / check the log / safe to re-run" hint never printed on an
early failure (preflight / docker / cluster / helm). Default it empty; the
readiness gate sets the real state.
3. preflight.sh: with curl absent (direct ./install-k8s.sh on a minimal VM,
before install_system_deps adds curl), the connectivity probes returned
'nocurl' and hard-failed with a misleading "egress blocked". Skip the check
with a warning when curl isn't present yet.
Also defensively quote `switch ("$env:CLIENT_ENV")` in Get-BackendUrl so the
prod default fires regardless of PowerShell version (refuted as a live bug on
pwsh 7 -- default does fire -- but cheap insurance for the unvalidated 5.1 path).
Tests: +nocurl-skip test, fixed an over-blunt has() mock; bats 120 / Pester 52
green. (#1's set -e abort can't be caught by bats -- no set -e there -- so it
was verified manually under set -e.)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…nosing it as a group issue)
Asad's AlmaLinux 9 / EC2 test: docker-ce installed fine but `dockerd` crashed on
startup (exit 1), systemd throttled it ("Start request repeated too quickly"),
and the installer then printed "Could not connect to Docker -- try logging out
and back in" -- the GROUP-not-active hint, which is wrong for a dead daemon and
sent him in circles (logout/login didn't help). The throttle also means a bare
re-run can't recover.
install_docker_engine now:
- `systemctl enable docker` WITHOUT `--now` (a start failure no longer hard-aborts
the script under `set -e` at that line);
- `systemctl reset-failed docker` before starting, so a throttled/failed unit from
a prior attempt can be retried (a plain re-run now recovers);
- when `docker info` fails AND the daemon isn't active (vs. the group-not-active
case, which is still re-exec'd via `sg docker`), surface Docker's OWN error
(systemctl status + the journalctl error lines) with likely RHEL/AlmaLinux
causes, instead of the misleading group hint.
Test: setup-linux.bats daemon-won't-start case; bats 121 green.
NOTE: this fixes the installer's HANDLING. Asad's root cause (why dockerd exits 1
on that box) is still masked by the systemd throttle and is being chased separately.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…d on minimal RHEL/AlmaLinux) Root cause of Asad's AlmaLinux 9 / EC2 failure: dockerd died on startup with "failed to register bridge driver: iptables ... addrtype ... missing kernel module". Minimal RHEL/AlmaLinux cloud images (incl. AWS) ship kernel-modules-core but NOT the full kernel-modules package, so xt_addrtype (+ br_netfilter, overlay) aren't available and Docker can't program its bridge NAT rules. New _ensure_kernel_modules() (setup-linux.sh), called before starting Docker: modprobe overlay / br_netfilter / xt_addrtype / iptable_nat / ip_tables; on RHEL-family, if a load fails, `dnf install kernel-modules-$(uname -r)` and retry; persist to /etc/modules-load.d for reboots. Best-effort + idempotent (verified clean on a healthy Ubuntu box). Also sharpened the daemon-won't-start hint to point at the kernel-modules remedy when the error mentions addrtype/missing module. This is the hospital-VM profile (minimal RHEL/Alma), so it's a real install-side fix, not just error handling. Test: setup-linux.bats _ensure_kernel_modules; bats 122. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…tic gate
Adds .github/workflows/installer-tests.yaml to validate the installer across the
breadth of environments customers actually run — not just Ubuntu-amd64:
• static — shellcheck (clean at --severity=warning) + bash -n + PSScriptAnalyzer
• unit-bash — bats (mocked), 124 tests
• unit-pester — Pester on Linux pwsh AND real windows-latest (the .ps1's true target)
• distro-prereqs — NEW: runs the REAL Linux prereq path (PM detect, system deps,
Docker branch, kernel modules, kubectl/k3d/helm) in a fresh
container per distro family: ubuntu 22.04/24.04, debian 12,
almalinux 9/8, rockylinux 9, amazonlinux 2023, fedora, opensuse leap.
The matrix paid for itself before it even shipped: validating it locally against
real distro containers surfaced a genuine gap — minimal Amazon Linux 2023 ships no
openssl/tar, so helm's get-helm-3 fails ("openssl must first be installed"). Fixed
in install_system_deps (ensure openssl + tar; package names are uniform across
apt/dnf/yum/zypper/pacman), with bats coverage. All 9 validated distro branches now
install every prerequisite.
Installer-test jobs moved out of helm-ci.yaml into their own workflow (no more
duplicate runs; helm-ci no longer triggers on scripts/** changes). Arch omitted
(x86-only image + bare-container keyring friction; pacman branch covered by bats).
Real k3d cluster-up (e2e) intentionally deferred — needs a stubbed backend; tracked.
Validated locally via mac Docker: ubuntu:22.04, almalinux:9, amazonlinux:2023,
opensuse/leap:15.6 → all PASS; bats 124 green; shellcheck 0 findings.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…Config test The new workflow's first run caught two issues — exactly its job: 1. Static analysis failed: `bash -n` ran on .bats files, which are bats DSL (@test "name" { … }), not valid bash. Restrict the syntax check to *.sh; .bats are validated by being run in the unit-bash job. 2. Pester on real windows-latest failed 1/55: the Confirm-Config test set $env:USERPROFILE = $env:HOME, but $env:HOME is empty on Windows, so [System.IO.Path]::GetFullPath("") threw "path is empty". The INSTALLER is correct (defaults to $env:USERPROFILE, always set on Windows) — the test fixture was Linux-centric. Derive a profile dir valid on both OSes. For the record, the first run's wins: all 9 distro prereq jobs (ubuntu 22.04/ 24.04, debian 12, almalinux 8/9, rockylinux 9, amazonlinux 2023, fedora, opensuse leap) + bats + Linux Pester passed on GHA's amd64 runners. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…alse positives) The libs are sourced together as one program, so single-file shellcheck reports SC2034 "unused" for shared vars defined in common.sh and consumed in other sourced files (CURL_SECURE, ARCH_DL, colours…). Gate at --severity=error (0 findings); warnings still printed as advisory. With this, Static analysis joins the already-green distro matrix + Windows/Linux Pester + bats. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds scripts/tests/e2e-cluster.sh + an e2e-cluster matrix job — the highest- fidelity check CI can run. It drives the installer's OWN create_cluster() to bring up an actual k3d cluster on a real kernel (Docker is preinstalled on the runner), asserts every node reaches Ready, then proves the cluster can pull, schedule, and run a public workload (nginx:alpine), and tears down. It deliberately stops BEFORE the tracebloc helm install / backend registration (private images + real credentials), so it needs no secrets. Runs on ubuntu-22.04, ubuntu-24.04, and ubuntu-24.04-arm (arm64 runners are free on this public repo) — covering the real cluster path on both architectures. Validated locally on an arm64 Ubuntu VM: create_cluster() → server+agent Ready (k3s v1.29.4) → nginx pod Running → teardown. shellcheck clean (0 errors). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The amd64 runners proved the cluster comes up fine (nodes Ready), but the probe pod failed with "serviceaccount default not found" — kubectl run raced the SA controller, which creates default/default asynchronously after the node goes Ready. arm64 dodged it by timing. Wait for the SA before running the pod. Pure test-harness fix; the installer cluster path is correct on all arches. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds scripts/tests/e2e-proxy.sh + an e2e-proxy job. Stands up a squid that REQUIRES basic auth, brings up a k3d cluster via the installer's create_cluster() with HTTP(S)_PROXY=http://user:pass@host.k3d.internal:3128, and proves the nodes pull a workload image THROUGH the authed proxy — the squid access log shows an authenticated CONNECT to auth.docker.io (which only a real image pull makes, never the readiness probe), closing the "proxy silently bypassed" false positive. It also asserts anonymous requests are refused, so auth is genuinely enforced. Guards the corporate-proxy hardening end-to-end (#172/#174, the Charité/hospital archetype): _write_k3d_proxy_config passes proxy env via a k3d config FILE so the '@' in user:pass@host survives (k3d splits --env on '@'), plus _augment_no_proxy. If the credentials regress, squid 407s and the pull hangs — the test fails loudly. Stops before the helm install / backend registration; no secrets. Validated locally on an arm64 Ubuntu VM: anonymous refused → cluster up via the authed proxy → nginx pulled through it (auth.docker.io + registry-1.docker.io CONNECTs by the proxy user) → teardown. shellcheck clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…#176) dockerd crash-loops on minimal RHEL/AlmaLinux images because xt_addrtype/ iptable_nat/br_netfilter live in kernel-modules-extra, not the base kernel-modules package. The prior fix installed kernel-modules-$(uname -r) — the wrong package — so the self-heal never took. Install kernel-modules-extra (unversioned). When the repo's extra modules target a newer kernel than the running one (stale AMI), they can't load until reboot: detect that, set KMODS_REBOOT_REQUIRED, and have install_docker_engine print a clear reboot-and-re-run message instead of a raw Docker error. Modules persist via /etc/modules-load.d/tracebloc.conf. Verified end-to-end on a pristine AlmaLinux 10.1 MINIMAL EC2 box: reboot gate fires, post-reboot modules load, re-run reaches Connected. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
ba4fb8c to
d97e20d
Compare
|
👋 Heads-up — Code review queue is at 24 / 8 Above the WIP limit. The team convention is to review existing PRs before opening new work. Open PRs currently in Code review (oldest first):
Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.) |
…177) (#178) Installer-only patch release — promotes #171–#175 (RHEL-family support, credential/readiness verification, preflight gate, reboot persistence, --diagnose bundle) to production via develop→main. Chart templates/values are unchanged from 1.4.2. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
* Sync main → develop after v1.4.2 release (#170) * Installer: verify readiness & credentials before reporting success; RHEL-family support (#171) * fix(installer): run on RHEL-family Linux (#718, #719, #720) Docker: install docker-ce from the official Docker CentOS dnf repo on AlmaLinux/Rocky/Oracle, which get.docker.com rejects as unsupported. k3d: preserve PATH through sudo so the post-install lookup survives RHEL secure_path (which omits /usr/local/bin). conntrack: use the conntrack apt package on Debian/Ubuntu and conntrack-tools elsewhere. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(installer): verify credentials and readiness before reporting success (#716, #717) Credentials entered at the prompt are validated against the backend api-token-auth endpoint (the same call jobs-manager makes) with a re-prompt loop, so a wrong Client ID or password is caught immediately instead of after a full deploy. After helm apply, wait_for_client_ready polls rollout status and classifies the outcome; print_summary reports connected, starting, bad_creds, image_pull or crash, and prints the data-never-leaves message only when the client is verifiably connected. Exit code now reflects the real outcome. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(installer): mirror credential and readiness checks on Windows (#716, #717) Test-Credentials, Wait-ForClientReady and Get-NotReadyState mirror the bash logic in install-k8s.ps1, and Print-Summary is now state-branched. Validated with the PowerShell 7.4 parser; runtime behavior still needs a check on a Windows host. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * test(installer): bats + Pester unit suites, wired into CI scripts/tests/: 64 bats tests (summary, install-client-helm, setup-linux, common) + 32 Pester tests for install-k8s.ps1 — all mocked, no Docker/k3d/network needed. Changed-line coverage measured with kcov (bash 96.2%) and Pester (PowerShell 97.4%); residual lines are the real RHEL Docker-install commands + the guarded main() orchestration, exercised by the integration E2E. A TB_PESTER guard lets the suite dot-source install-k8s.ps1 without running the installer. New installer-tests job in helm-ci.yaml runs both suites on PRs (scripts/ added to path filters). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> * fix(installer): harden corporate-proxy support (auth proxy, NO_PROXY, 0.0.0.0 detect, Windows parity) (#172) A customer running behind an authenticated corporate HTTP proxy hit install failures. The 0.0.0.0 kubeconfig headline was fixed for the bash path in #166/#167, but adverse testing (a forward proxy + real k3d on Linux VMs) surfaced three remaining gaps plus a Windows parity hole: - Gap A: authenticated proxies (http://user:pass@host) were silently SKIPPED — k3d's --env KEY=VALUE@FILTER can't carry an '@' in the value. Now propagated via a k3d --config file (structured YAML env) so credentials survive intact. Verified on k3d v5.8.3 (it merges the --config env with the existing CLI flags). - Gap B: NO_PROXY was propagated verbatim. Now auto-augmented with the cluster-internal ranges (loopback + RFC1918 + .svc/.cluster.local + host.k3d.internal), both into the cluster and host-side, so in-cluster traffic never routes through the proxy — fixes the misroute AND the observed `k3d cluster create --wait` hang. - Gap C: a cluster created outside the installer and bound to 0.0.0.0 is now detected (serverlb HostIp) and flagged with a non-destructive recreate remedy. - Windows parity: install-k8s.ps1::New-K3dCluster had NONE of the bash fixes — it still bound --api-port 0.0.0.0:6550 (the original headline bug, still live on Windows), normalized only host.docker.internal in the kubeconfig, and propagated zero proxy env. Now mirrors bash: 127.0.0.1:6550, a 0.0.0.0->127.0.0.1 kubeconfig rewrite, and Get-EffectiveNoProxy + Write-K3dProxyConfig (auth + augmented NO_PROXY, written UTF-8 without a BOM). Tests: new scripts/tests/cluster.bats (15) + Pester for the two ps1 helpers (6), both green. Verified end-to-end on Linux VMs: auth creds propagated into the node, no startup hang behind an unreachable proxy, and 0.0.0.0 detection firing. Stacked on #171 (the installer test scaffolding + final install-k8s.ps1 live there). Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> * feat(installer): preflight gate (arch, egress, disk, RAM, CPU) — fail fast, clearly (#173) Most install failures fall into two shapes: the environment can't support what the installer does and it fails CRYPTICALLY minutes in, or it claims success it hasn't earned. #171/#172 fixed specific cases; this attacks the first pattern systematically with a preflight gate that runs before any install/cluster work and fails in seconds with a precise, actionable reason. scripts/lib/preflight.sh — run_preflight() runs at the top of Step 1/4: - Architecture (arm64 guard): the tracebloc client images (e.g. mysql-client) are amd64-only. amd64 -> ok; arm64 on Docker Desktop -> info (emulated); arm64 Linux without QEMU binfmt -> hard fail with the `tonistiigi/binfmt --install amd64` remedy. Override: TRACEBLOC_ALLOW_ARM64=1. (This is exactly the `exec format error` we hit on arm64.) - Egress connectivity: probes the endpoints the install needs and reports which are blocked. Hard-fail on the criticals (registry-1.docker.io, ghcr.io, the CLIENT_ENV backend, tracebloc.github.io); warn-only on tool-download hosts when the tool isn't already installed. A TLS/cert error emits a break-and-inspect proxy hint. The probe honors HTTP_PROXY. - Disk / RAM / CPU: hard-fail on critically low disk; warn on low disk/RAM/CPU. - Aggregates results (runs ALL checks, then exits once with a summary). Escape hatch: TRACEBLOC_SKIP_PREFLIGHT=1. install-k8s.sh sources + calls run_preflight; install.sh adds preflight.sh to the curl-bootstrap download manifest (verified: every sourced lib is downloaded). install-k8s.ps1: Test-Preflight + Get-Pf* helpers mirror the bash logic for Windows (Get-CimInstance disk/RAM/CPU, Invoke-WebRequest connectivity). docs/INSTALL.md: a "Network requirements (egress allowlist)" section the preflight error points users to. Tests: scripts/tests/preflight.bats (21) + Pester Describes (Test-PfUrl, Test-Preflight; the Get-Cim* readers are Windows-only so they skip off-Windows). Verified end-to-end on a Linux VM: healthy run passes; a blocked ghcr.io fails in ~1s naming the host; arm64 without binfmt fails with the remedy. Stacked on #171 + #172. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> * feat(installer): guarantee reboot persistence + make it visible (#174) A real VM reboot test showed the k3d cluster already auto-recovers on Linux (k3d sets --restart unless-stopped on its nodes; the Docker install enables docker.service on boot). So no systemd unit is needed — this GUARANTEES that behavior even on edge setups, and tells the user about it: - cluster.sh: new ensure_cluster_autostart() (called from create_cluster) -- `docker update --restart unless-stopped` on the k3d nodes (covers externally-created clusters / a future k3d default change) + `systemctl enable docker` on Linux (covers the installed-but-disabled re-run case the fresh-install path misses). Opt out with TRACEBLOC_NO_AUTOSTART=1. - summary.sh: _reboot_note() in the connected summary -- Linux: "Survives reboot"; macOS/Windows: enable Docker Desktop start-on-login. - install-k8s.ps1: Set-ClusterAutostart (defensive unless-stopped on the nodes) + the Docker Desktop reboot note in Print-Summary. Tests: cluster.bats (+4), summary.bats (+3), Pester (+2) -- all green. Verified by a real reboot on a Linux VM: with the restart policy stripped (policy=no), ensure_cluster_autostart restored it to unless-stopped + enabled docker; after `limactl stop/start`, the cluster, node, AND a deployed workload all returned Running with no intervention. Stacked on #171 + #172 + #173. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> * feat(installer): --diagnose support bundle (redacted) (#175) * feat(installer): --diagnose support bundle (redacted) Adds a one-command support bundle so a customer hitting an install/runtime problem can send a single file instead of a multi-round log-gathering email thread (the Charité thread is the archetype). `bash <(curl ... i.sh) --diagnose` (or `install-k8s.ps1 -Diagnose`) collects logs + cluster/host status into ~/.tracebloc/tracebloc-diagnose-<ts>.tgz. Two guarantees: - Best-effort: the whole collection runs under `set +e` and short-circuits before any install work, so it works even when the install is broken. - Credential-safe: clientPassword, proxy credentials (user:pass@host), and password=/token/secret values are REDACTED from every file before archiving; clientId is kept (it's the identifier support needs, not a secret). scripts/lib/diagnose.sh -- run_diagnose() (host/versions, docker/k3d, kubectl overview + describe of non-Running pods, workload logs with namespace auto-discovery, helm, install log + values.yaml, proxy env) + _redact_file(). install-k8s.sh sources it + adds the --diagnose short-circuit (clears the EXIT trap so the post-install message doesn't fire); install.sh adds it to the bootstrap download manifest. install-k8s.ps1 mirrors with -Diagnose / Invoke-DiagnoseBundle / Edit-Redaction. Documented in --help. Tests: diagnose.bats (7, incl. the end-to-end redaction gate) + Pester (+3). Verified on a Linux VM: the real --diagnose flag produced a 16-file bundle and the seeded dev password + proxy credentials had ZERO occurrences in the archive (clientId kept); also works with no cluster present. Stacked on #171 + #172 + #173 + #174. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * test(installer): raise changed-line coverage (diagnose collection + real preflight probes) Measured changed-line coverage of the stack had dropped (bash ~84% vs #171's ~96%) because the new code added integration-only branches the mocked unit suites skipped. Recover the unit-testable portion: - diagnose.bats: exercise the kubectl/docker/helm collection path (has()=true + mocked tools) -> diagnose.sh 64% -> 90%. - preflight.bats: test the REAL _pf_probe_url curl-exit-code -> token mapping, the missing-curl path, and the _pf_ncpu/_pf_total_mem_kb/_pf_free_kb readers (re-sourced past the setup stubs) -> preflight.sh 79% -> 90%. Bash changed-line coverage: 83.6% -> 92.3% (kcov, 383/415). The residual ~8% is integration-only (real k3d/docker create + macOS/Windows-specific branches + MAIN orchestration), validated by the live VM E2Es (reboot recovery, auth-proxy, preflight blocked-egress/arm64, diagnose redaction grep). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(installer): close two redaction gaps in --diagnose (security review) A pre-merge security review of the support bundle found two ways credentials could land in the (supposedly redacted) archive the customer sends to support: 1. Redaction only matched `clientPassword:` and `password=` -- it missed any other *password key in colon form, so `dockerRegistry.password` (a registry token) and `HTTP_PROXY_PASSWORD` survived. Broadened _redact_file (bash) and Edit-Redaction (ps1) to redact ANY *password key, case-insensitive, in : or = form (portable explicit char classes -- BSD sed has no I flag). 2. (bash only) The bundle collected `helm get manifest`, which renders the k8s Secret objects with base64-encoded CLIENT_PASSWORD + .dockerconfigjson that text redaction can't see. Dropped the manifest collection (helm get values + kubectl output already cover triage). Regression tests added (diagnose.bats + Pester). Re-verified end-to-end on a Linux VM with the real --diagnose flag: clientPassword, the dockerRegistry token, HTTP_PROXY_PASSWORD, and proxy URL creds all have ZERO occurrences in the archive; clientId kept; manifest no longer collected. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(installer): correctness fixes from code review A pre-merge correctness review (high effort) over the stack found three real bugs — the failure paths weren't exercised by the passing tests: 1. cluster.sh: `k3d "${K3D_ARGS[@]}" > ...` was a bare command under `set -e`, so a k3d-create FAILURE aborted the script immediately, skipping the 'already exists' graceful reuse, the error dump, AND the proxy temp-dir cleanup. Capture rc set-e-safely (`&& create_rc=0 || create_rc=$?`). Proven under set -e: both the error-dump and reuse paths now run. 2. summary.sh: CLIENT_STATE was defaulted to "starting" at source time, so install_cleanup's `[[ -z "$CLIENT_STATE" ]]` guard was always false and the "did not complete / check the log / safe to re-run" hint never printed on an early failure (preflight / docker / cluster / helm). Default it empty; the readiness gate sets the real state. 3. preflight.sh: with curl absent (direct ./install-k8s.sh on a minimal VM, before install_system_deps adds curl), the connectivity probes returned 'nocurl' and hard-failed with a misleading "egress blocked". Skip the check with a warning when curl isn't present yet. Also defensively quote `switch ("$env:CLIENT_ENV")` in Get-BackendUrl so the prod default fires regardless of PowerShell version (refuted as a live bug on pwsh 7 -- default does fire -- but cheap insurance for the unvalidated 5.1 path). Tests: +nocurl-skip test, fixed an over-blunt has() mock; bats 120 / Pester 52 green. (#1's set -e abort can't be caught by bats -- no set -e there -- so it was verified manually under set -e.) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(installer): handle a Docker daemon that won't start (stop misdiagnosing it as a group issue) Asad's AlmaLinux 9 / EC2 test: docker-ce installed fine but `dockerd` crashed on startup (exit 1), systemd throttled it ("Start request repeated too quickly"), and the installer then printed "Could not connect to Docker -- try logging out and back in" -- the GROUP-not-active hint, which is wrong for a dead daemon and sent him in circles (logout/login didn't help). The throttle also means a bare re-run can't recover. install_docker_engine now: - `systemctl enable docker` WITHOUT `--now` (a start failure no longer hard-aborts the script under `set -e` at that line); - `systemctl reset-failed docker` before starting, so a throttled/failed unit from a prior attempt can be retried (a plain re-run now recovers); - when `docker info` fails AND the daemon isn't active (vs. the group-not-active case, which is still re-exec'd via `sg docker`), surface Docker's OWN error (systemctl status + the journalctl error lines) with likely RHEL/AlmaLinux causes, instead of the misleading group hint. Test: setup-linux.bats daemon-won't-start case; bats 121 green. NOTE: this fixes the installer's HANDLING. Asad's root cause (why dockerd exits 1 on that box) is still masked by the systemd throttle and is being chased separately. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(installer): load the kernel modules Docker/k3s need (fixes dockerd on minimal RHEL/AlmaLinux) Root cause of Asad's AlmaLinux 9 / EC2 failure: dockerd died on startup with "failed to register bridge driver: iptables ... addrtype ... missing kernel module". Minimal RHEL/AlmaLinux cloud images (incl. AWS) ship kernel-modules-core but NOT the full kernel-modules package, so xt_addrtype (+ br_netfilter, overlay) aren't available and Docker can't program its bridge NAT rules. New _ensure_kernel_modules() (setup-linux.sh), called before starting Docker: modprobe overlay / br_netfilter / xt_addrtype / iptable_nat / ip_tables; on RHEL-family, if a load fails, `dnf install kernel-modules-$(uname -r)` and retry; persist to /etc/modules-load.d for reboots. Best-effort + idempotent (verified clean on a healthy Ubuntu box). Also sharpened the daemon-won't-start hint to point at the kernel-modules remedy when the error mentions addrtype/missing module. This is the hospital-VM profile (minimal RHEL/Alma), so it's a real install-side fix, not just error handling. Test: setup-linux.bats _ensure_kernel_modules; bats 122. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * ci(installer): cross-distro prereq matrix + real-Windows Pester + static gate Adds .github/workflows/installer-tests.yaml to validate the installer across the breadth of environments customers actually run — not just Ubuntu-amd64: • static — shellcheck (clean at --severity=warning) + bash -n + PSScriptAnalyzer • unit-bash — bats (mocked), 124 tests • unit-pester — Pester on Linux pwsh AND real windows-latest (the .ps1's true target) • distro-prereqs — NEW: runs the REAL Linux prereq path (PM detect, system deps, Docker branch, kernel modules, kubectl/k3d/helm) in a fresh container per distro family: ubuntu 22.04/24.04, debian 12, almalinux 9/8, rockylinux 9, amazonlinux 2023, fedora, opensuse leap. The matrix paid for itself before it even shipped: validating it locally against real distro containers surfaced a genuine gap — minimal Amazon Linux 2023 ships no openssl/tar, so helm's get-helm-3 fails ("openssl must first be installed"). Fixed in install_system_deps (ensure openssl + tar; package names are uniform across apt/dnf/yum/zypper/pacman), with bats coverage. All 9 validated distro branches now install every prerequisite. Installer-test jobs moved out of helm-ci.yaml into their own workflow (no more duplicate runs; helm-ci no longer triggers on scripts/** changes). Arch omitted (x86-only image + bare-container keyring friction; pacman branch covered by bats). Real k3d cluster-up (e2e) intentionally deferred — needs a stubbed backend; tracked. Validated locally via mac Docker: ubuntu:22.04, almalinux:9, amazonlinux:2023, opensuse/leap:15.6 → all PASS; bats 124 green; shellcheck 0 findings. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * ci(installer): fix static gate (.bats ≠ bash) + Windows-safe Confirm-Config test The new workflow's first run caught two issues — exactly its job: 1. Static analysis failed: `bash -n` ran on .bats files, which are bats DSL (@test "name" { … }), not valid bash. Restrict the syntax check to *.sh; .bats are validated by being run in the unit-bash job. 2. Pester on real windows-latest failed 1/55: the Confirm-Config test set $env:USERPROFILE = $env:HOME, but $env:HOME is empty on Windows, so [System.IO.Path]::GetFullPath("") threw "path is empty". The INSTALLER is correct (defaults to $env:USERPROFILE, always set on Windows) — the test fixture was Linux-centric. Derive a profile dir valid on both OSes. For the record, the first run's wins: all 9 distro prereq jobs (ubuntu 22.04/ 24.04, debian 12, almalinux 8/9, rockylinux 9, amazonlinux 2023, fedora, opensuse leap) + bats + Linux Pester passed on GHA's amd64 runners. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * ci(installer): gate ShellCheck at error severity (SC2034 cross-file false positives) The libs are sourced together as one program, so single-file shellcheck reports SC2034 "unused" for shared vars defined in common.sh and consumed in other sourced files (CURL_SECURE, ARCH_DL, colours…). Gate at --severity=error (0 findings); warnings still printed as advisory. With this, Static analysis joins the already-green distro matrix + Windows/Linux Pester + bats. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * ci(installer): real k3d cluster-up E2E on Ubuntu (amd64 + arm64) Adds scripts/tests/e2e-cluster.sh + an e2e-cluster matrix job — the highest- fidelity check CI can run. It drives the installer's OWN create_cluster() to bring up an actual k3d cluster on a real kernel (Docker is preinstalled on the runner), asserts every node reaches Ready, then proves the cluster can pull, schedule, and run a public workload (nginx:alpine), and tears down. It deliberately stops BEFORE the tracebloc helm install / backend registration (private images + real credentials), so it needs no secrets. Runs on ubuntu-22.04, ubuntu-24.04, and ubuntu-24.04-arm (arm64 runners are free on this public repo) — covering the real cluster path on both architectures. Validated locally on an arm64 Ubuntu VM: create_cluster() → server+agent Ready (k3s v1.29.4) → nginx pod Running → teardown. shellcheck clean (0 errors). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * ci(installer): fix E2E probe race on the default ServiceAccount The amd64 runners proved the cluster comes up fine (nodes Ready), but the probe pod failed with "serviceaccount default not found" — kubectl run raced the SA controller, which creates default/default asynchronously after the node goes Ready. arm64 dodged it by timing. Wait for the SA before running the pod. Pure test-harness fix; the installer cluster path is correct on all arches. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * ci(installer): authenticated corporate-proxy E2E (squid) Adds scripts/tests/e2e-proxy.sh + an e2e-proxy job. Stands up a squid that REQUIRES basic auth, brings up a k3d cluster via the installer's create_cluster() with HTTP(S)_PROXY=http://user:pass@host.k3d.internal:3128, and proves the nodes pull a workload image THROUGH the authed proxy — the squid access log shows an authenticated CONNECT to auth.docker.io (which only a real image pull makes, never the readiness probe), closing the "proxy silently bypassed" false positive. It also asserts anonymous requests are refused, so auth is genuinely enforced. Guards the corporate-proxy hardening end-to-end (#172/#174, the Charité/hospital archetype): _write_k3d_proxy_config passes proxy env via a k3d config FILE so the '@' in user:pass@host survives (k3d splits --env on '@'), plus _augment_no_proxy. If the credentials regress, squid 407s and the pull hangs — the test fails loudly. Stops before the helm install / backend registration; no secrets. Validated locally on an arm64 Ubuntu VM: anonymous refused → cluster up via the authed proxy → nginx pulled through it (auth.docker.io + registry-1.docker.io CONNECTs by the proxy user) → teardown. shellcheck clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(installer): install kernel-modules-extra + handle reboot-required (#176) dockerd crash-loops on minimal RHEL/AlmaLinux images because xt_addrtype/ iptable_nat/br_netfilter live in kernel-modules-extra, not the base kernel-modules package. The prior fix installed kernel-modules-$(uname -r) — the wrong package — so the self-heal never took. Install kernel-modules-extra (unversioned). When the repo's extra modules target a newer kernel than the running one (stale AMI), they can't load until reboot: detect that, set KMODS_REBOOT_REQUIRED, and have install_docker_engine print a clear reboot-and-re-run message instead of a raw Docker error. Modules persist via /etc/modules-load.d/tracebloc.conf. Verified end-to-end on a pristine AlmaLinux 10.1 MINIMAL EC2 box: reboot gate fires, post-reboot modules load, re-run reaches Connected. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> Co-authored-by: Asad Iqbal <asad.dsoft@gmail.com> * chore(chart): bump version to 1.4.3 for installer-hardening release (#177) (#178) Installer-only patch release — promotes #171–#175 (RHEL-family support, credential/readiness verification, preflight gate, reboot persistence, --diagnose bundle) to production via develop→main. Chart templates/values are unchanged from 1.4.2. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: lukasWuttke <54042461+LukasWodka@users.noreply.github.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Summary
A one-command, redacted support bundle. When a customer hits an install/runtime problem, instead of a multi-round log-gathering email thread (the Charité thread is the archetype) they run one command and send us a single file.
bash <(curl … i.sh) --diagnose(orinstall-k8s.ps1 -Diagnose) writes~/.tracebloc/tracebloc-diagnose-<timestamp>.tgz. Installer-only — no chart changes.Related
Ref tracebloc/backend#722 · stacked on #171, #172, #173, #174
Type of change
What changed
scripts/lib/diagnose.sh(new) —run_diagnose(): host/versions, docker/k3d, kubectl overview +describeof non-Running pods, workload logs (namespace auto-discovered), helm, install log +values.yaml, proxy env → redact every file →tar._redact_file()does the redaction. Runs underset +e(best-effort) and short-circuits before any install work, so it works when the install is broken.scripts/install-k8s.sh— sources it + the--diagnosebranch at the top ofmain()(clears the EXIT trap).scripts/install.sh— addsdiagnose.shto the bootstrap download manifest (verified by the sourced-libs cross-check + a live bootstrap run).scripts/install-k8s.ps1—-Diagnose+Invoke-DiagnoseBundle+Edit-Redaction(Windows mirror;Compress-Archive).scripts/lib/common.sh— documents--diagnosein--help.Security — redaction (the crux; this file goes to support)
clientPassword, proxy credentials (user:pass@host), andpassword=/token/secretvalues are redacted from every file before archiving.clientIdis kept (the identifier support needs, not a secret). The bats/Pester redaction tests are the gate.Test plan
scripts/tests/diagnose.bats(7, incl. the end-to-end redaction gate) + PesterEdit-Redaction+Invoke-DiagnoseBundle(extract-and-grep) — green (bats 114, Pester 51/0/3-skip).--diagnoseon a Linux VM: produced a 16-file bundle; the seeded dev password + auth-proxy creds had 0 occurrences in the archive (clientId kept,[REDACTED]markers present); also runs cleanly with no cluster.bash <(curl … install.sh@branch) --diagnosefetcheddiagnose.sh, ran it, produced the bundle, 0 leaks.Invoke-DiagnoseBundlelive.Checklist
--diagnosein--help)