Skip to content

feat(installer): --diagnose support bundle (redacted)#175

Merged
saadqbal merged 13 commits into
developfrom
fix/diagnose-bundle
Jun 2, 2026
Merged

feat(installer): --diagnose support bundle (redacted)#175
saadqbal merged 13 commits into
developfrom
fix/diagnose-bundle

Conversation

@LukasWodka
Copy link
Copy Markdown
Contributor

🚫 STACKED ON #171 + #172 + #173 + #174 — DO NOT MERGE BEFORE THEM. Branched off fix/reboot-persistence (#174). Until the stack merges, this PR's range includes their commits — review only the latest commit / the 7 files under "What changed". Flips draft → ready as the stack lands.

Summary

A one-command, redacted support bundle. When a customer hits an install/runtime problem, instead of a multi-round log-gathering email thread (the Charité thread is the archetype) they run one command and send us a single file. bash <(curl … i.sh) --diagnose (or install-k8s.ps1 -Diagnose) writes ~/.tracebloc/tracebloc-diagnose-<timestamp>.tgz. Installer-only — no chart changes.

Related

Ref tracebloc/backend#722 · stacked on #171, #172, #173, #174

Type of change

  • Feature

What changed

  • scripts/lib/diagnose.sh (new) — run_diagnose(): host/versions, docker/k3d, kubectl overview + describe of non-Running pods, workload logs (namespace auto-discovered), helm, install log + values.yaml, proxy env → redact every file → tar. _redact_file() does the redaction. Runs under set +e (best-effort) and short-circuits before any install work, so it works when the install is broken.
  • scripts/install-k8s.sh — sources it + the --diagnose branch at the top of main() (clears the EXIT trap).
  • scripts/install.sh — adds diagnose.sh to the bootstrap download manifest (verified by the sourced-libs cross-check + a live bootstrap run).
  • scripts/install-k8s.ps1-Diagnose + Invoke-DiagnoseBundle + Edit-Redaction (Windows mirror; Compress-Archive).
  • scripts/lib/common.sh — documents --diagnose in --help.

Security — redaction (the crux; this file goes to support)

clientPassword, proxy credentials (user:pass@host), and password=/token/secret values are redacted from every file before archiving. clientId is kept (the identifier support needs, not a secret). The bats/Pester redaction tests are the gate.

Test plan

  • scripts/tests/diagnose.bats (7, incl. the end-to-end redaction gate) + Pester Edit-Redaction + Invoke-DiagnoseBundle (extract-and-grep) — green (bats 114, Pester 51/0/3-skip).
  • Real --diagnose on a Linux VM: produced a 16-file bundle; the seeded dev password + auth-proxy creds had 0 occurrences in the archive (clientId kept, [REDACTED] markers present); also runs cleanly with no cluster.
  • Real curl-bootstrap: bash <(curl … install.sh@branch) --diagnose fetched diagnose.sh, ran it, produced the bundle, 0 leaks.
  • ⚠️ PowerShell runtime not validated on Windows (logic Pester-covered) — Windows reviewer confirms Invoke-DiagnoseBundle live.

Checklist

  • Tests added / updated and passing locally
  • Docs updated (--diagnose in --help)
  • No secrets / credentials in the diff (tests use synthetic secrets; redaction verified)
  • Windows reviewer for the ps1 runtime

@LukasWodka
Copy link
Copy Markdown
Contributor Author

👋 Heads-up — Code review queue is at 16 / 8

Above the WIP limit. The team convention is to review existing PRs before opening new work.

Open PRs currently in Code review (oldest first):

Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.)

@LukasWodka
Copy link
Copy Markdown
Contributor Author

Pre-merge security review (covered the whole stack #171#175). Proxy temp-file handling, perms (0600 under umask 077 + cleaned up), injection, and the other PRs came back clean. Two redaction gaps in the support bundle were found and fixed in 4b70123:

  1. Redaction missed non-clientPassword keys — only clientPassword:/password= were matched, so dockerRegistry.password (registry token) and HTTP_PROXY_PASSWORD survived into the bundle. Broadened _redact_file (bash) + Edit-Redaction (ps1) to redact any *password key, case-insensitive, : or = form.
  2. helm get manifest shipped base64 Secrets (bash only) — rendered Secret objects carry base64 CLIENT_PASSWORD + .dockerconfigjson that text redaction can't see. Dropped the manifest collection.

Regression tests added; re-verified end-to-end (real --diagnose): clientPassword, registry token, HTTP_PROXY_PASSWORD, and proxy URL creds all have 0 occurrences in the archive. bats 119 / Pester 52 green.

@LukasWodka
Copy link
Copy Markdown
Contributor Author

Pre-merge correctness review (high effort, whole stack) — three real bugs fixed in 51dede6 (their failure paths weren't exercised by the passing tests):

  1. cluster.shk3d "${K3D_ARGS[@]}" was a bare command under set -e, so a k3d-create failure aborted the script before the 'already exists' reuse, the error dump, and the proxy temp-dir cleanup could run. Now && create_rc=0 || create_rc=$?. Proven under set -e.
  2. summary.shCLIENT_STATE was defaulted to starting at source time, so install_cleanup's [[ -z … ]] guard was always false → the 'did not complete / check the log / safe to re-run' hint never printed on early failures. Defaulted empty.
  3. preflight.sh — with curl absent (direct ./install-k8s.sh on a minimal VM, before deps install curl), connectivity probes returned nocurl and hard-failed with a misleading 'egress blocked'. Now skips with a warning.

Plus a defensive switch ("$env:CLIENT_ENV") in Get-BackendUrl (refuted as a live bug on pwsh 7, but cheap insurance for the unvalidated 5.1 path). Regression test added; bats 120 / Pester 52 green. Note: these touch files owned by #171/#172/#173 but land here on the tip — they ride along when the stack merges in order.

@saadqbal saadqbal changed the base branch from develop to fix/reboot-persistence June 2, 2026 10:16
@saadqbal saadqbal force-pushed the fix/reboot-persistence branch from e69b9e1 to 8953324 Compare June 2, 2026 10:32
@saadqbal saadqbal force-pushed the fix/diagnose-bundle branch from 9af4990 to 54bccd7 Compare June 2, 2026 10:32
@saadqbal saadqbal force-pushed the fix/reboot-persistence branch from 8953324 to ab4b555 Compare June 2, 2026 11:08
@saadqbal saadqbal force-pushed the fix/diagnose-bundle branch from 54bccd7 to decf881 Compare June 2, 2026 11:08
@saadqbal saadqbal force-pushed the fix/reboot-persistence branch from ab4b555 to cba4c35 Compare June 2, 2026 11:13
@saadqbal saadqbal force-pushed the fix/diagnose-bundle branch from decf881 to ba4fb8c Compare June 2, 2026 11:13
LukasWodka and others added 13 commits June 2, 2026 16:36
Adds a one-command support bundle so a customer hitting an install/runtime
problem can send a single file instead of a multi-round log-gathering email
thread (the Charité thread is the archetype). `bash <(curl ... i.sh) --diagnose`
(or `install-k8s.ps1 -Diagnose`) collects logs + cluster/host status into
~/.tracebloc/tracebloc-diagnose-<ts>.tgz.

Two guarantees:
- Best-effort: the whole collection runs under `set +e` and short-circuits
  before any install work, so it works even when the install is broken.
- Credential-safe: clientPassword, proxy credentials (user:pass@host), and
  password=/token/secret values are REDACTED from every file before archiving;
  clientId is kept (it's the identifier support needs, not a secret).

scripts/lib/diagnose.sh -- run_diagnose() (host/versions, docker/k3d, kubectl
overview + describe of non-Running pods, workload logs with namespace
auto-discovery, helm, install log + values.yaml, proxy env) + _redact_file().
install-k8s.sh sources it + adds the --diagnose short-circuit (clears the EXIT
trap so the post-install message doesn't fire); install.sh adds it to the
bootstrap download manifest. install-k8s.ps1 mirrors with -Diagnose /
Invoke-DiagnoseBundle / Edit-Redaction. Documented in --help.

Tests: diagnose.bats (7, incl. the end-to-end redaction gate) + Pester (+3).
Verified on a Linux VM: the real --diagnose flag produced a 16-file bundle and
the seeded dev password + proxy credentials had ZERO occurrences in the archive
(clientId kept); also works with no cluster present.

Stacked on #171 + #172 + #173 + #174.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…eal preflight probes)

Measured changed-line coverage of the stack had dropped (bash ~84% vs #171's
~96%) because the new code added integration-only branches the mocked unit
suites skipped. Recover the unit-testable portion:
- diagnose.bats: exercise the kubectl/docker/helm collection path (has()=true +
  mocked tools) -> diagnose.sh 64% -> 90%.
- preflight.bats: test the REAL _pf_probe_url curl-exit-code -> token mapping,
  the missing-curl path, and the _pf_ncpu/_pf_total_mem_kb/_pf_free_kb readers
  (re-sourced past the setup stubs) -> preflight.sh 79% -> 90%.

Bash changed-line coverage: 83.6% -> 92.3% (kcov, 383/415). The residual ~8% is
integration-only (real k3d/docker create + macOS/Windows-specific branches +
MAIN orchestration), validated by the live VM E2Es (reboot recovery, auth-proxy,
preflight blocked-egress/arm64, diagnose redaction grep).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A pre-merge security review of the support bundle found two ways credentials
could land in the (supposedly redacted) archive the customer sends to support:

1. Redaction only matched `clientPassword:` and `password=` -- it missed any
   other *password key in colon form, so `dockerRegistry.password` (a registry
   token) and `HTTP_PROXY_PASSWORD` survived. Broadened _redact_file (bash) and
   Edit-Redaction (ps1) to redact ANY *password key, case-insensitive, in : or =
   form (portable explicit char classes -- BSD sed has no I flag).
2. (bash only) The bundle collected `helm get manifest`, which renders the k8s
   Secret objects with base64-encoded CLIENT_PASSWORD + .dockerconfigjson that
   text redaction can't see. Dropped the manifest collection (helm get values +
   kubectl output already cover triage).

Regression tests added (diagnose.bats + Pester). Re-verified end-to-end on a
Linux VM with the real --diagnose flag: clientPassword, the dockerRegistry
token, HTTP_PROXY_PASSWORD, and proxy URL creds all have ZERO occurrences in the
archive; clientId kept; manifest no longer collected.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A pre-merge correctness review (high effort) over the stack found three real
bugs — the failure paths weren't exercised by the passing tests:

1. cluster.sh: `k3d "${K3D_ARGS[@]}" > ...` was a bare command under `set -e`,
   so a k3d-create FAILURE aborted the script immediately, skipping the
   'already exists' graceful reuse, the error dump, AND the proxy temp-dir
   cleanup. Capture rc set-e-safely (`&& create_rc=0 || create_rc=$?`).
   Proven under set -e: both the error-dump and reuse paths now run.
2. summary.sh: CLIENT_STATE was defaulted to "starting" at source time, so
   install_cleanup's `[[ -z "$CLIENT_STATE" ]]` guard was always false and the
   "did not complete / check the log / safe to re-run" hint never printed on an
   early failure (preflight / docker / cluster / helm). Default it empty; the
   readiness gate sets the real state.
3. preflight.sh: with curl absent (direct ./install-k8s.sh on a minimal VM,
   before install_system_deps adds curl), the connectivity probes returned
   'nocurl' and hard-failed with a misleading "egress blocked". Skip the check
   with a warning when curl isn't present yet.

Also defensively quote `switch ("$env:CLIENT_ENV")` in Get-BackendUrl so the
prod default fires regardless of PowerShell version (refuted as a live bug on
pwsh 7 -- default does fire -- but cheap insurance for the unvalidated 5.1 path).

Tests: +nocurl-skip test, fixed an over-blunt has() mock; bats 120 / Pester 52
green. (#1's set -e abort can't be caught by bats -- no set -e there -- so it
was verified manually under set -e.)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…nosing it as a group issue)

Asad's AlmaLinux 9 / EC2 test: docker-ce installed fine but `dockerd` crashed on
startup (exit 1), systemd throttled it ("Start request repeated too quickly"),
and the installer then printed "Could not connect to Docker -- try logging out
and back in" -- the GROUP-not-active hint, which is wrong for a dead daemon and
sent him in circles (logout/login didn't help). The throttle also means a bare
re-run can't recover.

install_docker_engine now:
- `systemctl enable docker` WITHOUT `--now` (a start failure no longer hard-aborts
  the script under `set -e` at that line);
- `systemctl reset-failed docker` before starting, so a throttled/failed unit from
  a prior attempt can be retried (a plain re-run now recovers);
- when `docker info` fails AND the daemon isn't active (vs. the group-not-active
  case, which is still re-exec'd via `sg docker`), surface Docker's OWN error
  (systemctl status + the journalctl error lines) with likely RHEL/AlmaLinux
  causes, instead of the misleading group hint.

Test: setup-linux.bats daemon-won't-start case; bats 121 green.

NOTE: this fixes the installer's HANDLING. Asad's root cause (why dockerd exits 1
on that box) is still masked by the systemd throttle and is being chased separately.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…d on minimal RHEL/AlmaLinux)

Root cause of Asad's AlmaLinux 9 / EC2 failure: dockerd died on startup with
"failed to register bridge driver: iptables ... addrtype ... missing kernel
module". Minimal RHEL/AlmaLinux cloud images (incl. AWS) ship kernel-modules-core
but NOT the full kernel-modules package, so xt_addrtype (+ br_netfilter, overlay)
aren't available and Docker can't program its bridge NAT rules.

New _ensure_kernel_modules() (setup-linux.sh), called before starting Docker:
modprobe overlay / br_netfilter / xt_addrtype / iptable_nat / ip_tables; on
RHEL-family, if a load fails, `dnf install kernel-modules-$(uname -r)` and retry;
persist to /etc/modules-load.d for reboots. Best-effort + idempotent (verified
clean on a healthy Ubuntu box). Also sharpened the daemon-won't-start hint to
point at the kernel-modules remedy when the error mentions addrtype/missing module.

This is the hospital-VM profile (minimal RHEL/Alma), so it's a real install-side
fix, not just error handling. Test: setup-linux.bats _ensure_kernel_modules; bats 122.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…tic gate

Adds .github/workflows/installer-tests.yaml to validate the installer across the
breadth of environments customers actually run — not just Ubuntu-amd64:

  • static         — shellcheck (clean at --severity=warning) + bash -n + PSScriptAnalyzer
  • unit-bash      — bats (mocked), 124 tests
  • unit-pester    — Pester on Linux pwsh AND real windows-latest (the .ps1's true target)
  • distro-prereqs — NEW: runs the REAL Linux prereq path (PM detect, system deps,
                     Docker branch, kernel modules, kubectl/k3d/helm) in a fresh
                     container per distro family: ubuntu 22.04/24.04, debian 12,
                     almalinux 9/8, rockylinux 9, amazonlinux 2023, fedora, opensuse leap.

The matrix paid for itself before it even shipped: validating it locally against
real distro containers surfaced a genuine gap — minimal Amazon Linux 2023 ships no
openssl/tar, so helm's get-helm-3 fails ("openssl must first be installed"). Fixed
in install_system_deps (ensure openssl + tar; package names are uniform across
apt/dnf/yum/zypper/pacman), with bats coverage. All 9 validated distro branches now
install every prerequisite.

Installer-test jobs moved out of helm-ci.yaml into their own workflow (no more
duplicate runs; helm-ci no longer triggers on scripts/** changes). Arch omitted
(x86-only image + bare-container keyring friction; pacman branch covered by bats).
Real k3d cluster-up (e2e) intentionally deferred — needs a stubbed backend; tracked.

Validated locally via mac Docker: ubuntu:22.04, almalinux:9, amazonlinux:2023,
opensuse/leap:15.6 → all PASS; bats 124 green; shellcheck 0 findings.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…Config test

The new workflow's first run caught two issues — exactly its job:

1. Static analysis failed: `bash -n` ran on .bats files, which are bats DSL
   (@test "name" { … }), not valid bash. Restrict the syntax check to *.sh;
   .bats are validated by being run in the unit-bash job.

2. Pester on real windows-latest failed 1/55: the Confirm-Config test set
   $env:USERPROFILE = $env:HOME, but $env:HOME is empty on Windows, so
   [System.IO.Path]::GetFullPath("") threw "path is empty". The INSTALLER is
   correct (defaults to $env:USERPROFILE, always set on Windows) — the test
   fixture was Linux-centric. Derive a profile dir valid on both OSes.

For the record, the first run's wins: all 9 distro prereq jobs (ubuntu 22.04/
24.04, debian 12, almalinux 8/9, rockylinux 9, amazonlinux 2023, fedora, opensuse
leap) + bats + Linux Pester passed on GHA's amd64 runners.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…alse positives)

The libs are sourced together as one program, so single-file shellcheck reports
SC2034 "unused" for shared vars defined in common.sh and consumed in other
sourced files (CURL_SECURE, ARCH_DL, colours…). Gate at --severity=error (0
findings); warnings still printed as advisory. With this, Static analysis joins
the already-green distro matrix + Windows/Linux Pester + bats.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds scripts/tests/e2e-cluster.sh + an e2e-cluster matrix job — the highest-
fidelity check CI can run. It drives the installer's OWN create_cluster() to
bring up an actual k3d cluster on a real kernel (Docker is preinstalled on the
runner), asserts every node reaches Ready, then proves the cluster can pull,
schedule, and run a public workload (nginx:alpine), and tears down.

It deliberately stops BEFORE the tracebloc helm install / backend registration
(private images + real credentials), so it needs no secrets. Runs on
ubuntu-22.04, ubuntu-24.04, and ubuntu-24.04-arm (arm64 runners are free on this
public repo) — covering the real cluster path on both architectures.

Validated locally on an arm64 Ubuntu VM: create_cluster() → server+agent Ready
(k3s v1.29.4) → nginx pod Running → teardown. shellcheck clean (0 errors).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The amd64 runners proved the cluster comes up fine (nodes Ready), but the probe
pod failed with "serviceaccount default not found" — kubectl run raced the SA
controller, which creates default/default asynchronously after the node goes
Ready. arm64 dodged it by timing. Wait for the SA before running the pod. Pure
test-harness fix; the installer cluster path is correct on all arches.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds scripts/tests/e2e-proxy.sh + an e2e-proxy job. Stands up a squid that
REQUIRES basic auth, brings up a k3d cluster via the installer's create_cluster()
with HTTP(S)_PROXY=http://user:pass@host.k3d.internal:3128, and proves the nodes
pull a workload image THROUGH the authed proxy — the squid access log shows an
authenticated CONNECT to auth.docker.io (which only a real image pull makes, never
the readiness probe), closing the "proxy silently bypassed" false positive. It
also asserts anonymous requests are refused, so auth is genuinely enforced.

Guards the corporate-proxy hardening end-to-end (#172/#174, the Charité/hospital
archetype): _write_k3d_proxy_config passes proxy env via a k3d config FILE so the
'@' in user:pass@host survives (k3d splits --env on '@'), plus _augment_no_proxy.
If the credentials regress, squid 407s and the pull hangs — the test fails loudly.
Stops before the helm install / backend registration; no secrets.

Validated locally on an arm64 Ubuntu VM: anonymous refused → cluster up via the
authed proxy → nginx pulled through it (auth.docker.io + registry-1.docker.io
CONNECTs by the proxy user) → teardown. shellcheck clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…#176)

dockerd crash-loops on minimal RHEL/AlmaLinux images because xt_addrtype/
iptable_nat/br_netfilter live in kernel-modules-extra, not the base
kernel-modules package. The prior fix installed kernel-modules-$(uname -r)
— the wrong package — so the self-heal never took.

Install kernel-modules-extra (unversioned). When the repo's extra modules
target a newer kernel than the running one (stale AMI), they can't load
until reboot: detect that, set KMODS_REBOOT_REQUIRED, and have
install_docker_engine print a clear reboot-and-re-run message instead of a
raw Docker error. Modules persist via /etc/modules-load.d/tracebloc.conf.

Verified end-to-end on a pristine AlmaLinux 10.1 MINIMAL EC2 box: reboot
gate fires, post-reboot modules load, re-run reaches Connected.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@saadqbal saadqbal force-pushed the fix/diagnose-bundle branch from ba4fb8c to d97e20d Compare June 2, 2026 11:36
@saadqbal saadqbal changed the base branch from fix/reboot-persistence to develop June 2, 2026 11:36
@saadqbal saadqbal marked this pull request as ready for review June 2, 2026 11:37
@saadqbal saadqbal closed this Jun 2, 2026
@saadqbal saadqbal reopened this Jun 2, 2026
@LukasWodka
Copy link
Copy Markdown
Contributor Author

👋 Heads-up — Code review queue is at 24 / 8

Above the WIP limit. The team convention is to review existing PRs before opening new work.

Open PRs currently in Code review (oldest first):

Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.)

@LukasWodka LukasWodka requested a review from saadqbal June 2, 2026 11:42
@saadqbal saadqbal merged commit da9f45d into develop Jun 2, 2026
33 checks passed
@saadqbal saadqbal deleted the fix/diagnose-bundle branch June 2, 2026 11:50
saadqbal added a commit that referenced this pull request Jun 2, 2026
…177) (#178)

Installer-only patch release — promotes #171#175 (RHEL-family support,
credential/readiness verification, preflight gate, reboot persistence,
--diagnose bundle) to production via develop→main. Chart templates/values
are unchanged from 1.4.2.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
saadqbal added a commit that referenced this pull request Jun 2, 2026
* Sync main → develop after v1.4.2 release (#170)

* Installer: verify readiness & credentials before reporting success; RHEL-family support (#171)

* fix(installer): run on RHEL-family Linux (#718, #719, #720)

Docker: install docker-ce from the official Docker CentOS dnf repo on AlmaLinux/Rocky/Oracle, which get.docker.com rejects as unsupported. k3d: preserve PATH through sudo so the post-install lookup survives RHEL secure_path (which omits /usr/local/bin). conntrack: use the conntrack apt package on Debian/Ubuntu and conntrack-tools elsewhere.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* feat(installer): verify credentials and readiness before reporting success (#716, #717)

Credentials entered at the prompt are validated against the backend api-token-auth endpoint (the same call jobs-manager makes) with a re-prompt loop, so a wrong Client ID or password is caught immediately instead of after a full deploy. After helm apply, wait_for_client_ready polls rollout status and classifies the outcome; print_summary reports connected, starting, bad_creds, image_pull or crash, and prints the data-never-leaves message only when the client is verifiably connected. Exit code now reflects the real outcome.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* feat(installer): mirror credential and readiness checks on Windows (#716, #717)

Test-Credentials, Wait-ForClientReady and Get-NotReadyState mirror the bash logic in install-k8s.ps1, and Print-Summary is now state-branched. Validated with the PowerShell 7.4 parser; runtime behavior still needs a check on a Windows host.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* test(installer): bats + Pester unit suites, wired into CI

scripts/tests/: 64 bats tests (summary, install-client-helm, setup-linux, common) + 32 Pester tests for install-k8s.ps1 — all mocked, no Docker/k3d/network needed. Changed-line coverage measured with kcov (bash 96.2%) and Pester (PowerShell 97.4%); residual lines are the real RHEL Docker-install commands + the guarded main() orchestration, exercised by the integration E2E. A TB_PESTER guard lets the suite dot-source install-k8s.ps1 without running the installer. New installer-tests job in helm-ci.yaml runs both suites on PRs (scripts/ added to path filters).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

* fix(installer): harden corporate-proxy support (auth proxy, NO_PROXY, 0.0.0.0 detect, Windows parity) (#172)

A customer running behind an authenticated corporate HTTP proxy hit install
failures. The 0.0.0.0 kubeconfig headline was fixed for the bash path in
#166/#167, but adverse testing (a forward proxy + real k3d on Linux VMs)
surfaced three remaining gaps plus a Windows parity hole:

- Gap A: authenticated proxies (http://user:pass@host) were silently SKIPPED —
  k3d's --env KEY=VALUE@FILTER can't carry an '@' in the value. Now propagated
  via a k3d --config file (structured YAML env) so credentials survive intact.
  Verified on k3d v5.8.3 (it merges the --config env with the existing CLI flags).
- Gap B: NO_PROXY was propagated verbatim. Now auto-augmented with the
  cluster-internal ranges (loopback + RFC1918 + .svc/.cluster.local +
  host.k3d.internal), both into the cluster and host-side, so in-cluster traffic
  never routes through the proxy — fixes the misroute AND the observed
  `k3d cluster create --wait` hang.
- Gap C: a cluster created outside the installer and bound to 0.0.0.0 is now
  detected (serverlb HostIp) and flagged with a non-destructive recreate remedy.
- Windows parity: install-k8s.ps1::New-K3dCluster had NONE of the bash fixes —
  it still bound --api-port 0.0.0.0:6550 (the original headline bug, still live
  on Windows), normalized only host.docker.internal in the kubeconfig, and
  propagated zero proxy env. Now mirrors bash: 127.0.0.1:6550, a 0.0.0.0->127.0.0.1
  kubeconfig rewrite, and Get-EffectiveNoProxy + Write-K3dProxyConfig (auth +
  augmented NO_PROXY, written UTF-8 without a BOM).

Tests: new scripts/tests/cluster.bats (15) + Pester for the two ps1 helpers (6),
both green. Verified end-to-end on Linux VMs: auth creds propagated into the node,
no startup hang behind an unreachable proxy, and 0.0.0.0 detection firing.

Stacked on #171 (the installer test scaffolding + final install-k8s.ps1 live there).

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

* feat(installer): preflight gate (arch, egress, disk, RAM, CPU) — fail fast, clearly (#173)

Most install failures fall into two shapes: the environment can't support what
the installer does and it fails CRYPTICALLY minutes in, or it claims success it
hasn't earned. #171/#172 fixed specific cases; this attacks the first pattern
systematically with a preflight gate that runs before any install/cluster work
and fails in seconds with a precise, actionable reason.

scripts/lib/preflight.sh — run_preflight() runs at the top of Step 1/4:
- Architecture (arm64 guard): the tracebloc client images (e.g. mysql-client)
  are amd64-only. amd64 -> ok; arm64 on Docker Desktop -> info (emulated); arm64
  Linux without QEMU binfmt -> hard fail with the `tonistiigi/binfmt --install
  amd64` remedy. Override: TRACEBLOC_ALLOW_ARM64=1. (This is exactly the
  `exec format error` we hit on arm64.)
- Egress connectivity: probes the endpoints the install needs and reports which
  are blocked. Hard-fail on the criticals (registry-1.docker.io, ghcr.io, the
  CLIENT_ENV backend, tracebloc.github.io); warn-only on tool-download hosts
  when the tool isn't already installed. A TLS/cert error emits a
  break-and-inspect proxy hint. The probe honors HTTP_PROXY.
- Disk / RAM / CPU: hard-fail on critically low disk; warn on low disk/RAM/CPU.
- Aggregates results (runs ALL checks, then exits once with a summary). Escape
  hatch: TRACEBLOC_SKIP_PREFLIGHT=1.

install-k8s.sh sources + calls run_preflight; install.sh adds preflight.sh to the
curl-bootstrap download manifest (verified: every sourced lib is downloaded).

install-k8s.ps1: Test-Preflight + Get-Pf* helpers mirror the bash logic for
Windows (Get-CimInstance disk/RAM/CPU, Invoke-WebRequest connectivity).

docs/INSTALL.md: a "Network requirements (egress allowlist)" section the preflight
error points users to.

Tests: scripts/tests/preflight.bats (21) + Pester Describes (Test-PfUrl,
Test-Preflight; the Get-Cim* readers are Windows-only so they skip off-Windows).
Verified end-to-end on a Linux VM: healthy run passes; a blocked ghcr.io fails in
~1s naming the host; arm64 without binfmt fails with the remedy.

Stacked on #171 + #172.

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

* feat(installer): guarantee reboot persistence + make it visible (#174)

A real VM reboot test showed the k3d cluster already auto-recovers on Linux
(k3d sets --restart unless-stopped on its nodes; the Docker install enables
docker.service on boot). So no systemd unit is needed — this GUARANTEES that
behavior even on edge setups, and tells the user about it:

- cluster.sh: new ensure_cluster_autostart() (called from create_cluster) --
  `docker update --restart unless-stopped` on the k3d nodes (covers
  externally-created clusters / a future k3d default change) + `systemctl enable
  docker` on Linux (covers the installed-but-disabled re-run case the
  fresh-install path misses). Opt out with TRACEBLOC_NO_AUTOSTART=1.
- summary.sh: _reboot_note() in the connected summary -- Linux: "Survives
  reboot"; macOS/Windows: enable Docker Desktop start-on-login.
- install-k8s.ps1: Set-ClusterAutostart (defensive unless-stopped on the nodes)
  + the Docker Desktop reboot note in Print-Summary.

Tests: cluster.bats (+4), summary.bats (+3), Pester (+2) -- all green.
Verified by a real reboot on a Linux VM: with the restart policy stripped
(policy=no), ensure_cluster_autostart restored it to unless-stopped + enabled
docker; after `limactl stop/start`, the cluster, node, AND a deployed workload
all returned Running with no intervention.

Stacked on #171 + #172 + #173.

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

* feat(installer): --diagnose support bundle (redacted) (#175)

* feat(installer): --diagnose support bundle (redacted)

Adds a one-command support bundle so a customer hitting an install/runtime
problem can send a single file instead of a multi-round log-gathering email
thread (the Charité thread is the archetype). `bash <(curl ... i.sh) --diagnose`
(or `install-k8s.ps1 -Diagnose`) collects logs + cluster/host status into
~/.tracebloc/tracebloc-diagnose-<ts>.tgz.

Two guarantees:
- Best-effort: the whole collection runs under `set +e` and short-circuits
  before any install work, so it works even when the install is broken.
- Credential-safe: clientPassword, proxy credentials (user:pass@host), and
  password=/token/secret values are REDACTED from every file before archiving;
  clientId is kept (it's the identifier support needs, not a secret).

scripts/lib/diagnose.sh -- run_diagnose() (host/versions, docker/k3d, kubectl
overview + describe of non-Running pods, workload logs with namespace
auto-discovery, helm, install log + values.yaml, proxy env) + _redact_file().
install-k8s.sh sources it + adds the --diagnose short-circuit (clears the EXIT
trap so the post-install message doesn't fire); install.sh adds it to the
bootstrap download manifest. install-k8s.ps1 mirrors with -Diagnose /
Invoke-DiagnoseBundle / Edit-Redaction. Documented in --help.

Tests: diagnose.bats (7, incl. the end-to-end redaction gate) + Pester (+3).
Verified on a Linux VM: the real --diagnose flag produced a 16-file bundle and
the seeded dev password + proxy credentials had ZERO occurrences in the archive
(clientId kept); also works with no cluster present.

Stacked on #171 + #172 + #173 + #174.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* test(installer): raise changed-line coverage (diagnose collection + real preflight probes)

Measured changed-line coverage of the stack had dropped (bash ~84% vs #171's
~96%) because the new code added integration-only branches the mocked unit
suites skipped. Recover the unit-testable portion:
- diagnose.bats: exercise the kubectl/docker/helm collection path (has()=true +
  mocked tools) -> diagnose.sh 64% -> 90%.
- preflight.bats: test the REAL _pf_probe_url curl-exit-code -> token mapping,
  the missing-curl path, and the _pf_ncpu/_pf_total_mem_kb/_pf_free_kb readers
  (re-sourced past the setup stubs) -> preflight.sh 79% -> 90%.

Bash changed-line coverage: 83.6% -> 92.3% (kcov, 383/415). The residual ~8% is
integration-only (real k3d/docker create + macOS/Windows-specific branches +
MAIN orchestration), validated by the live VM E2Es (reboot recovery, auth-proxy,
preflight blocked-egress/arm64, diagnose redaction grep).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(installer): close two redaction gaps in --diagnose (security review)

A pre-merge security review of the support bundle found two ways credentials
could land in the (supposedly redacted) archive the customer sends to support:

1. Redaction only matched `clientPassword:` and `password=` -- it missed any
   other *password key in colon form, so `dockerRegistry.password` (a registry
   token) and `HTTP_PROXY_PASSWORD` survived. Broadened _redact_file (bash) and
   Edit-Redaction (ps1) to redact ANY *password key, case-insensitive, in : or =
   form (portable explicit char classes -- BSD sed has no I flag).
2. (bash only) The bundle collected `helm get manifest`, which renders the k8s
   Secret objects with base64-encoded CLIENT_PASSWORD + .dockerconfigjson that
   text redaction can't see. Dropped the manifest collection (helm get values +
   kubectl output already cover triage).

Regression tests added (diagnose.bats + Pester). Re-verified end-to-end on a
Linux VM with the real --diagnose flag: clientPassword, the dockerRegistry
token, HTTP_PROXY_PASSWORD, and proxy URL creds all have ZERO occurrences in the
archive; clientId kept; manifest no longer collected.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(installer): correctness fixes from code review

A pre-merge correctness review (high effort) over the stack found three real
bugs — the failure paths weren't exercised by the passing tests:

1. cluster.sh: `k3d "${K3D_ARGS[@]}" > ...` was a bare command under `set -e`,
   so a k3d-create FAILURE aborted the script immediately, skipping the
   'already exists' graceful reuse, the error dump, AND the proxy temp-dir
   cleanup. Capture rc set-e-safely (`&& create_rc=0 || create_rc=$?`).
   Proven under set -e: both the error-dump and reuse paths now run.
2. summary.sh: CLIENT_STATE was defaulted to "starting" at source time, so
   install_cleanup's `[[ -z "$CLIENT_STATE" ]]` guard was always false and the
   "did not complete / check the log / safe to re-run" hint never printed on an
   early failure (preflight / docker / cluster / helm). Default it empty; the
   readiness gate sets the real state.
3. preflight.sh: with curl absent (direct ./install-k8s.sh on a minimal VM,
   before install_system_deps adds curl), the connectivity probes returned
   'nocurl' and hard-failed with a misleading "egress blocked". Skip the check
   with a warning when curl isn't present yet.

Also defensively quote `switch ("$env:CLIENT_ENV")` in Get-BackendUrl so the
prod default fires regardless of PowerShell version (refuted as a live bug on
pwsh 7 -- default does fire -- but cheap insurance for the unvalidated 5.1 path).

Tests: +nocurl-skip test, fixed an over-blunt has() mock; bats 120 / Pester 52
green. (#1's set -e abort can't be caught by bats -- no set -e there -- so it
was verified manually under set -e.)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(installer): handle a Docker daemon that won't start (stop misdiagnosing it as a group issue)

Asad's AlmaLinux 9 / EC2 test: docker-ce installed fine but `dockerd` crashed on
startup (exit 1), systemd throttled it ("Start request repeated too quickly"),
and the installer then printed "Could not connect to Docker -- try logging out
and back in" -- the GROUP-not-active hint, which is wrong for a dead daemon and
sent him in circles (logout/login didn't help). The throttle also means a bare
re-run can't recover.

install_docker_engine now:
- `systemctl enable docker` WITHOUT `--now` (a start failure no longer hard-aborts
  the script under `set -e` at that line);
- `systemctl reset-failed docker` before starting, so a throttled/failed unit from
  a prior attempt can be retried (a plain re-run now recovers);
- when `docker info` fails AND the daemon isn't active (vs. the group-not-active
  case, which is still re-exec'd via `sg docker`), surface Docker's OWN error
  (systemctl status + the journalctl error lines) with likely RHEL/AlmaLinux
  causes, instead of the misleading group hint.

Test: setup-linux.bats daemon-won't-start case; bats 121 green.

NOTE: this fixes the installer's HANDLING. Asad's root cause (why dockerd exits 1
on that box) is still masked by the systemd throttle and is being chased separately.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(installer): load the kernel modules Docker/k3s need (fixes dockerd on minimal RHEL/AlmaLinux)

Root cause of Asad's AlmaLinux 9 / EC2 failure: dockerd died on startup with
"failed to register bridge driver: iptables ... addrtype ... missing kernel
module". Minimal RHEL/AlmaLinux cloud images (incl. AWS) ship kernel-modules-core
but NOT the full kernel-modules package, so xt_addrtype (+ br_netfilter, overlay)
aren't available and Docker can't program its bridge NAT rules.

New _ensure_kernel_modules() (setup-linux.sh), called before starting Docker:
modprobe overlay / br_netfilter / xt_addrtype / iptable_nat / ip_tables; on
RHEL-family, if a load fails, `dnf install kernel-modules-$(uname -r)` and retry;
persist to /etc/modules-load.d for reboots. Best-effort + idempotent (verified
clean on a healthy Ubuntu box). Also sharpened the daemon-won't-start hint to
point at the kernel-modules remedy when the error mentions addrtype/missing module.

This is the hospital-VM profile (minimal RHEL/Alma), so it's a real install-side
fix, not just error handling. Test: setup-linux.bats _ensure_kernel_modules; bats 122.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* ci(installer): cross-distro prereq matrix + real-Windows Pester + static gate

Adds .github/workflows/installer-tests.yaml to validate the installer across the
breadth of environments customers actually run — not just Ubuntu-amd64:

  • static         — shellcheck (clean at --severity=warning) + bash -n + PSScriptAnalyzer
  • unit-bash      — bats (mocked), 124 tests
  • unit-pester    — Pester on Linux pwsh AND real windows-latest (the .ps1's true target)
  • distro-prereqs — NEW: runs the REAL Linux prereq path (PM detect, system deps,
                     Docker branch, kernel modules, kubectl/k3d/helm) in a fresh
                     container per distro family: ubuntu 22.04/24.04, debian 12,
                     almalinux 9/8, rockylinux 9, amazonlinux 2023, fedora, opensuse leap.

The matrix paid for itself before it even shipped: validating it locally against
real distro containers surfaced a genuine gap — minimal Amazon Linux 2023 ships no
openssl/tar, so helm's get-helm-3 fails ("openssl must first be installed"). Fixed
in install_system_deps (ensure openssl + tar; package names are uniform across
apt/dnf/yum/zypper/pacman), with bats coverage. All 9 validated distro branches now
install every prerequisite.

Installer-test jobs moved out of helm-ci.yaml into their own workflow (no more
duplicate runs; helm-ci no longer triggers on scripts/** changes). Arch omitted
(x86-only image + bare-container keyring friction; pacman branch covered by bats).
Real k3d cluster-up (e2e) intentionally deferred — needs a stubbed backend; tracked.

Validated locally via mac Docker: ubuntu:22.04, almalinux:9, amazonlinux:2023,
opensuse/leap:15.6 → all PASS; bats 124 green; shellcheck 0 findings.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* ci(installer): fix static gate (.bats ≠ bash) + Windows-safe Confirm-Config test

The new workflow's first run caught two issues — exactly its job:

1. Static analysis failed: `bash -n` ran on .bats files, which are bats DSL
   (@test "name" { … }), not valid bash. Restrict the syntax check to *.sh;
   .bats are validated by being run in the unit-bash job.

2. Pester on real windows-latest failed 1/55: the Confirm-Config test set
   $env:USERPROFILE = $env:HOME, but $env:HOME is empty on Windows, so
   [System.IO.Path]::GetFullPath("") threw "path is empty". The INSTALLER is
   correct (defaults to $env:USERPROFILE, always set on Windows) — the test
   fixture was Linux-centric. Derive a profile dir valid on both OSes.

For the record, the first run's wins: all 9 distro prereq jobs (ubuntu 22.04/
24.04, debian 12, almalinux 8/9, rockylinux 9, amazonlinux 2023, fedora, opensuse
leap) + bats + Linux Pester passed on GHA's amd64 runners.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* ci(installer): gate ShellCheck at error severity (SC2034 cross-file false positives)

The libs are sourced together as one program, so single-file shellcheck reports
SC2034 "unused" for shared vars defined in common.sh and consumed in other
sourced files (CURL_SECURE, ARCH_DL, colours…). Gate at --severity=error (0
findings); warnings still printed as advisory. With this, Static analysis joins
the already-green distro matrix + Windows/Linux Pester + bats.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* ci(installer): real k3d cluster-up E2E on Ubuntu (amd64 + arm64)

Adds scripts/tests/e2e-cluster.sh + an e2e-cluster matrix job — the highest-
fidelity check CI can run. It drives the installer's OWN create_cluster() to
bring up an actual k3d cluster on a real kernel (Docker is preinstalled on the
runner), asserts every node reaches Ready, then proves the cluster can pull,
schedule, and run a public workload (nginx:alpine), and tears down.

It deliberately stops BEFORE the tracebloc helm install / backend registration
(private images + real credentials), so it needs no secrets. Runs on
ubuntu-22.04, ubuntu-24.04, and ubuntu-24.04-arm (arm64 runners are free on this
public repo) — covering the real cluster path on both architectures.

Validated locally on an arm64 Ubuntu VM: create_cluster() → server+agent Ready
(k3s v1.29.4) → nginx pod Running → teardown. shellcheck clean (0 errors).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* ci(installer): fix E2E probe race on the default ServiceAccount

The amd64 runners proved the cluster comes up fine (nodes Ready), but the probe
pod failed with "serviceaccount default not found" — kubectl run raced the SA
controller, which creates default/default asynchronously after the node goes
Ready. arm64 dodged it by timing. Wait for the SA before running the pod. Pure
test-harness fix; the installer cluster path is correct on all arches.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* ci(installer): authenticated corporate-proxy E2E (squid)

Adds scripts/tests/e2e-proxy.sh + an e2e-proxy job. Stands up a squid that
REQUIRES basic auth, brings up a k3d cluster via the installer's create_cluster()
with HTTP(S)_PROXY=http://user:pass@host.k3d.internal:3128, and proves the nodes
pull a workload image THROUGH the authed proxy — the squid access log shows an
authenticated CONNECT to auth.docker.io (which only a real image pull makes, never
the readiness probe), closing the "proxy silently bypassed" false positive. It
also asserts anonymous requests are refused, so auth is genuinely enforced.

Guards the corporate-proxy hardening end-to-end (#172/#174, the Charité/hospital
archetype): _write_k3d_proxy_config passes proxy env via a k3d config FILE so the
'@' in user:pass@host survives (k3d splits --env on '@'), plus _augment_no_proxy.
If the credentials regress, squid 407s and the pull hangs — the test fails loudly.
Stops before the helm install / backend registration; no secrets.

Validated locally on an arm64 Ubuntu VM: anonymous refused → cluster up via the
authed proxy → nginx pulled through it (auth.docker.io + registry-1.docker.io
CONNECTs by the proxy user) → teardown. shellcheck clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(installer): install kernel-modules-extra + handle reboot-required (#176)

dockerd crash-loops on minimal RHEL/AlmaLinux images because xt_addrtype/
iptable_nat/br_netfilter live in kernel-modules-extra, not the base
kernel-modules package. The prior fix installed kernel-modules-$(uname -r)
— the wrong package — so the self-heal never took.

Install kernel-modules-extra (unversioned). When the repo's extra modules
target a newer kernel than the running one (stale AMI), they can't load
until reboot: detect that, set KMODS_REBOOT_REQUIRED, and have
install_docker_engine print a clear reboot-and-re-run message instead of a
raw Docker error. Modules persist via /etc/modules-load.d/tracebloc.conf.

Verified end-to-end on a pristine AlmaLinux 10.1 MINIMAL EC2 box: reboot
gate fires, post-reboot modules load, re-run reaches Connected.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Co-authored-by: Asad Iqbal <asad.dsoft@gmail.com>

* chore(chart): bump version to 1.4.3 for installer-hardening release (#177) (#178)

Installer-only patch release — promotes #171#175 (RHEL-family support,
credential/readiness verification, preflight gate, reboot persistence,
--diagnose bundle) to production via develop→main. Chart templates/values
are unchanged from 1.4.2.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: lukasWuttke <54042461+LukasWodka@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants