Skip to content

Sync main → develop after v1.4.1 release#163

Merged
saadqbal merged 33 commits into
developfrom
sync/main-to-develop-post-v1.4.1
May 26, 2026
Merged

Sync main → develop after v1.4.1 release#163
saadqbal merged 33 commits into
developfrom
sync/main-to-develop-post-v1.4.1

Conversation

@saadqbal
Copy link
Copy Markdown
Contributor

@saadqbal saadqbal commented May 26, 2026

...


Note

Medium Risk
Changes operational behavior of the image-refresh CronJob (deployment rollouts and registry tag resolution), which affects which ingestor digest jobs-manager uses over time.

Overview
Hardens the image-refresh ingestor path so a “fully in sync” tick no longer exits silently when a prior kubectl set env left the deployment stuck (spec/annotation/registry agree but rollout never finished). The no-op branch now runs kubectl rollout status, surfacing failures as a failed CronJob.

Aligns the default ingestor floating tag from prod to 0.3 in values.yaml, the CronJob INGESTOR_TAG env (including | default "0.3" for --reuse-values upgrades missing the key), and helm tests so ghcr.io polling matches the team’s semver float tags instead of 404’ing every tick.

Reviewed by Cursor Bugbot for commit 80d2f37. Bugbot is set up for automated code reviews on this repo. Configure here.

saadqbal and others added 30 commits April 24, 2026 21:18
* Add NetworkPolicy locking down training-pod egress

Training pods run untrusted ML code uploaded by external data scientists.
This policy selects on the tracebloc.io/workload=training label (injected
by jobs-manager in the companion client-runtime PR) and:

  - Denies all ingress (nothing should connect TO a training pod).
  - Allows DNS to the cluster DNS service.
  - Allows external TCP/443 only; blocks all pod-to-pod, ClusterIP, and
    in-cluster pod traffic via ipBlock with cluster-CIDR exclusions.

Training pods can still reach tracebloc backend, Azure Service Bus, and
App Insights (external HTTPS). They can no longer reach mysql-client,
the K8s API server, the jobs-manager pod IP, or other training pods.

Per-platform defaults:
  AKS:  enabled=true  (requires Azure NPM or Calico at cluster create)
  EKS:  enabled=false (AWS VPC CNI does not enforce NetworkPolicy; safer
                       to explicitly disable than silently have no effect)
  BM:   enabled=true  (requires Calico / Cilium / kube-router)
  OC:   enabled=true  (OVN-Kubernetes enforces by default; custom DNS
                       selector and OpenShift pod/service CIDRs)

The dnsSelector default is empty with a template-side fallback to
{k8s-app: kube-dns} to avoid Helm's map-merge semantics surprising
customers who override it (OpenShift's selector would otherwise be
unioned with the default rather than replacing it).

- templates/network-policy-training.yaml: new policy (gated on
  networkPolicy.training.enabled)
- values.yaml + values.schema.json: new networkPolicy.training block
- ci/{aks,eks,bm,oc}-values.yaml: per-platform overrides with notes
- tests/network_policy_test.yaml: 8 helm-unittest cases covering
  rendering, ingress denial, DNS allow, external HTTPS allow, cluster
  CIDR blocking, and the OpenShift selector override

No effect until the companion client-runtime PR lands, which adds the
tracebloc.io/workload=training label to spawned training pods.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Add optional Namespace resource with Pod Security Admission labels (#43)

* Add optional Namespace resource with Pod Security Admission labels

Layers Kubernetes Pod Security Admission on top of the per-pod
securityContext work for defense-in-depth. Off by default -- enabling
requires a greenfield install, since the chart does not currently own
the release namespace on existing deployments.

When namespace.create is true, the chart templates a Namespace with:

    pod-security.kubernetes.io/warn: restricted
    pod-security.kubernetes.io/audit: restricted
    helm.sh/resource-policy: keep

Warn + audit surface any pod-spec violation as a kubectl warning and
an audit-log event, without rejecting the pod. This gives us a
tripwire for future regressions in our own pod specs (jobs-manager,
mysql, resource-monitor, training pods) and for any third-party pods
in the same namespace.

Enforce mode is deliberately left UNSET. Two of our own workloads
would be rejected under enforce: restricted:

  - mysql init containers run as UID 0 (needed to chown the PVC
    before the main container -- UID 999 -- starts)
  - resource-monitor DaemonSet mounts hostPath /proc and /sys

Enabling enforce before those are refactored (or moved to a separate
namespace) would break the chart. Customers who want full enforcement
can set namespace.podSecurity.enforce = restricted after auditing
their own deployment; the current defaults keep them safe.

helm.sh/resource-policy: keep prevents helm uninstall from deleting
the Namespace, which would otherwise take the PVC-backed training
data and MySQL state with it.

- templates/namespace.yaml: new, gated on namespace.create (default false)
- values.yaml: new namespace block with long comments
- values.schema.json: schema entries for namespace.create + podSecurity
- tests/namespace_test.yaml: 8 helm-unittest cases (toggle off, toggle
  on, keep annotation, labels, version strings, enforce omitted when
  empty, enforce present when set, baseline override, namespace name
  respects release)
- docs/INSTALL.md: section explaining the greenfield vs existing-ns
  paths with copy-pasteable kubectl label commands

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Fix kubeVersion constraint to accept cloud pre-release suffixes

Helm's semver parser excludes pre-release versions from >= ranges by
default, so ">=1.24.0" rejected EKS ("1.34.4-eks-f69f56f"), GKE
("-gke-*"), and AKS release-tagged versions. Changing to ">=1.24.0-0"
explicitly opts the constraint into matching pre-releases, which is
how managed-Kubernetes providers encode their vendor suffix.

Surfaced while dry-run-installing PR #43 against a dev EKS cluster.

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Asad Iqbal <asad.dsoft@gmail.com>

* Add consolidated SECURITY.md covering the training-pod sandbox (#44)

Brings together the threat model, defense layers, per-platform
caveats, operator responsibilities, residual risks, and verification
steps into one reviewable artifact. Covers the complete hardening
posture as shipped across the chart + jobs-manager + new-arch
training images.

Sections:

  1.  Threat model: trusted platform, untrusted external-data-
      scientist submissions. Explicit in-scope / out-of-scope.
  2.  Seven design goals (G1-G7) for the training-pod sandbox,
      each mapped to current status on new-arch vs. legacy.
  3.  Architecture overview.
  4.  Defense layers -- credential isolation, network egress,
      K8s API access, container runtime hardening, storage
      isolation, cross-tenant forgeability, admission tripwire.
  5.  Per-platform caveats -- NetworkPolicy CNI matrix (AKS/EKS/
      bare-metal/OpenShift), PSA version requirements, OpenShift
      DNS selector override, runAsUser + arbitrary UIDs, bare-
      metal hostPath note.
  6.  What operators must do themselves -- rotate secrets, verify
      CNI enforces, label existing namespaces, monitor audit,
      upgrade ordering, refactor path for enforce: restricted.
  7.  Verification -- copy-pasteable kubectl snippets for each
      defense layer.
  8.  Residual risks with explicit ownership -- global SB conn
      strings (backend), HTTPS egress (platform endgame), token
      TTL (backend), legacy arch (migration team), PSA enforce
      (chart refactor), CNI silent no-op (operator), kernel
      escape (out of scope), resource DoS (out of scope).
  9.  Compromise response playbook.
  10. Where each defense is implemented (code-path map for
      reviewers).
  11. Document history.

Also:

- README.md: add Security subsection under Deployment Guide
  linking to docs/SECURITY.md.
- docs/INSTALL.md: prerequisite note about CNI enforcement.

No code changes; documentation only.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Add docs/MIGRATIONS.md and CLAUDE.md for Helm chart migration safety (#47)

Document the helm.sh/resource-policy=keep gotcha: Helm reads the
annotation from the stored release manifest, not live resources, so
kubectl annotate alone does not protect PVCs from helm uninstall.
Includes the 2026-04-22 tracebloc-templates migration as a case study
and three mitigation options (helm upgrade, strip ownership, or rely
on PV Retain + recreate).

* docs(client): add pre-Helm resource-monitor cleanup step to MIGRATION.md (#49)

Early-era edges were installed with a hand-rolled `resource-monitor`
DaemonSet via raw `kubectl apply` before the per-platform charts existed.
The unified chart's `tracebloc-resource-monitor` DaemonSet replaces it,
but the legacy DS is unmanaged and keeps running after migration, mounting
hostPath /proc + /sys and blocking PSA `enforce=restricted` on the namespace.

Adds a step-6 section documenting the kubectl cleanup (DS + SA + ClusterRole
+ ClusterRoleBinding, all named `resource-monitor`) with a safety check to
confirm the ClusterRole/Binding aren't shared before deletion.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

* feat(mysql): drop root init-containers, add PSA-restricted securityContext (#48)

* feat(mysql): drop root init-containers, add PSA-restricted securityContext

Unblocks pod-security.kubernetes.io/enforce: restricted on the release
namespace. Previously the mysql-client pod had two init-containers
running as UID 0 to chown /var/lib/mysql and /var/log/mysql to 999:999
before mysqld started. PSA restricted rejects runAsUser: 0 on any
container, so these init-containers were the last blocker to promoting
the namespace from warn/audit to enforce.

The pod already had `fsGroup: 999` + `fsGroupChangePolicy: OnRootMismatch`
at the pod level, which kubelet uses to chgrp mounted volumes on first
mount. Once that is in place the init-container chowns are redundant:

- On existing PVCs (already owned 999:999 from the prior init-container
  chown) OnRootMismatch sees the correct root ownership and skips the
  recursive chgrp — mount is instant, no behavior change.
- On fresh PVCs kubelet applies fsGroup before the main container starts.
- On emptyDir (the logs volume) kubelet applies fsGroup at volume
  creation.

Also adds a container-level securityContext with all six fields PSA
restricted requires:
- runAsNonRoot: true
- runAsUser / runAsGroup: 999 (matches the mysql:5.7.41 base image's
  default user, and the entrypoint skips its root-to-mysql gosu re-exec
  when already running as 999)
- allowPrivilegeEscalation: false
- capabilities: drop all
- seccompProfile: RuntimeDefault

Scope: client chart only (now the universal chart covering eks/aks/bm/oc).

Caveats for customers:
- Requires a CSI driver with fsGroupPolicy=File or ReadWriteOnceWithFSType
  (EBS, AzureDisk, GCE-PD, CephRBD all qualify). NFS v3 and some
  object-backed drivers do not; chart docs should flag this in a
  follow-up.

Deferred to separate PR:
- readOnlyRootFilesystem on the mysql container (needs emptyDir mounts
  for /tmp, /run/mysqld, /var/lib/mysql-files; real regression risk).

* fix(mysql): restore chown init-container for hostPath (bare-metal)

kubelet does not apply fsGroup ownership to hostPath volumes
(kubernetes/kubernetes#138411), so bare-metal installs need a
privileged bootstrap to chown /var/lib/mysql to 999:999 on first
start. Gated on .Values.hostPath.enabled so CSI-backed deployments
(EKS/AKS/OC) keep the clean no-init, PSA-restricted-compliant form.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

* Move tracebloc-resource-monitor to dedicated privileged namespace (#50)

* Move tracebloc-resource-monitor to dedicated privileged namespace

Pod Security Admission's `restricted` profile bans hostPath volumes
outright, and the resource-monitor DaemonSet needs hostPath /proc and
/sys to read node-level metrics. Previously, setting
`pod-security.kubernetes.io/enforce: restricted` on the release
namespace (tracebloc-templates) would reject the DaemonSet outright,
and `warn=restricted` + `audit=restricted` already spam violations.

This isolates the DaemonSet in a new dedicated namespace
(tracebloc-node-agents, configurable via `nodeAgents.namespace.name`)
that carries `pod-security.kubernetes.io/{enforce,warn,audit}:
privileged` labels. The release namespace is no longer constrained by
the node-agent and can run `enforce: restricted` once the mysql init
refactor lands.

Changes:
- templates/node-agents-namespace.yaml: new, gated on
  nodeAgents.namespace.create (default true) and resourceMonitor
- templates/resource-monitor-daemonset.yaml: deploy into node-agents ns
- templates/resource-monitor-rbac.yaml: SA + (Cluster)RoleBinding in
  node-agents ns
- templates/resource-monitor-scc.yaml: SCC users + CRB subject updated
  (OpenShift path)
- values.yaml + values.schema.json: new `nodeAgents.namespace` block
- templates/namespace.yaml + docs/INSTALL.md: drop resource-monitor
  from the enforce-blocker list; document the new node-agents ns
- tests/node_agents_namespace_test.yaml: 12 new unittest cases

Upgrade impact: existing installs will see the DaemonSet / SA /
(Cluster)RoleBinding deleted from the release namespace and recreated
in the node-agents namespace during `helm upgrade`. Brief (~seconds)
gap in node metrics during rollout; no persistent data involved.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Mirror secrets into node-agents ns; keep namespace RBAC in release ns

Two follow-ups from review of the namespace-split change:

1. Secrets are namespace-scoped — a pod in `tracebloc-node-agents`
   cannot `secretKeyRef` a Secret that only exists in the release
   namespace. The resource-monitor DaemonSet was referencing CLIENT_ID /
   CLIENT_PASSWORD from `tracebloc.secretName` and the registry pull
   secret, both of which template only into `.Release.Namespace`, so
   pods would have failed to start with CreateContainerConfigError.

   templates/secrets.yaml and templates/docker-registry-secret.yaml now
   template a second copy into `nodeAgents.namespace.name` when:
     resourceMonitor != false  AND  node-agents ns != release ns

   The mirror is skipped when the two namespaces collide (e.g. operator
   points nodeAgents.namespace.name back at the release namespace) so
   Helm does not try to create two resources with the same name.

2. When clusterScope: false, the Role must live in the RELEASE
   namespace because that is where the monitored workloads run — a
   namespace-scoped Role only grants access to its own namespace.
   Previously this PR put the Role in `tracebloc-node-agents`, which
   would have silently broken the resource-monitor for anyone not
   using ClusterRole. Role + RoleBinding are now back in
   `.Release.Namespace`; the RoleBinding subject still points at the
   ServiceAccount in the node-agents namespace (cross-namespace
   subjects in RoleBindings are valid).

Tests updated accordingly; 5 new cases cover mirror-on, mirror-off
(resourceMonitor=false), mirror-off (namespaces collide), dockercfg
mirror, and the corrected Role/RoleBinding placement.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(resource-monitor): pin NAMESPACE env to release ns; guard node-agents ns==release ns

Two review fixes from the PSA hardening change:

1. NAMESPACE env var was using Downward API fieldPath: metadata.namespace,
   which now resolves to the node-agents namespace (where the DaemonSet
   pods live) instead of the release namespace (where the monitored
   workloads live). Replace with the literal Release.Namespace so the
   monitor continues to watch the right namespace regardless of where
   its own pods run.

2. node-agents-namespace.yaml would stamp privileged PSA labels onto the
   release namespace if an operator set nodeAgents.namespace.name to the
   release namespace (and with namespace.create=true it would render two
   Namespace docs with the same name — a render-time collision). Add an
   equality guard so the template is a no-op in that configuration.

Adds one test covering the NAMESPACE env fix; tests: 74/74 pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

* feat(mysql): set readOnlyRootFilesystem on mysql-client (#52)

Completes container runtime hardening (G4) for mysql-client. Adds three
emptyDir mounts for the paths mysqld writes to at runtime that are NOT
already on PVC or log volumes:

- /var/run/mysqld       pid file + unix socket
- /tmp                  temp tables, sort buffers, LOAD DATA staging
- /var/lib/mysql-files  default secure_file_priv dir (touched at start)

Verified via helm upgrade on EKS (tb-client-dev-templates /
tracebloc-templates): pod Ready, readOnlyRootFilesystem=true, `touch /etc/x`
rejected as Read-only, mysqld.sock + mysqld.pid present under /var/run/mysqld,
existing DB data intact in /var/lib/mysql.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

* feat(psa): enforce=restricted by default on CSI; bare-metal overrides (#51)

- values.yaml: namespace.podSecurity.enforce flipped to "restricted".
- ci/bm-values.yaml: overrides enforce to "" because kubelet does not
  apply fsGroup to hostPath volumes (kubernetes/kubernetes#138411),
  forcing the chart to render a privileged init-mysql-data chown
  container that PSA restricted would reject. warn+audit remain on.
- namespace.yaml docstring + SECURITY.md (§4.7, §6.3, §6.6, §8.5)
  updated to document the CSI-default / bare-metal-override split.

Verified with helm template --set namespace.create=true against both
eks-values.yaml (enforce rendered) and bm-values.yaml (enforce absent).

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

* feat(installer): slim k3d and add dev overrides for local testing (#54)

The tracebloc client is outbound-only: jobs-manager and pods-monitor
dial out to the platform, and the only in-cluster Service is mysql-client
(ClusterIP). The bundled k3s ingress/LB stack and metrics-server are
unused overhead, and the chart ships its own StorageClass.

Drop the loadbalancer port mappings (HTTP_PORT/HTTPS_PORT) plus their
validation/help/log references, and pass --k3s-arg "--disable=..." for
traefik, servicelb, metrics-server, and local-storage to k3d cluster
create. Applied symmetrically in scripts/install-k8s.ps1.

Also add two env vars for local-chart testing in install-client-helm.sh:

  TRACEBLOC_CHART_PATH    install from a local chart path instead of the
                          published tracebloc/client Helm repo (skips
                          helm repo add/update)
  TRACEBLOC_VALUES_FILE   use the caller-supplied values file as-is and
                          skip the clientId/password prompts + values.yaml
                          generation

With both set, the installer can exercise the full flow end-to-end
against unreleased chart changes before publishing.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

* feat(client): harden image pinning and credentials (v1.0.4) (#53)

Address the High-severity findings from the client chart security review:

- Add digest support to tracebloc.image helper and images.* values for
  jobs-manager, pods-monitor, mysql-client, and busybox. When a digest is
  set, the image is rendered as repo@sha256:... and imagePullPolicy drops
  to IfNotPresent (immutable pin, auditable rollout).
- Replace the hard-coded mysql-client:latest with a configurable tag that
  defaults to "prod". The schema rejects "latest" outright; operators
  wanting absolute pinning should set images.mysqlClient.digest.
- Harden the bare-metal mysql init-container: still runs as root (kubelet
  does not apply fsGroup to hostPath volumes, k8s#138411), but now with
  drop: [ALL] + add: [CHOWN], allowPrivilegeEscalation: false,
  readOnlyRootFilesystem: true, and seccompProfile: RuntimeDefault.
- Remove deceptive "<CLIENT_ID>" / "<CLIENT_PASSWORD>" placeholder defaults.
  The defaults are now empty strings; the schema and template both reject
  empty values and <...> placeholder patterns so deployments fail fast
  instead of silently encoding a placeholder into the Secret.

Bump chart version 1.0.3 -> 1.0.4. All 76 unit tests pass.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

* feat(client): require metrics-server for resource-monitor (v1.0.5) (#55)

The tracebloc-resource-monitor DaemonSet queries the metrics.k8s.io API
for node CPU/memory. Without metrics-server registered, the DaemonSet
crash-loops with 404s against /apis/metrics.k8s.io/v1beta1 — silently,
every few seconds. Found during a bare-metal smoke test on a k3d cluster
where metrics-server had been explicitly disabled.

- scripts/lib/cluster.sh: drop --disable=metrics-server from the k3d
  create args. k3s bundles metrics-server; the earlier comment claiming
  the chart "ships its own" was wrong — the DaemonSet is a consumer of
  metrics-server, not a replacement.
- client/templates/resource-monitor-daemonset.yaml: add a pre-install
  `lookup` that fails the release up front when resourceMonitor is true
  but v1beta1.metrics.k8s.io is not registered. Guarded by a kube-system
  probe so offline `helm template` still renders.
- client/values.yaml: document the dependency inline on resourceMonitor,
  with per-platform install notes (k3d/AKS bundled; EKS/OC/bare-metal
  need manual install).
- docs/SECURITY.md: call out the dependency and the escape hatch
  (resourceMonitor: false) in the architecture section.
- Chart.yaml: 1.0.4 -> 1.0.5.

Verified on a fresh k3d cluster (no --disable=metrics-server): metrics
API comes up in ~30s, smoke install succeeds, resource-monitor reaches
Running with zero ERROR/404 lines. Pre-flight fail path also verified
against a metrics-less cluster.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

* fix(mysql): drop chmod from hostPath init (v1.0.6) (#56)

The init-container runs as UID 0 with capabilities drop:[ALL] add:[CHOWN].
After 'chown 999:999' transfers ownership, the subsequent 'chmod 755' runs
as a non-owner without CAP_FOWNER and returns EPERM on re-install where
the hostPath dir already exists from a prior run. Reversing the order
does not help (chmod first still fails once the dir is 999-owned from
any previous successful run).

kubelet creates hostPath dirs at 0755 via DirectoryOrCreate, so the chmod
was a no-op on fresh installs and broken on re-installs. Drop it.

Verified on k3d/AWS VM:
- fresh install: kubelet-created root:root dir -> chown succeeds -> 999:999
- re-install: pre-existing 999:999 dir with data -> chown no-op -> data intact

* Chore/merge main into develop (#58)

* Update README.md

* Add narrow CODEOWNERS for security-sensitive paths

* Remove metrics-server disable argument from k3d cluster creation in install-k8s.ps1 to ensure proper functionality of the resource-monitor DaemonSet, which relies on the metrics API. This change aligns with previous updates that emphasized the necessity of metrics-server for monitoring capabilities.

---------

Co-authored-by: lukasWuttke <54042461+LukasWodka@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

* Merge pull request #60 from tracebloc/fix/resource-monitor-digest-pinning

fix(client): pin resource-monitor by digest (v1.0.7)

* chore: add auto-add to engineer kanban workflow (#45)

* Add auto-add to engineer kanban workflow

* fix(ci): pin actions/add-to-project to v1.0.2

@v1 is not a valid tag — action publishes full semver only. Pin to v1.0.2.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(client): reject empty clusterCidrs on training NetworkPolicy (v1.0.8) (#61)

* fix(client): reject empty clusterCidrs on training NetworkPolicy (v1.0.8)

When `networkPolicy.training.enabled: true` and `clusterCidrs: []`, the
template's range loop produced no items, so `except:` rendered as null.
Kubernetes interprets a null `except` as "no exceptions" to `cidr: 0.0.0.0/0`,
silently granting training pods unrestricted port-443 egress to MySQL, the
K8s API, jobs-manager, and every other in-cluster destination the policy
is meant to block.

Gate the misconfiguration at two levels:
- `values.schema.json`: add `minItems: 1` to clusterCidrs (fires at
  helm install/upgrade validation)
- `network-policy-training.yaml`: add a `{{ fail }}` guard as
  defense-in-depth for schema-bypass paths (helm template --validate=false)
- `tests/network_policy_test.yaml`: add a unit test asserting the failure

Credit: bug bot finding.

* fix(client): tolerate missing images.resourceMonitor on --reuse-values upgrade

Caught by a live k3d upgrade 1.0.6 → 1.0.8: releases installed before
PR #60 have no `images.resourceMonitor` block in their stored values, so
`helm upgrade --reuse-values` nil-pointered on `.Values.images.resourceMonitor.digest`.

- Read the digest via nested `default (dict)` so a missing `images` map
  AND a missing `resourceMonitor` entry both fall through to "" safely.
  `dig` would be cleaner but it rejects chartutil.Values.
- Add tests/resource_monitor_test.yaml with a regression case that sets
  `images: null` and asserts the DaemonSet still renders with the tag
  fallback.

Scope limited to resourceMonitor: the other images (jobsManager,
podsMonitor, mysqlClient, busybox) were introduced together in PR #53
(1.0.4), so anyone on 1.0.4+ already has those blocks in stored values.

* fix(client): scope clusterCidrs minItems guard to enabled=true only

Bug bot flagged that the unconditional minItems:1 constraint on
networkPolicy.training.clusterCidrs rejects `enabled: false` +
`clusterCidrs: []` — a legitimate minimal config for operators on
non-enforcing CNIs who disable the policy entirely.

Move the constraint behind a JSON Schema draft 7 if/then at the
`training` object level: minItems:1 applies only when enabled=true.
The template-side fail guard was already correctly scoped inside the
`.Values.networkPolicy.training.enabled` check, so no template change
is needed — this aligns the schema with the template.

Add a unittest covering `enabled: false` + `clusterCidrs: []`
(schema must pass, no policy rendered).

---------

Co-authored-by: Lukas Wuttke <lukas@tracebloc.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: lukasWuttke <54042461+LukasWodka@users.noreply.github.com>
Merge pull request #62 from tracebloc/fix/release-workflow-lint
Enhance CI workflows and fix MySQL resource management issues
* Merge pull request #71 from tracebloc/docs/migrations-correct-option-b

docs(migrations): correct Option B + add hasan-prod case + active-jobs pre-flight

* chore: add default CODEOWNERS for auto-reviewer assignment (#73)

* ci: add kanban closure-routing caller workflow (#75)

* fix(client): release-scope resource-monitor names so multiple releases coexist (v1.2.0) (#72)

Two client releases on the same cluster could not both deploy the
resource-monitor DaemonSet because several resources templated into the
shared tracebloc-node-agents namespace used the literal name
`tracebloc-resource-monitor` rather than a release-scoped name. The
second `helm install` failed with:

  Error: ServiceAccount "tracebloc-resource-monitor" in namespace
  "tracebloc-node-agents" exists and cannot be imported into the current
  release: invalid ownership metadata; ... must equal "hasan-prod":
  current value is "stg".

Surfaced during the 2026-04-27 hasan-prod migration on
tracebloc-templates-prod; worked around at the time by setting
resourceMonitor: false on the second release, which means prod customers
currently lose their per-CLIENT_ID metric stream until this lands.

What changed:

- New helper `tracebloc.resourceMonitorName` -> `<Release.Name>-resource-monitor`,
  centralised in _helpers.tpl alongside the existing per-release name
  helpers (secretName, serviceAccountName, etc.).
- DaemonSet metadata.name, spec.selector.matchLabels.app, pod label
  app=, and spec.template.spec.serviceAccountName all now go through
  the helper. The selector + pod label have to move together because
  DaemonSet selectors are namespace-scoped: two DaemonSets in
  tracebloc-node-agents both selecting `app: tracebloc-resource-monitor`
  would each grab the other's pods, which is worse than the surface bug.
- ServiceAccount metadata.name (resource-monitor-rbac.yaml) goes through
  the helper. ClusterRole / ClusterRoleBinding / Role / RoleBinding
  metadata.name were already release-scoped (`tracebloc-resource-monitor-<release>`)
  and stay as-is to avoid an unnecessary ClusterRole rename for upgrading
  installs. Only the *subject* names in (Cluster)RoleBinding change to
  point at the new SA.
- Mirrored secrets (CLIENT_ID + dockerconfigjson) in tracebloc-node-agents:
  the secret names were already release-scoped via
  tracebloc.secretName / tracebloc.registrySecretName so they did not
  collide. Their `app` label was the literal value, which is harmless on
  uniquely-named resources but inconsistent — updated for consistency.
- Chart bumped 1.1.0 -> 1.2.0. Per-release naming of cluster-singleton
  resources is a behaviour change for existing installs (DaemonSet name,
  ServiceAccount name, and selector label all change), so a minor bump
  signals that operators should review.

Tests: 93 -> 98. New cases cover:
- DaemonSet name + selector + serviceAccountName all release-scoped
- ServiceAccount name release-scoped
- ClusterRoleBinding subject points at the release-scoped SA
- A second `helm template` with a different release name produces
  non-colliding names

Verified end-to-end via `helm template stg ./client` and
`helm template hasan-prod ./client` on the same chart: ServiceAccount,
DaemonSet, and ClusterRoleBinding subject names all diverge per release.

Upgrade path from 1.1.0:

The DaemonSet and ServiceAccount rename triggers a Helm three-way merge
that DELETEs the old `tracebloc-resource-monitor` resource and CREATEs
the new release-scoped one. ~30-60s gap on each node where resource
metrics are not collected. DaemonSet selector is immutable, so the
delete-then-create path is what we want — helm upgrade handles this
automatically because the names diverge in the stored manifest. No
manual orphan cleanup needed.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

* fix(client): allow training pods to reach mysql-client (v1.2.1) (#76)

The training-egress NetworkPolicy added in v1.1.0 only permitted DNS and
external TCP/443. Training pods load their dataset from the in-namespace
mysql-client over TCP/3306 (core/utils/database.py::load_dataframe_from_sql_table),
so under any CNI that actually enforces NetworkPolicy the connect failed
with errno 111 and the Job CrashLoopBackOff'd before the first batch:

  Database connection failed: 2003 (HY000): Can't connect to MySQL server
  on 'mysql-client:3306' (111)
  RuntimeError: Database connection is not available for load_dataframe_from_sql_table

Surfaced on a fresh client install (k3d / k3s, which enforces policy via
the built-in kube-router) where jobs-manager could reach mysql but every
training Job spawned with tracebloc.io/workload=training could not.

Add a third egress rule scoped to podSelector {app: mysql-client} on
TCP/3306. Same-namespace by default (no namespaceSelector), so it stays
tight to the chart's own mysql pod and does not open the namespace
generally. The egress[1] /32 ipBlock comment is updated to note that
MySQL is now explicitly re-permitted by egress[2].

Verified on a k3d cluster: pre-fix nc to mysql-client:3306 from a pod
with the training label was refused; post-fix it connects.

* docs(migration-tools): tenant migration runbook for eks-1.0.x → client-1.x (#74)

* docs(migration-tools): tenant migration runbook for eks-1.0.x -> client-1.x

Captures the operational tooling validated during the 2026-04-27 stg and
hasan-prod migrations and generalises it for the remaining tenants
(bmw, cisco, charite) and any future tenant on the legacy chart family.

What's here:

- README.md walks the workflow + recommended ordering for the pending
  set + skip rationale for chart toggles (resourceMonitor: false,
  priorityClass.create: false, etc).
- generate.sh consumes a tenant-config.env (gitignored) and emits, per
  tenant, /tmp/tracebloc-migration-<tenant>/{values,storageclass,pvcs}.yaml.
  Refuses to expand placeholder __FOO__ rows so an operator running
  generate.sh against the unmodified template fails fast.
- migrate-tenant.sh is the parameterised runbook. `phase1` is
  non-destructive (mysqldump-then-chunked-cp, AWS Backup on-demand
  recovery point, dry-run render). `phase2` is one-shot per tenant
  (helm uninstall, claimRef clear, SC re-create, PVC pre-create with
  release-scoped Helm ownership stamp, helm install, verify mysql data
  + keep annotation in stored manifest).
- tenant-config.example.env is the template; populated copy is the
  secret-bearing artifact and must stay local.

No real secrets in any committed file:

- DOCKER_PASSWORD placeholder (__DOCKER_HUB_PERSONAL_ACCESS_TOKEN__)
- per-tenant CLIENT_ID / CLIENT_PASSWORD placeholders
- MYSQL_ROOT_PW placeholder (it's image-baked; required from env at
  runtime, no committed default)
- .gitignore now excludes docs/migration-tools/tenant-config.env
  (only the .example variant is tracked)

Operational notes:

- Every kubectl/helm call passes --context explicitly. The 2026-04-27
  prod run hit a context-drift bug mid-migration; the explicit form
  is a hard requirement.
- values.yaml ships with resourceMonitor: false. Flip true after the
  release-scoped resource-monitor names land in client-1.2.0 (separate
  PR). Until then the shared SA in tracebloc-node-agents collides with
  the stg release.
- Phase 1 is idempotent and re-runnable. Phase 2 is destructive and
  one-shot per tenant. Operators should pause and eyeball Phase 1
  outputs before running Phase 2 — that's deliberately not automated.

Once all four pending tenants are on client-1.x, this directory is
historical. client-1.x -> client-1.y upgrades follow plain `helm upgrade`
because the new chart already templates `helm.sh/resource-policy: keep`
on PVCs, so the migration protocol isn't needed for routine upgrades.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(migration-tools): address bugbot review feedback on PR #74

Three issues flagged by Cursor Bugbot on the migration scripts:

* migrate-tenant.sh used macOS-only `md5 -q` and `stat -f%z` for chunked-cp
  verification (HIGH). Linux operators would abort Phase 1 mid-transfer.
  Add portable `_md5` and `_size` helpers that pick md5sum on Linux,
  fall back to md5(1) on macOS, and use `wc -c` instead of stat for size.

* generate.sh placeholder gate inspected only CLIENT_ID + CLIENT_PASSWORD
  + PV_MYSQL, missing PV_LOGS, PV_DATA, SC_NAME, and DOCKER_PASSWORD
  (MEDIUM). Literal `__FOO__` placeholders silently rendered into
  values.yaml/pvcs.yaml and only blew up at kubectl apply / helm install
  time. Iterate over every per-row field, plus a one-shot global check
  for DOCKER_PASSWORD before the loop. Error messages now name the
  offending field.

* Phase 2.5 readiness loop was an unbounded `while :; do … sleep 5; done`
  (MEDIUM). After the destructive helm uninstall, a non-converging
  install (image-pull error, mysql kill-loop recurrence, missing PVC
  binding) hung the script forever instead of surfacing the failure.
  Add a wall-clock deadline — default 600s, override via READY_TIMEOUT —
  and exit 1 with the last-seen pod state on timeout.

* fix(migration-tools): address bugbot follow-up on PR #74

Two more issues raised on the previous fix commit:

* Readiness wait loop aborted on empty pod list (HIGH). With `set -euo
  pipefail`, the routine post-install window where no pods are visible
  yet caused `grep -c .` to exit 1, killing the script on the very first
  iteration before the wall-clock deadline could ever fire — defeating
  the bounded-wait intent. Guard the empty case explicitly. `wc -l`
  alone is also wrong because `echo ""` prints a newline.

* MYSQL_ROOT_PW skipped the placeholder check that DOCKER_PASSWORD,
  CLIENT_*, and PV_* now have (LOW). An operator who copied the example
  without editing this row passed the non-empty gate, then the literal
  __LEGACY_MYSQL_ROOT_PW__ went into mysqldump and Phase 1 blew up
  partway through with an opaque "Access denied" inside kubectl exec.
  Add the same `*__*__*` case guard right after the non-empty check.

* fix(migration-tools): make EFS_FS_OVERRIDE actually override (PR #74)

The pre-source assignment

    EFS_FS="${EFS_FS_OVERRIDE:-fs-06b3faf51675ff9f9}"

was a no-op: `source "$CONFIG"` runs immediately after and the example
config (and any real tenant-config.env derived from it) unconditionally
sets EFS_FS=fs-06b3faf51675ff9f9, so the env override was clobbered every
time. Operators thinking they were targeting a non-default EFS would
silently start AWS Backup on-demand jobs against the hard-coded prod
filesystem.

Move the override knob to AFTER source where env genuinely wins, drop
the hard-coded fallback, and require EFS_FS to be set somewhere (config
or override) before continuing.

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

* fix(client): release-scope SCC SA refs (v1.2.2) (#78)

Bugbot caught a High-severity miss in v1.2.0's release-scoping work
(PR #72). The OpenShift SCC template was the one resource-monitor file
not updated when the literal `tracebloc-resource-monitor` ServiceAccount
name moved to `<Release.Name>-resource-monitor`. On OpenShift the SCC
granted access to a SA name that no longer existed, so the resource-
monitor DaemonSet pods would fail to launch (no SCC -> can't mount
hostPath /proc and /sys for node metrics).

The SCC's metadata.name + ClusterRole.name + ClusterRoleBinding.name
were ALREADY release-scoped (`tracebloc-resource-monitor-<release>` /
`tracebloc-resource-monitor-scc-<release>`), so this slipped through —
casual reading suggested it was already done.

Touchpoints in resource-monitor-scc.yaml:
- users[0]: now {{ include "tracebloc.resourceMonitorName" . }}
- ClusterRoleBinding subjects[0].name: same helper
- All `app: tracebloc-resource-monitor` labels: same helper, for
  consistency with the rest of the chart's resource-monitor templates
- Updated the kubernetes.io/description SCC annotation prose so the
  literal name doesn't appear there either (cosmetic, but easier to
  audit "no literal references" with a single grep).

Tests:
- platform_test.yaml gains 3 new cases: SCC users[0] points at
  release-scoped SA, ClusterRoleBinding subject does too, and two
  releases (stg + cisco/hasan-prod) produce non-colliding SA references.
- node_agents_namespace_test.yaml had a regression assertion checking
  the OLD literal name in users[0]; updated to the new release-scoped
  form (`RELEASE-NAME-resource-monitor`, helm-unittest's default
  release name when none is set).
- 98 -> 102 passing.

Verified end-to-end with two side-by-side `helm template` runs:
- stg     -> users[0] = system:serviceaccount:tracebloc-node-agents:stg-resource-monitor
- hasan-prod -> users[0] = system:serviceaccount:tracebloc-node-agents:hasan-prod-resource-monitor

Chart bumped 1.2.1 -> 1.2.2 (patch — restores OpenShift parity that
v1.2.0 inadvertently broke).

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

* fix: NOTES.txt rename + generator chart-version drift (v1.2.3) — bugbot follow-up #2 (#80)

* fix(client): release-scope SCC SA refs (v1.2.2)

Bugbot caught a High-severity miss in v1.2.0's release-scoping work
(PR #72). The OpenShift SCC template was the one resource-monitor file
not updated when the literal `tracebloc-resource-monitor` ServiceAccount
name moved to `<Release.Name>-resource-monitor`. On OpenShift the SCC
granted access to a SA name that no longer existed, so the resource-
monitor DaemonSet pods would fail to launch (no SCC -> can't mount
hostPath /proc and /sys for node metrics).

The SCC's metadata.name + ClusterRole.name + ClusterRoleBinding.name
were ALREADY release-scoped (`tracebloc-resource-monitor-<release>` /
`tracebloc-resource-monitor-scc-<release>`), so this slipped through —
casual reading suggested it was already done.

Touchpoints in resource-monitor-scc.yaml:
- users[0]: now {{ include "tracebloc.resourceMonitorName" . }}
- ClusterRoleBinding subjects[0].name: same helper
- All `app: tracebloc-resource-monitor` labels: same helper, for
  consistency with the rest of the chart's resource-monitor templates
- Updated the kubernetes.io/description SCC annotation prose so the
  literal name doesn't appear there either (cosmetic, but easier to
  audit "no literal references" with a single grep).

Tests:
- platform_test.yaml gains 3 new cases: SCC users[0] points at
  release-scoped SA, ClusterRoleBinding subject does too, and two
  releases (stg + cisco/hasan-prod) produce non-colliding SA references.
- node_agents_namespace_test.yaml had a regression assertion checking
  the OLD literal name in users[0]; updated to the new release-scoped
  form (`RELEASE-NAME-resource-monitor`, helm-unittest's default
  release name when none is set).
- 98 -> 102 passing.

Verified end-to-end with two side-by-side `helm template` runs:
- stg     -> users[0] = system:serviceaccount:tracebloc-node-agents:stg-resource-monitor
- hasan-prod -> users[0] = system:serviceaccount:tracebloc-node-agents:hasan-prod-resource-monitor

Chart bumped 1.2.1 -> 1.2.2 (patch — restores OpenShift parity that
v1.2.0 inadvertently broke).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix: NOTES.txt rename + generator chart-version drift (v1.2.3)

Bugbot follow-up to the v1.2.0/1.2.2 rename work. Two fresh issues:

1. (Medium) NOTES.txt:9 still hardcoded the literal
   `tracebloc-resource-monitor` for the resource-monitor DaemonSet
   display, while the actual DaemonSet name has been
   `<release>-resource-monitor` since v1.2.0. Operators see one name
   in the post-install banner and a different name when they
   `kubectl get ds`. Now routes through the same
   tracebloc.resourceMonitorName helper as the rest of the chart.

2. (Low) docs/migration-tools/generate.sh hardcoded
   `app.kubernetes.io/version: "1.1.0"` and `helm.sh/chart: client-1.1.0`
   on every pre-create PVC. The chart has moved through 1.1.0 → 1.2.3,
   and operators running generate.sh today get PVC labels stuck at
   1.1.0 even though the install ahead is 1.2.3. Helm adoption itself
   is unaffected (it keys on meta.helm.sh/release-name, not the chart
   label), but the labels lie until a subsequent upgrade reconciles
   them, and `kubectl get pvc -L helm.sh/chart` is misleading during
   migration debugging. Fixed by reading name + version from
   client/Chart.yaml at generate time.

Plus a few stale prose references caught while auditing the same path
(no functional impact, but the doc was directing operators at "client
fix in 1.2.0" as if it were still pending):

- generate.sh inline comment on `resourceMonitor: false` rephrased
  from "until client-1.2.0 is published" to "until you have verified
  the chart you're installing is 1.2.0+"
- migrate-tenant.sh banner relabelled from "v1.1.0 spec sanity" to
  "mysql spec sanity (v1.1.0+ shape: ...)"
- README.md skip table cell on `resourceMonitor: false` rewritten to
  reflect that 1.2.0+ has shipped — operators on >=1.2.0 can flip it
  to true without colliding with the stg release

Tests: 102 → 105 passing. New `client/tests/notes_test.yaml` covers:
- Release-scoped resource-monitor name appears in NOTES.txt
- A different release renders a different name (proves the helper
  isn't accidentally hardcoded)
- Negative regex guards against the literal `tracebloc-resource-monitor`
  reappearing followed by a non-suffix character (i.e. the bare
  pre-1.2.3 form, while still letting the SCC line `tracebloc-
  resource-monitor-<release>` further down the file pass)
- `resourceMonitor: false` removes the line entirely

End-to-end smoke of generate.sh confirms PVCs ship with the live chart
version (`helm.sh/chart: client-1.2.3` after this commit, verified
against /tmp/tracebloc-migration-<demo>/pvcs.yaml).

Stacked on PR #78 (v1.2.2 SCC fix), so this branch already contains
the SCC SA-ref rename. Once #78 lands the diff against develop will
reduce to just this commit.

Chart bumped 1.2.2 → 1.2.3 (patch — operator-facing string fix +
tooling correctness).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

* docs(claude): require @saadqbal as PR assignee (#79)

Convention captured after a session-end ask. Every PR Claude opens for
this repo must be assigned to saadqbal — orphaned PRs without an
assignee fall through the review queue.

Pass --assignee @me on `gh pr create` (or --assignee saadqbal if running
unauthenticated). No exceptions.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: lukasWuttke <54042461+LukasWodka@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
chore(client): bump chart to 1.2.3 for release
…loses #70) (#83) (#84)

The chart unification (4 per-platform charts -> unified client/ chart)
shipped in v1.1.0; the unified chart has now been at v1.2.x in production
across stg + hasan-prod for several releases. Time to retire the legacy
artifacts.

Removed:
- aks/, bm/, eks/, oc/ chart directories — 75 files, ~330KB. Each had a
  DEPRECATED.md pointing at the unified chart for ~6 months.
- 7 stale .tgz tarballs at repo root (aks-1.0.3, aks-1.0.4, bm-1.0.3,
  bm-1.0.4, eks-1.0.3, eks-1.0.4, oc-1.0.4). The release workflow
  publishes via gh-pages; these checked-in builds were dead weight.
- Root index.yaml — stale snapshot listing only 1.0.3/1.0.4 of the
  legacy charts. The live index served at tracebloc.github.io/client
  is on the gh-pages branch and is the source of truth.
- mysql.yaml at repo root — orphaned PVC manifest with hardcoded volume
  UUID and namespace. Audited: zero references anywhere in the repo.

Other:
- Added *.tgz to .gitignore so chart packages don't sneak back in.
- Updated client/MIGRATION.md Rollback section. The old "the legacy
  charts remain in aks/, bm/, eks/, oc/ and can be used at any time"
  was about to become a lie. Replaced with instructions to recover the
  directory from git history if anyone genuinely needs the old chart.

Verification:
- helm lint --strict ./client -f client/ci/eks-values.yaml — clean
  (same invocation the release workflow runs on every tag)
- helm unittest client — 105/105 still passing
- helm package ./client -d /tmp — produces a valid client-1.2.3.tgz

Net diff: 86 files changed, 17 insertions(+), 3447 deletions(-).

Co-authored-by: Lukas Wuttke <lukas@tracebloc.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: lukasWuttke <54042461+LukasWodka@users.noreply.github.com>
Prod: Implement self-upgrade CronJob for Helm chart automation
* chore: revert default CODEOWNERS — keep narrow security rules only (#92)

* Merge pull request #91 from tracebloc/chore/bump-chart-1.3.1

chore(client): bump chart 1.3.0 -> 1.3.1 (auto-upgrade verification)

---------

Co-authored-by: lukasWuttke <54042461+LukasWodka@users.noreply.github.com>
The Deploy section opened with `docker pull tracebloc/client:latest`,
but this repo ships a Helm chart — the actual install is `helm install`.
External walkthrough URLs (`/local-linux`, `/local-macos`, `/aws`,
`/deployment-overview`) didn't match any path in the tracebloc/docs
tree, so they 404. The in-repo documentation (`docs/INSTALL.md`,
`docs/MIGRATIONS.md`, `docs/migration-tools/README.md`,
`client/MIGRATION.md`) was never linked from the README despite being
the operational source of truth.

Surgical change — the rest of the README stays as-is:
- Replace `docker pull` with `helm repo add` + `helm install` (matches
  docs/INSTALL.md)
- Call out chart version (v1.3.1) and platform support (AKS / EKS /
  bare-metal / OpenShift) up front
- Table linking every in-repo operational doc
- Fix external URLs to match actual tracebloc/docs paths
  (local-deployment-guide-linux, local-deployment-guide-macos,
  eks-client-deployment-guide, azure-deployment-guide)
- Pull NetworkPolicy/CNI prerequisite into a callout

Closes #101

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
docs: fix README Deploy section (Helm not docker), surface in-repo docs
The standalone installer (bash <(curl -fsSL tracebloc.io/i.sh) /
irm tracebloc.io/i.ps1 | iex) is the one-command path for evaluation,
local dev, and first-time installs — it provisions a cluster, detects
GPU drivers, and deploys the client. Today it isn't documented anywhere
reachable from this repo, so readers see the multi-step helm install
flow as the only option.

README:
- New "Quick install" subsection at the top of Deploy with macOS/Linux
  and Windows commands, brief description of what it does, and a
  pointer to the local helper scripts under scripts/
- Existing helm flow relabeled as "Helm install (production)" — now
  positioned as the option for existing production clusters

docs/INSTALL.md:
- Top-of-doc callout pointing at the standalone installer for
  non-production users
- Production-focused content untouched

Closes #103

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous wording ("Best for evaluation, local dev, and first-time
installs" / "Just trying it out? For local dev or a quick evaluation")
implied the standalone installer produces a lesser/demo client. It
doesn't — it produces the same full client, just on a cluster the
script provisions for you.

Reframes the differentiator around cluster ownership instead of install
quality:
- README: "Use this when you don't already have a cluster — the result
  is a full client install, not a demo." Helm subsection retitled
  from "Helm install (production)" to just "Helm install" with
  "For existing Kubernetes clusters".
- INSTALL.md: callout opens with "Don't have a Kubernetes cluster
  yet?" and emphasizes "a full tracebloc client".

Refs #103
curl and PowerShell's irm both default to HTTP when no scheme is
specified, so `curl -fsSL tracebloc.io/i.sh` and `irm tracebloc.io/i.ps1`
issue plaintext requests. The downloaded body is piped straight into
bash / iex, so a network-level attacker between the user and tracebloc.io
could MITM the response and inject arbitrary code.

Add explicit `https://` to every installer URL in README.md and
docs/INSTALL.md so the request is encrypted from the first byte.

Refs #103
docs: surface standalone installer in README and INSTALL.md
…main

ci: bootstrap FR-flow callers on main
Switches the auto-upgrade CronJob default schedule from
"23 2 * * *" (daily 02:23 UTC) to "23 * * * *" (hourly at :23).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Switches the auto-upgrade CronJob default schedule from
"23 2 * * *" (daily 02:23 UTC) to "23 * * * *" (hourly at :23).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ships the hourly auto-upgrade schedule landed in #113 to deployed
customers. Updates MIGRATION.md to reflect the new default cadence
("hourly at :23 UTC" replaces the prior "daily at 02:23 UTC").

Refs #111.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
chore(client): bump chart 1.3.1 -> 1.3.2 (hourly auto-upgrade)
* Merge pull request #88 from tracebloc/ci/add-wip-limit-caller

ci: add WIP-limit-check caller workflow

* feat(requests-proxy): register requests-proxy in Helm chart (#95)

* feat(requests-proxy): register requests-proxy in Helm chart

- Add requests-proxy Deployment and Service templates
- Auto-generate requests-proxy-admin token on first install (preserved
  across upgrades via lookup; override with requestsProxyAdminToken)
- Inject REQUESTS_PROXY_ADMIN_TOKEN into jobs-manager via the same secret
- Add images.requestsProxy and resources.requestsProxy values

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Update order of setting request proxy admin token

* Bugbot Fix YAML

* Bugbot fix add validation for request proxy

---------

Co-authored-by: Syed Saqlain <syedsaqlain@MacBook-Pro.local>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

* Merge pull request #106 from tracebloc/docs/drop-stale-helm-charts-refs-105

docs: drop stale tracebloc-helm-charts references in INSTALL.md

* ci: add FR-pass comment caller for multi-stage kanban flow

* ci: add FR gate caller for staging/main promotions

* chore: sync main → develop after misrouted docs PRs (#108)

* docs: fix README Deploy section (Helm not docker), surface in-repo docs

The Deploy section opened with `docker pull tracebloc/client:latest`,
but this repo ships a Helm chart — the actual install is `helm install`.
External walkthrough URLs (`/local-linux`, `/local-macos`, `/aws`,
`/deployment-overview`) didn't match any path in the tracebloc/docs
tree, so they 404. The in-repo documentation (`docs/INSTALL.md`,
`docs/MIGRATIONS.md`, `docs/migration-tools/README.md`,
`client/MIGRATION.md`) was never linked from the README despite being
the operational source of truth.

Surgical change — the rest of the README stays as-is:
- Replace `docker pull` with `helm repo add` + `helm install` (matches
  docs/INSTALL.md)
- Call out chart version (v1.3.1) and platform support (AKS / EKS /
  bare-metal / OpenShift) up front
- Table linking every in-repo operational doc
- Fix external URLs to match actual tracebloc/docs paths
  (local-deployment-guide-linux, local-deployment-guide-macos,
  eks-client-deployment-guide, azure-deployment-guide)
- Pull NetworkPolicy/CNI prerequisite into a callout

Closes #101

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: surface standalone installer in README and INSTALL.md

The standalone installer (bash <(curl -fsSL tracebloc.io/i.sh) /
irm tracebloc.io/i.ps1 | iex) is the one-command path for evaluation,
local dev, and first-time installs — it provisions a cluster, detects
GPU drivers, and deploys the client. Today it isn't documented anywhere
reachable from this repo, so readers see the multi-step helm install
flow as the only option.

README:
- New "Quick install" subsection at the top of Deploy with macOS/Linux
  and Windows commands, brief description of what it does, and a
  pointer to the local helper scripts under scripts/
- Existing helm flow relabeled as "Helm install (production)" — now
  positioned as the option for existing production clusters

docs/INSTALL.md:
- Top-of-doc callout pointing at the standalone installer for
  non-production users
- Production-focused content untouched

Closes #103

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: reframe Quick install — same client, different cluster path

Previous wording ("Best for evaluation, local dev, and first-time
installs" / "Just trying it out? For local dev or a quick evaluation")
implied the standalone installer produces a lesser/demo client. It
doesn't — it produces the same full client, just on a cluster the
script provisions for you.

Reframes the differentiator around cluster ownership instead of install
quality:
- README: "Use this when you don't already have a cluster — the result
  is a full client install, not a demo." Helm subsection retitled
  from "Helm install (production)" to just "Helm install" with
  "For existing Kubernetes clusters".
- INSTALL.md: callout opens with "Don't have a Kubernetes cluster
  yet?" and emphasizes "a full tracebloc client".

Refs #103

* docs: explicit https:// on installer URLs (security)

curl and PowerShell's irm both default to HTTP when no scheme is
specified, so `curl -fsSL tracebloc.io/i.sh` and `irm tracebloc.io/i.ps1`
issue plaintext requests. The downloaded body is piped straight into
bash / iex, so a network-level attacker between the user and tracebloc.io
could MITM the response and inject arbitrary code.

Add explicit `https://` to every installer URL in README.md and
docs/INSTALL.md so the request is encrypted from the first byte.

Refs #103

* ci: bootstrap FR-pass caller on main

* ci: bootstrap FR gate caller on main

---------

Co-authored-by: Lukas Wuttke <lukas@tracebloc.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: lukasWuttke <54042461+LukasWodka@users.noreply.github.com>

* chore(auto-upgrade): run cronjob hourly at :23 (#112)

Switches the auto-upgrade CronJob default schedule from
"23 2 * * *" (daily 02:23 UTC) to "23 * * * *" (hourly at :23).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Merge pull request #115 from tracebloc/chore/bump-chart-1.3.2-develop

chore(client): bump chart 1.3.1 -> 1.3.2 (develop sync)

* ci: drop push-tags trigger from release-helm-chart workflow (#117)

* ci: drop push-tags trigger from release-helm-chart workflow

`gh release create v<x.y.z>` (the established release path per
`gh release list`) fires both `push` (tag) and `release` (published)
events, which causes two parallel workflow runs to race for the
gh-pages push. The slower run fails with non-fast-forward.

Most recent example: v1.3.2 cut today — run 25492826437 (release event)
failed; run 25492826350 (push event) succeeded. Artifacts landed fine,
but the failed sibling shows up as a red X on the release and is noise
for anyone debugging future releases.

Keeping only `release: published` removes the race. The
`Upload chart to GitHub Release (on tag)` step's
`startsWith(github.ref, 'refs/tags/')` guard still evaluates true for
release events (`github.ref` is the tag ref), so the upload step
behaviour is preserved.

Closes #116

* ci: harden release-asset upload against actions/runner#2788

With the push-tags trigger removed, the upload step's
`if: startsWith(github.ref, 'refs/tags/')` guard is the only thing
keeping the upload from running, but it silently evaluates to false
when `github.ref` arrives empty — a known intermittent runner bug
(actions/runner#2788, still open as of 2026-05). The same bug also
affects `github.ref_name`, which softprops/action-gh-release@v2 uses
by default to derive the tag, so the action itself can target the
wrong release (or fail) when the bug fires.

Drop the now-redundant `if:` guard (the workflow only runs on
`release: published`, so every run is by definition a release event)
and pass `tag_name` explicitly from the release event payload, which
is unaffected by the bug.

* ci: pin checkout ref to release tag (actions/runner#2788 hardening)

actions/checkout@v4 defaults `ref` to github.ref, which is the same
field hit by actions/runner#2788 — the still-open intermittent bug
where github.ref arrives empty on release-triggered runs. Per the
action's docs, when "checking out the repository that triggered a
workflow, this defaults to the reference or SHA for that event.
Otherwise, uses the default branch." So an empty github.ref would
fall back to the repo default branch (develop here), and we'd
package the chart from develop's HEAD instead of the tagged commit.

Pin ref explicitly to github.event.release.tag_name, which is part of
the release event payload and is unaffected by the runner bug.

* Add MySQL Host to request proxy yaml file (#118)

Co-authored-by: Syed Saqlain <syedsaqlain@MacBook-Pro.local>

* Add request proxy url to jobs manager yaml file (#119)

Co-authored-by: Syed Saqlain <syedsaqlain@MacBook-Pro.local>

* Remove REQUESTS_PROXY_ADMIN_TOKEN (#120)

Co-authored-by: Syed Saqlain <syedsaqlain@MacBook-Pro.local>

* Reduce dependency on values.yaml file for requests proxy (#122)

Co-authored-by: Syed Saqlain <syedsaqlain@MacBook-Pro.local>

* feat(#86): ingestor Helm subchart + companion RBAC/service/authz for new ingestion endpoint (#123)

* feat: companion chart changes for ingestion endpoint (client-runtime#21)

Wires the cluster side of the new ingestion flow into the main client
chart so the upcoming ingestor subchart can actually reach jobs-manager.

Five small changes:

1. **rbac.yaml** — adds three permissions to jobs-manager's RBAC:
     - authentication.k8s.io/tokenreviews   create
     - configmaps                            create
     - secrets                               create
   The endpoint validates caller SA tokens via TokenReview and creates
   a per-run ConfigMap (ingest.yaml) + Secret (BACKEND_TOKEN) before
   spawning the ingestor Job.

   `tokenreviews` is cluster-scoped and only added to the ClusterRole
   branch; customers with `clusterScope: false` won't have the
   ingestion endpoint authenticate. Documented in the rule comments.

2. **jobs-manager-service.yaml** (new) — ClusterIP exposing port 8080
   at the stable name `jobs-manager`, so the ingestor subchart's
   post-install hook doesn't need to discover Pod IPs.

3. **jobs-manager-deployment.yaml** — adds containerPort 8080 on the
   `api` container, mounts the ingestion-authz ConfigMap at
   `/etc/tracebloc/ingestion-authz.yaml`, declares the corresponding
   pod-level volume.

4. **ingestion-authz-configmap.yaml** (new) — renders the
   `ingestionAuthz.allowed` policy customers configure in values.yaml.
   Mounted into jobs-manager and read at startup by
   `submit_ingestion_run.load_authz_policy`. Each entry maps
   (namespace, service_account) → allowed table_prefixes; omitted
   `namespace` defaults to .Release.Namespace.

5. **values.yaml** — adds the `ingestionAuthz.allowed` default that
   permits the ingestor subchart's default SA (named `ingestor`) to
   ingest into any table. Customers tighten via overrides.

Verified
────────
- helm lint passes (only pre-existing icon-recommended INFO).
- helm template renders all five resources cleanly with expected
  values (Service name, RBAC verbs, container port, volume mount).
- helm unittest: 116/116 tests pass (existing snapshots unchanged).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(#86): ingestor Helm subchart (post-install hook submits to jobs-manager)

The customer-facing chart that finally closes the end-to-end loop:

  helm install my-dataset tracebloc/ingestor --namespace tracebloc \
    --set-file ingestConfig=./my-ingest.yaml \
    --set image.digest=sha256:<digest>

Renders the customer's ingest.yaml into a ConfigMap, then a
post-install hook Job POSTs `{ingest_config, idempotency_key,
image_digest}` to jobs-manager's `/internal/submit-ingestion-run`
endpoint (client-runtime#21). jobs-manager validates the SA token via
TokenReview, validates the YAML against ingest.v1, mints a backend
token, creates the per-run ConfigMap + Secret + Job, returns 201
(or 200 on replay).

Layout
──────
  ingestor/
  ├── Chart.yaml          appVersion: 0.3.0-rc1 (the data-ingestors release)
  ├── values.yaml         ingestConfig (required, --set-file), image.digest
  │                       (required, sha256), jobsManager.endpoint,
  │                       serviceAccount.create, hook resources, idempotency
  ├── README.md           ownership boundaries + verification commands
  ├── .helmignore
  └── templates/
      ├── _helpers.tpl
      ├── serviceaccount.yaml             default name "ingestor"
      ├── configmap-ingest-config.yaml    hook-weight 0
      └── post-install-job.yaml           hook-weight 1, runs as the SA,
                                          reads its own token, POSTs.

Ownership boundary
──────────────────
Per #86's acceptance criteria, the README spells out what `helm uninstall`
does and doesn't clean up:

  This chart owns:    ConfigMap (ingest.yaml), the hook Job, the SA.
  jobs-manager owns:  the per-run ConfigMap, Secret, ingestor Job.
  The cluster owns:   the ingested data + metadata POSTed to the backend.

`helm uninstall my-dataset` removes only the chart's footprint. The
running ingestor Job and its data persist. This is deliberate — uninstall
is not a cancel button. The README documents the kubectl command to
cancel a run if needed.

Implementation choices
──────────────────────
- **post-install hook, not a long-lived resource.** The hook is the
  whole point of this chart — fire once, exit.
- **automountServiceAccountToken: true** for the hook Job. That's the
  whole authentication mechanism — TokenReview on the SA token. Every
  other tracebloc workload disables automount; this one needs it.
- **`hook-delete-policy: before-hook-creation`**, NOT `hook-succeeded`.
  Keeps the completed Job around so operators can `kubectl logs` the
  POST response after install. Cleaned up only on the next install
  under the same release.
- **curlimages/curl** as the hook image — small, official, and ships
  python3 which we use to JSON-encode the multi-line YAML body safely
  (jq has a JSON-escape edge case for YAML newlines that's easier
  to side-step than handle).
- **idempotencyKey defaults to `<release>-<revision>`** so a
  `helm upgrade` submits a fresh run. Customers override to a stable
  UUID if they want strict at-most-once across reinstalls.

Verified
────────
- helm lint passes.
- helm template renders all four resources (ConfigMap, Job, SA, and
  the inline templates expand cleanly with --set-file ingestConfig).
- Required-value gates fire correctly: missing image.digest fails
  template; missing ingestConfig fails template.

Closes #86

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(#86): pre-render JSON body in ConfigMap, drop python3 + shell JSON escape

Three bugbot findings on the first ingestor-chart pass, all real:

  1. HIGH — curlimages/curl runtime layer doesn't include python3
     (only in the build stage; stripped in the final image). The
     hook's `python3 -c ...` JSON encoder would fail with
     "python3: not found" on every install.

  2. HIGH — even if python3 were available, the shell syntax
     `python3 -c "..." VAR=value` puts the assignments AFTER the
     command, which makes them positional argv, not env. The
     `os.environ['INGEST_CONFIG']` lookup would raise KeyError.

  3. MEDIUM — `nindent 4` after literal template-source indentation
     puts a leading blank line into the YAML block scalar, so the
     customer's ingest.yaml gets a "\n" prefix that block-scalar
     parsers tolerate but is wrong.

Structural fix rather than tweaking the script: the three POST-body
fields (ingest_config, idempotency_key, image_digest) are ALL known
at helm-template time. Render the JSON body in the ConfigMap as
`body.json` using Helm's `toJson` filter — which handles multi-line
string escaping correctly — then the hook becomes a one-line
`curl --data-binary @body.json`. No python3 needed, no shell-side
JSON construction at all. Eliminates both HIGH bugs as a category,
not just instance-by-instance.

For bug 3: use the left-trim action delimiter (dash inside braces)
before the `required ... | nindent 4` action so it eats the
leading whitespace cleanly. Verified via `helm template` that the
rendered `ingest.yaml` now starts cleanly with `apiVersion:`.

Verified
────────
- helm lint passes on both client/ and ingestor/.
- helm template renders the JSON body with correct escaping
  (multi-line YAML → "\n"-escaped scalar in JSON).
- helm template renders ingest.yaml with no leading blank line.
- helm unittest client/: 116/116 pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(#86): track ingestor/values.yaml (was silently .gitignored)

bugbot caught a serious oversight: `ingestor/values.yaml` exists in
the working tree but never made it into the repository. Every `git
add ingestor/` silently dropped it because the repo's .gitignore at
line 119 has `/*/values*.yaml` — an anti-leak pattern for operator
values files — which matches `ingestor/values.yaml`.

Without the file the chart is broken on `helm install`: every template
references `.Values.hookImage.repository`, `.Values.jobsManager.endpoint`,
etc., and Helm renders nil-pointer errors when the keys are absent.

Two-line fix:

  - Add `!ingestor/values.yaml` to .gitignore (mirrors the existing
    `!client/values*.yaml` exception for the main chart). Documents
    *why* the exception exists, so a future cleanup pass doesn't
    re-introduce the bug.
  - Commit the actual values.yaml file with the defaults already
    referenced by the README and the templates.

Local verification before pushing:

  helm template my-dataset ingestor/ --namespace tracebloc \
    --set ingestConfig=... --set image.digest=sha256:... \
  # renders ServiceAccount, ConfigMap, Job correctly.

Lesson for future runs: `git add <dir>/` is *not* a verification that
files were added — gitignore patterns can silently drop them. Should
have verified with `git status` before commit; would have caught this
before bugbot did.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: nil-guard ingestionAuthz access for --reuse-values upgrade path (#124)

#123's ingestion-authz ConfigMap template did unguarded nested access:

    {{- range .Values.ingestionAuthz.allowed }}

This crashes with "nil pointer evaluating interface {}.allowed" when
`.Values.ingestionAuthz` is absent — which is exactly what `helm
upgrade --reuse-values` produces against a pre-#123 release. The
stored values from the previous deploy don't have the key, and
`--reuse-values` doesn't pick up new chart defaults, so the upgrade
fails before any of the new resources are created.

A real user hit this immediately after #123 merged:

    Error: UPGRADE FAILED: template: client/templates/
    ingestion-authz-configmap.yaml:20:21:
    executing "..." at <.Values.ingestionAuthz.allowed>:
    nil pointer evaluating interface {}.allowed

Fix: collapse the missing-parent and missing-child cases to an empty
list with `default dict` + `default list`. The rendered ConfigMap
becomes `allowed:` (empty), which the authz policy parser treats as
"no SAs authorized" — fail-safe, matches the intent of "operator
hasn't configured this yet".

The recommended `helm upgrade` recipe is still
`--reset-then-reuse-values` (picks up new defaults including the
non-empty `ingestionAuthz.allowed` default), but the template no
longer requires that — it renders correctly under either path.

Verified
────────
- helm template renders cleanly with default values
  (full policy), with `--set ingestionAuthz=null` (empty allowed
  list), and with `--set ingestionAuthz.allowed=null` (same).
- helm unittest client/: 116/116 pass, no snapshot changes.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(#125): wire INGESTOR_IMAGE_DIGEST; drop digest requirement from ingestor subchart (#126)

* feat(#125): wire INGESTOR_IMAGE_DIGEST; drop digest requirement from ingestor subchart

Companion to tracebloc/client-runtime#41 (which made the endpoint
treat the request body's `image_digest` as an optional override of a
cluster-configured default). With this PR the ingestor image fits the
same auto-update model as every other component in the chart:

  client/values.yaml
    + images.ingestor.digest: ""
      The auto-upgrade cronjob bumps this when a new chart version is
      published; jobs-manager re-rolls and the new env takes effect.

  client/templates/jobs-manager-deployment.yaml
    + INGESTOR_IMAGE_DIGEST env, nil-guarded for --reuse-values from
      a pre-this-PR release. Empty value renders cleanly (no nil
      pointer), endpoint then accepts only request-body overrides
      until the operator sets the chart value.

  ingestor/values.yaml + templates/configmap-ingest-config.yaml
    + image.digest is now an OPTIONAL override, not required.
    + body.json renders without `image_digest` when none is set; the
      key is included only when the customer explicitly pinned via
      --set image.digest=... (the override path: reproducing old runs,
      testing pre-rollout versions, air-gapped mirrors).

  ingestor/README.md
    + Removes image.digest from "Required values".
    + Adds "Pinning a specific image version" section explaining the
      override use cases and when to reach for them.
    + Top-of-README install snippet drops --set image.digest=... — the
      dominant path is now `helm install --set-file ingestConfig=...`.

Once both PRs land, the bootstrap step is a one-line bump of
client/values.yaml's images.ingestor.digest to the current
ghcr.io/tracebloc/ingestor release digest, plus a chart version bump
so the auto-upgrade cronjob promotes it. Future ingestor releases
follow the same pattern — bump digest + chart version, customers'
auto-upgrade picks it up on the next tick.

Verified
────────
- helm lint passes on both charts.
- helm template renders:
    - env populated when images.ingestor.digest is set
    - env empty (nil-guard) when images.ingestor key absent entirely
      (simulates --reuse-values from pre-this-PR release)
    - body.json without image_digest when no override
    - body.json with image_digest when explicit --set image.digest=...
- helm unittest client/: 116/116 pass.

Closes #125

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: bootstrap ingestor digest + bump chart version 1.3.2 → 1.3.3

Activates the auto-update model introduced by the rest of this PR.
Without the value set, jobs-manager runs with `INGESTOR_IMAGE_DIGEST=""`
and the ingestion endpoint returns 503 for every call that doesn't
include a body override — which is the *opposite* of the "customer
doesn't have to think about digests" UX this PR is supposed to enable.

Two coupled bumps:

  client/Chart.yaml
    version: 1.3.2 → 1.3.3
    appVersion: 1.3.2 → 1.3.3
      Required for the auto-upgrade cronjob to detect this release.
      `helm search repo` orders by version; without a bump customers
      stay on 1.3.2 and never see the new env wiring.

  client/values.yaml
    images.ingestor.digest = "sha256:e6639b084d0d377072dc908db376050914ebd49c730ddaa13f838d10f5482ea9"
      The data-ingestors v0.3.0-rc1 release. Future ingestor releases
      bump both this and Chart.yaml's version; eventually a workflow
      in tracebloc/data-ingestors can raise the PR automatically when
      a new image is published.

After this lands and the chart is published to gh-pages, a
`helm upgrade --reset-then-reuse-values` on the customer's cluster
(or the daily auto-upgrade cronjob's next tick) rolls jobs-manager
with the env populated, and `helm install tracebloc/ingestor
--set-file ingestConfig=...` — no `--set image.digest=...` — works.

Verified
────────
- helm lint client/ clean.
- helm template shows INGESTOR_IMAGE_DIGEST env populated with the
  real digest.
- helm unittest client/: 116/116 pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(#127): ingestor chart auto-resolves jobs-manager endpoint to release namespace (#128)

The ingestor subchart's default jobsManager.endpoint hardcoded
"tracebloc" as the parent release's namespace:

    http://jobs-manager.tracebloc.svc.cluster.local:8080

Any release in a non-"tracebloc" namespace failed the post-install
hook with `curl: (6) Could not resolve host: …`, blocking end-to-end
ingestion. Surfaced today during real-cluster validation on a release
deployed to `tracebloc-templates`.

Fix shape: leave the values.yaml default empty; have the post-install
hook template the endpoint to use `.Release.Namespace` when no value
is set. The override path (cross-namespace install) keeps working —
set `jobsManager.endpoint` explicitly and it wins over the default.

  values.yaml
    jobsManager.endpoint: "" (was hardcoded to tracebloc namespace)
    + comment explaining the auto-resolve + override semantics

  templates/post-install-job.yaml
    JOBS_MANAGER_ENDPOINT defaults to
      http://jobs-manager.<.Release.Namespace>.svc.cluster.local:8080
    when .Values.jobsManager.endpoint is empty.

  README.md
    Frequently-overridden-values entry corrected.

Verified
────────
- helm template into namespace `tracebloc-templates` →
  http://jobs-manager.tracebloc-templates.svc.cluster.local:8080
- helm template into namespace `some-other-ns` →
  http://jobs-manager.some-other-ns.svc.cluster.local:8080
- helm template with --set jobsManager.endpoint=http://port-forward.localhost:8888
  → wins over the default.
- helm lint clean.

Closes #127

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(#129): parent client chart owns the shared ingestor ServiceAccount (#131)

The ingestor ServiceAccount is shared by every `tracebloc/ingestor`
subchart release in a namespace, but it was owned by the first such
release. Concurrent installs of a second ingestor release collided
with Helm's "cannot import into current release"; uninstalling the
first release ripped the SA out from under all the others.

Move the SA into this parent chart, which already owns the matching
`ingestionAuthz` ConfigMap, so the SA + policy have the same lifecycle
and every ingestor release in the namespace shares the SA cleanly.

Plumb the name through `ingestionAuthz.serviceAccountName` as a single
source of truth — both the new SA template and the default `allowed`
entry in the authz ConfigMap dereference it via the new
`tracebloc.ingestorServiceAccountName` helper. The helper nil-guards
pre-#129 `--reuse-values` upgrades by defaulting to "ingestor".

Document the SA adoption path in `client/MIGRATION.md` for clusters
that already have an `ingestor` SA owned by a 0.1.0 subchart release —
re-annotate before upgrading the parent chart so Helm doesn't refuse
the import.

Bumps chart to 1.3.4. Pair with tracebloc/ingestor 0.2.0, which flips
`serviceAccount.create` default to `false` so subchart releases stop
trying to own the SA themselves.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(#130): default idempotency key to install-time stamp, not release revision (#132)

`ingestor.idempotencyKey` previously fell back to `<release>-<revision>`
when `.Values.idempotencyKey` was unset. Helm restarts revisions at 1
after `helm uninstall`, so reinstalling under the same release name
produced the same key. If anything dedupe-relevant changed in between
(image digest is the dominant case during testing), jobs-manager
correctly rejected the second submission with a 409 — but to a customer
following the README it looked like the chart was broken.

Default to `<release>-<unix-epoch>` instead. Each install gets a fresh
key; the opt-in stable-UUID path remains for callers who actually want
at-most-once semantics across reinstalls.

Note on the printf format: Sprig's `unixEpoch` returns a string (not an
int), so the formatter is `%s-%s`, not `%s-%d`.

Bumps ingestor subchart 0.1.0 → 0.1.1 (default-behavior change).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(#129)!: default serviceAccount.create=false; parent chart owns the SA (#133)

The ingestor SA is shared across every `tracebloc/ingestor` release in
the namespace. The previous per-release ownership made the second
concurrent install collide with Helm's "cannot import into current
release" error, and uninstalling the first release deleted the SA out
from under any sibling release that worked around the collision with
`serviceAccount.create=false`.

The parent `tracebloc/client` chart 1.3.4 now owns the SA, exposing
its name via `ingestionAuthz.serviceAccountName`. This subchart's
default flips to `create: false` so it consumes that shared SA. The
`name` value is still required so the post-install hook Job knows
which SA's token to mount.

`serviceAccount.create=true` remains available as an escape hatch for
operators on a pre-1.3.4 parent chart, with a comment in values.yaml
explaining when (and only when) to flip it back on.

Breaking change: bumps chart to 0.2.0. Pair with the 1.3.4 parent
chart bump; see the parent's MIGRATION.md "Upgrading to 1.3.4" section
for the SA-adoption procedure on clusters where a 0.1.0 release
already created the SA.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(chart): bump ingestor digest to v0.3.0 + chart to 1.3.5 (#134)

v0.3.0 is the first production-ready ingestor release (signed +
SBOM), validated end-to-end against EKS on 2026-05-19 (6 files in
PVC + 576 MySQL rows via the declarative chart path). The previous
default (v0.3.0-rc1) had three real-cluster bugs that landed as
tracebloc/data-ingestors#106:

- #103 wheel + sdist were missing schema/ingest.v1.json
- #104 image-resolution validator tuple-vs-list comparison
- #105 _has_extension dot/case normalization (no more cat1.jpeg.jpeg)

Chart bumped to 1.3.5 so the auto-upgrade cronjob (#69) detects the
change and rolls customers onto v0.3.0 on the next tick.

ingestor image: ghcr.io/tracebloc/ingestor@sha256:463e2367...07a4a
cosign verify available; release notes contain the verification
command.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(#135): publish ingestor subchart alongside parent chart (#136)

The customer-facing install path is

    helm repo add tracebloc https://tracebloc.github.io/client
    helm install my-dataset tracebloc/ingestor \
      --namespace tracebloc-templates \
      --set-file ingestConfig=./my.yaml

For `tracebloc/ingestor` to resolve from that helm repo, the ingestor
subchart must be packaged into gh-pages alongside the parent client
chart. Before this PR, `release-helm-chart.yaml` only ran
`helm package ./client`, so the second install path returned
`Error: chart "ingestor" not found`. helm-ci.yaml also only lints the
parent chart, so any future regression in `ingestor/templates/` would
land on develop without CI noticing.

Three changes:

1. release-helm-chart.yaml: package + index BOTH client and ingestor
   into a single shared index.yaml. Attach both tgzs to the GitHub
   release for download-by-tag pinning.

2. helm-ci.yaml: lint the ingestor subchart on every PR alongside the
   per-platform client lints. Plain `helm lint --strict ./ingestor`
   is enough — its only required value (ingestConfig) emits INFO not
   FAIL, and the chart's templates don't branch on platform so the
   per-platform values-file matrix doesn't apply.

3. ingestor/Chart.yaml: bump appVersion 0.3.0-rc1 → 0.3.0 to match
   the tracebloc/data-ingestors v0.3.0 release that just shipped.
   Chart version (0.2.0) is unchanged; appVersion is descriptive.

Validated locally: both charts package cleanly
(client-1.3.5.tgz, ingestor-0.2.0.tgz), all four platform-specific
client lints pass, ingestor lint passes.

Closes #135.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(ingestor): explain image vs chart update lifecycle (#138)

Customers ask: "the cluster has an auto-upgrade cronjob — does that
mean my ingestor chart updates too?" The answer is nuanced: the
image auto-updates (via INGESTOR_IMAGE_DIGEST on jobs-manager,
kept current by the cronjob), but the chart on your workstation
is independent — Helm's repo cache doesn't refresh itself.

Add a "How updates work" section that explains the two-layer model
and the strong property that the image you run is decoupled from
the chart version that submitted the request. Plus an explicit FAQ
on previously-installed ingestor releases (nothing to upgrade —
fire-and-forget).

No code change.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Fix three bugbot findings from PR #137 review (#142)

* fix(#139): preserve idempotency key across helm upgrade

The ingestor.idempotencyKey helper defaulted to "<release>-<unix-epoch>"
and re-stamped on every render. `helm upgrade --reuse-values`
preserves the stored value "" (not the previously-rendered key), so
the template re-evaluated `now | unixEpoch` and produced a NEW key
each upgrade — accidentally creating duplicate ingestion runs from
what customers expected to be no-op upgrades. Contradicts the
documented behavior in ingestor/README.md added in #138.

Look up the existing post-install hook ConfigMap from the previous
render and reuse its idempotency_key. On fresh install (or after
uninstall) the lookup returns empty and we fall through to the
now-based default. `helm template` (no cluster connection) returns
empty for lookup too, so local previews still get a fresh key per
render — matches the in-cluster install path the first time.

Caught by bugbot on PR #137 review.

Closes #139.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(#140): read requests-proxy resources from values

The requests-proxy deployment hardcoded its container resources,
ignoring the resources.requestsProxy schema entry that values.schema.json
has defined since the requests-proxy was added. Every other component
(jobsManager, podsMonitor, mysql) reads from .Values.resources.<name>.*
with defaults — bring requestsProxy in line with that pattern.

Adds the resources.requestsProxy block to values.yaml with the
existing hardcoded defaults so behavior on a fresh install is
unchanged. The template uses the default-through-dict nil-guard
idiom so `helm upgrade --reuse-values` from a pre-1.3.6 release
(where the value didn't exist) still renders cleanly without
crashing on a nil parent.

Caught by bugbot on PR #137 review.

Closes #140.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(#141): add images.ingestor entry to values.schema.json

values.yaml has had images.ingestor.digest since #126, and the
jobs-manager template surfaces it as INGESTOR_IMAGE_DIGEST, but the
schema didn't validate it — every other image (jobsManager,
podsMonitor, resourceMonitor, requestsProxy, mysqlClient, busybox)
has an entry. An operator setting --set images.ingestor.digest=foo
(not the canonical sha256:<64-hex>) bypassed schema validation and
failed only later inside submit_ingestion_run.py.

Add the missing entry mirroring the other image entries' shape.
helm template now rejects malformed digests at chart-template time
("values don't meet the specifications of the schema(s)... Does not
match pattern '^(sha256:[a-f0-9]{64})?$'") rather than waiting for
runtime.

Caught by bugbot on PR #137 review.

Closes #141.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: lukasWuttke <54042461+LukasWodka@users.noreply.github.com>
Co-authored-by: Syed Is Saqlain <saqlain.syed007@gmail.com>
Co-authored-by: Syed Saqlain <syedsaqlain@MacBook-Pro.local>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Lukas Wuttke <lukas@tracebloc.io>
docs: surface the declarative ingestor flow from top-level docs
Brings the image-refresh CronJob feature to main for release:
- #155: feat(#154) auto-refresh jobs-manager image on Docker Hub publish
- #156: fix(#154) read annotations via jq (kubectl jsonpath bracket
  notation returns empty for keys containing dots/slashes)

Verified end-to-end on the tb-client-dev-templates EKS dev cluster:
fresh upgrade installs the new resources cleanly, first-tick records
both annotations without restart, second tick is a no-op, and a forced
digest mismatch triggers the expected rollout-restart-then-annotate
sequence. Rollout history bumps as expected.

Closes #154.
Sync develop → main for v1.4.0 chart release
Brings two changes to main for release:
- #161: chore: pin client chart's ingestor digest to v0.3.1
- #159: feat(#158): auto-refresh ingestor image digest without
  chart release (extends image-refresh CronJob from #155/v1.4.0 to
  reconcile ghcr.io/tracebloc/ingestor digests via kubectl set env)

#159 went through five iterations of bugbot findings during review
(rollout-failure retry → annotation source of truth; env drift via
kubectl rollout undo / edit / GitOps reconcile → re-apply path;
adopt-as-baseline rollout-status check). End-to-end smoke-tested on
the tb-client-dev-templates EKS dev cluster against the existing
ghcr.io 0.3 floating tag.

Closes #158, completes the rollout pattern for #154.
Caught in PR #162 review (bugbot, two medium-severity issues).

1. Env-drift rollout retry gap

   The no-op branch (annotation == registry AND spec env == recorded)
   was a bare log statement with no rollout-health verification. A
   previous tick's env-drift `kubectl set env` commits its spec change
   to etcd BEFORE `kubectl rollout status` waits for the new
   ReplicaSet to come up. If the rollout fails, `set -eu` aborts —
   but the spec write persists. Next tick: annotation, registry, and
   spec env all match (because the spec write committed), so the
   no-op branch fires and silently masks the stuck rollout. Running
   pods may be on the old or empty INGESTOR_IMAGE_DIGEST while the
   script reports success.

   Fix: call `kubectl rollout status` in the no-op branch too. On a
   healthy deployment it returns near-instantly (no active rollout
   to wait for). On a stuck deployment it times out, set -eu aborts,
   and the Job is visibly failed in `kubectl get cronjob`. The
   operator then sees the stuck state and can investigate. Image-
   refresh can't autonomously recover from a bad image push, but
   making the failure visible is the right behaviour.

2. Default ingestor tag mismatched team's publishing convention

   Chart defaulted `images.ingestor.tag: prod`. The team's
   ghcr.io/tracebloc/ingestor repo uses semver-style float tags
   (`0`, `0.3`) — there is no `prod` tag. Default install would
   silently no-op every tick because manifest resolution 404'd:

     curl ... ghcr.io/v2/.../manifests/prod → 404
     log "  WARN: could not resolve latest digest; skipping"

   The whole ingestor auto-refresh feature wouldn't work for any
   customer running the chart's defaults, despite `autoRefresh:
   true`.

   Fix: changed default to "0.3" (conservative — patch-only auto-
   track; won't pick up a future 0.4 with breaking changes).
   Operators can override to "0" if they want major-version
   auto-tracking. Long-term, the team should consider standardising
   the chart default once the data-ingestors release-image.yml
   formalises its tag-publishing contract — for now this matches
   what we tested with on the dev cluster.

Regression tests:
  * Default tag asserted as "0.3" with `notContains` of "prod" to
    guard against silent revert.
  * No-op branch asserted to call `kubectl rollout status` via
    (?s)-multiline regex matching the "verifying deployment health"
    log line + the kubectl rollout status call.
  * Existing test updated from value: prod to value: "0.3".

141/141 unit tests pass.

NB: these commits are landing on the sync branch directly to avoid
another full develop-PR cycle before release. After #162 merges,
the same content will need to flow back to develop — either via a
"sync main → develop" PR or by cherry-picking the two commits. The
divergence is two commits and is easy to resolve.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
saadqbal and others added 3 commits May 26, 2026 11:56
Caught in PR #162 review (bugbot, medium severity).

The chart default was changed from "prod" to "0.3" in the values
block — matches the team's ghcr.io publishing convention — but the
CronJob template's runtime fallback was left at `| default "prod"`.
Two render paths:

  * helm install / helm upgrade --reset-then-reuse-values: the
    chart's new default ("0.3") flows through, runtime fallback
    never fires, INGESTOR_TAG="0.3". OK.
  * helm upgrade --reuse-values from a pre-v1.4.1 stored manifest:
    the stored values lack `images.ingestor.tag` entirely. Runtime
    fallback fires, renders INGESTOR_TAG="prod", which 404s on
    ghcr.io because that tag doesn't exist. Ingestor refresh
    silently no-ops every tick.

Failure mode is graceful (log warning, no crash), but inconsistent
with the per-customer expectation that v1.4.1 enables ingestor
auto-refresh. autoUpgrade itself uses --reset-then-reuse-values, so
this only hits manual --reuse-values upgrades — narrow but real.

Fix: change runtime fallback to "0.3" so both render paths converge.

Regression test simulates the --reuse-values scenario by setting
images.ingestor.tag=null, exercising the runtime fallback. 142/142
tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* Merge pull request #161 from tracebloc/chore/pin-ingestor-v0.3.1-160

chore: pin client chart's ingestor digest to v0.3.1

* feat(#158): auto-refresh ingestor image digest without chart release (#159)

* feat(#158): auto-refresh ingestor image digest without chart release

Extends the existing image-refresh CronJob (#155) to also reconcile
the ghcr.io/tracebloc/ingestor digest onto the live jobs-manager
deployment's INGESTOR_IMAGE_DIGEST env var. New ingestor image
publishes to the floating tag are now picked up within the cronjob's
poll interval (~15 min) instead of requiring a full chart release.

Why

Today, shipping a new ghcr.io/tracebloc/ingestor image required
bumping client/values.yaml images.ingestor.digest + client/Chart.yaml
+ PR + sync to main + release tag. That's hours of overhead per
bump and asymmetric with jobs-manager (which already gets the
~15-min image-refresh path). The asymmetry hurts because the
ingestor changes frequently as the data-ingestors team iterates.

Design

Two image classes in one CronJob now:

  Class 1 (jobs-manager, pods-monitor):
    Registry: docker.io
    Tag: CLIENT_ENV
    Source of truth: deployment annotation
      `tracebloc.io/last-refreshed-<image>-digest` (#154)
    Action on change: kubectl rollout restart

  Class 2 (ingestor):
    Registry: ghcr.io
    Tag: images.ingestor.tag (default "prod")
    Source of truth: live INGESTOR_IMAGE_DIGEST env value on the
      api container of the jobs-manager deployment (no annotation
      needed — the env IS the digest jobs-manager passes to each
      spawned ingestion Job, so the most direct read of "what
      will be used next" is THIS value).
    Action on change: kubectl set env (triggers natural rollout
      via ReplicaSet rotation — no explicit `rollout restart`).

get_token + get_latest_digest parameterized by registry; both
docker.io and ghcr.io support anonymous pull tokens for public
images with only the issuer URL differing.

Per-image opt-out

* jobs-manager / pods-monitor: same as #154 — set
  `images.<image>.digest` non-empty.
* ingestor: explicit `images.ingestor.autoRefresh: false` flag.
  Asymmetric because ingestor.digest must be non-empty for
  jobs-manager to function (an empty env would 503 every ingestion
  submit), so we can't use digest-presence as the signal.

When ALL THREE pin signals are active, the chart renders no
image-refresh resources at all (helper `imageRefreshEnabled`).
When at least one is unpinned, the cronjob is rendered and the
script skips pinned images via env flags at runtime.

Chart-default ingestor digest stays pinned (v0.3.0) as the
baseline for greenfield installs; image-refresh dynamically
updates the live env from there. Helm's 3-way merge preserves
image-refresh's writes across future helm upgrades as long as
the chart's pinned baseline doesn't change.

Subtle gotcha caught in dev

`default true $autoRefresh` in Go templates returns `true` even
when $autoRefresh is explicitly `false` (Go treats bool false as
falsy, so default overrides it). Switched to `eq $autoRefresh
false` directly — absence (nil) and explicit `true` both fall
through to "not pinned" as intended. Test pinned against the
correct idiom.

Other changes

* `log()` continues to write to stderr (#155 fix).
* `get_container_env` helper for jq-based env-var reads —
  same kubectl-jsonpath caveat as `get_annotation` (#156).
* Chart version bumped 1.4.0 → 1.4.1.

Tests

20 image-refresh-suite tests (was 17), 140 total pass. New
assertions:
  * all-three-pinned renders zero resources
  * only-jobs-manager+pods-monitor-pinned still renders (regression
    guard for the asymmetric pin signal — without this, the
    ingestor would never auto-refresh on default installs)
  * INGESTOR_PINNED flips correctly on autoRefresh=false
  * INGESTOR_TAG is overridable, `latest` rejected by schema
  * Script must include `kubectl set env`, `ghcr.io`,
    `auth.docker.io`, `get_container_env`, the empty-env
    fill-from-registry path, and the autoRefresh-skip log line

Closes #158

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(#158): annotation as source of truth for ingestor (rollout-failure retry)

Caught in PR #159 review (bugbot, medium severity).

The original design used the live spec env (`INGESTOR_IMAGE_DIGEST` on
the api container) as the source of truth for "what image-refresh has
reconciled to." `kubectl set env` commits the new spec to etcd BEFORE
`kubectl rollout status` waits for the rollout to complete. If the
rollout times out or the new ReplicaSet's pods fail to come up:

  * `set -eu` aborts the script.
  * But the spec already matches the registry.
  * Next tick: `get_container_env` returns the new digest, compares
    equal to registry, no-op → script appears successful.
  * Meanwhile the old ReplicaSet's pods are still running with the
    OLD env, and the new ReplicaSet is stuck failing. The deployment
    is frozen on the old version with no retry signal.

Fix: mirror the jobs-manager/pods-monitor pattern from #154. Use a
`tracebloc.io/last-refreshed-ingestor-digest` annotation on the
deployment as the source of truth. Update the annotation as the LAST
step, only after `rollout status` succeeds. A failed rollout aborts
before the annotate → next tick sees stale annotation → retries.

First-observation contract for ingestor:
  * Non-empty spec env (the normal case — chart populates a default):
    adopt as baseline annotation, don't touch env. Same "don't churn
    on install" principle as jobs-manager first-observation.
  * Empty spec env (corrupted state, manual kubectl edit, stale
    --reuse-values): fill from registry on first tick. Empty would
    otherwise cause jobs-manager to 503 on every ingestion submit,
    so the "don't churn" trade is wrong in that case.

Tests pin:
  * Annotation key `tracebloc.io/last-refreshed-ingestor-digest`
    appears in the script.
  * Order-of-operations: set env → rollout status → annotate (the
    annotate MUST come last; regex matches the full sequence).

140/140 tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(#158): correct stale top-of-script comment + add first-obs test

Found during PR #159 self-review.

The top-of-script comment block still described the pre-bugbot-fix
design — claimed "Source of truth: the live env value itself (no
annotation needed)" and "no 'first observation' empty-state, each
tick is a normal compare-and-patch-if-different." Both were stale
after e7cf829 switched ingestor to annotation-based source of truth
and added the first-observation branch. Anyone reading the
script-level overview would have been misled about the actual loop.

Comment now matches the code: annotation as source of truth, two-
case first-observation contract (non-empty → adopt as baseline;
empty → fill from registry).

Also adds a positive regression test for the previously-untested
first-observation "adopting spec env as baseline" branch. The
empty-spec-env branch was already covered indirectly by the existing
"would 503 on ingestion submit" regex.

140/140 tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(#158): also reconcile env drift (rollout undo / kubectl edit / GitOps)

Caught in PR #159 review (bugbot, medium severity). Follow-up to the
annotation-source-of-truth switch in e7cf829.

The previous no-op branch fired whenever annotation == registry and
NEVER read the live spec env. That meant any external actor that
reverted the spec env without touching the annotation would leave the
deployment on a stale env indefinitely. The annotation continued to
match the registry, so image-refresh kept skipping. Real scenarios
this affects:

  * `kubectl rollout undo deployment/X` — reverts pod template to a
    previous ReplicaSet's spec, including its INGESTOR_IMAGE_DIGEST
    env. Annotation on deployment metadata is untouched.
  * `kubectl edit deployment X` — operator manually changes the env.
  * Certain `helm upgrade` flag combos can reset env to the chart's
    pre-image-refresh baseline while preserving annotations (e.g.,
    --reset-values or upgrade from a chart where the digest baseline
    differs from what image-refresh had reconciled to).
  * GitOps reconcilers (Argo CD, Flux) that own the deployment spec
    will revert image-refresh's env writes back to the rendered
    template values.

In all of these, the live deployment runs a stale ingestor image
forever — exactly the failure mode #158 was meant to prevent.

Fix: each tick now reads both the annotation AND the live spec env.
Three reconciliation paths:

  * recorded != registry → "registry drift". Set env to registry,
    wait for rollout, update annotation. (Existing behaviour.)
  * recorded == registry AND spec env != recorded → "env drift". Set
    env to recorded value (NOT registry — registry matches recorded
    by definition here, but recorded is the value we last decided to
    roll to). Wait for rollout. Don't update the annotation; it's
    already correct.
  * recorded == registry AND spec env == recorded → fully in sync,
    no-op.

Tests pin:
  * The "spec env drifted" log line.
  * The drift-recovery branch sets env to `${recorded_ingestor}`,
    not `${latest_ingestor}` (different from the registry-drift
    branch). Regex catches the variable used in `INGESTOR_IMAGE_DIGEST=`.

Top-of-script comment block updated to document the drift recovery.
140/140 tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(#158): wait for rollout status in adopt-as-baseline branch

Caught in PR #159 review (bugbot, high severity).

Scenario the existing code mishandles:
  1. Tick N: empty-spec-first-obs branch runs `kubectl set env`
     (commits new spec to etcd) → `kubectl rollout status` times out
     → `set -eu` aborts before the annotate. Annotation stays empty.
  2. Tick N+1: annotation still empty. spec env is now non-empty
     (the failed-rollout's spec change persists). get_container_env
     returns that value, so the adopt-as-baseline branch fires.
  3. Adopt-as-baseline only annotates — it never checks rollout
     health. Annotation records "we're at D1" while running pods
     are still on the old/empty env from before tick N.

The deployment now appears reconciled (annotation == registry on
subsequent ticks) while actually being stuck on the wrong image.

Fix: call `kubectl rollout status` inside the adopt-as-baseline
branch before the annotate. On a healthy deployment it returns
near-instantly; on a stuck rollout from a previous failed
set-env it times out, `set -eu` aborts before the annotate, next
tick retries. No latency cost on the happy path.

Regression test pins the (?s)-multiline order:
adopting → rollout status → annotate.

140/140 tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

* fix(#158): two bugbot follow-ups + chart default tag

Caught in PR #162 review (bugbot, two medium-severity issues).

1. Env-drift rollout retry gap

   The no-op branch (annotation == registry AND spec env == recorded)
   was a bare log statement with no rollout-health verification. A
   previous tick's env-drift `kubectl set env` commits its spec change
   to etcd BEFORE `kubectl rollout status` waits for the new
   ReplicaSet to come up. If the rollout fails, `set -eu` aborts —
   but the spec write persists. Next tick: annotation, registry, and
   spec env all match (because the spec write committed), so the
   no-op branch fires and silently masks the stuck rollout. Running
   pods may be on the old or empty INGESTOR_IMAGE_DIGEST while the
   script reports success.

   Fix: call `kubectl rollout status` in the no-op branch too. On a
   healthy deployment it returns near-instantly (no active rollout
   to wait for). On a stuck deployment it times out, set -eu aborts,
   and the Job is visibly failed in `kubectl get cronjob`. The
   operator then sees the stuck state and can investigate. Image-
   refresh can't autonomously recover from a bad image push, but
   making the failure visible is the right behaviour.

2. Default ingestor tag mismatched team's publishing convention

   Chart defaulted `images.ingestor.tag: prod`. The team's
   ghcr.io/tracebloc/ingestor repo uses semver-style float tags
   (`0`, `0.3`) — there is no `prod` tag. Default install would
   silently no-op every tick because manifest resolution 404'd:

     curl ... ghcr.io/v2/.../manifests/prod → 404
     log "  WARN: could not resolve latest digest; skipping"

   The whole ingestor auto-refresh feature wouldn't work for any
   customer running the chart's defaults, despite `autoRefresh:
   true`.

   Fix: changed default to "0.3" (conservative — patch-only auto-
   track; won't pick up a future 0.4 with breaking changes).
   Operators can override to "0" if they want major-version
   auto-tracking. Long-term, the team should consider standardising
   the chart default once the data-ingestors release-image.yml
   formalises its tag-publishing contract — for now this matches
   what we tested with on the dev cluster.

Regression tests:
  * Default tag asserted as "0.3" with `notContains` of "prod" to
    guard against silent revert.
  * No-op branch asserted to call `kubectl rollout status` via
    (?s)-multiline regex matching the "verifying deployment health"
    log line + the kubectl rollout status call.
  * Existing test updated from value: prod to value: "0.3".

141/141 unit tests pass.

NB: these commits are landing on the sync branch directly to avoid
another full develop-PR cycle before release. After #162 merges,
the same content will need to flow back to develop — either via a
"sync main → develop" PR or by cherry-picking the two commits. The
divergence is two commits and is easy to resolve.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(#158): align INGESTOR_TAG runtime fallback with chart default

Caught in PR #162 review (bugbot, medium severity).

The chart default was changed from "prod" to "0.3" in the values
block — matches the team's ghcr.io publishing convention — but the
CronJob template's runtime fallback was left at `| default "prod"`.
Two render paths:

  * helm install / helm upgrade --reset-then-reuse-values: the
    chart's new default ("0.3") flows through, runtime fallback
    never fires, INGESTOR_TAG="0.3". OK.
  * helm upgrade --reuse-values from a pre-v1.4.1 stored manifest:
    the stored values lack `images.ingestor.tag` entirely. Runtime
    fallback fires, renders INGESTOR_TAG="prod", which 404s on
    ghcr.io because that tag doesn't exist. Ingestor refresh
    silently no-ops every tick.

Failure mode is graceful (log warning, no crash), but inconsistent
with the per-customer expectation that v1.4.1 enables ingestor
auto-refresh. autoUpgrade itself uses --reset-then-reuse-values, so
this only hits manual --reuse-values upgrades — narrow but real.

Fix: change runtime fallback to "0.3" so both render paths converge.

Regression test simulates the --reuse-values scenario by setting
images.ingestor.tag=null, exercising the runtime fallback. 142/142
tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Brings the bugbot-followup commits from #162 back to develop:
- edd22f2: rollout-status check in no-op branch + chart default
  tag prod → 0.3
- 15bc136: INGESTOR_TAG runtime fallback prod → 0.3

These landed on the sync branch directly during release prep to
avoid an extra develop-PR cycle. Re-applying them here so develop
stays current.
@saadqbal saadqbal self-assigned this May 26, 2026
@LukasWodka
Copy link
Copy Markdown
Contributor

👋 Heads-up — Code review queue is at 16 / 8

Above the WIP limit. The team convention is to review existing PRs before opening new work.

Open PRs currently in Code review (oldest first):

Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.)

@saadqbal saadqbal merged commit 1bd3fcc into develop May 26, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants