Sync main → develop after v1.4.1 release#163
Merged
Merged
Conversation
* Add NetworkPolicy locking down training-pod egress
Training pods run untrusted ML code uploaded by external data scientists.
This policy selects on the tracebloc.io/workload=training label (injected
by jobs-manager in the companion client-runtime PR) and:
- Denies all ingress (nothing should connect TO a training pod).
- Allows DNS to the cluster DNS service.
- Allows external TCP/443 only; blocks all pod-to-pod, ClusterIP, and
in-cluster pod traffic via ipBlock with cluster-CIDR exclusions.
Training pods can still reach tracebloc backend, Azure Service Bus, and
App Insights (external HTTPS). They can no longer reach mysql-client,
the K8s API server, the jobs-manager pod IP, or other training pods.
Per-platform defaults:
AKS: enabled=true (requires Azure NPM or Calico at cluster create)
EKS: enabled=false (AWS VPC CNI does not enforce NetworkPolicy; safer
to explicitly disable than silently have no effect)
BM: enabled=true (requires Calico / Cilium / kube-router)
OC: enabled=true (OVN-Kubernetes enforces by default; custom DNS
selector and OpenShift pod/service CIDRs)
The dnsSelector default is empty with a template-side fallback to
{k8s-app: kube-dns} to avoid Helm's map-merge semantics surprising
customers who override it (OpenShift's selector would otherwise be
unioned with the default rather than replacing it).
- templates/network-policy-training.yaml: new policy (gated on
networkPolicy.training.enabled)
- values.yaml + values.schema.json: new networkPolicy.training block
- ci/{aks,eks,bm,oc}-values.yaml: per-platform overrides with notes
- tests/network_policy_test.yaml: 8 helm-unittest cases covering
rendering, ingress denial, DNS allow, external HTTPS allow, cluster
CIDR blocking, and the OpenShift selector override
No effect until the companion client-runtime PR lands, which adds the
tracebloc.io/workload=training label to spawned training pods.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Add optional Namespace resource with Pod Security Admission labels (#43)
* Add optional Namespace resource with Pod Security Admission labels
Layers Kubernetes Pod Security Admission on top of the per-pod
securityContext work for defense-in-depth. Off by default -- enabling
requires a greenfield install, since the chart does not currently own
the release namespace on existing deployments.
When namespace.create is true, the chart templates a Namespace with:
pod-security.kubernetes.io/warn: restricted
pod-security.kubernetes.io/audit: restricted
helm.sh/resource-policy: keep
Warn + audit surface any pod-spec violation as a kubectl warning and
an audit-log event, without rejecting the pod. This gives us a
tripwire for future regressions in our own pod specs (jobs-manager,
mysql, resource-monitor, training pods) and for any third-party pods
in the same namespace.
Enforce mode is deliberately left UNSET. Two of our own workloads
would be rejected under enforce: restricted:
- mysql init containers run as UID 0 (needed to chown the PVC
before the main container -- UID 999 -- starts)
- resource-monitor DaemonSet mounts hostPath /proc and /sys
Enabling enforce before those are refactored (or moved to a separate
namespace) would break the chart. Customers who want full enforcement
can set namespace.podSecurity.enforce = restricted after auditing
their own deployment; the current defaults keep them safe.
helm.sh/resource-policy: keep prevents helm uninstall from deleting
the Namespace, which would otherwise take the PVC-backed training
data and MySQL state with it.
- templates/namespace.yaml: new, gated on namespace.create (default false)
- values.yaml: new namespace block with long comments
- values.schema.json: schema entries for namespace.create + podSecurity
- tests/namespace_test.yaml: 8 helm-unittest cases (toggle off, toggle
on, keep annotation, labels, version strings, enforce omitted when
empty, enforce present when set, baseline override, namespace name
respects release)
- docs/INSTALL.md: section explaining the greenfield vs existing-ns
paths with copy-pasteable kubectl label commands
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Fix kubeVersion constraint to accept cloud pre-release suffixes
Helm's semver parser excludes pre-release versions from >= ranges by
default, so ">=1.24.0" rejected EKS ("1.34.4-eks-f69f56f"), GKE
("-gke-*"), and AKS release-tagged versions. Changing to ">=1.24.0-0"
explicitly opts the constraint into matching pre-releases, which is
how managed-Kubernetes providers encode their vendor suffix.
Surfaced while dry-run-installing PR #43 against a dev EKS cluster.
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Asad Iqbal <asad.dsoft@gmail.com>
* Add consolidated SECURITY.md covering the training-pod sandbox (#44)
Brings together the threat model, defense layers, per-platform
caveats, operator responsibilities, residual risks, and verification
steps into one reviewable artifact. Covers the complete hardening
posture as shipped across the chart + jobs-manager + new-arch
training images.
Sections:
1. Threat model: trusted platform, untrusted external-data-
scientist submissions. Explicit in-scope / out-of-scope.
2. Seven design goals (G1-G7) for the training-pod sandbox,
each mapped to current status on new-arch vs. legacy.
3. Architecture overview.
4. Defense layers -- credential isolation, network egress,
K8s API access, container runtime hardening, storage
isolation, cross-tenant forgeability, admission tripwire.
5. Per-platform caveats -- NetworkPolicy CNI matrix (AKS/EKS/
bare-metal/OpenShift), PSA version requirements, OpenShift
DNS selector override, runAsUser + arbitrary UIDs, bare-
metal hostPath note.
6. What operators must do themselves -- rotate secrets, verify
CNI enforces, label existing namespaces, monitor audit,
upgrade ordering, refactor path for enforce: restricted.
7. Verification -- copy-pasteable kubectl snippets for each
defense layer.
8. Residual risks with explicit ownership -- global SB conn
strings (backend), HTTPS egress (platform endgame), token
TTL (backend), legacy arch (migration team), PSA enforce
(chart refactor), CNI silent no-op (operator), kernel
escape (out of scope), resource DoS (out of scope).
9. Compromise response playbook.
10. Where each defense is implemented (code-path map for
reviewers).
11. Document history.
Also:
- README.md: add Security subsection under Deployment Guide
linking to docs/SECURITY.md.
- docs/INSTALL.md: prerequisite note about CNI enforcement.
No code changes; documentation only.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Add docs/MIGRATIONS.md and CLAUDE.md for Helm chart migration safety (#47)
Document the helm.sh/resource-policy=keep gotcha: Helm reads the
annotation from the stored release manifest, not live resources, so
kubectl annotate alone does not protect PVCs from helm uninstall.
Includes the 2026-04-22 tracebloc-templates migration as a case study
and three mitigation options (helm upgrade, strip ownership, or rely
on PV Retain + recreate).
* docs(client): add pre-Helm resource-monitor cleanup step to MIGRATION.md (#49)
Early-era edges were installed with a hand-rolled `resource-monitor`
DaemonSet via raw `kubectl apply` before the per-platform charts existed.
The unified chart's `tracebloc-resource-monitor` DaemonSet replaces it,
but the legacy DS is unmanaged and keeps running after migration, mounting
hostPath /proc + /sys and blocking PSA `enforce=restricted` on the namespace.
Adds a step-6 section documenting the kubectl cleanup (DS + SA + ClusterRole
+ ClusterRoleBinding, all named `resource-monitor`) with a safety check to
confirm the ClusterRole/Binding aren't shared before deletion.
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
* feat(mysql): drop root init-containers, add PSA-restricted securityContext (#48)
* feat(mysql): drop root init-containers, add PSA-restricted securityContext
Unblocks pod-security.kubernetes.io/enforce: restricted on the release
namespace. Previously the mysql-client pod had two init-containers
running as UID 0 to chown /var/lib/mysql and /var/log/mysql to 999:999
before mysqld started. PSA restricted rejects runAsUser: 0 on any
container, so these init-containers were the last blocker to promoting
the namespace from warn/audit to enforce.
The pod already had `fsGroup: 999` + `fsGroupChangePolicy: OnRootMismatch`
at the pod level, which kubelet uses to chgrp mounted volumes on first
mount. Once that is in place the init-container chowns are redundant:
- On existing PVCs (already owned 999:999 from the prior init-container
chown) OnRootMismatch sees the correct root ownership and skips the
recursive chgrp — mount is instant, no behavior change.
- On fresh PVCs kubelet applies fsGroup before the main container starts.
- On emptyDir (the logs volume) kubelet applies fsGroup at volume
creation.
Also adds a container-level securityContext with all six fields PSA
restricted requires:
- runAsNonRoot: true
- runAsUser / runAsGroup: 999 (matches the mysql:5.7.41 base image's
default user, and the entrypoint skips its root-to-mysql gosu re-exec
when already running as 999)
- allowPrivilegeEscalation: false
- capabilities: drop all
- seccompProfile: RuntimeDefault
Scope: client chart only (now the universal chart covering eks/aks/bm/oc).
Caveats for customers:
- Requires a CSI driver with fsGroupPolicy=File or ReadWriteOnceWithFSType
(EBS, AzureDisk, GCE-PD, CephRBD all qualify). NFS v3 and some
object-backed drivers do not; chart docs should flag this in a
follow-up.
Deferred to separate PR:
- readOnlyRootFilesystem on the mysql container (needs emptyDir mounts
for /tmp, /run/mysqld, /var/lib/mysql-files; real regression risk).
* fix(mysql): restore chown init-container for hostPath (bare-metal)
kubelet does not apply fsGroup ownership to hostPath volumes
(kubernetes/kubernetes#138411), so bare-metal installs need a
privileged bootstrap to chown /var/lib/mysql to 999:999 on first
start. Gated on .Values.hostPath.enabled so CSI-backed deployments
(EKS/AKS/OC) keep the clean no-init, PSA-restricted-compliant form.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
* Move tracebloc-resource-monitor to dedicated privileged namespace (#50)
* Move tracebloc-resource-monitor to dedicated privileged namespace
Pod Security Admission's `restricted` profile bans hostPath volumes
outright, and the resource-monitor DaemonSet needs hostPath /proc and
/sys to read node-level metrics. Previously, setting
`pod-security.kubernetes.io/enforce: restricted` on the release
namespace (tracebloc-templates) would reject the DaemonSet outright,
and `warn=restricted` + `audit=restricted` already spam violations.
This isolates the DaemonSet in a new dedicated namespace
(tracebloc-node-agents, configurable via `nodeAgents.namespace.name`)
that carries `pod-security.kubernetes.io/{enforce,warn,audit}:
privileged` labels. The release namespace is no longer constrained by
the node-agent and can run `enforce: restricted` once the mysql init
refactor lands.
Changes:
- templates/node-agents-namespace.yaml: new, gated on
nodeAgents.namespace.create (default true) and resourceMonitor
- templates/resource-monitor-daemonset.yaml: deploy into node-agents ns
- templates/resource-monitor-rbac.yaml: SA + (Cluster)RoleBinding in
node-agents ns
- templates/resource-monitor-scc.yaml: SCC users + CRB subject updated
(OpenShift path)
- values.yaml + values.schema.json: new `nodeAgents.namespace` block
- templates/namespace.yaml + docs/INSTALL.md: drop resource-monitor
from the enforce-blocker list; document the new node-agents ns
- tests/node_agents_namespace_test.yaml: 12 new unittest cases
Upgrade impact: existing installs will see the DaemonSet / SA /
(Cluster)RoleBinding deleted from the release namespace and recreated
in the node-agents namespace during `helm upgrade`. Brief (~seconds)
gap in node metrics during rollout; no persistent data involved.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* Mirror secrets into node-agents ns; keep namespace RBAC in release ns
Two follow-ups from review of the namespace-split change:
1. Secrets are namespace-scoped — a pod in `tracebloc-node-agents`
cannot `secretKeyRef` a Secret that only exists in the release
namespace. The resource-monitor DaemonSet was referencing CLIENT_ID /
CLIENT_PASSWORD from `tracebloc.secretName` and the registry pull
secret, both of which template only into `.Release.Namespace`, so
pods would have failed to start with CreateContainerConfigError.
templates/secrets.yaml and templates/docker-registry-secret.yaml now
template a second copy into `nodeAgents.namespace.name` when:
resourceMonitor != false AND node-agents ns != release ns
The mirror is skipped when the two namespaces collide (e.g. operator
points nodeAgents.namespace.name back at the release namespace) so
Helm does not try to create two resources with the same name.
2. When clusterScope: false, the Role must live in the RELEASE
namespace because that is where the monitored workloads run — a
namespace-scoped Role only grants access to its own namespace.
Previously this PR put the Role in `tracebloc-node-agents`, which
would have silently broken the resource-monitor for anyone not
using ClusterRole. Role + RoleBinding are now back in
`.Release.Namespace`; the RoleBinding subject still points at the
ServiceAccount in the node-agents namespace (cross-namespace
subjects in RoleBindings are valid).
Tests updated accordingly; 5 new cases cover mirror-on, mirror-off
(resourceMonitor=false), mirror-off (namespaces collide), dockercfg
mirror, and the corrected Role/RoleBinding placement.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* fix(resource-monitor): pin NAMESPACE env to release ns; guard node-agents ns==release ns
Two review fixes from the PSA hardening change:
1. NAMESPACE env var was using Downward API fieldPath: metadata.namespace,
which now resolves to the node-agents namespace (where the DaemonSet
pods live) instead of the release namespace (where the monitored
workloads live). Replace with the literal Release.Namespace so the
monitor continues to watch the right namespace regardless of where
its own pods run.
2. node-agents-namespace.yaml would stamp privileged PSA labels onto the
release namespace if an operator set nodeAgents.namespace.name to the
release namespace (and with namespace.create=true it would render two
Namespace docs with the same name — a render-time collision). Add an
equality guard so the template is a no-op in that configuration.
Adds one test covering the NAMESPACE env fix; tests: 74/74 pass.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
* feat(mysql): set readOnlyRootFilesystem on mysql-client (#52)
Completes container runtime hardening (G4) for mysql-client. Adds three
emptyDir mounts for the paths mysqld writes to at runtime that are NOT
already on PVC or log volumes:
- /var/run/mysqld pid file + unix socket
- /tmp temp tables, sort buffers, LOAD DATA staging
- /var/lib/mysql-files default secure_file_priv dir (touched at start)
Verified via helm upgrade on EKS (tb-client-dev-templates /
tracebloc-templates): pod Ready, readOnlyRootFilesystem=true, `touch /etc/x`
rejected as Read-only, mysqld.sock + mysqld.pid present under /var/run/mysqld,
existing DB data intact in /var/lib/mysql.
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
* feat(psa): enforce=restricted by default on CSI; bare-metal overrides (#51)
- values.yaml: namespace.podSecurity.enforce flipped to "restricted".
- ci/bm-values.yaml: overrides enforce to "" because kubelet does not
apply fsGroup to hostPath volumes (kubernetes/kubernetes#138411),
forcing the chart to render a privileged init-mysql-data chown
container that PSA restricted would reject. warn+audit remain on.
- namespace.yaml docstring + SECURITY.md (§4.7, §6.3, §6.6, §8.5)
updated to document the CSI-default / bare-metal-override split.
Verified with helm template --set namespace.create=true against both
eks-values.yaml (enforce rendered) and bm-values.yaml (enforce absent).
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
* feat(installer): slim k3d and add dev overrides for local testing (#54)
The tracebloc client is outbound-only: jobs-manager and pods-monitor
dial out to the platform, and the only in-cluster Service is mysql-client
(ClusterIP). The bundled k3s ingress/LB stack and metrics-server are
unused overhead, and the chart ships its own StorageClass.
Drop the loadbalancer port mappings (HTTP_PORT/HTTPS_PORT) plus their
validation/help/log references, and pass --k3s-arg "--disable=..." for
traefik, servicelb, metrics-server, and local-storage to k3d cluster
create. Applied symmetrically in scripts/install-k8s.ps1.
Also add two env vars for local-chart testing in install-client-helm.sh:
TRACEBLOC_CHART_PATH install from a local chart path instead of the
published tracebloc/client Helm repo (skips
helm repo add/update)
TRACEBLOC_VALUES_FILE use the caller-supplied values file as-is and
skip the clientId/password prompts + values.yaml
generation
With both set, the installer can exercise the full flow end-to-end
against unreleased chart changes before publishing.
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
* feat(client): harden image pinning and credentials (v1.0.4) (#53)
Address the High-severity findings from the client chart security review:
- Add digest support to tracebloc.image helper and images.* values for
jobs-manager, pods-monitor, mysql-client, and busybox. When a digest is
set, the image is rendered as repo@sha256:... and imagePullPolicy drops
to IfNotPresent (immutable pin, auditable rollout).
- Replace the hard-coded mysql-client:latest with a configurable tag that
defaults to "prod". The schema rejects "latest" outright; operators
wanting absolute pinning should set images.mysqlClient.digest.
- Harden the bare-metal mysql init-container: still runs as root (kubelet
does not apply fsGroup to hostPath volumes, k8s#138411), but now with
drop: [ALL] + add: [CHOWN], allowPrivilegeEscalation: false,
readOnlyRootFilesystem: true, and seccompProfile: RuntimeDefault.
- Remove deceptive "<CLIENT_ID>" / "<CLIENT_PASSWORD>" placeholder defaults.
The defaults are now empty strings; the schema and template both reject
empty values and <...> placeholder patterns so deployments fail fast
instead of silently encoding a placeholder into the Secret.
Bump chart version 1.0.3 -> 1.0.4. All 76 unit tests pass.
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
* feat(client): require metrics-server for resource-monitor (v1.0.5) (#55)
The tracebloc-resource-monitor DaemonSet queries the metrics.k8s.io API
for node CPU/memory. Without metrics-server registered, the DaemonSet
crash-loops with 404s against /apis/metrics.k8s.io/v1beta1 — silently,
every few seconds. Found during a bare-metal smoke test on a k3d cluster
where metrics-server had been explicitly disabled.
- scripts/lib/cluster.sh: drop --disable=metrics-server from the k3d
create args. k3s bundles metrics-server; the earlier comment claiming
the chart "ships its own" was wrong — the DaemonSet is a consumer of
metrics-server, not a replacement.
- client/templates/resource-monitor-daemonset.yaml: add a pre-install
`lookup` that fails the release up front when resourceMonitor is true
but v1beta1.metrics.k8s.io is not registered. Guarded by a kube-system
probe so offline `helm template` still renders.
- client/values.yaml: document the dependency inline on resourceMonitor,
with per-platform install notes (k3d/AKS bundled; EKS/OC/bare-metal
need manual install).
- docs/SECURITY.md: call out the dependency and the escape hatch
(resourceMonitor: false) in the architecture section.
- Chart.yaml: 1.0.4 -> 1.0.5.
Verified on a fresh k3d cluster (no --disable=metrics-server): metrics
API comes up in ~30s, smoke install succeeds, resource-monitor reaches
Running with zero ERROR/404 lines. Pre-flight fail path also verified
against a metrics-less cluster.
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
* fix(mysql): drop chmod from hostPath init (v1.0.6) (#56)
The init-container runs as UID 0 with capabilities drop:[ALL] add:[CHOWN].
After 'chown 999:999' transfers ownership, the subsequent 'chmod 755' runs
as a non-owner without CAP_FOWNER and returns EPERM on re-install where
the hostPath dir already exists from a prior run. Reversing the order
does not help (chmod first still fails once the dir is 999-owned from
any previous successful run).
kubelet creates hostPath dirs at 0755 via DirectoryOrCreate, so the chmod
was a no-op on fresh installs and broken on re-installs. Drop it.
Verified on k3d/AWS VM:
- fresh install: kubelet-created root:root dir -> chown succeeds -> 999:999
- re-install: pre-existing 999:999 dir with data -> chown no-op -> data intact
* Chore/merge main into develop (#58)
* Update README.md
* Add narrow CODEOWNERS for security-sensitive paths
* Remove metrics-server disable argument from k3d cluster creation in install-k8s.ps1 to ensure proper functionality of the resource-monitor DaemonSet, which relies on the metrics API. This change aligns with previous updates that emphasized the necessity of metrics-server for monitoring capabilities.
---------
Co-authored-by: lukasWuttke <54042461+LukasWodka@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
* Merge pull request #60 from tracebloc/fix/resource-monitor-digest-pinning
fix(client): pin resource-monitor by digest (v1.0.7)
* chore: add auto-add to engineer kanban workflow (#45)
* Add auto-add to engineer kanban workflow
* fix(ci): pin actions/add-to-project to v1.0.2
@v1 is not a valid tag — action publishes full semver only. Pin to v1.0.2.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(client): reject empty clusterCidrs on training NetworkPolicy (v1.0.8) (#61)
* fix(client): reject empty clusterCidrs on training NetworkPolicy (v1.0.8)
When `networkPolicy.training.enabled: true` and `clusterCidrs: []`, the
template's range loop produced no items, so `except:` rendered as null.
Kubernetes interprets a null `except` as "no exceptions" to `cidr: 0.0.0.0/0`,
silently granting training pods unrestricted port-443 egress to MySQL, the
K8s API, jobs-manager, and every other in-cluster destination the policy
is meant to block.
Gate the misconfiguration at two levels:
- `values.schema.json`: add `minItems: 1` to clusterCidrs (fires at
helm install/upgrade validation)
- `network-policy-training.yaml`: add a `{{ fail }}` guard as
defense-in-depth for schema-bypass paths (helm template --validate=false)
- `tests/network_policy_test.yaml`: add a unit test asserting the failure
Credit: bug bot finding.
* fix(client): tolerate missing images.resourceMonitor on --reuse-values upgrade
Caught by a live k3d upgrade 1.0.6 → 1.0.8: releases installed before
PR #60 have no `images.resourceMonitor` block in their stored values, so
`helm upgrade --reuse-values` nil-pointered on `.Values.images.resourceMonitor.digest`.
- Read the digest via nested `default (dict)` so a missing `images` map
AND a missing `resourceMonitor` entry both fall through to "" safely.
`dig` would be cleaner but it rejects chartutil.Values.
- Add tests/resource_monitor_test.yaml with a regression case that sets
`images: null` and asserts the DaemonSet still renders with the tag
fallback.
Scope limited to resourceMonitor: the other images (jobsManager,
podsMonitor, mysqlClient, busybox) were introduced together in PR #53
(1.0.4), so anyone on 1.0.4+ already has those blocks in stored values.
* fix(client): scope clusterCidrs minItems guard to enabled=true only
Bug bot flagged that the unconditional minItems:1 constraint on
networkPolicy.training.clusterCidrs rejects `enabled: false` +
`clusterCidrs: []` — a legitimate minimal config for operators on
non-enforcing CNIs who disable the policy entirely.
Move the constraint behind a JSON Schema draft 7 if/then at the
`training` object level: minItems:1 applies only when enabled=true.
The template-side fail guard was already correctly scoped inside the
`.Values.networkPolicy.training.enabled` check, so no template change
is needed — this aligns the schema with the template.
Add a unittest covering `enabled: false` + `clusterCidrs: []`
(schema must pass, no policy rendered).
---------
Co-authored-by: Lukas Wuttke <lukas@tracebloc.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: lukasWuttke <54042461+LukasWodka@users.noreply.github.com>
Merge pull request #62 from tracebloc/fix/release-workflow-lint
Enhance CI workflows and fix MySQL resource management issues
* Merge pull request #71 from tracebloc/docs/migrations-correct-option-b docs(migrations): correct Option B + add hasan-prod case + active-jobs pre-flight * chore: add default CODEOWNERS for auto-reviewer assignment (#73) * ci: add kanban closure-routing caller workflow (#75) * fix(client): release-scope resource-monitor names so multiple releases coexist (v1.2.0) (#72) Two client releases on the same cluster could not both deploy the resource-monitor DaemonSet because several resources templated into the shared tracebloc-node-agents namespace used the literal name `tracebloc-resource-monitor` rather than a release-scoped name. The second `helm install` failed with: Error: ServiceAccount "tracebloc-resource-monitor" in namespace "tracebloc-node-agents" exists and cannot be imported into the current release: invalid ownership metadata; ... must equal "hasan-prod": current value is "stg". Surfaced during the 2026-04-27 hasan-prod migration on tracebloc-templates-prod; worked around at the time by setting resourceMonitor: false on the second release, which means prod customers currently lose their per-CLIENT_ID metric stream until this lands. What changed: - New helper `tracebloc.resourceMonitorName` -> `<Release.Name>-resource-monitor`, centralised in _helpers.tpl alongside the existing per-release name helpers (secretName, serviceAccountName, etc.). - DaemonSet metadata.name, spec.selector.matchLabels.app, pod label app=, and spec.template.spec.serviceAccountName all now go through the helper. The selector + pod label have to move together because DaemonSet selectors are namespace-scoped: two DaemonSets in tracebloc-node-agents both selecting `app: tracebloc-resource-monitor` would each grab the other's pods, which is worse than the surface bug. - ServiceAccount metadata.name (resource-monitor-rbac.yaml) goes through the helper. ClusterRole / ClusterRoleBinding / Role / RoleBinding metadata.name were already release-scoped (`tracebloc-resource-monitor-<release>`) and stay as-is to avoid an unnecessary ClusterRole rename for upgrading installs. Only the *subject* names in (Cluster)RoleBinding change to point at the new SA. - Mirrored secrets (CLIENT_ID + dockerconfigjson) in tracebloc-node-agents: the secret names were already release-scoped via tracebloc.secretName / tracebloc.registrySecretName so they did not collide. Their `app` label was the literal value, which is harmless on uniquely-named resources but inconsistent — updated for consistency. - Chart bumped 1.1.0 -> 1.2.0. Per-release naming of cluster-singleton resources is a behaviour change for existing installs (DaemonSet name, ServiceAccount name, and selector label all change), so a minor bump signals that operators should review. Tests: 93 -> 98. New cases cover: - DaemonSet name + selector + serviceAccountName all release-scoped - ServiceAccount name release-scoped - ClusterRoleBinding subject points at the release-scoped SA - A second `helm template` with a different release name produces non-colliding names Verified end-to-end via `helm template stg ./client` and `helm template hasan-prod ./client` on the same chart: ServiceAccount, DaemonSet, and ClusterRoleBinding subject names all diverge per release. Upgrade path from 1.1.0: The DaemonSet and ServiceAccount rename triggers a Helm three-way merge that DELETEs the old `tracebloc-resource-monitor` resource and CREATEs the new release-scoped one. ~30-60s gap on each node where resource metrics are not collected. DaemonSet selector is immutable, so the delete-then-create path is what we want — helm upgrade handles this automatically because the names diverge in the stored manifest. No manual orphan cleanup needed. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> * fix(client): allow training pods to reach mysql-client (v1.2.1) (#76) The training-egress NetworkPolicy added in v1.1.0 only permitted DNS and external TCP/443. Training pods load their dataset from the in-namespace mysql-client over TCP/3306 (core/utils/database.py::load_dataframe_from_sql_table), so under any CNI that actually enforces NetworkPolicy the connect failed with errno 111 and the Job CrashLoopBackOff'd before the first batch: Database connection failed: 2003 (HY000): Can't connect to MySQL server on 'mysql-client:3306' (111) RuntimeError: Database connection is not available for load_dataframe_from_sql_table Surfaced on a fresh client install (k3d / k3s, which enforces policy via the built-in kube-router) where jobs-manager could reach mysql but every training Job spawned with tracebloc.io/workload=training could not. Add a third egress rule scoped to podSelector {app: mysql-client} on TCP/3306. Same-namespace by default (no namespaceSelector), so it stays tight to the chart's own mysql pod and does not open the namespace generally. The egress[1] /32 ipBlock comment is updated to note that MySQL is now explicitly re-permitted by egress[2]. Verified on a k3d cluster: pre-fix nc to mysql-client:3306 from a pod with the training label was refused; post-fix it connects. * docs(migration-tools): tenant migration runbook for eks-1.0.x → client-1.x (#74) * docs(migration-tools): tenant migration runbook for eks-1.0.x -> client-1.x Captures the operational tooling validated during the 2026-04-27 stg and hasan-prod migrations and generalises it for the remaining tenants (bmw, cisco, charite) and any future tenant on the legacy chart family. What's here: - README.md walks the workflow + recommended ordering for the pending set + skip rationale for chart toggles (resourceMonitor: false, priorityClass.create: false, etc). - generate.sh consumes a tenant-config.env (gitignored) and emits, per tenant, /tmp/tracebloc-migration-<tenant>/{values,storageclass,pvcs}.yaml. Refuses to expand placeholder __FOO__ rows so an operator running generate.sh against the unmodified template fails fast. - migrate-tenant.sh is the parameterised runbook. `phase1` is non-destructive (mysqldump-then-chunked-cp, AWS Backup on-demand recovery point, dry-run render). `phase2` is one-shot per tenant (helm uninstall, claimRef clear, SC re-create, PVC pre-create with release-scoped Helm ownership stamp, helm install, verify mysql data + keep annotation in stored manifest). - tenant-config.example.env is the template; populated copy is the secret-bearing artifact and must stay local. No real secrets in any committed file: - DOCKER_PASSWORD placeholder (__DOCKER_HUB_PERSONAL_ACCESS_TOKEN__) - per-tenant CLIENT_ID / CLIENT_PASSWORD placeholders - MYSQL_ROOT_PW placeholder (it's image-baked; required from env at runtime, no committed default) - .gitignore now excludes docs/migration-tools/tenant-config.env (only the .example variant is tracked) Operational notes: - Every kubectl/helm call passes --context explicitly. The 2026-04-27 prod run hit a context-drift bug mid-migration; the explicit form is a hard requirement. - values.yaml ships with resourceMonitor: false. Flip true after the release-scoped resource-monitor names land in client-1.2.0 (separate PR). Until then the shared SA in tracebloc-node-agents collides with the stg release. - Phase 1 is idempotent and re-runnable. Phase 2 is destructive and one-shot per tenant. Operators should pause and eyeball Phase 1 outputs before running Phase 2 — that's deliberately not automated. Once all four pending tenants are on client-1.x, this directory is historical. client-1.x -> client-1.y upgrades follow plain `helm upgrade` because the new chart already templates `helm.sh/resource-policy: keep` on PVCs, so the migration protocol isn't needed for routine upgrades. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(migration-tools): address bugbot review feedback on PR #74 Three issues flagged by Cursor Bugbot on the migration scripts: * migrate-tenant.sh used macOS-only `md5 -q` and `stat -f%z` for chunked-cp verification (HIGH). Linux operators would abort Phase 1 mid-transfer. Add portable `_md5` and `_size` helpers that pick md5sum on Linux, fall back to md5(1) on macOS, and use `wc -c` instead of stat for size. * generate.sh placeholder gate inspected only CLIENT_ID + CLIENT_PASSWORD + PV_MYSQL, missing PV_LOGS, PV_DATA, SC_NAME, and DOCKER_PASSWORD (MEDIUM). Literal `__FOO__` placeholders silently rendered into values.yaml/pvcs.yaml and only blew up at kubectl apply / helm install time. Iterate over every per-row field, plus a one-shot global check for DOCKER_PASSWORD before the loop. Error messages now name the offending field. * Phase 2.5 readiness loop was an unbounded `while :; do … sleep 5; done` (MEDIUM). After the destructive helm uninstall, a non-converging install (image-pull error, mysql kill-loop recurrence, missing PVC binding) hung the script forever instead of surfacing the failure. Add a wall-clock deadline — default 600s, override via READY_TIMEOUT — and exit 1 with the last-seen pod state on timeout. * fix(migration-tools): address bugbot follow-up on PR #74 Two more issues raised on the previous fix commit: * Readiness wait loop aborted on empty pod list (HIGH). With `set -euo pipefail`, the routine post-install window where no pods are visible yet caused `grep -c .` to exit 1, killing the script on the very first iteration before the wall-clock deadline could ever fire — defeating the bounded-wait intent. Guard the empty case explicitly. `wc -l` alone is also wrong because `echo ""` prints a newline. * MYSQL_ROOT_PW skipped the placeholder check that DOCKER_PASSWORD, CLIENT_*, and PV_* now have (LOW). An operator who copied the example without editing this row passed the non-empty gate, then the literal __LEGACY_MYSQL_ROOT_PW__ went into mysqldump and Phase 1 blew up partway through with an opaque "Access denied" inside kubectl exec. Add the same `*__*__*` case guard right after the non-empty check. * fix(migration-tools): make EFS_FS_OVERRIDE actually override (PR #74) The pre-source assignment EFS_FS="${EFS_FS_OVERRIDE:-fs-06b3faf51675ff9f9}" was a no-op: `source "$CONFIG"` runs immediately after and the example config (and any real tenant-config.env derived from it) unconditionally sets EFS_FS=fs-06b3faf51675ff9f9, so the env override was clobbered every time. Operators thinking they were targeting a non-default EFS would silently start AWS Backup on-demand jobs against the hard-coded prod filesystem. Move the override knob to AFTER source where env genuinely wins, drop the hard-coded fallback, and require EFS_FS to be set somewhere (config or override) before continuing. --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> * fix(client): release-scope SCC SA refs (v1.2.2) (#78) Bugbot caught a High-severity miss in v1.2.0's release-scoping work (PR #72). The OpenShift SCC template was the one resource-monitor file not updated when the literal `tracebloc-resource-monitor` ServiceAccount name moved to `<Release.Name>-resource-monitor`. On OpenShift the SCC granted access to a SA name that no longer existed, so the resource- monitor DaemonSet pods would fail to launch (no SCC -> can't mount hostPath /proc and /sys for node metrics). The SCC's metadata.name + ClusterRole.name + ClusterRoleBinding.name were ALREADY release-scoped (`tracebloc-resource-monitor-<release>` / `tracebloc-resource-monitor-scc-<release>`), so this slipped through — casual reading suggested it was already done. Touchpoints in resource-monitor-scc.yaml: - users[0]: now {{ include "tracebloc.resourceMonitorName" . }} - ClusterRoleBinding subjects[0].name: same helper - All `app: tracebloc-resource-monitor` labels: same helper, for consistency with the rest of the chart's resource-monitor templates - Updated the kubernetes.io/description SCC annotation prose so the literal name doesn't appear there either (cosmetic, but easier to audit "no literal references" with a single grep). Tests: - platform_test.yaml gains 3 new cases: SCC users[0] points at release-scoped SA, ClusterRoleBinding subject does too, and two releases (stg + cisco/hasan-prod) produce non-colliding SA references. - node_agents_namespace_test.yaml had a regression assertion checking the OLD literal name in users[0]; updated to the new release-scoped form (`RELEASE-NAME-resource-monitor`, helm-unittest's default release name when none is set). - 98 -> 102 passing. Verified end-to-end with two side-by-side `helm template` runs: - stg -> users[0] = system:serviceaccount:tracebloc-node-agents:stg-resource-monitor - hasan-prod -> users[0] = system:serviceaccount:tracebloc-node-agents:hasan-prod-resource-monitor Chart bumped 1.2.1 -> 1.2.2 (patch — restores OpenShift parity that v1.2.0 inadvertently broke). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> * fix: NOTES.txt rename + generator chart-version drift (v1.2.3) — bugbot follow-up #2 (#80) * fix(client): release-scope SCC SA refs (v1.2.2) Bugbot caught a High-severity miss in v1.2.0's release-scoping work (PR #72). The OpenShift SCC template was the one resource-monitor file not updated when the literal `tracebloc-resource-monitor` ServiceAccount name moved to `<Release.Name>-resource-monitor`. On OpenShift the SCC granted access to a SA name that no longer existed, so the resource- monitor DaemonSet pods would fail to launch (no SCC -> can't mount hostPath /proc and /sys for node metrics). The SCC's metadata.name + ClusterRole.name + ClusterRoleBinding.name were ALREADY release-scoped (`tracebloc-resource-monitor-<release>` / `tracebloc-resource-monitor-scc-<release>`), so this slipped through — casual reading suggested it was already done. Touchpoints in resource-monitor-scc.yaml: - users[0]: now {{ include "tracebloc.resourceMonitorName" . }} - ClusterRoleBinding subjects[0].name: same helper - All `app: tracebloc-resource-monitor` labels: same helper, for consistency with the rest of the chart's resource-monitor templates - Updated the kubernetes.io/description SCC annotation prose so the literal name doesn't appear there either (cosmetic, but easier to audit "no literal references" with a single grep). Tests: - platform_test.yaml gains 3 new cases: SCC users[0] points at release-scoped SA, ClusterRoleBinding subject does too, and two releases (stg + cisco/hasan-prod) produce non-colliding SA references. - node_agents_namespace_test.yaml had a regression assertion checking the OLD literal name in users[0]; updated to the new release-scoped form (`RELEASE-NAME-resource-monitor`, helm-unittest's default release name when none is set). - 98 -> 102 passing. Verified end-to-end with two side-by-side `helm template` runs: - stg -> users[0] = system:serviceaccount:tracebloc-node-agents:stg-resource-monitor - hasan-prod -> users[0] = system:serviceaccount:tracebloc-node-agents:hasan-prod-resource-monitor Chart bumped 1.2.1 -> 1.2.2 (patch — restores OpenShift parity that v1.2.0 inadvertently broke). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix: NOTES.txt rename + generator chart-version drift (v1.2.3) Bugbot follow-up to the v1.2.0/1.2.2 rename work. Two fresh issues: 1. (Medium) NOTES.txt:9 still hardcoded the literal `tracebloc-resource-monitor` for the resource-monitor DaemonSet display, while the actual DaemonSet name has been `<release>-resource-monitor` since v1.2.0. Operators see one name in the post-install banner and a different name when they `kubectl get ds`. Now routes through the same tracebloc.resourceMonitorName helper as the rest of the chart. 2. (Low) docs/migration-tools/generate.sh hardcoded `app.kubernetes.io/version: "1.1.0"` and `helm.sh/chart: client-1.1.0` on every pre-create PVC. The chart has moved through 1.1.0 → 1.2.3, and operators running generate.sh today get PVC labels stuck at 1.1.0 even though the install ahead is 1.2.3. Helm adoption itself is unaffected (it keys on meta.helm.sh/release-name, not the chart label), but the labels lie until a subsequent upgrade reconciles them, and `kubectl get pvc -L helm.sh/chart` is misleading during migration debugging. Fixed by reading name + version from client/Chart.yaml at generate time. Plus a few stale prose references caught while auditing the same path (no functional impact, but the doc was directing operators at "client fix in 1.2.0" as if it were still pending): - generate.sh inline comment on `resourceMonitor: false` rephrased from "until client-1.2.0 is published" to "until you have verified the chart you're installing is 1.2.0+" - migrate-tenant.sh banner relabelled from "v1.1.0 spec sanity" to "mysql spec sanity (v1.1.0+ shape: ...)" - README.md skip table cell on `resourceMonitor: false` rewritten to reflect that 1.2.0+ has shipped — operators on >=1.2.0 can flip it to true without colliding with the stg release Tests: 102 → 105 passing. New `client/tests/notes_test.yaml` covers: - Release-scoped resource-monitor name appears in NOTES.txt - A different release renders a different name (proves the helper isn't accidentally hardcoded) - Negative regex guards against the literal `tracebloc-resource-monitor` reappearing followed by a non-suffix character (i.e. the bare pre-1.2.3 form, while still letting the SCC line `tracebloc- resource-monitor-<release>` further down the file pass) - `resourceMonitor: false` removes the line entirely End-to-end smoke of generate.sh confirms PVCs ship with the live chart version (`helm.sh/chart: client-1.2.3` after this commit, verified against /tmp/tracebloc-migration-<demo>/pvcs.yaml). Stacked on PR #78 (v1.2.2 SCC fix), so this branch already contains the SCC SA-ref rename. Once #78 lands the diff against develop will reduce to just this commit. Chart bumped 1.2.2 → 1.2.3 (patch — operator-facing string fix + tooling correctness). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> * docs(claude): require @saadqbal as PR assignee (#79) Convention captured after a session-end ask. Every PR Claude opens for this repo must be assigned to saadqbal — orphaned PRs without an assignee fall through the review queue. Pass --assignee @me on `gh pr create` (or --assignee saadqbal if running unauthenticated). No exceptions. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: lukasWuttke <54042461+LukasWodka@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
chore(client): bump chart to 1.2.3 for release
…loses #70) (#83) (#84) The chart unification (4 per-platform charts -> unified client/ chart) shipped in v1.1.0; the unified chart has now been at v1.2.x in production across stg + hasan-prod for several releases. Time to retire the legacy artifacts. Removed: - aks/, bm/, eks/, oc/ chart directories — 75 files, ~330KB. Each had a DEPRECATED.md pointing at the unified chart for ~6 months. - 7 stale .tgz tarballs at repo root (aks-1.0.3, aks-1.0.4, bm-1.0.3, bm-1.0.4, eks-1.0.3, eks-1.0.4, oc-1.0.4). The release workflow publishes via gh-pages; these checked-in builds were dead weight. - Root index.yaml — stale snapshot listing only 1.0.3/1.0.4 of the legacy charts. The live index served at tracebloc.github.io/client is on the gh-pages branch and is the source of truth. - mysql.yaml at repo root — orphaned PVC manifest with hardcoded volume UUID and namespace. Audited: zero references anywhere in the repo. Other: - Added *.tgz to .gitignore so chart packages don't sneak back in. - Updated client/MIGRATION.md Rollback section. The old "the legacy charts remain in aks/, bm/, eks/, oc/ and can be used at any time" was about to become a lie. Replaced with instructions to recover the directory from git history if anyone genuinely needs the old chart. Verification: - helm lint --strict ./client -f client/ci/eks-values.yaml — clean (same invocation the release workflow runs on every tag) - helm unittest client — 105/105 still passing - helm package ./client -d /tmp — produces a valid client-1.2.3.tgz Net diff: 86 files changed, 17 insertions(+), 3447 deletions(-). Co-authored-by: Lukas Wuttke <lukas@tracebloc.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: lukasWuttke <54042461+LukasWodka@users.noreply.github.com>
Prod: Implement self-upgrade CronJob for Helm chart automation
The Deploy section opened with `docker pull tracebloc/client:latest`, but this repo ships a Helm chart — the actual install is `helm install`. External walkthrough URLs (`/local-linux`, `/local-macos`, `/aws`, `/deployment-overview`) didn't match any path in the tracebloc/docs tree, so they 404. The in-repo documentation (`docs/INSTALL.md`, `docs/MIGRATIONS.md`, `docs/migration-tools/README.md`, `client/MIGRATION.md`) was never linked from the README despite being the operational source of truth. Surgical change — the rest of the README stays as-is: - Replace `docker pull` with `helm repo add` + `helm install` (matches docs/INSTALL.md) - Call out chart version (v1.3.1) and platform support (AKS / EKS / bare-metal / OpenShift) up front - Table linking every in-repo operational doc - Fix external URLs to match actual tracebloc/docs paths (local-deployment-guide-linux, local-deployment-guide-macos, eks-client-deployment-guide, azure-deployment-guide) - Pull NetworkPolicy/CNI prerequisite into a callout Closes #101 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
docs: fix README Deploy section (Helm not docker), surface in-repo docs
The standalone installer (bash <(curl -fsSL tracebloc.io/i.sh) / irm tracebloc.io/i.ps1 | iex) is the one-command path for evaluation, local dev, and first-time installs — it provisions a cluster, detects GPU drivers, and deploys the client. Today it isn't documented anywhere reachable from this repo, so readers see the multi-step helm install flow as the only option. README: - New "Quick install" subsection at the top of Deploy with macOS/Linux and Windows commands, brief description of what it does, and a pointer to the local helper scripts under scripts/ - Existing helm flow relabeled as "Helm install (production)" — now positioned as the option for existing production clusters docs/INSTALL.md: - Top-of-doc callout pointing at the standalone installer for non-production users - Production-focused content untouched Closes #103 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous wording ("Best for evaluation, local dev, and first-time
installs" / "Just trying it out? For local dev or a quick evaluation")
implied the standalone installer produces a lesser/demo client. It
doesn't — it produces the same full client, just on a cluster the
script provisions for you.
Reframes the differentiator around cluster ownership instead of install
quality:
- README: "Use this when you don't already have a cluster — the result
is a full client install, not a demo." Helm subsection retitled
from "Helm install (production)" to just "Helm install" with
"For existing Kubernetes clusters".
- INSTALL.md: callout opens with "Don't have a Kubernetes cluster
yet?" and emphasizes "a full tracebloc client".
Refs #103
curl and PowerShell's irm both default to HTTP when no scheme is specified, so `curl -fsSL tracebloc.io/i.sh` and `irm tracebloc.io/i.ps1` issue plaintext requests. The downloaded body is piped straight into bash / iex, so a network-level attacker between the user and tracebloc.io could MITM the response and inject arbitrary code. Add explicit `https://` to every installer URL in README.md and docs/INSTALL.md so the request is encrypted from the first byte. Refs #103
docs: surface standalone installer in README and INSTALL.md
…main ci: bootstrap FR-flow callers on main
Switches the auto-upgrade CronJob default schedule from "23 2 * * *" (daily 02:23 UTC) to "23 * * * *" (hourly at :23). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Switches the auto-upgrade CronJob default schedule from "23 2 * * *" (daily 02:23 UTC) to "23 * * * *" (hourly at :23). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
chore(client): bump chart 1.3.1 -> 1.3.2 (hourly auto-upgrade)
…in-v1.3.5 # Conflicts: # client/Chart.yaml
* Merge pull request #88 from tracebloc/ci/add-wip-limit-caller ci: add WIP-limit-check caller workflow * feat(requests-proxy): register requests-proxy in Helm chart (#95) * feat(requests-proxy): register requests-proxy in Helm chart - Add requests-proxy Deployment and Service templates - Auto-generate requests-proxy-admin token on first install (preserved across upgrades via lookup; override with requestsProxyAdminToken) - Inject REQUESTS_PROXY_ADMIN_TOKEN into jobs-manager via the same secret - Add images.requestsProxy and resources.requestsProxy values Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Update order of setting request proxy admin token * Bugbot Fix YAML * Bugbot fix add validation for request proxy --------- Co-authored-by: Syed Saqlain <syedsaqlain@MacBook-Pro.local> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> * Merge pull request #106 from tracebloc/docs/drop-stale-helm-charts-refs-105 docs: drop stale tracebloc-helm-charts references in INSTALL.md * ci: add FR-pass comment caller for multi-stage kanban flow * ci: add FR gate caller for staging/main promotions * chore: sync main → develop after misrouted docs PRs (#108) * docs: fix README Deploy section (Helm not docker), surface in-repo docs The Deploy section opened with `docker pull tracebloc/client:latest`, but this repo ships a Helm chart — the actual install is `helm install`. External walkthrough URLs (`/local-linux`, `/local-macos`, `/aws`, `/deployment-overview`) didn't match any path in the tracebloc/docs tree, so they 404. The in-repo documentation (`docs/INSTALL.md`, `docs/MIGRATIONS.md`, `docs/migration-tools/README.md`, `client/MIGRATION.md`) was never linked from the README despite being the operational source of truth. Surgical change — the rest of the README stays as-is: - Replace `docker pull` with `helm repo add` + `helm install` (matches docs/INSTALL.md) - Call out chart version (v1.3.1) and platform support (AKS / EKS / bare-metal / OpenShift) up front - Table linking every in-repo operational doc - Fix external URLs to match actual tracebloc/docs paths (local-deployment-guide-linux, local-deployment-guide-macos, eks-client-deployment-guide, azure-deployment-guide) - Pull NetworkPolicy/CNI prerequisite into a callout Closes #101 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: surface standalone installer in README and INSTALL.md The standalone installer (bash <(curl -fsSL tracebloc.io/i.sh) / irm tracebloc.io/i.ps1 | iex) is the one-command path for evaluation, local dev, and first-time installs — it provisions a cluster, detects GPU drivers, and deploys the client. Today it isn't documented anywhere reachable from this repo, so readers see the multi-step helm install flow as the only option. README: - New "Quick install" subsection at the top of Deploy with macOS/Linux and Windows commands, brief description of what it does, and a pointer to the local helper scripts under scripts/ - Existing helm flow relabeled as "Helm install (production)" — now positioned as the option for existing production clusters docs/INSTALL.md: - Top-of-doc callout pointing at the standalone installer for non-production users - Production-focused content untouched Closes #103 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: reframe Quick install — same client, different cluster path Previous wording ("Best for evaluation, local dev, and first-time installs" / "Just trying it out? For local dev or a quick evaluation") implied the standalone installer produces a lesser/demo client. It doesn't — it produces the same full client, just on a cluster the script provisions for you. Reframes the differentiator around cluster ownership instead of install quality: - README: "Use this when you don't already have a cluster — the result is a full client install, not a demo." Helm subsection retitled from "Helm install (production)" to just "Helm install" with "For existing Kubernetes clusters". - INSTALL.md: callout opens with "Don't have a Kubernetes cluster yet?" and emphasizes "a full tracebloc client". Refs #103 * docs: explicit https:// on installer URLs (security) curl and PowerShell's irm both default to HTTP when no scheme is specified, so `curl -fsSL tracebloc.io/i.sh` and `irm tracebloc.io/i.ps1` issue plaintext requests. The downloaded body is piped straight into bash / iex, so a network-level attacker between the user and tracebloc.io could MITM the response and inject arbitrary code. Add explicit `https://` to every installer URL in README.md and docs/INSTALL.md so the request is encrypted from the first byte. Refs #103 * ci: bootstrap FR-pass caller on main * ci: bootstrap FR gate caller on main --------- Co-authored-by: Lukas Wuttke <lukas@tracebloc.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: lukasWuttke <54042461+LukasWodka@users.noreply.github.com> * chore(auto-upgrade): run cronjob hourly at :23 (#112) Switches the auto-upgrade CronJob default schedule from "23 2 * * *" (daily 02:23 UTC) to "23 * * * *" (hourly at :23). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Merge pull request #115 from tracebloc/chore/bump-chart-1.3.2-develop chore(client): bump chart 1.3.1 -> 1.3.2 (develop sync) * ci: drop push-tags trigger from release-helm-chart workflow (#117) * ci: drop push-tags trigger from release-helm-chart workflow `gh release create v<x.y.z>` (the established release path per `gh release list`) fires both `push` (tag) and `release` (published) events, which causes two parallel workflow runs to race for the gh-pages push. The slower run fails with non-fast-forward. Most recent example: v1.3.2 cut today — run 25492826437 (release event) failed; run 25492826350 (push event) succeeded. Artifacts landed fine, but the failed sibling shows up as a red X on the release and is noise for anyone debugging future releases. Keeping only `release: published` removes the race. The `Upload chart to GitHub Release (on tag)` step's `startsWith(github.ref, 'refs/tags/')` guard still evaluates true for release events (`github.ref` is the tag ref), so the upload step behaviour is preserved. Closes #116 * ci: harden release-asset upload against actions/runner#2788 With the push-tags trigger removed, the upload step's `if: startsWith(github.ref, 'refs/tags/')` guard is the only thing keeping the upload from running, but it silently evaluates to false when `github.ref` arrives empty — a known intermittent runner bug (actions/runner#2788, still open as of 2026-05). The same bug also affects `github.ref_name`, which softprops/action-gh-release@v2 uses by default to derive the tag, so the action itself can target the wrong release (or fail) when the bug fires. Drop the now-redundant `if:` guard (the workflow only runs on `release: published`, so every run is by definition a release event) and pass `tag_name` explicitly from the release event payload, which is unaffected by the bug. * ci: pin checkout ref to release tag (actions/runner#2788 hardening) actions/checkout@v4 defaults `ref` to github.ref, which is the same field hit by actions/runner#2788 — the still-open intermittent bug where github.ref arrives empty on release-triggered runs. Per the action's docs, when "checking out the repository that triggered a workflow, this defaults to the reference or SHA for that event. Otherwise, uses the default branch." So an empty github.ref would fall back to the repo default branch (develop here), and we'd package the chart from develop's HEAD instead of the tagged commit. Pin ref explicitly to github.event.release.tag_name, which is part of the release event payload and is unaffected by the runner bug. * Add MySQL Host to request proxy yaml file (#118) Co-authored-by: Syed Saqlain <syedsaqlain@MacBook-Pro.local> * Add request proxy url to jobs manager yaml file (#119) Co-authored-by: Syed Saqlain <syedsaqlain@MacBook-Pro.local> * Remove REQUESTS_PROXY_ADMIN_TOKEN (#120) Co-authored-by: Syed Saqlain <syedsaqlain@MacBook-Pro.local> * Reduce dependency on values.yaml file for requests proxy (#122) Co-authored-by: Syed Saqlain <syedsaqlain@MacBook-Pro.local> * feat(#86): ingestor Helm subchart + companion RBAC/service/authz for new ingestion endpoint (#123) * feat: companion chart changes for ingestion endpoint (client-runtime#21) Wires the cluster side of the new ingestion flow into the main client chart so the upcoming ingestor subchart can actually reach jobs-manager. Five small changes: 1. **rbac.yaml** — adds three permissions to jobs-manager's RBAC: - authentication.k8s.io/tokenreviews create - configmaps create - secrets create The endpoint validates caller SA tokens via TokenReview and creates a per-run ConfigMap (ingest.yaml) + Secret (BACKEND_TOKEN) before spawning the ingestor Job. `tokenreviews` is cluster-scoped and only added to the ClusterRole branch; customers with `clusterScope: false` won't have the ingestion endpoint authenticate. Documented in the rule comments. 2. **jobs-manager-service.yaml** (new) — ClusterIP exposing port 8080 at the stable name `jobs-manager`, so the ingestor subchart's post-install hook doesn't need to discover Pod IPs. 3. **jobs-manager-deployment.yaml** — adds containerPort 8080 on the `api` container, mounts the ingestion-authz ConfigMap at `/etc/tracebloc/ingestion-authz.yaml`, declares the corresponding pod-level volume. 4. **ingestion-authz-configmap.yaml** (new) — renders the `ingestionAuthz.allowed` policy customers configure in values.yaml. Mounted into jobs-manager and read at startup by `submit_ingestion_run.load_authz_policy`. Each entry maps (namespace, service_account) → allowed table_prefixes; omitted `namespace` defaults to .Release.Namespace. 5. **values.yaml** — adds the `ingestionAuthz.allowed` default that permits the ingestor subchart's default SA (named `ingestor`) to ingest into any table. Customers tighten via overrides. Verified ──────── - helm lint passes (only pre-existing icon-recommended INFO). - helm template renders all five resources cleanly with expected values (Service name, RBAC verbs, container port, volume mount). - helm unittest: 116/116 tests pass (existing snapshots unchanged). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(#86): ingestor Helm subchart (post-install hook submits to jobs-manager) The customer-facing chart that finally closes the end-to-end loop: helm install my-dataset tracebloc/ingestor --namespace tracebloc \ --set-file ingestConfig=./my-ingest.yaml \ --set image.digest=sha256:<digest> Renders the customer's ingest.yaml into a ConfigMap, then a post-install hook Job POSTs `{ingest_config, idempotency_key, image_digest}` to jobs-manager's `/internal/submit-ingestion-run` endpoint (client-runtime#21). jobs-manager validates the SA token via TokenReview, validates the YAML against ingest.v1, mints a backend token, creates the per-run ConfigMap + Secret + Job, returns 201 (or 200 on replay). Layout ────── ingestor/ ├── Chart.yaml appVersion: 0.3.0-rc1 (the data-ingestors release) ├── values.yaml ingestConfig (required, --set-file), image.digest │ (required, sha256), jobsManager.endpoint, │ serviceAccount.create, hook resources, idempotency ├── README.md ownership boundaries + verification commands ├── .helmignore └── templates/ ├── _helpers.tpl ├── serviceaccount.yaml default name "ingestor" ├── configmap-ingest-config.yaml hook-weight 0 └── post-install-job.yaml hook-weight 1, runs as the SA, reads its own token, POSTs. Ownership boundary ────────────────── Per #86's acceptance criteria, the README spells out what `helm uninstall` does and doesn't clean up: This chart owns: ConfigMap (ingest.yaml), the hook Job, the SA. jobs-manager owns: the per-run ConfigMap, Secret, ingestor Job. The cluster owns: the ingested data + metadata POSTed to the backend. `helm uninstall my-dataset` removes only the chart's footprint. The running ingestor Job and its data persist. This is deliberate — uninstall is not a cancel button. The README documents the kubectl command to cancel a run if needed. Implementation choices ────────────────────── - **post-install hook, not a long-lived resource.** The hook is the whole point of this chart — fire once, exit. - **automountServiceAccountToken: true** for the hook Job. That's the whole authentication mechanism — TokenReview on the SA token. Every other tracebloc workload disables automount; this one needs it. - **`hook-delete-policy: before-hook-creation`**, NOT `hook-succeeded`. Keeps the completed Job around so operators can `kubectl logs` the POST response after install. Cleaned up only on the next install under the same release. - **curlimages/curl** as the hook image — small, official, and ships python3 which we use to JSON-encode the multi-line YAML body safely (jq has a JSON-escape edge case for YAML newlines that's easier to side-step than handle). - **idempotencyKey defaults to `<release>-<revision>`** so a `helm upgrade` submits a fresh run. Customers override to a stable UUID if they want strict at-most-once across reinstalls. Verified ──────── - helm lint passes. - helm template renders all four resources (ConfigMap, Job, SA, and the inline templates expand cleanly with --set-file ingestConfig). - Required-value gates fire correctly: missing image.digest fails template; missing ingestConfig fails template. Closes #86 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(#86): pre-render JSON body in ConfigMap, drop python3 + shell JSON escape Three bugbot findings on the first ingestor-chart pass, all real: 1. HIGH — curlimages/curl runtime layer doesn't include python3 (only in the build stage; stripped in the final image). The hook's `python3 -c ...` JSON encoder would fail with "python3: not found" on every install. 2. HIGH — even if python3 were available, the shell syntax `python3 -c "..." VAR=value` puts the assignments AFTER the command, which makes them positional argv, not env. The `os.environ['INGEST_CONFIG']` lookup would raise KeyError. 3. MEDIUM — `nindent 4` after literal template-source indentation puts a leading blank line into the YAML block scalar, so the customer's ingest.yaml gets a "\n" prefix that block-scalar parsers tolerate but is wrong. Structural fix rather than tweaking the script: the three POST-body fields (ingest_config, idempotency_key, image_digest) are ALL known at helm-template time. Render the JSON body in the ConfigMap as `body.json` using Helm's `toJson` filter — which handles multi-line string escaping correctly — then the hook becomes a one-line `curl --data-binary @body.json`. No python3 needed, no shell-side JSON construction at all. Eliminates both HIGH bugs as a category, not just instance-by-instance. For bug 3: use the left-trim action delimiter (dash inside braces) before the `required ... | nindent 4` action so it eats the leading whitespace cleanly. Verified via `helm template` that the rendered `ingest.yaml` now starts cleanly with `apiVersion:`. Verified ──────── - helm lint passes on both client/ and ingestor/. - helm template renders the JSON body with correct escaping (multi-line YAML → "\n"-escaped scalar in JSON). - helm template renders ingest.yaml with no leading blank line. - helm unittest client/: 116/116 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(#86): track ingestor/values.yaml (was silently .gitignored) bugbot caught a serious oversight: `ingestor/values.yaml` exists in the working tree but never made it into the repository. Every `git add ingestor/` silently dropped it because the repo's .gitignore at line 119 has `/*/values*.yaml` — an anti-leak pattern for operator values files — which matches `ingestor/values.yaml`. Without the file the chart is broken on `helm install`: every template references `.Values.hookImage.repository`, `.Values.jobsManager.endpoint`, etc., and Helm renders nil-pointer errors when the keys are absent. Two-line fix: - Add `!ingestor/values.yaml` to .gitignore (mirrors the existing `!client/values*.yaml` exception for the main chart). Documents *why* the exception exists, so a future cleanup pass doesn't re-introduce the bug. - Commit the actual values.yaml file with the defaults already referenced by the README and the templates. Local verification before pushing: helm template my-dataset ingestor/ --namespace tracebloc \ --set ingestConfig=... --set image.digest=sha256:... \ # renders ServiceAccount, ConfigMap, Job correctly. Lesson for future runs: `git add <dir>/` is *not* a verification that files were added — gitignore patterns can silently drop them. Should have verified with `git status` before commit; would have caught this before bugbot did. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: nil-guard ingestionAuthz access for --reuse-values upgrade path (#124) #123's ingestion-authz ConfigMap template did unguarded nested access: {{- range .Values.ingestionAuthz.allowed }} This crashes with "nil pointer evaluating interface {}.allowed" when `.Values.ingestionAuthz` is absent — which is exactly what `helm upgrade --reuse-values` produces against a pre-#123 release. The stored values from the previous deploy don't have the key, and `--reuse-values` doesn't pick up new chart defaults, so the upgrade fails before any of the new resources are created. A real user hit this immediately after #123 merged: Error: UPGRADE FAILED: template: client/templates/ ingestion-authz-configmap.yaml:20:21: executing "..." at <.Values.ingestionAuthz.allowed>: nil pointer evaluating interface {}.allowed Fix: collapse the missing-parent and missing-child cases to an empty list with `default dict` + `default list`. The rendered ConfigMap becomes `allowed:` (empty), which the authz policy parser treats as "no SAs authorized" — fail-safe, matches the intent of "operator hasn't configured this yet". The recommended `helm upgrade` recipe is still `--reset-then-reuse-values` (picks up new defaults including the non-empty `ingestionAuthz.allowed` default), but the template no longer requires that — it renders correctly under either path. Verified ──────── - helm template renders cleanly with default values (full policy), with `--set ingestionAuthz=null` (empty allowed list), and with `--set ingestionAuthz.allowed=null` (same). - helm unittest client/: 116/116 pass, no snapshot changes. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(#125): wire INGESTOR_IMAGE_DIGEST; drop digest requirement from ingestor subchart (#126) * feat(#125): wire INGESTOR_IMAGE_DIGEST; drop digest requirement from ingestor subchart Companion to tracebloc/client-runtime#41 (which made the endpoint treat the request body's `image_digest` as an optional override of a cluster-configured default). With this PR the ingestor image fits the same auto-update model as every other component in the chart: client/values.yaml + images.ingestor.digest: "" The auto-upgrade cronjob bumps this when a new chart version is published; jobs-manager re-rolls and the new env takes effect. client/templates/jobs-manager-deployment.yaml + INGESTOR_IMAGE_DIGEST env, nil-guarded for --reuse-values from a pre-this-PR release. Empty value renders cleanly (no nil pointer), endpoint then accepts only request-body overrides until the operator sets the chart value. ingestor/values.yaml + templates/configmap-ingest-config.yaml + image.digest is now an OPTIONAL override, not required. + body.json renders without `image_digest` when none is set; the key is included only when the customer explicitly pinned via --set image.digest=... (the override path: reproducing old runs, testing pre-rollout versions, air-gapped mirrors). ingestor/README.md + Removes image.digest from "Required values". + Adds "Pinning a specific image version" section explaining the override use cases and when to reach for them. + Top-of-README install snippet drops --set image.digest=... — the dominant path is now `helm install --set-file ingestConfig=...`. Once both PRs land, the bootstrap step is a one-line bump of client/values.yaml's images.ingestor.digest to the current ghcr.io/tracebloc/ingestor release digest, plus a chart version bump so the auto-upgrade cronjob promotes it. Future ingestor releases follow the same pattern — bump digest + chart version, customers' auto-upgrade picks it up on the next tick. Verified ──────── - helm lint passes on both charts. - helm template renders: - env populated when images.ingestor.digest is set - env empty (nil-guard) when images.ingestor key absent entirely (simulates --reuse-values from pre-this-PR release) - body.json without image_digest when no override - body.json with image_digest when explicit --set image.digest=... - helm unittest client/: 116/116 pass. Closes #125 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: bootstrap ingestor digest + bump chart version 1.3.2 → 1.3.3 Activates the auto-update model introduced by the rest of this PR. Without the value set, jobs-manager runs with `INGESTOR_IMAGE_DIGEST=""` and the ingestion endpoint returns 503 for every call that doesn't include a body override — which is the *opposite* of the "customer doesn't have to think about digests" UX this PR is supposed to enable. Two coupled bumps: client/Chart.yaml version: 1.3.2 → 1.3.3 appVersion: 1.3.2 → 1.3.3 Required for the auto-upgrade cronjob to detect this release. `helm search repo` orders by version; without a bump customers stay on 1.3.2 and never see the new env wiring. client/values.yaml images.ingestor.digest = "sha256:e6639b084d0d377072dc908db376050914ebd49c730ddaa13f838d10f5482ea9" The data-ingestors v0.3.0-rc1 release. Future ingestor releases bump both this and Chart.yaml's version; eventually a workflow in tracebloc/data-ingestors can raise the PR automatically when a new image is published. After this lands and the chart is published to gh-pages, a `helm upgrade --reset-then-reuse-values` on the customer's cluster (or the daily auto-upgrade cronjob's next tick) rolls jobs-manager with the env populated, and `helm install tracebloc/ingestor --set-file ingestConfig=...` — no `--set image.digest=...` — works. Verified ──────── - helm lint client/ clean. - helm template shows INGESTOR_IMAGE_DIGEST env populated with the real digest. - helm unittest client/: 116/116 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(#127): ingestor chart auto-resolves jobs-manager endpoint to release namespace (#128) The ingestor subchart's default jobsManager.endpoint hardcoded "tracebloc" as the parent release's namespace: http://jobs-manager.tracebloc.svc.cluster.local:8080 Any release in a non-"tracebloc" namespace failed the post-install hook with `curl: (6) Could not resolve host: …`, blocking end-to-end ingestion. Surfaced today during real-cluster validation on a release deployed to `tracebloc-templates`. Fix shape: leave the values.yaml default empty; have the post-install hook template the endpoint to use `.Release.Namespace` when no value is set. The override path (cross-namespace install) keeps working — set `jobsManager.endpoint` explicitly and it wins over the default. values.yaml jobsManager.endpoint: "" (was hardcoded to tracebloc namespace) + comment explaining the auto-resolve + override semantics templates/post-install-job.yaml JOBS_MANAGER_ENDPOINT defaults to http://jobs-manager.<.Release.Namespace>.svc.cluster.local:8080 when .Values.jobsManager.endpoint is empty. README.md Frequently-overridden-values entry corrected. Verified ──────── - helm template into namespace `tracebloc-templates` → http://jobs-manager.tracebloc-templates.svc.cluster.local:8080 - helm template into namespace `some-other-ns` → http://jobs-manager.some-other-ns.svc.cluster.local:8080 - helm template with --set jobsManager.endpoint=http://port-forward.localhost:8888 → wins over the default. - helm lint clean. Closes #127 Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(#129): parent client chart owns the shared ingestor ServiceAccount (#131) The ingestor ServiceAccount is shared by every `tracebloc/ingestor` subchart release in a namespace, but it was owned by the first such release. Concurrent installs of a second ingestor release collided with Helm's "cannot import into current release"; uninstalling the first release ripped the SA out from under all the others. Move the SA into this parent chart, which already owns the matching `ingestionAuthz` ConfigMap, so the SA + policy have the same lifecycle and every ingestor release in the namespace shares the SA cleanly. Plumb the name through `ingestionAuthz.serviceAccountName` as a single source of truth — both the new SA template and the default `allowed` entry in the authz ConfigMap dereference it via the new `tracebloc.ingestorServiceAccountName` helper. The helper nil-guards pre-#129 `--reuse-values` upgrades by defaulting to "ingestor". Document the SA adoption path in `client/MIGRATION.md` for clusters that already have an `ingestor` SA owned by a 0.1.0 subchart release — re-annotate before upgrading the parent chart so Helm doesn't refuse the import. Bumps chart to 1.3.4. Pair with tracebloc/ingestor 0.2.0, which flips `serviceAccount.create` default to `false` so subchart releases stop trying to own the SA themselves. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(#130): default idempotency key to install-time stamp, not release revision (#132) `ingestor.idempotencyKey` previously fell back to `<release>-<revision>` when `.Values.idempotencyKey` was unset. Helm restarts revisions at 1 after `helm uninstall`, so reinstalling under the same release name produced the same key. If anything dedupe-relevant changed in between (image digest is the dominant case during testing), jobs-manager correctly rejected the second submission with a 409 — but to a customer following the README it looked like the chart was broken. Default to `<release>-<unix-epoch>` instead. Each install gets a fresh key; the opt-in stable-UUID path remains for callers who actually want at-most-once semantics across reinstalls. Note on the printf format: Sprig's `unixEpoch` returns a string (not an int), so the formatter is `%s-%s`, not `%s-%d`. Bumps ingestor subchart 0.1.0 → 0.1.1 (default-behavior change). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(#129)!: default serviceAccount.create=false; parent chart owns the SA (#133) The ingestor SA is shared across every `tracebloc/ingestor` release in the namespace. The previous per-release ownership made the second concurrent install collide with Helm's "cannot import into current release" error, and uninstalling the first release deleted the SA out from under any sibling release that worked around the collision with `serviceAccount.create=false`. The parent `tracebloc/client` chart 1.3.4 now owns the SA, exposing its name via `ingestionAuthz.serviceAccountName`. This subchart's default flips to `create: false` so it consumes that shared SA. The `name` value is still required so the post-install hook Job knows which SA's token to mount. `serviceAccount.create=true` remains available as an escape hatch for operators on a pre-1.3.4 parent chart, with a comment in values.yaml explaining when (and only when) to flip it back on. Breaking change: bumps chart to 0.2.0. Pair with the 1.3.4 parent chart bump; see the parent's MIGRATION.md "Upgrading to 1.3.4" section for the SA-adoption procedure on clusters where a 0.1.0 release already created the SA. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(chart): bump ingestor digest to v0.3.0 + chart to 1.3.5 (#134) v0.3.0 is the first production-ready ingestor release (signed + SBOM), validated end-to-end against EKS on 2026-05-19 (6 files in PVC + 576 MySQL rows via the declarative chart path). The previous default (v0.3.0-rc1) had three real-cluster bugs that landed as tracebloc/data-ingestors#106: - #103 wheel + sdist were missing schema/ingest.v1.json - #104 image-resolution validator tuple-vs-list comparison - #105 _has_extension dot/case normalization (no more cat1.jpeg.jpeg) Chart bumped to 1.3.5 so the auto-upgrade cronjob (#69) detects the change and rolls customers onto v0.3.0 on the next tick. ingestor image: ghcr.io/tracebloc/ingestor@sha256:463e2367...07a4a cosign verify available; release notes contain the verification command. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(#135): publish ingestor subchart alongside parent chart (#136) The customer-facing install path is helm repo add tracebloc https://tracebloc.github.io/client helm install my-dataset tracebloc/ingestor \ --namespace tracebloc-templates \ --set-file ingestConfig=./my.yaml For `tracebloc/ingestor` to resolve from that helm repo, the ingestor subchart must be packaged into gh-pages alongside the parent client chart. Before this PR, `release-helm-chart.yaml` only ran `helm package ./client`, so the second install path returned `Error: chart "ingestor" not found`. helm-ci.yaml also only lints the parent chart, so any future regression in `ingestor/templates/` would land on develop without CI noticing. Three changes: 1. release-helm-chart.yaml: package + index BOTH client and ingestor into a single shared index.yaml. Attach both tgzs to the GitHub release for download-by-tag pinning. 2. helm-ci.yaml: lint the ingestor subchart on every PR alongside the per-platform client lints. Plain `helm lint --strict ./ingestor` is enough — its only required value (ingestConfig) emits INFO not FAIL, and the chart's templates don't branch on platform so the per-platform values-file matrix doesn't apply. 3. ingestor/Chart.yaml: bump appVersion 0.3.0-rc1 → 0.3.0 to match the tracebloc/data-ingestors v0.3.0 release that just shipped. Chart version (0.2.0) is unchanged; appVersion is descriptive. Validated locally: both charts package cleanly (client-1.3.5.tgz, ingestor-0.2.0.tgz), all four platform-specific client lints pass, ingestor lint passes. Closes #135. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(ingestor): explain image vs chart update lifecycle (#138) Customers ask: "the cluster has an auto-upgrade cronjob — does that mean my ingestor chart updates too?" The answer is nuanced: the image auto-updates (via INGESTOR_IMAGE_DIGEST on jobs-manager, kept current by the cronjob), but the chart on your workstation is independent — Helm's repo cache doesn't refresh itself. Add a "How updates work" section that explains the two-layer model and the strong property that the image you run is decoupled from the chart version that submitted the request. Plus an explicit FAQ on previously-installed ingestor releases (nothing to upgrade — fire-and-forget). No code change. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Fix three bugbot findings from PR #137 review (#142) * fix(#139): preserve idempotency key across helm upgrade The ingestor.idempotencyKey helper defaulted to "<release>-<unix-epoch>" and re-stamped on every render. `helm upgrade --reuse-values` preserves the stored value "" (not the previously-rendered key), so the template re-evaluated `now | unixEpoch` and produced a NEW key each upgrade — accidentally creating duplicate ingestion runs from what customers expected to be no-op upgrades. Contradicts the documented behavior in ingestor/README.md added in #138. Look up the existing post-install hook ConfigMap from the previous render and reuse its idempotency_key. On fresh install (or after uninstall) the lookup returns empty and we fall through to the now-based default. `helm template` (no cluster connection) returns empty for lookup too, so local previews still get a fresh key per render — matches the in-cluster install path the first time. Caught by bugbot on PR #137 review. Closes #139. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(#140): read requests-proxy resources from values The requests-proxy deployment hardcoded its container resources, ignoring the resources.requestsProxy schema entry that values.schema.json has defined since the requests-proxy was added. Every other component (jobsManager, podsMonitor, mysql) reads from .Values.resources.<name>.* with defaults — bring requestsProxy in line with that pattern. Adds the resources.requestsProxy block to values.yaml with the existing hardcoded defaults so behavior on a fresh install is unchanged. The template uses the default-through-dict nil-guard idiom so `helm upgrade --reuse-values` from a pre-1.3.6 release (where the value didn't exist) still renders cleanly without crashing on a nil parent. Caught by bugbot on PR #137 review. Closes #140. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(#141): add images.ingestor entry to values.schema.json values.yaml has had images.ingestor.digest since #126, and the jobs-manager template surfaces it as INGESTOR_IMAGE_DIGEST, but the schema didn't validate it — every other image (jobsManager, podsMonitor, resourceMonitor, requestsProxy, mysqlClient, busybox) has an entry. An operator setting --set images.ingestor.digest=foo (not the canonical sha256:<64-hex>) bypassed schema validation and failed only later inside submit_ingestion_run.py. Add the missing entry mirroring the other image entries' shape. helm template now rejects malformed digests at chart-template time ("values don't meet the specifications of the schema(s)... Does not match pattern '^(sha256:[a-f0-9]{64})?$'") rather than waiting for runtime. Caught by bugbot on PR #137 review. Closes #141. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: lukasWuttke <54042461+LukasWodka@users.noreply.github.com> Co-authored-by: Syed Is Saqlain <saqlain.syed007@gmail.com> Co-authored-by: Syed Saqlain <syedsaqlain@MacBook-Pro.local> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Lukas Wuttke <lukas@tracebloc.io>
docs: surface the declarative ingestor flow from top-level docs
Brings the image-refresh CronJob feature to main for release: - #155: feat(#154) auto-refresh jobs-manager image on Docker Hub publish - #156: fix(#154) read annotations via jq (kubectl jsonpath bracket notation returns empty for keys containing dots/slashes) Verified end-to-end on the tb-client-dev-templates EKS dev cluster: fresh upgrade installs the new resources cleanly, first-tick records both annotations without restart, second tick is a no-op, and a forced digest mismatch triggers the expected rollout-restart-then-annotate sequence. Rollout history bumps as expected. Closes #154.
Sync develop → main for v1.4.0 chart release
Brings two changes to main for release: - #161: chore: pin client chart's ingestor digest to v0.3.1 - #159: feat(#158): auto-refresh ingestor image digest without chart release (extends image-refresh CronJob from #155/v1.4.0 to reconcile ghcr.io/tracebloc/ingestor digests via kubectl set env) #159 went through five iterations of bugbot findings during review (rollout-failure retry → annotation source of truth; env drift via kubectl rollout undo / edit / GitOps reconcile → re-apply path; adopt-as-baseline rollout-status check). End-to-end smoke-tested on the tb-client-dev-templates EKS dev cluster against the existing ghcr.io 0.3 floating tag. Closes #158, completes the rollout pattern for #154.
Caught in PR #162 review (bugbot, two medium-severity issues). 1. Env-drift rollout retry gap The no-op branch (annotation == registry AND spec env == recorded) was a bare log statement with no rollout-health verification. A previous tick's env-drift `kubectl set env` commits its spec change to etcd BEFORE `kubectl rollout status` waits for the new ReplicaSet to come up. If the rollout fails, `set -eu` aborts — but the spec write persists. Next tick: annotation, registry, and spec env all match (because the spec write committed), so the no-op branch fires and silently masks the stuck rollout. Running pods may be on the old or empty INGESTOR_IMAGE_DIGEST while the script reports success. Fix: call `kubectl rollout status` in the no-op branch too. On a healthy deployment it returns near-instantly (no active rollout to wait for). On a stuck deployment it times out, set -eu aborts, and the Job is visibly failed in `kubectl get cronjob`. The operator then sees the stuck state and can investigate. Image- refresh can't autonomously recover from a bad image push, but making the failure visible is the right behaviour. 2. Default ingestor tag mismatched team's publishing convention Chart defaulted `images.ingestor.tag: prod`. The team's ghcr.io/tracebloc/ingestor repo uses semver-style float tags (`0`, `0.3`) — there is no `prod` tag. Default install would silently no-op every tick because manifest resolution 404'd: curl ... ghcr.io/v2/.../manifests/prod → 404 log " WARN: could not resolve latest digest; skipping" The whole ingestor auto-refresh feature wouldn't work for any customer running the chart's defaults, despite `autoRefresh: true`. Fix: changed default to "0.3" (conservative — patch-only auto- track; won't pick up a future 0.4 with breaking changes). Operators can override to "0" if they want major-version auto-tracking. Long-term, the team should consider standardising the chart default once the data-ingestors release-image.yml formalises its tag-publishing contract — for now this matches what we tested with on the dev cluster. Regression tests: * Default tag asserted as "0.3" with `notContains` of "prod" to guard against silent revert. * No-op branch asserted to call `kubectl rollout status` via (?s)-multiline regex matching the "verifying deployment health" log line + the kubectl rollout status call. * Existing test updated from value: prod to value: "0.3". 141/141 unit tests pass. NB: these commits are landing on the sync branch directly to avoid another full develop-PR cycle before release. After #162 merges, the same content will need to flow back to develop — either via a "sync main → develop" PR or by cherry-picking the two commits. The divergence is two commits and is easy to resolve. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Caught in PR #162 review (bugbot, medium severity). The chart default was changed from "prod" to "0.3" in the values block — matches the team's ghcr.io publishing convention — but the CronJob template's runtime fallback was left at `| default "prod"`. Two render paths: * helm install / helm upgrade --reset-then-reuse-values: the chart's new default ("0.3") flows through, runtime fallback never fires, INGESTOR_TAG="0.3". OK. * helm upgrade --reuse-values from a pre-v1.4.1 stored manifest: the stored values lack `images.ingestor.tag` entirely. Runtime fallback fires, renders INGESTOR_TAG="prod", which 404s on ghcr.io because that tag doesn't exist. Ingestor refresh silently no-ops every tick. Failure mode is graceful (log warning, no crash), but inconsistent with the per-customer expectation that v1.4.1 enables ingestor auto-refresh. autoUpgrade itself uses --reset-then-reuse-values, so this only hits manual --reuse-values upgrades — narrow but real. Fix: change runtime fallback to "0.3" so both render paths converge. Regression test simulates the --reuse-values scenario by setting images.ingestor.tag=null, exercising the runtime fallback. 142/142 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* Merge pull request #161 from tracebloc/chore/pin-ingestor-v0.3.1-160 chore: pin client chart's ingestor digest to v0.3.1 * feat(#158): auto-refresh ingestor image digest without chart release (#159) * feat(#158): auto-refresh ingestor image digest without chart release Extends the existing image-refresh CronJob (#155) to also reconcile the ghcr.io/tracebloc/ingestor digest onto the live jobs-manager deployment's INGESTOR_IMAGE_DIGEST env var. New ingestor image publishes to the floating tag are now picked up within the cronjob's poll interval (~15 min) instead of requiring a full chart release. Why Today, shipping a new ghcr.io/tracebloc/ingestor image required bumping client/values.yaml images.ingestor.digest + client/Chart.yaml + PR + sync to main + release tag. That's hours of overhead per bump and asymmetric with jobs-manager (which already gets the ~15-min image-refresh path). The asymmetry hurts because the ingestor changes frequently as the data-ingestors team iterates. Design Two image classes in one CronJob now: Class 1 (jobs-manager, pods-monitor): Registry: docker.io Tag: CLIENT_ENV Source of truth: deployment annotation `tracebloc.io/last-refreshed-<image>-digest` (#154) Action on change: kubectl rollout restart Class 2 (ingestor): Registry: ghcr.io Tag: images.ingestor.tag (default "prod") Source of truth: live INGESTOR_IMAGE_DIGEST env value on the api container of the jobs-manager deployment (no annotation needed — the env IS the digest jobs-manager passes to each spawned ingestion Job, so the most direct read of "what will be used next" is THIS value). Action on change: kubectl set env (triggers natural rollout via ReplicaSet rotation — no explicit `rollout restart`). get_token + get_latest_digest parameterized by registry; both docker.io and ghcr.io support anonymous pull tokens for public images with only the issuer URL differing. Per-image opt-out * jobs-manager / pods-monitor: same as #154 — set `images.<image>.digest` non-empty. * ingestor: explicit `images.ingestor.autoRefresh: false` flag. Asymmetric because ingestor.digest must be non-empty for jobs-manager to function (an empty env would 503 every ingestion submit), so we can't use digest-presence as the signal. When ALL THREE pin signals are active, the chart renders no image-refresh resources at all (helper `imageRefreshEnabled`). When at least one is unpinned, the cronjob is rendered and the script skips pinned images via env flags at runtime. Chart-default ingestor digest stays pinned (v0.3.0) as the baseline for greenfield installs; image-refresh dynamically updates the live env from there. Helm's 3-way merge preserves image-refresh's writes across future helm upgrades as long as the chart's pinned baseline doesn't change. Subtle gotcha caught in dev `default true $autoRefresh` in Go templates returns `true` even when $autoRefresh is explicitly `false` (Go treats bool false as falsy, so default overrides it). Switched to `eq $autoRefresh false` directly — absence (nil) and explicit `true` both fall through to "not pinned" as intended. Test pinned against the correct idiom. Other changes * `log()` continues to write to stderr (#155 fix). * `get_container_env` helper for jq-based env-var reads — same kubectl-jsonpath caveat as `get_annotation` (#156). * Chart version bumped 1.4.0 → 1.4.1. Tests 20 image-refresh-suite tests (was 17), 140 total pass. New assertions: * all-three-pinned renders zero resources * only-jobs-manager+pods-monitor-pinned still renders (regression guard for the asymmetric pin signal — without this, the ingestor would never auto-refresh on default installs) * INGESTOR_PINNED flips correctly on autoRefresh=false * INGESTOR_TAG is overridable, `latest` rejected by schema * Script must include `kubectl set env`, `ghcr.io`, `auth.docker.io`, `get_container_env`, the empty-env fill-from-registry path, and the autoRefresh-skip log line Closes #158 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(#158): annotation as source of truth for ingestor (rollout-failure retry) Caught in PR #159 review (bugbot, medium severity). The original design used the live spec env (`INGESTOR_IMAGE_DIGEST` on the api container) as the source of truth for "what image-refresh has reconciled to." `kubectl set env` commits the new spec to etcd BEFORE `kubectl rollout status` waits for the rollout to complete. If the rollout times out or the new ReplicaSet's pods fail to come up: * `set -eu` aborts the script. * But the spec already matches the registry. * Next tick: `get_container_env` returns the new digest, compares equal to registry, no-op → script appears successful. * Meanwhile the old ReplicaSet's pods are still running with the OLD env, and the new ReplicaSet is stuck failing. The deployment is frozen on the old version with no retry signal. Fix: mirror the jobs-manager/pods-monitor pattern from #154. Use a `tracebloc.io/last-refreshed-ingestor-digest` annotation on the deployment as the source of truth. Update the annotation as the LAST step, only after `rollout status` succeeds. A failed rollout aborts before the annotate → next tick sees stale annotation → retries. First-observation contract for ingestor: * Non-empty spec env (the normal case — chart populates a default): adopt as baseline annotation, don't touch env. Same "don't churn on install" principle as jobs-manager first-observation. * Empty spec env (corrupted state, manual kubectl edit, stale --reuse-values): fill from registry on first tick. Empty would otherwise cause jobs-manager to 503 on every ingestion submit, so the "don't churn" trade is wrong in that case. Tests pin: * Annotation key `tracebloc.io/last-refreshed-ingestor-digest` appears in the script. * Order-of-operations: set env → rollout status → annotate (the annotate MUST come last; regex matches the full sequence). 140/140 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(#158): correct stale top-of-script comment + add first-obs test Found during PR #159 self-review. The top-of-script comment block still described the pre-bugbot-fix design — claimed "Source of truth: the live env value itself (no annotation needed)" and "no 'first observation' empty-state, each tick is a normal compare-and-patch-if-different." Both were stale after e7cf829 switched ingestor to annotation-based source of truth and added the first-observation branch. Anyone reading the script-level overview would have been misled about the actual loop. Comment now matches the code: annotation as source of truth, two- case first-observation contract (non-empty → adopt as baseline; empty → fill from registry). Also adds a positive regression test for the previously-untested first-observation "adopting spec env as baseline" branch. The empty-spec-env branch was already covered indirectly by the existing "would 503 on ingestion submit" regex. 140/140 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(#158): also reconcile env drift (rollout undo / kubectl edit / GitOps) Caught in PR #159 review (bugbot, medium severity). Follow-up to the annotation-source-of-truth switch in e7cf829. The previous no-op branch fired whenever annotation == registry and NEVER read the live spec env. That meant any external actor that reverted the spec env without touching the annotation would leave the deployment on a stale env indefinitely. The annotation continued to match the registry, so image-refresh kept skipping. Real scenarios this affects: * `kubectl rollout undo deployment/X` — reverts pod template to a previous ReplicaSet's spec, including its INGESTOR_IMAGE_DIGEST env. Annotation on deployment metadata is untouched. * `kubectl edit deployment X` — operator manually changes the env. * Certain `helm upgrade` flag combos can reset env to the chart's pre-image-refresh baseline while preserving annotations (e.g., --reset-values or upgrade from a chart where the digest baseline differs from what image-refresh had reconciled to). * GitOps reconcilers (Argo CD, Flux) that own the deployment spec will revert image-refresh's env writes back to the rendered template values. In all of these, the live deployment runs a stale ingestor image forever — exactly the failure mode #158 was meant to prevent. Fix: each tick now reads both the annotation AND the live spec env. Three reconciliation paths: * recorded != registry → "registry drift". Set env to registry, wait for rollout, update annotation. (Existing behaviour.) * recorded == registry AND spec env != recorded → "env drift". Set env to recorded value (NOT registry — registry matches recorded by definition here, but recorded is the value we last decided to roll to). Wait for rollout. Don't update the annotation; it's already correct. * recorded == registry AND spec env == recorded → fully in sync, no-op. Tests pin: * The "spec env drifted" log line. * The drift-recovery branch sets env to `${recorded_ingestor}`, not `${latest_ingestor}` (different from the registry-drift branch). Regex catches the variable used in `INGESTOR_IMAGE_DIGEST=`. Top-of-script comment block updated to document the drift recovery. 140/140 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(#158): wait for rollout status in adopt-as-baseline branch Caught in PR #159 review (bugbot, high severity). Scenario the existing code mishandles: 1. Tick N: empty-spec-first-obs branch runs `kubectl set env` (commits new spec to etcd) → `kubectl rollout status` times out → `set -eu` aborts before the annotate. Annotation stays empty. 2. Tick N+1: annotation still empty. spec env is now non-empty (the failed-rollout's spec change persists). get_container_env returns that value, so the adopt-as-baseline branch fires. 3. Adopt-as-baseline only annotates — it never checks rollout health. Annotation records "we're at D1" while running pods are still on the old/empty env from before tick N. The deployment now appears reconciled (annotation == registry on subsequent ticks) while actually being stuck on the wrong image. Fix: call `kubectl rollout status` inside the adopt-as-baseline branch before the annotate. On a healthy deployment it returns near-instantly; on a stuck rollout from a previous failed set-env it times out, `set -eu` aborts before the annotate, next tick retries. No latency cost on the happy path. Regression test pins the (?s)-multiline order: adopting → rollout status → annotate. 140/140 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> * fix(#158): two bugbot follow-ups + chart default tag Caught in PR #162 review (bugbot, two medium-severity issues). 1. Env-drift rollout retry gap The no-op branch (annotation == registry AND spec env == recorded) was a bare log statement with no rollout-health verification. A previous tick's env-drift `kubectl set env` commits its spec change to etcd BEFORE `kubectl rollout status` waits for the new ReplicaSet to come up. If the rollout fails, `set -eu` aborts — but the spec write persists. Next tick: annotation, registry, and spec env all match (because the spec write committed), so the no-op branch fires and silently masks the stuck rollout. Running pods may be on the old or empty INGESTOR_IMAGE_DIGEST while the script reports success. Fix: call `kubectl rollout status` in the no-op branch too. On a healthy deployment it returns near-instantly (no active rollout to wait for). On a stuck deployment it times out, set -eu aborts, and the Job is visibly failed in `kubectl get cronjob`. The operator then sees the stuck state and can investigate. Image- refresh can't autonomously recover from a bad image push, but making the failure visible is the right behaviour. 2. Default ingestor tag mismatched team's publishing convention Chart defaulted `images.ingestor.tag: prod`. The team's ghcr.io/tracebloc/ingestor repo uses semver-style float tags (`0`, `0.3`) — there is no `prod` tag. Default install would silently no-op every tick because manifest resolution 404'd: curl ... ghcr.io/v2/.../manifests/prod → 404 log " WARN: could not resolve latest digest; skipping" The whole ingestor auto-refresh feature wouldn't work for any customer running the chart's defaults, despite `autoRefresh: true`. Fix: changed default to "0.3" (conservative — patch-only auto- track; won't pick up a future 0.4 with breaking changes). Operators can override to "0" if they want major-version auto-tracking. Long-term, the team should consider standardising the chart default once the data-ingestors release-image.yml formalises its tag-publishing contract — for now this matches what we tested with on the dev cluster. Regression tests: * Default tag asserted as "0.3" with `notContains` of "prod" to guard against silent revert. * No-op branch asserted to call `kubectl rollout status` via (?s)-multiline regex matching the "verifying deployment health" log line + the kubectl rollout status call. * Existing test updated from value: prod to value: "0.3". 141/141 unit tests pass. NB: these commits are landing on the sync branch directly to avoid another full develop-PR cycle before release. After #162 merges, the same content will need to flow back to develop — either via a "sync main → develop" PR or by cherry-picking the two commits. The divergence is two commits and is easy to resolve. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(#158): align INGESTOR_TAG runtime fallback with chart default Caught in PR #162 review (bugbot, medium severity). The chart default was changed from "prod" to "0.3" in the values block — matches the team's ghcr.io publishing convention — but the CronJob template's runtime fallback was left at `| default "prod"`. Two render paths: * helm install / helm upgrade --reset-then-reuse-values: the chart's new default ("0.3") flows through, runtime fallback never fires, INGESTOR_TAG="0.3". OK. * helm upgrade --reuse-values from a pre-v1.4.1 stored manifest: the stored values lack `images.ingestor.tag` entirely. Runtime fallback fires, renders INGESTOR_TAG="prod", which 404s on ghcr.io because that tag doesn't exist. Ingestor refresh silently no-ops every tick. Failure mode is graceful (log warning, no crash), but inconsistent with the per-customer expectation that v1.4.1 enables ingestor auto-refresh. autoUpgrade itself uses --reset-then-reuse-values, so this only hits manual --reuse-values upgrades — narrow but real. Fix: change runtime fallback to "0.3" so both render paths converge. Regression test simulates the --reuse-values scenario by setting images.ingestor.tag=null, exercising the runtime fallback. 142/142 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Brings the bugbot-followup commits from #162 back to develop: - edd22f2: rollout-status check in no-op branch + chart default tag prod → 0.3 - 15bc136: INGESTOR_TAG runtime fallback prod → 0.3 These landed on the sync branch directly during release prep to avoid an extra develop-PR cycle. Re-applying them here so develop stays current.
Contributor
|
👋 Heads-up — Code review queue is at 16 / 8 Above the WIP limit. The team convention is to review existing PRs before opening new work. Open PRs currently in Code review (oldest first):
Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.) |
shujaatTracebloc
approved these changes
May 26, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
...
Note
Medium Risk
Changes operational behavior of the image-refresh CronJob (deployment rollouts and registry tag resolution), which affects which ingestor digest jobs-manager uses over time.
Overview
Hardens the image-refresh ingestor path so a “fully in sync” tick no longer exits silently when a prior
kubectl set envleft the deployment stuck (spec/annotation/registry agree but rollout never finished). The no-op branch now runskubectl rollout status, surfacing failures as a failed CronJob.Aligns the default ingestor floating tag from
prodto0.3invalues.yaml, the CronJobINGESTOR_TAGenv (including| default "0.3"for--reuse-valuesupgrades missing the key), and helm tests so ghcr.io polling matches the team’s semver float tags instead of 404’ing every tick.Reviewed by Cursor Bugbot for commit 80d2f37. Bugbot is set up for automated code reviews on this repo. Configure here.