From d2527a407928a3f5660879272257382981b899c2 Mon Sep 17 00:00:00 2001 From: Lukas Wuttke Date: Thu, 4 Jun 2026 10:13:12 +0200 Subject: [PATCH] docs(environment-setup): #192 namespace/workspace flow Depends on client#192 (fixed namespace `tracebloc`, dropped workspace prompt, one-per-machine guard, version-in-summary). Hold until #192 ships; rebase onto develop after the accuracy PR (docs#48) lands. - setup-guide: drop the "choose a workspace name" step; namespace fixed to `tracebloc`; Version line in the install summary; verify uses `-n tracebloc` + the real pod list (mysql-client, jobs-manager, requests-proxy, resource-monitor). - configuration: add the TB_NAMESPACE override; installer commands use `-n tracebloc`; Manual-Helm placeholders standardized to ``. - troubleshooting: one-per-machine guard entry; `` placeholders. Co-Authored-By: Claude Opus 4.8 --- environment-setup/configuration.mdx | 25 +++++++++++++------------ environment-setup/setup-guide.mdx | 22 +++++++++++++--------- environment-setup/troubleshooting.mdx | 19 ++++++++++--------- 3 files changed, 36 insertions(+), 30 deletions(-) diff --git a/environment-setup/configuration.mdx b/environment-setup/configuration.mdx index d626845..fd6aeb0 100644 --- a/environment-setup/configuration.mdx +++ b/environment-setup/configuration.mdx @@ -15,6 +15,7 @@ Override defaults by setting environment variables before the install command. U | `SERVERS` | `1` | Number of control-plane nodes | | `AGENTS` | `1` | Number of worker nodes | | `K8S_VERSION` | `v1.29.4-k3s1` | k3s image tag | +| `TB_NAMESPACE` | `tracebloc` | Kubernetes namespace + Helm release name for the client | | `HTTP_PORT` | `80` | Host port mapped to cluster HTTP ingress | | `HTTPS_PORT` | `443` | Host port mapped to cluster HTTPS ingress | | `HOST_DATA_DIR` | `~/.tracebloc` | Persistent data directory on host | @@ -45,7 +46,7 @@ k3d cluster delete tracebloc The jobs manager is the main tracebloc process. Check its logs when debugging connectivity or job execution issues: ```bash -kubectl logs -n -l app=tracebloc-jobs-manager +kubectl logs -n tracebloc -l app=tracebloc-jobs-manager ``` ### Useful commands @@ -55,9 +56,9 @@ Common kubectl commands for inspecting cluster state: ```bash kubectl get nodes -o wide # Node status and IPs kubectl get pods -A # All pods across namespaces -kubectl get pods -n # Pods in your workspace -kubectl get pvc -n # Persistent volume claims -kubectl get services -n # Services and endpoints +kubectl get pods -n tracebloc # Pods in your workspace +kubectl get pvc -n tracebloc # Persistent volume claims +kubectl get services -n tracebloc # Services and endpoints ``` Install logs are saved to `~/.tracebloc/install-*.log`. @@ -352,7 +353,7 @@ namespace: When `create: false` (default) and you want PSA labels on an existing namespace: ```bash -kubectl label namespace \ +kubectl label namespace \ pod-security.kubernetes.io/warn=restricted \ pod-security.kubernetes.io/audit=restricted \ pod-security.kubernetes.io/enforce=restricted @@ -392,8 +393,8 @@ podDisruptionBudget: Install the chart into a new namespace: ```bash -helm upgrade --install tracebloc/client \ - --namespace \ +helm upgrade --install tracebloc/client \ + --namespace \ --create-namespace \ --values values.yaml ``` @@ -404,8 +405,8 @@ The auto-upgrade CronJob handles routine version bumps. To upgrade manually: ```bash helm repo update -helm upgrade tracebloc/client \ - --namespace \ +helm upgrade tracebloc/client \ + --namespace \ --reset-then-reuse-values \ --values values.yaml ``` @@ -417,14 +418,14 @@ When upgrading **into** chart 1.3.0 from 1.2.x, use `--reset-then-reuse-values` ### Uninstall ```bash -helm uninstall -n +helm uninstall -n ``` PVCs and the PriorityClass are annotated `helm.sh/resource-policy: keep` so your data and shared cluster resources survive uninstall. To remove them too: ```bash -kubectl delete pvc --all -n -kubectl delete namespace +kubectl delete pvc --all -n +kubectl delete namespace ``` ### Migrating from legacy charts diff --git a/environment-setup/setup-guide.mdx b/environment-setup/setup-guide.mdx index 9ba80e6..9f63963 100644 --- a/environment-setup/setup-guide.mdx +++ b/environment-setup/setup-guide.mdx @@ -98,7 +98,7 @@ Verifies Docker is installed and running, detects GPU hardware (falls back to CP Provisions a lightweight local Kubernetes cluster inside Docker. First run takes 1–2 minutes to download components. **Step 3/4 — Install tracebloc client** -Prompts for a **workspace name** (e.g. `berlin-team`, `vision-lab`, `ml-mardan`). This identifies the client on your machine and becomes the Kubernetes namespace. +Deploys the tracebloc client into the cluster — no input required. The client runs in a fixed local namespace (`tracebloc`), and one client runs per machine. **Step 4/4 — Connect to tracebloc network** Prompts for your **Client ID** and **password** from step 2 above. This links your secure local environment to the tracebloc platform so vendors can submit models for evaluation. @@ -108,17 +108,18 @@ When it finishes you'll see a summary like: ``` tracebloc client installed successfully -Workspace : -Mode : CPU # or GPU +Workspace : tracebloc +Version : 1.4.4 # the client version you're running +Mode : CPU # or GPU Logs: ~/.tracebloc/ -Data: /tracebloc/ +Data: /tracebloc/tracebloc ``` Install logs are kept in `~/.tracebloc/` if you need to debug anything. -To upgrade a one-liner install later, run `helm upgrade tracebloc/client -n --reset-then-reuse-values` (append `--version ` to pin). See [Configuration → Upgrade](/environment-setup/configuration#upgrade) for details — `--reset-then-reuse-values` is required so the values applied by the installer are preserved. +To upgrade a one-liner install later, run `helm upgrade tracebloc tracebloc/client -n tracebloc --reset-then-reuse-values` (append `--version ` to pin). See [Configuration → Upgrade](/environment-setup/configuration#upgrade) for details — `--reset-then-reuse-values` is required so the values applied by the installer are preserved. ### GPU Support @@ -136,15 +137,18 @@ See [Configuration > GPU](/environment-setup/configuration#gpu-support) for deta After the installer finishes, confirm that your workspace is running: ```bash -kubectl get pods -n +kubectl get pods -n tracebloc ``` -You should see two pods in `Running` state: +You should see these pods in `Running` state: | Pod | Role | |-----|------| -| `mysql-...` | Local metadata store — tracks jobs, metrics, and configuration | -| `tracebloc-jobs-manager-...` | The client — executes training jobs and communicates with the platform | +| `mysql-client-...` | Local metadata store — tracks jobs, metrics, and configuration | +| `tracebloc-jobs-manager-...` | The client — runs training jobs and communicates with the platform | +| `tracebloc-requests-proxy-...` | Routes the client's outbound traffic to the platform | + +A `tracebloc-resource-monitor` DaemonSet also runs in the `tracebloc-node-agents` namespace, reporting node capacity. Then open [ai.tracebloc.io](https://ai.tracebloc.io) and check that your client status shows **Online**. This confirms the client has established a secure connection to the tracebloc backend. diff --git a/environment-setup/troubleshooting.mdx b/environment-setup/troubleshooting.mdx index 2f5ac9c..b963dd7 100644 --- a/environment-setup/troubleshooting.mdx +++ b/environment-setup/troubleshooting.mdx @@ -5,14 +5,14 @@ description: "Common issues and debugging commands for your tracebloc workspace. Most issues fall into a few categories: pods not starting, client not connecting, or resource limits being hit. Start with the quick checks below — they cover the most common problems. -For real-time cluster monitoring, try [k9s](https://k9scli.io/) — run `k9s -n ` to get a live view of pods, logs, and events. +For real-time cluster monitoring, try [k9s](https://k9scli.io/) — run `k9s -n ` to get a live view of pods, logs, and events. ## Quick Checks | Symptom | Check | Fix | |---------|-------|-----| -| Pods not starting | `kubectl describe pod -n ` | Check resource limits, Docker status | -| Client shows Offline | `kubectl logs -n -l app=tracebloc-jobs-manager` | Verify client ID/password, check network | +| Pods not starting | `kubectl describe pod -n ` | Check resource limits, Docker status | +| Client shows Offline | `kubectl logs -n -l app=tracebloc-jobs-manager` | Verify client ID/password, check network | | Docker not running | `docker info` | Start Docker Desktop or daemon | | Cluster not found | `k3d cluster list` | Re-run the installer | | GPU not detected | `nvidia-smi` | Install NVIDIA drivers, reboot, re-run installer | @@ -34,6 +34,7 @@ Issues specific to local (k3d) deployments: | Error Message | Description | Resolution | |---------------|-------------|------------| | ServiceBus connection error after Docker restart | When Docker overutilizes local resources and restarts, the ServiceBus connection may fail with `NoneType` errors. | Monitor resources via Docker Dashboard. Restart the jobs manager pod (e.g. in k9s, exit with Ctrl+D) to restore the connection. | +| Installer stops: *"This machine already runs the tracebloc client '…'"* | A **different** client is already installed here. tracebloc runs one client per machine. | **Update** it: re-run with the same Client ID. **Switch** clients: `k3d cluster delete tracebloc` (wipes the current client + its local data), then re-run. **Run both:** use a separate machine. | ## Debugging Commands @@ -42,9 +43,9 @@ When the quick checks don't resolve the issue, use these commands to dig deeper. ### Pod status and logs ```bash -kubectl get pods -n -kubectl logs -n -kubectl describe pod -n +kubectl get pods -n +kubectl logs -n +kubectl describe pod -n ``` ### Resource usage @@ -53,7 +54,7 @@ See if your nodes or pods are running out of CPU or memory: ```bash kubectl top nodes -kubectl top pods -n +kubectl top pods -n ``` ### Storage @@ -61,7 +62,7 @@ kubectl top pods -n Check that persistent volume claims are bound and have enough capacity: ```bash -kubectl get pvc -n +kubectl get pvc -n kubectl get pv ``` @@ -70,7 +71,7 @@ kubectl get pv If pods fail with `ErrImagePull`, verify that the Docker registry secret exists: ```bash -kubectl get secret regcred -n +kubectl get secret regcred -n ``` ## CPU and Memory Optimization