diff --git a/environment-setup/configuration.mdx b/environment-setup/configuration.mdx index d626845..b09faa4 100644 --- a/environment-setup/configuration.mdx +++ b/environment-setup/configuration.mdx @@ -3,11 +3,13 @@ title: "Configuration" description: "Customize your tracebloc workspace — environment variables, cluster management, GPU support, and manual Helm deployment." --- -The installer uses sensible defaults. This page covers everything you can change — from cluster naming and port mapping to GPU configuration, manual Helm deployment, and day-to-day cluster management. +The installer uses sensible defaults; this page covers what you can change. + +**Installed with the one-liner?** See [Installer Options](#installer-options), [Cluster Management](#cluster-management), and [GPU Support](#gpu-support). **Deploying into your own cluster with Helm** (EKS, AKS, bare-metal)? Jump to [Manual Deployment](#manual-deployment). ## Installer Options -Override defaults by setting environment variables before the install command. Useful when you need a custom cluster name, multiple worker nodes, or non-standard ports. +Override defaults by setting environment variables before the install command. Useful for a custom cluster name, extra worker nodes, or a different data directory. | Variable | Default | Description | |----------|---------|-------------| @@ -15,8 +17,6 @@ Override defaults by setting environment variables before the install command. U | `SERVERS` | `1` | Number of control-plane nodes | | `AGENTS` | `1` | Number of worker nodes | | `K8S_VERSION` | `v1.29.4-k3s1` | k3s image tag | -| `HTTP_PORT` | `80` | Host port mapped to cluster HTTP ingress | -| `HTTPS_PORT` | `443` | Host port mapped to cluster HTTPS ingress | | `HOST_DATA_DIR` | `~/.tracebloc` | Persistent data directory on host | Example — custom cluster name with two worker nodes: @@ -64,7 +64,7 @@ Install logs are saved to `~/.tracebloc/install-*.log`. ## GPU Support -The installer auto-detects GPU hardware and configures the cluster accordingly. No manual setup required on Linux — the installer handles drivers, container toolkit, and Kubernetes device plugin. +GPU is automatic on Linux — the installer detects your hardware and sets up drivers, the container toolkit, and the Kubernetes device plugin. ### NVIDIA (Linux) @@ -94,7 +94,7 @@ The installer does **not** install GPU drivers on Windows. Pre-install NVIDIA dr Skip the installer entirely. Use this if you already have a Kubernetes cluster, need custom resource limits, or want full control over the Helm deployment. -A single unified chart — **`tracebloc/client`** — supports AKS, EKS, bare-metal, and OpenShift. Platform behaviour is selected via values overrides; reference defaults live in the repo at [`client/ci/{aks,eks,bm,oc}-values.yaml`](https://github.com/tracebloc/client/tree/main/client/ci). +A single chart — **`tracebloc/client`** — supports AKS, EKS, bare-metal, and OpenShift; choose your platform via values overrides. Reference defaults live at [`client/ci/{aks,eks,bm,oc}-values.yaml`](https://github.com/tracebloc/client/tree/main/client/ci). ### Add the Helm repository @@ -266,7 +266,7 @@ env: #### Auto-upgrade (on by default) -Releases of chart `1.3.0+` install a `-auto-upgrade` CronJob that polls `https://tracebloc.github.io/client` daily and runs `helm upgrade --reset-then-reuse-values` whenever a newer chart version is published. Closes [tracebloc/client#69](https://github.com/tracebloc/client/issues/69) — older deployed clients no longer drift from the latest secure release. +Releases of chart `1.3.0+` install a `-auto-upgrade` CronJob that polls `https://tracebloc.github.io/client` daily and runs `helm upgrade --reset-then-reuse-values` whenever a newer chart version is published — so clients auto-update instead of staying pinned to the version they were installed with. ```yaml autoUpgrade: diff --git a/environment-setup/eks-client-deployment-guide.mdx b/environment-setup/eks-client-deployment-guide.mdx index bde8421..140bb7a 100644 --- a/environment-setup/eks-client-deployment-guide.mdx +++ b/environment-setup/eks-client-deployment-guide.mdx @@ -5,12 +5,15 @@ description: "Step-by-step guide to deploy Tracebloc on Amazon EKS. Build a prod ## Overview + +**Use EKS for production** — multi-node, autoscaling, or shared GPU clusters on AWS. For a single machine (a laptop or one server), the [local installer](/environment-setup/setup-guide) is simpler and faster. + Running machine learning workloads in the cloud often requires a reliable, secure, and scalable infrastructure—yet setting it up can be complex. This guide walks you through building a complete Amazon EKS (Elastic Kubernetes Service) environment from scratch using the AWS CLI. By following these steps, you'll create a production-ready foundation with networking, GPU-optional compute, storage, and security fully aligned with AWS and Kubernetes best practices. Once the infrastructure is in place, you'll deploy and configure the tracebloc client to securely train and benchmark AI models. This setup ensures that your proprietary data stays within your environment, while still allowing external AI models to be tested and fine-tuned in a controlled, isolated way. The result: a scalable, secure platform for high-performance ML workloads that accelerates collaboration with external experts while maintaining full control over your data and IP. -The entire setup can be completed in about 1–2 hours. +The entire setup takes ~1–2 hours. If the cluster is already up and you are just adding another client to it, skip the cluster-creation steps and go straight to ["Client Configuration"](#5-client-configuration). @@ -44,7 +47,7 @@ aws configure set region eu-central-1 ``` #### Required Permissions -Your AWS user/role should have permissions for: +Requires permissions for: - Amazon EKS cluster management - VPC and networking resources - EC2 instances and security groups @@ -137,6 +140,8 @@ Together, these measures ensure that external models can be deployed safely into ## Quick Setup +Quick Setup runs an automated script that builds the whole cluster in one go. Want step-by-step control (or to customize networking)? Use [Detailed Setup](#detailed-setup) instead. + ### Purpose Spin up a production-ready EKS baseline (VPC, subnets, internet gateway, EKS cluster, managed nodegroup, EFS + CSI driver) in one go. Includes basic validation, colored logging, and a cleanup mode. @@ -167,7 +172,7 @@ Run `./setup_eks.sh cleanup` to remove cluster, nodegroup, EFS, subnets, gateway - **Costs**: This creates billable resources (EC2, EKS, EFS, data transfer). Remove when not needed. - **Network model**: Subnets are configured to auto-assign public IPs for simplicity. Adjust to private subnets + NAT as needed. - **Kubernetes version**: The script requests `--kubernetes-version 1.32`; update if your region/account supports a different current version. -- **Security hardening**: Treat this as a solid baseline; adapt SGs, private subnets, IRSA, and PodSecurity/OPA as required by your environment. +- **Security hardening**: This is a production baseline; harden further for your environment (security groups, private subnets, IRSA, Pod Security/OPA). If you prefer more control over your setup and want to customize the environment to your needs, follow the step-by-step guide below. @@ -454,7 +459,7 @@ Creates a nodegroup with `t3.medium` instances (2 vCPUs, 4 GiB memory) spread ac #### Training Nodegroup -This group runs your ML training workloads and **must be sized appropriately** to provide sufficient memory and compute. Consider dataset size, model type, the number of parallel workloads and whether GPU acceleration is needed. Select instance types and scaling parameters carefully, based on the kind of models you expect to train and the resources they demand. +This group runs your ML training workloads — size it for your dataset, model type, number of parallel workloads, and whether you need GPUs. Refer to the [EC2 instance types list](https://aws.amazon.com/ec2/instance-types) and [EKS managed nodegroups docs](https://docs.aws.amazon.com/eks/latest/userguide/create-managed-node-group.html) for guidance. diff --git a/environment-setup/setup-guide.mdx b/environment-setup/setup-guide.mdx index 9ba80e6..e1431e9 100644 --- a/environment-setup/setup-guide.mdx +++ b/environment-setup/setup-guide.mdx @@ -23,7 +23,7 @@ The installer runs on any modern machine (one host per workspace). These are the **Supported platforms:** macOS (Intel & Apple Silicon) · Linux (x86_64 & arm64) · Windows (x86_64 & arm64) -**Outbound access needed:** The installer downloads container images and connects to the tracebloc platform. Make sure your network allows traffic to `*.docker.io`, `*.tracebloc.io`, `github.com`, and `pypi.org`. +**Outbound access needed:** The installer pulls container images, the install scripts, and the Helm chart, then connects to the tracebloc platform. Allow traffic to `*.docker.io`, `ghcr.io`, `raw.githubusercontent.com`, `*.github.io`, `*.tracebloc.io`, and `pypi.org`. --- @@ -123,7 +123,7 @@ To upgrade a one-liner install later, run `helm upgrade tracebloc/cl ### GPU Support -The installer auto-detects GPU hardware and configures the cluster accordingly: +The installer detects your GPU and configures the cluster: - **Linux (NVIDIA/AMD)** — drivers, container toolkit, and Kubernetes device plugin are installed automatically. A reboot may be required after driver installation. - **macOS** — CPU-only. For GPU workloads, deploy on a Linux machine or use [AWS (EKS)](/environment-setup/eks-client-deployment-guide). diff --git a/environment-setup/troubleshooting.mdx b/environment-setup/troubleshooting.mdx index 2f5ac9c..5d72ae8 100644 --- a/environment-setup/troubleshooting.mdx +++ b/environment-setup/troubleshooting.mdx @@ -7,6 +7,16 @@ Most issues fall into a few categories: pods not starting, client not connecting For real-time cluster monitoring, try [k9s](https://k9scli.io/) — run `k9s -n ` to get a live view of pods, logs, and events. + +**Stuck? Generate a support bundle.** Re-run the installer with `--diagnose`: + +```bash +bash <(curl -fsSL https://tracebloc.io/i.sh) --diagnose +``` + +It writes a redacted `~/.tracebloc/tracebloc-diagnose-.tgz` — logs, pod status, and versions with **credentials removed** — that you can send to support. The first line of output shows your client version. + + ## Quick Checks | Symptom | Check | Fix |