Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 7 additions & 7 deletions environment-setup/configuration.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -3,20 +3,20 @@ title: "Configuration"
description: "Customize your tracebloc workspace — environment variables, cluster management, GPU support, and manual Helm deployment."
---

The installer uses sensible defaults. This page covers everything you can change — from cluster naming and port mapping to GPU configuration, manual Helm deployment, and day-to-day cluster management.
The installer uses sensible defaults; this page covers what you can change.

**Installed with the one-liner?** See [Installer Options](#installer-options), [Cluster Management](#cluster-management), and [GPU Support](#gpu-support). **Deploying into your own cluster with Helm** (EKS, AKS, bare-metal)? Jump to [Manual Deployment](#manual-deployment).

## Installer Options

Override defaults by setting environment variables before the install command. Useful when you need a custom cluster name, multiple worker nodes, or non-standard ports.
Override defaults by setting environment variables before the install command. Useful for a custom cluster name, extra worker nodes, or a different data directory.

| Variable | Default | Description |
|----------|---------|-------------|
| `CLUSTER_NAME` | `tracebloc` | Name of the k3d cluster |
| `SERVERS` | `1` | Number of control-plane nodes |
| `AGENTS` | `1` | Number of worker nodes |
| `K8S_VERSION` | `v1.29.4-k3s1` | k3s image tag |
| `HTTP_PORT` | `80` | Host port mapped to cluster HTTP ingress |
| `HTTPS_PORT` | `443` | Host port mapped to cluster HTTPS ingress |
| `HOST_DATA_DIR` | `~/.tracebloc` | Persistent data directory on host |

Example — custom cluster name with two worker nodes:
Expand Down Expand Up @@ -64,7 +64,7 @@ Install logs are saved to `~/.tracebloc/install-*.log`.

## GPU Support

The installer auto-detects GPU hardware and configures the cluster accordingly. No manual setup required on Linux — the installer handles drivers, container toolkit, and Kubernetes device plugin.
GPU is automatic on Linux — the installer detects your hardware and sets up drivers, the container toolkit, and the Kubernetes device plugin.

### NVIDIA (Linux)

Expand Down Expand Up @@ -94,7 +94,7 @@ The installer does **not** install GPU drivers on Windows. Pre-install NVIDIA dr

Skip the installer entirely. Use this if you already have a Kubernetes cluster, need custom resource limits, or want full control over the Helm deployment.

A single unified chart — **`tracebloc/client`** — supports AKS, EKS, bare-metal, and OpenShift. Platform behaviour is selected via values overrides; reference defaults live in the repo at [`client/ci/{aks,eks,bm,oc}-values.yaml`](https://github.com/tracebloc/client/tree/main/client/ci).
A single chart — **`tracebloc/client`** — supports AKS, EKS, bare-metal, and OpenShift; choose your platform via values overrides. Reference defaults live at [`client/ci/{aks,eks,bm,oc}-values.yaml`](https://github.com/tracebloc/client/tree/main/client/ci).

### Add the Helm repository

Expand Down Expand Up @@ -266,7 +266,7 @@ env:

#### Auto-upgrade (on by default)

Releases of chart `1.3.0+` install a `<release>-auto-upgrade` CronJob that polls `https://tracebloc.github.io/client` daily and runs `helm upgrade --reset-then-reuse-values` whenever a newer chart version is published. Closes [tracebloc/client#69](https://github.com/tracebloc/client/issues/69) — older deployed clients no longer drift from the latest secure release.
Releases of chart `1.3.0+` install a `<release>-auto-upgrade` CronJob that polls `https://tracebloc.github.io/client` daily and runs `helm upgrade --reset-then-reuse-values` whenever a newer chart version is published — so clients auto-update instead of staying pinned to the version they were installed with.

```yaml
autoUpgrade:
Expand Down
13 changes: 9 additions & 4 deletions environment-setup/eks-client-deployment-guide.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,15 @@ description: "Step-by-step guide to deploy Tracebloc on Amazon EKS. Build a prod

## Overview

<Note>
**Use EKS for production** — multi-node, autoscaling, or shared GPU clusters on AWS. For a single machine (a laptop or one server), the [local installer](/environment-setup/setup-guide) is simpler and faster.
</Note>

Running machine learning workloads in the cloud often requires a reliable, secure, and scalable infrastructure—yet setting it up can be complex. This guide walks you through building a complete Amazon EKS (Elastic Kubernetes Service) environment from scratch using the AWS CLI. By following these steps, you'll create a production-ready foundation with networking, GPU-optional compute, storage, and security fully aligned with AWS and Kubernetes best practices.

Once the infrastructure is in place, you'll deploy and configure the tracebloc client to securely train and benchmark AI models. This setup ensures that your proprietary data stays within your environment, while still allowing external AI models to be tested and fine-tuned in a controlled, isolated way. The result: a scalable, secure platform for high-performance ML workloads that accelerates collaboration with external experts while maintaining full control over your data and IP.

The entire setup can be completed in about 1–2 hours.
The entire setup takes ~1–2 hours.

If the cluster is already up and you are just adding another client to it, skip the cluster-creation steps and go straight to ["Client Configuration"](#5-client-configuration).

Expand Down Expand Up @@ -44,7 +47,7 @@ aws configure set region eu-central-1
```

#### Required Permissions
Your AWS user/role should have permissions for:
Requires permissions for:
- Amazon EKS cluster management
- VPC and networking resources
- EC2 instances and security groups
Expand Down Expand Up @@ -137,6 +140,8 @@ Together, these measures ensure that external models can be deployed safely into

## Quick Setup

Quick Setup runs an automated script that builds the whole cluster in one go. Want step-by-step control (or to customize networking)? Use [Detailed Setup](#detailed-setup) instead.

### Purpose
Spin up a production-ready EKS baseline (VPC, subnets, internet gateway, EKS cluster, managed nodegroup, EFS + CSI driver) in one go. Includes basic validation, colored logging, and a cleanup mode.

Expand Down Expand Up @@ -167,7 +172,7 @@ Run `./setup_eks.sh cleanup` to remove cluster, nodegroup, EFS, subnets, gateway
- **Costs**: This creates billable resources (EC2, EKS, EFS, data transfer). Remove when not needed.
- **Network model**: Subnets are configured to auto-assign public IPs for simplicity. Adjust to private subnets + NAT as needed.
- **Kubernetes version**: The script requests `--kubernetes-version 1.32`; update if your region/account supports a different current version.
- **Security hardening**: Treat this as a solid baseline; adapt SGs, private subnets, IRSA, and PodSecurity/OPA as required by your environment.
- **Security hardening**: This is a production baseline; harden further for your environment (security groups, private subnets, IRSA, Pod Security/OPA).

If you prefer more control over your setup and want to customize the environment to your needs, follow the step-by-step guide below.

Expand Down Expand Up @@ -454,7 +459,7 @@ Creates a nodegroup with `t3.medium` instances (2 vCPUs, 4 GiB memory) spread ac

#### Training Nodegroup

This group runs your ML training workloads and **must be sized appropriately** to provide sufficient memory and compute. Consider dataset size, model type, the number of parallel workloads and whether GPU acceleration is needed. Select instance types and scaling parameters carefully, based on the kind of models you expect to train and the resources they demand.
This group runs your ML training workloads — size it for your dataset, model type, number of parallel workloads, and whether you need GPUs.

Refer to the [EC2 instance types list](https://aws.amazon.com/ec2/instance-types) and [EKS managed nodegroups docs](https://docs.aws.amazon.com/eks/latest/userguide/create-managed-node-group.html) for guidance.

Expand Down
4 changes: 2 additions & 2 deletions environment-setup/setup-guide.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ The installer runs on any modern machine (one host per workspace). These are the

**Supported platforms:** macOS (Intel & Apple Silicon) · Linux (x86_64 & arm64) · Windows (x86_64 & arm64)

**Outbound access needed:** The installer downloads container images and connects to the tracebloc platform. Make sure your network allows traffic to `*.docker.io`, `*.tracebloc.io`, `github.com`, and `pypi.org`.
**Outbound access needed:** The installer pulls container images, the install scripts, and the Helm chart, then connects to the tracebloc platform. Allow traffic to `*.docker.io`, `ghcr.io`, `raw.githubusercontent.com`, `*.github.io`, `*.tracebloc.io`, and `pypi.org`.

---

Expand Down Expand Up @@ -123,7 +123,7 @@ To upgrade a one-liner install later, run `helm upgrade <workspace> tracebloc/cl

### GPU Support

The installer auto-detects GPU hardware and configures the cluster accordingly:
The installer detects your GPU and configures the cluster:

- **Linux (NVIDIA/AMD)** — drivers, container toolkit, and Kubernetes device plugin are installed automatically. A reboot may be required after driver installation.
- **macOS** — CPU-only. For GPU workloads, deploy on a Linux machine or use [AWS (EKS)](/environment-setup/eks-client-deployment-guide).
Expand Down
10 changes: 10 additions & 0 deletions environment-setup/troubleshooting.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,16 @@ Most issues fall into a few categories: pods not starting, client not connecting

For real-time cluster monitoring, try [k9s](https://k9scli.io/) — run `k9s -n <workspace>` to get a live view of pods, logs, and events.

<Note>
**Stuck? Generate a support bundle.** Re-run the installer with `--diagnose`:

```bash
bash <(curl -fsSL https://tracebloc.io/i.sh) --diagnose
```

It writes a redacted `~/.tracebloc/tracebloc-diagnose-<timestamp>.tgz` — logs, pod status, and versions with **credentials removed** — that you can send to support. The first line of output shows your client version.
</Note>

## Quick Checks

| Symptom | Check | Fix |
Expand Down
Loading