feat(workspace): two-tier runtime — long-lived VM hosts running per-run containers

## Goal

A single harnessd should be able to provision **one (or a few) long-lived VMs** and then run **many concurrent per-run containers** on those VMs. Today's behavior is one VM per run, which is the wrong runtime model: every run pays a 15-second VM cold start, every run pays a full hour of VM billing for a 30-second task, and concurrency scales linearly with VM count.

## Today's state (relevant code)

- \`internal/harness/runner.go:683\` — \`provisionRunWorkspace\` is called inside \`StartRun\`. \`Workspace.Destroy\` runs on terminal events. So one POST \`/v1/runs\` with \`workspace_type: "vm"\` → boot a fresh Hetzner VM. 10 concurrent runs → 10 VMs.
- \`internal/workspace/pool.go\` exists. \`Pool\` + \`PoolWorkspace\` keep N pre-provisioned workspaces and lease them per run. **But standalone harnessd doesn't use the pool** — only \`internal/symphd/orchestrator.go:316\` does.
- \`internal/workspace/container.go\` always uses a **local** Docker daemon (\`client.NewClientWithOpts(client.FromEnv)\`). It can't talk to a docker daemon on a remote VM.
- \`internal/workspace/vm.go\` exposes \`HarnessURL\` and \`WorkspacePath\` but neither is consumed: tools run on the host harnessd, not inside the VM.
- Verified live in \`docs/investigations/2026-04-29-vm-mode-test.md\` (forthcoming): a Hetzner VM was provisioned at \`178.105.38.116\` for a single run; the agent's file write landed at \`/tmp/harness-vm-test/\` (host), not on the VM. VM was destroyed cleanly at run end.

## Target architecture

Two-tier workspace runtime:

\`\`\`
                  ┌─────────────────────────────────────┐
   harnessd ────► │  VM Pool (1-N warm VMs)             │
   (orchestrator) │  ┌───────────────────────────────┐  │
                  │  │  VM (long-lived, ~15s boot)   │  │
                  │  │  ├─ Docker daemon             │  │
                  │  │  ├─ Container A (run_xyz)     │  │  ← per-run container
                  │  │  ├─ Container B (run_abc)     │  │  ← per-run container
                  │  │  └─ Container C (run_def)     │  │  ← per-run container
                  │  └───────────────────────────────┘  │
                  └─────────────────────────────────────┘
\`\`\`

- **VM tier**: pool of long-lived VMs, sized to expected concurrency. Each VM has Docker installed via cloud-init. VMs come up once and stay up across many runs.
- **Container tier**: each run gets a fresh container on one of the warm VMs. Container teardown is sub-second. Boot cost is amortized to ~zero per run.

## Design proposal

### New type: \`HostVMWorkspace\`

A workspace whose role is *not* to be the run's workspace but to be a **container host**. It exposes:

\`\`\`go
type HostVMWorkspace interface {
    Workspace                          // Provision / Destroy / WorkspacePath / HarnessURL (the VM's own harnessd)
    DockerHostURL() string             // e.g. "ssh://root@1.2.3.4" or "tcp://1.2.3.4:2375"
}
\`\`\`

Provision: boot the VM, install Docker, expose its daemon (over a Unix-socket-tunneling SSH or a TLS-secured TCP socket). The VM itself does NOT run a per-run harnessd — it runs Docker.

### Extension: \`ContainerWorkspace\` accepts a remote Docker host

\`internal/workspace/container.go\` currently does:
\`\`\`go
cli, err := client.NewClientWithOpts(client.FromEnv, client.WithAPIVersionNegotiation())
\`\`\`

Add a \`HostURL\` field to \`ContainerWorkspace\` and pass it via \`Options\`:
\`\`\`go
cli, err := client.NewClientWithOpts(client.WithHost(opts.DockerHostURL), client.WithAPIVersionNegotiation())
\`\`\`

When \`opts.DockerHostURL == ""\`, fall back to local Docker (today's behavior).

### New workspace type: \`vm-pool\`

A composite workspace that wraps the existing \`workspace.Pool\` mechanism with a two-tier shape:

- The pool is a pool of \`HostVMWorkspace\`.
- When \`Provision\` is called for a run, lease a host VM, then create a \`ContainerWorkspace\` configured with that VM's \`DockerHostURL\`. The leased host stays leased only until the container is created; subsequent runs can lease the same host immediately.
- On \`Destroy\`, destroy the container and return the host to the pool. The host VM keeps running.
- Pool sizes the host tier; container concurrency is bounded by host capacity (configurable).

\`\`\`go
type VMPoolWorkspace struct {
    pool      *Pool                    // pool of HostVMWorkspace
    leased    HostVMWorkspace          // currently held host
    container *ContainerWorkspace      // the actual per-run workspace
}
\`\`\`

### Tool execution: still run inside the container

Today, container mode \"works\" because the bind-mount source is on the host filesystem and tools resolve there. For a remote Docker host the bind-mount source is on the VM, so the host harnessd cannot see it.

Two options:

- **Option A — Tool proxy via inner harnessd** (consistent with symphd's whole-run dispatch). The container runs harnessd, the per-run registry on the outer harnessd builds tools that proxy to \`<container-harnessd>/v1/runs\`. Heavyweight but pure.
- **Option B — Tool proxy via \`docker exec\`**. The outer harnessd's per-run registry builds tools that all do \`docker exec <container-id> <cmd>\`. ~150 lines of adapter code. Reuses the same Docker client connection that already exists for create/destroy.

**Recommendation: Option B for v1**. \`docker exec\` is cheaper than running a full inner harnessd, the Docker SDK already has an exec API, and it generalizes to local Docker the same way (today's container mode could move to docker exec instead of host-bind, fixing the \"tools run on host\" gap there too).

### Bootstrap script: install Docker, expose daemon over TLS or SSH

\`internal/workspace/bootstrap.go\` currently writes a systemd unit for harnessd that doesn't exist. Replace with:

\`\`\`bash
# Install Docker
curl -fsSL https://get.docker.com | sh
systemctl enable --now docker

# Open a TLS-secured TCP socket for the local harnessd to reach over the public internet
# OR: rely on SSH tunneling — the host harnessd's docker client uses ssh://... transport
\`\`\`

SSH transport is cleaner — Hetzner already supports SSH-key injection at create time (\`ServerCreateOpts.SSHKeys\`); inject a per-VM key, point the docker client at \`ssh://root@<vm-ip>\`. No public Docker daemon to secure.

## Acceptance criteria

1. \`workspace_type: \"vm-pool\"\` (or similar name) is registered in \`internal/workspace/registry.go\`.
2. Starting harnessd with pool config (e.g., \`HARNESS_VM_POOL_SIZE=1\`) provisions one warm VM at boot.
3. POST \`/v1/runs\` with \`workspace_type: \"vm-pool\"\` produces a per-run container on the warm VM in **under 5 seconds** (vs ~15s today for a fresh VM).
4. Tools (\`read\`, \`write\`, \`bash\`, \`grep\`, \`git_*\`) execute inside the container — \`bash pwd\` returns a path that exists on the VM, not on the harness host.
5. Two concurrent runs against the same pool share one VM and produce two distinct containers.
6. \`workspace.destroyed\` event fires per run; the VM stays running after run completion.
7. Pool shutdown (harnessd termination) destroys all warm VMs cleanly. No leaked Hetzner servers.
8. End-to-end test (gated on \`HETZNER_API_KEY\`): provision pool → run two concurrent agent tasks → both succeed → pool teardown → \`curl https://api.hetzner.cloud/v1/servers\` returns 0 servers.

## Phased implementation

| Phase | Scope | Estimate |
|---|---|---|
| **1** | Make today's container mode use \`docker exec\` for tools instead of host bind-mount. Establishes the tool-proxy pattern. | ~half day |
| **2** | Add \`HostURL\` plumbing to \`ContainerWorkspace\`. Verify a single container against a remote Docker daemon over SSH. | ~half day |
| **3** | Implement \`HostVMWorkspace\` (Hetzner-only initially; reuses existing \`HetznerProvider\`). Bootstrap installs Docker, exposes daemon over SSH. | ~1 day |
| **4** | Wire \`workspace.Pool\` for host VMs in standalone harnessd (currently only symphd uses it). New \`workspace_type: \"vm-pool\"\`. | ~half day |
| **5** | End-to-end live test, document, polish. | ~half day |

Total: ~3 days.

## Existing code that helps

- \`internal/workspace/pool.go\` — pool plumbing already exists.
- \`internal/workspace/container.go\` — Docker SDK usage, just needs \`HostURL\` parameterization.
- \`internal/workspace/vm.go\` + \`hetzner.go\` — VM provisioning works (verified live this session).
- \`internal/symphd/orchestrator.go:288\` — \`buildWorkspaceFactory\` is a precedent for composing workspace types.
- \`cmd/harnessd/runtime_container.go:105\` — subagent path already shows how to rebind tool registries per workspace.

## Out of scope for this issue

- Multi-cloud (only Hetzner via existing \`VMProvider\` interface).
- Multi-tenant resource isolation (cgroups, network namespaces beyond what Docker gives by default).
- Pool autoscaling (fixed size for v1).
- Cross-VM container migration.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(workspace): two-tier runtime — long-lived VM hosts running per-run containers #564

Goal

Today's state (relevant code)

Target architecture

Design proposal

New type: `HostVMWorkspace`

Extension: `ContainerWorkspace` accepts a remote Docker host

New workspace type: `vm-pool`

Tool execution: still run inside the container

Bootstrap script: install Docker, expose daemon over TLS or SSH

Install Docker

Open a TLS-secured TCP socket for the local harnessd to reach over the public internet

OR: rely on SSH tunneling — the host harnessd's docker client uses ssh://... transport

Acceptance criteria

Phased implementation

Existing code that helps

Out of scope for this issue

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Phase	Scope	Estimate
1	Make today's container mode use `docker exec` for tools instead of host bind-mount. Establishes the tool-proxy pattern.	~half day
2	Add `HostURL` plumbing to `ContainerWorkspace`. Verify a single container against a remote Docker daemon over SSH.	~half day
3	Implement `HostVMWorkspace` (Hetzner-only initially; reuses existing `HetznerProvider`). Bootstrap installs Docker, exposes daemon over SSH.	~1 day
4	Wire `workspace.Pool` for host VMs in standalone harnessd (currently only symphd uses it). New `workspace_type: "vm-pool"`.	~half day
5	End-to-end live test, document, polish.	~half day

feat(workspace): two-tier runtime — long-lived VM hosts running per-run containers #564

Description

Goal

Today's state (relevant code)

Target architecture

Design proposal

New type: `HostVMWorkspace`

Extension: `ContainerWorkspace` accepts a remote Docker host

New workspace type: `vm-pool`

Tool execution: still run inside the container

Bootstrap script: install Docker, expose daemon over TLS or SSH

Install Docker

Open a TLS-secured TCP socket for the local harnessd to reach over the public internet

OR: rely on SSH tunneling — the host harnessd's docker client uses ssh://... transport

Acceptance criteria

Phased implementation

Existing code that helps

Out of scope for this issue

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions