Problem
`HetznerProvider.Create` (`internal/workspace/hetzner.go:33`) polls Hetzner's API until the server reports `ServerStatusRunning`, then returns. `ContainerWorkspace.Provision` (`internal/workspace/container.go:51`) similarly polls Docker's `ContainerInspect` until `State.Running == true`.
Both definitions of "running" are kernel-level, not application-level. They don't mean:
- cloud-init has finished
- the harnessd binary is installed
- harnessd is bound to port 8080
- the security group / firewall permits inbound traffic from the harnessd host
So the runner gets the workspace's `HarnessURL` (e.g. `http://1.2.3.4:8080\`) and immediately tries to use it. The first request connects to a port that may not be open yet, returns ECONNREFUSED, and the run fails for reasons that look like a code bug but are actually a race.
`internal/symphd/dispatcher.go` already has `waitForHarnessReady` to work around this for the orchestrator dispatch path. The standalone runner has no equivalent. The workaround should be promoted into the workspace itself so any caller benefits.
Proposed fix
Add a `WaitReady(ctx context.Context) error` method to the `Workspace` interface (`internal/workspace/workspace.go`):
```go
type Workspace interface {
Provision(ctx context.Context, opts Options) error
WaitReady(ctx context.Context) error // NEW: returns nil when the workspace is serving requests
Destroy(ctx context.Context) error
HarnessURL() string
WorkspacePath() string
}
```
For `LocalWorkspace` and `WorktreeWorkspace`: `WaitReady` is a no-op returning nil (no inner harnessd to wait for).
For `ContainerWorkspace` and `VMWorkspace`: probe `HarnessURL() + "/healthz"` with exponential backoff, time out after 2 minutes, return a clean error ("harnessd inside never became ready: ").
In `internal/harness/runner.go` after `provisionRunWorkspace` succeeds: call `ws.WaitReady(ctx)` and treat its error as a provisioning failure (emit `workspace.provision_failed` and fail the run).
Acceptance criteria
- `Workspace.WaitReady` is implemented for all four backends.
- The runner calls `WaitReady` after `Provision` and before any tool dispatch.
- A timeout produces a clear error message identifying which workspace type was probed and what the last probe error was.
- Local and worktree modes are not slowed down (their `WaitReady` is an immediate no-op).
- Regression test: a fast-boot workspace (local) is leased without delay; a slow-boot workspace (VM with bad bootstrap) fails with a recognizable timeout error within ~2 minutes.
Related
References
- `internal/symphd/dispatcher.go` — `waitForHarnessReady` (the existing workaround to study).
- `internal/workspace/workspace.go` — `Workspace` interface.
- `internal/workspace/container.go:125-136` — current readiness check (kernel-level, not app-level).
- `internal/workspace/hetzner.go:57-88` — same shape, same gap.
Problem
`HetznerProvider.Create` (`internal/workspace/hetzner.go:33`) polls Hetzner's API until the server reports `ServerStatusRunning`, then returns. `ContainerWorkspace.Provision` (`internal/workspace/container.go:51`) similarly polls Docker's `ContainerInspect` until `State.Running == true`.
Both definitions of "running" are kernel-level, not application-level. They don't mean:
So the runner gets the workspace's `HarnessURL` (e.g. `http://1.2.3.4:8080\`) and immediately tries to use it. The first request connects to a port that may not be open yet, returns ECONNREFUSED, and the run fails for reasons that look like a code bug but are actually a race.
`internal/symphd/dispatcher.go` already has `waitForHarnessReady` to work around this for the orchestrator dispatch path. The standalone runner has no equivalent. The workaround should be promoted into the workspace itself so any caller benefits.
Proposed fix
Add a `WaitReady(ctx context.Context) error` method to the `Workspace` interface (`internal/workspace/workspace.go`):
```go
type Workspace interface {
Provision(ctx context.Context, opts Options) error
WaitReady(ctx context.Context) error // NEW: returns nil when the workspace is serving requests
Destroy(ctx context.Context) error
HarnessURL() string
WorkspacePath() string
}
```
For `LocalWorkspace` and `WorktreeWorkspace`: `WaitReady` is a no-op returning nil (no inner harnessd to wait for).
For `ContainerWorkspace` and `VMWorkspace`: probe `HarnessURL() + "/healthz"` with exponential backoff, time out after 2 minutes, return a clean error ("harnessd inside never became ready: ").
In `internal/harness/runner.go` after `provisionRunWorkspace` succeeds: call `ws.WaitReady(ctx)` and treat its error as a provisioning failure (emit `workspace.provision_failed` and fail the run).
Acceptance criteria
Related
References