Skip to content

fix(workspace): readiness probe — Provision returns before harnessd inside VM is serving #566

@dennisonbertram

Description

@dennisonbertram

Problem

`HetznerProvider.Create` (`internal/workspace/hetzner.go:33`) polls Hetzner's API until the server reports `ServerStatusRunning`, then returns. `ContainerWorkspace.Provision` (`internal/workspace/container.go:51`) similarly polls Docker's `ContainerInspect` until `State.Running == true`.

Both definitions of "running" are kernel-level, not application-level. They don't mean:

  • cloud-init has finished
  • the harnessd binary is installed
  • harnessd is bound to port 8080
  • the security group / firewall permits inbound traffic from the harnessd host

So the runner gets the workspace's `HarnessURL` (e.g. `http://1.2.3.4:8080\`) and immediately tries to use it. The first request connects to a port that may not be open yet, returns ECONNREFUSED, and the run fails for reasons that look like a code bug but are actually a race.

`internal/symphd/dispatcher.go` already has `waitForHarnessReady` to work around this for the orchestrator dispatch path. The standalone runner has no equivalent. The workaround should be promoted into the workspace itself so any caller benefits.

Proposed fix

Add a `WaitReady(ctx context.Context) error` method to the `Workspace` interface (`internal/workspace/workspace.go`):

```go
type Workspace interface {
Provision(ctx context.Context, opts Options) error
WaitReady(ctx context.Context) error // NEW: returns nil when the workspace is serving requests
Destroy(ctx context.Context) error
HarnessURL() string
WorkspacePath() string
}
```

For `LocalWorkspace` and `WorktreeWorkspace`: `WaitReady` is a no-op returning nil (no inner harnessd to wait for).

For `ContainerWorkspace` and `VMWorkspace`: probe `HarnessURL() + "/healthz"` with exponential backoff, time out after 2 minutes, return a clean error ("harnessd inside never became ready: ").

In `internal/harness/runner.go` after `provisionRunWorkspace` succeeds: call `ws.WaitReady(ctx)` and treat its error as a provisioning failure (emit `workspace.provision_failed` and fail the run).

Acceptance criteria

  1. `Workspace.WaitReady` is implemented for all four backends.
  2. The runner calls `WaitReady` after `Provision` and before any tool dispatch.
  3. A timeout produces a clear error message identifying which workspace type was probed and what the last probe error was.
  4. Local and worktree modes are not slowed down (their `WaitReady` is an immediate no-op).
  5. Regression test: a fast-boot workspace (local) is leased without delay; a slow-boot workspace (VM with bad bootstrap) fails with a recognizable timeout error within ~2 minutes.

Related

References

  • `internal/symphd/dispatcher.go` — `waitForHarnessReady` (the existing workaround to study).
  • `internal/workspace/workspace.go` — `Workspace` interface.
  • `internal/workspace/container.go:125-136` — current readiness check (kernel-level, not app-level).
  • `internal/workspace/hetzner.go:57-88` — same shape, same gap.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions