DKP-2859 start: bound docker readiness wait and dump logs on timeout#36
Merged
Conversation
Run each `docker ps` under `timeout 60` so a hung daemon returns instead of blocking forever, and give up at a 600s deadline. On timeout, dump the host and VM logs (host/*.log and vm/console.log) so a failed start leaves a trace instead of hanging until the job is cancelled.
Small runners (the 2 vCPU / 7 GB ubuntu-latest on private/internal repos) starve dockerd at startup and the action hangs. Document the requirement and the public-vs-private ubuntu-latest gotcha.
3c16fd0 to
6116eeb
Compare
ebriney
approved these changes
Jun 9, 2026
| cat ~/.docker/desktop/log/vm/*.log 2>/dev/null || true | ||
| exit 1 | ||
| fi | ||
| echo "docker not ready, sleep 10 s and try again" |
There was a problem hiding this comment.
within 600s vs. sleep 10 s. I prefer without space. Either way, better stay consistent with the two occurrences.
Member
Author
There was a problem hiding this comment.
Ah sorry, I’ll open a follow-up.
Member
Author
There was a problem hiding this comment.
Good catch. Opened #38 to make the spacing consistent in a follow-up.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #3.
What this PR does
Bounds the "wait for Docker" loop so a daemon that never comes up fails fast instead of hanging until the job is cancelled, and documents the runner sizing that avoids the hang in the first place.
Notes for the reviewer
The loop ran
until docker ps. Turns out when the daemon is wedged that call blocks forever rather than returning non-zero, so the loop never iterates and we never sleep, never retry, never give up. The step just sits there until the job timeout kills it (~34 min in this run). You can tell because "docker not ready…" is never printed once.So each
docker psnow runs undertimeout 60, which lets a hung daemon return and the loop give up at a 600s deadline. On timeout I dump the desktop logs so a failed start leaves a trace instead of nothing.A step/action-level timeout isn't supported for composite actions (see the discussion in #3), hence doing it inside the loop. And the per-call
timeout 60matters: an outer deadline check alone wouldn't fire, sincedocker psitself blocks on a wedged daemon.Paths checked against the pinata
pathspackage on Linux:~/.docker/desktop/log/host/holds the backend log (the monitor tees the backend output tohost/monitor.log).~/.docker/desktop/log/vm/holds the VM console (console.log, written straight to file by the qemu engine), which is really the one that'll tell us why the VM didn't come up.While debugging a hang with this, the console showed the VM and containerd starting fine but
dockerdstalling at startup on a small 2 vCPU / 7 GB runner. So I also added a "Choosing a runner" section to the README recommending 4 vCPU / 16 GB, and calling out thatubuntu-latestis smaller on private/internal repos.This makes the failure visible and fast, it doesn't fix whatever is keeping the VM from booting (looks like nested KVM on the runner, but that's another story).