Skip to content

Stabilize Linux KVM CI on shared runners#284

Open
hiroTamada wants to merge 3 commits into
mainfrom
fix/deft-runner-contention
Open

Stabilize Linux KVM CI on shared runners#284
hiroTamada wants to merge 3 commits into
mainfrom
fix/deft-runner-contention

Conversation

@hiroTamada

@hiroTamada hiroTamada commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Serialize the Linux KVM CI job so multiple hypeman integration suites do not overload the same self-hosted runner host.
  • Run Linux tests with a run-scoped temp directory and lower Go test parallelism, then clean up orphaned VM helper processes tied to that run.
  • Extend Firecracker restored-guest readiness polling to tolerate slower exec-agent/vsock startup under load.

Test plan

  • go test ./lib/instances -run TestWaitForProcessExit -count=1
  • Parsed .github/workflows/test.yml locally to verify the new concurrency and temp-dir env blocks.
  • Inspected failing run 27217664557 / job 80411086183; it did not include these changes and failed on TestFCUFFDOneShotLifecycle exec-agent readiness timeout.

Made with Cursor


Note

Low Risk
Changes are limited to CI workflow, Makefile test invocation, and integration-test cleanup/timeouts; no production auth or API behavior changes.

Overview
Serializes Linux KVM CI on self-hosted runners via a shared linux-kvm-ci-test concurrency group (cancel-in-progress: false) so overlapping integration suites do not pile onto one host.

CI test isolation uses run-scoped TMPDIR (/tmp/hci{run_attempt}), kills orphaned firecracker / cloud-hypervisor / hypeman-uffd-pager processes tied to that path before and after tests, and runs make test with GO_TEST_PARALLELISM=4. The Makefile wires optional -parallel, and forwards TMPDIR / HYPEMAN_TEST_NETWORK_TMPDIR into the sudo go test environment.

Test harness hardening: integration cleanup now scans /proc for hypervisor helpers whose cmdline references the test data dir (used from manager and QEMU setups). Firecracker requireRunningSleepInstance readiness polling is extended from 30s to 90s under load.

Reviewed by Cursor Bugbot for commit 483a064. Bugbot is set up for automated code reviews on this repo. Configure here.

hiroTamada and others added 3 commits June 9, 2026 16:07
Limit host-level contention in CI and clean up VM helpers that survive timed-out tests, so Firecracker/QEMU integration runs do not leave pressure on deft-kernel-dev.

Co-authored-by: Cursor <cursoragent@cursor.com>
Keep the run-scoped cleanup root short enough for Firecracker and Cloud Hypervisor Unix socket path limits.

Co-authored-by: Cursor <cursoragent@cursor.com>
Keep the test temp root short for VMM socket limits while placing it under /tmp so the runner can create it.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant