Faster macOS VM standby/restore: compress machine state + overlap copy by rgarcia · Pull Request #277 · kernel/hypeman

rgarcia · 2026-06-07T02:56:09Z

Summary

Implements the remaining work from the Faster macOS VM standby/restore RFC (docs/proposals/faster-standby-restore.md) for vz (Apple Virtualization.framework) guests on Apple Silicon. Section C (APFS clonefile) landed in #276; this PR adds Sections A and B:

A.1 — compress machine-state.vzm: the vz machine state is added to the snapshot compression candidate set, so the existing async compression job compresses it and the existing restore-prep decompresses it — no new lifecycle code.
A.2 — synchronous shim compression: a darwin/arm64 helper zstd-compresses machine-state.vzm inside the shim when requested, gated behind a new optional CompressMachineState field on the shim snapshot request. The vz Save/Restore API is path-based and cannot be byte-streamed, so compression is a post-write pass.
B — overlap quiesce with the disk copy: for a standby snapshot taken from a running vz source, the machine-state save and the guest-disk CoW clone now run concurrently under one errgroup inside the paused window, instead of serially.

The new options default to off, so every non-vz hypervisor and the auto-standby path are byte-identical to before.

Notes

A race the original Section B sketch missed — the disk copy racing the save over snapshots/snapshot-latest/ — is handled by pruning that subtree from the overlapped copy and copying it in after both legs finish (equivalent to the serial copy for vz, which has no snapshot-base).
Tests: host-portable unit tests (candidate set, compress/decompress round-trip, errgroup cancel/single-leg, overlap fault-injection) run on Linux; a darwin/arm64 file-level zstd round-trip runs via the signed shim; and an end-to-end TestVZRunningSnapshotCompressionOverlap snapshots a running vz guest with compression then forks + restores it — verified passing on the macOS 15 runner.
The design doc is included for context, with Section C marked as landed (Add APFS clonefile fast path for macOS VM disk forks #276).

Note

Medium Risk
Changes standby/snapshot orchestration and vz shim snapshot behavior on a narrow path (running-source vz snapshots with overlap); failure handling resumes the VM and removes orphaned snapshot dirs, but incorrect overlap could still affect snapshot correctness or pause duration.

Overview
Speeds up vz standby/restore on Apple Silicon by shrinking saved RAM state and overlapping work while the guest is paused.

Compression: machine-state.vzm is now part of the existing snapshot compression candidate list, so async jobs and restore prep (ensureSnapshotMemoryReady) handle .zst like other memory artifacts. The vz shim can optionally zstd-compress the state file right after SaveMachineStateToPath via new CompressMachineState / CompressionLevel on hypervisor.SnapshotOptions and the shim snapshot API (defaults off; non-vz hypervisors unchanged).

Overlap (running-source standby snapshots only): For vz, the control plane runs guest-disk clone and machine-state save in parallel (runPausedSnapshotLegs + standbyOverlap) instead of serially. The overlapped copy skips the whole snapshots/ tree to avoid racing the shim, then copyStandbySnapshotStateIntoPayload copies snapshot-latest after both legs finish. With zstd compression enabled on the snapshot, overlap also turns on synchronous shim compression so the payload is already .zst when copied.

Supporting changes: forkvm skip paths now prune skipped directories; snapshot path constants move to build-tag-free shimconfig/paths.go; RFC doc added. Unit and darwin integration tests cover round-trips, errgroup semantics, and overlap failure cleanup.

^{Reviewed by Cursor Bugbot for commit d2319d4. Bugbot is set up for automated code reviews on this repo. Configure here.}

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Reframe the APFS clonefile disk-copy work from proposal to landed context now that #276 merged it. The Summary, Motivation, current-behavior, testing plan, rollout (M1 done), and risks now describe the shipped implementation — real symbol names (copyRegularFileReflink / isReflinkUnsupportedError), the CLONE_NOFOLLOW|CLONE_NOOWNERCOPY flags, and the narrow ENOTSUP/EOPNOTSUPP/EXDEV fallback set with EINVAL/ENOTDIR/EEXIST propagating — rather than the original sketch. Sections A and B remain the live proposals. Also refresh restore.go line references shifted by #279. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Implements the two remaining sections of the faster-standby-restore RFC (Section C / APFS clonefile already shipped in #276). Section A.1 — teach the snapshot-compression subsystem about the vz machine-state file. Add machine-state.vzm to snapshotMemoryFileCandidates at both the standby-latest level and the stored-payload-nested level, so the existing async job compresses it and ensureSnapshotMemoryReady decompresses it before restore. The filename constant moves to a build-tag-free file in shimconfig (mirroring RosettaMountTag) so the host-portable compression code can name it without pulling the darwin-only package. Section A.2 — optional synchronous compression in the shim. Add a darwin/arm64 compressMachineStateFile helper (klauspost zstd; the vz Save API is path-based and cannot stream-compress, so it is a post-write pass) with a matching error stub off arm64 so the shim never references zstd there. Gated behind a new CompressMachineState flag on the shim snapshot request; the manifest still names the raw file and the .zst is discovered by suffix. Section B — overlap quiesce with the disk copy. After the VM is paused, standbyInstance now runs the machine-state save and the guest-disk CoW-clone concurrently under errgroup instead of serially, via an optional overlap the running-source snapshot path supplies (vz only; other hypervisors and auto-standby keep the serial path, byte-identical). The overlapped disk copy prunes the snapshots subtree the save is writing and copies it afterward, so the two legs never race; on either leg failing the source is resumed and the partial snapshot dir is cleaned, matching the serial path. forkvm's SkipRelativePaths now prunes a matched directory's whole subtree (files still drop individually). Plumbs CompressMachineState (+ optional level) through hypervisor.SnapshotOptions -> vz Client.Snapshot -> shim request -> handleSnapshot. When false (the default and the only value non-vz hypervisors ever see) behavior is byte-identical to before. Tests: machine-state participates in the compression candidate set and .zst discovery; compress/decompress round-trips a synthetic .vzm; the errgroup cancels its sibling on failure; standbyInstance resumes the source when the disk leg fails; forkvm prunes a skipped directory subtree; and a darwin/arm64 round-trip for compressMachineStateFile. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Wrap compressMachineStateFile errors with step context to match runGoCompression, and clarify three review-flagged comments: note the shim's compression-level clamp is trust-boundary hardening for already validated input, document why the overlap helper intentionally drops the re-resolved compression-policy error, and trim the redundant rationale on the standby disk-copy comment. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Drives Sections A and B on a real vz guest: a standby snapshot from a RUNNING source runs the machine-state save and disk clone concurrently and synchronously zstd-compresses machine-state.vzm, then forks and restores the snapshot to prove the compressed state decompresses and resumes with guest state intact. The existing darwin tests snapshot from standby with compression off, so they never hit the running-source overlap or the synchronous shim compression. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

firetiger-agent · 2026-06-09T23:29:34Z

Firetiger deploy monitoring skipped

This PR didn't match the auto-monitor filter configured on your GitHub connection:

PRs in the kernel, infra, hypeman, and hypeship repos. kernel is a ~mono repo with many logical services underneath, ensure to focus on the implicated service for the PR

Reason: This is a documentation-only design proposal (RFC) with no code changes; it doesn't clearly indicate which service in the kernel mono repo it affects, and design docs typically don't require deploy monitoring.

To monitor this PR anyway, reply with @firetiger monitor this.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

✅ Fixed: Orphan snapshot after overlap failure
- On overlap snapshot failure, standby now explicitly removes snapshot-latest when no retained base was promoted, preventing orphaned snapshot artifacts after resuming the VM.

Or push these changes by commenting:

@cursor push 0836f242df

Preview (0836f242df)

diff --git a/lib/instances/snapshot_overlap_test.go b/lib/instances/snapshot_overlap_test.go
--- a/lib/instances/snapshot_overlap_test.go
+++ b/lib/instances/snapshot_overlap_test.go
@@ -188,6 +188,8 @@
 	_, err := mgr.standbyInstance(context.Background(), instanceID, StandbyInstanceRequest{}, true, overlap)
 	require.Error(t, err)
 	assert.Equal(t, int32(1), hv.resumes.Load(), "source should be resumed when a leg fails")
+	_, statErr := os.Stat(mgr.paths.InstanceSnapshotLatest(instanceID))
+	assert.True(t, os.IsNotExist(statErr), "failed standby should clean snapshot-latest artifacts")
 
 	// The instance must remain Running (the failed standby left the VM paused and
 	// then resumed it), not transition to Standby.

diff --git a/lib/instances/standby.go b/lib/instances/standby.go
--- a/lib/instances/standby.go
+++ b/lib/instances/standby.go
@@ -197,6 +197,10 @@
 			if rollbackErr := discardPromotedRetainedSnapshotTarget(snapshotDir); rollbackErr != nil {
 				log.WarnContext(ctx, "failed to discard promoted snapshot target after snapshot error", "instance_id", id, "error", rollbackErr)
 			}
+		} else {
+			if cleanupErr := os.RemoveAll(snapshotDir); cleanupErr != nil {
+				log.WarnContext(ctx, "failed to remove snapshot target after snapshot error", "instance_id", id, "error", cleanupErr)
+			}
 		}
 		return nil, fmt.Errorf("create snapshot: %w", err)
 	}

_{You can send follow-ups to the cloud agent here.}

In the running-source overlap the save leg can write a complete snapshot-latest while the concurrent disk-copy leg fails. The failure handler resumed the VM but left that snapshot-latest on disk, so the running source kept stale standby artifacts (with metadata still describing a running guest). Remove it on the overlap failure path; base reuse is off there, so there is no promoted base to preserve. Extend the fault-injection test so the mock save leg writes a snapshot-latest and assert it is cleaned up after the failure. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix prepared a fix for the issue found in the latest run.

✅ Fixed: Overlap abort races shim save
- I changed overlap snapshot orchestration to keep the save leg running after disk-copy failure and wait for it to finish before cleanup, eliminating the RemoveAll-vs-shim-write race.

Or push these changes by commenting:

@cursor push 81111d0582

Preview (81111d0582)

diff --git a/lib/instances/snapshot_overlap_test.go b/lib/instances/snapshot_overlap_test.go
--- a/lib/instances/snapshot_overlap_test.go
+++ b/lib/instances/snapshot_overlap_test.go
@@ -76,29 +76,65 @@
 	assert.Equal(t, payload, got)
 }
 
-// TestRunPausedSnapshotLegsCancelsSiblingOnFailure verifies the errgroup
-// restructure runs the two legs concurrently and cancels the surviving leg when
-// the other fails (Section B mechanism).
-func TestRunPausedSnapshotLegsCancelsSiblingOnFailure(t *testing.T) {
+// TestRunPausedSnapshotLegsWaitsForSaveOnCopyFailure verifies a copy-leg
+// failure does not cancel the save leg and the helper only returns after save
+// completion.
+func TestRunPausedSnapshotLegsWaitsForSaveOnCopyFailure(t *testing.T) {
 	t.Parallel()
 
 	diskErr := errors.New("disk copy failed")
-	saveObservedCancel := make(chan struct{})
+	saveStarted := make(chan struct{})
+	releaseSave := make(chan struct{})
 	saveLeg := func(ctx context.Context) error {
-		<-ctx.Done()
-		close(saveObservedCancel)
-		return ctx.Err()
+		close(saveStarted)
+		select {
+		case <-releaseSave:
+			return nil
+		case <-ctx.Done():
+			return ctx.Err()
+		}
 	}
 	copyDisk := func(context.Context) error {
 		return diskErr
 	}
 
+	errCh := make(chan error, 1)
+	go func() {
+		errCh <- runPausedSnapshotLegs(context.Background(), saveLeg, copyDisk)
+	}()
+	<-saveStarted
+	select {
+	case err := <-errCh:
+		require.FailNow(t, "runPausedSnapshotLegs returned before save completed", "unexpected error: %v", err)
+	default:
+	}
+	close(releaseSave)
+	err := <-errCh
+	require.ErrorIs(t, err, diskErr)
+}
+
+// TestRunPausedSnapshotLegsCancelsCopyOnSaveFailure verifies a save-leg failure
+// still cancels the copy leg to avoid waiting on unnecessary work.
+func TestRunPausedSnapshotLegsCancelsCopyOnSaveFailure(t *testing.T) {
+	t.Parallel()
+
+	saveErr := errors.New("save failed")
+	copyObservedCancel := make(chan struct{})
+	saveLeg := func(context.Context) error {
+		return saveErr
+	}
+	copyDisk := func(ctx context.Context) error {
+		<-ctx.Done()
+		close(copyObservedCancel)
+		return ctx.Err()
+	}
+
 	err := runPausedSnapshotLegs(context.Background(), saveLeg, copyDisk)
-	require.ErrorIs(t, err, diskErr)
+	require.ErrorIs(t, err, saveErr)
 	select {
-	case <-saveObservedCancel:
+	case <-copyObservedCancel:
 	case <-time.After(time.Second):
-		t.Fatal("save leg was not cancelled after disk leg failed")
+		require.FailNow(t, "copy leg was not cancelled after save leg failed")
 	}
 }
 

diff --git a/lib/instances/standby.go b/lib/instances/standby.go
--- a/lib/instances/standby.go
+++ b/lib/instances/standby.go
@@ -16,7 +16,6 @@
 	"github.com/kernel/hypeman/lib/network"
 	snapshotstore "github.com/kernel/hypeman/lib/snapshot"
 	"go.opentelemetry.io/otel/attribute"
-	"golang.org/x/sync/errgroup"
 )
 
 // standbyOverlap lets a caller run a disk copy concurrently with the
@@ -193,11 +192,12 @@
 		if resumeErr := hv.Resume(ctx); resumeErr != nil {
 			log.ErrorContext(ctx, "failed to resume VM after snapshot error", "instance_id", id, "error", resumeErr)
 		}
-		// With an overlap the save leg can still complete (writing a full
-		// snapshot-latest) while the disk leg fails. We are aborting the standby and
-		// resuming the VM, so remove that orphaned snapshot-latest to keep the
-		// running source free of standby artifacts. Base reuse is off on the overlap
-		// path, so there is no promoted base to preserve here.
+		// On overlap failures we are aborting standby and resuming the VM, so remove
+		// the now-orphaned snapshot-latest to keep the running source free of
+		// standby artifacts. runPausedSnapshotLegs waits for the save leg to finish
+		// before returning, so this cleanup will not race an in-flight shim write.
+		// Base reuse is off on the overlap path, so there is no promoted base to
+		// preserve here.
 		if overlap != nil {
 			if rmErr := os.RemoveAll(snapshotDir); rmErr != nil {
 				log.WarnContext(ctx, "failed to remove orphaned snapshot after overlap failure", "instance_id", id, "error", rmErr)
@@ -325,23 +325,33 @@
 }
 
 // runPausedSnapshotLegs runs the machine-state save and, when supplied, the
-// guest-disk copy concurrently under one errgroup. The disk is quiescent while
-// the VM is paused, so the copy overlaps the (RAM-sized) save instead of waiting
-// for it. With no disk leg the group holds a single goroutine and is equivalent
-// to calling saveLeg directly. errgroup cancels the shared context on the first
-// failure, so the surviving leg observes cancellation and the first error is
-// returned.
+// guest-disk copy concurrently while the VM is paused. If the save leg fails we
+// cancel the copy leg, but a copy failure must not cancel the save because the
+// shim can still be writing snapshot-latest even after the HTTP call returns.
+// The function waits for both legs to finish before returning an error, so
+// callers can safely clean up snapshot-latest on overlap failures.
 func runPausedSnapshotLegs(ctx context.Context, saveLeg, copyDisk func(context.Context) error) error {
-	g, gctx := errgroup.WithContext(ctx)
-	g.Go(func() error {
-		return saveLeg(gctx)
-	})
-	if copyDisk != nil {
-		g.Go(func() error {
-			return copyDisk(gctx)
-		})
+	if copyDisk == nil {
+		return saveLeg(ctx)
 	}
-	return g.Wait()
+
+	copyCtx, cancelCopy := context.WithCancel(ctx)
+	defer cancelCopy()
+
+	copyErrCh := make(chan error, 1)
+	go func() {
+		copyErrCh <- copyDisk(copyCtx)
+	}()
+
+	saveErr := saveLeg(ctx)
+	if saveErr != nil {
+		cancelCopy()
+	}
+	copyErr := <-copyErrCh
+	if saveErr != nil {
+		return saveErr
+	}
+	return copyErr
 }
 
 // createSnapshot creates a snapshot using the hypervisor interface

_{You can send follow-ups to the cloud agent here.}

^{Reviewed by Cursor Bugbot for commit bcaeb8b. Configure here.}

errgroup.WithContext cancelled the save leg when the disk-copy leg failed, so the vz HTTP snapshot call returned while the shim was still writing snapshot-latest — letting the failure handler resume the VM and remove that directory over an in-flight save. The disk leg ignores cancellation anyway, so run both legs to completion on the caller's context and handle the result only once neither leg is in flight. Rewrite the leg test to assert the save runs to completion (is not cancelled) when the disk leg fails. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

rgarcia and others added 4 commits June 9, 2026 21:36

Add design proposal: faster macOS VM standby/restore

be90a8d

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

rgarcia force-pushed the hypeship/spec-faster-standby-restore branch from 3e195d8 to 6a79048 Compare June 9, 2026 22:42

rgarcia force-pushed the hypeship/spec-faster-standby-restore branch from 6794f8d to 6c36054 Compare June 9, 2026 23:28

rgarcia marked this pull request as ready for review June 9, 2026 23:29

rgarcia changed the title ~~Add design proposal: faster macOS VM standby/restore~~ Faster macOS VM standby/restore: compress machine state + overlap copy Jun 9, 2026

cursor Bot reviewed Jun 9, 2026

View reviewed changes

Comment thread lib/instances/standby.go Outdated

cursor Bot reviewed Jun 10, 2026

View reviewed changes

Comment thread lib/instances/standby.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster macOS VM standby/restore: compress machine state + overlap copy#277

Faster macOS VM standby/restore: compress machine state + overlap copy#277
rgarcia wants to merge 7 commits into
mainfrom
hypeship/spec-faster-standby-restore

rgarcia commented Jun 7, 2026 •

edited by cursor Bot

Loading

Uh oh!

firetiger-agent Bot commented Jun 9, 2026

Uh oh!

cursor Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

cursor Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rgarcia commented Jun 7, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Notes

Uh oh!

firetiger-agent Bot commented Jun 9, 2026

Uh oh!

cursor Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rgarcia commented Jun 7, 2026 •

edited by cursor Bot

Loading

cursor Bot left a comment •

edited

Loading

cursor Bot left a comment •

edited

Loading