Skip to content

Faster macOS VM standby/restore: compress machine state + overlap copy#277

Open
rgarcia wants to merge 7 commits into
mainfrom
hypeship/spec-faster-standby-restore
Open

Faster macOS VM standby/restore: compress machine state + overlap copy#277
rgarcia wants to merge 7 commits into
mainfrom
hypeship/spec-faster-standby-restore

Conversation

@rgarcia

@rgarcia rgarcia commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

Summary

Implements the remaining work from the Faster macOS VM standby/restore RFC (docs/proposals/faster-standby-restore.md) for vz (Apple Virtualization.framework) guests on Apple Silicon. Section C (APFS clonefile) landed in #276; this PR adds Sections A and B:

  • A.1 — compress machine-state.vzm: the vz machine state is added to the snapshot compression candidate set, so the existing async compression job compresses it and the existing restore-prep decompresses it — no new lifecycle code.
  • A.2 — synchronous shim compression: a darwin/arm64 helper zstd-compresses machine-state.vzm inside the shim when requested, gated behind a new optional CompressMachineState field on the shim snapshot request. The vz Save/Restore API is path-based and cannot be byte-streamed, so compression is a post-write pass.
  • B — overlap quiesce with the disk copy: for a standby snapshot taken from a running vz source, the machine-state save and the guest-disk CoW clone now run concurrently under one errgroup inside the paused window, instead of serially.

The new options default to off, so every non-vz hypervisor and the auto-standby path are byte-identical to before.

Notes

  • A race the original Section B sketch missed — the disk copy racing the save over snapshots/snapshot-latest/ — is handled by pruning that subtree from the overlapped copy and copying it in after both legs finish (equivalent to the serial copy for vz, which has no snapshot-base).
  • Tests: host-portable unit tests (candidate set, compress/decompress round-trip, errgroup cancel/single-leg, overlap fault-injection) run on Linux; a darwin/arm64 file-level zstd round-trip runs via the signed shim; and an end-to-end TestVZRunningSnapshotCompressionOverlap snapshots a running vz guest with compression then forks + restores it — verified passing on the macOS 15 runner.
  • The design doc is included for context, with Section C marked as landed (Add APFS clonefile fast path for macOS VM disk forks #276).

Note

Medium Risk
Changes standby/snapshot orchestration and vz shim snapshot behavior on a narrow path (running-source vz snapshots with overlap); failure handling resumes the VM and removes orphaned snapshot dirs, but incorrect overlap could still affect snapshot correctness or pause duration.

Overview
Speeds up vz standby/restore on Apple Silicon by shrinking saved RAM state and overlapping work while the guest is paused.

Compression: machine-state.vzm is now part of the existing snapshot compression candidate list, so async jobs and restore prep (ensureSnapshotMemoryReady) handle .zst like other memory artifacts. The vz shim can optionally zstd-compress the state file right after SaveMachineStateToPath via new CompressMachineState / CompressionLevel on hypervisor.SnapshotOptions and the shim snapshot API (defaults off; non-vz hypervisors unchanged).

Overlap (running-source standby snapshots only): For vz, the control plane runs guest-disk clone and machine-state save in parallel (runPausedSnapshotLegs + standbyOverlap) instead of serially. The overlapped copy skips the whole snapshots/ tree to avoid racing the shim, then copyStandbySnapshotStateIntoPayload copies snapshot-latest after both legs finish. With zstd compression enabled on the snapshot, overlap also turns on synchronous shim compression so the payload is already .zst when copied.

Supporting changes: forkvm skip paths now prune skipped directories; snapshot path constants move to build-tag-free shimconfig/paths.go; RFC doc added. Unit and darwin integration tests cover round-trips, errgroup semantics, and overlap failure cleanup.

Reviewed by Cursor Bugbot for commit d2319d4. Bugbot is set up for automated code reviews on this repo. Configure here.

rgarcia and others added 4 commits June 9, 2026 21:36
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Reframe the APFS clonefile disk-copy work from proposal to landed
context now that #276 merged it. The Summary, Motivation,
current-behavior, testing plan, rollout (M1 done), and risks now
describe the shipped implementation — real symbol names
(copyRegularFileReflink / isReflinkUnsupportedError), the
CLONE_NOFOLLOW|CLONE_NOOWNERCOPY flags, and the narrow
ENOTSUP/EOPNOTSUPP/EXDEV fallback set with EINVAL/ENOTDIR/EEXIST
propagating — rather than the original sketch. Sections A and B
remain the live proposals. Also refresh restore.go line references
shifted by #279.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Implements the two remaining sections of the faster-standby-restore RFC
(Section C / APFS clonefile already shipped in #276).

Section A.1 — teach the snapshot-compression subsystem about the vz
machine-state file. Add machine-state.vzm to snapshotMemoryFileCandidates
at both the standby-latest level and the stored-payload-nested level, so
the existing async job compresses it and ensureSnapshotMemoryReady
decompresses it before restore. The filename constant moves to a
build-tag-free file in shimconfig (mirroring RosettaMountTag) so the
host-portable compression code can name it without pulling the darwin-only
package.

Section A.2 — optional synchronous compression in the shim. Add a
darwin/arm64 compressMachineStateFile helper (klauspost zstd; the vz
Save API is path-based and cannot stream-compress, so it is a post-write
pass) with a matching error stub off arm64 so the shim never references
zstd there. Gated behind a new CompressMachineState flag on the shim
snapshot request; the manifest still names the raw file and the .zst is
discovered by suffix.

Section B — overlap quiesce with the disk copy. After the VM is paused,
standbyInstance now runs the machine-state save and the guest-disk
CoW-clone concurrently under errgroup instead of serially, via an optional
overlap the running-source snapshot path supplies (vz only; other
hypervisors and auto-standby keep the serial path, byte-identical). The
overlapped disk copy prunes the snapshots subtree the save is writing and
copies it afterward, so the two legs never race; on either leg failing the
source is resumed and the partial snapshot dir is cleaned, matching the
serial path. forkvm's SkipRelativePaths now prunes a matched directory's
whole subtree (files still drop individually).

Plumbs CompressMachineState (+ optional level) through
hypervisor.SnapshotOptions -> vz Client.Snapshot -> shim request ->
handleSnapshot. When false (the default and the only value non-vz
hypervisors ever see) behavior is byte-identical to before.

Tests: machine-state participates in the compression candidate set and
.zst discovery; compress/decompress round-trips a synthetic .vzm; the
errgroup cancels its sibling on failure; standbyInstance resumes the
source when the disk leg fails; forkvm prunes a skipped directory subtree;
and a darwin/arm64 round-trip for compressMachineStateFile.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wrap compressMachineStateFile errors with step context to match
runGoCompression, and clarify three review-flagged comments: note the
shim's compression-level clamp is trust-boundary hardening for already
validated input, document why the overlap helper intentionally drops the
re-resolved compression-policy error, and trim the redundant rationale on
the standby disk-copy comment.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@rgarcia rgarcia force-pushed the hypeship/spec-faster-standby-restore branch from 3e195d8 to 6a79048 Compare June 9, 2026 22:42
Drives Sections A and B on a real vz guest: a standby snapshot from a
RUNNING source runs the machine-state save and disk clone concurrently
and synchronously zstd-compresses machine-state.vzm, then forks and
restores the snapshot to prove the compressed state decompresses and
resumes with guest state intact. The existing darwin tests snapshot from
standby with compression off, so they never hit the running-source
overlap or the synchronous shim compression.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@rgarcia rgarcia force-pushed the hypeship/spec-faster-standby-restore branch from 6794f8d to 6c36054 Compare June 9, 2026 23:28
@rgarcia rgarcia marked this pull request as ready for review June 9, 2026 23:29
@rgarcia rgarcia changed the title Add design proposal: faster macOS VM standby/restore Faster macOS VM standby/restore: compress machine state + overlap copy Jun 9, 2026
@firetiger-agent

Copy link
Copy Markdown

Firetiger deploy monitoring skipped

This PR didn't match the auto-monitor filter configured on your GitHub connection:

PRs in the kernel, infra, hypeman, and hypeship repos. kernel is a ~mono repo with many logical services underneath, ensure to focus on the implicated service for the PR

Reason: This is a documentation-only design proposal (RFC) with no code changes; it doesn't clearly indicate which service in the kernel mono repo it affects, and design docs typically don't require deploy monitoring.

To monitor this PR anyway, reply with @firetiger monitor this.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Orphan snapshot after overlap failure
    • On overlap snapshot failure, standby now explicitly removes snapshot-latest when no retained base was promoted, preventing orphaned snapshot artifacts after resuming the VM.

Create PR

Or push these changes by commenting:

@cursor push 0836f242df
Preview (0836f242df)
diff --git a/lib/instances/snapshot_overlap_test.go b/lib/instances/snapshot_overlap_test.go
--- a/lib/instances/snapshot_overlap_test.go
+++ b/lib/instances/snapshot_overlap_test.go
@@ -188,6 +188,8 @@
 	_, err := mgr.standbyInstance(context.Background(), instanceID, StandbyInstanceRequest{}, true, overlap)
 	require.Error(t, err)
 	assert.Equal(t, int32(1), hv.resumes.Load(), "source should be resumed when a leg fails")
+	_, statErr := os.Stat(mgr.paths.InstanceSnapshotLatest(instanceID))
+	assert.True(t, os.IsNotExist(statErr), "failed standby should clean snapshot-latest artifacts")
 
 	// The instance must remain Running (the failed standby left the VM paused and
 	// then resumed it), not transition to Standby.

diff --git a/lib/instances/standby.go b/lib/instances/standby.go
--- a/lib/instances/standby.go
+++ b/lib/instances/standby.go
@@ -197,6 +197,10 @@
 			if rollbackErr := discardPromotedRetainedSnapshotTarget(snapshotDir); rollbackErr != nil {
 				log.WarnContext(ctx, "failed to discard promoted snapshot target after snapshot error", "instance_id", id, "error", rollbackErr)
 			}
+		} else {
+			if cleanupErr := os.RemoveAll(snapshotDir); cleanupErr != nil {
+				log.WarnContext(ctx, "failed to remove snapshot target after snapshot error", "instance_id", id, "error", cleanupErr)
+			}
 		}
 		return nil, fmt.Errorf("create snapshot: %w", err)
 	}

You can send follow-ups to the cloud agent here.

Comment thread lib/instances/standby.go Outdated
In the running-source overlap the save leg can write a complete
snapshot-latest while the concurrent disk-copy leg fails. The failure
handler resumed the VM but left that snapshot-latest on disk, so the
running source kept stale standby artifacts (with metadata still
describing a running guest). Remove it on the overlap failure path; base
reuse is off there, so there is no promoted base to preserve.

Extend the fault-injection test so the mock save leg writes a
snapshot-latest and assert it is cleaned up after the failure.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Overlap abort races shim save
    • I changed overlap snapshot orchestration to keep the save leg running after disk-copy failure and wait for it to finish before cleanup, eliminating the RemoveAll-vs-shim-write race.

Create PR

Or push these changes by commenting:

@cursor push 81111d0582
Preview (81111d0582)
diff --git a/lib/instances/snapshot_overlap_test.go b/lib/instances/snapshot_overlap_test.go
--- a/lib/instances/snapshot_overlap_test.go
+++ b/lib/instances/snapshot_overlap_test.go
@@ -76,29 +76,65 @@
 	assert.Equal(t, payload, got)
 }
 
-// TestRunPausedSnapshotLegsCancelsSiblingOnFailure verifies the errgroup
-// restructure runs the two legs concurrently and cancels the surviving leg when
-// the other fails (Section B mechanism).
-func TestRunPausedSnapshotLegsCancelsSiblingOnFailure(t *testing.T) {
+// TestRunPausedSnapshotLegsWaitsForSaveOnCopyFailure verifies a copy-leg
+// failure does not cancel the save leg and the helper only returns after save
+// completion.
+func TestRunPausedSnapshotLegsWaitsForSaveOnCopyFailure(t *testing.T) {
 	t.Parallel()
 
 	diskErr := errors.New("disk copy failed")
-	saveObservedCancel := make(chan struct{})
+	saveStarted := make(chan struct{})
+	releaseSave := make(chan struct{})
 	saveLeg := func(ctx context.Context) error {
-		<-ctx.Done()
-		close(saveObservedCancel)
-		return ctx.Err()
+		close(saveStarted)
+		select {
+		case <-releaseSave:
+			return nil
+		case <-ctx.Done():
+			return ctx.Err()
+		}
 	}
 	copyDisk := func(context.Context) error {
 		return diskErr
 	}
 
+	errCh := make(chan error, 1)
+	go func() {
+		errCh <- runPausedSnapshotLegs(context.Background(), saveLeg, copyDisk)
+	}()
+	<-saveStarted
+	select {
+	case err := <-errCh:
+		require.FailNow(t, "runPausedSnapshotLegs returned before save completed", "unexpected error: %v", err)
+	default:
+	}
+	close(releaseSave)
+	err := <-errCh
+	require.ErrorIs(t, err, diskErr)
+}
+
+// TestRunPausedSnapshotLegsCancelsCopyOnSaveFailure verifies a save-leg failure
+// still cancels the copy leg to avoid waiting on unnecessary work.
+func TestRunPausedSnapshotLegsCancelsCopyOnSaveFailure(t *testing.T) {
+	t.Parallel()
+
+	saveErr := errors.New("save failed")
+	copyObservedCancel := make(chan struct{})
+	saveLeg := func(context.Context) error {
+		return saveErr
+	}
+	copyDisk := func(ctx context.Context) error {
+		<-ctx.Done()
+		close(copyObservedCancel)
+		return ctx.Err()
+	}
+
 	err := runPausedSnapshotLegs(context.Background(), saveLeg, copyDisk)
-	require.ErrorIs(t, err, diskErr)
+	require.ErrorIs(t, err, saveErr)
 	select {
-	case <-saveObservedCancel:
+	case <-copyObservedCancel:
 	case <-time.After(time.Second):
-		t.Fatal("save leg was not cancelled after disk leg failed")
+		require.FailNow(t, "copy leg was not cancelled after save leg failed")
 	}
 }
 

diff --git a/lib/instances/standby.go b/lib/instances/standby.go
--- a/lib/instances/standby.go
+++ b/lib/instances/standby.go
@@ -16,7 +16,6 @@
 	"github.com/kernel/hypeman/lib/network"
 	snapshotstore "github.com/kernel/hypeman/lib/snapshot"
 	"go.opentelemetry.io/otel/attribute"
-	"golang.org/x/sync/errgroup"
 )
 
 // standbyOverlap lets a caller run a disk copy concurrently with the
@@ -193,11 +192,12 @@
 		if resumeErr := hv.Resume(ctx); resumeErr != nil {
 			log.ErrorContext(ctx, "failed to resume VM after snapshot error", "instance_id", id, "error", resumeErr)
 		}
-		// With an overlap the save leg can still complete (writing a full
-		// snapshot-latest) while the disk leg fails. We are aborting the standby and
-		// resuming the VM, so remove that orphaned snapshot-latest to keep the
-		// running source free of standby artifacts. Base reuse is off on the overlap
-		// path, so there is no promoted base to preserve here.
+		// On overlap failures we are aborting standby and resuming the VM, so remove
+		// the now-orphaned snapshot-latest to keep the running source free of
+		// standby artifacts. runPausedSnapshotLegs waits for the save leg to finish
+		// before returning, so this cleanup will not race an in-flight shim write.
+		// Base reuse is off on the overlap path, so there is no promoted base to
+		// preserve here.
 		if overlap != nil {
 			if rmErr := os.RemoveAll(snapshotDir); rmErr != nil {
 				log.WarnContext(ctx, "failed to remove orphaned snapshot after overlap failure", "instance_id", id, "error", rmErr)
@@ -325,23 +325,33 @@
 }
 
 // runPausedSnapshotLegs runs the machine-state save and, when supplied, the
-// guest-disk copy concurrently under one errgroup. The disk is quiescent while
-// the VM is paused, so the copy overlaps the (RAM-sized) save instead of waiting
-// for it. With no disk leg the group holds a single goroutine and is equivalent
-// to calling saveLeg directly. errgroup cancels the shared context on the first
-// failure, so the surviving leg observes cancellation and the first error is
-// returned.
+// guest-disk copy concurrently while the VM is paused. If the save leg fails we
+// cancel the copy leg, but a copy failure must not cancel the save because the
+// shim can still be writing snapshot-latest even after the HTTP call returns.
+// The function waits for both legs to finish before returning an error, so
+// callers can safely clean up snapshot-latest on overlap failures.
 func runPausedSnapshotLegs(ctx context.Context, saveLeg, copyDisk func(context.Context) error) error {
-	g, gctx := errgroup.WithContext(ctx)
-	g.Go(func() error {
-		return saveLeg(gctx)
-	})
-	if copyDisk != nil {
-		g.Go(func() error {
-			return copyDisk(gctx)
-		})
+	if copyDisk == nil {
+		return saveLeg(ctx)
 	}
-	return g.Wait()
+
+	copyCtx, cancelCopy := context.WithCancel(ctx)
+	defer cancelCopy()
+
+	copyErrCh := make(chan error, 1)
+	go func() {
+		copyErrCh <- copyDisk(copyCtx)
+	}()
+
+	saveErr := saveLeg(ctx)
+	if saveErr != nil {
+		cancelCopy()
+	}
+	copyErr := <-copyErrCh
+	if saveErr != nil {
+		return saveErr
+	}
+	return copyErr
 }
 
 // createSnapshot creates a snapshot using the hypervisor interface

You can send follow-ups to the cloud agent here.

Reviewed by Cursor Bugbot for commit bcaeb8b. Configure here.

Comment thread lib/instances/standby.go
errgroup.WithContext cancelled the save leg when the disk-copy leg
failed, so the vz HTTP snapshot call returned while the shim was still
writing snapshot-latest — letting the failure handler resume the VM and
remove that directory over an in-flight save. The disk leg ignores
cancellation anyway, so run both legs to completion on the caller's
context and handle the result only once neither leg is in flight.

Rewrite the leg test to assert the save runs to completion (is not
cancelled) when the disk leg fails.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant