Faster macOS VM standby/restore: compress machine state + overlap copy#277
Faster macOS VM standby/restore: compress machine state + overlap copy#277rgarcia wants to merge 7 commits into
Conversation
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Reframe the APFS clonefile disk-copy work from proposal to landed context now that #276 merged it. The Summary, Motivation, current-behavior, testing plan, rollout (M1 done), and risks now describe the shipped implementation — real symbol names (copyRegularFileReflink / isReflinkUnsupportedError), the CLONE_NOFOLLOW|CLONE_NOOWNERCOPY flags, and the narrow ENOTSUP/EOPNOTSUPP/EXDEV fallback set with EINVAL/ENOTDIR/EEXIST propagating — rather than the original sketch. Sections A and B remain the live proposals. Also refresh restore.go line references shifted by #279. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Implements the two remaining sections of the faster-standby-restore RFC (Section C / APFS clonefile already shipped in #276). Section A.1 — teach the snapshot-compression subsystem about the vz machine-state file. Add machine-state.vzm to snapshotMemoryFileCandidates at both the standby-latest level and the stored-payload-nested level, so the existing async job compresses it and ensureSnapshotMemoryReady decompresses it before restore. The filename constant moves to a build-tag-free file in shimconfig (mirroring RosettaMountTag) so the host-portable compression code can name it without pulling the darwin-only package. Section A.2 — optional synchronous compression in the shim. Add a darwin/arm64 compressMachineStateFile helper (klauspost zstd; the vz Save API is path-based and cannot stream-compress, so it is a post-write pass) with a matching error stub off arm64 so the shim never references zstd there. Gated behind a new CompressMachineState flag on the shim snapshot request; the manifest still names the raw file and the .zst is discovered by suffix. Section B — overlap quiesce with the disk copy. After the VM is paused, standbyInstance now runs the machine-state save and the guest-disk CoW-clone concurrently under errgroup instead of serially, via an optional overlap the running-source snapshot path supplies (vz only; other hypervisors and auto-standby keep the serial path, byte-identical). The overlapped disk copy prunes the snapshots subtree the save is writing and copies it afterward, so the two legs never race; on either leg failing the source is resumed and the partial snapshot dir is cleaned, matching the serial path. forkvm's SkipRelativePaths now prunes a matched directory's whole subtree (files still drop individually). Plumbs CompressMachineState (+ optional level) through hypervisor.SnapshotOptions -> vz Client.Snapshot -> shim request -> handleSnapshot. When false (the default and the only value non-vz hypervisors ever see) behavior is byte-identical to before. Tests: machine-state participates in the compression candidate set and .zst discovery; compress/decompress round-trips a synthetic .vzm; the errgroup cancels its sibling on failure; standbyInstance resumes the source when the disk leg fails; forkvm prunes a skipped directory subtree; and a darwin/arm64 round-trip for compressMachineStateFile. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wrap compressMachineStateFile errors with step context to match runGoCompression, and clarify three review-flagged comments: note the shim's compression-level clamp is trust-boundary hardening for already validated input, document why the overlap helper intentionally drops the re-resolved compression-policy error, and trim the redundant rationale on the standby disk-copy comment. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
3e195d8 to
6a79048
Compare
Drives Sections A and B on a real vz guest: a standby snapshot from a RUNNING source runs the machine-state save and disk clone concurrently and synchronously zstd-compresses machine-state.vzm, then forks and restores the snapshot to prove the compressed state decompresses and resumes with guest state intact. The existing darwin tests snapshot from standby with compression off, so they never hit the running-source overlap or the synchronous shim compression. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
6794f8d to
6c36054
Compare
|
Firetiger deploy monitoring skipped This PR didn't match the auto-monitor filter configured on your GitHub connection:
Reason: This is a documentation-only design proposal (RFC) with no code changes; it doesn't clearly indicate which service in the kernel mono repo it affects, and design docs typically don't require deploy monitoring. To monitor this PR anyway, reply with |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Autofix Details
Bugbot Autofix prepared a fix for the issue found in the latest run.
- ✅ Fixed: Orphan snapshot after overlap failure
- On overlap snapshot failure, standby now explicitly removes
snapshot-latestwhen no retained base was promoted, preventing orphaned snapshot artifacts after resuming the VM.
- On overlap snapshot failure, standby now explicitly removes
Or push these changes by commenting:
@cursor push 0836f242df
Preview (0836f242df)
diff --git a/lib/instances/snapshot_overlap_test.go b/lib/instances/snapshot_overlap_test.go
--- a/lib/instances/snapshot_overlap_test.go
+++ b/lib/instances/snapshot_overlap_test.go
@@ -188,6 +188,8 @@
_, err := mgr.standbyInstance(context.Background(), instanceID, StandbyInstanceRequest{}, true, overlap)
require.Error(t, err)
assert.Equal(t, int32(1), hv.resumes.Load(), "source should be resumed when a leg fails")
+ _, statErr := os.Stat(mgr.paths.InstanceSnapshotLatest(instanceID))
+ assert.True(t, os.IsNotExist(statErr), "failed standby should clean snapshot-latest artifacts")
// The instance must remain Running (the failed standby left the VM paused and
// then resumed it), not transition to Standby.
diff --git a/lib/instances/standby.go b/lib/instances/standby.go
--- a/lib/instances/standby.go
+++ b/lib/instances/standby.go
@@ -197,6 +197,10 @@
if rollbackErr := discardPromotedRetainedSnapshotTarget(snapshotDir); rollbackErr != nil {
log.WarnContext(ctx, "failed to discard promoted snapshot target after snapshot error", "instance_id", id, "error", rollbackErr)
}
+ } else {
+ if cleanupErr := os.RemoveAll(snapshotDir); cleanupErr != nil {
+ log.WarnContext(ctx, "failed to remove snapshot target after snapshot error", "instance_id", id, "error", cleanupErr)
+ }
}
return nil, fmt.Errorf("create snapshot: %w", err)
}You can send follow-ups to the cloud agent here.
In the running-source overlap the save leg can write a complete snapshot-latest while the concurrent disk-copy leg fails. The failure handler resumed the VM but left that snapshot-latest on disk, so the running source kept stale standby artifacts (with metadata still describing a running guest). Remove it on the overlap failure path; base reuse is off there, so there is no promoted base to preserve. Extend the fault-injection test so the mock save leg writes a snapshot-latest and assert it is cleaned up after the failure. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix prepared a fix for the issue found in the latest run.
- ✅ Fixed: Overlap abort races shim save
- I changed overlap snapshot orchestration to keep the save leg running after disk-copy failure and wait for it to finish before cleanup, eliminating the RemoveAll-vs-shim-write race.
Or push these changes by commenting:
@cursor push 81111d0582
Preview (81111d0582)
diff --git a/lib/instances/snapshot_overlap_test.go b/lib/instances/snapshot_overlap_test.go
--- a/lib/instances/snapshot_overlap_test.go
+++ b/lib/instances/snapshot_overlap_test.go
@@ -76,29 +76,65 @@
assert.Equal(t, payload, got)
}
-// TestRunPausedSnapshotLegsCancelsSiblingOnFailure verifies the errgroup
-// restructure runs the two legs concurrently and cancels the surviving leg when
-// the other fails (Section B mechanism).
-func TestRunPausedSnapshotLegsCancelsSiblingOnFailure(t *testing.T) {
+// TestRunPausedSnapshotLegsWaitsForSaveOnCopyFailure verifies a copy-leg
+// failure does not cancel the save leg and the helper only returns after save
+// completion.
+func TestRunPausedSnapshotLegsWaitsForSaveOnCopyFailure(t *testing.T) {
t.Parallel()
diskErr := errors.New("disk copy failed")
- saveObservedCancel := make(chan struct{})
+ saveStarted := make(chan struct{})
+ releaseSave := make(chan struct{})
saveLeg := func(ctx context.Context) error {
- <-ctx.Done()
- close(saveObservedCancel)
- return ctx.Err()
+ close(saveStarted)
+ select {
+ case <-releaseSave:
+ return nil
+ case <-ctx.Done():
+ return ctx.Err()
+ }
}
copyDisk := func(context.Context) error {
return diskErr
}
+ errCh := make(chan error, 1)
+ go func() {
+ errCh <- runPausedSnapshotLegs(context.Background(), saveLeg, copyDisk)
+ }()
+ <-saveStarted
+ select {
+ case err := <-errCh:
+ require.FailNow(t, "runPausedSnapshotLegs returned before save completed", "unexpected error: %v", err)
+ default:
+ }
+ close(releaseSave)
+ err := <-errCh
+ require.ErrorIs(t, err, diskErr)
+}
+
+// TestRunPausedSnapshotLegsCancelsCopyOnSaveFailure verifies a save-leg failure
+// still cancels the copy leg to avoid waiting on unnecessary work.
+func TestRunPausedSnapshotLegsCancelsCopyOnSaveFailure(t *testing.T) {
+ t.Parallel()
+
+ saveErr := errors.New("save failed")
+ copyObservedCancel := make(chan struct{})
+ saveLeg := func(context.Context) error {
+ return saveErr
+ }
+ copyDisk := func(ctx context.Context) error {
+ <-ctx.Done()
+ close(copyObservedCancel)
+ return ctx.Err()
+ }
+
err := runPausedSnapshotLegs(context.Background(), saveLeg, copyDisk)
- require.ErrorIs(t, err, diskErr)
+ require.ErrorIs(t, err, saveErr)
select {
- case <-saveObservedCancel:
+ case <-copyObservedCancel:
case <-time.After(time.Second):
- t.Fatal("save leg was not cancelled after disk leg failed")
+ require.FailNow(t, "copy leg was not cancelled after save leg failed")
}
}
diff --git a/lib/instances/standby.go b/lib/instances/standby.go
--- a/lib/instances/standby.go
+++ b/lib/instances/standby.go
@@ -16,7 +16,6 @@
"github.com/kernel/hypeman/lib/network"
snapshotstore "github.com/kernel/hypeman/lib/snapshot"
"go.opentelemetry.io/otel/attribute"
- "golang.org/x/sync/errgroup"
)
// standbyOverlap lets a caller run a disk copy concurrently with the
@@ -193,11 +192,12 @@
if resumeErr := hv.Resume(ctx); resumeErr != nil {
log.ErrorContext(ctx, "failed to resume VM after snapshot error", "instance_id", id, "error", resumeErr)
}
- // With an overlap the save leg can still complete (writing a full
- // snapshot-latest) while the disk leg fails. We are aborting the standby and
- // resuming the VM, so remove that orphaned snapshot-latest to keep the
- // running source free of standby artifacts. Base reuse is off on the overlap
- // path, so there is no promoted base to preserve here.
+ // On overlap failures we are aborting standby and resuming the VM, so remove
+ // the now-orphaned snapshot-latest to keep the running source free of
+ // standby artifacts. runPausedSnapshotLegs waits for the save leg to finish
+ // before returning, so this cleanup will not race an in-flight shim write.
+ // Base reuse is off on the overlap path, so there is no promoted base to
+ // preserve here.
if overlap != nil {
if rmErr := os.RemoveAll(snapshotDir); rmErr != nil {
log.WarnContext(ctx, "failed to remove orphaned snapshot after overlap failure", "instance_id", id, "error", rmErr)
@@ -325,23 +325,33 @@
}
// runPausedSnapshotLegs runs the machine-state save and, when supplied, the
-// guest-disk copy concurrently under one errgroup. The disk is quiescent while
-// the VM is paused, so the copy overlaps the (RAM-sized) save instead of waiting
-// for it. With no disk leg the group holds a single goroutine and is equivalent
-// to calling saveLeg directly. errgroup cancels the shared context on the first
-// failure, so the surviving leg observes cancellation and the first error is
-// returned.
+// guest-disk copy concurrently while the VM is paused. If the save leg fails we
+// cancel the copy leg, but a copy failure must not cancel the save because the
+// shim can still be writing snapshot-latest even after the HTTP call returns.
+// The function waits for both legs to finish before returning an error, so
+// callers can safely clean up snapshot-latest on overlap failures.
func runPausedSnapshotLegs(ctx context.Context, saveLeg, copyDisk func(context.Context) error) error {
- g, gctx := errgroup.WithContext(ctx)
- g.Go(func() error {
- return saveLeg(gctx)
- })
- if copyDisk != nil {
- g.Go(func() error {
- return copyDisk(gctx)
- })
+ if copyDisk == nil {
+ return saveLeg(ctx)
}
- return g.Wait()
+
+ copyCtx, cancelCopy := context.WithCancel(ctx)
+ defer cancelCopy()
+
+ copyErrCh := make(chan error, 1)
+ go func() {
+ copyErrCh <- copyDisk(copyCtx)
+ }()
+
+ saveErr := saveLeg(ctx)
+ if saveErr != nil {
+ cancelCopy()
+ }
+ copyErr := <-copyErrCh
+ if saveErr != nil {
+ return saveErr
+ }
+ return copyErr
}
// createSnapshot creates a snapshot using the hypervisor interfaceYou can send follow-ups to the cloud agent here.
Reviewed by Cursor Bugbot for commit bcaeb8b. Configure here.
errgroup.WithContext cancelled the save leg when the disk-copy leg failed, so the vz HTTP snapshot call returned while the shim was still writing snapshot-latest — letting the failure handler resume the VM and remove that directory over an in-flight save. The disk leg ignores cancellation anyway, so run both legs to completion on the caller's context and handle the result only once neither leg is in flight. Rewrite the leg test to assert the save runs to completion (is not cancelled) when the disk leg fails. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>


Summary
Implements the remaining work from the Faster macOS VM standby/restore RFC (
docs/proposals/faster-standby-restore.md) forvz(Apple Virtualization.framework) guests on Apple Silicon. Section C (APFSclonefile) landed in #276; this PR adds Sections A and B:machine-state.vzm: the vz machine state is added to the snapshot compression candidate set, so the existing async compression job compresses it and the existing restore-prep decompresses it — no new lifecycle code.machine-state.vzminside the shim when requested, gated behind a new optionalCompressMachineStatefield on the shim snapshot request. The vz Save/Restore API is path-based and cannot be byte-streamed, so compression is a post-write pass.errgroupinside the paused window, instead of serially.The new options default to off, so every non-vz hypervisor and the auto-standby path are byte-identical to before.
Notes
snapshots/snapshot-latest/— is handled by pruning that subtree from the overlapped copy and copying it in after both legs finish (equivalent to the serial copy for vz, which has no snapshot-base).TestVZRunningSnapshotCompressionOverlapsnapshots a running vz guest with compression then forks + restores it — verified passing on the macOS 15 runner.Note
Medium Risk
Changes standby/snapshot orchestration and vz shim snapshot behavior on a narrow path (running-source vz snapshots with overlap); failure handling resumes the VM and removes orphaned snapshot dirs, but incorrect overlap could still affect snapshot correctness or pause duration.
Overview
Speeds up vz standby/restore on Apple Silicon by shrinking saved RAM state and overlapping work while the guest is paused.
Compression:
machine-state.vzmis now part of the existing snapshot compression candidate list, so async jobs and restore prep (ensureSnapshotMemoryReady) handle.zstlike other memory artifacts. The vz shim can optionally zstd-compress the state file right afterSaveMachineStateToPathvia newCompressMachineState/CompressionLevelonhypervisor.SnapshotOptionsand the shim snapshot API (defaults off; non-vz hypervisors unchanged).Overlap (running-source standby snapshots only): For vz, the control plane runs guest-disk clone and machine-state save in parallel (
runPausedSnapshotLegs+standbyOverlap) instead of serially. The overlapped copy skips the wholesnapshots/tree to avoid racing the shim, thencopyStandbySnapshotStateIntoPayloadcopiessnapshot-latestafter both legs finish. With zstd compression enabled on the snapshot, overlap also turns on synchronous shim compression so the payload is already.zstwhen copied.Supporting changes:
forkvmskip paths now prune skipped directories; snapshot path constants move to build-tag-freeshimconfig/paths.go; RFC doc added. Unit and darwin integration tests cover round-trips, errgroup semantics, and overlap failure cleanup.Reviewed by Cursor Bugbot for commit d2319d4. Bugbot is set up for automated code reviews on this repo. Configure here.