Skip to content

Add live memory ceiling for macOS (vz) VMs#278

Open
rgarcia wants to merge 16 commits into
mainfrom
hypeship/spec-memory-hotplug-resize
Open

Add live memory ceiling for macOS (vz) VMs#278
rgarcia wants to merge 16 commits into
mainfrom
hypeship/spec-memory-hotplug-resize

Conversation

@rgarcia

@rgarcia rgarcia commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

Summary

Implements live memory resize for macOS (vz) VMs via a boot ceiling. A vz VM can be created with a memory ceiling above its baseline size; the shim boots the VM at the ceiling and immediately balloons it down to the baseline, and the existing host-pressure controller then moves the balloon target within [floor, ceiling] — growing toward the ceiling and shrinking under host pressure — with no reboot. vz only (Cloud Hypervisor / Firecracker have real hotplug and ignore it).

The committed design doc (docs/proposals/memory-hotplug-resize.md) is the contract; this PR implements milestones 1–2 plus the grow-on-demand plumbing.

What's included

  • API: memory_ceiling (human-readable string, mirroring size/hotplug_size) on the create-instance request and the Instance response (openapi.yaml + regenerated lib/oapi/oapi.go).
  • Threading: MemoryCeilingBytes flows CreateInstanceRequest → stored instance → hypervisor.VMConfigbuildShimConfigFromVMConfigShimConfig, the same path the merged Platform/EnableRosetta fields take, and persists through the snapshot manifest (no restore change).
  • Shim: boot-at-ceiling, balloon mandatory when a ceiling is active, balloon-to-baseline applied after the VM is Running (cold boot only).
  • Controller: the ceiling becomes the balloon's upper bound; the protected floor is anchored on the baseline (so a ceiling VM idling at baseline never squeezes its co-tenants), and the proportional reclaim split is computed in 128 bits to stay overflow-safe at large ceilings.
  • Validation: reject a ceiling ≤ size, reject a ceiling on a non-vz hypervisor, and reject a ceiling above the host RAM maximum (fail fast in the shim rather than silently booting smaller).
  • Capability: SupportsLiveMemoryCeiling (derived, internal).
  • Tests: Linux unit tests (grow policy, Normalize clamps, validation, providers ceiling clamp, floor anchor, overflow split) + a darwin integration test (TestVZMemoryCeiling) that boots at the ceiling, asserts the guest sees ~ceiling MemTotal, balloons to baseline at low host RSS, and grows live to the ceiling without a reboot.

Deferred (out of scope here)

  • Automatic grow-on-demand is plumbed and unit-tested but inert: it needs a per-guest memory-demand signal that does not exist yet (RFC milestone 4). The config knob is documented as not-yet-active; live grow works today via the balloon API.
  • A restore-path balloon re-apply (if vz does not persist the balloon target across save/restore) is left pending verification on a darwin runner.

Validation

  • Reviewed across API/data-model, concurrency/runtime-correctness, vz/darwin-platform, and scope/quality dimensions; all high/medium findings fixed (ceiling-VM reclaim accounting, int64 overflow in the reclaim split, vz-only gating, reject-ceiling-above-host-RAM).
  • Full CI green on the restored suite (test, test-darwin, e2e-install). TestVZMemoryCeiling was verified to run and pass on the real macOS arm64 runner (a temporary CI step, since reverted).

With no ceiling set, behavior is byte-identical to today.


Note

Medium Risk
Touches instance creation, vz boot/shim paths, and host memory pressure accounting; mis-accounting could affect multi-tenant reclaim, though defaults preserve prior behavior when no ceiling is set.

Overview
Adds live memory elasticity for macOS (vz) VMs via an optional memory_ceiling on create (and on the Instance API): when ceiling > baseline size, the guest can use RAM between baseline and ceiling without reboot, using boot-at-ceiling plus the virtio balloon instead of true hotplug.

vz-shim boots at the ceiling when set, requires the balloon, rejects ceilings above the host RAM max, and on cold start balloons the guest down to baseline after vm.Start. Create/validation threads MemoryCeilingBytes through instance metadata and VMConfig, enforces vz-only and ceiling > size, and caps ceiling against per-instance memory limits.

Active ballooning gains BaselineMemoryBytes vs ceiling as AssignedMemoryBytes, reclaims from the baseline anchor (so idle ceiling VMs don’t steal reclaim from neighbors), healthy-host logic holds at baseline or preserves manual/API grows, optional grow_on_demand_* config (still inert until a guest utilization signal exists), and a 128-bit proportional reclaim split to avoid overflow. SupportsLiveMemoryCeiling is exposed on the vz client after start/restore.

Docs include an RFC (docs/proposals/memory-hotplug-resize.md), darwin example config, and unit plus darwin integration tests for ceiling boot, baseline settle, and live balloon grow.

Reviewed by Cursor Bugbot for commit 74504d3. Bugbot is set up for automated code reviews on this repo. Configure here.

rgarcia and others added 2 commits June 9, 2026 19:57
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Update file:line references that drifted when the clonefile and
rosetta/multi-platform changes merged to main (createVM gained a device
block, shifting computeMemorySize and the balloon block by +4; create.go
balloon-policy wiring moved into guestMemoryConfig; HotplugSize moved).
Correct the integration-test path to lib/instances/, and note that the
proposed MemoryCeilingBytes threading and derived-capability flag now have
a merged precedent (Platform / derived EnableRosetta take the identical
request -> VMConfig -> buildShimConfigFromVMConfig -> ShimConfig path).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@rgarcia rgarcia force-pushed the hypeship/spec-memory-hotplug-resize branch from d69eebd to 527ecf7 Compare June 9, 2026 20:08
rgarcia and others added 3 commits June 9, 2026 22:01
Apple's Virtualization.framework cannot grow a guest's RAM above its boot
size; the only runtime lever is the traditional balloon, which reclaims
down from boot size. This lets a vz VM boot at a configured memory ceiling,
balloon down to its baseline (the normal Size), and have the existing
host-pressure controller move the balloon target within [floor, ceiling] —
elastic usable memory with no reboot.

- ShimConfig.MemoryCeilingBytes: createVM boots at the ceiling when it
  exceeds the baseline and makes the balloon mandatory (attach regardless of
  the enable/require flags, fail creation if it cannot attach). After Start,
  the shim balloons the guest down to the baseline on cold boot; restore
  resumes an already-ballooned guest.
- Thread the ceiling request -> stored instance -> hypervisor.VMConfig ->
  buildShimConfigFromVMConfig -> ShimConfig, mirroring the Platform/
  EnableRosetta path. It rides ShimConfig through the snapshot manifest, so
  restore needs no extra change.
- Validation: ceiling 0 means no ceiling (identical to today); a ceiling at
  or below Size is rejected; a ceiling above the per-instance memory limit is
  rejected; the host-RAM bound is enforced by the shim's vz Validate().
- providers ListBalloonVMs reports AssignedMemoryBytes = max(size+hotplug,
  ceiling) so the controller's upper clamp becomes the ceiling, and a
  baseline so the controller holds there while the host is healthy. The
  protected floor is anchored on the baseline rather than the ceiling.
- Grow-on-demand: GrowOnDemandEnabled (default off) and
  GrowUtilizationPercent (default 85, clamped 1..99) on ActiveBallooningConfig
  plus growthTargetBytes; with the flag off behavior is byte-identical to
  today. A measured guest-memory signal is left as a follow-up.
- SupportsLiveMemoryCeiling capability, derived per-instance for vz when a
  ceiling is configured; SupportsHotplugMemory stays false.
- config.example.darwin.yaml gains grow_on_demand_enabled/
  grow_utilization_percent and a corrected hotplug note.

Builds on the threading precedent from #279 and the fork fast path in #276.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add a human-readable memory_ceiling string to the create-instance request
and the Instance response (mirroring size/hotplug_size), regenerate the
OpenAPI bindings, and wire it through the create handler and the
instance->response mapping. The request parses to bytes via the same
datasize helper; the response reports the resolved ceiling, omitted when no
ceiling is set.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Linux unit tests (run everywhere): growthTargetBytes (no-op when disabled,
grows to the ceiling only above the utilization threshold, never beyond the
ceiling, never below the floor); ActiveBallooningConfig.Normalize clamps for
the new grow fields; the controller holds a ceiling VM at its baseline when
grow is off and still recovers an ordinary reclaimed VM to full; ceiling
validation (reject <= size, accept > size); providers AssignedMemoryBytes =
max(size+hotplug, ceiling) with baseline = size+hotplug.

Darwin integration test TestVZMemoryCeiling (gated by
HYPEMAN_RUN_GUESTMEMORY_TESTS=1, arm64): boots an nginx:alpine vz VM with
Size=1GiB and a 4GiB ceiling, asserts a balloon device is attached, that
/proc/meminfo MemTotal reflects the ~4GiB boot ceiling, that the balloon
settles near 1GiB with a low host RSS, then deflates to 4GiB and asserts
usable memory climbs without a reboot.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown

✱ Stainless preview builds for hypeman

This PR will update the hypeman SDKs with the following commit message.

feat: Add design proposal: live memory resize for macOS (vz) VMs

Edit this comment to update it. It will appear in the SDK's changelogs.

hypeman-openapi studio · code · diff

Your SDK build had at least one "note" diagnostic, but this did not represent a regression.
generate ✅

hypeman-go studio · code · diff

Your SDK build had at least one "note" diagnostic, but this did not represent a regression.
generate ✅build ✅lint ✅test ✅

go get github.com/stainless-sdks/hypeman-go@68600abf17bbd618e661c1c28aa8d72496d5ac0e
hypeman-typescript studio · code · diff

Your SDK build had at least one "note" diagnostic, but this did not represent a regression.
generate ✅build ✅lint ❗test ✅

npm install https://pkg.stainless.com/s/hypeman-typescript/fddac36c47d2448f6af0473fa9e80cef992c9784/dist.tar.gz

This comment is auto-generated by GitHub Actions and is automatically kept up to date as you push.
If you push custom code to the preview branch, re-run this workflow to update the comment.
Last updated: 2026-06-10 00:09:38 UTC

rgarcia and others added 7 commits June 9, 2026 22:26
Measure currentTotalReclaim below the baseline anchor instead of the
ceiling: a ceiling VM idling at its baseline reclaims nothing real (the
ballooned pages were never resident), so counting its headroom as reclaim
under the Stressed branch squeezed co-tenants below their own baseline.

Compute the proportional reclaim split in 128 bits. With a large ceiling
the operands approach total headroom and the int64 product overflowed
once they exceed ~2.8GiB, wrapping to a negative reclaim that corrupted
the split (one VM absorbed everything to its floor, its peer gave up
nothing) and silently failed to reclaim under real pressure.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Reject a non-zero memory ceiling at create time for any backend other
than vz: backends with real hotplug ignore the ceiling and boot at size,
but the controller was still told their assigned max was the ceiling,
mis-accounting their reclaim headroom.

Fail VM creation in the shim when a ceiling exceeds the host maximum
instead of silently clamping the boot size: a clamped boot left the
controller treating the full (unreachable) ceiling as the balloon's
upper bound. This makes the boot-ceiling contract's reject rule real, so
a running ceiling VM's assigned max always equals its actual boot size.

Also log the real boot size on start, drop the inaccurate "Resolved"
wording from the response field, and document that the live-ceiling
capability is only known on the start/restore client.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Temporary CI step to prove the gated TestVZMemoryCeiling darwin
integration test (boot-at-ceiling, balloon-to-baseline, live grow)
actually passes on the real macOS arm64 runner. Reverted before merge.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The traditional balloon target is a request the guest fulfills lazily, so
reading guest MemAvailable immediately after the target reaches baseline
captured a near-ceiling value and made the subsequent grow assertion
impossible. Wait for the guest's MemAvailable to actually fall below the
ceiling midpoint (proving the balloon inflated in-guest) before measuring
the grow baseline.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ilable

Usable memory on the traditional balloon is the target the controller sets;
the guest fulfills it lazily and may not reflect it in MemAvailable, and the
design accounts on the target, not the achieved size. Drop the guest-visible
MemAvailable grow/shrink assertions and verify the actual contract instead:
the guest boots at the ceiling (MemTotal), the target settles at the baseline
with low resident host RSS, and the target can be raised back to the ceiling
with the VM still Running (no reboot).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The gated TestVZMemoryCeiling integration test was verified passing on the
macOS arm64 runner; restore test.yml to match main. Feature and test code
are unchanged.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The grow_on_demand_enabled knob is plumbed but inert until a per-guest
memory-demand signal exists, so the operator-facing example should say so
rather than presenting it as a working toggle.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@rgarcia rgarcia marked this pull request as ready for review June 9, 2026 23:29
@rgarcia rgarcia changed the title Add design proposal: live memory resize for macOS (vz) VMs Add live memory ceiling for macOS (vz) VMs Jun 9, 2026

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Pressure reclaim grows ceiling VMs
    • Pressure planning now computes ceiling-VM reclaim headroom and targets from baseline (not assigned ceiling), preventing pressure reclaim from increasing balloon targets.

Create PR

Or push these changes by commenting:

@cursor push 339e327cdb
Preview (339e327cdb)
diff --git a/lib/guestmemory/controller.go b/lib/guestmemory/controller.go
--- a/lib/guestmemory/controller.go
+++ b/lib/guestmemory/controller.go
@@ -187,12 +187,13 @@
 		}
 		currentTotalReclaim += currentReclaim
 
+		reclaimBase := clampInt64(floorAnchorBytes(vm), protectedFloor, vm.AssignedMemoryBytes)
 		candidates = append(candidates, candidateState{
 			vm:                      vm,
 			hv:                      hv,
 			currentTargetGuestBytes: currentTarget,
 			protectedFloorBytes:     protectedFloor,
-			maxReclaimBytes:         maxInt64(0, vm.AssignedMemoryBytes-protectedFloor),
+			maxReclaimBytes:         maxInt64(0, reclaimBase-protectedFloor),
 		})
 	}
 	summary.eligibleVMs = len(candidates)

diff --git a/lib/guestmemory/planner.go b/lib/guestmemory/planner.go
--- a/lib/guestmemory/planner.go
+++ b/lib/guestmemory/planner.go
@@ -41,7 +41,7 @@
 	var totalHeadroom int64
 	for _, candidate := range candidates {
 		totalHeadroom += candidate.maxReclaimBytes
-		targets[candidate.vm.ID] = candidate.vm.AssignedMemoryBytes
+		targets[candidate.vm.ID] = candidate.baselineGuestBytes()
 	}
 	if totalHeadroom <= 0 {
 		return targets
@@ -61,7 +61,7 @@
 		if reclaim > candidate.maxReclaimBytes {
 			reclaim = candidate.maxReclaimBytes
 		}
-		targets[candidate.vm.ID] = candidate.vm.AssignedMemoryBytes - reclaim
+		targets[candidate.vm.ID] = candidate.baselineGuestBytes() - reclaim
 		remainder -= reclaim
 	}
 
@@ -69,7 +69,7 @@
 		if remainder <= 0 {
 			break
 		}
-		currentReclaim := candidate.vm.AssignedMemoryBytes - targets[candidate.vm.ID]
+		currentReclaim := candidate.baselineGuestBytes() - targets[candidate.vm.ID]
 		headroomLeft := candidate.maxReclaimBytes - currentReclaim
 		if headroomLeft <= 0 {
 			continue

diff --git a/lib/guestmemory/planner_test.go b/lib/guestmemory/planner_test.go
--- a/lib/guestmemory/planner_test.go
+++ b/lib/guestmemory/planner_test.go
@@ -24,6 +24,33 @@
 	}
 }
 
+func TestPlanGuestTargetsCeilingVMReclaimStartsFromBaseline(t *testing.T) {
+	const gib = int64(1024 * 1024 * 1024)
+	const baseline = 1 * gib
+	const ceiling = 4 * gib
+	const floor = baseline / 2
+
+	// Ceiling VMs run at baseline when healthy; pressure reclaim must start from
+	// that baseline, never by "reclaiming" from the ceiling and growing the
+	// balloon target above baseline.
+	candidates := []candidateState{
+		{
+			vm: BalloonVM{
+				ID:                  "vz-ceiling",
+				AssignedMemoryBytes: ceiling,
+				BaselineMemoryBytes: baseline,
+			},
+			protectedFloorBytes: floor,
+			maxReclaimBytes:     baseline - floor,
+		},
+	}
+
+	targets := planGuestTargets(ActiveBallooningConfig{}, candidates, gib)
+	if got := targets["vz-ceiling"]; got != floor {
+		t.Fatalf("ceiling VM reclaim should plan down from baseline to floor %d, got %d", floor, got)
+	}
+}
+
 func TestFloorAnchorBytesUsesBaselineForCeilingVM(t *testing.T) {
 	const gib = int64(1024 * 1024 * 1024)
 	const baseline = 1 * gib

You can send follow-ups to the cloud agent here.

Comment thread lib/guestmemory/controller.go
planGuestTargets computed each guest's reclaim target as
AssignedMemoryBytes - reclaim, and maxReclaimBytes as AssignedMemoryBytes
- floor. For a ceiling VM idling at its baseline (target well below the
ceiling) that lands above the baseline, so under host pressure the
reconcile loop raised the balloon target — inflating the guest instead of
reclaiming it. Anchor both on floorAnchorBytes (the baseline; equal to the
assigned size for ordinary VMs, so their behavior is unchanged) so reclaim
moves a ceiling VM from its baseline toward its floor.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Controller undoes balloon API grow
    • When healthy with grow-on-demand disabled, reconcile now holds at max(baseline,current target) so externally grown balloon targets are preserved instead of being forced back to baseline, and a regression test covers this case.

Create PR

Or push these changes by commenting:

@cursor push 5c22aead5a
Preview (5c22aead5a)
diff --git a/lib/guestmemory/controller.go b/lib/guestmemory/controller.go
--- a/lib/guestmemory/controller.go
+++ b/lib/guestmemory/controller.go
@@ -227,18 +227,24 @@
 	summary.manualHoldActive = state.manualHold != nil
 
 	plannedTargets := planGuestTargets(c.config, candidates, totalTarget)
-	// No reclaim demanded means the host is healthy: hold each guest at its
-	// baseline (baseline == assigned for ordinary VMs, so this recovers prior
-	// reclaim unchanged). growthTargetBytes returns the baseline while grow-on-demand
-	// is off. This per-VM grow has no aggregate host-RAM cap, which is safe only
-	// because utilizationPercent() is 0 today; a real signal (RFC milestone 4) must
-	// route grow through a host-aware planner.
+	// No reclaim demanded means the host is healthy: recover reclaimed guests up to
+	// their baseline (baseline == assigned for ordinary VMs), but do not pull down
+	// a target that was already grown externally (for example through the balloon
+	// API) while grow-on-demand is disabled.
+	//
+	// This per-VM grow has no aggregate host-RAM cap, which is safe only because
+	// utilizationPercent() is 0 today; a real signal (RFC milestone 4) must route
+	// grow through a host-aware planner.
 	if totalTarget == 0 {
 		for _, candidate := range candidates {
 			baseline := candidate.baselineGuestBytes()
+			holdTarget := baseline
+			if !c.config.GrowOnDemandEnabled {
+				holdTarget = maxInt64(holdTarget, candidate.currentTargetGuestBytes)
+			}
 			plannedTargets[candidate.vm.ID] = growthTargetBytes(
 				c.config,
-				baseline,
+				holdTarget,
 				candidate.vm.AssignedMemoryBytes,
 				candidate.protectedFloorBytes,
 				candidate.utilizationPercent(),

diff --git a/lib/guestmemory/controller_test.go b/lib/guestmemory/controller_test.go
--- a/lib/guestmemory/controller_test.go
+++ b/lib/guestmemory/controller_test.go
@@ -189,6 +189,41 @@
 	assert.Equal(t, baseline, hv.target, "balloon target must remain at baseline")
 }
 
+func TestHealthyPreservesExternallyGrownCeilingVMWhenGrowOnDemandOff(t *testing.T) {
+	const mib = int64(1024 * 1024)
+	const baseline = 1024 * mib
+	const ceiling = 4096 * mib
+
+	src := &stubSource{
+		vms: []BalloonVM{
+			{ID: "a", Name: "a", HypervisorType: hypervisor.TypeVZ, SocketPath: "a", AssignedMemoryBytes: ceiling, BaselineMemoryBytes: baseline},
+		},
+	}
+	// Simulate a prior live grow via the balloon API.
+	hv := &stubHypervisor{target: ceiling, capabilities: hypervisor.Capabilities{SupportsBalloonControl: true}}
+
+	c := NewController(Policy{Enabled: true, ReclaimEnabled: true}, ActiveBallooningConfig{
+		Enabled:                true,
+		ProtectedFloorPercent:  50,
+		ProtectedFloorMinBytes: 0,
+		MinAdjustmentBytes:     1,
+		PerVMMaxStepBytes:      ceiling,
+		PerVMCooldown:          time.Millisecond,
+		GrowOnDemandEnabled:    false,
+	}, src, slog.New(slog.NewTextHandler(io.Discard, nil))).(*controller)
+	c.sampler = &stubSampler{sample: HostPressureSample{TotalBytes: 64 * 1024 * mib, AvailableBytes: 32 * 1024 * mib, AvailablePercent: 50}}
+	c.reconcileMu.newClient = func(_ hypervisor.Type, _ string) (hypervisor.Hypervisor, error) {
+		return hv, nil
+	}
+
+	resp, err := c.TriggerReclaim(context.Background(), ManualReclaimRequest{ReclaimBytes: 0})
+	require.NoError(t, err)
+	require.Len(t, resp.Actions, 1)
+	assert.Equal(t, "unchanged", resp.Actions[0].Status)
+	assert.Equal(t, ceiling, resp.Actions[0].PlannedTargetGuestMemoryBytes, "healthy reconcile must not undo external balloon API grow")
+	assert.Equal(t, ceiling, hv.target, "balloon target must remain at externally grown value")
+}
+
 func TestStressedCeilingVMAtBaselineDoesNotSqueezeCoTenant(t *testing.T) {
 	const mib = int64(1024 * 1024)
 	const baseline = 1024 * mib

You can send follow-ups to the cloud agent here.

Comment thread lib/guestmemory/controller.go
The healthy-state reconcile forced every guest to its baseline, so a
ceiling VM grown above baseline via the balloon API (or a future
auto-grow signal) was pulled back down on the next poll, making the
ceiling headroom unreachable whenever active ballooning is enabled. Hold
at max(currentTarget, baseline) instead: reclaimed guests still recover up
to baseline, but an explicit grow is preserved. Reclaim under pressure is
unchanged.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Autofix Details

Bugbot Autofix prepared fixes for both issues found in the latest run.

  • ✅ Fixed: Reclaim metrics use ceiling anchor
    • Reconcile now computes planned and applied reclaim from each VM’s floor anchor (baseline for ceiling VMs) with a zero floor, eliminating the ceiling-minus-baseline phantom reclaim in response totals and per-action metrics.
  • ✅ Fixed: Baseline ignores vz hotplug split
    • Provider mapping now forces vz baseline memory to Size (ignoring hotplug) so controller baseline/floor behavior matches the vz shim’s balloon baseline and avoids inflated baseline targets.

Create PR

Or push these changes by commenting:

@cursor push 4ee4790473
Preview (4ee4790473)
diff --git a/lib/guestmemory/controller.go b/lib/guestmemory/controller.go
--- a/lib/guestmemory/controller.go
+++ b/lib/guestmemory/controller.go
@@ -307,7 +307,8 @@
 			Status:                         "unchanged",
 		}
 
-		resp.PlannedReclaimBytes += candidate.vm.AssignedMemoryBytes - plannedTarget
+		anchorBytes := floorAnchorBytes(candidate.vm)
+		resp.PlannedReclaimBytes += maxInt64(0, anchorBytes-plannedTarget)
 
 		if !req.dryRun && appliedTarget != candidate.currentTargetGuestBytes {
 			if err := candidate.hv.SetTargetGuestMemoryBytes(applyCtx, appliedTarget); err != nil {
@@ -335,7 +336,7 @@
 			action.TargetGuestMemoryBytes = appliedTarget
 		}
 		if !req.dryRun {
-			action.AppliedReclaimBytes = candidate.vm.AssignedMemoryBytes - action.TargetGuestMemoryBytes
+			action.AppliedReclaimBytes = maxInt64(0, anchorBytes-action.TargetGuestMemoryBytes)
 		}
 		resp.AppliedReclaimBytes += action.AppliedReclaimBytes
 		resp.Actions = append(resp.Actions, action)

diff --git a/lib/guestmemory/controller_test.go b/lib/guestmemory/controller_test.go
--- a/lib/guestmemory/controller_test.go
+++ b/lib/guestmemory/controller_test.go
@@ -184,7 +184,10 @@
 	resp, err := c.TriggerReclaim(context.Background(), ManualReclaimRequest{ReclaimBytes: 0})
 	require.NoError(t, err)
 	require.Len(t, resp.Actions, 1)
+	assert.Equal(t, int64(0), resp.PlannedReclaimBytes, "baseline-held ceiling VM should not report phantom planned reclaim")
+	assert.Equal(t, int64(0), resp.AppliedReclaimBytes, "baseline-held ceiling VM should not report phantom applied reclaim")
 	assert.Equal(t, "unchanged", resp.Actions[0].Status)
+	assert.Equal(t, int64(0), resp.Actions[0].AppliedReclaimBytes, "per-action reclaim should be anchored to baseline")
 	assert.Equal(t, baseline, resp.Actions[0].TargetGuestMemoryBytes, "ceiling VM should hold at baseline, not grow to ceiling, when grow-on-demand is off")
 	assert.Equal(t, baseline, hv.target, "balloon target must remain at baseline")
 }

diff --git a/lib/providers/providers.go b/lib/providers/providers.go
--- a/lib/providers/providers.go
+++ b/lib/providers/providers.go
@@ -305,10 +305,13 @@
 // balloonVMForInstance maps a stored instance onto the controller's view. The
 // baseline is the guest's normal running size; a vz boot ceiling is the live
 // maximum the balloon can deflate to, so it drives the controller's upper clamp
-// while the baseline is the size held when the host is healthy. HotplugSize is 0
-// on vz, so the max keeps non-vz backends correct if hotplug is ever populated.
+// while the baseline is the size held when the host is healthy. vz ignores
+// hotplug, so its baseline remains Size; other backends keep size+hotplug.
 func balloonVMForInstance(inst instances.Instance) guestmemory.BalloonVM {
 	baseline := inst.Size + inst.HotplugSize
+	if inst.HypervisorType == hypervisor.TypeVZ {
+		baseline = inst.Size
+	}
 	assigned := baseline
 	if inst.MemoryCeilingBytes > assigned {
 		assigned = inst.MemoryCeilingBytes

diff --git a/lib/providers/providers_test.go b/lib/providers/providers_test.go
--- a/lib/providers/providers_test.go
+++ b/lib/providers/providers_test.go
@@ -4,6 +4,7 @@
 	"testing"
 
 	"github.com/kernel/hypeman/cmd/api/config"
+	"github.com/kernel/hypeman/lib/hypervisor"
 	"github.com/kernel/hypeman/lib/instances"
 	snapshotstore "github.com/kernel/hypeman/lib/snapshot"
 	"github.com/stretchr/testify/assert"
@@ -34,6 +35,13 @@
 	}})
 	assert.Equal(t, 2*gib, lowCeiling.AssignedMemoryBytes)
 	assert.Equal(t, 2*gib, lowCeiling.BaselineMemoryBytes)
+
+	// vz ignores hotplug sizing, so controller baseline must stay at Size.
+	vzWithHotplug := balloonVMForInstance(instances.Instance{StoredMetadata: instances.StoredMetadata{
+		Id: "d", HypervisorType: hypervisor.TypeVZ, Size: gib, HotplugSize: gib / 2, MemoryCeilingBytes: 3 * gib,
+	}})
+	assert.Equal(t, 3*gib, vzWithHotplug.AssignedMemoryBytes)
+	assert.Equal(t, gib, vzWithHotplug.BaselineMemoryBytes)
 }
 
 func TestSnapshotDefaultsFromConfigDisabledReturnsNilCompression(t *testing.T) {

You can send follow-ups to the cloud agent here.

Comment thread lib/guestmemory/controller.go
Comment thread lib/providers/providers.go
Two consistency gaps from baseline-anchored reclaim:
- PlannedReclaimBytes/AppliedReclaimBytes still subtracted the target from
  AssignedMemoryBytes (the ceiling), so a ceiling VM idling at baseline
  reported phantom reclaim of ~ceiling-baseline. Anchor the reported
  reclaim on floorAnchorBytes too, clamped at zero.
- balloonVMForInstance set the controller baseline to Size+HotplugSize,
  but the vz shim balloons to MemoryBytes (== Size) and ignores hotplug,
  so a vz ceiling VM with a hotplug size diverged. Mirror the shim: a
  ceiling VM baselines at Size and is capped at the ceiling.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Small baseline below protected floor
    • Protected floors are now capped at each VM’s floor anchor so small-baseline ceiling VMs are held at baseline instead of being inflated to the global minimum floor.

Create PR

Or push these changes by commenting:

@cursor push 45bd39c88a
Preview (45bd39c88a)
diff --git a/lib/guestmemory/planner.go b/lib/guestmemory/planner.go
--- a/lib/guestmemory/planner.go
+++ b/lib/guestmemory/planner.go
@@ -89,7 +89,7 @@
 
 func protectedFloorBytes(cfg ActiveBallooningConfig, anchor int64) int64 {
 	percentFloor := (anchor * int64(cfg.ProtectedFloorPercent)) / 100
-	return maxInt64(cfg.ProtectedFloorMinBytes, percentFloor)
+	return minInt64(anchor, maxInt64(cfg.ProtectedFloorMinBytes, percentFloor))
 }
 
 // floorAnchorBytes is the size the protected floor is computed against: the

diff --git a/lib/guestmemory/planner_test.go b/lib/guestmemory/planner_test.go
--- a/lib/guestmemory/planner_test.go
+++ b/lib/guestmemory/planner_test.go
@@ -45,6 +45,22 @@
 	}
 }
 
+func TestProtectedFloorBytesCappedAtAnchor(t *testing.T) {
+	const mib = int64(1024 * 1024)
+	const baseline = 256 * mib
+
+	cfg := ActiveBallooningConfig{
+		ProtectedFloorPercent:  50,
+		ProtectedFloorMinBytes: 512 * mib,
+	}
+
+	// The protected floor must never exceed the baseline anchor; otherwise a
+	// small-baseline ceiling VM is forced above baseline on healthy reconciles.
+	if got := protectedFloorBytes(cfg, baseline); got != baseline {
+		t.Fatalf("protected floor should cap at anchor %d, got %d", baseline, got)
+	}
+}
+
 func TestAutomaticTargetBytesStressedHoldsCurrentReclaim(t *testing.T) {
 	const gib = int64(1024 * 1024 * 1024)
 	cfg := ActiveBallooningConfig{PressureLowWatermarkAvailablePercent: 15}

You can send follow-ups to the cloud agent here.

Reviewed by Cursor Bugbot for commit ede5689. Configure here.

Comment thread lib/guestmemory/planner.go
The floor was capped at AssignedMemoryBytes (the ceiling), so a ceiling VM
whose baseline (Size) is below protected_floor_min_bytes got a floor above
its baseline — the healthy reconcile then raised the guest up to the floor
instead of holding it at the baseline the shim set. Cap the floor at
floorAnchorBytes (the baseline; == assigned for ordinary VMs, unchanged).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@firetiger-agent

Copy link
Copy Markdown

Created a monitoring plan for this PR.

What this PR does: Ships the live memory ceiling infrastructure for macOS (vz) VMs — a new memory_ceiling field on instance creation lets a VM boot at a higher ceiling and balloon down to baseline, so usable memory can grow back up at runtime. Also fixes the active-ballooning controller's reclaim accounting for ceiling VMs so their headroom isn't incorrectly counted against co-tenants. Both the ceiling feature and auto-grow are off by default; no existing VMs are affected unless memory_ceiling is explicitly set.

Intended effect:

  • invalid_memory_ceiling 400 responses: baseline 0/24h; confirmed if it stays at 0 (no callers set the field yet, so zero false-positive 400s expected)
  • Instance creation errors in API logs: baseline 0–2,656/hr depending on traffic and infrastructure; confirmed if no regression above 5,000/hr
  • API 5xx rate: baseline 0.001–0.035%; confirmed if it stays below 0.2%

Risks:

  • Controller reclaim regressioncurrentReclaim / floorAnchorBytes fix affects all VMs; watch for sustained "failed to create instance" errors in API logs rising above 5,000/hr for 2+ consecutive hours
  • Unexpected invalid_memory_ceiling 400s — any invalid_memory_ceiling log line in API logs signals a caller accidentally sending the new field incorrectly; alert on any occurrence
  • Balloon initialization failure — once a ceiling VM is first created, watch Railway stdout for prod-jfk-hypeman-{0,1,2} for "failed to balloon guest to baseline"; alert on any occurrence
  • Host-RAM ceiling overflow — watch API logs for "memory ceiling … exceeds host maximum"; alert on any occurrence once the feature is adopted

View monitor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant