Skip to content

display: retry neko reconfig when X root never converges#280

Open
hiroTamada wants to merge 1 commit into
mainfrom
hypeship/display-server-retry-on-unconverged
Open

display: retry neko reconfig when X root never converges#280
hiroTamada wants to merge 1 commit into
mainfrom
hypeship/display-server-retry-on-unconverged

Conversation

@hiroTamada

@hiroTamada hiroTamada commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Summary

PATCH /display could return HTTP 200 with the unconverged dummy DDX default (3840×2160) in the response body instead of the requested mode. This was being papered over at the test layer by adding retry loops in e2e tests (PR #279); this PR fixes it at the server layer so SDK callers get correct behavior.

The race

setResolutionViaNeko calls neko's ScreenConfigurationChange RPC, which is asynchronous. If a PATCH lands while a previous reconfig is in flight — e.g. neko's boot 1920×1080@25 config — the new request can be dropped/clobbered. waitForXRootRealized then polls X until timeout, sees the dummy default 3840×2160, and the handler echoes that value back as the realized size.

The function had no way to signal "I gave up" vs "I converged on something close to the request", so the handler treated both the same and returned 200 either way.

Fix

Two changes in server/cmd/api/api/display.go:

  1. waitForXRootRealized returns a converged bool. Same polling logic, but the function now distinguishes a successful settle from a timeout. No behavior change for the existing exact-match and stable-within-delta success paths.
  2. Neko path retries the reconfig RPC. When convergence fails, the handler re-issues setResolutionViaNeko up to 3 times before giving up. Polling X harder can't recover a request neko dropped — only re-issuing the RPC will. The xrandr (no-neko) path skips the RPC retry (xrandr is synchronous and idempotent) but still surfaces the non-convergence honestly.

If all retries fail, the handler now returns 500 with display did not converge to WxH (X root reports W'xH') instead of 200 with the wrong size. SDK callers see an honest error and can decide to retry.

Verification

Reproduced locally on a quiet VM (no concurrent load):

Test Before After
TestDisplayResizeOddWidthHonoursLibxcvtRounding 2 fails / 5 runs (~40%) 8/8 pass
TestDisplayResizeChromiumWindow headful subtests (not measured) 12/12 pass (4 runs × 3 subtests)

The failing body before the fix was {"width":3840,"height":2160} for a request of {1365,768}. Two of the 8 runs after the fix took ~10s longer than baseline — that's the retry firing transparently and recovering, exactly the intended behavior.

Follow-ups

Once this lands, the e2e retry loops added in #279 become redundant and can be removed — keeping the waitForXRootResolution(1920, 1080) baseline-wait there is fine as a latency optimization but is no longer load-bearing.

🤖 Generated with Claude Code


Note

Medium Risk
Changes core display resize success/failure semantics for headful Neko deployments; callers that relied on 200 with wrong sizes will now see errors, which is intended but is a behavioral contract change.

Overview
PATCH /display on the Xorg + Neko path no longer returns HTTP 200 with the wrong realized size (e.g. dummy 3840×2160) when Neko drops an async screen reconfig.

waitForXRootRealized now returns a converged flag so the handler can tell a successful settle from a timeout. On the Neko path, if X never converges within the poll window, the handler re-issues setResolutionViaNeko up to 3 times before giving up; the xrandr-only path does not retry the RPC but still treats non-convergence as failure.

If convergence still fails, the API returns 500 with display did not converge to WxH (X root reports W'xH') instead of echoing the stale X root in a 200 response. Successful convergence still uses the realized X root dimensions (including libxcvt rounding) in the response body.

Reviewed by Cursor Bugbot for commit a00bbbe. Bugbot is set up for automated code reviews on this repo. Configure here.

PATCH /display could return 200 with the unconverged dummy DDX default
(3840x2160) in the body instead of the requested mode. The race: neko
applies screen reconfigs asynchronously, so a PATCH that lands while a
previous reconfig is in flight (e.g. neko's 1920x1080 boot config) can be
dropped/clobbered. waitForXRootRealized then times out polling, and the
handler echoes the bad X root reading back as the realized size — silently
violating the API contract.

Two changes:

- waitForXRootRealized now returns a converged bool so the handler can
  distinguish "settled at the requested mode" from "gave up at the deadline."
- The neko path retries the reconfig RPC up to 3 times when convergence
  fails. Polling X harder can't recover a dropped neko request — only
  re-issuing the RPC will. If all retries fail, the handler returns 500
  with the realized vs requested mismatch instead of 200 with the wrong size.

The xrandr (no-neko) path skips the RPC retry — xrandr is synchronous and
idempotent, retrying it can't change the outcome — but still returns the
honest non-convergence error.

Reproduced locally on a quiet VM:
  before: TestDisplayResizeOddWidthHonoursLibxcvtRounding flakes ~40%
          (2 fails in 5 runs); failing body is {3840,2160} for request
          {1365,768}.
  after:  8/8 passing; two runs took ~10s longer (retry firing
          transparently and recovering).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@hiroTamada hiroTamada marked this pull request as ready for review June 9, 2026 17:48
@firetiger-agent

Copy link
Copy Markdown

Created a monitoring plan for this PR.

What this PR does: Fixes a silent display-resize failure on headful (WebRTC/Neko-enabled) browser VMs. Previously, when a Neko display reconfiguration was lost due to an in-flight boot reconfig, the API would quietly return success with the wrong resolution. Now it retries the reconfig up to 3 times and returns an explicit HTTP 500 if the screen never converges, so callers get a real signal instead of silent corruption.

Intended effect:

  • "X root did not converge after resize, retrying neko reconfig" WARN log: baseline 0 (log is new); confirmed if it appears then resolves within the retry window, not sustained at > 50/hr
  • "successfully changed resolution via Neko API" INFO log: should appear more consistently on headful VMs that previously silently failed; watch for volume holding steady or increasing

Risks:

  • New convergence-failure 500s"display did not converge to requested mode after retries" ERROR log in API logs, alert if > 5/hr sustained; previously these failures were silently swallowed as fake 200s
  • Increased resize latency on Neko VMs — each failed convergence adds up to 10s per retry attempt (max 30s total); alert if callers report PATCH /display timeouts or sustained high latency on headful sessions
  • Retry storms during VM boot — if Neko is consistently slow to initialize, the retry loop issues the RPC 3× in 30s; alert on any "failed to change screen configuration" ERROR surge at VM startup time

Status updates will be posted automatically on this PR as monitoring progresses.

View monitor

@rgarcia rgarcia left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reviewed — lgtm. the retry-on-non-convergence approach is sound and the loop logic checks out: 3 polls / 2 rpc retries on the neko path, with the final state always polled after the last retry. returning 500 instead of echoing the stale dummy 3840×2160 is the right call. a couple non-blocking things:

questions

  • server/cmd/api/api/display.go:168-172 — worst-case latency: neko non-convergence now blocks ~30s (3×10s polls) before the 500. waitForXRootRealized honors ctx.Done() so an upstream deadline bails early, but worth confirming no request timeout sits below ~30s and would cut a legit retry short. is 3×10s the right budget, vs. shorter polls or fewer attempts?

nits

  • server/cmd/api/api/display.go:525-527 — doc comment on waitForXRootRealized still says "Always non-fatal — the response always echoes some size, never 500s." that's stale now: a non-converged (false) return drives a 500 at the caller. worth updating so the next reader isn't misled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants