display: retry neko reconfig when X root never converges#280
Conversation
PATCH /display could return 200 with the unconverged dummy DDX default
(3840x2160) in the body instead of the requested mode. The race: neko
applies screen reconfigs asynchronously, so a PATCH that lands while a
previous reconfig is in flight (e.g. neko's 1920x1080 boot config) can be
dropped/clobbered. waitForXRootRealized then times out polling, and the
handler echoes the bad X root reading back as the realized size — silently
violating the API contract.
Two changes:
- waitForXRootRealized now returns a converged bool so the handler can
distinguish "settled at the requested mode" from "gave up at the deadline."
- The neko path retries the reconfig RPC up to 3 times when convergence
fails. Polling X harder can't recover a dropped neko request — only
re-issuing the RPC will. If all retries fail, the handler returns 500
with the realized vs requested mismatch instead of 200 with the wrong size.
The xrandr (no-neko) path skips the RPC retry — xrandr is synchronous and
idempotent, retrying it can't change the outcome — but still returns the
honest non-convergence error.
Reproduced locally on a quiet VM:
before: TestDisplayResizeOddWidthHonoursLibxcvtRounding flakes ~40%
(2 fails in 5 runs); failing body is {3840,2160} for request
{1365,768}.
after: 8/8 passing; two runs took ~10s longer (retry firing
transparently and recovering).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Created a monitoring plan for this PR. What this PR does: Fixes a silent display-resize failure on headful (WebRTC/Neko-enabled) browser VMs. Previously, when a Neko display reconfiguration was lost due to an in-flight boot reconfig, the API would quietly return success with the wrong resolution. Now it retries the reconfig up to 3 times and returns an explicit HTTP 500 if the screen never converges, so callers get a real signal instead of silent corruption. Intended effect:
Risks:
Status updates will be posted automatically on this PR as monitoring progresses. |
rgarcia
left a comment
There was a problem hiding this comment.
reviewed — lgtm. the retry-on-non-convergence approach is sound and the loop logic checks out: 3 polls / 2 rpc retries on the neko path, with the final state always polled after the last retry. returning 500 instead of echoing the stale dummy 3840×2160 is the right call. a couple non-blocking things:
questions
server/cmd/api/api/display.go:168-172— worst-case latency: neko non-convergence now blocks ~30s (3×10s polls) before the 500.waitForXRootRealizedhonorsctx.Done()so an upstream deadline bails early, but worth confirming no request timeout sits below ~30s and would cut a legit retry short. is 3×10s the right budget, vs. shorter polls or fewer attempts?
nits
server/cmd/api/api/display.go:525-527— doc comment onwaitForXRootRealizedstill says "Always non-fatal — the response always echoes some size, never 500s." that's stale now: a non-converged (false) return drives a 500 at the caller. worth updating so the next reader isn't misled.
Summary
PATCH /displaycould return HTTP 200 with the unconverged dummy DDX default (3840×2160) in the response body instead of the requested mode. This was being papered over at the test layer by adding retry loops in e2e tests (PR #279); this PR fixes it at the server layer so SDK callers get correct behavior.The race
setResolutionViaNekocalls neko'sScreenConfigurationChangeRPC, which is asynchronous. If a PATCH lands while a previous reconfig is in flight — e.g. neko's boot1920×1080@25config — the new request can be dropped/clobbered.waitForXRootRealizedthen polls X until timeout, sees the dummy default3840×2160, and the handler echoes that value back as the realized size.The function had no way to signal "I gave up" vs "I converged on something close to the request", so the handler treated both the same and returned 200 either way.
Fix
Two changes in
server/cmd/api/api/display.go:waitForXRootRealizedreturns aconverged bool. Same polling logic, but the function now distinguishes a successful settle from a timeout. No behavior change for the existing exact-match and stable-within-delta success paths.setResolutionViaNekoup to 3 times before giving up. Polling X harder can't recover a request neko dropped — only re-issuing the RPC will. The xrandr (no-neko) path skips the RPC retry (xrandr is synchronous and idempotent) but still surfaces the non-convergence honestly.If all retries fail, the handler now returns 500 with
display did not converge to WxH (X root reports W'xH')instead of 200 with the wrong size. SDK callers see an honest error and can decide to retry.Verification
Reproduced locally on a quiet VM (no concurrent load):
TestDisplayResizeOddWidthHonoursLibxcvtRoundingTestDisplayResizeChromiumWindowheadful subtestsThe failing body before the fix was
{"width":3840,"height":2160}for a request of{1365,768}. Two of the 8 runs after the fix took ~10s longer than baseline — that's the retry firing transparently and recovering, exactly the intended behavior.Follow-ups
Once this lands, the e2e retry loops added in #279 become redundant and can be removed — keeping the
waitForXRootResolution(1920, 1080)baseline-wait there is fine as a latency optimization but is no longer load-bearing.🤖 Generated with Claude Code
Note
Medium Risk
Changes core display resize success/failure semantics for headful Neko deployments; callers that relied on 200 with wrong sizes will now see errors, which is intended but is a behavioral contract change.
Overview
PATCH /displayon the Xorg + Neko path no longer returns HTTP 200 with the wrong realized size (e.g. dummy 3840×2160) when Neko drops an async screen reconfig.waitForXRootRealizednow returns aconvergedflag so the handler can tell a successful settle from a timeout. On the Neko path, if X never converges within the poll window, the handler re-issuessetResolutionViaNekoup to 3 times before giving up; the xrandr-only path does not retry the RPC but still treats non-convergence as failure.If convergence still fails, the API returns 500 with
display did not converge to WxH (X root reports W'xH')instead of echoing the stale X root in a 200 response. Successful convergence still uses the realized X root dimensions (including libxcvt rounding) in the response body.Reviewed by Cursor Bugbot for commit a00bbbe. Bugbot is set up for automated code reviews on this repo. Configure here.