display: retry neko reconfig when X root never converges by hiroTamada · Pull Request #280 · kernel/kernel-images

hiroTamada · 2026-06-09T17:48:21Z

Summary

PATCH /display could return HTTP 200 with the unconverged dummy DDX default (3840×2160) in the response body instead of the requested mode. This was being papered over at the test layer by adding retry loops in e2e tests (PR #279); this PR fixes it at the server layer so SDK callers get correct behavior.

The race

setResolutionViaNeko calls neko's ScreenConfigurationChange RPC, which is asynchronous. If a PATCH lands while a previous reconfig is in flight — e.g. neko's boot 1920×1080@25 config — the new request can be dropped/clobbered. waitForXRootRealized then polls X until timeout, sees the dummy default 3840×2160, and the handler echoes that value back as the realized size.

The function had no way to signal "I gave up" vs "I converged on something close to the request", so the handler treated both the same and returned 200 either way.

Fix

Two changes in server/cmd/api/api/display.go:

waitForXRootRealized returns a converged bool. Same polling logic, but the function now distinguishes a successful settle from a timeout. No behavior change for the existing exact-match and stable-within-delta success paths.
Neko path retries the reconfig RPC. When convergence fails, the handler re-issues setResolutionViaNeko up to 3 times before giving up. Polling X harder can't recover a request neko dropped — only re-issuing the RPC will. The xrandr (no-neko) path skips the RPC retry (xrandr is synchronous and idempotent) but still surfaces the non-convergence honestly.

If all retries fail, the handler now returns 500 with display did not converge to WxH (X root reports W'xH') instead of 200 with the wrong size. SDK callers see an honest error and can decide to retry.

Verification

Reproduced locally on a quiet VM (no concurrent load):

Test	Before	After
`TestDisplayResizeOddWidthHonoursLibxcvtRounding`	2 fails / 5 runs (~40%)	8/8 pass
`TestDisplayResizeChromiumWindow` headful subtests	(not measured)	12/12 pass (4 runs × 3 subtests)

The failing body before the fix was {"width":3840,"height":2160} for a request of {1365,768}. Two of the 8 runs after the fix took ~10s longer than baseline — that's the retry firing transparently and recovering, exactly the intended behavior.

Follow-ups

Once this lands, the e2e retry loops added in #279 become redundant and can be removed — keeping the waitForXRootResolution(1920, 1080) baseline-wait there is fine as a latency optimization but is no longer load-bearing.

🤖 Generated with Claude Code

Note

Medium Risk
Changes core display resize success/failure semantics for headful Neko deployments; callers that relied on 200 with wrong sizes will now see errors, which is intended but is a behavioral contract change.

Overview
PATCH /display on the Xorg + Neko path no longer returns HTTP 200 with the wrong realized size (e.g. dummy 3840×2160) when Neko drops an async screen reconfig.

waitForXRootRealized now returns a converged flag so the handler can tell a successful settle from a timeout. On the Neko path, if X never converges within the poll window, the handler re-issues setResolutionViaNeko up to 3 times before giving up; the xrandr-only path does not retry the RPC but still treats non-convergence as failure.

If convergence still fails, the API returns 500 with display did not converge to WxH (X root reports W'xH') instead of echoing the stale X root in a 200 response. Successful convergence still uses the realized X root dimensions (including libxcvt rounding) in the response body.

^{Reviewed by Cursor Bugbot for commit a00bbbe. Bugbot is set up for automated code reviews on this repo. Configure here.}

PATCH /display could return 200 with the unconverged dummy DDX default (3840x2160) in the body instead of the requested mode. The race: neko applies screen reconfigs asynchronously, so a PATCH that lands while a previous reconfig is in flight (e.g. neko's 1920x1080 boot config) can be dropped/clobbered. waitForXRootRealized then times out polling, and the handler echoes the bad X root reading back as the realized size — silently violating the API contract. Two changes: - waitForXRootRealized now returns a converged bool so the handler can distinguish "settled at the requested mode" from "gave up at the deadline." - The neko path retries the reconfig RPC up to 3 times when convergence fails. Polling X harder can't recover a dropped neko request — only re-issuing the RPC will. If all retries fail, the handler returns 500 with the realized vs requested mismatch instead of 200 with the wrong size. The xrandr (no-neko) path skips the RPC retry — xrandr is synchronous and idempotent, retrying it can't change the outcome — but still returns the honest non-convergence error. Reproduced locally on a quiet VM: before: TestDisplayResizeOddWidthHonoursLibxcvtRounding flakes ~40% (2 fails in 5 runs); failing body is {3840,2160} for request {1365,768}. after: 8/8 passing; two runs took ~10s longer (retry firing transparently and recovering). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

firetiger-agent · 2026-06-09T17:56:38Z

Created a monitoring plan for this PR.

What this PR does: Fixes a silent display-resize failure on headful (WebRTC/Neko-enabled) browser VMs. Previously, when a Neko display reconfiguration was lost due to an in-flight boot reconfig, the API would quietly return success with the wrong resolution. Now it retries the reconfig up to 3 times and returns an explicit HTTP 500 if the screen never converges, so callers get a real signal instead of silent corruption.

Intended effect:

"X root did not converge after resize, retrying neko reconfig" WARN log: baseline 0 (log is new); confirmed if it appears then resolves within the retry window, not sustained at > 50/hr
"successfully changed resolution via Neko API" INFO log: should appear more consistently on headful VMs that previously silently failed; watch for volume holding steady or increasing

Risks:

New convergence-failure 500s — "display did not converge to requested mode after retries" ERROR log in API logs, alert if > 5/hr sustained; previously these failures were silently swallowed as fake 200s
Increased resize latency on Neko VMs — each failed convergence adds up to 10s per retry attempt (max 30s total); alert if callers report PATCH /display timeouts or sustained high latency on headful sessions
Retry storms during VM boot — if Neko is consistently slow to initialize, the retry loop issues the RPC 3× in 30s; alert on any "failed to change screen configuration" ERROR surge at VM startup time

Status updates will be posted automatically on this PR as monitoring progresses.

View monitor

rgarcia

reviewed — lgtm. the retry-on-non-convergence approach is sound and the loop logic checks out: 3 polls / 2 rpc retries on the neko path, with the final state always polled after the last retry. returning 500 instead of echoing the stale dummy 3840×2160 is the right call. a couple non-blocking things:

questions

server/cmd/api/api/display.go:168-172 — worst-case latency: neko non-convergence now blocks ~30s (3×10s polls) before the 500. waitForXRootRealized honors ctx.Done() so an upstream deadline bails early, but worth confirming no request timeout sits below ~30s and would cut a legit retry short. is 3×10s the right budget, vs. shorter polls or fewer attempts?

nits

server/cmd/api/api/display.go:525-527 — doc comment on waitForXRootRealized still says "Always non-fatal — the response always echoes some size, never 500s." that's stale now: a non-converged (false) return drives a 500 at the caller. worth updating so the next reader isn't misled.

hiroTamada marked this pull request as ready for review June 9, 2026 17:48

rgarcia approved these changes Jun 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

display: retry neko reconfig when X root never converges#280

display: retry neko reconfig when X root never converges#280
hiroTamada wants to merge 1 commit into
mainfrom
hypeship/display-server-retry-on-unconverged

hiroTamada commented Jun 9, 2026 •

edited by cursor Bot

Loading

Uh oh!

firetiger-agent Bot commented Jun 9, 2026

Uh oh!

rgarcia left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hiroTamada commented Jun 9, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

The race

Fix

Verification

Follow-ups

Uh oh!

firetiger-agent Bot commented Jun 9, 2026

Uh oh!

rgarcia left a comment

Choose a reason for hiding this comment

questions

nits

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hiroTamada commented Jun 9, 2026 •

edited by cursor Bot

Loading