test(e2e): de-flake display-resize and zip/zstd transfer tests#279
Open
rgarcia wants to merge 1 commit into
Open
test(e2e): de-flake display-resize and zip/zstd transfer tests#279rgarcia wants to merge 1 commit into
rgarcia wants to merge 1 commit into
Conversation
Two general, fork-agnostic flakiness fixes for the e2e suite. Both are
timing/external-dependency issues observed under concurrent
privileged-container load on self-hosted CI; neither is a product bug.
display-resize (TestDisplayResizeOddWidthHonoursLibxcvtRounding,
TestDisplayResizeChromiumWindow): under concurrent load neko's
ScreenConfigurationChange is slow/unconverged at boot, leaving the X root
at the dummy DDX default (3840x2160). A resize issued before neko
converges makes neko echo that stale 3840x2160 back (PATCH 200
body={3840x2160}), failing the assertions. Fix:
- gate the odd-width resize on neko's 1920x1080 boot baseline
(neko.yaml screen) before resizing,
- poll the X root for the realized mode instead of reading once
(ScreenConfigurationChange is asynchronous), and
- retry PATCH (in patchDisplayExpectingOK and the odd-width path) when
the realized size is the dummy-default "neko not converged" signal
(>64px off the request; real libxcvt rounding is <16px).
zip/zstd transfer (populateUserData): navigated to www.google.com with
the default 'load' wait, which timed out at 30s under CI network
pressure (google fans out to many subresources), failing the transfer
benchmarks. The helper only needs to generate *some* user-data state, so
use example.com/.org with 'domcontentloaded' and make the secondary
navigation best-effort.
Also drops a stray double blank line for gofmt compliance.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Firetiger deploy monitoring skipped This PR didn't match the auto-monitor filter configured on your GitHub connection:
Reason: PR contains only e2e test fixes with no server or image changes; unclear which service in the kernel mono-repo is affected, and this appears to be test infrastructure rather than a service deployment. To monitor this PR anyway, reply with |
hiroTamada
added a commit
that referenced
this pull request
Jun 10, 2026
## Summary `PATCH /display` could return **HTTP 200** with the unconverged dummy DDX default (`3840×2160`) in the response body instead of the requested mode. This was being papered over at the test layer by adding retry loops in e2e tests (PR #279); this PR fixes it at the server layer so SDK callers get correct behavior. ## The race `setResolutionViaNeko` calls neko's `ScreenConfigurationChange` RPC, which is asynchronous. If a PATCH lands while a previous reconfig is in flight — e.g. neko's boot `1920×1080@25` config — the new request can be dropped/clobbered. `waitForXRootRealized` then polls X until timeout, sees the dummy default `3840×2160`, and the handler echoes that value back as the realized size. The function had no way to signal "I gave up" vs "I converged on something close to the request", so the handler treated both the same and returned 200 either way. ## Fix Two changes in `server/cmd/api/api/display.go`: 1. **`waitForXRootRealized` returns a `converged bool`.** Same polling logic, but the function now distinguishes a successful settle from a timeout. No behavior change for the existing exact-match and stable-within-delta success paths. 2. **Neko path retries the reconfig RPC.** When convergence fails, the handler re-issues `setResolutionViaNeko` up to 3 times before giving up. Polling X harder can't recover a request neko dropped — only re-issuing the RPC will. The xrandr (no-neko) path skips the RPC retry (xrandr is synchronous and idempotent) but still surfaces the non-convergence honestly. If all retries fail, the handler now returns 500 with `display did not converge to WxH (X root reports W'xH')` instead of 200 with the wrong size. SDK callers see an honest error and can decide to retry. ## Verification Reproduced locally on a quiet VM (no concurrent load): | Test | Before | After | |---|---|---| | `TestDisplayResizeOddWidthHonoursLibxcvtRounding` | 2 fails / 5 runs (~40%) | 8/8 pass | | `TestDisplayResizeChromiumWindow` headful subtests | (not measured) | 12/12 pass (4 runs × 3 subtests) | The failing body before the fix was `{"width":3840,"height":2160}` for a request of `{1365,768}`. Two of the 8 runs after the fix took ~10s longer than baseline — that's the retry firing transparently and recovering, exactly the intended behavior. ## Follow-ups Once this lands, the e2e retry loops added in #279 become redundant and can be removed — keeping the `waitForXRootResolution(1920, 1080)` baseline-wait there is fine as a latency optimization but is no longer load-bearing. ## Fail-fast update (second commit) `waitForXRootRealized` now returns non-converged **early** when the root has been stable at a value far from the request (2s grace + ~500ms of identical reads at the 50ms cadence), instead of always burning the full 10s poll. A dropped reconfig parks X at the dummy default immediately and stably — polling longer never recovers it — so the retry now fires after ~2.2s instead of ~10.3s, and the worst case before the 500 drops from ~30s to ~7s. Measured via local docker e2e (fresh headful container boot + immediate `PATCH /display` 1365×768, quiet and with 3–4 concurrent boots): | | retry needed | racy resize latency | clean resize latency | false-positive retries | converge-fail 500s | |---|---|---|---|---|---| | 10s poll (first commit) | 5/43 cold boots | 10.23–10.32s | ~225ms mean | n/a | 0 | | fail-fast (this commit) | 2/55 cold boots | 2.21–2.27s | ~224ms mean | **0** | 0 | Every racy case had the same signature — root parked at 3840×2160 for the entire window, convergence ~200ms after the re-issued reconfig — and every response was 200 with the correct realized size. Zero of 98 warm resizes needed a retry; the race is purely a boot-window phenomenon. The kiosk transient (root briefly at the dummy max during mode-switch) can't trip the early-fail: the consecutive-identical-reads requirement resets on any moving value, and the `headful_kiosk` / `headful_start_maximized` / `headful_xorg_no_neko` subtests plus `TestDisplayResizeOddWidthHonoursLibxcvtRounding` all pass against an image built from this branch. Also rewrites the stale "never 500s" doc paragraph on `waitForXRootRealized` flagged in review. 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **Medium Risk** > Changes core display resize success/failure semantics for headful Neko deployments; callers that relied on 200 with wrong sizes will now see errors, which is intended but is a behavioral contract change. > > **Overview** > **`PATCH /display`** on the Xorg + Neko path no longer returns **HTTP 200** with the wrong realized size (e.g. dummy **3840×2160**) when Neko drops an async screen reconfig. > > `waitForXRootRealized` now returns a **`converged`** flag so the handler can tell a successful settle from a timeout. On the Neko path, if X never converges within the poll window, the handler **re-issues `setResolutionViaNeko` up to 3 times** before giving up; the xrandr-only path does not retry the RPC but still treats non-convergence as failure. > > If convergence still fails, the API returns **500** with `display did not converge to WxH (X root reports W'xH')` instead of echoing the stale X root in a 200 response. Successful convergence still uses the realized X root dimensions (including libxcvt rounding) in the response body. > > <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit a00bbbe. Bugbot is set up for automated code reviews on this repo. Configure [here](https://www.cursor.com/dashboard/bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two general, fork-agnostic e2e flakiness fixes. Both are timing / external-dependency issues seen under concurrent privileged-container load on self-hosted CI — neither is a product bug. Test/harness-only; no server or image changes.
These were first found and fixed while stabilizing the downstream private fork's e2e suite; this PR contributes back only the parts that are general-purpose (no fork-specific features), so the fixes live upstream and future syncs carry fewer conflicts.
display-resize (
e2e_display_resize_window_test.go)TestDisplayResizeOddWidthHonoursLibxcvtRoundingandTestDisplayResizeChromiumWindowflaked because, under concurrent load, neko'sScreenConfigurationChangeis slow/unconverged at boot, leaving the X root at the dummy DDX default (3840x2160). A resize issued before neko converges makes neko echo that stale 3840x2160 back (PATCH 200 body={3840x2160}), failing the assertions.Fix:
neko.yamlscreen: "1920x1080@25") before resizing.ScreenConfigurationChangeis asynchronous, so the root may not reflect the new mode the instantPATCHreturns.PATCH(inpatchDisplayExpectingOKand the odd-width path) when the realized size is the dummy-default "neko not converged" signal (isUnconvergedNekoSize: >64px off the request; real libxcvt rounding is <16px, the dummy max is >1000px off, so 64px is a safe separator). A genuine resize bug — wrong-but-close size or non-200 — is still caught, since only the far-off dummy-default echo is retried.zip/zstd transfer (
e2e_zip_transfer_bench_test.go)populateUserDatanavigated towww.google.comwith the defaultloadwait, which timed out at 30s under CI network pressure (google fans out to many subresources), failing the transfer benchmarks. The helper only needs to generate some user-data state, so it now uses example.com/.org withdomcontentloadedand makes the secondary navigation best-effort.Also drops a stray double blank line for gofmt compliance.
Verification
gofmtclean on both touched files;go vet ./e2e/clean.test-server-e2eCI.🤖 Generated with Claude Code
Note
Low Risk
Changes are limited to e2e test helpers and assertions; no server, API, or image behavior is modified.
Overview
Hardens display-resize e2e tests against neko/X timing under heavy CI load.
patchDisplayExpectingOKand the odd-width libxcvt test now retryPATCH /displayup to five times when the API reports a realized size that looks like the unconverged dummy DDX default (3840×2160, detected via newisUnconvergedNekoSizewith a 64px slack). The odd-width test also waits for the 1920×1080 boot baseline before resizing and polls the X root for the libxcvt-realized mode instead of asserting immediately afterPATCH.In zip/zstd transfer benchmarks,
populateUserDatastops navigating to Google with a fullloadwait (30s timeouts under CI). It uses example.com withdomcontentloadedand treats a secondary example.org visit as best-effort so the benchmark only needs some browser profile state.Reviewed by Cursor Bugbot for commit 6e3a269. Bugbot is set up for automated code reviews on this repo. Configure here.