test(e2e): de-flake display-resize and zip/zstd transfer tests by rgarcia · Pull Request #279 · kernel/kernel-images

rgarcia · 2026-06-09T15:24:27Z

Summary

Two general, fork-agnostic e2e flakiness fixes. Both are timing / external-dependency issues seen under concurrent privileged-container load on self-hosted CI — neither is a product bug. Test/harness-only; no server or image changes.

These were first found and fixed while stabilizing the downstream private fork's e2e suite; this PR contributes back only the parts that are general-purpose (no fork-specific features), so the fixes live upstream and future syncs carry fewer conflicts.

display-resize (`e2e_display_resize_window_test.go`)

TestDisplayResizeOddWidthHonoursLibxcvtRounding and TestDisplayResizeChromiumWindow flaked because, under concurrent load, neko's ScreenConfigurationChange is slow/unconverged at boot, leaving the X root at the dummy DDX default (3840x2160). A resize issued before neko converges makes neko echo that stale 3840x2160 back (PATCH 200 body={3840x2160}), failing the assertions.

Fix:

Gate the odd-width resize on neko's 1920x1080 boot baseline (neko.yaml screen: "1920x1080@25") before resizing.
Poll the X root for the realized mode instead of reading once — ScreenConfigurationChange is asynchronous, so the root may not reflect the new mode the instant PATCH returns.
Retry PATCH (in patchDisplayExpectingOK and the odd-width path) when the realized size is the dummy-default "neko not converged" signal (isUnconvergedNekoSize: >64px off the request; real libxcvt rounding is <16px, the dummy max is >1000px off, so 64px is a safe separator). A genuine resize bug — wrong-but-close size or non-200 — is still caught, since only the far-off dummy-default echo is retried.

zip/zstd transfer (`e2e_zip_transfer_bench_test.go`)

populateUserData navigated to www.google.com with the default load wait, which timed out at 30s under CI network pressure (google fans out to many subresources), failing the transfer benchmarks. The helper only needs to generate some user-data state, so it now uses example.com/.org with domcontentloaded and makes the secondary navigation best-effort.

Also drops a stray double blank line for gofmt compliance.

Verification

gofmt clean on both touched files; go vet ./e2e/ clean.
The equivalent changes were verified to pass locally against built chromium-headful/headless images (display-resize odd-width + all window-resize scenarios; zip + zstd transfer timing) and green on the downstream fork's test-server-e2e CI.

🤖 Generated with Claude Code

Note

Low Risk
Changes are limited to e2e test helpers and assertions; no server, API, or image behavior is modified.

Overview
Hardens display-resize e2e tests against neko/X timing under heavy CI load. patchDisplayExpectingOK and the odd-width libxcvt test now retry PATCH /display up to five times when the API reports a realized size that looks like the unconverged dummy DDX default (3840×2160, detected via new isUnconvergedNekoSize with a 64px slack). The odd-width test also waits for the 1920×1080 boot baseline before resizing and polls the X root for the libxcvt-realized mode instead of asserting immediately after PATCH.

In zip/zstd transfer benchmarks, populateUserData stops navigating to Google with a full load wait (30s timeouts under CI). It uses example.com with domcontentloaded and treats a secondary example.org visit as best-effort so the benchmark only needs some browser profile state.

^{Reviewed by Cursor Bugbot for commit 6e3a269. Bugbot is set up for automated code reviews on this repo. Configure here.}

Two general, fork-agnostic flakiness fixes for the e2e suite. Both are timing/external-dependency issues observed under concurrent privileged-container load on self-hosted CI; neither is a product bug. display-resize (TestDisplayResizeOddWidthHonoursLibxcvtRounding, TestDisplayResizeChromiumWindow): under concurrent load neko's ScreenConfigurationChange is slow/unconverged at boot, leaving the X root at the dummy DDX default (3840x2160). A resize issued before neko converges makes neko echo that stale 3840x2160 back (PATCH 200 body={3840x2160}), failing the assertions. Fix: - gate the odd-width resize on neko's 1920x1080 boot baseline (neko.yaml screen) before resizing, - poll the X root for the realized mode instead of reading once (ScreenConfigurationChange is asynchronous), and - retry PATCH (in patchDisplayExpectingOK and the odd-width path) when the realized size is the dummy-default "neko not converged" signal (>64px off the request; real libxcvt rounding is <16px). zip/zstd transfer (populateUserData): navigated to www.google.com with the default 'load' wait, which timed out at 30s under CI network pressure (google fans out to many subresources), failing the transfer benchmarks. The helper only needs to generate *some* user-data state, so use example.com/.org with 'domcontentloaded' and make the secondary navigation best-effort. Also drops a stray double blank line for gofmt compliance. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

firetiger-agent · 2026-06-09T15:48:09Z

Firetiger deploy monitoring skipped

This PR didn't match the auto-monitor filter configured on your GitHub connection:

PRs in the kernel, infra, hypeman, and hypeship repos. kernel is a ~mono repo with many logical services underneath, ensure to focus on the implicated service for the PR

Reason: PR contains only e2e test fixes with no server or image changes; unclear which service in the kernel mono-repo is affected, and this appears to be test infrastructure rather than a service deployment.

To monitor this PR anyway, reply with @firetiger monitor this.

## Summary `PATCH /display` could return **HTTP 200** with the unconverged dummy DDX default (`3840×2160`) in the response body instead of the requested mode. This was being papered over at the test layer by adding retry loops in e2e tests (PR #279); this PR fixes it at the server layer so SDK callers get correct behavior. ## The race `setResolutionViaNeko` calls neko's `ScreenConfigurationChange` RPC, which is asynchronous. If a PATCH lands while a previous reconfig is in flight — e.g. neko's boot `1920×1080@25` config — the new request can be dropped/clobbered. `waitForXRootRealized` then polls X until timeout, sees the dummy default `3840×2160`, and the handler echoes that value back as the realized size. The function had no way to signal "I gave up" vs "I converged on something close to the request", so the handler treated both the same and returned 200 either way. ## Fix Two changes in `server/cmd/api/api/display.go`: 1. **`waitForXRootRealized` returns a `converged bool`.** Same polling logic, but the function now distinguishes a successful settle from a timeout. No behavior change for the existing exact-match and stable-within-delta success paths. 2. **Neko path retries the reconfig RPC.** When convergence fails, the handler re-issues `setResolutionViaNeko` up to 3 times before giving up. Polling X harder can't recover a request neko dropped — only re-issuing the RPC will. The xrandr (no-neko) path skips the RPC retry (xrandr is synchronous and idempotent) but still surfaces the non-convergence honestly. If all retries fail, the handler now returns 500 with `display did not converge to WxH (X root reports W'xH')` instead of 200 with the wrong size. SDK callers see an honest error and can decide to retry. ## Verification Reproduced locally on a quiet VM (no concurrent load): | Test | Before | After | |---|---|---| | `TestDisplayResizeOddWidthHonoursLibxcvtRounding` | 2 fails / 5 runs (~40%) | 8/8 pass | | `TestDisplayResizeChromiumWindow` headful subtests | (not measured) | 12/12 pass (4 runs × 3 subtests) | The failing body before the fix was `{"width":3840,"height":2160}` for a request of `{1365,768}`. Two of the 8 runs after the fix took ~10s longer than baseline — that's the retry firing transparently and recovering, exactly the intended behavior. ## Follow-ups Once this lands, the e2e retry loops added in #279 become redundant and can be removed — keeping the `waitForXRootResolution(1920, 1080)` baseline-wait there is fine as a latency optimization but is no longer load-bearing. ## Fail-fast update (second commit) `waitForXRootRealized` now returns non-converged **early** when the root has been stable at a value far from the request (2s grace + ~500ms of identical reads at the 50ms cadence), instead of always burning the full 10s poll. A dropped reconfig parks X at the dummy default immediately and stably — polling longer never recovers it — so the retry now fires after ~2.2s instead of ~10.3s, and the worst case before the 500 drops from ~30s to ~7s. Measured via local docker e2e (fresh headful container boot + immediate `PATCH /display` 1365×768, quiet and with 3–4 concurrent boots): | | retry needed | racy resize latency | clean resize latency | false-positive retries | converge-fail 500s | |---|---|---|---|---|---| | 10s poll (first commit) | 5/43 cold boots | 10.23–10.32s | ~225ms mean | n/a | 0 | | fail-fast (this commit) | 2/55 cold boots | 2.21–2.27s | ~224ms mean | **0** | 0 | Every racy case had the same signature — root parked at 3840×2160 for the entire window, convergence ~200ms after the re-issued reconfig — and every response was 200 with the correct realized size. Zero of 98 warm resizes needed a retry; the race is purely a boot-window phenomenon. The kiosk transient (root briefly at the dummy max during mode-switch) can't trip the early-fail: the consecutive-identical-reads requirement resets on any moving value, and the `headful_kiosk` / `headful_start_maximized` / `headful_xorg_no_neko` subtests plus `TestDisplayResizeOddWidthHonoursLibxcvtRounding` all pass against an image built from this branch. Also rewrites the stale "never 500s" doc paragraph on `waitForXRootRealized` flagged in review. 🤖 Generated with [Claude Code](https://claude.com/claude-code)  --- > [!NOTE] > **Medium Risk** > Changes core display resize success/failure semantics for headful Neko deployments; callers that relied on 200 with wrong sizes will now see errors, which is intended but is a behavioral contract change. > > **Overview** > **`PATCH /display`** on the Xorg + Neko path no longer returns **HTTP 200** with the wrong realized size (e.g. dummy **3840×2160**) when Neko drops an async screen reconfig. > > `waitForXRootRealized` now returns a **`converged`** flag so the handler can tell a successful settle from a timeout. On the Neko path, if X never converges within the poll window, the handler **re-issues `setResolutionViaNeko` up to 3 times** before giving up; the xrandr-only path does not retry the RPC but still treats non-convergence as failure. > > If convergence still fails, the API returns **500** with `display did not converge to WxH (X root reports W'xH')` instead of echoing the stale X root in a 200 response. Successful convergence still uses the realized X root dimensions (including libxcvt rounding) in the response body. > > <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit a00bbbe. Bugbot is set up for automated code reviews on this repo. Configure [here](https://www.cursor.com/dashboard/bugbot).</sup>  --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

rgarcia marked this pull request as ready for review June 9, 2026 15:48

rgarcia assigned hiroTamada Jun 9, 2026

hiroTamada mentioned this pull request Jun 9, 2026

display: retry neko reconfig when X root never converges #280

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(e2e): de-flake display-resize and zip/zstd transfer tests#279

test(e2e): de-flake display-resize and zip/zstd transfer tests#279
rgarcia wants to merge 1 commit into
mainfrom
raf/e2e-deflake-display-zstd

rgarcia commented Jun 9, 2026 •

edited by cursor Bot

Loading

Uh oh!

firetiger-agent Bot commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rgarcia commented Jun 9, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

display-resize (e2e_display_resize_window_test.go)

zip/zstd transfer (e2e_zip_transfer_bench_test.go)

Verification

Uh oh!

firetiger-agent Bot commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rgarcia commented Jun 9, 2026 •

edited by cursor Bot

Loading

display-resize (`e2e_display_resize_window_test.go`)

zip/zstd transfer (`e2e_zip_transfer_bench_test.go`)