Skip to content

test(e2e): de-flake display-resize and zip/zstd transfer tests#279

Open
rgarcia wants to merge 1 commit into
mainfrom
raf/e2e-deflake-display-zstd
Open

test(e2e): de-flake display-resize and zip/zstd transfer tests#279
rgarcia wants to merge 1 commit into
mainfrom
raf/e2e-deflake-display-zstd

Conversation

@rgarcia

@rgarcia rgarcia commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Summary

Two general, fork-agnostic e2e flakiness fixes. Both are timing / external-dependency issues seen under concurrent privileged-container load on self-hosted CI — neither is a product bug. Test/harness-only; no server or image changes.

These were first found and fixed while stabilizing the downstream private fork's e2e suite; this PR contributes back only the parts that are general-purpose (no fork-specific features), so the fixes live upstream and future syncs carry fewer conflicts.

display-resize (e2e_display_resize_window_test.go)

TestDisplayResizeOddWidthHonoursLibxcvtRounding and TestDisplayResizeChromiumWindow flaked because, under concurrent load, neko's ScreenConfigurationChange is slow/unconverged at boot, leaving the X root at the dummy DDX default (3840x2160). A resize issued before neko converges makes neko echo that stale 3840x2160 back (PATCH 200 body={3840x2160}), failing the assertions.

Fix:

  • Gate the odd-width resize on neko's 1920x1080 boot baseline (neko.yaml screen: "1920x1080@25") before resizing.
  • Poll the X root for the realized mode instead of reading once — ScreenConfigurationChange is asynchronous, so the root may not reflect the new mode the instant PATCH returns.
  • Retry PATCH (in patchDisplayExpectingOK and the odd-width path) when the realized size is the dummy-default "neko not converged" signal (isUnconvergedNekoSize: >64px off the request; real libxcvt rounding is <16px, the dummy max is >1000px off, so 64px is a safe separator). A genuine resize bug — wrong-but-close size or non-200 — is still caught, since only the far-off dummy-default echo is retried.

zip/zstd transfer (e2e_zip_transfer_bench_test.go)

populateUserData navigated to www.google.com with the default load wait, which timed out at 30s under CI network pressure (google fans out to many subresources), failing the transfer benchmarks. The helper only needs to generate some user-data state, so it now uses example.com/.org with domcontentloaded and makes the secondary navigation best-effort.

Also drops a stray double blank line for gofmt compliance.

Verification

  • gofmt clean on both touched files; go vet ./e2e/ clean.
  • The equivalent changes were verified to pass locally against built chromium-headful/headless images (display-resize odd-width + all window-resize scenarios; zip + zstd transfer timing) and green on the downstream fork's test-server-e2e CI.

🤖 Generated with Claude Code


Note

Low Risk
Changes are limited to e2e test helpers and assertions; no server, API, or image behavior is modified.

Overview
Hardens display-resize e2e tests against neko/X timing under heavy CI load. patchDisplayExpectingOK and the odd-width libxcvt test now retry PATCH /display up to five times when the API reports a realized size that looks like the unconverged dummy DDX default (3840×2160, detected via new isUnconvergedNekoSize with a 64px slack). The odd-width test also waits for the 1920×1080 boot baseline before resizing and polls the X root for the libxcvt-realized mode instead of asserting immediately after PATCH.

In zip/zstd transfer benchmarks, populateUserData stops navigating to Google with a full load wait (30s timeouts under CI). It uses example.com with domcontentloaded and treats a secondary example.org visit as best-effort so the benchmark only needs some browser profile state.

Reviewed by Cursor Bugbot for commit 6e3a269. Bugbot is set up for automated code reviews on this repo. Configure here.

Two general, fork-agnostic flakiness fixes for the e2e suite. Both are
timing/external-dependency issues observed under concurrent
privileged-container load on self-hosted CI; neither is a product bug.

display-resize (TestDisplayResizeOddWidthHonoursLibxcvtRounding,
TestDisplayResizeChromiumWindow): under concurrent load neko's
ScreenConfigurationChange is slow/unconverged at boot, leaving the X root
at the dummy DDX default (3840x2160). A resize issued before neko
converges makes neko echo that stale 3840x2160 back (PATCH 200
body={3840x2160}), failing the assertions. Fix:
- gate the odd-width resize on neko's 1920x1080 boot baseline
  (neko.yaml screen) before resizing,
- poll the X root for the realized mode instead of reading once
  (ScreenConfigurationChange is asynchronous), and
- retry PATCH (in patchDisplayExpectingOK and the odd-width path) when
  the realized size is the dummy-default "neko not converged" signal
  (>64px off the request; real libxcvt rounding is <16px).

zip/zstd transfer (populateUserData): navigated to www.google.com with
the default 'load' wait, which timed out at 30s under CI network
pressure (google fans out to many subresources), failing the transfer
benchmarks. The helper only needs to generate *some* user-data state, so
use example.com/.org with 'domcontentloaded' and make the secondary
navigation best-effort.

Also drops a stray double blank line for gofmt compliance.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@rgarcia rgarcia marked this pull request as ready for review June 9, 2026 15:48
@firetiger-agent

Copy link
Copy Markdown

Firetiger deploy monitoring skipped

This PR didn't match the auto-monitor filter configured on your GitHub connection:

PRs in the kernel, infra, hypeman, and hypeship repos. kernel is a ~mono repo with many logical services underneath, ensure to focus on the implicated service for the PR

Reason: PR contains only e2e test fixes with no server or image changes; unclear which service in the kernel mono-repo is affected, and this appears to be test infrastructure rather than a service deployment.

To monitor this PR anyway, reply with @firetiger monitor this.

hiroTamada added a commit that referenced this pull request Jun 10, 2026
## Summary

`PATCH /display` could return **HTTP 200** with the unconverged dummy
DDX default (`3840×2160`) in the response body instead of the requested
mode. This was being papered over at the test layer by adding retry
loops in e2e tests (PR #279); this PR fixes it at the server layer so
SDK callers get correct behavior.

## The race

`setResolutionViaNeko` calls neko's `ScreenConfigurationChange` RPC,
which is asynchronous. If a PATCH lands while a previous reconfig is in
flight — e.g. neko's boot `1920×1080@25` config — the new request can be
dropped/clobbered. `waitForXRootRealized` then polls X until timeout,
sees the dummy default `3840×2160`, and the handler echoes that value
back as the realized size.

The function had no way to signal "I gave up" vs "I converged on
something close to the request", so the handler treated both the same
and returned 200 either way.

## Fix

Two changes in `server/cmd/api/api/display.go`:

1. **`waitForXRootRealized` returns a `converged bool`.** Same polling
logic, but the function now distinguishes a successful settle from a
timeout. No behavior change for the existing exact-match and
stable-within-delta success paths.
2. **Neko path retries the reconfig RPC.** When convergence fails, the
handler re-issues `setResolutionViaNeko` up to 3 times before giving up.
Polling X harder can't recover a request neko dropped — only re-issuing
the RPC will. The xrandr (no-neko) path skips the RPC retry (xrandr is
synchronous and idempotent) but still surfaces the non-convergence
honestly.

If all retries fail, the handler now returns 500 with `display did not
converge to WxH (X root reports W'xH')` instead of 200 with the wrong
size. SDK callers see an honest error and can decide to retry.

## Verification

Reproduced locally on a quiet VM (no concurrent load):

| Test | Before | After |
|---|---|---|
| `TestDisplayResizeOddWidthHonoursLibxcvtRounding` | 2 fails / 5 runs
(~40%) | 8/8 pass |
| `TestDisplayResizeChromiumWindow` headful subtests | (not measured) |
12/12 pass (4 runs × 3 subtests) |

The failing body before the fix was `{"width":3840,"height":2160}` for a
request of `{1365,768}`. Two of the 8 runs after the fix took ~10s
longer than baseline — that's the retry firing transparently and
recovering, exactly the intended behavior.

## Follow-ups

Once this lands, the e2e retry loops added in #279 become redundant and
can be removed — keeping the `waitForXRootResolution(1920, 1080)`
baseline-wait there is fine as a latency optimization but is no longer
load-bearing.


## Fail-fast update (second commit)

`waitForXRootRealized` now returns non-converged **early** when the root
has been stable at a value far from the request (2s grace + ~500ms of
identical reads at the 50ms cadence), instead of always burning the full
10s poll. A dropped reconfig parks X at the dummy default immediately
and stably — polling longer never recovers it — so the retry now fires
after ~2.2s instead of ~10.3s, and the worst case before the 500 drops
from ~30s to ~7s.

Measured via local docker e2e (fresh headful container boot + immediate
`PATCH /display` 1365×768, quiet and with 3–4 concurrent boots):

| | retry needed | racy resize latency | clean resize latency |
false-positive retries | converge-fail 500s |
|---|---|---|---|---|---|
| 10s poll (first commit) | 5/43 cold boots | 10.23–10.32s | ~225ms mean
| n/a | 0 |
| fail-fast (this commit) | 2/55 cold boots | 2.21–2.27s | ~224ms mean |
**0** | 0 |

Every racy case had the same signature — root parked at 3840×2160 for
the entire window, convergence ~200ms after the re-issued reconfig — and
every response was 200 with the correct realized size. Zero of 98 warm
resizes needed a retry; the race is purely a boot-window phenomenon. The
kiosk transient (root briefly at the dummy max during mode-switch) can't
trip the early-fail: the consecutive-identical-reads requirement resets
on any moving value, and the `headful_kiosk` / `headful_start_maximized`
/ `headful_xorg_no_neko` subtests plus
`TestDisplayResizeOddWidthHonoursLibxcvtRounding` all pass against an
image built from this branch.

Also rewrites the stale "never 500s" doc paragraph on
`waitForXRootRealized` flagged in review.
🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> Changes core display resize success/failure semantics for headful Neko
deployments; callers that relied on 200 with wrong sizes will now see
errors, which is intended but is a behavioral contract change.
> 
> **Overview**
> **`PATCH /display`** on the Xorg + Neko path no longer returns **HTTP
200** with the wrong realized size (e.g. dummy **3840×2160**) when Neko
drops an async screen reconfig.
> 
> `waitForXRootRealized` now returns a **`converged`** flag so the
handler can tell a successful settle from a timeout. On the Neko path,
if X never converges within the poll window, the handler **re-issues
`setResolutionViaNeko` up to 3 times** before giving up; the xrandr-only
path does not retry the RPC but still treats non-convergence as failure.
> 
> If convergence still fails, the API returns **500** with `display did
not converge to WxH (X root reports W'xH')` instead of echoing the stale
X root in a 200 response. Successful convergence still uses the realized
X root dimensions (including libxcvt rounding) in the response body.
> 
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
a00bbbe. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants