Skip to content

feat(test): TestChaosSuite — chaos faults via the harness (network-partition)#432

Merged
bdchatham merged 3 commits into
mainfrom
feat/wsi-chaos-suite
Jun 23, 2026
Merged

feat(test): TestChaosSuite — chaos faults via the harness (network-partition)#432
bdchatham merged 3 commits into
mainfrom
feat/wsi-chaos-suite

Conversation

@bdchatham

Copy link
Copy Markdown
Collaborator

WS-I — the chaos suite, ported into the go-test harness

First chaos increment: ports the platform chaos suite's inject/verify/recover pattern into TestChaosSuite, starting with network-partition. Builds on the merged load suite (#430/#431).

Pattern (mirrors the platform verify-injection/verify-cleanup)

  • apply the Chaos-Mesh fault CR unstructured via the dynamic client — keeps the chaos-mesh API out of the module's deps (the k8s lens's de-risk recommendation); fault templates are //go:embed'd.
  • gateInjected: poll status.conditions[AllInjected]=True before asserting — the anti-false-green guard (inject is async; a 0-match selector would let the assert run against an undisturbed chain).
  • assert live-under-fault: the unfaulted RPC follower stays caught up (faults are f=1-bounded → 2/3 quorum holds).
  • gateRecovered: poll AllRecovered=True (catches stuck tc/tproxy finalizers); then assert the chain reconverged.

network-partition isolates validator-0 from validators 1-3. Each fault is a subtest with its own fresh chain (continue-on-failure, matching the platform suite). Per-run sei.io/harness-run on the fault CR.

Verification

go build ./... clean · golangci-lint 0 issues · go test -c -tags integration → TestBenchmark + TestChaosSuite · in-cluster smoke on harbor in progress.

Next

More faults (the other 6 kinds) once each passes in-cluster; container-kill (one-shot, no AllRecovered) + rpc-chaos (2 CRs) + mempool (PromQL, deferred telemetry gate) get per-scenario handling.

🤖 Generated with Claude Code

…rtition)

Ports the platform chaos suite into the go-test harness, starting with
network-partition. Each fault runs against its own fresh chain as a subtest
(continue-on-failure):

- apply the Chaos-Mesh fault CR (unstructured, via the dynamic client — keeps
  the chaos-mesh API out of the module's deps); fault templates are embedded.
- gateInjected: poll status.conditions[AllInjected]=True before asserting — the
  anti-false-green guard (inject is async; a 0-match selector would let the
  assert run against an undisturbed chain).
- assert live-under-fault: the unfaulted RPC follower stays caught up (faults
  are f=1-bounded, so 2/3 quorum holds).
- gateRecovered: poll AllRecovered=True (catches stuck tc/tproxy finalizers);
  then assert the chain reconverged.

network-partition isolates validator-0 from validators 1-3. Per-run
sei.io/harness-run label on the fault CR. More faults follow once each passes
in-cluster.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@cursor

cursor Bot commented Jun 23, 2026

Copy link
Copy Markdown

PR Summary

Medium Risk
New in-cluster integration tests and fault injection against live validator networks; SDK readiness change is additive and test-driven but shared by other harness callers.

Overview
Ports the platform chaos inject/verify/recover flow into integration TestChaosSuite, starting with network-partition (validator-0 isolated from the rest while an RPC follower observes liveness and recovery).

SDK: Adds WaitHeightAdvances (and internal latestHeight) so tests can require committed height to rise by a delta—catching a stalled chain that still reports catching_up == false, which WaitCaughtUp alone would miss.

Harness: New Chaos-Mesh helpers apply fault CRs via the dynamic client (unstructured, no chaos-mesh API in module deps), embed YAML templates, and gate on AllInjected / AllRecovered with guards against false greens (non-zero injected targets, required spec.duration). Each scenario provisions a fresh 4-validator + 1-RPC chain, injects the fault, asserts height advances on the follower during the fault window, then waits for recovery and WaitCaughtUp post-fault.

Reviewed by Cursor Bugbot for commit 0da01cc. Bugbot is set up for automated code reviews on this repo. Configure here.

Correctness fixes (the two block-worthy findings):
- under-fault liveness was trivial: WaitCaughtUp has no lower bound and
  catching_up==false is satisfied by a STALLED node, so it proved nothing about
  progress under the fault. Add sdk WaitHeightAdvances and assert the follower's
  height advances within a window inside the fault duration (observed while the
  fault is programmed, not after it expires).
- 0-target false-green: AllInjected is vacuously true for an empty target set,
  and the only liveness probe is the unfaulted follower, so a no-op fault would
  pass. gateInjected now asserts status.experiment.injectedCount > 0.

Hardening:
- waitFaultCondition surfaces the last real Get error (tolerating transient
  post-create NotFound) instead of masking it as a generic timeout.
- registration tripwires: every fault must carry spec.duration (else
  gateRecovered hangs) and the <chain>-0 label value must be <= 63 chars.

Deferred (noted): direct v0 recovery probe (validators publish no endpoint;
platform doesn't probe v0 either — port-fidelity); switching the selector off
the frozen sei.io/nodedeployment key (not a clean swap — it scopes validators-
only; seinetwork would also catch followers). Harness-SA chaos-mesh RBAC is a
platform deliverable.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@bdchatham

Copy link
Copy Markdown
Collaborator Author

/xreview — 4 lenses (idiom · systems · k8s-dissenter · sei-network), RESOLVED + live-smoke-proven

Inject→recover lifecycle smoke-passed in-cluster (AllInjected/AllRecovered both fire for a duration fault — empirically resolves the dissenter's recovery-hang concern). But the reviews proved the pass was weak; fixes applied:

# Finding Lens Resolution
C1 under-fault liveness trivial — WaitCaughtUp has no lower bound + catching_up==false is true for a STALLED node systems + sei-network added sdk.WaitHeightAdvances; assert follower height advances within a window inside the fault duration
C2 0-target false-green — AllInjected vacuously true for empty target set; only probe is the unfaulted follower k8s-dissenter (block) gateInjected asserts status.experiment.injectedCount > 0
H1 swallowed Get error masks RBAC/CRD/GVR failures as generic timeout idiom + systems surface last non-NotFound Get error (tolerate transient post-create NotFound)
H2 self-expiry + label-length traps latent as the suite grows dissenter registration tripwires: every fault must carry spec.duration; <chain>-0 label ≤ 63 chars

Conceded correct: apply path, GVR, namespace handling, label selectors (validator pod labels verified), AllRecovered-for-duration-faults.

Deferred (noted): direct v0 recovery probe — validators publish no endpoint and the platform doesn't probe v0 either, so it's an expansion beyond port-fidelity (D1); frozen sei.io/nodedeployment key — not a clean swap (it scopes validators-only; sei.io/seinetwork would also catch followers, changing the fault); 2nd follower + foreground-delete (advisory). Harness-SA chaos-mesh RBAC is the platform run-model deliverable.

Re-smoke with the strengthened asserts in progress (now a pass means the chain advanced under a real fault).

A diagnostic apply revealed the injected count lives per-target at
status.experiment.containerRecords[].injectedCount (and status.instances),
NOT at a top-level status.experiment.injectedCount (which the CRD declares but
the controller never populates). The first guard read the wrong path -> always
0 -> false-failed even when the selector matched.

faultInjectedTargets now sums injectedCount across containerRecords. Verified
out-of-band: the network-partition selector (sei.io/nodedeployment=<chain> +
sei.io/node=<chain>-0) DOES match the SDK-provisioned validator pods
(Selected=True, AllInjected=True, one container record injected).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@bdchatham bdchatham merged commit 9595620 into main Jun 23, 2026
5 checks passed
@bdchatham bdchatham deleted the feat/wsi-chaos-suite branch June 23, 2026 03:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant