feat(test): TestChaosSuite — chaos faults via the harness (network-partition)#432
Conversation
…rtition) Ports the platform chaos suite into the go-test harness, starting with network-partition. Each fault runs against its own fresh chain as a subtest (continue-on-failure): - apply the Chaos-Mesh fault CR (unstructured, via the dynamic client — keeps the chaos-mesh API out of the module's deps); fault templates are embedded. - gateInjected: poll status.conditions[AllInjected]=True before asserting — the anti-false-green guard (inject is async; a 0-match selector would let the assert run against an undisturbed chain). - assert live-under-fault: the unfaulted RPC follower stays caught up (faults are f=1-bounded, so 2/3 quorum holds). - gateRecovered: poll AllRecovered=True (catches stuck tc/tproxy finalizers); then assert the chain reconverged. network-partition isolates validator-0 from validators 1-3. Per-run sei.io/harness-run label on the fault CR. More faults follow once each passes in-cluster. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
PR SummaryMedium Risk Overview SDK: Adds Harness: New Chaos-Mesh helpers apply fault CRs via the dynamic client (unstructured, no chaos-mesh API in module deps), embed YAML templates, and gate on Reviewed by Cursor Bugbot for commit 0da01cc. Bugbot is set up for automated code reviews on this repo. Configure here. |
Correctness fixes (the two block-worthy findings): - under-fault liveness was trivial: WaitCaughtUp has no lower bound and catching_up==false is satisfied by a STALLED node, so it proved nothing about progress under the fault. Add sdk WaitHeightAdvances and assert the follower's height advances within a window inside the fault duration (observed while the fault is programmed, not after it expires). - 0-target false-green: AllInjected is vacuously true for an empty target set, and the only liveness probe is the unfaulted follower, so a no-op fault would pass. gateInjected now asserts status.experiment.injectedCount > 0. Hardening: - waitFaultCondition surfaces the last real Get error (tolerating transient post-create NotFound) instead of masking it as a generic timeout. - registration tripwires: every fault must carry spec.duration (else gateRecovered hangs) and the <chain>-0 label value must be <= 63 chars. Deferred (noted): direct v0 recovery probe (validators publish no endpoint; platform doesn't probe v0 either — port-fidelity); switching the selector off the frozen sei.io/nodedeployment key (not a clean swap — it scopes validators- only; seinetwork would also catch followers). Harness-SA chaos-mesh RBAC is a platform deliverable. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
/xreview — 4 lenses (idiom · systems · k8s-dissenter · sei-network), RESOLVED + live-smoke-provenInject→recover lifecycle smoke-passed in-cluster (AllInjected/AllRecovered both fire for a duration fault — empirically resolves the dissenter's recovery-hang concern). But the reviews proved the pass was weak; fixes applied:
Conceded correct: apply path, GVR, namespace handling, label selectors (validator pod labels verified), AllRecovered-for-duration-faults. Deferred (noted): direct v0 recovery probe — validators publish no endpoint and the platform doesn't probe v0 either, so it's an expansion beyond port-fidelity (D1); frozen Re-smoke with the strengthened asserts in progress (now a pass means the chain advanced under a real fault). |
A diagnostic apply revealed the injected count lives per-target at status.experiment.containerRecords[].injectedCount (and status.instances), NOT at a top-level status.experiment.injectedCount (which the CRD declares but the controller never populates). The first guard read the wrong path -> always 0 -> false-failed even when the selector matched. faultInjectedTargets now sums injectedCount across containerRecords. Verified out-of-band: the network-partition selector (sei.io/nodedeployment=<chain> + sei.io/node=<chain>-0) DOES match the SDK-provisioned validator pods (Selected=True, AllInjected=True, one container record injected). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
WS-I — the chaos suite, ported into the go-test harness
First chaos increment: ports the platform chaos suite's inject/verify/recover pattern into
TestChaosSuite, starting with network-partition. Builds on the merged load suite (#430/#431).Pattern (mirrors the platform verify-injection/verify-cleanup)
//go:embed'd.status.conditions[AllInjected]=Truebefore asserting — the anti-false-green guard (inject is async; a 0-match selector would let the assert run against an undisturbed chain).AllRecovered=True(catches stuck tc/tproxy finalizers); then assert the chain reconverged.network-partition isolates validator-0 from validators 1-3. Each fault is a subtest with its own fresh chain (continue-on-failure, matching the platform suite). Per-run
sei.io/harness-runon the fault CR.Verification
go build ./...clean · golangci-lint 0 issues ·go test -c -tags integration→ TestBenchmark + TestChaosSuite · in-cluster smoke on harbor in progress.Next
More faults (the other 6 kinds) once each passes in-cluster; container-kill (one-shot, no AllRecovered) + rpc-chaos (2 CRs) + mempool (PromQL, deferred telemetry gate) get per-scenario handling.
🤖 Generated with Claude Code