feat(test): port 4 more chaos faults (packet-loss, cpu/time/dns)#433
Conversation
Adds the duration-bearing, validator-targeted faults across the remaining chaos KINDS, on the proven network-partition flow (apply -> gateInjected + injectedCount>0 -> height-advances-under-fault -> gateRecovered -> caught-up): - packet-loss (NetworkChaos, 15% correlated loss, one validator) - cpu-stress (StressChaos, saturate one validator) - time-skew (TimeChaos, -30s clock skew on one validator) - dns-chaos (DNSChaos, resolution errors on all validators) All f=1-bounded so the chain degrades-not-halts; the unfaulted follower observes forward progress. One-shot faults (pod-failure, container-kill) and the follower-targeted/PromQL outliers (rpc-chaos, mempool) follow separately. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
PR SummaryLow Risk Overview New scenarios: DNS chaos is not added—an inline comment records that it is deferred until recovery-focused asserts exist (live peers don’t re-resolve DNS under the current under-fault height check). Reviewed by Cursor Bugbot for commit 0fb7b8d. Bugbot is set up for automated code reviews on this repo. Configure here. |
…review) In-cluster smoke confirmed packet-loss, cpu-stress, time-skew genuinely perturb the chain while it advances (real faults). dns-chaos passed VACUOUSLY: its patterns (<chain>/<chain>-internal/<chain>-rpc) don't match the per-pod peer FQDNs seid resolves (<chain>-N-0.<chain>-N.<ns>.svc...), and live MConnections don't re-resolve mid-fault — so the under-fault progress assert is satisfied by an undisturbed chain (a false-green). It's a rediscovery/recovery fault, not a steady-state-liveness one, so it needs a recovery-focused assert + FQDN-matching patterns — deferred to the outlier follow-up rather than ship a vacuous test. Also reword the time-skew comment to name the real mechanism (median-vote dominance, not a tolerance window). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
/xreview — sei-network (per-kind fidelity), RESOLVED + smoke-provenIn-cluster smoke ran all 4; the review + smoke together split them cleanly:
dns-chaos dropped from this PR. The patterns ( Also: reworded the time-skew comment to name the real mechanism (median-vote dominance). Namespace-parameterized DNS patterns were a fidelity improvement over the hardcoded-nightly original (noted, but doesn't rescue dns-chaos). This PR ships packet-loss + cpu-stress + time-skew — all smoke-proven to genuinely perturb the chain while it stays live. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 0fb7b8d. Configure here.
| cpu: | ||
| workers: 4 | ||
| load: 80 | ||
| duration: "{{.Duration}}" |
There was a problem hiding this comment.
CPU stress lacks seid scoping
Medium Severity
The new StressChaos template omits containerNames, so Chaos Mesh applies CPU stress to every container in the selected validator pod (including sei-sidecar and observability sidecars), not just seid. The sibling time_skew template added in the same change scopes injection to seid, so the CPU fault may destabilize auxiliary containers and produce flaky or misleading under-fault liveness results.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 0fb7b8d. Configure here.


WS-I — chaos faults batch 2
Adds 4 faults on the proven network-partition flow (#432), covering the remaining duration-bearing KINDS:
All f=1-bounded (chain degrades, doesn't halt); the unfaulted follower observes forward progress via
WaitHeightAdvances. Each is a//go:embedtemplate + achaosScenariosentry; the infra (gates, 0-target guard, lifecycle) is unchanged from #432.Excluded (follow-ups): one-shot faults (pod-failure/container-kill — no
duration, needoneShothandling); follower-targeted rpc-chaos (2 HTTPChaos CRs, different observer); mempool (no CR + PromQL telemetry gate).Verification
go build ./...clean · golangci-lint 0 issues ·go test -c -tags integrationcompiles · in-cluster smoke of all 4 in progress.🤖 Generated with Claude Code