Skip to content

feat(test): port 4 more chaos faults (packet-loss, cpu/time/dns)#433

Merged
bdchatham merged 2 commits into
mainfrom
feat/wsi-chaos-faults-batch2
Jun 23, 2026
Merged

feat(test): port 4 more chaos faults (packet-loss, cpu/time/dns)#433
bdchatham merged 2 commits into
mainfrom
feat/wsi-chaos-faults-batch2

Conversation

@bdchatham

Copy link
Copy Markdown
Collaborator

WS-I — chaos faults batch 2

Adds 4 faults on the proven network-partition flow (#432), covering the remaining duration-bearing KINDS:

fault kind target effect
packet-loss NetworkChaos one validator 15% correlated loss
cpu-stress StressChaos one validator CPU saturation
time-skew TimeChaos one validator -30s clock skew
dns-chaos DNSChaos all validators resolution errors

All f=1-bounded (chain degrades, doesn't halt); the unfaulted follower observes forward progress via WaitHeightAdvances. Each is a //go:embed template + a chaosScenarios entry; the infra (gates, 0-target guard, lifecycle) is unchanged from #432.

Excluded (follow-ups): one-shot faults (pod-failure/container-kill — no duration, need oneShot handling); follower-targeted rpc-chaos (2 HTTPChaos CRs, different observer); mempool (no CR + PromQL telemetry gate).

Verification

go build ./... clean · golangci-lint 0 issues · go test -c -tags integration compiles · in-cluster smoke of all 4 in progress.

🤖 Generated with Claude Code

Adds the duration-bearing, validator-targeted faults across the remaining
chaos KINDS, on the proven network-partition flow (apply -> gateInjected +
injectedCount>0 -> height-advances-under-fault -> gateRecovered -> caught-up):

- packet-loss   (NetworkChaos, 15% correlated loss, one validator)
- cpu-stress    (StressChaos, saturate one validator)
- time-skew     (TimeChaos, -30s clock skew on one validator)
- dns-chaos     (DNSChaos, resolution errors on all validators)

All f=1-bounded so the chain degrades-not-halts; the unfaulted follower
observes forward progress. One-shot faults (pod-failure, container-kill) and
the follower-targeted/PromQL outliers (rpc-chaos, mempool) follow separately.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@cursor

cursor Bot commented Jun 23, 2026

Copy link
Copy Markdown

PR Summary

Low Risk
Changes are integration-test-only (Chaos-Mesh YAML templates and scenario registration) with no production or runtime behavior changes.

Overview
Extends the in-cluster Chaos-Mesh integration suite with three duration-bound faults that reuse the existing provision → inject → gate → liveness → recovery flow from network-partition.

New scenarios: packet-loss (15% correlated NetworkChaos on one validator), cpu-stress (StressChaos CPU load on one validator), and time-skew (TimeChaos −30s on one validator’s seid container). Each adds a //go:embed YAML template and a chaosScenarios entry; harness gates and asserts are unchanged.

DNS chaos is not added—an inline comment records that it is deferred until recovery-focused asserts exist (live peers don’t re-resolve DNS under the current under-fault height check).

Reviewed by Cursor Bugbot for commit 0fb7b8d. Bugbot is set up for automated code reviews on this repo. Configure here.

Comment thread test/integration/faults/dns_chaos.yaml.tmpl Outdated
…review)

In-cluster smoke confirmed packet-loss, cpu-stress, time-skew genuinely perturb
the chain while it advances (real faults). dns-chaos passed VACUOUSLY: its
patterns (<chain>/<chain>-internal/<chain>-rpc) don't match the per-pod peer
FQDNs seid resolves (<chain>-N-0.<chain>-N.<ns>.svc...), and live MConnections
don't re-resolve mid-fault — so the under-fault progress assert is satisfied by
an undisturbed chain (a false-green). It's a rediscovery/recovery fault, not a
steady-state-liveness one, so it needs a recovery-focused assert + FQDN-matching
patterns — deferred to the outlier follow-up rather than ship a vacuous test.

Also reword the time-skew comment to name the real mechanism (median-vote
dominance, not a tolerance window).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@bdchatham

Copy link
Copy Markdown
Collaborator Author

/xreview — sei-network (per-kind fidelity), RESOLVED + smoke-proven

In-cluster smoke ran all 4; the review + smoke together split them cleanly:

fault verdict smoke
packet-loss COMPATIBLE (real fault, valid assert) PASS (chain degraded-yet-advanced)
cpu-stress COMPATIBLE PASS
time-skew COMPATIBLE (survives via median-vote dominance, not 'tolerance') PASS (skewed validator outvoted)
dns-chaos MISMATCH — false-green PASS vacuously

dns-chaos dropped from this PR. The patterns (<chain>/<chain>-internal/<chain>-rpc) don't match the per-pod peer FQDNs seid resolves (<chain>-N-0.<chain>-N.<ns>.svc…), and live MConnections don't re-resolve mid-fault — so it injects but doesn't perturb consensus, and the under-fault progress assert passes against an undisturbed chain. It's a rediscovery/recovery fault, not steady-state-liveness — needs a recovery-focused assert + FQDN-matching patterns. Deferred to the outlier follow-up rather than ship a vacuous test.

Also: reworded the time-skew comment to name the real mechanism (median-vote dominance). Namespace-parameterized DNS patterns were a fidelity improvement over the hardcoded-nightly original (noted, but doesn't rescue dns-chaos).

This PR ships packet-loss + cpu-stress + time-skew — all smoke-proven to genuinely perturb the chain while it stays live.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 0fb7b8d. Configure here.

cpu:
workers: 4
load: 80
duration: "{{.Duration}}"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CPU stress lacks seid scoping

Medium Severity

The new StressChaos template omits containerNames, so Chaos Mesh applies CPU stress to every container in the selected validator pod (including sei-sidecar and observability sidecars), not just seid. The sibling time_skew template added in the same change scopes injection to seid, so the CPU fault may destabilize auxiliary containers and produce flaky or misleading under-fault liveness results.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 0fb7b8d. Configure here.

@bdchatham bdchatham merged commit 06adfe4 into main Jun 23, 2026
5 checks passed
@bdchatham bdchatham deleted the feat/wsi-chaos-faults-batch2 branch June 23, 2026 14:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant