Skip to content

feat(test): port dns-chaos with corrected peer-FQDN patterns#436

Closed
bdchatham wants to merge 1 commit into
mainfrom
feat/wsi-chaos-dns
Closed

feat(test): port dns-chaos with corrected peer-FQDN patterns#436
bdchatham wants to merge 1 commit into
mainfrom
feat/wsi-chaos-dns

Conversation

@bdchatham

Copy link
Copy Markdown
Collaborator

WS-I — dns-chaos (corrected)

Re-adds dns-chaos, deferred in #433 because its patterns (<chain>/<chain>-internal/<chain>-rpc) matched no name seid resolves → vacuous. Corrected to <chain>-*.<ns>.svc.cluster.local, which matches the per-pod peer FQDNs validators actually resolve (<chain>-N-0.<chain>-N.<ns>.svc…).

With the injectedTargets>0 gate confirming the DNS interceptor is actually installed on the validators, this is a legitimate DNS-resilience test: the chain keeps producing despite resolution failures on its own names (established MConnections don't re-resolve, so DNS error doesn't tear down live peering). The template comment is explicit that it asserts resilience, not active consensus perturbation — honest about what a green means.

Fits the standard flow (gateInjected + height-advance + recovery Ready-gate). Chaos suite → 12/14 (rpc-chaos + mempool/PromQL remain).

Verification

build/lint clean · in-cluster smoke in progress (validates the corrected patterns inject + the chain stays resilient + recovers).

🤖 Generated with Claude Code

Re-adds dns-chaos (deferred in #433: its old patterns matched no name seid
resolves). Corrected to <chain>-*.<ns>.svc.cluster.local, which matches the
per-pod peer FQDNs validators actually resolve. With the injected-targets gate
confirming the DNS interceptor is installed, this is a legitimate DNS-resilience
test: the chain keeps producing despite resolution failures on its own names
(established MConnections don't re-resolve). Comment is honest that it asserts
resilience, not active consensus perturbation. Fits the standard flow +
recovery gate. Chaos suite -> 12/14 (rpc-chaos + mempool remain).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@cursor

cursor Bot commented Jun 23, 2026

Copy link
Copy Markdown

PR Summary

Low Risk
Changes are limited to integration test assets and scenario registration; no runtime or production paths are modified.

Overview
dns-chaos is back in the integration chaos suite after being deferred because old DNS patterns never matched names validators resolve.

A new Chaos-Mesh DNSChaos template injects resolution errors on all validators for {{.ChainID}}-*.{{.Namespace}}.svc.cluster.local, targeting the per-pod peer FQDNs seid actually uses. The scenario runs through the same inject → gateInjected → height advance → recovery gates as other duration-bearing faults; the template documents that a pass means DNS resilience (chain keeps producing while live peering does not re-resolve), not that DNS failure stops consensus.

Reviewed by Cursor Bugbot for commit ca1e185. Bugbot is set up for automated code reviews on this repo. Configure here.

@bdchatham

Copy link
Copy Markdown
Collaborator Author

Closing — full slate blocked this. sei-network traced the chaos-mesh matcher: the pattern <chain>-*.<ns>.svc.cluster.local is illegal (DNSChaos requires * at the end only; mid-string * is rejected at insert). The rule-push errors → Apply returns before writing records → AllInjected never fires → dns-chaos hangs at gateInjected then times out. Confirmed by the in-cluster smoke (provisioned + caught-up, then stuck — no 'fault injected'). The injectedTargets guard correctly surfaced it as a hard fail, not a false-green.

Even with the legal trailing form (<chain>-*), the dissenter's point stands: with established MConnections that don't re-resolve, the under-fault height-advance has no failing branch on the DNS-resilience dimension — it'd be a vacuous-green. A real dns-chaos needs: (1) legal trailing-* pattern matching the peer FQDNs, (2) a direct fault-effect assertion (a DNS query inside a faulted validator returns SERVFAIL during the window), and (3) ideally a forced re-dial (bounce a validator mid-fault) so resolution is actually exercised. Deferring to do it properly with that design, grouped with the other fault-effect-assert outliers (rpc-chaos, mempool).

@bdchatham bdchatham closed this Jun 23, 2026
@bdchatham bdchatham deleted the feat/wsi-chaos-dns branch June 23, 2026 16:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant