Skip to content

fix(networking): explicit sync for gossipsub outbound stream setup#765

Merged
tcoratger merged 1 commit into
leanEthereum:mainfrom
tcoratger:worktree-fix-explicit-gossipsub-stream-sync
May 23, 2026
Merged

fix(networking): explicit sync for gossipsub outbound stream setup#765
tcoratger merged 1 commit into
leanEthereum:mainfrom
tcoratger:worktree-fix-explicit-gossipsub-stream-sync

Conversation

@tcoratger
Copy link
Copy Markdown
Collaborator

Summary

  • Two code paths on the dialer side could race to open an outbound gossipsub stream for the same peer: the direct call after status exchange, and the inbound-stream handler after the peer's reciprocal stream arrived. Both could pass the has-outbound-stream check before either registered, leaving an orphaned QUIC stream behind.
  • The previous fix was an await asyncio.sleep(0.1) in the deferred-setup helper — timing-dependent papering over a TOCTOU race.
  • Replaces the sleep with a per-peer asyncio.Lock that makes check-and-open atomic. The deferred-setup helper is deleted; the inbound handler now calls the idempotent setup directly. Lock entries are cleared on disconnect to bound memory.

Why the delay was unnecessary

The two comments around the removed delay claimed it "lets the dialer complete its status exchange first" and "prevents deadlock". Neither is true:

  • The dialer opens its outbound gossipsub stream after status exchange completes, so when a listener receives an inbound gossipsub stream, the dialer's status is already done.
  • Gossipsub mesh formation is heartbeat-driven (1s tick). Opening the reciprocal stream earlier strictly enlarges the GRAFT-eligible set at the next heartbeat, so faster open helps mesh convergence rather than hurts.
  • Subscription RPCs sent on first outbound open are idempotent (set add/discard), so concurrent reciprocal subscribes converge to identical state regardless of arrival order.

The delay was, in practice, only serialising against the local TOCTOU race on the dialer side — which the lock now handles deterministically.

Test plan

  • uv run pytest tests/lean_spec/subspecs/networking/ — 804 passed
  • just check — ruff check, ruff format, ty check, codespell, mdformat all pass
  • Verify two-node connect on a local devnet still completes mesh handshake without orphaned streams in the QUIC log

🤖 Generated with Claude Code

…eer lock

Two code paths concurrently attempted to open the outbound gossipsub
stream for the same peer: the dialing path called the setup helper
directly after status exchange, and the inbound-stream handler called
it once the peer's reciprocal stream arrived. Both could pass the
has-outbound-stream check before either registered, leaving an
orphaned QUIC stream behind. A 100ms sleep masked the race by giving
the direct path enough time to win.

Serialise the check-and-open under a per-peer asyncio.Lock so only
one stream is opened. The deferred-setup helper is no longer needed
and the inbound handler now calls the idempotent setup directly.
Lock entries are cleared on disconnect to bound memory.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@tcoratger tcoratger merged commit 37f3ebc into leanEthereum:main May 23, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant