fix(network): stop integer-overflow panic on malformed gossip SSZ (Hive gossip: ignores malformed ssz)#855
Conversation
Hive `gossip: ignores malformed ssz` (test 390) publishes 1024 bytes of
`0xef` on a valid block topic and expects the client to remain healthy.
zeam_devnet4 instead panicked on the network thread:
thread 64 panic: integer overflow
handleMsgFromRustBridge -> snappyz.decodeWithMax -> uvarint
s += 7; // s: u6, after 9 iterations s == 63
Every `0xef` is a varint continuation byte; after the 10th continuation
byte `s += 7` overflows the u6 shift counter. The crash took down the
client and cascaded into 26 follow-up zeam_devnet4 failures (every test
that re-used the second node afterwards reported "client should connect
and send a request: Elapsed(())").
Two layers of defense:
1. Bump zig-snappy to a fix that rejects >9-byte continuation runs in
uvarint() instead of panicking (blockblaz/zig-snappy#9). The decoder
now returns error.Corrupt for the exact Hive payload.
2. Add validateSnappyBlockHeader() and call it in handleMsgFromRustBridge
before invoking the third-party decoder. Uses zeam's own uvarint
(multiformats) which has proper bounds checking and per-topic size
limits. This guards against any future regression in the upstream
decoder and rejects oversized declared sizes before any heap
allocation.
Edge cases covered by new tests in pkgs/network/src/ethlibp2p.zig:
- 1024 bytes of 0xef (the actual Hive payload) -> rejected, no panic
- 11 continuation bytes followed by a valid terminator -> rejected
- Empty payload -> rejected
- Declared size > MAX_GOSSIP_BLOCK_SIZE -> rejected
- Header-only buffer with non-zero declared size -> rejected
- Well-formed header + payload byte -> accepted
- Zero-length declared block (single 0x00 byte) -> accepted
- snappyz.decodeWithMax on 1024 bytes of 0xef -> error.Corrupt (no panic)
Refs: https://hive.leanroadmap.org/suite.html?suiteid=1778305924-e14785654316449971f512c113089da3.json&suitename=gossip&client=zeam_devnet4#test-390
Companion: blockblaz/zig-snappy#9
Move from the pre-review commit on the fix branch (9521bbcb) to the tagged release v0.0.5 (050f529b) which includes the merged review feedback: - 10-byte budget is now encoded directly in the uvarint loop bound (`buf[0..@min(buf.len, 10)]`) so `s += 7` cannot overflow s:u6 by construction. - Varint sentinel semantics are documented on the type itself. - Test coverage extended: 9-byte truncation, non-canonical `[0x80, 0x00]` zero encoding, symmetric value/bytesRead asserts on every malformed case. Network tests remain green (14/14, including the validateSnappyBlockHeader and snappyz.decodeWithMax regression tests added in this PR). Refs: - https://github.com/blockblaz/zig-snappy/releases/tag/v0.0.5 - blockblaz/zig-snappy#9
|
Bumped the dep to the tagged release zig-snappy v0.0.5 ( The tagged release includes the review feedback that landed on
Diff in this PR: Using
|
Substantive - 1. Naming inversion fixed. The pre-PR `validateGossipSnappyHeader` was used by RPC frame parsers (parseRequestFrame/parseResponseFrame), not by gossip. Renamed it to `validateRpcSnappyHeader` and gave the new gossip guard the (now correct) name `validateGossipSnappyHeader`. Both sit on top of a shared `validateSnappyHeader` helper that takes max_size as a parameter so both call sites use the same code path. - 2. Bool return replaced with a typed error union. Three different attacker shapes (corrupt varint, oversized claim, missing body) now surface as three distinct `SnappyHeaderValidationError` variants so ops/metrics can tell them apart instead of getting a single "malformed snappy header" line. - 3. Sender peer_id now appears in every gossip rejection log. The previous log only printed topic and length even though sender_peer_id was a fn arg \u2014 attribution is back. - 4. Peer scoring follow-up flagged with TODO at the gossip rejection site. Out of scope for the panic fix; tracked for a future PR. - 5. `writeFailedBytes` is now sample-rate limited via a process-local `shouldPersistMalformedDump()` (1-of-1024 + always-the-first). A peer spamming garbage gossip can no longer fill the disk with one debug file per message. - 6. Two-layer-defense exit criteria documented in the validator's doc comment. Three reasons the local guard stays permanent: per-topic size pre-allocation gating, typed errors for ops, and a safety net against future upstream regressions. - 7. Header-only-with-non-zero-body is no longer mislabeled as "malformed header". The header is well-formed; the body is missing. Distinct `HeaderWithoutBody` variant + distinct log reason + distinct dump label "snappy_truncated". RPC frame parser maps it to `error.Incomplete` (body bytes may not have arrived yet on this read), not to a fatal frame error. - 8. Varint-decoded-twice contract pinned with comments at both the validator and the gossip call site. Notes that both decoders MUST agree on strict `>` for the size-limit comparison. - 9. TODO at the per-kind size switch noting that attestations/ aggregations rarely approach MAX_RPC_MESSAGE_SIZE and tighter limits belong here. Nits - 10. Validator doc comment is generalised. No longer hardcodes the 10-byte continuation anecdote in the library function; the regression detail lives in the test. - 11. The decodeWithMax test is now explicitly labelled "REGRESSION CANARY" with an inline note that an unexpected panic here means the upstream zig-snappy pin has been downgraded below v0.0.5. Fixes the cryptic-failure-mode complaint. - 12. New test pinning the body-shorter-than-declared case as accepted by the header validator (decoder is authoritative for body integrity). Doc comment on `validateSnappyHeader` calls this out explicitly. - 13. Boundary test added: declared == max_size accepted, declared == max_size + 1 rejected. Pins the strict-`>` contract against upstream's `if (block.blockLen > max_size)`. The 1-byte-mismatch scenario the reviewer was worried about will surface here loudly instead of silently. - 14. TODO marker added at MAX_GOSSIP_BLOCK_SIZE noting the per-peer memory-pressure surface is real; revisit once the leanSpec lands.
|
Thanks for the review — pushed Substantive1. Naming inversion + signature unification. The pre-PR
Both sit on top of a single 2. Bool return → typed error union. New Three attacker shapes, three log lines, three dump labels ( 3. Sender peer_id in rejection log. 4. Peer scoring penalty. Out of scope but flagged: 5. 6. Two-layer-defense exit criteria. Documented in the validator's doc comment with three reasons the gossip guard stays permanent: (a) per-topic pre-allocation gating using zeam's limits, (b) typed errors for ops/metrics attribution, (c) safety net against future upstream regressions. Will not be removed once upstream stabilises — varint decode is the only overlap, and that overlap is documented at the call site. 7. Header-only-with-nonzero-body mislabel. New 8. Decoder runs varint twice. Pinned with comments at both the validator and the gossip call site. Both decoders MUST agree on strict 9. Per-kind size limits. TODO at the per-kind switch in Nits10. Generalised the validator doc comment. No longer mentions "10+ continuation bytes" or gossip; the regression detail lives in the test. 11. Regression-canary marker. The 12. Body-shorter-than-declared accepted. New test 13. Boundary 14. 50 MB × N peers. TODO marker added at VerificationDiff: |
Hive failure
gossip: ignores malformed ssz(test 390) onzeam_devnet4.The simulator publishes 1024 bytes of
0xefon a valid block topic and asserts the client stays healthy. The client log shows a clean panic on the network thread:Reproduced locally with a 30-line repro:
Root cause
uvarint()inblockblaz/zig-snappywalks every continuation byte (b >= 0x80) and increments au6shift counters += 7. After 9 iterationss == 63; on the 10ths += 7overflows the u6 in safe builds and panics. Every0xefis a continuation byte, so 1024 of them blow past the limit before any terminator is seen.The network thread crash kills the client. Hive's simulator then can't reach
/lean/v0/fork_choiceon the second node, and 26 follow-up tests cascade-fail withclient should connect and send a request: Elapsed(())and"Failed to publish message: Duplicate".Fix (two layers of defense)
1. Upstream:
blockblaz/zig-snappy#9uvarint()now returns the corrupt sentinel (bytesRead == 0) as soon as it sees the 10th continuation byte, before the nexts += 7. Also tighteneddecodedLen()'s corrupt check from<= 0to== 0(the old check was a no-op for usize).This PR bumps the dep to the tagged release v0.0.5 (
050f529).2. Local: pre-validate snappy header in
handleMsgFromRustBridgeEven with the upstream fix, we add a
validateSnappyBlockHeader()gate in front ofsnappyz.decodeWithMax. It uses zeam's ownuvarint(multiformats) — which already has proper bounds checking — to verify:MAX_GOSSIP_BLOCK_SIZEfor blocks,MAX_RPC_MESSAGE_SIZEotherwise),Malformed payloads are dropped with a
writeFailedBytesdebug dump and an error log. If the upstream decoder ever regresses again, gossip stays alive.Edge cases covered (new tests in
pkgs/network/src/ethlibp2p.zig)0xef(the actual Hive payload)MAX_GOSSIP_BLOCK_SIZE0x00byte (zero-length block)snappyz.decodeWithMaxon 1024×0xeferror.Corrupt, no panicUpstream
blockblaz/zig-snappy#9adds matching uvarint-level edge-case tests (10-byte max u64, 10-byte overflow boundary, etc.).Verification
Full test suite green.
Files
build.zig.zon— bumpzig_snappyto v0.0.5 (050f529)pkgs/network/src/ethlibp2p.zig— addvalidateSnappyBlockHeader(), call it inhandleMsgFromRustBridgebeforesnappyz.decodeWithMax, add 3 regression testsRefs: blockblaz/zig-snappy#9
cc @ch4r10t33r