Skip to content

fix(network): stop integer-overflow panic on malformed gossip SSZ (Hive gossip: ignores malformed ssz)#855

Merged
ch4r10t33r merged 3 commits into
mainfrom
fix/snappy-uvarint-overflow-malformed-ssz
May 9, 2026
Merged

fix(network): stop integer-overflow panic on malformed gossip SSZ (Hive gossip: ignores malformed ssz)#855
ch4r10t33r merged 3 commits into
mainfrom
fix/snappy-uvarint-overflow-malformed-ssz

Conversation

@zclawz
Copy link
Copy Markdown
Contributor

@zclawz zclawz commented May 9, 2026

Hive failure

gossip: ignores malformed ssz (test 390) on zeam_devnet4.

The simulator publishes 1024 bytes of 0xef on a valid block topic and asserts the client stays healthy. The client log shows a clean panic on the network thread:

thread 64 panic: integer overflow
std/mem.zig:3394:42 in handleMsgFromRustBridge (zeam)
...
pkgs/network/src/ethlibp2p.zig:1106:27 in createAndRunNetworkThread (zeam)

Reproduced locally with a 30-line repro:

$ cat /tmp/test_decode.zig
const std = @import("std");
const snappy = @import("snappy.zig");
pub fn main() !void {
    const garbage = [_]u8{0xef} ** 1024;
    _ = snappy.decodeWithMax(std.heap.page_allocator, &garbage, 16 * 1024 * 1024);
}
$ zig run /tmp/test_decode.zig
thread 628 panic: integer overflow
snappy.zig:73:11 in uvarint     s += 7;
snappy.zig:105:27 in decodedLen const varint = uvarint(src);
snappy.zig:265:33 in decodeWithMax

Root cause

uvarint() in blockblaz/zig-snappy walks every continuation byte (b >= 0x80) and increments a u6 shift counter s += 7. After 9 iterations s == 63; on the 10th s += 7 overflows the u6 in safe builds and panics. Every 0xef is a continuation byte, so 1024 of them blow past the limit before any terminator is seen.

The network thread crash kills the client. Hive's simulator then can't reach /lean/v0/fork_choice on the second node, and 26 follow-up tests cascade-fail with client should connect and send a request: Elapsed(()) and "Failed to publish message: Duplicate".

Fix (two layers of defense)

1. Upstream: blockblaz/zig-snappy#9

uvarint() now returns the corrupt sentinel (bytesRead == 0) as soon as it sees the 10th continuation byte, before the next s += 7. Also tightened decodedLen()'s corrupt check from <= 0 to == 0 (the old check was a no-op for usize).

This PR bumps the dep to the tagged release v0.0.5 (050f529).

2. Local: pre-validate snappy header in handleMsgFromRustBridge

Even with the upstream fix, we add a validateSnappyBlockHeader() gate in front of snappyz.decodeWithMax. It uses zeam's own uvarint (multiformats) — which already has proper bounds checking — to verify:

  • the leading varint decodes cleanly,
  • the declared uncompressed size fits the per-topic limit (MAX_GOSSIP_BLOCK_SIZE for blocks, MAX_RPC_MESSAGE_SIZE otherwise),
  • the buffer actually carries a body when the declared size is non-zero.

Malformed payloads are dropped with a writeFailedBytes debug dump and an error log. If the upstream decoder ever regresses again, gossip stays alive.

Edge cases covered (new tests in pkgs/network/src/ethlibp2p.zig)

case expected covered
1024 bytes of 0xef (the actual Hive payload) rejected, no panic
11 continuation bytes + valid terminator rejected
Exactly 10 continuation bytes, no terminator rejected
Empty payload rejected
Declared size > MAX_GOSSIP_BLOCK_SIZE rejected
Header-only buffer with non-zero declared size rejected
Well-formed header + payload byte accepted
Single 0x00 byte (zero-length block) accepted
snappyz.decodeWithMax on 1024×0xef error.Corrupt, no panic

Upstream blockblaz/zig-snappy#9 adds matching uvarint-level edge-case tests (10-byte max u64, 10-byte overflow boundary, etc.).

Verification

$ zig build
Finished `release` profile [optimized] target(s) in 0.36s

$ zig build test
...
11/14 ethlibp2p.test.validateGossipSnappyHeader rejects oversized declared size...OK
12/14 ethlibp2p.test.validateSnappyBlockHeader rejects malformed gossip payloads...OK
13/14 ethlibp2p.test.snappyz.decodeWithMax does not panic on 1024 bytes of 0xef...OK
All 14 tests passed.
...
All 145 tests passed.
All 11 tests passed.

Full test suite green.

Files

  • build.zig.zon — bump zig_snappy to v0.0.5 (050f529)
  • pkgs/network/src/ethlibp2p.zig — add validateSnappyBlockHeader(), call it in handleMsgFromRustBridge before snappyz.decodeWithMax, add 3 regression tests

Refs: blockblaz/zig-snappy#9
cc @ch4r10t33r

zclawz added 2 commits May 9, 2026 09:33
Hive `gossip: ignores malformed ssz` (test 390) publishes 1024 bytes of
`0xef` on a valid block topic and expects the client to remain healthy.
zeam_devnet4 instead panicked on the network thread:

    thread 64 panic: integer overflow
    handleMsgFromRustBridge -> snappyz.decodeWithMax -> uvarint
        s += 7;     // s: u6, after 9 iterations s == 63

Every `0xef` is a varint continuation byte; after the 10th continuation
byte `s += 7` overflows the u6 shift counter. The crash took down the
client and cascaded into 26 follow-up zeam_devnet4 failures (every test
that re-used the second node afterwards reported "client should connect
and send a request: Elapsed(())").

Two layers of defense:

1. Bump zig-snappy to a fix that rejects >9-byte continuation runs in
   uvarint() instead of panicking (blockblaz/zig-snappy#9). The decoder
   now returns error.Corrupt for the exact Hive payload.

2. Add validateSnappyBlockHeader() and call it in handleMsgFromRustBridge
   before invoking the third-party decoder. Uses zeam's own uvarint
   (multiformats) which has proper bounds checking and per-topic size
   limits. This guards against any future regression in the upstream
   decoder and rejects oversized declared sizes before any heap
   allocation.

Edge cases covered by new tests in pkgs/network/src/ethlibp2p.zig:

- 1024 bytes of 0xef (the actual Hive payload) -> rejected, no panic
- 11 continuation bytes followed by a valid terminator -> rejected
- Empty payload -> rejected
- Declared size > MAX_GOSSIP_BLOCK_SIZE -> rejected
- Header-only buffer with non-zero declared size -> rejected
- Well-formed header + payload byte -> accepted
- Zero-length declared block (single 0x00 byte) -> accepted
- snappyz.decodeWithMax on 1024 bytes of 0xef -> error.Corrupt (no panic)

Refs: https://hive.leanroadmap.org/suite.html?suiteid=1778305924-e14785654316449971f512c113089da3.json&suitename=gossip&client=zeam_devnet4#test-390
Companion: blockblaz/zig-snappy#9
Move from the pre-review commit on the fix branch
(9521bbcb) to the tagged release v0.0.5
(050f529b) which includes the merged review feedback:

- 10-byte budget is now encoded directly in the uvarint loop bound
  (`buf[0..@min(buf.len, 10)]`) so `s += 7` cannot overflow s:u6 by
  construction.
- Varint sentinel semantics are documented on the type itself.
- Test coverage extended: 9-byte truncation, non-canonical `[0x80, 0x00]`
  zero encoding, symmetric value/bytesRead asserts on every malformed
  case.

Network tests remain green (14/14, including the validateSnappyBlockHeader
and snappyz.decodeWithMax regression tests added in this PR).

Refs:
- https://github.com/blockblaz/zig-snappy/releases/tag/v0.0.5
- blockblaz/zig-snappy#9
@zclawz
Copy link
Copy Markdown
Contributor Author

zclawz commented May 9, 2026

Bumped the dep to the tagged release zig-snappy v0.0.5 (050f529) instead of the pre-review commit (9521bbc).

The tagged release includes the review feedback that landed on blockblaz/zig-snappy#9 after the original push:

  • uvarint loop is now bounded by buf[0..@min(buf.len, 10)] so the 10-byte budget lives in the loop bound itself; s += 7 can never overflow s:u6 by construction.
  • Varint struct has a doc block calling out the overloaded bytesRead == 0 sentinel and the upgrade path if uvarint ever goes pub.
  • Test coverage extended: 9-byte truncation, non-canonical [0x80, 0x00] zero encoding, symmetric value + bytesRead asserts on every malformed case.
  • build.zig.zon .version bumped to 0.0.5 (skipping 0.0.4 because that tag already exists pointing at the decodeWithMax commit).

Diff in this PR:

-            .url = "git+https://github.com/blockblaz/zig-snappy#9521bbcb95b59fe21f729abdba4852b88424af0c",
-            .hash = "zig_snappy-0.0.3-bDFzXuRvAABIBsAlnrHwgPDKH78O3Y5konV6Q8mFpcxy",
+            .url = "git+https://github.com/blockblaz/zig-snappy?ref=v0.0.5#050f529bc5fa11242140312ecf550c186f05af1e",
+            .hash = "zig_snappy-0.0.5-bDFzXlh4AADPRkDi2swDMAf8shH39Fc1g5oYRAhFQfTt",

Using ?ref=v0.0.5#<sha> so the URL is human-readable (tag) but content-addressed (sha) — Zig's package fetcher uses the SHA, the tag is for our eyes.

zig build is green and the two regression tests added in this PR (validateSnappyBlockHeader rejects malformed gossip payloads, snappyz.decodeWithMax does not panic on 1024 bytes of 0xef) pass against v0.0.5.

Substantive
- 1. Naming inversion fixed. The pre-PR `validateGossipSnappyHeader` was
  used by RPC frame parsers (parseRequestFrame/parseResponseFrame), not by
  gossip. Renamed it to `validateRpcSnappyHeader` and gave the new gossip
  guard the (now correct) name `validateGossipSnappyHeader`. Both sit on
  top of a shared `validateSnappyHeader` helper that takes max_size as a
  parameter so both call sites use the same code path.

- 2. Bool return replaced with a typed error union. Three different
  attacker shapes (corrupt varint, oversized claim, missing body) now
  surface as three distinct `SnappyHeaderValidationError` variants so
  ops/metrics can tell them apart instead of getting a single "malformed
  snappy header" line.

- 3. Sender peer_id now appears in every gossip rejection log. The
  previous log only printed topic and length even though sender_peer_id
  was a fn arg \u2014 attribution is back.

- 4. Peer scoring follow-up flagged with TODO at the gossip rejection
  site. Out of scope for the panic fix; tracked for a future PR.

- 5. `writeFailedBytes` is now sample-rate limited via a process-local
  `shouldPersistMalformedDump()` (1-of-1024 + always-the-first). A peer
  spamming garbage gossip can no longer fill the disk with one debug
  file per message.

- 6. Two-layer-defense exit criteria documented in the validator's doc
  comment. Three reasons the local guard stays permanent: per-topic
  size pre-allocation gating, typed errors for ops, and a safety net
  against future upstream regressions.

- 7. Header-only-with-non-zero-body is no longer mislabeled as
  "malformed header". The header is well-formed; the body is missing.
  Distinct `HeaderWithoutBody` variant + distinct log reason +
  distinct dump label "snappy_truncated". RPC frame parser maps it to
  `error.Incomplete` (body bytes may not have arrived yet on this
  read), not to a fatal frame error.

- 8. Varint-decoded-twice contract pinned with comments at both the
  validator and the gossip call site. Notes that both decoders MUST
  agree on strict `>` for the size-limit comparison.

- 9. TODO at the per-kind size switch noting that attestations/
  aggregations rarely approach MAX_RPC_MESSAGE_SIZE and tighter limits
  belong here.

Nits
- 10. Validator doc comment is generalised. No longer hardcodes the
  10-byte continuation anecdote in the library function; the
  regression detail lives in the test.

- 11. The decodeWithMax test is now explicitly labelled
  "REGRESSION CANARY" with an inline note that an unexpected panic
  here means the upstream zig-snappy pin has been downgraded below
  v0.0.5. Fixes the cryptic-failure-mode complaint.

- 12. New test pinning the body-shorter-than-declared case as
  accepted by the header validator (decoder is authoritative for body
  integrity). Doc comment on `validateSnappyHeader` calls this out
  explicitly.

- 13. Boundary test added: declared == max_size accepted, declared ==
  max_size + 1 rejected. Pins the strict-`>` contract against
  upstream's `if (block.blockLen > max_size)`. The 1-byte-mismatch
  scenario the reviewer was worried about will surface here loudly
  instead of silently.

- 14. TODO marker added at MAX_GOSSIP_BLOCK_SIZE noting the per-peer
  memory-pressure surface is real; revisit once the leanSpec lands.
@zclawz
Copy link
Copy Markdown
Contributor Author

zclawz commented May 9, 2026

Thanks for the review — pushed 189a315 addressing all 14 points.

Substantive

1. Naming inversion + signature unification. The pre-PR validateGossipSnappyHeader was used by parseRequestFrame/parseResponseFrame (RPC frame parsers), not by gossip — names backwards. Renamed:

  • validateGossipSnappyHeader (old) → validateRpcSnappyHeader (used by RPC frame parsers, hard-codes MAX_RPC_MESSAGE_SIZE).
  • validateSnappyBlockHeader (new) → validateGossipSnappyHeader (used by gossip, takes max_size parameter).

Both sit on top of a single validateSnappyHeader(bytes, max_size) helper so the actual checks live in one place.

2. Bool return → typed error union. New SnappyHeaderValidationError:

EmptyMessage             // empty buffer
InvalidVarint            // truncated/oversized/u64-overflow varint
DeclaredPayloadTooLarge  // varint OK, declared size > limit
HeaderWithoutBody        // header OK, body bytes missing

Three attacker shapes, three log lines, three dump labels (snappy_empty / snappy_varint / snappy_oversized / snappy_truncated).

3. Sender peer_id in rejection log. rejectMalformedGossip() now logs topic, len, peer, and the resolved node_namesender_peer_id was already a fn arg, just wasn't being used.

4. Peer scoring penalty. Out of scope but flagged: TODO(#855 review #4): apply a libp2p gossipsub score penalty here so a peer spamming malformed gossip is ejected by the protocol. Tracking for a follow-up PR.

5. writeFailedBytes disk-fill DoS. Added shouldPersistMalformedDump(): process-local atomic counter, persists 1-of-MALFORMED_DUMP_SAMPLE_RATE (1024) by default, plus always the very first malformed message of a process. Race-free is unnecessary; the goal is just bounded disk pressure. Both snappyz_* paths now go through it.

6. Two-layer-defense exit criteria. Documented in the validator's doc comment with three reasons the gossip guard stays permanent: (a) per-topic pre-allocation gating using zeam's limits, (b) typed errors for ops/metrics attribution, (c) safety net against future upstream regressions. Will not be removed once upstream stabilises — varint decode is the only overlap, and that overlap is documented at the call site.

7. Header-only-with-nonzero-body mislabel. New HeaderWithoutBody variant. Distinct from InvalidVarint: the header is well-formed, the body is missing. Distinct log reason (snappy header parsed but body bytes are missing), distinct dump label (snappy_truncated). For the RPC frame parser this maps to error.Incomplete (body bytes may not have arrived yet on this read), not a fatal frame error.

8. Decoder runs varint twice. Pinned with comments at both the validator and the gossip call site. Both decoders MUST agree on strict > for the size-limit comparison; if upstream ever flips to >=, the boundary test (#13 below) will catch it.

9. Per-kind size limits. TODO at the per-kind switch in handleMsgFromRustBridge: attestations/aggregations rarely approach MAX_RPC_MESSAGE_SIZE (4 MB). Tighter per-kind ceilings would let us reject earlier. Out of scope for this PR.

Nits

10. Generalised the validator doc comment. No longer mentions "10+ continuation bytes" or gossip; the regression detail lives in the test.

11. Regression-canary marker. The decodeWithMax test is now snappyz.decodeWithMax regression canary: 1024 bytes of 0xef must not panic with an explicit comment: "if this test ever panics the whole test binary instead of returning error.Corrupt, the upstream zig-snappy dependency has been downgraded below v0.0.5". Cryptic failure mode → loud failure mode.

12. Body-shorter-than-declared accepted. New test Body shorter than declared but well-formed header: validator accepts by design. Doc comment on validateSnappyHeader now explicitly says: "this is not a full body integrity check".

13. Boundary > vs >=. Confirmed against upstream (if (block.blockLen > max_size) in snappyz.decodeWithMax). New boundary test: declared == max_size accepted, declared == max_size + 1 rejected. Pinned both in code (a comment on DeclaredPayloadTooLarge) and in tests so a future upstream change to >= flips the boundary visibly instead of silently disagreeing across a 1-byte gap.

14. 50 MB × N peers. TODO marker added at MAX_GOSSIP_BLOCK_SIZE: "50 MB × N peers is a real memory-pressure surface. Track in a follow-up issue once the spec lands and we can lower this."

Verification

$ zig build
Finished `release` profile [optimized] target(s) in 0.33s

$ zig build test
...
All 14 tests passed.   # network pkg (incl. all new typed-error + boundary tests)
All 145 tests passed.  # zeam core
All 11 tests passed.   # node

Diff: +273 / -59 in pkgs/network/src/ethlibp2p.zig.

Copy link
Copy Markdown
Contributor

@ch4r10t33r ch4r10t33r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ch4r10t33r ch4r10t33r merged commit d0d50eb into main May 9, 2026
13 checks passed
@ch4r10t33r ch4r10t33r deleted the fix/snappy-uvarint-overflow-malformed-ssz branch May 9, 2026 11:40
@zclawz zclawz mentioned this pull request May 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants