Skip to content

fix(snappy): reject >9-byte uvarint continuation runs (no integer-overflow panic)#9

Merged
ch4r10t33r merged 2 commits into
masterfrom
fix/uvarint-overflow-on-long-continuation-runs
May 9, 2026
Merged

fix(snappy): reject >9-byte uvarint continuation runs (no integer-overflow panic)#9
ch4r10t33r merged 2 commits into
masterfrom
fix/uvarint-overflow-on-long-continuation-runs

Conversation

@zclawz
Copy link
Copy Markdown

@zclawz zclawz commented May 9, 2026

Bug

uvarint() in snappy.zig panics with integer overflow when the input contains 10+ continuation bytes (b >= 0x80).

Reproduction in zeam Hive (test gossip: ignores malformed ssz — 27 zeam_devnet4 cascade failures from this single panic):

thread 64 panic: integer overflow
snappy.zig:73:11 in uvarint     s += 7;
snappy.zig:105:27 in decodedLen const varint = uvarint(src);
snappy.zig:265:33 in decodeWithMax

The simulator publishes 1024 bytes of 0xef on a valid gossip topic. Each 0xef has the high bit set, so uvarint keeps looping. The shift counter s: u6 reaches 63 after 9 iterations, and s += 7 on the 10th iteration overflows the u6, panicking the network thread and crashing the client.

Fix

Reject as malformed (return the corrupt sentinel bytesRead == 0) as soon as the loop sees its 10th continuation byte. A 64-bit varint is at most 10 bytes long, so further bytes can't possibly be valid — and we must bail before the next s += 7.

Also tightened decodedLen()'s corrupt check from <= 0 to == 0. The old check was a no-op for any sentinel: bytesRead is usize and the previous "corrupt" path returned -%i + 1 which wraps to a huge usize, never <= 0.

Edge cases covered (new tests)

  • 1024 bytes of 0xef (the actual Hive payload) → Corrupt, no panic
  • Exactly 10 continuation bytes, no terminator → corrupt
  • 11 continuation bytes followed by a valid terminator → corrupt
  • 10th byte with value > 1 (overflows u64) → corrupt (existing rule, still enforced)
  • 10th byte with value == 1 (max u64 = 0xffff_ffff_ffff_ffff) → 10 bytes consumed, valid
  • Empty buffer → bytesRead == 0, no read
  • decode and decodeWithMax on the 1024×0xef payload → error.Corrupt

Verification

$ zig build test --summary all
Build Summary: 3/3 steps succeeded; 9/9 tests passed

The regression test decode rejects garbage with long continuation run reproduces the exact zeam Hive scenario.

Companion change

A follow-up PR on blockblaz/zeam will bump this dep so the live Hive failure clears. cc @ch4r10t33r

A malformed snappy header that consists of 10 or more continuation bytes
(byte >= 0x80) walks the loop in uvarint() past the u6 shift counter.
After 9 iterations s == 63, and on the 10th iteration s += 7 panics with
`integer overflow` in safe builds. Triggered in zeam by the Hive
`gossip: ignores malformed ssz` test (1024 bytes of 0xef on a valid
gossip topic).

Reject malformed varints as soon as we see continuation byte 10 by
returning the corrupt sentinel (`bytesRead == 0`) instead of pushing
the shift counter further. Also tighten decodedLen() to compare against
0 explicitly (the previous `<= 0` on a usize was a no-op for the wrap-
around sentinel that the old code tried to use).
…and tests

Changes from review on PR #9:

uvarint:
- Restructure loop to iterate buf[0..@min(len,10)] so the 10-byte budget is
  encoded in the loop bound itself; `s += 7` can no longer push s:u6 past 63
  by construction.
- Replace `i >= 9` continuation-bit-still-set guard with `i == 9` (loop now
  cannot reach i == 10).
- Move struct-level rationale onto Varint with explicit doc that
  `bytesRead == 0` is overloaded across truncated and malformed cases, and
  what to do if uvarint is ever made pub (return ?Varint or isize byte
  count, matching Go's encoding/binary.Uvarint convention).
- Strip the gossip-payload reference from the library function comment;
  downstream context lives in the test comment.
- Multi-line all early-return Varint{} literals to match the existing
  multi-line style.

decodedLen:
- Revert the `<= 0` -> `== 0` clarification. For usize the two are
  identical and the original PR description framed it as a tightening,
  which is misleading.

Tests:
- Symmetric `expectEqual(@as(u64, 0), vN.value)` on every malformed sub-case
  for parity with the bytesRead asserts.
- New: 9 continuation bytes with no 10th byte (truncated input falls
  through to the post-loop return, distinct from the i == 9 guard).
- New: non-canonical `[0x80, 0x00]` encoding of zero. Pins current behaviour
  (varint format does not require minimal encoding) so a future
  canonicalisation tightening is a deliberate breaking change.
@zclawz
Copy link
Copy Markdown
Author

zclawz commented May 9, 2026

Thanks for the careful review. Pushed a01e503 addressing 1, 2, 3, 4, 5, 6, 7, 9, 10. Filing 8 as a follow-up issue.

1. Restructured to for (buf[0..@min(buf.len, 10)], 0..) so the 10-byte budget lives in the loop bound. Loop can't reach i == 10 by construction; s += 7 can't push s:u6 past 63 by construction. Both internal guards are now i == 9 (the in-loop continuation-bit-still-set case stays as a guard because we still need to bail before the next iteration's s += 7 runs).

2. Added a doc block on Varint explicitly calling out that bytesRead == 0 is overloaded across truncated and malformed inputs, that the Go reference distinguishes via negative byte count, and what to do if uvarint is ever made pub (switch to ?Varint or isize).

3. Comment rewritten: "10th byte (i == 9) with continuation bit still set: varint exceeds the 64-bit budget. Bail before the next s += 7 would overflow the u6 shift counter."

4. Reverted. Original <= 0 is preserved; for usize it's identical to == 0 and the previous "tightened" framing was misleading.

5. All early-return Varint{} literals now multi-line, matching the existing style.

6. Every malformed sub-case now asserts both bytesRead == 0 and value == 0.

7. "malformed gossip payload" line is gone from the library; the regression detail stays in the test comment where it belongs.

8. Out of scope, will file a follow-up issue: a 1k-iteration random-bytes fuzz against decode/decodeWithMax would catch sibling panics in runDecode and other malformed-input crashes. Linked here once filed.

9. Added: [0xff] * 9 exactly. This is the truncated-input codepath that falls through the post-loop return, which is distinct from the i == 9 guard. Comment in the test calls out the distinction.

10. Added: [0x80, 0x00] non-canonical encoding of zero. Pins current behaviour (the varint format doesn't require minimal encoding) so any future tightening is a deliberate breaking change rather than a silent drift.

Final diff: +88 / -42 in snappy.zig. zig build test 9/9 green. The 1024×0xef repro still returns error.Corrupt instead of panicking.

@zclawz
Copy link
Copy Markdown
Author

zclawz commented May 9, 2026

Re: point 8 (fuzz follow-up) — I tried to file it as an issue here but issues are disabled on this repo. Sketch below for the next time we touch this file or open a tracker repo:

test "fuzz: decode never panics on random input" {
    const allocator = testing.allocator;
    var prng = std.Random.DefaultPrng.init(0xDEAD_BEEF_CAFE_F00D);
    const random = prng.random();

    var buf: [4096]u8 = undefined;
    var i: usize = 0;
    while (i < 1024) : (i += 1) {
        const len = random.intRangeAtMost(usize, 0, buf.len);
        random.bytes(buf[0..len]);
        if (decode(allocator, buf[0..len])) |out| { allocator.free(out); } else |_| {}
        if (decodeWithMax(allocator, buf[0..len], 16 * 1024 * 1024)) |out| { allocator.free(out); } else |_| {}
    }
}

Goal: catch sibling panics in runDecode (tag-literal length boundaries, tagCopy* offset casts, the dst[d - offset ..] slicing, etc.) the same way #9 caught the one in uvarint. Deterministic seed so CI failures reproduce locally; both decode and decodeWithMax exercised so we cover the early-bail path and the full-decode path. Out of scope for #9; happy to open a separate PR for it after this one merges.

@ch4r10t33r ch4r10t33r merged commit 8321cfa into master May 9, 2026
4 checks passed
zclawz added a commit that referenced this pull request May 9, 2026
build.zig.zon was still pinned at 0.0.3 even though tag v0.0.4 was cut on
the decodeWithMax commit. Bump to 0.0.5 to reflect the uvarint
integer-overflow fix landing on master via #9.

Skipping 0.0.4 because that tag already exists pointing at e95e113 and
re-tagging would break consumers that resolved against the existing tag.
zclawz added a commit to blockblaz/zeam that referenced this pull request May 9, 2026
Move from the pre-review commit on the fix branch
(9521bbcb) to the tagged release v0.0.5
(050f529b) which includes the merged review feedback:

- 10-byte budget is now encoded directly in the uvarint loop bound
  (`buf[0..@min(buf.len, 10)]`) so `s += 7` cannot overflow s:u6 by
  construction.
- Varint sentinel semantics are documented on the type itself.
- Test coverage extended: 9-byte truncation, non-canonical `[0x80, 0x00]`
  zero encoding, symmetric value/bytesRead asserts on every malformed
  case.

Network tests remain green (14/14, including the validateSnappyBlockHeader
and snappyz.decodeWithMax regression tests added in this PR).

Refs:
- https://github.com/blockblaz/zig-snappy/releases/tag/v0.0.5
- blockblaz/zig-snappy#9
ch4r10t33r pushed a commit to blockblaz/zeam that referenced this pull request May 9, 2026
…ve `gossip: ignores malformed ssz`) (#855)

* network: stop integer-overflow panic on malformed gossip SSZ (Hive #390)

Hive `gossip: ignores malformed ssz` (test 390) publishes 1024 bytes of
`0xef` on a valid block topic and expects the client to remain healthy.
zeam_devnet4 instead panicked on the network thread:

    thread 64 panic: integer overflow
    handleMsgFromRustBridge -> snappyz.decodeWithMax -> uvarint
        s += 7;     // s: u6, after 9 iterations s == 63

Every `0xef` is a varint continuation byte; after the 10th continuation
byte `s += 7` overflows the u6 shift counter. The crash took down the
client and cascaded into 26 follow-up zeam_devnet4 failures (every test
that re-used the second node afterwards reported "client should connect
and send a request: Elapsed(())").

Two layers of defense:

1. Bump zig-snappy to a fix that rejects >9-byte continuation runs in
   uvarint() instead of panicking (blockblaz/zig-snappy#9). The decoder
   now returns error.Corrupt for the exact Hive payload.

2. Add validateSnappyBlockHeader() and call it in handleMsgFromRustBridge
   before invoking the third-party decoder. Uses zeam's own uvarint
   (multiformats) which has proper bounds checking and per-topic size
   limits. This guards against any future regression in the upstream
   decoder and rejects oversized declared sizes before any heap
   allocation.

Edge cases covered by new tests in pkgs/network/src/ethlibp2p.zig:

- 1024 bytes of 0xef (the actual Hive payload) -> rejected, no panic
- 11 continuation bytes followed by a valid terminator -> rejected
- Empty payload -> rejected
- Declared size > MAX_GOSSIP_BLOCK_SIZE -> rejected
- Header-only buffer with non-zero declared size -> rejected
- Well-formed header + payload byte -> accepted
- Zero-length declared block (single 0x00 byte) -> accepted
- snappyz.decodeWithMax on 1024 bytes of 0xef -> error.Corrupt (no panic)

Refs: https://hive.leanroadmap.org/suite.html?suiteid=1778305924-e14785654316449971f512c113089da3.json&suitename=gossip&client=zeam_devnet4#test-390
Companion: blockblaz/zig-snappy#9

* deps: pin zig-snappy to tagged release v0.0.5

Move from the pre-review commit on the fix branch
(9521bbcb) to the tagged release v0.0.5
(050f529b) which includes the merged review feedback:

- 10-byte budget is now encoded directly in the uvarint loop bound
  (`buf[0..@min(buf.len, 10)]`) so `s += 7` cannot overflow s:u6 by
  construction.
- Varint sentinel semantics are documented on the type itself.
- Test coverage extended: 9-byte truncation, non-canonical `[0x80, 0x00]`
  zero encoding, symmetric value/bytesRead asserts on every malformed
  case.

Network tests remain green (14/14, including the validateSnappyBlockHeader
and snappyz.decodeWithMax regression tests added in this PR).

Refs:
- https://github.com/blockblaz/zig-snappy/releases/tag/v0.0.5
- blockblaz/zig-snappy#9

* network: address PR #855 review feedback

Substantive
- 1. Naming inversion fixed. The pre-PR `validateGossipSnappyHeader` was
  used by RPC frame parsers (parseRequestFrame/parseResponseFrame), not by
  gossip. Renamed it to `validateRpcSnappyHeader` and gave the new gossip
  guard the (now correct) name `validateGossipSnappyHeader`. Both sit on
  top of a shared `validateSnappyHeader` helper that takes max_size as a
  parameter so both call sites use the same code path.

- 2. Bool return replaced with a typed error union. Three different
  attacker shapes (corrupt varint, oversized claim, missing body) now
  surface as three distinct `SnappyHeaderValidationError` variants so
  ops/metrics can tell them apart instead of getting a single "malformed
  snappy header" line.

- 3. Sender peer_id now appears in every gossip rejection log. The
  previous log only printed topic and length even though sender_peer_id
  was a fn arg \u2014 attribution is back.

- 4. Peer scoring follow-up flagged with TODO at the gossip rejection
  site. Out of scope for the panic fix; tracked for a future PR.

- 5. `writeFailedBytes` is now sample-rate limited via a process-local
  `shouldPersistMalformedDump()` (1-of-1024 + always-the-first). A peer
  spamming garbage gossip can no longer fill the disk with one debug
  file per message.

- 6. Two-layer-defense exit criteria documented in the validator's doc
  comment. Three reasons the local guard stays permanent: per-topic
  size pre-allocation gating, typed errors for ops, and a safety net
  against future upstream regressions.

- 7. Header-only-with-non-zero-body is no longer mislabeled as
  "malformed header". The header is well-formed; the body is missing.
  Distinct `HeaderWithoutBody` variant + distinct log reason +
  distinct dump label "snappy_truncated". RPC frame parser maps it to
  `error.Incomplete` (body bytes may not have arrived yet on this
  read), not to a fatal frame error.

- 8. Varint-decoded-twice contract pinned with comments at both the
  validator and the gossip call site. Notes that both decoders MUST
  agree on strict `>` for the size-limit comparison.

- 9. TODO at the per-kind size switch noting that attestations/
  aggregations rarely approach MAX_RPC_MESSAGE_SIZE and tighter limits
  belong here.

Nits
- 10. Validator doc comment is generalised. No longer hardcodes the
  10-byte continuation anecdote in the library function; the
  regression detail lives in the test.

- 11. The decodeWithMax test is now explicitly labelled
  "REGRESSION CANARY" with an inline note that an unexpected panic
  here means the upstream zig-snappy pin has been downgraded below
  v0.0.5. Fixes the cryptic-failure-mode complaint.

- 12. New test pinning the body-shorter-than-declared case as
  accepted by the header validator (decoder is authoritative for body
  integrity). Doc comment on `validateSnappyHeader` calls this out
  explicitly.

- 13. Boundary test added: declared == max_size accepted, declared ==
  max_size + 1 rejected. Pins the strict-`>` contract against
  upstream's `if (block.blockLen > max_size)`. The 1-byte-mismatch
  scenario the reviewer was worried about will surface here loudly
  instead of silently.

- 14. TODO marker added at MAX_GOSSIP_BLOCK_SIZE noting the per-peer
  memory-pressure surface is real; revisit once the leanSpec lands.

---------

Co-authored-by: zclawz <zclawz@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants