fix(snappy): reject >9-byte uvarint continuation runs (no integer-overflow panic)#9
Conversation
A malformed snappy header that consists of 10 or more continuation bytes (byte >= 0x80) walks the loop in uvarint() past the u6 shift counter. After 9 iterations s == 63, and on the 10th iteration s += 7 panics with `integer overflow` in safe builds. Triggered in zeam by the Hive `gossip: ignores malformed ssz` test (1024 bytes of 0xef on a valid gossip topic). Reject malformed varints as soon as we see continuation byte 10 by returning the corrupt sentinel (`bytesRead == 0`) instead of pushing the shift counter further. Also tighten decodedLen() to compare against 0 explicitly (the previous `<= 0` on a usize was a no-op for the wrap- around sentinel that the old code tried to use).
…and tests Changes from review on PR #9: uvarint: - Restructure loop to iterate buf[0..@min(len,10)] so the 10-byte budget is encoded in the loop bound itself; `s += 7` can no longer push s:u6 past 63 by construction. - Replace `i >= 9` continuation-bit-still-set guard with `i == 9` (loop now cannot reach i == 10). - Move struct-level rationale onto Varint with explicit doc that `bytesRead == 0` is overloaded across truncated and malformed cases, and what to do if uvarint is ever made pub (return ?Varint or isize byte count, matching Go's encoding/binary.Uvarint convention). - Strip the gossip-payload reference from the library function comment; downstream context lives in the test comment. - Multi-line all early-return Varint{} literals to match the existing multi-line style. decodedLen: - Revert the `<= 0` -> `== 0` clarification. For usize the two are identical and the original PR description framed it as a tightening, which is misleading. Tests: - Symmetric `expectEqual(@as(u64, 0), vN.value)` on every malformed sub-case for parity with the bytesRead asserts. - New: 9 continuation bytes with no 10th byte (truncated input falls through to the post-loop return, distinct from the i == 9 guard). - New: non-canonical `[0x80, 0x00]` encoding of zero. Pins current behaviour (varint format does not require minimal encoding) so a future canonicalisation tightening is a deliberate breaking change.
|
Thanks for the careful review. Pushed 1. Restructured to 2. Added a doc block on 3. Comment rewritten: "10th byte (i == 9) with continuation bit still set: varint exceeds the 64-bit budget. Bail before the next 4. Reverted. Original 5. All early-return 6. Every malformed sub-case now asserts both 7. "malformed gossip payload" line is gone from the library; the regression detail stays in the test comment where it belongs. 8. Out of scope, will file a follow-up issue: a 1k-iteration random-bytes fuzz against 9. Added: 10. Added: Final diff: |
|
Re: point 8 (fuzz follow-up) — I tried to file it as an issue here but issues are disabled on this repo. Sketch below for the next time we touch this file or open a tracker repo: test "fuzz: decode never panics on random input" {
const allocator = testing.allocator;
var prng = std.Random.DefaultPrng.init(0xDEAD_BEEF_CAFE_F00D);
const random = prng.random();
var buf: [4096]u8 = undefined;
var i: usize = 0;
while (i < 1024) : (i += 1) {
const len = random.intRangeAtMost(usize, 0, buf.len);
random.bytes(buf[0..len]);
if (decode(allocator, buf[0..len])) |out| { allocator.free(out); } else |_| {}
if (decodeWithMax(allocator, buf[0..len], 16 * 1024 * 1024)) |out| { allocator.free(out); } else |_| {}
}
}Goal: catch sibling panics in |
build.zig.zon was still pinned at 0.0.3 even though tag v0.0.4 was cut on the decodeWithMax commit. Bump to 0.0.5 to reflect the uvarint integer-overflow fix landing on master via #9. Skipping 0.0.4 because that tag already exists pointing at e95e113 and re-tagging would break consumers that resolved against the existing tag.
Move from the pre-review commit on the fix branch (9521bbcb) to the tagged release v0.0.5 (050f529b) which includes the merged review feedback: - 10-byte budget is now encoded directly in the uvarint loop bound (`buf[0..@min(buf.len, 10)]`) so `s += 7` cannot overflow s:u6 by construction. - Varint sentinel semantics are documented on the type itself. - Test coverage extended: 9-byte truncation, non-canonical `[0x80, 0x00]` zero encoding, symmetric value/bytesRead asserts on every malformed case. Network tests remain green (14/14, including the validateSnappyBlockHeader and snappyz.decodeWithMax regression tests added in this PR). Refs: - https://github.com/blockblaz/zig-snappy/releases/tag/v0.0.5 - blockblaz/zig-snappy#9
…ve `gossip: ignores malformed ssz`) (#855) * network: stop integer-overflow panic on malformed gossip SSZ (Hive #390) Hive `gossip: ignores malformed ssz` (test 390) publishes 1024 bytes of `0xef` on a valid block topic and expects the client to remain healthy. zeam_devnet4 instead panicked on the network thread: thread 64 panic: integer overflow handleMsgFromRustBridge -> snappyz.decodeWithMax -> uvarint s += 7; // s: u6, after 9 iterations s == 63 Every `0xef` is a varint continuation byte; after the 10th continuation byte `s += 7` overflows the u6 shift counter. The crash took down the client and cascaded into 26 follow-up zeam_devnet4 failures (every test that re-used the second node afterwards reported "client should connect and send a request: Elapsed(())"). Two layers of defense: 1. Bump zig-snappy to a fix that rejects >9-byte continuation runs in uvarint() instead of panicking (blockblaz/zig-snappy#9). The decoder now returns error.Corrupt for the exact Hive payload. 2. Add validateSnappyBlockHeader() and call it in handleMsgFromRustBridge before invoking the third-party decoder. Uses zeam's own uvarint (multiformats) which has proper bounds checking and per-topic size limits. This guards against any future regression in the upstream decoder and rejects oversized declared sizes before any heap allocation. Edge cases covered by new tests in pkgs/network/src/ethlibp2p.zig: - 1024 bytes of 0xef (the actual Hive payload) -> rejected, no panic - 11 continuation bytes followed by a valid terminator -> rejected - Empty payload -> rejected - Declared size > MAX_GOSSIP_BLOCK_SIZE -> rejected - Header-only buffer with non-zero declared size -> rejected - Well-formed header + payload byte -> accepted - Zero-length declared block (single 0x00 byte) -> accepted - snappyz.decodeWithMax on 1024 bytes of 0xef -> error.Corrupt (no panic) Refs: https://hive.leanroadmap.org/suite.html?suiteid=1778305924-e14785654316449971f512c113089da3.json&suitename=gossip&client=zeam_devnet4#test-390 Companion: blockblaz/zig-snappy#9 * deps: pin zig-snappy to tagged release v0.0.5 Move from the pre-review commit on the fix branch (9521bbcb) to the tagged release v0.0.5 (050f529b) which includes the merged review feedback: - 10-byte budget is now encoded directly in the uvarint loop bound (`buf[0..@min(buf.len, 10)]`) so `s += 7` cannot overflow s:u6 by construction. - Varint sentinel semantics are documented on the type itself. - Test coverage extended: 9-byte truncation, non-canonical `[0x80, 0x00]` zero encoding, symmetric value/bytesRead asserts on every malformed case. Network tests remain green (14/14, including the validateSnappyBlockHeader and snappyz.decodeWithMax regression tests added in this PR). Refs: - https://github.com/blockblaz/zig-snappy/releases/tag/v0.0.5 - blockblaz/zig-snappy#9 * network: address PR #855 review feedback Substantive - 1. Naming inversion fixed. The pre-PR `validateGossipSnappyHeader` was used by RPC frame parsers (parseRequestFrame/parseResponseFrame), not by gossip. Renamed it to `validateRpcSnappyHeader` and gave the new gossip guard the (now correct) name `validateGossipSnappyHeader`. Both sit on top of a shared `validateSnappyHeader` helper that takes max_size as a parameter so both call sites use the same code path. - 2. Bool return replaced with a typed error union. Three different attacker shapes (corrupt varint, oversized claim, missing body) now surface as three distinct `SnappyHeaderValidationError` variants so ops/metrics can tell them apart instead of getting a single "malformed snappy header" line. - 3. Sender peer_id now appears in every gossip rejection log. The previous log only printed topic and length even though sender_peer_id was a fn arg \u2014 attribution is back. - 4. Peer scoring follow-up flagged with TODO at the gossip rejection site. Out of scope for the panic fix; tracked for a future PR. - 5. `writeFailedBytes` is now sample-rate limited via a process-local `shouldPersistMalformedDump()` (1-of-1024 + always-the-first). A peer spamming garbage gossip can no longer fill the disk with one debug file per message. - 6. Two-layer-defense exit criteria documented in the validator's doc comment. Three reasons the local guard stays permanent: per-topic size pre-allocation gating, typed errors for ops, and a safety net against future upstream regressions. - 7. Header-only-with-non-zero-body is no longer mislabeled as "malformed header". The header is well-formed; the body is missing. Distinct `HeaderWithoutBody` variant + distinct log reason + distinct dump label "snappy_truncated". RPC frame parser maps it to `error.Incomplete` (body bytes may not have arrived yet on this read), not to a fatal frame error. - 8. Varint-decoded-twice contract pinned with comments at both the validator and the gossip call site. Notes that both decoders MUST agree on strict `>` for the size-limit comparison. - 9. TODO at the per-kind size switch noting that attestations/ aggregations rarely approach MAX_RPC_MESSAGE_SIZE and tighter limits belong here. Nits - 10. Validator doc comment is generalised. No longer hardcodes the 10-byte continuation anecdote in the library function; the regression detail lives in the test. - 11. The decodeWithMax test is now explicitly labelled "REGRESSION CANARY" with an inline note that an unexpected panic here means the upstream zig-snappy pin has been downgraded below v0.0.5. Fixes the cryptic-failure-mode complaint. - 12. New test pinning the body-shorter-than-declared case as accepted by the header validator (decoder is authoritative for body integrity). Doc comment on `validateSnappyHeader` calls this out explicitly. - 13. Boundary test added: declared == max_size accepted, declared == max_size + 1 rejected. Pins the strict-`>` contract against upstream's `if (block.blockLen > max_size)`. The 1-byte-mismatch scenario the reviewer was worried about will surface here loudly instead of silently. - 14. TODO marker added at MAX_GOSSIP_BLOCK_SIZE noting the per-peer memory-pressure surface is real; revisit once the leanSpec lands. --------- Co-authored-by: zclawz <zclawz@users.noreply.github.com>
Bug
uvarint()insnappy.zigpanics withinteger overflowwhen the input contains 10+ continuation bytes (b >= 0x80).Reproduction in zeam Hive (test
gossip: ignores malformed ssz— 27 zeam_devnet4 cascade failures from this single panic):The simulator publishes 1024 bytes of
0xefon a valid gossip topic. Each0xefhas the high bit set, souvarintkeeps looping. The shift counters: u6reaches 63 after 9 iterations, ands += 7on the 10th iteration overflows the u6, panicking the network thread and crashing the client.Fix
Reject as malformed (return the corrupt sentinel
bytesRead == 0) as soon as the loop sees its 10th continuation byte. A 64-bit varint is at most 10 bytes long, so further bytes can't possibly be valid — and we must bail before the nexts += 7.Also tightened
decodedLen()'s corrupt check from<= 0to== 0. The old check was a no-op for any sentinel:bytesReadisusizeand the previous "corrupt" path returned-%i + 1which wraps to a huge usize, never<= 0.Edge cases covered (new tests)
0xef(the actual Hive payload) →Corrupt, no panic0xffff_ffff_ffff_ffff) → 10 bytes consumed, validbytesRead == 0, no readdecodeanddecodeWithMaxon the 1024×0xef payload →error.CorruptVerification
The regression test
decode rejects garbage with long continuation runreproduces the exact zeam Hive scenario.Companion change
A follow-up PR on
blockblaz/zeamwill bump this dep so the live Hive failure clears. cc @ch4r10t33r